Commit Graph

32663 Commits

Author SHA1 Message Date
Kamil Braun
5badf20c7a raft: server: use visit instead of holds_alternative+get
In `std::holds_alternative`+`std::get` version, the `get` performs a
redundant check. Also `std::visit` gives a compile-time exhaustiveness
check (whether we handled all possible cases of the `variant`).
2022-08-22 18:47:48 +02:00
Avi Kivity
afa7960926 Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.

Fixes: https://github.com/scylladb/scylladb/issues/11264

Closes #11273

* github.com:scylladb/scylladb:
  querier: querier_cache: remove now unused evict_all_for_table()
  database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
  reader_concurrency_semaphore: add evict_inactive_reads_for_table()
2022-08-15 19:05:59 +03:00
Botond Dénes
d56dcb842c db/virtual_table: add virtual destructor to virtual_table
It should have had one, derived instances are stored and destroyed via
the base-class. The only reason this haven't caused bugs yet is that
derived instances happen to not have any non-trivial members yet.

Closes #11293
2022-08-15 16:58:05 +03:00
Avi Kivity
73d4930815 Merge 'test/lib: various improvements to sstable test env' from Botond Dénes
A mixed bag of improvements developed as part of another PR (https://github.com/scylladb/scylladb/pull/10736). Said PR was closed so I'm submitting these improvements separately.

Closes #11294

* github.com:scylladb/scylladb:
  test/lib: move convenience table config factory to sstable_test_env
  test/lib/sstable_test_env: move members to impl struct
  test/lib/sstable_utils: use test_env::do_with_async()
2022-08-15 16:57:01 +03:00
Botond Dénes
92e5f438a4 querier: querier_cache: remove now unused evict_all_for_table() 2022-08-15 14:16:41 +03:00
Botond Dénes
2b1eb6e284 database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
Instead of querier_cache::evict_all_for_table(). The new method cover
all queriers and in addition any other inactive reads registered on the
semaphore. In theory by the time we detach a table, no regular inactive
reads should be in the semaphore anymore, but if there is any still, we
better evict them before the table is destroyed, they might attempt to
access it in when destroyed later.
2022-08-15 14:16:41 +03:00
Botond Dénes
e55ccbde8f reader_concurrency_semaphore: add evict_inactive_reads_for_table()
Allowing for evicting all inactive reads that belong to a certain table.
2022-08-15 14:16:41 +03:00
Botond Dénes
c8ef356859 test/lib: move convenience table config factory to sstable_test_env
All users of `column_family_test_config()`, get the semaphore parameter
for it from `sstable_test_env`. It is clear that the latter serves as
the storage space for stable objects required by the table config. This
patch just enshrines this fact by moving the config factory method to
`sstable_test_env`, so it can just get what it needs from members.
2022-08-15 11:23:59 +03:00
Botond Dénes
c0e017e0f7 test/lib/sstable_test_env: move members to impl struct
All present members of sstable_test_env are std::unique_ptr<>:s because
they require stable addresses. This makes their handling somewhat
awkward. Move all of them into an internal `struct impl` and make that
member a unique ptr.
2022-08-15 11:20:09 +03:00
Botond Dénes
a9f296ed47 test/lib/sstable_utils: use test_env::do_with_async()
Instead of manually instantiating test_env.
2022-08-15 11:19:27 +03:00
Botond Dénes
a9573b84c5 Merge 'commitlog: Revert/modify fac2bc4 - do footprint add in delete' from Calle Wilund
Fixes #11184
Fixes #11237

In prev (broken) fix for https://github.com/scylladb/scylladb/issues/11184 we added the footprint for left-over
files (replay candidates) to disk footprint on commitlog init.

This effectively prevents us from creating segments iff we have tight limits. Since we nowadays do quite a bit of inserts _before_ commitlog replay (system.local, but...) we can end up in a situation where we deadlock start because we cannot get to the actual replay that will eventually free things.

Another, not thought through, consequence is that we add a single footprint to _all_ commitlog shard instances - even though only shard 0 will get to actually replay + delete (i.e. drop footprint).
So shards 1-X would all be either locked out or performance degraded.

Simplest fix is to add the footprint in delete call instead. This will lock out segment creation until delete call is done, but this is fast. Also ensures that only replay shard is involved.

To further emphasize this, don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially ask for the list twice.
More to the point, goes better hand-in-hand with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.

Closes #11251

* github.com:scylladb/scylladb:
  commitlog: Make get_segments_to_replay on-demand
  commitlog: Revert/modify fac2bc4 - do footprint add in delete
2022-08-15 09:10:32 +03:00
Botond Dénes
8f10413087 Merge 'doc: describe specifying workload attributes with service levels' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11197

This PR adds a new page where specifying workload attributes with service levels is described and adds it to the menu.

Also, I had to fix some links because of the warnings.

Closes #11209

* github.com:scylladb/scylladb:
  doc: remove the reduntant space from index
  doc: update the syntax for defining service level attributes
  doc: rewording
  doc: update the links to fix the warnings
  doc: add the new page to the toctree
  doc: add the descrption of specifying workload attributes with service levels
  doc: add the definition of workloads to the glossary
2022-08-15 07:14:28 +03:00
Nadav Har'El
c8b5c3595e Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity
Increase readability in preparation for managing topology with
effective_replication_map (continuing 69aea59d9).

Closes #11290

* github.com:scylladb/scylladb:
  cql3: select_statement: improve loop termination condition in  indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: reindent indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: de-result_wrap indexed_table_select_statement::do_execute_base_query()
2022-08-14 23:26:06 +03:00
Nadav Har'El
4a4231ea53 Merge 'storage_proxy: coroutinize some counter mutate functions' from Avi Kivity
In preparation for effective_replication_map hygiene, convert
some counter functions to coroutines to simplify the changes.

Closes #11291

* github.com:scylladb/scylladb:
  storage_proxy: mutate_counters_on_leader: coroutinize
  storage_proxy: mutate_counters: coroutinize
  storage_proxy: mutate_counters: reorganize error handling
2022-08-14 23:16:42 +03:00
Avi Kivity
8070cdbbf9 storage_proxy: mutate_counters_on_leader: coroutinize
Simplify ahead of refactoring for consistent effective_replication_map.
2022-08-14 17:36:58 +03:00
Avi Kivity
6e330d98d2 storage_proxy: mutate_counters: coroutinize
Simplify ahead of refactoring for consistent effective_replication_map.

This is probably a pessimization of the error case, but the error case
will be terrible in any case unless we resultify it.
2022-08-14 17:28:46 +03:00
Avi Kivity
105b066ff7 storage_proxy: mutate_counters: reorganize error handling
Move the error handling function where it's used so the code
is more straightforward.

Due to some std::move()s later, we must still capture the schema early.
2022-08-14 17:13:22 +03:00
Avi Kivity
fbaa280acd cql3: select_statement: improve loop termination condition in indexed_table_select_statement::do_execute_base_query()
Move the termination condition to the front of the loop so it's
clear why we're looping and when we stop.

It's less than perfectly clean since we widen the scope of some variables
(from loop-internal to loop-carried), but IMO it's clearer.
2022-08-14 15:40:45 +03:00
Avi Kivity
60c7c11c96 cql3: select_statement: reindent indexed_table_select_statement::do_execute_base_query()
Reindent after coroutinization. No functional changes.
2022-08-14 15:35:36 +03:00
Avi Kivity
492dc6879e cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()
It's much easier to maintain this way. Since it uses ranges_to_vnodes,
it interacts with topology and needs integration into
effective_replication_map management.

The patch leaves bad indentation and an infinite-looking loop in
the interest of minimization, but that will be corrected later.

Note, the test for `!r.has_value()` was eliminated since it was
short-circuited by the test for `!rqr.has_value()` returning from
the coroutine rather than propagating an error.
2022-08-14 15:31:45 +03:00
Avi Kivity
973034978c cql3: select_statement: de-result_wrap indexed_table_select_statement::do_execute_base_query()
We use result_wrap() in two places, but that makes coroutinizing the
containing function a little harder, since it's composed of more lambdas.

Remove the wrappers, gaining a bit of performance in the error case.
2022-08-14 15:22:18 +03:00
Kamil Braun
b4c5b79f5e db: system_distributed_keyspace: don't call on_internal_error in check_exists
The function `check_exists` checks whether a given table exists, giving
an error otherwise. It previously used `on_internal_error`.

`check_exists` is used in some old functions that insert CDC metadata to
CDC tables. These tables are no longer used in newer Scylla versions
(they were replaced with other tables with different schema), and this
function is no longer called. The table definitions were removed and
these tables are no longer created. They will only exists in clusters
that were upgraded from old versions of Scylla (4.3) through a sequence
of upgrades.

If you tried to upgrade from a very old version of Scylla which had
neither the old or the new tables to a modern version, say from 4.2 to
5.0, you would get `on_internal_error` from this `check_exists`
function. Fortunately:
1. we don't support such upgrade paths
2. `on_internal_error` in production clusters does not crash the system,
   only throws. The exception would be catched, printed, and the system
   would run (just without CDC - until you finished upgrade and called
   the propoer nodetool command to fix the CDC module).

Unfortunately, there is a dtest (`partitioner_tests.py`) which performs
an unsupported upgrade scenario - it starts Scylla from Cassandra (!)
work directories, which is like upgrading from a very old version of
Scylla.

This dtest was not failing due to another bug which masked the problem.
When we try to fix the bug - see #11225 - the dtest starts hitting the
assertion in `check_exists`. Because it's a test, we configure
`on_internal_error` to crash the system.

The point of this commit is to not crash the system in this rare
scenario which happens only in some weird tests. We now throw
`std::runtime_error` instead of calling `on_internal_error`. In the
dtest, we already ignore the resulting CDC error appearing in the logs
(see scylladb/scylla-dtest#2804). Together with this change, we'll be
able to fix the #11225 bug and pass this test.

Closes #11287
2022-08-14 13:12:03 +03:00
Piotr Sarna
fe617ed198 Merge 'db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column' from Piotr Dulikowski
Previously, the `system.local`'s `rpc_address` column kept local node's
`rpc_address` from the scylla.yaml configuration. Although it sounds
like it makes sense, there are a few reasons to change it to the value
of scylla.yaml's `broadcast_rpc_address`:

- The `broadcast_rpc_address` is the address that the drivers are
  supposed to connect to. `rpc_address` is the address that the node
  binds to - it can be set for example to 0.0.0.0 so that Scylla listens
  on all addresses, however this gives no useful information to the
  driver.
- The `system.peers` table also has the `rpc_address` column and it
  already keeps other nodes' `broadcast_rpc_address`es.
- Cassandra is going to do the same change in the upcoming version 4.1.

Fixes: #11201

Closes #11204

* github.com:scylladb/scylladb:
  db/system_keyspace: fix indentation after previous patch
  db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column
2022-08-12 16:24:28 +02:00
Piotr Sarna
1ab4c6aab3 Merge 'cql3: enable collections as UDA accumulators' from Wojciech Mitros
Currently, the initial values of UDA accumulators are converted
to strings using the to_string() method and from strings using the
from_string() method. The from_string() method is not implemented
for collections, and it can't be implemented without changing the
string format, because in that format, we cannot differentiate
whether a separator is a part of a value or is an actual separator
between values. In particular, the separators are not escaped
in the collection values.

Instead of from_string()/to_string() the cql parser is used
for creating a value from a string (the same , and to_parsable_string()
is used to converting a value into a string.

A test using a list as an accumulator is added to
cql-pytest/test_uda.py.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #11250

* github.com:scylladb/scylladb:
  cql3: enable collections as UDA accumulators
  cql3: extend implementation of to_bytes for raw_value
2022-08-12 12:51:17 +02:00
Botond Dénes
ceb1cdcb7a Merge 'doc: fix the typo on the Fault Tolerance page' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/438
In addition, I've replaced "Scylla" with "ScyllaDB" on that page.

Closes #11281

* github.com:scylladb/scylladb:
  doc: replace Scylla with ScyllaDB on the Fault Tolerance page
  doc: fis the typo in the note
2022-08-12 06:58:39 +03:00
Nadav Har'El
c27f431580 test/alternator: fix a flaky test for full-table scan page size
This patch fixes the test test_scan.py::test_scan_paging_missing_limit
which failed in a Jenkins run once (that we know of).

That test verifies that an Alternator Scan operation *without* an explicit
"Limit" is nevertheless paged: DynamoDB (and also Scylla) wanted this page
size to be 1 MB, but it turns out (see #10327) that because of the details
of how Scylla's scan works, the page size can be larger than 1 MB. How much
larger? I ran this test hundreds of times and never saw it exceed a 3 MB
page - so the test asserted the page must be smaller than 4 MB. But now
in one run - we got to this 4 MB and failed the test.

So in this patch we increase the table to be scanned from 4 MB to 6 MB,
and assert the page size isn't the full 6 MB. The chance that this size will
eventually fail as well should be (famous last words...) very small for
two reasons: First because 6 MB is even higher than I the maximum I saw
in practice, and second because empirically I noticed that adding more
data to the table reduces the variance of the page size, so it should
become closer to 1 MB and reduce the chance of it reaching 6 MB.

Refs #10327

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11280
2022-08-12 06:57:45 +03:00
Botond Dénes
2a39d6518d Merge 'doc: clarify the disclaimer about reusing deleted counter column values' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/857

Closes #11253

* github.com:scylladb/scylladb:
  doc: language improvemens to the Counrers page
  doc: fix the external link
  doc: clarify the disclaimer about reusing deleted counter column values
2022-08-12 06:56:28 +03:00
Botond Dénes
10371441c9 Merge 'docs: add a disclaimer about not supporting local counters by SSTableLoader' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/867
Plus some language, formatting, and organization improvements.

Closes #11248

* github.com:scylladb/scylladb:
  doc: language, formatting, and organization improvements
  doc: add a disclaimer about not supporting local counters by SSTableLoader
2022-08-12 06:55:00 +03:00
Benny Halevy
d295d8e280 everywhere: define locator::host_id as a strong tagged_uuid type
So it can be distinguished from other uuid-based
identifiers in the system.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11276
2022-08-12 06:01:44 +03:00
Botond Dénes
69aea59d97 Merge 'storage_proxy: use consistent topology, prepare for fencing' from Avi Kivity
Replication is a mix of several inputs: tokens and token->node mappings (topology),
the replication strategy, replication strategy parameters. These are all captured
in effective_replication_map.

However, if we use effective_replication_map:s captured at different times in a single
query, then different uses may see different inputs to effective_replication_map.

This series protects against that by capturing an effective_replication_map just
once in a query, and then using it. Furthermore, the captured effective_replication_map
is held until the query completes, so topology code can know when a topology is no
longer is use (although this isn't exploited in this series).

Only the simple read and write paths are covered. Counters and paxos are left for
later.

I don't think the series fixes any bugs - as far as I could tell everything was happening
in the same continuation. But this series ensures it.

Closes #11259

* github.com:scylladb/scylladb:
  storage_proxy: use consistent topology
  storage_proxy: use consistent replication map on read path
  storage_proxy: use consistent replication map on write path
  storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map
  consistency_level: accept effective_replication_map as parameter, rather than keyspace
  consistency_level: be more const when using replication_strategy
2022-08-12 06:00:30 +03:00
Avi Kivity
a2c4f5aa1a storage_proxy: use consistent topology
Derive the topology from captured and stable effective_replication_map
instead of getting a fresh topology from storage_proxy, since the
fresh topology may be inconsistent with the running query.

digest_read_resolver did not capture an effective_replication_map, so
that is added.
2022-08-11 17:58:42 +03:00
Avi Kivity
883518697b storage_proxy: use consistent replication map on read path
Capture a replication map just once in
abstract_read_executor::_effective_replication_map_ptr. Although it isn't
used yet, it serves to keep a reference count on topology (for fencing),
and some accesses to topology within reads still remain, which can be
converted to use the member in a later patch.
2022-08-11 17:58:42 +03:00
Avi Kivity
01a614fb4d storage_proxy: use consistent replication map on write path
Capture a replication map just once in
abstract_write_handler::_effective_replication_map_ptr and use it
in all write handlers. A few accesses to get the topology still remain,
they will be fixed up in a later patch.
2022-08-11 17:58:42 +03:00
Avi Kivity
f1b0e3d58e storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map
Allow callers to use consistent effective_replication_map:s across calls
by letting the caller select the object to use.
2022-08-11 17:58:42 +03:00
Avi Kivity
46bd0b1e62 consistency_level: accept effective_replication_map as parameter, rather than keyspace
A keyspace is a mutable object that can change from time to time. An
effective_replication_map captures the state of a keyspace at a point in
time and can therefore be consistent (with care from the caller).

Change consistency_level's functions to accept an effective_replication_map.
This allows the caller to ensure that separate calls use the same
information and are consistent with each other.

Current callers are likely correct since they are called from one
continuation, but it's better to be sure.
2022-08-11 17:58:42 +03:00
Avi Kivity
1078d1bfda consistency_level: be more const when using replication_strategy
We don't modify the replication_strategy here, so use const. This
will help when the object we get is const itself, as it will be in
the next patches.
2022-08-11 17:58:42 +03:00
Wojciech Mitros
48bd752971 cql3: enable collections as UDA accumulators
Currently, the initial values of UDA accumulators are converted
to strings using the to_string() method and from strings using the
from_string() method. The from_string() method is not implemented
for collections, and it can't be implemented without changing the
string format, because in that format, we cannot differentiate
whether a separator is a part of a value or is an actual separator
between values. In particular, the separators are not escaped
in the collection values. For example, a list with string elements:
'a, b', 'c' would be represented as a string 'a, b, c', while now
it is represented as "['a, b', 'c']".
Some types that were parsable are now represented in a different
way. For example, a tuple ('a', null, 0) was represented as
"a:\@:0", and now it is "('a', null, 0)".

Instead of from_string()/to_string() the cql parser is used
for creating a value from a string (the same , and to_parsable_string()
is used to converting a value into a string.

A test using a list as an accumulator is added to
cql-pytest/test_uda.py.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-08-11 16:23:57 +02:00
Anna Stuchlik
f5a49688ae doc: replace Scylla with ScyllaDB on the Fault Tolerance page 2022-08-11 16:14:33 +02:00
Anna Stuchlik
7218a977df doc: fis the typo in the note 2022-08-11 16:09:49 +02:00
Botond Dénes
d407d3b480 Merge 'Calculate effective_replication_map: prevent stalls with everywhere_replication_strategy' from Benny Halevy
For replication strategies like "everywhere"
and "local" that return the same set of endpoints
for all tokens, we can call rs->calculate_natural_endpoints
one once and reuse the result for all token.

Note that ideally the replication_map could contain only
a single token range for this case, but that does't seem to work yet.

Add `maybe_yield()` calls to the tight loop
to prevent reactor stalls on large clusters when copying
a long vector returned by everywhere_replication_strategy
to potentially 1000's of tokens in the map.

Nicholas Peshek wrote in
https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370

about similar patch by Geoffrey Beausire:
994c6ecf3c

> Yep. That dropped our startup from 3000+ seconds to about 40.

Fixes #10337

Closes #11277

* github.com:scylladb/scylladb:
  abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies
  abstract_replication_strategy: add has_uniform_natural_endpoints
2022-08-11 15:19:47 +03:00
Anna Stuchlik
1603129275 doc: remove the reduntant space from index 2022-08-11 12:36:16 +02:00
Anna Stuchlik
ee258cb0af doc: update the syntax for defining service level attributes 2022-08-11 12:32:38 +02:00
Petr Gusev
4bc6611829 raft read_barrier, retry over intermittent rpc failures
If the leader was unavailable during read_barrier,
closed_error occurs, which was not handled in any way
and eventually reached the client. This patch adds retries in this case.

Fix: scylladb#11262
Refs: #11278

Closes #11263
2022-08-11 13:31:19 +03:00
Amnon Heiman
5ac20ac861 Reduce the number of per-scheduling group metrics
This patch reduces the number of metrics ScyllaDB generates.

Motivation: The combination of per-shard with per-scheduling group
generates a lot of metrics. When combined with histograms, which require
many metrics, the problem becomes even bigger.

The two tools we are going to use:
1. Replace per-shard histograms with summaries
2. Do not report unused metrics.

The storage_proxy stats holds information for the API and the metrics
layer.  We replaced timed_rate_moving_average_and_histogram and
time_estimated_histogram with the unfied
timed_rate_moving_average_summary_and_histogram which give us an option
to report per-shard summaries instead of histogram.

All the counters, histograms, and summaries were marked as
skip_when_empty.

The API was modified to use
timed_rate_moving_average_summary_and_histogram.

Closes #11173
2022-08-11 13:31:19 +03:00
Benny Halevy
9167b857e9 abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies
For replication strategies like "everywhere"
and "local" that return the same set of endpoints
for all tokens, we can call rs->calculate_natural_endpoints
one once and reuse the result for all token.

Note that ideally the replication_map could contain only
a single token range for this case, but that does't seem to work yet.

Add maybe_yield() calls to the tight loop
to prevent reactor stalls on large clusters when copying
a long vector returned by everywhere_replication_strategy
to potentially 1000's of tokens in the map.

Nicholas Peshek wrote in
https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370

about similar patch by Geoffrey Beausire:
994c6ecf3c

> Yep. That dropped our startup from 3000+ seconds to about 40.

Fixes #10337

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-11 10:35:29 +03:00
Benny Halevy
eb678e723b abstract_replication_strategy: add has_uniform_natural_endpoints
So that using calaculate_natural_endpoints can be optimized
for strategies that return the same endpoints for all tokens,
namely everywhere_replication_strategy and local_strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-11 10:34:14 +03:00
Calle Wilund
a729c2438e commitlog: Make get_segments_to_replay on-demand
Refs #11237

Don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when
required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially
ask for the list twice. More to the point, goes better hand-in-hand
with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.
2022-08-11 06:41:23 +00:00
Nadav Har'El
d03bd82222 Revert "test: move scylla_inject_error from alternator/ to cql-pytest/"
This reverts commit 8e892426e2 and fixes
the code in a different way:

That commit moved the scylla_inject_error function from
test/alternator/util.py to test/cql-pytest/util.py and renamed
test/alternator/util.py. I found the rename confusing and unnecessary.
Moreover, the moved function isn't even usable today by the test suite
that includes it, cql-pytest, because it lacks the "rest_api" fixture :-)
so test/cql-pytest/util.py wasn't the right place for it anyway.
test/rest_api/rest_util.py could have been a good place for this function,
but there is another complication: Although the Alternator and rest_api
tests both had a "rest_api" fixture, it has a different type, which led
to the code in rest_api which used the moved function to have to jump
through hoops to call it instead of just passing "rest_api".

I think the best solution is to revert the above commit, and duplicate
the short scylla_inject_error() function. The duplication isn't an
exact copy - the test/rest_api/rest_util.py version now accepts the
"rest_api" fixture instead of the URL that the Alternator version used.

In the future we can remove some of this duplication by having some
shared "library" code but we should do it carefully and starting with
agreeing on the basic fixtures like "rest_api" and "cql", without that
it's not useful to share small functions that operate on them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11275
2022-08-11 06:43:26 +03:00
Wojciech Mitros
42e0fb90ea cql3: extend implementation of to_bytes for raw_value
When called with a null_value or an unset_value,
raw_value::to_bytes() threw an std::get error for
wrong variant. This patch adds a description for
the errors thrown, and adds a to_bytes_opt() method
that instead of throwing returns a std::nullopt.
2022-08-10 16:40:22 +02:00
Avi Kivity
e9cbc9ee85 Merge 'Add support for empty replica pages' from Botond Dénes
Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones.
The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by 3131cbea62, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones.
The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set.

Upgrade sanity test was conducted as following:
* Created cluster of 3 nodes with RF=3 with master version
* Wrote small dataset of 1000 rows.
* Deleted prefix of 980 rows.
* Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100`
* Also did some manual queries via `cqlsh` with smaller page size and tracing on.
* Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`.
* Confirmed there are no errors or read-repairs.

Perf regression test:
```
build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60
```
Before:
```
median 133665.96 tps ( 62.0 allocs/op,  12.0 tasks/op,   43007 insns/op,        0 errors)
median absolute deviation: 973.40
maximum: 135511.63
minimum: 104978.74
```
After:
```
median 129984.90 tps ( 62.0 allocs/op,  12.0 tasks/op,   43181 insns/op,        0 errors)
median absolute deviation: 2979.13
maximum: 134538.13
minimum: 114688.07
```
Diff: +~200 instruction/op.

Fixes: https://github.com/scylladb/scylla/issues/7689
Fixes: https://github.com/scylladb/scylla/issues/3914
Fixes: https://github.com/scylladb/scylla/issues/7933
Refs: https://github.com/scylladb/scylla/issues/3672

Closes #11053

* github.com:scylladb/scylladb:
  test/cql-pytest: add test for query tombstone page limit
  query-result-writer: stop when tombstone-limit is reached
  service/pager: prepare for empty pages
  service/storage_proxy: set smallest continue pos as query's continue pos
  service/storage_proxy: propagate last position on digest reads
  query: result_merger::get() don't reset last-pos on short-reads and last pages
  query: add tombstone-limit to read-command
  service/storage_proxy: add get_tombstone_limit()
  query: add tombstone_limit type
  db/config: add config item for query tombstone limit
  gms: add cluster feature for empty replica pages
  tree: don't use query::read_command's IDL constructor
2022-08-10 13:38:06 +03:00