Commit Graph

26839 Commits

Author SHA1 Message Date
Avi Kivity
3e3003fcc1 Merge 'cql3: limit the concurrency of indexed statements' from Piotr Sarna
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096 and additionally throttled
if enough results are already fetched. For paged queries it means that
the query returns as soon as 1MB of data is ready, and for unpaged ones
the concurrency will no longer be doubled as soon as the previous
iteration fetched 1MB of results.

The fixed 4096 value can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.

Fixes #8799

Tests: unit(release),
       manual: observing metrics of modified index_paging_test

Closes #8814

* github.com:scylladb/scylla:
  cql3: limit the transitional result size for indexed queries
  cql3: return indexed pages after 1MB worth of data
  cql3: limit the concurrency of indexed statements
2021-06-07 18:00:51 +03:00
Gleb Natapov
5d15ecb7e5 raft: do not block io_fiber just because of a slow follower
Currently if append_message cannot be sent to one of the followers the
entire io_fiber will block which eventually stop the replication. The
patch changes message sending part of io_fiber to be non blocking. The
code adds a hash table that is used to keep track of append_request
sending status per destination. All the remaining futures are waited for
during abort.
Message-Id: <20210606140305.2930189-2-gleb@scylladb.com>
2021-06-07 16:55:14 +02:00
Gleb Natapov
01b6a2eb38 raft: randomized_nemesis_test: tick virtual clock less aggressively
Currently each tick of the virtual clock immediately schedules the next one
at the end of the task queue, but this is too aggressive. If a tick
generates work that need two tasks to be scheduled one after another
such implementation will make the task queue grow to infinity. Considering
that in the debug mode even ready future causes preemption and task
queue shuffling may cause two or more ticks to be executed without any
other work done in the middle it is very easy to get to such situation.

The patch changes the virtual clock to tick only when a shard is idle.
Message-Id: <20210606140305.2930189-1-gleb@scylladb.com>
2021-06-07 16:54:56 +02:00
Piotr Sarna
df0d44486a cql3: limit the transitional result size for indexed queries
Unpaged indexed queries already have a concurrency limit of 4096,
but now the concurrency is further limited by previous number of bytes
fetched. Once this number reached 1MB, the concurrency will not be
increased in consecutive queries to avoid overload.
2021-06-07 16:29:18 +02:00
Piotr Sarna
60e55b6c7f cql3: return indexed pages after 1MB worth of data
Currently there's no practical limit of the resulting page size
for an indexed query, because it simply translates a page worth
of base primary keys into base rows. In order to avoid sending
too large pages, the result is returned after hitting a 1MB limit.
2021-06-07 16:05:50 +02:00
Piotr Sarna
8eeac10ded cql3: limit the concurrency of indexed statements
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096.
The number can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.

Fixes #8799

Tests: unit(release),
       manual: observing metrics of modified index_paging_test
2021-06-07 15:56:15 +02:00
Benny Halevy
5f31beaf97 flat_mutation_reader: unify reader_consumer declarations
Put the reader_consumer declaration in flat_mutation_reader.hh
and include it instead of declaring the same `using reader_consumer`
declaration in several places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210607075020.31671-1-bhalevy@scylladb.com>
2021-06-07 16:11:18 +03:00
Pavel Solodovnikov
76bea23174 treewide: reduce header interdependencies
Use forward declarations wherever possible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>

Closes #8813
2021-06-07 15:58:35 +03:00
Avi Kivity
0048c404d2 Merge 'dht: token: make some cosmetic changes' from Michał Chojnowski
This is a set of a few cosmetic changes in dht/token. Mostly some comments and a simplification of `midpoint()`.

Closes #8803

* github.com:scylladb/scylla:
  dht: token: add a comment excusing the `const bytes&` constructor
  dht: token: simplify midpoint()
  dht: token: add a comment to normalize()
  dht: token: use {read,write}_unaligned instead of std::copy_n
  dht: token-sharding: fix a typo in a comment
2021-06-07 15:41:15 +03:00
Piotr Sarna
fa29b79c20 transport: close connections when too large requests arrive
Too large requests are currently handled by the CQL server by
skipping them and sending back an error response.
That's however wasteful and dangerous: bogus request sizes
will force Scylla to potentially skip gigabytes of data
- and skipping is done by simply reading from the socket,
so it may results in gigabytes of bandwidth wasted.
Even if the request size is not bogus, closing the connection
forces users to adjust their request sizes, which should be done
anyway.

Originally, there was a bug in handling too large requests which
only read their headers and then left the connection in a broken,
undefined state, trying to interpret the rest of the large request
as a next CQL header. It was later fixed to skip the request, but
closing the connection is a safer thing to do.

Fixes #8798

Closes #8800
2021-06-07 12:23:55 +03:00
Avi Kivity
e6c5a63581 Merge "Fix several issues on transport stop" from Pavel E
"
There's a bunch of issues with starting and stopping of cql_server with
the help of cql_controller.

fixes: #8796
tests: manual(start + stop,
              start + exception on cql_set_state()
	     )
       unit not run, they don't mess with transport controller
"

* 'br-transport-stop-fixes' of https://github.com/xemul/scylla:
  transport/controller: Stop server on state change failure too
  transport/controller: Rollback server start on state change failure too
  transport/controller: Do not leave _server uninitialized
  transport/controller: Rework try-catch into defers
2021-06-07 11:41:36 +03:00
Alejo Sanchez
bd168d57ff raft: fix vote reply handling in prevote
Do not register a reply to prevote as a real vote

Found and authored by @kostja.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210604122530.1975388-1-alejo.sanchez@scylladb.com>
2021-06-06 19:18:49 +03:00
Tomasz Grabiec
50d64646cd Merge "raft: replication test fixes and OOP refactor" from Alejo
Feature requests, fixes, and OOP refactor of replication_test.

Note: all known bugs and hangs are now fixed.

A new helper class "raft_cluster" is created.
Each move of a helper function to the class has its own commit.
New helpers are provided

To simplify code, for now only a single apply function can be set per
raft_cluster. No tests were using in any other way. In the future,
there could be custom apply functions per server dynamically assigned,
if this becomes needed.

* alejo/raft-tests-replication-02-v3-30: (66 commits)
  raft: replication test: wait for log for both index and term
  raft: replication test: reset network at construction
  raft: replication test: use lambda visitor for updates
  raft: replication test: move structs into class
  raft: replication test: move data structures to cluster class
  raft: replication test: remove shared pointers
  raft: replication test: move get_states() to raft_cluster
  raft: replication test: test_server inside raft_cluster
  raft: replication test: rpc declarative tests
  raft: replication test: add wait_log
  raft: replication test: add stop and reset server
  raft: replication test: disconnect 2 support
  raft: replication test: explicit node_id naming
  raft: replication test: move definitions up
  raft: replication test: no append entries support
  raft: replication test: fix helper parameter
  raft: replication test: stop servers out of config
  raft: replication test: wait log when removing leader from configuration
  raft: replication test: only manipulate servers in configuration
  raft: replication test: only cancel rearm ticker for removed server
  ...
2021-06-06 19:18:49 +03:00
Piotr Sarna
cb17aa1e53 Merge 'test/alternator: rewrite run script to share code with cql-pytest's run script' from Nadav Har'El
In this small series, I rewrite test/alternator/run to Python using the utility
functions developed for test/cql-pytest. In the future, we should do the same to
test/redis/run and test/scylla-gdb/run.

The benefit of this rewrite is less code duplication (all run scripts start with
the same duplicate code to deal with temporary directories, to run Scylla IP
addresses, etc.), but most importantly - in the future fixes we do to cql-pytest
(e.g., parameters needed to start Scylla efficiently, how to shut down Scylla,
etc.) will appear automatically in alternator test without needing to remember
to change both.

Another benefit is that test/alternator/run will now be Python, not a shell
script. This should make it easier to integrate it into test.py (refs #6212) in
the future - if we want to.

Closes #8792

* github.com:scylladb/scylla:
  test/alternator: rewrite test/alternator/run script in Python
  test/cql-pytest: make test run code more general
2021-06-06 19:18:49 +03:00
Nadav Har'El
fe1fa9d72b docs: update Alternator's compatibility.md
In the last year, four new features were added to DynamoDB which we
don't yet support - Kinesis Streams, PartiQL, Contributor Insights and
Export to S3. Let's document them as missing Alternator features, and
point to the four newly-created issues about these features.

Refs #8786
Refs #8787
Refs #8788
Refs #8789

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210603125825.1179171-1-nyh@scylladb.com>
2021-06-06 19:18:49 +03:00
Avi Kivity
872cd8f692 test: adjust copyright statement to use ScyllaDB rather than old name 2021-06-06 19:18:49 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Avi Kivity
b3ce1c8b40 gdb: prepare for Seastar's "smp: allow having multiple instances of the smp class"
scylladb/seastar@e6463df8a0 ("smp: allow
having multiple instances of the smp class") changes the type of
seastar::smp::_qs from a unique_ptr to a regular pointer. Adjust for
that change, with a fallback to support older versions.

Closes #8784
2021-06-06 19:18:49 +03:00
Nadav Har'El
48ff641f67 Merge 'commitlog: make_checked_file for segments, report and ignore other errors on shutdown' from Benny Halevy
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.

This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.

In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.

Fixes #8577

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8578

* github.com:scylladb/scylla:
  commitlog: segment_manager::shutdown: abort on errors
  commitlog: allocate_segment_ex: make_checked_file
2021-06-06 19:18:49 +03:00
Avi Kivity
8a4abe9895 cql3: expression: don't copy expression in has_supporting_index()
std::bind() copies the bound parameters for safekeeping. Here this
includes expr, which can be quite heavyweight. Use std::ref() to
prevent copying. This is safe since the bound expression is executed
and discarded before has_supporting_index() returns.

Closes #8791
2021-06-06 19:18:49 +03:00
Nadav Har'El
cee0340c89 scripts/pull_github_pr.sh: do not hard-code project name
The current pull_github_pr.sh hard-codes the project name "scylladb/scylla".
Let's determine it automaticaly, from the git origin url.

This will allow using exactly the same script in other Scylla subprojects,
e.g., Seastar.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318142624.1794419-1-nyh@scylladb.com>
2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
2187a59089 treewide: move service::cas_request out from storage_proxy.hh
And remove all remaining inclusions of `storage_proxy.hh` in the
headers.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
e0749d6264 treewide: some random header cleanups
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
142d3b5ad9 cdc: self-sufficient headers fixup
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Gleb Natapov
bb822c92ab raft: change raft::rpc api to return void for most sending functions
Most RAFT packets are sent very rarely during special phases of the
protocol (like election or leader stepdown). The protocol itself does
not care if a packet is sent or dropped, so returning futures from their
send function does not serve any purpose. Change the raft's rpc interface
to return void for all packet types but append_request. We still want to
get a future from sending append_request for backpressure purposes since
replication protocol is more efficient if there is no packet loss, so
it is better to pause a sender than dropping packets inside the rpc. Rpc
is still allowed to drop append_requests if overloaded.
2021-06-06 19:18:49 +03:00
Gleb Natapov
f5a54d6c05 raft: move ELECTION_TIMEOUT definition to a public header
Move ELECTION_TIMEOUT definition to be visible to outside modules.
2021-06-06 19:18:49 +03:00
Gleb Natapov
87844c0ce1 raft: remove unused clock type definition
RAFT uses logical clock now and this define is from older times.
2021-06-06 19:18:49 +03:00
Gleb Natapov
90ea71da54 raft: wait for io and applier fiber to stop before before aborting snapshots and waiters
IO and applier fibers may update waiters and start new snapshot
transfers, so abort() needs to wait for them to stop before proceeding
to abort waiters and snapshot transfers,
2021-06-06 19:18:49 +03:00
Yaron Kaikov
6a447db8a8 scylla_util.py: Fix Azure support for machine-image
In https://github.com/scylladb/scylla/pull/7807 we added support for
Azure instance in Scylla.

The following changes are required in order machine-image to work:
1) fix wrong metadata URL and updating metadata path values (was
   intreduce in
f627fcbb0c)
2) fix function naming which been used my machine image
3) add missing function which are reuqired by mahcine-image
4) cleanup unused functions

Closes #8596
2021-06-06 09:21:23 +03:00
Asias He
2a7b855255 repair: Init repair metrics during startup
The _node_ops_metrics is thread local, it is constructed when it
is first accessed.

If there are no node operations, the metrics will not be shown. To make the
metrics more consistent, init during startup.

Refs #8311

Closes #8780
2021-06-06 09:21:23 +03:00
Benny Halevy
3f9bad0f0a test: compound_test: use tests::random
For reproducibility.

Test: compound_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210602061910.286893-2-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
40e032ff8b test: compound_test: use to seastar test framework
Prepare for using tests::random instead of std::rand
for reproducibility.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210602061910.286893-1-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Asias He
9b902fad79 gossiper: Update timestamp for nodes in ack and ack2 msg handler
In commit 425e3b1182 (gossip: Introduce
direct failure detector), the call to notify_failure_detector inside ack
and ack2 msg handler was removed since there is no need to update the
old failure detector anymore. However, the timestamp for endpoit_state
is also updated inside notify_failure_detector. With the new failure
detector we still need the timestamp for endpoit_state. Otherwise, nodes
might be removed from gossip wrongly.

For example, as we saw in issue #8702:

INFO  2021-05-24 22:45:24,713 [shard 0] gossip - FatClient 127.0.60.2
has been silent for 5000ms, removing from gossip

To fix, update the timestamp as we do before in ack and ack2 msg
handler.

Fixes #8702

Closes #8777
2021-06-06 09:21:23 +03:00
Benny Halevy
f081e651b3 memtable_list: rename request_flush to just flush
Now that it returns a future that always waits on
pending flushes there is no point in calling it `request_flush`.
`flush()` is simpler and better describes its function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
4f20cd3bea memtable_list: rename seal_active_memtable_immediate to seal_active_memtable
Now that there's no more seal_active_memtable_delayed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
ba65b90b34 memtable_list: get rid of seal_active_memtable_delayed
This path is unused since e5be3352cf.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Calle Wilund
3b55ef36d1 cf_prop_defs: Fix extensions merge to handle removal
Fixes #8773

When refactored for cdc, properties -> extensions merge
was modified so it did not handle _removal_ (i.e. an
extension function returning null -> no entry in new map).

This causes certain enterprise extensions to not be able
to disable themselves.

Fixed by filtering existing extensions by property keywords.
Unit test added.

Closes #8774
2021-06-06 09:21:23 +03:00
Nadav Har'El
f22ed3ff5c test/alternator: reduce very high timeout in one tracing test
In test_tracing.py::test_slow_query_log, the was what looked like
an innocent 30-second timeout, but this was in fact a 8 minute
timeout - because it started with sleeping 1 second, then 2 seconds,
then 3, ... until 30 seconds. Such a high timeout is frustrating when
trying to debug failures in the test - which is only expected to take
2 seconds (and all of it because of an artificial timeout).

So fix the loop to stop iterating after 60 seconds (a compromise
between 30 seconds and 8 minutes...), sleeping a constant amount
between iterations.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210601150631.1037158-1-nyh@scylladb.com>
2021-06-06 09:21:23 +03:00
Avi Kivity
100d6f4094 build: enable -Wunused-function
Also drop a single violation in transport/server.cc. This helps
prevent dead code from piling up.

Three functions in row_cache_test that are not used in debug mode
are moved near their user, and under the same ifdef, to avoid triggering
the error.

Closes #8767
2021-06-06 09:21:23 +03:00
Benny Halevy
6ce826206a sstables: use vector empty method rather than size
Testing std::vector::empty() is slightly more efficient
than testing for `size() > 0`.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210601115552.155148-2-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
0565ba31a1 compaction_info: is_stop_requested: use sstring::empty rather than size
`!empty()` is slightly more efficient than `size() > 0`.

While at it, mark the function noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210601115552.155148-1-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
948a9da832 table: do_apply: verify that _async_gate is open
Applying changes to the memtable after table::stop
is prohibited. Verify that by making sure that
the _async_gate is still open in `do_apply`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210601055042.41380-1-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
82a263f672 database: apply_in_memory: run_when_memory_available under table::run_async
Make sure to apply the mutation under the table's _async_gate.

Fixes #8790

Test: unit(dev), view_build_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8794
2021-06-06 09:21:23 +03:00
Calle Wilund
131da30856 table: Always use explicit commitlog discard + clear out rp_set
Fixes #8733

If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.

Closes #8766
2021-06-06 09:21:23 +03:00
Pavel Emelyanov
0944d69475 repair, streaming: Generalize consumer lambdas
Both streaming and repair call the distributed sstables writing with
equal lambdas each being ~30 lines of code. The only difference between
them is repair might request offstrategy compaction for new sstable.

Generalization of these two pieces save lines of codes and speeds the
release/repair/row_level.o compilation by half a minute (out of twelve).

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210531133113.23003-1-xemul@scylladb.com>
2021-06-06 09:21:23 +03:00
Lubos Kosco
777771df34 scylla_util.py: Relax GCE setup NVMe device checks
We don't want to fail I/O setup if there are more than one NVMe devices
mounted as root nor if there are no NVMe devices.

Fixes #8032

Closes #8444
2021-06-06 09:21:23 +03:00
Botond Dénes
b0056f88dc test.py: revamp coverage support
Instead of attempting to universally set the proper environment
necessary for tests to generate profiling data such that coverage.py can
process it, allow each Test subclass to set up the environment as needed
by the specific Test variant.
With this we now have support for all current test types, including cql,
cql-pytest and alternator tests.
2021-06-06 09:21:23 +03:00
Botond Dénes
438391b4cc scripts/coverage.py: check that --path is a directory
To detect a bad --path that would fail coverage generation early.
2021-06-06 09:21:23 +03:00
Botond Dénes
ca91fd0e34 scripts/coverage.py: update main()'s docstring with new --run modifiers
And fix a typo while there.
2021-06-06 09:21:23 +03:00
Botond Dénes
2ba3fc2e11 scripts/coverage.py: add --distinct-id parameter
Yet another modifier for `--run`, allowing running the same executable
multiple times and then generating a coverage report across all runs.
This will also be used by test.py for those test suites (cql test) which
run the same executable multiple times, with different inputs.
2021-06-06 09:21:23 +03:00