Commit Graph

23790 Commits

Author SHA1 Message Date
Tomasz Grabiec
864b2c5736 CMakeLists.txt: Add raft directory to source code directories
Needed for IDE integration. Not used for building currently.
Message-Id: <1601570008-19666-1-git-send-email-tgrabiec@scylladb.com>
2020-10-01 19:38:39 +03:00
Gleb Natapov
3e8dbb3c09 lwt: do not return unavailable exception from the 'learn' stage
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.

Fixes #7258

Message-Id: <20201001130724.GA2283830@scylladb.com>
2020-10-01 17:16:52 +02:00
Tomasz Grabiec
ca7f0c61f0 Merge "raft: initial implementation" from Gleb
This is the beginning of raft protocol implementation. It only supports
log replication and voter state machine. The main difference between
this one and the RFC (besides having voter state machine) is that the
approach taken here is to implement raft as a deterministic state
machine and move all the IO processing away from the main logic.
To do that some changes to RPC interface was required: all verbs are now
one way meaning that sending a request does not wait for a reply  and
the reply arrives as a separate message (or not at all, it is safe to
drop packets).

* scylla-dev/raft-v4:
  raft: add a short readme file
  raft: compile raft tests
  raft: add raft tests
  raft: Implement log replication and leader election
  raft: Introduce raft interface header
2020-10-01 17:09:52 +02:00
Konstantin Osipov
9a5f2b87dc raft: add a short readme file
The file has a brief description of the code status, usage and some
implementation assumptions.
2020-10-01 14:30:59 +03:00
Gleb Natapov
16cb009ea2 raft: compile raft tests
Compilation is not enabled by default as it requires coroutines support
and may require special compiler (until distributed one fixes all the
bugs related to coroutines). To enable raft tests compilation new
configure.py option is added (--build-raft).
2020-10-01 14:30:59 +03:00
Gleb Natapov
4959609589 raft: add raft tests
Add test for currently implemented raft features. replication_test
tests replication functionality with various initial log configurations.
raft_fsm_test test voting state machine functionality.
2020-10-01 14:30:59 +03:00
Gleb Natapov
e1ac1a61c9 raft: Implement log replication and leader election
This patch introduces partial RAFT implementation. It has only log
replication and leader election support. Snapshotting and configuration
change along with other, smaller features are not yet implemented.

The approach taken by this implementation is to have a deterministic
state machine coded in raft::fsm. What makes the FSM deterministic is
that it does not do any IO by itself. It only takes an input (which may
be a networking message, time tick or new append message), changes its
state and produce an output. The output contains the state that has
to be persisted, messages that need to be sent and entries that may
be applied (in that order). The input and output of the FSM is handled
by raft::server class. It uses raft::rpc interface to send and receive
messages and raft::storage interface to implement persistence.
2020-10-01 14:30:59 +03:00
Gleb Natapov
c073997431 raft: Introduce raft interface header
This commit introduce public raft interfaces. raft::server represents
single raft server instance. raft::state_machine represents a user
defined state machine. raft::rpc, raft::rpc_client and raft::storage are
used to allow implementing custom networking and storage layers.

A shared failure detector interface defines keep-alive semantics,
required for efficient implementation of thousands of raft groups.
2020-10-01 14:30:59 +03:00
Piotr Dulikowski
bfbf02a657 transport/config: fix cross-shard use of updateable_value
Recently, the cql_server_config::max_concurrent_requests field was
changed to be an updateable_value, so that it is updated when the
corresponding option in Scylla's configuration is live-reloaded.
Unfortunately, due to how cql_server is constructed, this caused
cql_server instances on all shards to store an updateable_value which
pointed to an updateable_value_source on shard 0. Unsynchronized
cross-shard memory operations ensue.

The fix changes the cql_server_config so that it holds a function which
creates an updateable_value appropriate for the given shard. This
pattern is similar to another, already existing option in the config:
get_service_memory_limiter_semaphore.

This fix can be reverted if updateable_value becomes safe to use across
shards.

Tests: unit(dev)

Fixes: #7310
2020-10-01 14:10:56 +03:00
Etienne Adam
98dc0dc03a redis: only create required keyspaces/tables
The 'redis_database_count' was already existing, but
was not used when initializing the keyspaces. This
patch merely uses it. I think it's better that way, it
seems cleaner not to create 15 x 5 tables when we
use only one redis database.

Also change a test to test with a higher max number
of database.

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200930210256.4439-1-etienne.adam@gmail.com>
2020-10-01 10:27:03 +03:00
Wojciech Mitros
e79ad38425 tracing: add username to the session table
In order to improve observability, add a username field to the the
system_traces.sessions table. The system table should be change
while upgrading by running the fix_system_distributed_tables.py
script. Until the table is updated, the old behaviour is preserved.

Fixes #6737.
2020-10-01 04:46:40 +02:00
Nadav Har'El
d73cf589e7 docs: fix typos in docs/alternator/alternator.md
Discovered by running a spell-checker.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930101046.76710-1-nyh@scylladb.com>
2020-10-01 04:46:40 +02:00
Nadav Har'El
8db01aeeb4 docs: fix typo in alternator/getting-started.md
Fix a typo reported by a user. Ran spell-checker to verify there are no
other obvious spelling mistakes in that file.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930084304.74776-1-nyh@scylladb.com>
2020-10-01 04:46:40 +02:00
Avi Kivity
701d24a832 Merge 'Enhance max concurrent requests code' from Piotr Sarna
This miniseries enhances the code from #7279 by:
 * adding metrics for shed requests, which will allow to pinpoint the problem if the max concurrent requests threshold is too low
 * making the error message more comprehensive by pointing at the variable used to set max concurrent requests threshold

Example of an ehanced error message:
```
ConnectionException('Failed to initialize new connection to 127.0.0.1: Error from server: code=1001 [Coordinator node overloaded] message="too many in-flight requests (configured via max_concurrent_requests_per_shard): 18"',)})
```

Closes #7299

* github.com:scylladb/scylla:
  transport: make _requests_serving param uint32_t
  transport: make overloaded error message more descriptive
  transport: add requests_shed metrics
2020-10-01 04:46:40 +02:00
Piotr Sarna
876e9fe51a transport: make _requests_serving param uint32_t
It's not realistic for a shard to have over 4 billion concurrent
requests, so this value can be safely represented in 32 bits.
Also, since the current concurrency limit is represented in uint32_t,
it makes sense for these two to have matching types.
2020-09-30 08:20:52 +02:00
Piotr Sarna
d18f68f1c1 transport: make overloaded error message more descriptive
The message now mentions the config variable used to set the limit
of max allowed concurrent requests.
2020-09-30 08:20:51 +02:00
Piotr Sarna
792ff3757a transport: add requests_shed metrics
The counter shows a total number of requests shed due to overload.
2020-09-30 08:20:50 +02:00
Avi Kivity
fd1dd0eac7 Merge "Track the memory consumption of reader buffers" from Botond
"
The last major untracked area of the reader pipeline is the reader
buffers. These scale with the number of readers as well as with the size
and shape of data, so their memory consumption is unpredictable varies
wildly. For example many small rows will trigger larger buffers
allocated within the `circular_buffer<mutation_fragment>`, while few
larger rows will consume a lot of external memory.

This series covers this area by tracking the memory consumption of both
the buffer and its content. This is achieved by passing a tracking
allocator to `circular_buffer<mutation_fragment>` so that each
allocation it makes is tracked. Additionally, we now track the memory
consumption of each and every mutation fragment through its whole
lifetime. Initially I contemplated just tracking the `_buffer_size` of
`flat_mutation_reader::impl`, but concluded that as our reader trees are
typically quite deep, this would result in a lot of unnecessary
`signal()`/`consume()` calls, that scales with the number of mutation
fragments and hence adds to the already considerable per mutation
fragment overhead. The solution chosen in this series is to instead
track the memory consumption of the individual mutation fragments, with
the observation that these are typically always moved and very rarely
copied, so the number of `signal()`/`consume()` calls will be minimal.

This additional tracking introduces an interesting dilemma however:
readers will now have significant memory on their account even before
being admitted. So it may happen that they can prevent their own
admission via this memory consumption. To prevent this, memory
consumption is only forwarded to the semaphore upon admission. This
might be solved when the semaphore is moved to the front -- before the
cache.
Another consequence of this additional, more complete tracking is that
evictable readers now consume memory even when the underlying reader is
evicted. So it may happen that even though no reader is currently
admitted, all memory is consumed from the semaphore. To prevent any such
deadlocks, the semaphore now admits a reader unconditionally if no
reader is admitted -- that is if all count resources all available.

Refs: #4176

Tests: unit(dev, debug, release)
"

* 'track-reader-buffers/v2' of https://github.com/denesb/scylla: (37 commits)
  test/manual/sstable_scan_footprint_test: run test body in statement sched group
  test/manual/sstable_scan_footprint_test: move test main code into separate function
  test/manual/sstable_scan_footprint_test: sprinkle some thread::maybe_yield():s
  test/manual/sstable_scan_footprint_test: make clustering row size configurable
  test/manual/sstable_scan_footprint_test: document sstable related command line arguments
  mutation_fragment_test: add exception safety test for mutation_fragment::mutate_as_*()
  test: simple_schema: add make_static_row()
  reader_permit: reader_resources: add operator==
  mutation_fragment: memory_usage(): remove unused schema parameter
  mutation_fragment: track memory usage through the reader_permit
  reader_permit: resource_units: add permit() and resources() accessors
  mutation_fragment: add schema and permit
  partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment
  mutation_fragment: remove as_mutable_end_of_partition()
  mutation_fragment: s/as_mutable_partition_start/mutate_as_partition_start/
  mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/
  mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
  mutation_fragment: s/as_mutable_static_row/mutation_as_static_row/
  flat_mutation_reader: make _buffer a tracked buffer
  mutation_reader: extract the two fill_buffer_result into a single one
  ...
2020-09-29 16:08:16 +03:00
Pekka Enberg
8f17ca2d1a scripts/refresh-submodules.sh: Add python3 submodule
Message-Id: <20200928075422.377888-1-penberg@scylladb.com>
2020-09-29 16:06:32 +03:00
Yaron Kaikov
d48df44f26 configure.py: build python3, jmx, tools and unified-tar only in relevant dist-{mode}
Today when ever we are building scylla in a singel mode we still
building jmx, tools and python3 for all dev,release and debug.
Let's make sure we build only in relevant build mode

Also adding unified-tar to ninja build

Closes #7260
2020-09-29 15:41:52 +03:00
Juliusz Stasiewicz
0afa738a8f tracing: Fix error on slow batches
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```

In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```

Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.

Fixes #5843

Closes #7293
2020-09-29 13:24:39 +02:00
Asias He
eedcee7f31 gossip: Reduce unncessary VIEW_BACKLOG updates
The blacklog of current and max in VIEW_BACKLOG is not update but the
nodes are updating VIEW_BACKLOG all the time. For example:

```
INFO  2020-03-06 17:13:46,761 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026590,718)
INFO  2020-03-06 17:13:46,821 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026531,742)
INFO  2020-03-06 17:13:47,765 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027590,721)
INFO  2020-03-06 17:13:47,825 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027531,745)
INFO  2020-03-06 17:13:48,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028590,726)
INFO  2020-03-06 17:13:48,833 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028531,750)
INFO  2020-03-06 17:13:49,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029590,729)
INFO  2020-03-06 17:13:49,832 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029531,753)
```

The downside of such updates:

 - Introduces more gossip exchange traffic
 - Updates system.peers all the time

The extra unnecessary gossip traffic is fine to a cluster in a good
shape but when some of the nodes or shards are loaded, such messages and
the handling of such messages can make the system even busy.

With this patch, VIEW_BACKLOG is updated only when the backlog is really
updated.

Btw, we can even make the update only when the change of the backlog is
great than a threshold, e.g., 5%, which can reduce the traffic even
further.

Fixes #5970
2020-09-29 13:37:37 +03:00
Avi Kivity
6fdc8f28a9 Update tools/jmx submodule
* tools/jmx 45e4f28...25bcd76 (1):
  > install.sh: stop using symlinks for systemd units on nonroot mode

Fixes #7288.
2020-09-29 13:32:45 +03:00
Takuya ASADA
8504332e17 scylla_setup: skip offline warnings on nonroot mode
Since most of the scripts requires root privilege, we don't shows up offline
warning on nonroot mode.

Fixes #7286

Closes #7287
2020-09-29 13:30:13 +03:00
Eliran Sinvani
925cdc9ae1 consistency level: fix wrong quorum calculation whe RF = 0
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.

Tests: Unit Tests (dev)
Fixes #6905

Closes #7296
2020-09-29 13:25:41 +03:00
Takuya ASADA
ba29074c42 install.sh: stop using symlinks for systemd units on nonroot mode
On some environment, systemctl enable <service> fails when we use symlink.
So just directly copy systemd units to ~/.config/systemd/user, instead of
creating symlink.

Fixes #7288

Closes #7290
2020-09-29 12:20:41 +03:00
Piotr Sarna
9e5ce5a93c counters: remove unused 1.7.4 counter order code
After cleaning up old cluster features (253a7640e3)
the code for special handling of 1.7.4 counter order was effectively
only used in its own tests, so it can be safely removed.

Closes #7289
2020-09-29 12:16:58 +03:00
Avi Kivity
57f377e1fe Merge 'Add max concurrent requests configuration option to coordinator' from Piotr Sarna
This series approaches issue #7072 and provides a very simple mechanism for limiting the number of concurrent CQL requests being served on a shard. Once the limit is hit, new requests will be instantly refused and OverloadedException will be returned to the client.
This mechanism has many improvement opportunities:
 * shedding requests gradually instead of having one hard limit,
 * having more than one limit per different types of queries (reads, writes, schema changes, ...),
 * not using a preconfigured value at all, and instead figuring out the limit dynamically,
 * etc.

... and none of these are taken into account in this series, which only adds a very basic configuration variable. The variable can be updated live without a restart - it can be done by updating the .yaml file and triggering a configuration re-read via sending the SIGHUP signal to Scylla.

The default value for this parameter is a very large number, which translates to effectively not shedding any requests at all.

Refs #7072

Closes #7279

* github.com:scylladb/scylla:
  transport: make max_concurrent_requests_per_shard reloadable
  transport: return exceptional future instead of throwing
  transport,config: add a param for max request concurrency
  exceptions: make a single-param constructor explicit
  exceptions: add a constructor based on custom message
2020-09-29 12:14:03 +03:00
Pekka Enberg
1adf2cc848 Revert "scylla_ntp_setup: use chrony on all distributions"
This reverts commit 8366d2231d because it
causes the following "scylla_setup" failure on Ubuntu 16.04:

  Command: 'sudo /usr/lib/scylla/scylla_setup --nic ens5 --disks /dev/nvme0n1  --swap-directory / '
  Exit code: 1
  Stdout:
  Setting up libtomcrypt0:amd64 (1.17-7ubuntu0.1) ...
  Setting up chrony (2.1.1-1ubuntu0.1) ...
  Creating '_chrony' system user/group for the chronyd daemon…
  Creating config file /etc/chrony/chrony.conf with new version
  Processing triggers for libc-bin (2.23-0ubuntu11.2) ...
  Processing triggers for ureadahead (0.100.0-19.1) ...
  Processing triggers for systemd (229-4ubuntu21.29) ...
  501 Not authorised
  NTP setup failed.
  Stderr:
  chrony.service is not a native service, redirecting to systemd-sysv-install
  Executing /lib/systemd/systemd-sysv-install enable chrony
  Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/scylla_ntp_setup", line 63, in <module>
  run('chronyc makestep')
  File "/opt/scylladb/scripts/scylla_util.py", line 504, in run
  return subprocess.run(cmd, stdout=stdout, stderr=stderr, shell=shell, check=exception, env=scylla_env).returncode
  File "/opt/scylladb/python3/lib64/python3.8/subprocess.py", line 512, in run
  raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['chronyc', 'makestep']' returned non-zero exit status 1.
2020-09-29 11:23:23 +03:00
Piotr Sarna
4b856cf62d transport: make max_concurrent_requests_per_shard reloadable
This configuration entry is expected to be used as a quick fix
for an overloaded node, so it should be possible to reload this value
without having to restart the server.
2020-09-29 10:11:36 +02:00
Piotr Sarna
4da8957461 transport: return exceptional future instead of throwing
Throwing bears an additional cost, so it's better to simply
construct the error in place and return it.
2020-09-29 10:00:30 +02:00
Piotr Sarna
b4db6d2598 transport,config: add a param for max request concurrency
The newly introduced parameter - max_concurrent_requests_per_shard
- can be used to limit the number of in-flight requests a single
coordinator shard can handle. Each surplus request will be
immediately refused by returning OverloadedException error to the client.
The default value for this parameter is large enough to never
actually shed any requests.
Currently, the limit is only applied to CQL requests - other frontends
like alternator and redis are not throttled yet.
2020-09-29 09:59:30 +02:00
Botond Dénes
2ee026f26f test/manual/sstable_scan_footprint_test: run test body in statement sched group
So that queries are processed in said scheduling group and thus they use
the user read concurrency semaphore.
2020-09-28 11:27:49 +03:00
Botond Dénes
272a54b81c test/manual/sstable_scan_footprint_test: move test main code into separate function 2020-09-28 11:27:49 +03:00
Botond Dénes
29861b068e test/manual/sstable_scan_footprint_test: sprinkle some thread::maybe_yield():s
To avoid stalls.
2020-09-28 11:27:49 +03:00
Botond Dénes
daa9fa72f1 test/manual/sstable_scan_footprint_test: make clustering row size configurable
So that large-row workloads can be simulated too.
2020-09-28 11:27:49 +03:00
Botond Dénes
2ff326a41a test/manual/sstable_scan_footprint_test: document sstable related command line arguments 2020-09-28 11:27:49 +03:00
Botond Dénes
ceb308411c mutation_fragment_test: add exception safety test for mutation_fragment::mutate_as_*() 2020-09-28 11:27:49 +03:00
Botond Dénes
ceb0b02ee8 test: simple_schema: add make_static_row() 2020-09-28 11:27:49 +03:00
Botond Dénes
63578bf0a7 reader_permit: reader_resources: add operator== 2020-09-28 11:27:49 +03:00
Botond Dénes
256140a033 mutation_fragment: memory_usage(): remove unused schema parameter
The memory usage is now maintained and updated on each change to the
mutation fragment, so it needs not be recalculated on a call to
`memory_usage()`, hence the schema parameter is unused and can be
removed.
2020-09-28 11:27:47 +03:00
Botond Dénes
041d71bd6f mutation_fragment: track memory usage through the reader_permit
The memory usage of mutation fragments is now tracked through its
lifetime through a reader permit. This was the last major (to my current
knowledge) untracked piece of the reader pipeline.
2020-09-28 11:27:29 +03:00
Botond Dénes
52662f17ea reader_permit: resource_units: add permit() and resources() accessors 2020-09-28 11:27:29 +03:00
Botond Dénes
6ca0464af5 mutation_fragment: add schema and permit
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.

In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
2020-09-28 11:27:23 +03:00
Botond Dénes
54357221f0 partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment
It is what its callers want anyway.
2020-09-28 10:53:56 +03:00
Botond Dénes
1e6285d776 mutation_fragment: remove as_mutable_end_of_partition()
There is nothing to mutate on a partition_end fragment.
2020-09-28 10:53:56 +03:00
Botond Dénes
5079b9ccf1 mutation_fragment: s/as_mutable_partition_start/mutate_as_partition_start/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_mutation_start() &&`.
2020-09-28 10:53:56 +03:00
Botond Dénes
72a88e0257 mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_range_tombstone() &&`.
2020-09-28 10:53:56 +03:00
Botond Dénes
4f5ccf82cb mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_clustering_row() &&`.
2020-09-28 10:53:56 +03:00
Botond Dénes
f2b9cad4c6 mutation_fragment: s/as_mutable_static_row/mutation_as_static_row/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_static_row() &&`.
2020-09-28 10:53:56 +03:00