Commit Graph

25526 Commits

Author SHA1 Message Date
Botond Dénes
581edc4e4e reader_concurrency_semaphore: make inactive_read_handle a weak reference
Having the handle keep an owning reference to the inactive read lead to
awkward situations, where the inactive read is destroyed during eviction
in certain situations only (querier cache) and not in other cases.
Although the users didn't notice anything from this, it lead to very
brittle code inside the reader concurrency semaphore. Among others, the
inactive read destructor has to be open coded in evict() which already
lead to mistakes.
This patch goes back to the weak pointer paradigm used a while ago,
which is a much more natural fit for this. Inactive reads are still kept
in an intrusive list in the semaphore but the handle now keeps a weak
pointer to them. When destroyed the handler will destroy the inactive
read if it is still alive. When evicting the inactive read, it will
set the pointer in the handle to null.
2021-03-18 14:57:57 +02:00
Botond Dénes
cbc83b8b1b reader_concurrency_semaphore: make evict() noexcept
In the next patch it will be called from a destructor.
2021-03-18 14:57:57 +02:00
Botond Dénes
2d348e0211 reader_concurrency_semaphore: update out-of-date comments 2021-03-18 14:57:57 +02:00
Gleb Natapov
32d386d0d8 raft: fix use after free during logging in append_entries_reply()
As the existing comment explains a progress can be deleted at the point
of logging. The logging should only be done if the progress still
exists.

Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>
2021-03-17 09:59:22 +02:00
Dejan Mircevski
8db24fc03b cql3/expr: Handle IN ? bound to null
Previously, we crashed when the IN marker is bound to null.  Throw
invalid_request_exception instead.

Fixes #8265

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8287
2021-03-17 09:59:22 +02:00
Avi Kivity
1afd6fbe06 hashing: appending_hash: convert from enable_if to concepts
A little simpler to understand.

Closes #8288
2021-03-17 09:59:22 +02:00
Piotr Sarna
7961a28835 Merge 'storage_proxy: Include counter writes in...
...  `writes_coordinator_outside_replica_set`' from Juliusz Stasiewicz

With this change, coordinator prefers himself as the "counter leader", so if
another endpoint is chosen as the leader, we know that coordinator was not a
member of replica set. With this guarantee we can increment
`scylla_storage_proxy_coordinator_writes_coordinator_outside_replica_set` metric
after electing different leader (that metric used to neglect the counter
updates).

The motivation for this change is to have more reliable way of counting
non-token-aware queries.

Fixes #4337
Closes #8282

* github.com:scylladb/scylla:
  storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set`
  counters: Favor coordinator as leader
2021-03-17 09:59:22 +02:00
Avi Kivity
972ea9900c Merge 'commitlog: Make pre-allocation drop O_DSYNC while pre-filling' from Calle Wilund
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

Closes #8250

* github.com:scylladb/scylla:
  commitlog: Make pre-allocation drop O_DSYNC while pre-filling
  commitlog: coroutinize allocate_segment_ex
2021-03-17 09:59:22 +02:00
Dejan Mircevski
992d5c6184 cql3/expr: Improve column printing
Before this change, we would print an expression like this:

((ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}) = 0000007b)

Now, we print the same expression like this:

(c = 0000007b)

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8285
2021-03-17 09:59:22 +02:00
Tomasz Grabiec
40121621f6 Merge "Kill some get_local_migration_manager() calls" from Pavel Emelyanov
There are a bunch of such calls in schema altering statements and
there's currently no way to obtain the migration manager for such
statements, so a relatively big rework needed.

The solution in this set is -- all statements' execute() methods are
called with query processor as first argument (now the storage proxy
is there), query processor references and provides migration manager
for statements. Those statements that need proxy can already get it
from the query processor.

Afterwards table_helper and thrift code can also stop using the global
migration manager instance, since they both have query processor in
needed places. While patching them a couple of calls to global storage
proxy also go away.

The new query processor -> migration manager dependency fits into
current start-stop sequence: the migration manager is started early,
the query processor is started after it. On stop the query processor
remains alive, but the migration manager stops. But since no code
currently (should) call get_local_migration_manager() it will _not_
call the query_processor::get_migration_manager() either, so this
dangling reference is ugly, but safe.

Another option could be to make storage proxy reference migration
manager, but this dependency doesn't look correct -- migration manager
is higher-level service than the storage proxy is, it is migration
manager who currently calls storage proxy, but not the vice versa.

* xemul/br-kill-some-migration-managers-2:
  cql3: Get database directly from query processor
  thrift: Use query_processor::get_migration_manager()
  table_helper: Use query_processor::get_migration_manager()
  cql3: Use query_processor::get_migration_manager() (lambda captures cases)
  cql3: Use query_processor::get_migration_manager() (alter_type statement)
  cql3: Use query_processor::get_migration_manager() (trivial cases)
  query_processor: Keep migration manager onboard
  cql3: Pass query processor to announce_migration:s
  cql3: Switch to qp (almost) in schema-altering-stmt
  cql3: Change execute()'s 1st arg to query_processor
2021-03-17 09:59:22 +02:00
Raphael S. Carvalho
2065e2c912 partitioned_sstable_set: adjust select_sstable_runs() to work with compound set
compound set will select runs from all of its managed sets, so let's
adjust select_sstable_runs() to only return runs which belong to it.
without this adjustment, selection of runs would fail because
function would try to unconditionally retrieve the run which may
live somewhere else.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-3-raphaelsc@scylladb.com>
2021-03-17 09:59:22 +02:00
Raphael S. Carvalho
02b2df1ea9 sstable_set: move select_sstable_runs() into partitioned_sstable_set
after compound set is introduced, select_sstable_runs() will no longer
work because the sstable runs live in sstable_set, but they should
actually live in the sstable_set being written to.

Given that runs is a concept that belongs only to strategies which
use partitioned_sstable_set, let's move the implementation of
select_sstable_runs() to it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-2-raphaelsc@scylladb.com>
2021-03-17 09:59:22 +02:00
Avi Kivity
11308c05f4 Update tools/jmx submodule
* tools/jmx 15c1d4f...9c687b5 (1):
  > dist/redhat: add support SLES
2021-03-17 09:59:22 +02:00
Calle Wilund
a0745f9498 messaging_service: Enforce dc/rack membership iff required for non-tls connections
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.

This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.

Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"

Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.

Closes #8051
2021-03-17 09:59:22 +02:00
Avi Kivity
bcd41cb32d Merge 'Support installing our rpm to SLES' from Takuya ASADA
Basically SLES support is already done in f20736d93d, but it was for offline installer.
This fixes few more problems to install our rpm to SLES.
After this change, we can just install our rpm for both CentOS/RHEL and SLES in single image, like unified deb.
SLES uses original package manager called 'zypper', but it does support yum repository so no need to change required for repo.

Closes #8277

* github.com:scylladb/scylla:
  scylla_coredump_setup: support SLES
  scylla_setup: use rpm to check package availability for SLES
  dist: install optional packages for SLES
2021-03-17 09:59:22 +02:00
Tomasz Grabiec
cc0bb92afe Merge "raft: provide a ticker for each raft server" from Pavel Solodovnikov
Automatically initialize and start a timer in
`raft_services::add_server` for each raft server instance created.

The patch set also changes several other things in order
for tickers to work:

1. A bug in `raft_sys_table_storage` which caused an exception
   if `raft::server::start` is called without any persisted state.
2. `raft_services::add_server` now automatically calls
   `raft::server::start()` since a server instance should be started
   before any of its methods can be called.
3. Raft servers can now start with initial term = 0. There was an
   artificial restriction which is now lifted.
4. Raft schema state machine now returns a ready future instead of
   throwing "not implemented" exception in `abort()`.

* github.com/ManManson/scylla.git/raft_services_tickers_v9_next_rebase:
  raft/raft_services: provide a ticker for each raft server
  raft/raft_services: switch from plain `throw` to `on_internal_error`
  raft/raft_services: start server instance automatically in `add_server`
  raft: return ready future instead of throwing in schema_raft_state_machine
  raft: allow raft server to start with initial term 0
  raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
2021-03-17 09:59:22 +02:00
Nadav Har'El
e344f74858 Merge 'logalloc: improve background reclaim shares management' from Avi Kivity
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.

Fixes #8234.

Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)

Closes #8281

* github.com:scylladb/scylla:
  test: logalloc_test: harden background reclain test against cpu overcommit
  logalloc: background reclaim: use default scheduling group for adjusting shares
  logalloc: background reclaim: log shares adjustment under trace level
  logalloc: background reclaim: fix shares not updated by periodic timer
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
aaea8c6c7d raft/raft_services: provide a ticker for each raft server
Automatically initialize a ticker for each raft server
instance when `raft_services::add_server` is called.
A ticker is a timer which regularly calls `raft::server::tick`
in order to tick its raft protocol state machine.

Note that the timer should start after the server calls
its `start()` method, because otherwise it would crash
since fsm is not initialized yet.

Currently, the tick interval is hardcoded to be 100ms.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
1496a3559f raft/raft_services: switch from plain throw to on_internal_error
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
975c9a8021 raft/raft_services: start server instance automatically in add_server
Raft server instance cannot be used in any way prior
to calling the `start()` method, which initializes
its internal state, e.g. raft protocol state machine.
Otherwise, it will likely result in a crash.

Also, properly stop the servers on shutdown via
`raft_services::stop_servers()`.

In case some exception happened inside `add_server`,
the `init` function will de-initialize what it already
initialized, i.e. raft rpc verbs. This is important
since otherwise it would break further initialization
process and, what is more important, will prevent raft
rpc verbs deinitialization. This will cause a crash in
`messaging_service` uninit procedure, because raft rpc
handlers would still be initialized.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
0b3dba07bd raft: return ready future instead of throwing in schema_raft_state_machine
The current implementation throws an exception, which will cause
a crash when stopping scylla. This will be used in the next patch.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
93c565a1bf raft: allow raft server to start with initial term 0
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.

This restriction is completely artificial and can be lifted
without any problems, which will be described below.

The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).

This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.

Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.

In either case term will never be 0, because:

1. If the term has changed, then it's naturally greater than 0,
   since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
   a vote request message. In such case we have already updated
   our term to the requester's term.

Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.

Given the motivation described above, the corresponding

    assert(_fsm->get_current_term() != term_t(0));

in `server_impl::start` is removed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
ae5f26adec raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
When a raft server is started for the first time and there isn't
any persisted state yet, provide default return values for
`load_term_and_vote` and `load_snapshot`. The code currently
does not handle this corner case correctly and fail with an
exception.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Juliusz Stasiewicz
f77d0f5439 storage_proxy: Include counter writes in writes_coordinator_outside_replica_set
Coordinator prefers himself as the "counter leader", so if another
endpoint is chosen as the leader, we know that coordinator was
not a member of replica set. We can use this information to
increment relevant metric (which used to neglect the counters
completely).

Fixes #4337
2021-03-16 12:07:16 +01:00
Juliusz Stasiewicz
5689106b92 counters: Favor coordinator as leader
This not only reduces internode traffic but is also needed for a
later change in this PR: metrics for non-token-aware writes
including counter updates.
2021-03-16 12:07:13 +01:00
Pavel Emelyanov
12e4269dce cql3: Get database directly from query processor
After previous patches some places in cql3 code take a
long path to get database reference:

  query processor -> storage proxy -> database

The query processor can provide the database reference
by itself, so take this chance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:36:04 +03:00
Pavel Emelyanov
fb49550943 thrift: Use query_processor::get_migration_manager()
Thrift needs migration manager to call announce_<something> on
it and currently it grabs blobak migration manager instance.

Since thrift handler has query processor rerefence onboard and
the query processor can provide the migration manager reference,
it's time to remove few more globals from thrift code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:59 +03:00
Pavel Emelyanov
6dc9a16b4e table_helper: Use query_processor::get_migration_manager()
After the migration manager can be obtained from the query
processor the table heler can also benefit from it and not
call for global migration manager instance any longer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:53 +03:00
Pavel Emelyanov
a9646dd779 cql3: Use query_processor::get_migration_manager() (lambda captures cases)
There are few schema altering statements that need to have
the query processor inside lambda continuations. Fortunately,
they all are continuations of make_ready_future<>()s, so the
query processor can be simply captured by reference and used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:48 +03:00
Pavel Emelyanov
50e4eacd08 cql3: Use query_processor::get_migration_manager() (alter_type statement)
This statement needs the query processor one step below the
stack from its .announce_migration method. So here's the
dedicated patch for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:43 +03:00
Pavel Emelyanov
464e58abf7 cql3: Use query_processor::get_migration_manager() (trivial cases)
Most of the schema altering statements implementations can now
stop calling for global migration manager instance and get it
from the query processor.

Here are the trivial cases when the query processor is just
avaiable at the place where it's needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:36 +03:00
Pavel Emelyanov
1de235f4da query_processor: Keep migration manager onboard
The query processor sits upper than the migration manager,
in the services layering, it's started after and (will be)
stopped before the migration manager.

The migration manager is needed in schema altering statements
which are called with query processor argument. They will
later get the migration manager from the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:58 +03:00
Pavel Emelyanov
1e8f0963f9 cql3: Pass query processor to announce_migration:s
Now when the only call to .announce_migration gas the
query processor at hands -- pass it to the real statements.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Pavel Emelyanov
470928dd94 cql3: Switch to qp (almost) in schema-altering-stmt
The schema altering statements are all inherited from the same
base class which delcares a pure virtual .announce_migration()
method. All the real statements are called with storage proxy
argument, while the need the migration manager. So like in the
previous patch -- replace storage proxy with query processor.

While doing the replacement also get the database instance from
the querty processor, not from proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Pavel Emelyanov
26c115f379 cql3: Change execute()'s 1st arg to query_processor
Currently the statement's execute() method accepts storage
proxy as the first argument. This is enough for all of them
but schema altering ones, because the latter need to call
migration manager's announce.

To provide the migration manager to those who need it it's
needed to have some higher-level service that the proxy. The
query processor seems to be good candidate for it.

Said that -- all the .execute()s now accept the querty
processor instead of the proxy and get the proxy itself from
the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Avi Kivity
65fea203d2 test: logalloc_test: harden background reclain test against cpu overcommit
Use thread CPU time instead of real time to avoid an overcommitted
machine from not being able to supply enough CPU for the test.
2021-03-15 13:54:49 +02:00
Avi Kivity
290897ddbc logalloc: background reclaim: use default scheduling group for adjusting shares
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.

This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
2021-03-15 13:54:49 +02:00
Avi Kivity
a87f6498c3 logalloc: background reclaim: log shares adjustment under trace level
Useful when debugging, but too noisy at any other time.
2021-03-15 13:54:49 +02:00
Avi Kivity
ce1b1d6ec4 logalloc: background reclaim: fix shares not updated by periodic timer
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.

Fix by splitting the conditional into two.
2021-03-15 13:54:37 +02:00
Tomasz Grabiec
bf6c4e0b24 Merge "raft: consolidate tests in raft directory" from Alejo
Move boost tests to tests/raft and factor out common helpers.

* alejo/raft-tests-reorg-5-rebase-next-2:
  raft: tests: move common helpers to header
  raft: tests: move boost tests to tests/raft
2021-03-15 11:59:16 +01:00
Takuya ASADA
e8cfd5114f scylla_coredump_setup: support SLES
SLES requires to install systemd-coredump package and enable
systemd-coredump.socket to use systemd-coredump.
2021-03-15 19:19:56 +09:00
Takuya ASADA
13871ff1f8 scylla_setup: use rpm to check package availability for SLES
Use rpm to check scylla packages installed on SLES.
2021-03-15 19:18:44 +09:00
Takuya ASADA
e3b5ffcf14 dist: install optional packages for SLES
Support SUSE original package manager 'zypper' for pkg_install()
function.
2021-03-15 19:17:48 +09:00
Alejo Sanchez
88063b6e3e raft: tests: move common helpers to header
Move common test helper functions and data structures to a common
helpers.hh header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Alejo Sanchez
6139ad6337 raft: tests: move boost tests to tests/raft
Move raft boost tests to test/raft directory.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Calle Wilund
48ca01c3ab commitlog: Make pre-allocation drop O_DSYNC while pre-filling
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
2021-03-15 09:35:45 +00:00
Calle Wilund
ae3b8e6fdf commitlog: coroutinize allocate_segment_ex
To make further changes here easier to write and read.
2021-03-15 09:35:37 +00:00
Avi Kivity
f326a2253c Update tools/java submodule
* tools/java 2c6110500c...fdc8fcc22c (1):
  > sstableloader: Use compound "where" restrictions for clustering
2021-03-15 11:19:22 +02:00
Raphael S. Carvalho
7171244844 compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
2021-03-14 14:31:26 +02:00
Nadav Har'El
d73934372d storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
2021-03-14 14:11:11 +02:00