Having the handle keep an owning reference to the inactive read lead to
awkward situations, where the inactive read is destroyed during eviction
in certain situations only (querier cache) and not in other cases.
Although the users didn't notice anything from this, it lead to very
brittle code inside the reader concurrency semaphore. Among others, the
inactive read destructor has to be open coded in evict() which already
lead to mistakes.
This patch goes back to the weak pointer paradigm used a while ago,
which is a much more natural fit for this. Inactive reads are still kept
in an intrusive list in the semaphore but the handle now keeps a weak
pointer to them. When destroyed the handler will destroy the inactive
read if it is still alive. When evicting the inactive read, it will
set the pointer in the handle to null.
As the existing comment explains a progress can be deleted at the point
of logging. The logging should only be done if the progress still
exists.
Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>
Previously, we crashed when the IN marker is bound to null. Throw
invalid_request_exception instead.
Fixes#8265
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8287
... `writes_coordinator_outside_replica_set`' from Juliusz Stasiewicz
With this change, coordinator prefers himself as the "counter leader", so if
another endpoint is chosen as the leader, we know that coordinator was not a
member of replica set. With this guarantee we can increment
`scylla_storage_proxy_coordinator_writes_coordinator_outside_replica_set` metric
after electing different leader (that metric used to neglect the counter
updates).
The motivation for this change is to have more reliable way of counting
non-token-aware queries.
Fixes#4337Closes#8282
* github.com:scylladb/scylla:
storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set`
counters: Favor coordinator as leader
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
Closes#8250
* github.com:scylladb/scylla:
commitlog: Make pre-allocation drop O_DSYNC while pre-filling
commitlog: coroutinize allocate_segment_ex
Before this change, we would print an expression like this:
((ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}) = 0000007b)
Now, we print the same expression like this:
(c = 0000007b)
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8285
There are a bunch of such calls in schema altering statements and
there's currently no way to obtain the migration manager for such
statements, so a relatively big rework needed.
The solution in this set is -- all statements' execute() methods are
called with query processor as first argument (now the storage proxy
is there), query processor references and provides migration manager
for statements. Those statements that need proxy can already get it
from the query processor.
Afterwards table_helper and thrift code can also stop using the global
migration manager instance, since they both have query processor in
needed places. While patching them a couple of calls to global storage
proxy also go away.
The new query processor -> migration manager dependency fits into
current start-stop sequence: the migration manager is started early,
the query processor is started after it. On stop the query processor
remains alive, but the migration manager stops. But since no code
currently (should) call get_local_migration_manager() it will _not_
call the query_processor::get_migration_manager() either, so this
dangling reference is ugly, but safe.
Another option could be to make storage proxy reference migration
manager, but this dependency doesn't look correct -- migration manager
is higher-level service than the storage proxy is, it is migration
manager who currently calls storage proxy, but not the vice versa.
* xemul/br-kill-some-migration-managers-2:
cql3: Get database directly from query processor
thrift: Use query_processor::get_migration_manager()
table_helper: Use query_processor::get_migration_manager()
cql3: Use query_processor::get_migration_manager() (lambda captures cases)
cql3: Use query_processor::get_migration_manager() (alter_type statement)
cql3: Use query_processor::get_migration_manager() (trivial cases)
query_processor: Keep migration manager onboard
cql3: Pass query processor to announce_migration:s
cql3: Switch to qp (almost) in schema-altering-stmt
cql3: Change execute()'s 1st arg to query_processor
compound set will select runs from all of its managed sets, so let's
adjust select_sstable_runs() to only return runs which belong to it.
without this adjustment, selection of runs would fail because
function would try to unconditionally retrieve the run which may
live somewhere else.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-3-raphaelsc@scylladb.com>
after compound set is introduced, select_sstable_runs() will no longer
work because the sstable runs live in sstable_set, but they should
actually live in the sstable_set being written to.
Given that runs is a concept that belongs only to strategies which
use partitioned_sstable_set, let's move the implementation of
select_sstable_runs() to it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-2-raphaelsc@scylladb.com>
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.
This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.
Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"
Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.
Closes#8051
Basically SLES support is already done in f20736d93d, but it was for offline installer.
This fixes few more problems to install our rpm to SLES.
After this change, we can just install our rpm for both CentOS/RHEL and SLES in single image, like unified deb.
SLES uses original package manager called 'zypper', but it does support yum repository so no need to change required for repo.
Closes#8277
* github.com:scylladb/scylla:
scylla_coredump_setup: support SLES
scylla_setup: use rpm to check package availability for SLES
dist: install optional packages for SLES
Automatically initialize and start a timer in
`raft_services::add_server` for each raft server instance created.
The patch set also changes several other things in order
for tickers to work:
1. A bug in `raft_sys_table_storage` which caused an exception
if `raft::server::start` is called without any persisted state.
2. `raft_services::add_server` now automatically calls
`raft::server::start()` since a server instance should be started
before any of its methods can be called.
3. Raft servers can now start with initial term = 0. There was an
artificial restriction which is now lifted.
4. Raft schema state machine now returns a ready future instead of
throwing "not implemented" exception in `abort()`.
* github.com/ManManson/scylla.git/raft_services_tickers_v9_next_rebase:
raft/raft_services: provide a ticker for each raft server
raft/raft_services: switch from plain `throw` to `on_internal_error`
raft/raft_services: start server instance automatically in `add_server`
raft: return ready future instead of throwing in schema_raft_state_machine
raft: allow raft server to start with initial term 0
raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.
Fixes#8234.
Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)
Closes#8281
* github.com:scylladb/scylla:
test: logalloc_test: harden background reclain test against cpu overcommit
logalloc: background reclaim: use default scheduling group for adjusting shares
logalloc: background reclaim: log shares adjustment under trace level
logalloc: background reclaim: fix shares not updated by periodic timer
Automatically initialize a ticker for each raft server
instance when `raft_services::add_server` is called.
A ticker is a timer which regularly calls `raft::server::tick`
in order to tick its raft protocol state machine.
Note that the timer should start after the server calls
its `start()` method, because otherwise it would crash
since fsm is not initialized yet.
Currently, the tick interval is hardcoded to be 100ms.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Raft server instance cannot be used in any way prior
to calling the `start()` method, which initializes
its internal state, e.g. raft protocol state machine.
Otherwise, it will likely result in a crash.
Also, properly stop the servers on shutdown via
`raft_services::stop_servers()`.
In case some exception happened inside `add_server`,
the `init` function will de-initialize what it already
initialized, i.e. raft rpc verbs. This is important
since otherwise it would break further initialization
process and, what is more important, will prevent raft
rpc verbs deinitialization. This will cause a crash in
`messaging_service` uninit procedure, because raft rpc
handlers would still be initialized.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The current implementation throws an exception, which will cause
a crash when stopping scylla. This will be used in the next patch.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.
This restriction is completely artificial and can be lifted
without any problems, which will be described below.
The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).
This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.
Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.
In either case term will never be 0, because:
1. If the term has changed, then it's naturally greater than 0,
since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
a vote request message. In such case we have already updated
our term to the requester's term.
Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.
Given the motivation described above, the corresponding
assert(_fsm->get_current_term() != term_t(0));
in `server_impl::start` is removed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When a raft server is started for the first time and there isn't
any persisted state yet, provide default return values for
`load_term_and_vote` and `load_snapshot`. The code currently
does not handle this corner case correctly and fail with an
exception.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Coordinator prefers himself as the "counter leader", so if another
endpoint is chosen as the leader, we know that coordinator was
not a member of replica set. We can use this information to
increment relevant metric (which used to neglect the counters
completely).
Fixes#4337
This not only reduces internode traffic but is also needed for a
later change in this PR: metrics for non-token-aware writes
including counter updates.
After previous patches some places in cql3 code take a
long path to get database reference:
query processor -> storage proxy -> database
The query processor can provide the database reference
by itself, so take this chance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Thrift needs migration manager to call announce_<something> on
it and currently it grabs blobak migration manager instance.
Since thrift handler has query processor rerefence onboard and
the query processor can provide the migration manager reference,
it's time to remove few more globals from thrift code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the migration manager can be obtained from the query
processor the table heler can also benefit from it and not
call for global migration manager instance any longer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are few schema altering statements that need to have
the query processor inside lambda continuations. Fortunately,
they all are continuations of make_ready_future<>()s, so the
query processor can be simply captured by reference and used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This statement needs the query processor one step below the
stack from its .announce_migration method. So here's the
dedicated patch for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the schema altering statements implementations can now
stop calling for global migration manager instance and get it
from the query processor.
Here are the trivial cases when the query processor is just
avaiable at the place where it's needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The query processor sits upper than the migration manager,
in the services layering, it's started after and (will be)
stopped before the migration manager.
The migration manager is needed in schema altering statements
which are called with query processor argument. They will
later get the migration manager from the query processor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the only call to .announce_migration gas the
query processor at hands -- pass it to the real statements.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema altering statements are all inherited from the same
base class which delcares a pure virtual .announce_migration()
method. All the real statements are called with storage proxy
argument, while the need the migration manager. So like in the
previous patch -- replace storage proxy with query processor.
While doing the replacement also get the database instance from
the querty processor, not from proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the statement's execute() method accepts storage
proxy as the first argument. This is enough for all of them
but schema altering ones, because the latter need to call
migration manager's announce.
To provide the migration manager to those who need it it's
needed to have some higher-level service that the proxy. The
query processor seems to be good candidate for it.
Said that -- all the .execute()s now accept the querty
processor instead of the proxy and get the proxy itself from
the query processor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.
This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.
Fix by splitting the conditional into two.
Move boost tests to tests/raft and factor out common helpers.
* alejo/raft-tests-reorg-5-rebase-next-2:
raft: tests: move common helpers to header
raft: tests: move boost tests to tests/raft
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.
So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.
Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.
Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.
This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.
Fixes#8247.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".
Refs #8089
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>