postponed_compactions_reevaluation() runs until compaction_manager is
stopped, checking if it needs to launch new compactions.
Make it return a future instead of stashing its completion somewhere.
This makes is easier to convert it to a coroutine.
When repair master and followers have different shard count, the repair
followers need to create multi-shard readers. Each multi-shard reader
will create one local reader on each shard, N (smp::count) local readers
in total.
There is a hard limit on the number of readers who can work in parallel.
When there are more readers than this limit. The readers will start to
evict each other, causing buffers already read from disk to be dropped
and recreating of readers, which is not very efficient.
To optimize and reduce reader eviction overhead, a global reader permit
is introduced which considers the multi-shard reader bloats.
With this patch, at any point in time, the number of readers created by
repair will not exceed the reader limit.
Test Results:
1) with stream sem 10, repair global sem 10, 5 ranges in parallel, n1=2
shards, n2=8 shards, memory wanted =1
1.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2)
[2022-11-23 17:45:24,770] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:45:53,869] Repair session 1
[2022-11-23 17:45:53,869] Repair session 1 finished
real 0m30.212s
user 0m1.680s
sys 0m0.222s
1.2)
[asias@hjpc2 mycluster]$ time nodetool repair ks2 (repair on n1)
[2022-11-23 17:46:07,507] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:46:30,608] Repair session 1
[2022-11-23 17:46:30,608] Repair session 1 finished
real 0m24.241s
user 0m1.731s
sys 0m0.213s
2) with stream sem 10, repair global sem no_limit, 5 ranges in
parallel, n1=2 shards, n2=8 shards, memory wanted =1
2.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2)
[2022-11-23 17:49:49,301] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:01,414] Repair session 1
[2022-11-23 17:52:01,415] Repair session 1 finished
real 2m13.227s
user 0m1.752s
sys 0m0.218s
2.2)
[asias@hjpc2 mycluster]$ time nodetool repair ks2 (repair on n1)
[2022-11-23 17:52:19,280] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:42,387] Repair session 1
[2022-11-23 17:52:42,387] Repair session 1 finished
real 0m24.196s
user 0m1.689s
sys 0m0.184s
Comparing 1.1) and 2.1), it shows the eviction played a major role here.
The patch gives 73s / 30s = 2.5X speed up in this setup.
Comparing 1.1 and 1.2, it shows even if we limit the readers, starting
on the lower shard is faster 30s / 24s = 1.25X (the total number of
multishard readers is lower)
Fixes#12157Closes#12158
Split the simple (and common) case from the complex case,
and coroutinize the latter. Hopefully this generates better
code for the simple case, and it makes the complex case a
little nicer.
Closes#12194
* github.com:scylladb/scylladb:
cql3: select_statement: reindent process_results_complex()
cql3: select_statement: coroutinize process_results_complex()
cql3: select_statement: split process_results() into fast path and complex path
run_snapshot_list_operation() takes a continuation, so passing it
a lambda coroutine without protection is dangerous.
Protect the coroutine with coroutine::lambda so it doesn't lost its
contents.
Fixes#12192.
Closes#12193
Not a huge gain, since it's just a do_with, but still a little better.
Note the inner lambda is not a coroutine, so isn't susceptibe to
the lambda coroutine fiasco.
One of the prerequisites to make sstables reside on object-storage is not to let the rest of the code "know" the filesystem path they are located on (because sometimes they will not be on any filesystem path). This patch makes the methods that can reveal this path back private so that later they can be abstracted out.
Closes#12182
* github.com:scylladb/scylladb:
sstable: Mark some methods private
test: Don't get sstable dir when known
test: Use move_to_quarantine() helper
test: Use sstable::filename() overload without dir name
sstables: Reimplement batch directory sync after move
table, tests: Make use of move_to_new_dir() default arg
sstables: Remove fsync_directory() helper
table: Simplify take_snapshot()'s collecting sstables names
The test enables an error injection inside the Raft upgrade procedure
on one of the nodes which will cause the node to throw an exception
before entering `synchronize` state. Then it restarts other nodes with
Raft enabled, waits until they enter `synchronize` state, puts them in
RECOVERY mode, removes the error-injected node and creates a new Raft
group 0.
As soon as the other nodes enter `synchronize`, the test disabled the
error injection (the rest of the test was outside the `async with
inject_error(...)` block). There was a small chance that we disabled the
error injection before the node reached it. In that case the node also
entered `synchronize` and the cluster managed to finish the upgrade
procedure. We encountered this during next promotion.
Eliminate this possibility by extending the scope of the `async with
inject_error(...)` block, so that the RECOVERY mode steps on the other
nodes are performed within that block.
Closes#12162
There are several class sstable methods that reveal internal directory
path to caller. It's not object-storage-friendly. Fortunately, all the
callers of those methods had been patched not to work with full paths,
so these can be marked private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable_move_test creates sstables in its own temp directories and
the requests these dirs' paths back from sstables. Test can come with
the paths it has at hand, no need to call sstables for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Two places in tests move sstable to quarantine subdir by hand. There's
the class sstable method that does the same, so use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The dir this place currently uses is the directory where the sstable was
created, so dropping this argument would just render the same path.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a table::move_sstables_from_staging() method that gets a bunch
of sstables and moves them from staging subdit into table's root
datadir. Not to flush the root dir for every sstable move, it asks the
sstable::move_to_new_dir() not to flush, but collects staging dir names
and flushes them and the root dir at the end altothether.
In order to make it more friendly to object-storage and to remove one
more caller of sstable::get_dir() the delayed_commit_changes struct is
introduced. It collects _all_ the affected dir names in unordered_set,
then allows flushing them. By default the move_to_new_dir() doesn't
receive this object and flushes the directories instantly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question accepts boolean bit whether or not it should sync
directories at the end. It's always true but in one case, so there's the
default value for it. Make use of it.
Anticipating the suggestion to replace bool with bool_class -- next
patch will replace it with something else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one effectively wraps existing seastar sync_directory() helper into
two io_check-s. It's simpler just to call the latter directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question "snapshots" all sstables it can find, then writes
their Datafile names into the manifest file. To get the list of file
names it iterates over sstables list again and does silly conversion of
full file path to file name with the help of the directory path length.
This all can be made much simpler if just collecting component names
directly at the time sstable is hardlinked.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
compaction_state shouldn't be moved once emplaced. moving it could
theoretically cause task's gate holder to have a dangling pointer to
compaction_state's gate, but turns out gate's move ctor will actually
fail under this assertion:
assert(!_count && "gate reassigned with outstanding requests");
Cannot happen today, but let's make it more future proof.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12167
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
Closes#12107
* github.com:scylladb/scylladb:
service/raft: specialized verb for failure detector pinger
db: system_keyspace: de-staticize `{get,set}_raft_server_id`
service/raft: make this node's Raft ID available early in group registry
According to seastar/doc/lambda-coroutine-fiasco.md lambda that
co_awaits once loses its capture frame. In distrobuted_loader
code there's at least one of that kind.
fixes: #12175
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12170
The method became unused since 70e5252a (table: no longer accept online
loading of SSTable files in the main directory) and the whole concept of
reshuffling sstables was dropped later by 7351db7c (Reshape upload files
and reshard+reshape at boot).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12165
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
Raft ID was loaded or created late in the boot procedure, in
`storage_service::join_token_ring`.
Create it earlier, as soon as it's possible (when `system_keyspace`
is started), pass it to `raft_group_registry::start` and store it inside
`raft_group_registry`.
We will use this Raft ID stored in group registry in following patches.
Also this reduces the number of disk accesses for this node's Raft ID.
It's now loaded from disk once, stored in `raft_group_registry`, then
obtained from there when needed.
This moves `raft_group_registry::start` a bit later in the startup
procedure - after `system_keyspace` is started - but it doesn't make
a difference.
In a recent commit 757d2a4, we removed the "xfail" mark from the test
test_manual_requests.py::test_too_large_request_content_length
because it started to pass on more modern versions of Python, with a
urllib3 bug fixed.
Unfortunately, the celebration was premature: It turns out that although
the test now *usually* passes, it sometimes fails. This is caused by
a Seastar bug scylladb/seastar#1325, which I opened #12166 to track
in this project. So unfortunately we need to add the "xfail" mark back
to this test.
Note that although the test will now be marked "xfail", it will actually
pass most of the time, so will appear as "xpass" to people run it.
I put a note in the xfail reason string as a reminder why this is
happening.
Fixes#12143
Refs #12166
Refs scylladb/seastar#1325
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12169
The field was not used for anything. We can keep decommissioned server
in `stopped` field.
In fact it caused us a problem: since recently, we're using
`ScyllaCluster.uninstall` to clean-up servers after test suite finishes
(previously we were using `ScyllaServer.uninstall` directly). But
`ScyllaCluster.uninstall` didn't look into the `decommissioned` field,
so if a server got decommissioned, we wouldn't uninstall it, and it left
us some unnecessary artifacts even for successful tests. This is now
fixed.
Closes#12163
Mainly this PR removes global db::config and feature service that are used by sstables::test_env as dependencies for embedded sstables_manager. Other than that -- drop unused methods, remove nested test_env-s and relax few cases that use two temp dirs at a time for no gain.
Closes#12155
* github.com:scylladb/scylladb:
test, utils: Use only one tempdir
sstable_compaction_test: Dont create nested envs
mutation_reader_test: Remove unused create_sstable() helper
tests, lib: Move globals onto sstables::test_env
tests: Use sstables::test_env.db_config() to access config
features: Mark feature_config_from_db_config const
sstable_3_x_test: Use env method to create sst
sstable_3_x_test: Indentation fix after previous patch
sstable_3_x_test: Use sstable::test_env
test: Add config to sstable::test_env creation
config: Add constexpr value for default murmur ignore bits
There's a do_with_cloned_tmp_directory that makes two temp dirs to toss
sstables between them. Make it go with just one, all the more so it
would resemble existing manipulations aroung staging/ subdir
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The "compact" test case runs in sstables::test_env and additionally
wraps it with another instance provided by do_with_tmp_directory helper.
It's simpler to create the temp dir by hand and use outter env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a bunch of objects that are used by test_env as sstables_manager
dependencies. Now when no other code needs those globals they better sit
on the test_env next to the manager
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently some places use global test config, but it's going to be
removed soon, so switch to using config from environment
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's in fact such. Other than that, next patch will call it with const
config at hand and fail to compile without this fix
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several cases there that construct sstables_manager by hand
with the help of a bunch of global dependencies. It's nicer to use
existing wrapper.
(indentation left broken until next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To make callers (tests) construct it with different options. In
particular, one test will soon want to construct it with custom large
data handler of its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
... and use in some places of sstable_compaction_test. This will allow
getting rid of global test_db_config thing later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The PR introduces shard_repair_task_impl which represents a repair task
that spans over a single shard repair.
repair_info is replaced with shard_repair_task_impl, since both serve
similar purpose.
Closes#12066
* github.com:scylladb/scylladb:
repair: reindent
repair: replace repair_info with shard_repair_task_impl
repair: move repair_info methods to shard_repair_task_impl
repair: rename methods of repair_module
repair: change type of repair_module::_repairs
repair: keep a reference to shard_repair_task_impl in row_level_repair
repair: move repair_range method to shard_repair_task_impl
repair: make do_repair_ranges a method of shard_repair_task_impl
repair: copy repair_info methods to shard_repair_task_impl
repair: corutinize shard task creation
repair: define run for shard_repair_task_impl
repair: add shard_repair_task_impl
This is the core of dynamic IP address support in Raft, moving out the
IP address sourcing from Raft Group 0 configuration to gossip. At start
of Raft, the raft id <> IP address translation map is tuned into the
gossiper notifications and learns IP addresses of Raft hosts from them.
The series intentionally doesn't contain the part which speeds up the
initial cluster assembly by persisting the translation cache and using
more sources besides gossip (discovery, RPC) to show correctness of the
approach.
Closes#12035
* github.com:scylladb/scylladb:
raft: (rpc) do not throw in case of a missing IP address in RPC
raft: (address map) actively maintain ip <-> raft server id map