Intra-node migrations are scheduled for each node independently with
the aim to equalize per-shard tablet count on each node.
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if
RF=N then there is not even a possibility of moving tablets around to
fix the imbalance. The only way to bring the system to balance is to
move tablets within the nodes.
After scheduling inter-node migrations, the algorithm schedules
intra-node migrations. This means that across-node migrations can
proceed in parallel with intra-node migrations if there is free
capacity to carry them out, but across-node migrations have higher
priority.
Fixes#16594
Currently the load balancer is only generting an inter-node plan, and
the algorithm is embedded in make_plan(). The method will become even
harder to follow once we add more kinds of plan generating steps,
e.g. inter-node plan. Extract the inter-node plan to make it easier to
add other plans and see the grand flow.
target nodes
The node_load datastructure was not updated to reflect migration
decisions on the target node. This is not needed for inter-node
migration because target nodes are not considered as sources. But we
want it to reflect migration decisions so that later inter-node
migration sees an accurate picture with earlier migrations reflected
in node_load.
During streaming for intra-node migration we want to write only to the
new shard. To achieve that, allow altering write selector in
sharder::shard_for_writes() and per-instance of
auto_refreshing_sharder.
This writer is used by streaming, on tablet migration and
load-and-stream.
The caller of distribute_reader_and_consume_on_shards(), which provides
a sharder, is supposed to ensure that effective_replication_map is kept
alive around it, in order for topology coordinator to wait for any writes
which may be in flight to reach their shards before tablet replica starts
another migration. This is already the case:
1) repair and load-and-stream keep the erm around writing.
2) tablet migration uses autorefreshing_sharder, so it does not, but
it keeps the topology_guard around the operation in the consumer,
which serves the same purpose.
When sharder says that the write should go to multiple shards,
we need to consider the write as applied only if it was applied
to all those shards.
This can happen during intra-node tablet migration. During such migration,
the request coordinator on storage_proxy side is coordinating to hosts
as if no migration was in progress. The replica-side coordinator coordinates
to shards based on sharder response.
One way to think about it is that
effective_replication_map::get_natural_endpoints()/get_pending_endpoints()
tells how to coordinate between nodes, and sharder tells how to
coordinate between shards. Both work with some snapshot of tablet
metadata, which should be kept alive around the operation. Sharder is
associated with its own effective_replication_map, which marks the
topology version as used and allows barriers to synchronize with
replica-side operations.
Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.
The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.
The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.
We need a separate transition kind for intra node migration so that we
don't have to recover this information from replica set in an
expensive way. This information is needed in the hot path - in
effective_replicaiton_map, to not return the pending tablet replica to
the coordinator. From its perspective, replica set is not
transitional.
The transition will also be used to alter the behavior of the
sharder. When not in intra-node migration, the sharder should
advertise the shard which is either in the previous or next replica
set. During intra-node migration, that's not possible as there may be
two such shards. So it will return the shard according to the current
read selector.
balance_tablets() is invoked in a loop, so only the first call will
see non-empty skiplist.
This bug starts to manifest after adding intra-node migration plan,
causing failures of the test_load_balancing_with_skiplist test
case. The reason is that rebalancing will now require multiple passes
before convergence is reached, due to intra-node migrations, and later
calls will not see the skiplist and try to balance skipped nodes,
vioating test's assertions.
For the purpose of scylla-gdb.py command "scylla
active-sstables". Before the patch, readers were located by scanning
the heap for live objects with vtable pointers corresponding to
readers. It was observed that the test scylla_gdb/test_misc.py::test_active_sstables started failing like this:
gdb.error: Error occurred in Python: Cannot access memory at address 0x300000000000000
This could be explained by there being a live object on the heap which
used to be a reader but now is a different object, and the _sst field
contains some other data which is not a pointer.
To fix, track readers explicitly in a linked list so that the gdb
script can reliably walk readers.
Fixes#18618.
Topology version may be updated, for example, by executing a RESTful
API call to move a tablet. If that is done concurrently with an
ongoing token metadata barrier executed by topology coordinator
(because there is active tablet migration, for example), then some
requests may fail due to being fenced out unnecessarily.
The problem is that barrier function assumes no concurrent topology
updates so it sets the fence version to the one which is current after
other nodes are drained. This patch changes it to set the fence to the
version which was current before other nodes were drained. Semantics
of the barrier are preserved because it only guarantees that topology
state from before the invocation of barrier is propagated.
Fixes#18699
This patch adds metrics that will be reported per-table per-node.
The added metrics (that are part of the per-table per-shard metrics)
are:
scylla_column_family_cache_hit_rate
scylla_column_family_read_latency
scylla_column_family_write_latency
scylla_column_family_live_disk_space
Fixes#18642
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closesscylladb/scylladb#18645
incremental_reader_selector is the mechanism for incremental comsumption
of disjoint sstables on range reads.
tablet_sstable_set was implemented, such that selector is efficient with
tablets.
The problem is selector is vnode addicted and will only consider a given
set exhausted when maximum token is reached.
With tablets, that means a range read on first tablet of a given shard
will also consume other tablets living in the same shard. That results
in combined reader having to work with empty sstable readers of tablets
that don't intersect with the range of the read. It won't cause extra
I/O because the underlying sstables don't intersect with the range of
the read. It's only unnecessary CPU work, as it involves creating
readers (= allocation), feeding them into combined reader, which will
in turn invoke the sstable readers only to realize they don't have any
data for that range.
With 100k tablets (ranges), and 100 tablets per shard, and ~5 sstables
per tablet, there will be this amount of readers (empty or not):
(100k * ((100^2 + 100) / 2) * avg_sstable_per_tablet=5) = ~2.5 billions.
~5000 times more readers, it can be quite significant additional cpu
work, even though I/O dominates the most in scans. It's an inefficiency
that we rather get rid of.
The behavior can be observed from logs (there's 1 sstable for each of
4 tablets, but note how readers are created for every single one of
them when reading only 1 tablet range):
```
table - make_reader_v2 - range=(-inf, {-4611686018427387905, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {minimum token, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._34qn42... that has range [{-9151620220812943033, start},{-4813568684827439727, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {-4611686018427387904, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._368nk2... that has range [{-4599560452460784857, start},{-78043747517466964, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {0, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._38lj42... that has range [{851021166589397842, start},{3516631334339266977, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {4611686018427387904, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._3dba82... that has range [{5065088566032249228, start},{9215673076482556375, end}]
```
Fix is about making sure the tablet set won't select past the
supplied range of the read.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18556
Currently, all documentation links that feature anywhere in the help output of scylla-nodetool, are hard-coded to point to the documentation of the latest stable release. As our documentation is version and product (open-source or enterprise) specific, this is not correct. This PR addresses this, by generating documentation links such that they point to the documentation appropriate for the product and version of the scylladb release.
Fixes: https://github.com/scylladb/scylladb/issues/18276
- [x] the native nodetool is a new feature, no backport needed
Closesscylladb/scylladb#18476
* github.com:scylladb/scylladb:
tools/scylla-nodetool: make doc link version-specific
release: introduce doc_link()
build: pass scylla product to release.cc
There are two metrics to help observe base-write throttling:
* current_throttled_base_writes
* last_mv_flow_control_delay
Both show a snapshot of what is happening right at the time of querying
these metrincs. This doesn't work well when one wants to investigate the
role throttling is playing in occasional write timeouts.s Prometheus
scrapes metrics in multi-second intervals, and the probability of that
instant catching the throttling at play is very small (almost zero).
Add two new metrics:
* throttled_base_writes_total
* mv_flow_control_delay_total
These accumulate all values, allowing graphana to derive the values and
extract information about throttle events that happened in the past
(but not necessarily at the instant of the scrape).
Note that dividing the two values, will yield the average delay for a
throttle, which is also useful.
Closesscylladb/scylladb#18435
In commit 642f9a1966 (repair: Improve
estimated_partitions to reduce memory usage), a 10% hard coded
estimation ratio is used.
This patch introduces a new config option to specify the estimation
ratio of partitions written by repair out of the total partitions.
It is set to 0.1 by default.
Fixes#18615Closesscylladb/scylladb#18634
Closesscylladb/scylladb#18616
* github.com:scylladb/scylladb:
replica: Make it explicit table's sstable set is immutable
replica: avoid reallocations in tablet_sstable_set
replica: Avoid compound set if only one sstable set is filled
There's a loop that calculates the number of shard matches over a tablet
map. The check of the given shard against optional<shard> can be made
shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18592
As part of the unification process, alternator tests are migrated to the PythonTestSuite instead of using the RunTestSuite. The main idea is to have one suite, so there will be easier to maintain and introduce new features.
Introduce the prepare_sql option for suite.yaml to add possibility to run cql statements as precondition for the test suite.
Related: https://github.com/scylladb/scylladb/issues/18188Closesscylladb/scylladb#18442
The default limit of open file descriptors
per process may be too small for iotune on
certain machines with large number of cores.
In such case iotune reports failure due to
unability to create files or to set up seastar
framework.
This change configures the limit of open file
descriptors before running iotune to ensure
that the failure does not occur.
The limit is set via 'resource.setrlimit()' in
the parent process. The limit is then inherited
by the child process.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#18546
In b4e66ddf1d (4.0) we added a new batchlog_manager configuration
named delay, but forgot to initialize it in cql_test_env. This somehow
worked, but doesn't with clang 18.
Fix it by initializing to 0 (there isn't a good reason to delay it).
Also provide a default to make it safer.
Closesscylladb/scylladb#18572
* tools/cqlsh e5f5eafd...c8158555 (11):
> cqlshlib/sslhandling: fix logic of `ssl_check_hostname`
> cqlshlib/sslhandling.py: don't use empty userkey/usercert
> Dockerfile: noninteractive isn't enough for answering yet on apt-get
> fix cqlsh version print
> cqlshlib/sslhandling: change `check_hostname` deafult to False
> Introduce new ssl configuration for disableing check_hostname
> set the hostname in ssl_options.server_hostname when SSL is used
> issue-73 Fixed a bug where username and password from the credentials file were ignored.
> issue-73 Fixed a bug where username and password from the credentials file were ignored.
> issue-73
> github actions: update `cibuildwheel==v2.16.5`
Fixes: scylladb/scylladb#18590Closesscylladb/scylladb#18591
The code is based on similar idea as perf_simple_query. The main differences are:
- it starts full scylla process
- communicates with alternator via http (localhost)
- uses richer table schema with all dynamoDB types instead of only strings
Testing code runs in the same process as scylla so we can easily get various perf counters (tps, instr, allocation, etc).
Results on my machine (with 1 vCPU):
> ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload read --duration 10 2> /dev/null
...
median 23402.59616090321
median absolute deviation: 598.77
maximum: 24014.41
minimum: 19990.34
> ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write --duration 10 2> /dev/null
...
median 16089.34211320635
median absolute deviation: 552.65
maximum: 16915.95
minimum: 14781.97
The above seem more realistic than results from perf_simple_query which are 96k and 49k tps (per core).
Related: https://github.com/scylladb/scylladb/issues/12518Closesscylladb/scylladb#13121
* github.com:scylladb/scylladb:
test: perf: alternator: add option to skip data pre-population
perf-alternator-workloads: add operations-per-shard option
test: perf: add global secondary indexes write workload for alternator
test: perf: add option to continue after failed request
test: perf: add read modify write workload for alternator (lwt)
test: perf: add scan workload for alternator
test: perf: add end-to-end benchmark for alternator
test: perf: extract result aggregation logic to a separate struct
in 906700d5, we accepted 0 as well as the return code of
"nodetool <command> --help", because we needed to be prepared for
the newer seastar submodule while be compatible with the older
seastar versions. now that in 305f1bd3, we bumped up the seastar
module, and this commit picked up the change to return 0 when
handling "--help" command line option in seastar, we are able to
drop the workaround.
so, in this change, we only use "0" as the expected return code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18627
in the same spirit of d57a82c156, this change adds `dist-unified` as one of the default targets. so that it is built by default. the unified package is required to when redistributing the precompiled packages -- we publish the rpm, deb and tar balls to S3.
- [x] cmake related change, no need to backport
Closesscylladb/scylladb#18621
* github.com:scylladb/scylladb:
build: cmake: use paths to be compatible with CI
build: cmake build dist-unified by default
password_authenticator::create_default_if_missing() is a confusing mix of
coroutines and continuations, simplify it to a normal coroutine.
Closesscylladb/scylladb#18571
our CI workflow for publishing the packages expects the tar balls
to be located under `build/$buildMode/dist/tar`, where `$buildMode`
is "release" or "debug".
before this change, the CMake building system puts the tar balls
under "build/dist" when the multi-config generator is used. and
`configure.py` uses multi-config generator.
in this change, we put the tar balls for redistribution under
`build/$<CONFIG>/dist/tar`, where `$<CONFIG>` is "RelWithDebInfo"
or "Debug", this works better with the CI workflow -- we just need
to map "release" and "debug" to "RelWithDebInfo" and "Debug" respectively.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in the same spirit of d57a82c156, this change adds `dist-unified`
as one of the default targets. so that it is built by default.
the unified package is required to when redistributing the precompiled
packages -- we publish the rpm, deb and tar balls to S3.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Most of the time only main set is filled, so we can avoid one layer
of indirection (= compound set) when maintenance set is empty.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently empty storage_groups are allocated for tablets that are
not on this shard.
Allocate storage groups dynamically, i.e.:
- on table creation allocate only storage groups that are on this
shard;
- allocate a storage group for tablet that is moved to this shard;
- deallocate storage group for tablet that is cleaned up.
Stop compaction group before it's deallocated.
Add a flag to table::cleanup_tablet deciding whether to deallocate
sgs and use it in commitlog tests.