If a node restart just before it stores bootstrapping node's IP it will
not have ID to IP mapping for bootstrapping node which may cause failure
on a write path. Detect this and fail bootstrapping if it happens.
(cherry picked from commit 1faef47952)
(cherry picked from commit 27445f5291)
(cherry picked from commit 6853b02c00)
(cherry picked from commit f91db0c1e4)
Refs #18927Closesscylladb/scylladb#19118
* github.com:scylladb/scylladb:
raft topology: fix indentation after previous commit
raft topology: do not add bootstrapping node without IP as pending
test: add test of bootstrap where the coordinator crashes just before storing IP mapping
schema_tables: remove unused code
Fetching only the first page is not the intuitive behavior expected by users.
This causes flakiness in some tests which generate variable amount of
keys depending on execution speed and verify later that all keys were
written using a single SELECT statement. When the amount of keys
becomes larger than page size, the test fails.
Fixes#18774
(cherry picked from commit 2c3f7c996f)
Closesscylladb/scylladb#19130
Assigning to a member of an uninitialized optional
does not initialize the object before assigning to it.
This resulted in the AddressSanitizer detecting attempt
to double-free when the uninitialized string contained
apprently a bogus pointer.
The change emplaces the returned optional when needed
without resorting to the copy-assignment operator.
So it's not suceptible to assigning to uninitialized
memory, and it's more efficient as well...
Fixesscylladb/scylladb#19041
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b2fa954d82)
Closesscylladb/scylladb#19117
Task manager's tasks stay in memory after they are finished.
Moreover, even if a child task is unregistered from task manager,
it is still alive since its parent keeps a foreign pointer to it. Also,
when a task has finished successfully there is no point in keeping
all of its descendants in memory.
The patch introduces folding of task manager's tasks. Whenever
a task which has a parent is finished it is unregistered from task
manager and foreign_ptr to it (kept in its parent) is replaced
with its status. Children's statuses of the task are dropped unless
they or one of their descendants failed. So for each operation we
keep a tree of tasks which contains:
- a root task and its direct children (status if they are finished, a task
otherwise);
- running tasks and their direct children (same as above);
- a statuses path from root to failed tasks.
/task_manager/wait_task/ does not unregister tasks anymore.
Refs: https://github.com/scylladb/scylladb/issues/16694.
- [ ] ** Backport reason (please explain below if this patch should be backported or not) **
Requires backport to 6.0 as task number exploded with tablets.
(cherry picked from commit 6add9edf8a)
(cherry picked from commit 319e799089)
(cherry picked from commit e6c50ad2d0)
(cherry picked from commit a82a2f0624)
(cherry picked from commit c1b2b8cb2c)
(cherry picked from commit 30f97ea133)
(cherry picked from commit fc0796f684)
(cherry picked from commit d7e80a6520)
(cherry picked from commit beef77a778)
Refs https://github.com/scylladb/scylladb/pull/18735Closesscylladb/scylladb#19104
* github.com:scylladb/scylladb:
docs: describe task folding
test: rest_api: add test for task tree structure
test: rest_api: modify new_test_module
tasks: test: modify test_task methods
api: task_manager: do not unregister task in /task_manager/wait_task/
tasks: unregister tasks with parents when they are finished
tasks: fold finished tasks info their parents
tasks: make task_manager::task::impl::finish_failed noexcept
tasks: change _children type
If there is no mapping from host id to ip while a node is in bootstrap
state there is no point adding it to pending endpoint since write
handler will not be able to map it back to host id anyway. If the
transition sate requires double writes though we still want to fail.
In case the state is write_both_read_old we fail the barrier that will
cause topology operation to rollback and in case of write_both_read_new
we assert but this should not happen since the mapping is persisted by
this point (or we failed in write_both_read_old state).
Fixes: scylladb/scylladb#18676
(cherry picked from commit 6853b02c00)
On the next boot there is no host ID to IP mapping which causes node to
crash again with "No mapping for :: in the passed effective replication map"
assertion.
(cherry picked from commit 27445f5291)
The values of `tablets_enabled` were nonempty strings, so they
always evaluated to `True` in the if statement responsible for
enabling writing workers only if tablets are disabled. Hence, the
writing workers were always disabled.
The original commit, ea4717da65,
contains one more change, which is not needed (and conflicting)
in 6.0 because scylladb/scylladb#18898 has been backported first.
Closesscylladb/scylladb#19111
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.
Fixes https://github.com/scylladb/scylladb/issues/18085.
(cherry picked from commit abcc68dbe7)
(cherry picked from commit 551bf9dd58)
(cherry picked from commit e7246751b6)
Refs https://github.com/scylladb/scylladb/pull/18287Closesscylladb/scylladb#19095
* github.com:scylladb/scylladb:
topology_experimental_raft/test_tablets: restore usage of check_with_down
test: Fix flakiness in topology_experimental_raft/test_tablets
service: Use tablet read selector to determine which replica to account table stats
storage_service: Fix race between tablet split and stats retrieval
This doesn't apply for auth-v2 as we improved data placement and
removed cassandra quirk which was setting different CL for some
default superuser involved operations.
Fixes#18773
(cherry picked from commit 9adf74ae6c)
Closesscylladb/scylladb#18860
Currently, there is no indication of tablets in the logged KSMetaData.
Print the tablets configuration of either the`initial` number of tablets,
if enabled, or {'enabled':false} otherwise.
For example:
```
migration_manager - Create new Keyspace: KSMetaData{name=tablets_ks, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={"datacenter1": "1"}, cfMetaData={}, durable_writes=true, tablets={"initial":0}, userTypes=org.apache.cassandra.config.UTMetaData@0x600004d446a8}
migration_manager - Create new Keyspace: KSMetaData{name=vnodes_ks, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={"datacenter1": "1"}, cfMetaData={}, durable_writes=true, tablets={"enabled":false}, userTypes=org.apache.cassandra.config.UTMetaData@0x600004c33ea8}
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4fe700a962)
Closesscylladb/scylladb#19009
Tablet allocation does not guarantee fairness of
the first replica in the replicas set across dcs.
The lack of this fix cause the following dtest to fail:
repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc
Use the tablet_map get_primary_replica or get_primary_replica_within_dc,
respectively to see if this node is the primary replica for each tablet
or not.
Fixes https://github.com/scylladb/scylladb/issues/17752
No backport is required before 6.0 as tablets (and tablet repair) are introduced in 6.0
(cherry picked from commit c52f70f92c)
(cherry picked from commit 2de79c39dc)
(cherry picked from commit 84761acc31)
(cherry picked from commit 009767455d)
(cherry picked from commit 18df36d920)
Refs #18784Closesscylladb/scylladb#19068
* github.com:scylladb/scylladb:
repair: repair_tablets: use get_primary_replica
repair: repair_tablets: no need to check ranges_specified per tablet
locator: tablet_map: add get_primary_replica_within_dc
locator: tablet_map: get_primary_replica: do not copy tablet info
locator: tablet_map: get_primary_replica: return tablet_replica
Wait until the task is done in test_task::finish_failed and
test_task::finish to ensure that it is folded into its parent.
(cherry picked from commit 30f97ea133)
If /task_manager/wait_task/ unregisters the task, then there is no
way to examine children failures, since their statuses can be checked
only through their parent.
(cherry picked from commit c1b2b8cb2c)
Currently, when a child task is unregistered, it is still kept by its parent. This leads
to excessive memory usage, especially when the tasks are configured to be kept in task
manager after they are finished (task_ttl_in_seconds).
Introduce task_essentials struct which keeps only data necesarry for task manager API.
When a task which has a parent is finished, a foreign pointer to it in its parent is replaced
with respective task_essentials. Once a parent task is finished it is also folded into
its parent (if it has one). Children details of a folded task are lost, unless they
(or some of their subtrees) failed. That is, when a task is finished, we keep:
- a root task (until it is unregistered);
- task_essentials of root's direct children;
- a path (of task_essentials) from root to each failed task (so that the reason
of a failure could be examined).
(cherry picked from commit e6c50ad2d0)
storage_proxy is responsible for intersecting the range of the read
with tablets, and calling replica with a single tablet range, therefore
it makes sense to avoid touching memtables of tablets that don't
intersect with a particular range.
Note this is a performance issue, not correctness one, as memtable
readers that don't intersect with current range won't produce any
data, but cpu is wasted until that's realized (they're added to list
of readers in mutation_reader_merger, more allocations, more data
sources to peek into, etc).
That's also important for streaming e.g. after decommission, that
will consume one tablet at a time through a reader, so we don't want
memtables of streamed tablets (that weren't cleaned up yet) to
be consumed.
Refs #18904.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 832fb43fb4)
Closesscylladb/scylladb#18983
before this change, in order to avoid repeating/hardwiring the
compiling options set by Seastar, we just inherit the compiling
options of Seastar for building Abseil, as the former exposes the
options to enable sanitizers.
this works fine, despite that, strictly speaking, not all options
are necessary for building abseil, as abseil is not a Seastar
application -- it is just a C++ library.
but when we introduce dependencies which are only generated at
build time, and these dependencies are passed to the compiler
at build time, this breaks the build of Abseil. because these
dependencies are exposed by the Seastar's .pc file, and consumed
by Abseil. when building Abseil, apparently, the building process
driven by ninja is not started yet, so we are not able to build
Abseil with these settings due to missing dependencies.
so instead of inheriting the compiling options from Seastar, just
set the sanitizer related compiling options directly, to avoid
referencing these missing dependencies.
the upside is that we pass a much smaller set of compiling options
to compiler when building Abseil, the downside is that we hardwire
these options related to sanitizer manually, they are also detected
by Seastar's building system. but fortunately, these options are
relatively stable across the building environements we support.
Fixes#19055
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit c436dfd2db)
Closesscylladb/scylladb#19064
Tablet allocation does not guarantee fairness of
the first replica in the replicas set across dcs.
The lack of this fix cause the following dtest to fail:
repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc
Use the tablet_map get_primary_replica* functions to get
the primary replica for each tablet, possibly within a dc.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 18df36d920)
The code already turns off `primary_replica_only`
if `!ranges_specified.empty()`, so there's no need to
check it again inside the per-tablet loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 009767455d)
Currently, the function needlessly copies the tablet_info
(all tablet replicas in particular) to a local variable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2de79c39dc)
This is required by repair when it will start using get_primary_replica
in a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c52f70f92c)
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.
Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.
For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions. One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2. Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):
Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}
The worst state before the patch had the following distribution of tablets for the smaller table:
Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33
One node has as many as 171 tablets of that table and another one has as few as 3.
After the patch, the worst distribution looks like this:
Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65
Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.
Refs #16824
(cherry picked from commit 3be6120e3b)
(cherry picked from commit c9bcb5e400)
(cherry picked from commit 7b1eea794b)
(cherry picked from commit 603abddca9)
Refs #18885Closesscylladb/scylladb#19036
* github.com:scylladb/scylladb:
tablets: load balancer: Use random selection of candidates when moving tablets
test: perf: Add test for tablet load balancer effectiveness
load_sketch: Extract get_shard_minmax()
load_sketch: Allow populating only for a given table
tablet snapshot, used by migration, can race with compaction and
can find files deleted. That won't cause data loss because the
error is propagated back into the coordinator that decides to
retry streaming stage. So the consequence is delayed migration,
which might in turn reduce node operation throughput (e.g.
when decommissioning a node). It should be rare though, so
shouldn't have drastic consequences.
Fixes#18977.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b396b05e20)
Closesscylladb/scylladb#19008
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)
Closesscylladb/scylladb#19014
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.
The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 83d491af02)
Closesscylladb/scylladb#19012
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.
Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.
For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions. One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2. Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):
Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}
The worst state before the patch had the following distribution of tablets for the smaller table:
Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33
One node has as many as 171 tablets of that table and the one has as few as 3.
After the patch, the worst distribution looks like this:
Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65
Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.
Refs #16824
(cherry picked from commit 603abddca9)
Update docs for backup procedure to use `DESC SCHEMA WITH INTERNALS`
instead of plain `DESC SCHEMA`.
Add a note to use cqlsh in a proper version (at least 6.0.19).
Closesscylladb/scylladb#18953
(cherry picked from commit 5b4e688668)
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769
(cherry picked from commit 2ab143fb40)
[avi: adjust test/alternator/suite.yaml to reflect new keyspace]
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.
Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.
This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.
Fixes: #17786Fixes: #18709
(cherry-picked from 519317dc58)
This commit also includes the follow-up patch that removes the
flakiness from the test that is introduced by the commit above.
The flakiness was caused by enabling the
delay_before_get_view_natural_endpoint injection on a node
and not disabling it before the node is shut down. The patch
removes the enabling of the injection on the node in the first
place.
By squashing the commits, we won't introduce a place in the
commit history where a potential bisect could mistakenly fail.
Fixes: https://github.com/scylladb/scylladb/issues/18941
(cherry-picked from 0de3a5f3ff)
Closesscylladb/scylladb#18974
This commit adds an explanation of how the `nodetool describering` command
works if tablets are enabled.
(cherry picked from commit 888d7601a2)
Closesscylladb/scylladb#18981
This change supports changing replication factor in tablets-enabled keyspaces.
This covers both increasing and decreasing the number of tablets replicas through
first building topology mutations (`alter_keyspace_statement.cc`) and then
tablets/topology/schema mutations (`topology_coordinator.cc`).
For the limitations of the current solution, please see the docs changes attached to this PR.
refs: scylladb/scylladb#16723
* br-backport-alter-ks-tablets:
test: Do not check tablets mutations on nodes that don't have them
test: Fix the way tablets RF-change test parses mutation_fragments
test/tablets: Unmark RF-changing test with xfail
docs: document ALTER KEYSPACE with tablets
Return response only when tablets are reallocated
cql-pytest: Verify RF is changes by at most 1 when tablets on
cql3/alter_keyspace_statement: Do not allow for change of RF by more than 1
Reject ALTER with 'replication_factor' tag
Implement ALTER tablets KEYSPACE statement support
Parameterize migration_manager::announce by type to allow executing different raft commands
Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks
Extend system.topology with 3 new columns to store data required to process alter ks global topo req
Allow query_processor to check if global topo queue is empty
Introduce new global topo `keyspace_rf_change` req
New raft cmd for both schema & topo changes
Add storage service to query processor
tablets: tests for adding/removing replicas
tablet_allocator: make load_balancer_stats_manager configurable by name
The check is performed by selecting from mutation_fragments(table), but
it's known that this query crashes Scylla when there's no tablet replica
on that node.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the test changes RF from 2 to 3, the extra node executes "rebuild"
transition which means that it streams tablets replicas from two other
peers. When doing it, the node receives two sets of sstables with
mutations from the given tablet. The test part that checks if the extra
node received the mutations notices two mutation fragments on the new
replica and errorneously fails by seeing, that RF=3 is not equal to the
number of mutations found, which is 4.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>