Commit Graph

4597 Commits

Author SHA1 Message Date
Michał Jadwiszczak
4e236f3392 cql3/statements/create_service_level: forbid creating SL starting with $
Tenant names starting with `$` are reserved for internal ones.
Forbid creating new service level which name starts with `$`
and log a warning for existing service levels with `$` prefix.

(cherry picked from commit d729d1b272)

Closes scylladb/scylladb#20198
2024-08-19 16:20:48 +03:00
Piotr Smaron
b10bf17df7 tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
Using the error injection framework, we inject a sleep into the
processing path of ALTER tablets KS, so that the topology coordinator of
the leader node
sleeps after the rf_change event has been scheduled, but before it is
started to be executed. During that time the second node executes a DROP
KS statement, which is propagated to the leader node. Once leader node
wakes up and resumes processing of ALTER tablets KS, the KS won't exist
and the node cannot crash, which was the case before.

(cherry picked from commit ddb5204929)
2024-08-14 10:37:59 +00:00
Piotr Smaron
d0ceaa01ee cql: refactor rf_change indentation
(cherry picked from commit 0ea2128140)
2024-08-14 10:37:59 +00:00
Piotr Smaron
352c159460 Prevent ALTERing non-existing KS with tablets
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required
   to execute this req,
2. global topo req is executed by topo coordinator, which reads data
   attached to the req.

The KS name is among the data attached to the req.
There's a time window between these steps where a to-be-altered KS could
have been DROPped, which results in topo coordinator forever trying to
ALTER a non-existing KS. In order to avoid it, the code has been changed
to first check if a to-be-altered KS exists, and if it's not the case,
it doesn't perform any schema/tablets mutations, but just removes the
global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected
changes, which is due to the fact that the code is written badly and
needs to be refactored - an effort that's already planned under #19126

Fixes: #19576
(cherry picked from commit 5b089d8e10)
2024-08-14 10:37:59 +00:00
Kamil Braun
a56f7ce21a storage_service: raft topology: warn when raft_topology_cmd_handler fails due to abort
Currently we print an ERROR on all exceptions in
`raft_topology_cmd_handler`. This log level is too high, in some cases
exceptions are expected -- like during shutdown. And it causes dtest
failures.

Turn exceptions from aborts into WARN level.

Also improve logging by printing the command that failed.

Fixes scylladb/scylladb#19754

(cherry picked from commit 7506709573)

Closes scylladb/scylladb#20072
2024-08-08 18:14:24 +02:00
Kamil Braun
4948029666 raft topology: improve logging
Add more logging for raft-based topology operations in INFO and DEBUG
levels.

Improve the existing logging, adding more details.

Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).

Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.

These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).

Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945
(cherry picked from commit e8d5974961)

Closes scylladb/scylladb#20049
2024-08-08 11:59:34 +03:00
Emil Maskovsky
b99d87863d raft: fix the shutdown phase being stuck
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223

(cherry picked from commit 5dfc50d354)
2024-08-01 19:37:02 +02:00
Emil Maskovsky
0770069dda raft: use the abort source reference in raft group0 client interface
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.

(cherry picked from commit 2dbe9ef2f2)
2024-08-01 19:36:00 +02:00
Gleb Natapov
c437c8be36 test: add test to check that coordinator lwt semaphore continues functioning after locking failures
(cherry picked from commit 4178589826)
2024-07-18 15:34:17 +00:00
Gleb Natapov
1c04b95c68 paxos: do not signal semaphore if it was not acquired
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.

Fixes: https://github.com/scylladb/scylladb/issues/19698
(cherry picked from commit 87beebeed0)
2024-07-18 15:34:16 +00:00
Michael Litvak
815a707b0a storage_proxy: remove response handler if no targets
When writing a mutation, it might happen that there are no live targets
to send the mutation to, yet the request can be satisfied. For example,
when writing with CL=ANY to a dead node, the request is completed by
storing a local hint.

Currently, in that case, a write response handler is created for the
request and it remains active until it timeouts because it is not
removed anywhere, even though the write is completed successfuly after
storing the hint. The response handler should be removed usually when
receiving responses from all targets, but in this case there are no
targets to trigger the removal.

In this commit we check if we don't have live targets to send the
mutation to. If so, we remove the response handler immediately.

Fixes scylladb/scylladb#19529

(cherry picked from commit a9fdd0a93a)

Closes scylladb/scylladb#19680
2024-07-15 08:24:18 +02:00
Wojciech Przytuła
a7fe9eeffd storage_proxy: fix uninitialized LWT contention counter
When debugging the issue of high LWT contention metric, we (the
drivers team) discovered that at least 3 drivers (Go, Java, Rust)
cause high numbers in that metrics in LWT workloads - we doubted that
all those drivers route LWT queries badly. We tried to understand that
metric and its semantics. It took 3 people over 10 hours to figure out
what it is supposed to count.

People from core team suspected that it was the drivers sending
requests to different shards, causing contention. Then we ran the
workload against a single node single shard cluster... and observed
contention. Finally, we looked into the Scylla code and saw it.

**Uninitialized stack value.**

The core member was shocked. But we, the drivers people, felt we always
knew it. It's yet another time that we are blamed for a server-side
issue. We rebuilt scylla with the variable initialized to 0 and the
metric kept being 0.

To prevent such errors in the future, let's consider some lints that
warn against uninitialized variables. This is such an obvious feature
of e.g. Rust, and yet this has shown to be cause a painful bug in 2024.

Fixes: scylladb/scylladb#19654
(cherry picked from commit 36a125bf97)

Closes scylladb/scylladb#19657
2024-07-09 11:41:10 +02:00
Michael Litvak
ad6eb1cadf view: drain view builder before database
The view builder is doing write operations to the database.
In order for the view builder to shutdown gracefully without errors, we
need to ensure the database can handle writes while it is drained.
The commit changes the drain order, so that view builder is drained
before the database shuts down.

Fixes scylladb/scylladb#18929

(cherry picked from commit 9d9318c564)

Closes scylladb/scylladb#19636
2024-07-08 19:16:26 +02:00
Pavel Emelyanov
78f3fc8890 tablet_allocator: Put more info into failed-to-drain exception
When balancer fails to find a node to balance drained tablets into, it
throws an exception with tablet id and node id, but it's also good to
know more details about the balancing state that lead to failure

refs: #19504

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c3d9831c5f)

Closes scylladb/scylladb#19619
2024-07-05 11:17:37 +03:00
Gleb Natapov
724ec62e22 test: add test that checks that local address cannot expire between join request placemen and its processing
(cherry picked from commit 3f136cf2eb)
2024-07-01 10:44:31 +00:00
Gleb Natapov
a6c5f8192d storage_service: make node's entry non expiring in raft address map
Local address map entry should never expire in the address map.

(cherry picked from commit 5d8f08c0d7)
2024-07-01 10:44:31 +00:00
Piotr Smaron
6a1e0489c6 cql: forbid switching from tablets to vnodes in ALTER KS
This check is already in place, but isn't fully working, i.e.
switching from a vnode KS to a tablets KS is not allowed, but
this check doesn't work in the other direction. To fix the
latter, `ks_prop_defs::get_initial_tablets()` has been changed
to handle 3 states: (1) init_tablets is set, (2) it was skipped,
(3) tablets are disabled. These couldn't fit into std::optional,
so a new local struct to hold these states has been introduced.
Callers of this function have been adjusted to set init_tablets
to an appropriate value according to the circumstances, i.e. if
tablets are globally enabled, but have been skipped in the CQL,
init_tablets is automatically set to 0, but if someone executes
ALTER KS and doesn't provide tablets options, they're inherited
from the old KS.
I tried various approaches and this one resulted in the least
lines of code changed. I also provided testcases to explain how
the code behaves.

Fixes: #18795
(cherry picked from commit 758139c8b2)

Closes scylladb/scylladb#19540
2024-06-28 17:58:35 +03:00
Patryk Jędrzejczak
e129c4ad43 join_token_ring, gossip topology: update obsolete comment
The code mentioned in the comment has already been added. We change
the comment to prevent confusion.

(cherry picked from commit bcc0a352b7)
2024-06-21 12:05:43 +00:00
Patryk Jędrzejczak
37bb6e0a43 join_token_ring, gossip topology: fix indendation after previous patch
(cherry picked from commit 7735bd539b)
2024-06-21 12:05:43 +00:00
Patryk Jędrzejczak
e5e8b970ed join_token_ring, gossip topology: recalculate sync nodes in wait_alive
Before this patch, if we booted a node just after removing
a different node, the booting node may still see the removed node
as NORMAL and wait for it to be UP, which would time out and fail
the bootstrap.

This issue caused scylladb/scylladb#17526.

Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.

(cherry picked from commit 017134fd38)
2024-06-21 12:05:42 +00:00
Gleb Natapov
0e49180cef topology coordinator: add more trace level logging for debugging
Add more logging that provide more visibility into what happens during
topology loading.

Message-ID: <ZnE5OAmUbExVZMWA@scylladb.com>

(cherry picked from commit fb764720d3)
2024-06-18 16:38:51 +02:00
Wojciech Mitros
d70cf46af0 mv: replicate the gossiped backlog to all shards
On each shard of each node we store the view update backlogs of
other nodes to, depending on their size, delay responses to incoming
writes, lowering the load on these nodes and helping them get their
backlog to normal if it were too high.

These backlogs are propagated between nodes in two ways: the first
one is adding them to replica write responses. The seconds one
is gossiping any changes to the node's backlog every 1s. The gossip
becomes useful when writes stop to some node for some time and we
stop getting the backlog using the first method, but we still want
to be able to select a proper delay for new writes coming to this
node. It will also be needed for the mv admission control.

Currently, the backlog is gossiped from shard 0, as expected.
However, we also receive the backlog only on shard 0 and only
update this shard's backlogs for the other node. Instead, we'd
want to have the backlogs updated on all shards, allowing us
to use proper delays also when requests are received on shards
different than 0.

This patch changes the backlog update code, so that the backlogs
on all shards are updated instead. This will only be performed
up to once per second for each other node, and is done with
a lower priority, so it won't severly impact other work.

Fixes: scylladb/scylladb#19232
(cherry picked from commit d31437b589)

Closes scylladb/scylladb#19302
2024-06-17 09:32:29 +03:00
Benny Halevy
6122f9454d storage_service: join_token_ring: reject replace on different dc or rack
Do not allow replacing a node on one dc/rack
with a node on a different dc/rack as this violates
the assumption of replace node operation that
all token ranges previously owned by the dead
node would be rebuilt on the new node.

Fixes #16858
Refs scylladb/scylla-enterprise#3518

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 34dfa4d3a3)

Closes scylladb/scylladb#19281
2024-06-14 07:43:58 +03:00
Michał Chojnowski
80ac0da11c storage_proxy: avoid infinite growth of _throttled_writes
storage_proxy has a throttling mechanism which attempts to limit the number
of background writes by forcefully raising CL to ALL
(it's not implemented exactly like that, but that's the effect) when
the amount of background and queued writes is above some fixed threshold.
If this is applied to a write, it becomes "throttled",
and its ID is appended to into _throttled_writes.

Whenever the amount of background and queued writes falls below the threshold,
writes are "unthrottled" — some IDs are popped from _throttled_writes
and the writes represented by these IDs — if their handlers still exist —
have their CL lowered back.

The problem here is that IDs are only ever removed from _throttled_writes
if the number of queued and background writes falls below the threshold.
But this doesn't have to happen in any finite time, if there's constant write
pressure. And in fact, in one load test, it hasn't happened in 3 hours,
eventually causing the buffer to grow into gigabytes and trigger OOM.

This patch is intended to be a good-enough-in-practice fix for the problem.

Fixes #17476
Fixes #1834

(cherry picked from commit fee48f67ef)

Closes scylladb/scylladb#19180
2024-06-11 18:33:38 +03:00
Gleb Natapov
45ff4d2c41 group0, topology coordinator: run group0 and the topology coordinator in gossiper scheduling group
Currently they both run in streaming group and it may become busy during
repair/mv building and affect group0 functionality. Move it to the
gossiper group where it should have more time to run.

Fixes #18863

(cherry picked from commit a74fbab99a)

Closes scylladb/scylladb#19175
2024-06-10 10:34:29 +02:00
Piotr Dulikowski
e04378fdf0 Merge ' [Backport 6.0] db/hints: Use host ID to IP mappings to choose the ep manager to drain when node is leaving' from Dawid Mędrek
In [d0f5873](d0f58736c8), we introduced mappings IP–host ID between hint directories and the hint endpoint managers managing them. As a consequence, it may happen that one hint directory stores hints towards multiple nodes at the same time. If any of those nodes leaves the cluster, we should drain the hint directory. However, before these changes that doesn't happen – we only drain it when the node of the same host ID as the hint endpoint manager leaves the cluster.

This PR fixes that draining issue in the pre-host-ID-based hinted handoff. Now no matter which of the nodes corresponding to a hint directory leaves the cluster, the directory will be drained.

We also introduce error injections to be able to test that it indeed happens.

Fixes scylladb/scylladb#18761

(cherry picked from commit [745a9c6](745a9c6ab8))

(cherry picked from commit [e855794](e855794327))

Refs scylladb/scylladb#18764

Closes scylladb/scylladb#19114

* github.com:scylladb/scylladb:
  db/hints: Introduce an error injection to test draining
  db/hints: Ensure that draining happens
2024-06-10 09:11:07 +02:00
Tomasz Grabiec
f8243cbf19 Merge '[Backport 6.0] Serialize repair with tablet migration' from ScyllaDB
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requests start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

A blocked tablet migration (e.g. due to down node) will block repair,
whereas before it would fail. Once admin resolves the cause of blocked migration,
repair will continue.

Fixes #17658.
Fixes #18561.

(cherry picked from commit 6c64cf33df)

(cherry picked from commit 1513d6f0b0)

(cherry picked from commit 476c076a21)

(cherry picked from commit c45ce41330)

(cherry picked from commit e97acf4e30)

(cherry picked from commit 98323be296)

(cherry picked from commit 5ca54a6e88)

 Refs #18641

Closes scylladb/scylladb#19144

* github.com:scylladb/scylladb:
  test: pylib: Do not block async reactor while removing directories
  repair: Exclude tablet migrations with tablet repair
  repair_service: Propagate topology_state_machine to repair_service
  main, storage_service: Move topology_state_machine outside storage_service
  storage_srvice, toplogy: Extract topology_state_machine::await_quiesced()
  tablet_scheduler: Make disabling of balancing interrupt shuffle mode
  tablet_scheduler: Log whether balancing is considered as enabled
2024-06-09 00:20:44 +02:00
Kefu Chai
50d8fa6b77 topology_coordinator: handle/wait futures when stopping topology_coordinator
before this change, unlike other services in scylla,
topology_coordinator is not properly stopped when it is aborted,
because the scylla instance is no longer a leader or is being shut down.
its `run()` method just stops the grand loop and bails out before
topology_coordinator is destroyed. but we are tracking the migration
state of tablets using a bunch of futures, which might not be
handled yet, and some of them could carry failures. in that case,
when the `future` instances with failure state get destroyed,
seastar calls `report_failed_future`. and seastar considers this
practice a source a bug -- as one just fails to handle an error.
that's why we have following error:

```
WARN  2024-05-19 23:00:42,895 [shard 0:strm] seastar - Exceptional future ignored: seastar::rpc::unknown_verb_error (unknown verb), backtrace: /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56c14e /home/bhalevy/.ccm/scylla-repository/local_tarball/libre
loc/libseastar.so+0x56c770 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56ca58 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x38c6ad 0x29cdd07 0x29b376b 0x29a5b65 0x108105a /home/bhalevy/.ccm/scylla-repository/local_tarbal
l/libreloc/libseastar.so+0x3ff1df /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x400367 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff838 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36de58
 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36d092 0x1017cba 0x1055080 0x1016ba7 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27b89 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27c4a 0x1015524
```
and the backtrace looks like:
```
seastar::current_backtrace_tasklocal() at ??:?
seastar::current_tasktrace() at ??:?
seastar::current_backtrace() at ??:?
seastar::report_failed_future(seastar::future_state_base::any&&) at ??:?
service::topology_coordinator::tablet_migration_state::~tablet_migration_state() at topology_coordinator.cc:?
service::topology_coordinator::~topology_coordinator() at topology_coordinator.cc:?
service::run_topology_coordinator(seastar::sharded<db::system_distributed_keyspace>&, gms::gossiper&, netw::messaging_service&, locator::shared_token_metadata&, db::system_keyspace&, replica::database&, service::raft_group0&, service::topology_state_machine&, seastar::abort_source&, raft::server&, seastar::noncopyable_function<seastar::future<service::raft_topology_cmd_result> (utils::tagged_tagged_integer<raft::internal::non_final, raft::term_tag, unsigned long>, unsigned long, service::raft_topology_cmd const&)>, service::tablet_allocator&, std::chrono::duration<long, std::ratio<1l, 1000l> >, service::endpoint_lifecycle_notifier&) [clone .resume] at topology_coordinator.cc:?
seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at main.cc:?
seastar::reactor::run_some_tasks() at ??:?
seastar::reactor::do_run() at ??:?
seastar::reactor::run() at ??:?
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ??:?
```

and even worse, these futures are indirectly owned by `topology_coordinator`.
so there are chances that they could be used even after `topology_coordinator`
is destroyed. this is a use-after-free issue. because the
`run_topology_coordinator` fiber exits when the scylla instance retires
from the leader's role, this use-after-free could be fatal to a
running instance due to undefined behavior of use after free.

so, in this change, we handle the futures in `_tablets`, and note
down the failures carried by them if any.

Fixes #18745
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 4a36918989)

Closes scylladb/scylladb#19139
2024-06-07 07:00:25 +03:00
Tomasz Grabiec
e518bb68b2 main, storage_service: Move topology_state_machine outside storage_service
It will be propagated to repair_service to avoid cyclic dependency:

storage_service <-> repair_service

(cherry picked from commit c45ce41330)
2024-06-06 13:01:19 +00:00
Tomasz Grabiec
af2caeb2de storage_srvice, toplogy: Extract topology_state_machine::await_quiesced()
Will be used later in a place which doesn't have access to storage_service
but has to toplogy_state_machine.

It's not necessary to start group0 operation around polling because
the busy() state can be checked atomically and if it's false it means
the topology is no longer busy.

(cherry picked from commit 476c076a21)
2024-06-06 13:01:19 +00:00
Tomasz Grabiec
d5ebfea1ff tablet_scheduler: Make disabling of balancing interrupt shuffle mode
Tests will rely on that, they will run in shuffle mode, and disable
balancing around section which otherwise would be infinitely blocked
by ongoing shuffling (like repair).

(cherry picked from commit 1513d6f0b0)
2024-06-06 13:01:18 +00:00
Tomasz Grabiec
3fec9e1344 tablet_scheduler: Log whether balancing is considered as enabled
(cherry picked from commit 6c64cf33df)
2024-06-06 13:01:18 +00:00
Kamil Braun
5d3dde50f4 Merge '[Backport 6.0] Fail bootstrap if ip mapping is missing during double write stage' from ScyllaDB
If a node restart just before it stores bootstrapping node's IP it will
not have ID to IP mapping for bootstrapping node which may cause failure
on a write path. Detect this and fail bootstrapping if it happens.

(cherry picked from commit 1faef47952)

(cherry picked from commit 27445f5291)

(cherry picked from commit 6853b02c00)

(cherry picked from commit f91db0c1e4)

 Refs #18927

Closes scylladb/scylladb#19118

* github.com:scylladb/scylladb:
  raft topology: fix indentation after previous commit
  raft topology: do not add bootstrapping node without IP as pending
  test: add test of bootstrap where the coordinator crashes just before storing IP mapping
  schema_tables: remove unused code
2024-06-06 11:35:13 +02:00
Gleb Natapov
e11827f37e raft topology: fix indentation after previous commit
(cherry picked from commit f91db0c1e4)
2024-06-05 13:55:29 +00:00
Gleb Natapov
0acfc223ab raft topology: do not add bootstrapping node without IP as pending
If there is no mapping from host id to ip while a node is in bootstrap
state there is no point adding it to pending endpoint since write
handler will not be able to map it back to host id anyway. If the
transition sate requires double writes though we still want to fail.
In case the state is write_both_read_old we fail the barrier that will
cause topology operation to rollback and in case of write_both_read_new
we assert but this should not happen since the mapping is persisted by
this point (or we failed in write_both_read_old state).

Fixes: scylladb/scylladb#18676
(cherry picked from commit 6853b02c00)
2024-06-05 13:55:28 +00:00
Gleb Natapov
c53cd98a41 test: add test of bootstrap where the coordinator crashes just before storing IP mapping
On the next boot there is no host ID to IP mapping which causes node to
crash again with "No mapping for :: in the passed effective replication map"
assertion.

(cherry picked from commit 27445f5291)
2024-06-05 13:55:28 +00:00
Botond Dénes
341c29bd74 Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.

Fixes https://github.com/scylladb/scylladb/issues/18085.

(cherry picked from commit abcc68dbe7)

(cherry picked from commit 551bf9dd58)

(cherry picked from commit e7246751b6)

Refs https://github.com/scylladb/scylladb/pull/18287

Closes scylladb/scylladb#19095

* github.com:scylladb/scylladb:
  topology_experimental_raft/test_tablets: restore usage of check_with_down
  test: Fix flakiness in topology_experimental_raft/test_tablets
  service: Use tablet read selector to determine which replica to account table stats
  storage_service: Fix race between tablet split and stats retrieval
2024-06-05 13:06:32 +03:00
Dawid Medrek
82d635b6a7 db/hints: Ensure that draining happens
Before hinted handoff is migrated to using host IDs
to identify nodes in the cluster, we keep track
of mappings between hint endpoint managers
identified by host IDs and the hint directories
managed by them and represented by IP addresses.
As a consequence, it may happen that one hint
directory corresponds to multiple nodes
-- it's intended. See 64ba620 for more details.

Before these changes, we only started the draining
process of a hint directory if the node leaving
the cluster corresponded to that hint directory
AND was identified by the same host ID as
the hint endpoint manager managing that directory.
As a result, the draining did not always happen
when it was supposed to.

Draining should start no matter which of the nodes
corresponding to a hint directory is leaving
the cluster. This commit ensures that it happens.

(cherry picked from commit 745a9c6ab8)
2024-06-04 14:42:08 +00:00
Benny Halevy
21f87c9cfa locator: tablet_map: get_primary_replica: return tablet_replica
This is required by repair when it will start using get_primary_replica
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c52f70f92c)
2024-06-03 19:50:39 +00:00
Botond Dénes
a38d5463ef Merge '[Backport 6.0] tablets: load balancer: Use random selection of candidates when moving tablets' from ScyllaDB
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.

Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.

For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions.  One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2.  Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):

  Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
  Overcommit       : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit       : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
  Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}

The worst state before the patch had the following distribution of tablets for the smaller table:

  Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
  Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
  Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
  Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
  Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
  Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
  Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33

One node has as many as 171 tablets of that table and another one has as few as 3.

After the patch, the worst distribution looks like this:

  Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
  Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
  Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
  Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
  Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
  Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
  Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65

Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.

Refs #16824

(cherry picked from commit 3be6120e3b)

(cherry picked from commit c9bcb5e400)

(cherry picked from commit 7b1eea794b)

(cherry picked from commit 603abddca9)

 Refs #18885

Closes scylladb/scylladb#19036

* github.com:scylladb/scylladb:
  tablets: load balancer: Use random selection of candidates when moving tablets
  test: perf: Add test for tablet load balancer effectiveness
  load_sketch: Extract get_shard_minmax()
  load_sketch: Allow populating only for a given table
2024-06-03 12:25:05 +03:00
Pavel Emelyanov
62a23fd86a config: Remove experimental TABLETS feature
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.

The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 83d491af02)

Closes scylladb/scylladb#19012
2024-06-03 12:16:41 +03:00
Tomasz Grabiec
b9c88fdf4b tablets: load balancer: Use random selection of candidates when moving tablets
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.

Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.

For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions.  One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2.  Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):

  Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
  Overcommit       : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit       : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
  Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}

The worst state before the patch had the following distribution of tablets for the smaller table:

  Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
  Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
  Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
  Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
  Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
  Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
  Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33

One node has as many as 171 tablets of that table and the one has as few as 3.

After the patch, the worst distribution looks like this:

  Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
  Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
  Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
  Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
  Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
  Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
  Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65

Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.

Refs #16824

(cherry picked from commit 603abddca9)
2024-06-02 22:40:46 +00:00
Marcin Maliszkiewicz
cbf47319c1 db: auth: move auth tables to system keyspace
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.

Fixes https://github.com/scylladb/scylladb/issues/18098

Closes scylladb/scylladb#18769

(cherry picked from commit 2ab143fb40)

[avi: adjust test/alternator/suite.yaml to reflect new keyspace]
2024-06-02 21:41:14 +03:00
Wojciech Mitros
3c47ab9851 mv: handle different ERMs for base and view table
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.

This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.

Fixes: #17786
Fixes: #18709

(cherry-picked from 519317dc58)

This commit also includes the follow-up patch that removes the
flakiness from the test that is introduced by the commit above.
The flakiness was caused by enabling the
delay_before_get_view_natural_endpoint injection on a node
and not disabling it before the node is shut down. The patch
removes the enabling of the injection on the node in the first
place.
By squashing the commits, we won't introduce a place in the
commit history where a potential bisect could mistakenly fail.

Fixes: https://github.com/scylladb/scylladb/issues/18941

(cherry-picked from 0de3a5f3ff)

Closes scylladb/scylladb#18974
2024-05-30 09:13:31 +02:00
Piotr Smaron
e04964ba17 Return response only when tablets are reallocated
Up until now we waited until mutations are in place and then returned
directly to the caller of the ALTER statement, but that doesn't imply
that tablets were deleted/created, so we must wait until the whole
processing is done and return only then.
2024-05-30 08:33:26 +03:00
Piotr Smaron
544c424e89 Implement ALTER tablets KEYSPACE statement support
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.
2024-05-30 08:33:25 +03:00
Piotr Smaron
73b59b244d Parameterize migration_manager::announce by type to allow executing different raft commands
Since ALTER KS requires creating topology_change raft command, some
functions need to be extended to handle it. RAFT commands are recognized
by types, so some functions are just going to be parameterized by type,
i.e. made into templates.
These templates are instantiated already, so that only 1 instances of
each template exists across the whole code base, to avoid compiling it
in each translation unit.
2024-05-30 08:33:15 +03:00
Piotr Smaron
885c7309ee Extend system.topology with 3 new columns to store data required to process alter ks global topo req
Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.
2024-05-30 08:33:15 +03:00
Piotr Smaron
adfad686b3 Allow query_processor to check if global topo queue is empty
With current implementation only 1 global topo req can be executed at a
time, so when ALTER KS is executed, we'll have to check if any other
global topo req is ongoing and fail the req if that's the case.
2024-05-30 08:33:15 +03:00
Piotr Smaron
1a70db17a6 Introduce new global topo keyspace_rf_change req
It will be used when processing ALTER KS statement, but also to
create a separate processing path for a KS with tablets (as opposed to
a vnode KS).
2024-05-30 08:33:15 +03:00