Commit Graph

58 Commits

Author SHA1 Message Date
Wojciech Mitros
cde14a5788 mv: make the view update backlog unmofidiable
Currently, a view update backlog may reach an invalid state, when
its max is 0 and its relative_size() is NaN as a result. This can
be achieved either by constructing the backlog with a 0 max or by
modifying the max of an existing backlog. In particular, this
happens when creating the backlog using the default constructor.

In this patch the the default constructor is deleted and a check
is added to make sure that the max is different than 0 is added
to its constructor - if the check fails, we construct an empty
backlog instead, to handle the possibility of getting an invalid
backlog sent from a node with a version that's missing this check.
Additionally, we make the backlogs members private, exposing them
only through const getters.
2024-06-19 19:44:57 +02:00
Wojciech Mitros
d31437b589 mv: replicate the gossiped backlog to all shards
On each shard of each node we store the view update backlogs of
other nodes to, depending on their size, delay responses to incoming
writes, lowering the load on these nodes and helping them get their
backlog to normal if it were too high.

These backlogs are propagated between nodes in two ways: the first
one is adding them to replica write responses. The seconds one
is gossiping any changes to the node's backlog every 1s. The gossip
becomes useful when writes stop to some node for some time and we
stop getting the backlog using the first method, but we still want
to be able to select a proper delay for new writes coming to this
node. It will also be needed for the mv admission control.

Currently, the backlog is gossiped from shard 0, as expected.
However, we also receive the backlog only on shard 0 and only
update this shard's backlogs for the other node. Instead, we'd
want to have the backlogs updated on all shards, allowing us
to use proper delays also when requests are received on shards
different than 0.

This patch changes the backlog update code, so that the backlogs
on all shards are updated instead. This will only be performed
up to once per second for each other node, and is done with
a lower priority, so it won't severly impact other work.

Fixes: scylladb/scylladb#19232

Closes scylladb/scylladb#19268
2024-06-14 11:24:20 +02:00
Wojciech Mitros
f70f774e40 mv: gossip the same backlog if a different backlog was sent in a response
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a the backlog changed
since the last gossip round so a different backlog could have been send
through the response piggyback mechanism. With that information, gossip
will send an unchanged backlog to other nodes in the following gossip round.

Fixes: https://github.com/scylladb/scylladb/issues/18461
2024-06-06 10:45:15 +02:00
Wojciech Mitros
5154429713 mv gossip: check errno instead of value returned by strtoull
Currently, when a view update backlog is changed and sent
using gossip, we check whether the strtoll/strtoull
function used for reading the backlog returned
LLONG_MAX/ULLONG_MAX, signaling an error of a value
exceeding the type's limit, and if so, we do not store
it as the new value for the node.

However, the ULLONG_MAX value can also be used as the max
backlog size when sending empty backlogs that were never
updated. In theory, we could avoid sending the default
backlog because each node has its real backlog (based on
the node's memory, different than the ULLONG_MAX used in
the default backlog). In practice, if the node's
backlog changed to 0, the backlog sent by it will be
likely the default backlog, because when selecting
the biggest backlog across node's shards, we use the
operator<=>(), which treats the default backlog as
equal to an empty backlog and we may get the default
backlog during comparison if the backlog of some shard
was never changed (also it's the initial max value
we compare shard's backlogs against).

This patch removes the (U)LLONG_MAX check and replaces
it with the errno check, which is also set to ERANGE during
the strtoll error, and which won't prevent empty backlogs
from being read

Fixes: #18462

Closes scylladb/scylladb#18560
2024-05-15 07:14:36 +02:00
Piotr Dulikowski
21d5d4e15c misc_services: fix data race from bad usage of get_next_version
The function `gms::version_generator::get_next_version()` can only be
called from shard 0 as it uses a global, unsynchronized counter to
issue versions. Notably, the function is used as a default argument for
the constructor of `gms::versioned_value` which is used from shorthand
constructors such as `versioned_value::cache_hitrates`,
`versioned_value::schema` etc.

The `cache_hitrate_calculator` service runs a periodic job which
updates the `CACHE_HITRATES` application state in the local gossiper
state. Each time the job is scheduled, it runs on the next shard (it
goes through shards in a round-robin fashion). The job uses the
`versioned_value::cache_hitrates` shorthand to create a
`versioned_value`, therefore risking a data race if it is not currently
executing on shard 0.

Fix the race by constructing the versioned value on shard 0.
2024-02-23 12:54:32 +01:00
Avi Kivity
7cb1c10fed treewide: replace seastar::future::get0() with seastar::future::get()
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.

Replace with seastar::future::get(), which does the same thing.
2024-02-02 22:12:57 +08:00
Botond Dénes
f4f724921c load_meter: get_load_map(): don't unconditionally dereference _lb
Said method has a check on `_lb` not being null, before accessing it.
However, since 0e5754a, there was an unconditional access, adding an
entry for the local node. Move this inside the if, so it is covered by
the null-check. The only caller is the api (probably nodetool), the
worst that can happend is that they get completely empty load-map if
they call too early during startup.

Fixes: #16617

Closes scylladb/scylladb#16659
2024-01-09 16:02:12 +03:00
Benny Halevy
ad8a9104d8 endpoint_state subscriptions: batch on_change notification
Rather than calling on_change for each particular
application_state, pass an endpoint_state::map_type
with all changed states, to be processed as a batch.

In particular, thise allows storage_service::on_change
to update_peer_info once for all changed states.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
0e5754adc6 misc_services: use locator::topology rather than fb_utilities
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 10:01:36 +02:00
Kamil Braun
39ca07c49b Merge 'Gossiper endpoint locking' from Benny Halevy
This series cleans up and hardens the endpoint locking design and
implementation in the gossiper and endpoint-state subscribers.

We make sure that all notifications (expect for `before_change`, that
apparently can be dropped) are called under lock_endpoint, as well as
all calls to gossiper::replicate, to serialize endpoint_state changes
across all shards.

An endpoint lock gets a unique permit_id that is passed to the
notifications and passed back by them if the notification functions call
the gossiper back for the same endpoint on paths that modify the
endpoint_state and may acquire the same endpoint lock - to prevent a
deadlock.

Fixes scylladb/scylladb#14838
Refs scylladb/scylladb#14471

Closes #14845

* github.com:scylladb/scylladb:
  gossiper: replicate: ensure non-null permit
  gossiper: add_saved_endpoint: lock_endpoint
  gossiper: mark_as_shutdown: lock_endpoint
  gossiper: real_mark_alive: lock_endpoint
  gossiper: advertise_token_removed: lock_endpoint
  gossiper: do_status_check: lock_endpoint
  gossiper: remove_endpoint: lock_endpoint if needed
  gossiper: force_remove_endpoint: lock_endpoint if needed
  storage_service: lock_endpoint when removing node
  gossiper: use permit_id to serialize state changes while preventing deadlocks
  gossiper: lock_endpoint: add debug messages
  utils: UUID: make default tagged_uuid ctor constexpr
  gossiper: lock_endpoint must be called on shard 0
  gossiper: replicate: simplify interface
  gossiper: mark_as_shutdown: make private
  gossiper: convict: make private
  gossiper: mark_as_shutdown: do not call convict
2023-08-02 13:50:08 +02:00
Benny Halevy
f74d154fe3 gossiper: use permit_id to serialize state changes while preventing deadlocks
Pass permit_id to subscribers when we acquire one
via lock_endpoint.  The subscribers then pass it back to
gossiper for paths that acquire lock_endpoint for
the same endpoint, to detect nested locks when the endpoint
is locked with the same permit_id.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:41:57 +03:00
Aleksandra Martyniuk
c5cad803b3 replica: add methods to get a filtered copy of tables map 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
6796721c3d replica: add methods to get table or table id 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
cdbfa0b2f5 replica: iterate safely over tables related maps
Loops over _column_families and _ks_cf_to_uuid which may preempt
are protected by reader mode of rwlock so that iterators won't
get invalid.
2023-07-25 17:13:04 +02:00
Aleksandra Martyniuk
52afd9d42d replica: wrap column families related maps into tables_metadata
As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
2023-07-25 16:13:00 +02:00
Pavel Emelyanov
07b7e9faf1 load-meter: Remove unused get_load_string
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13873
2023-05-15 09:21:08 +03:00
Benny Halevy
c5d819ce60 gms: versioned_value: make members private
and provide accessor functions to get them.

1. So they can't be modified by mistake, as the versioned value is
   immutable. A new value must have a higher version.
2. Before making the version a strong gms::version_type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:37:32 +03:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Asias He
39db15d2cb misc_services: Fix cache hitrate update
This patch avoids unncessary CACHE_HITRATES updates through gossip.

After this patch:

Publish CACHE_HITRATES in case:

- We haven't published it at all
- The diff is bigger than 1% and we haven't published in the last 5 seconds
- The diff is really big 10%

Note: A peer node can know the cache hitrate through read_data
read_mutation_data and read_digest RPC verbs which have cache_temperature in
the response. So there is no need to update CACHE_HITRATES through gossip in
high frequency.

We do the recalculation faster if the diff is bigger than 0.01. It is useful to
do the calculation even if we do not publish the CACHE_HITRATES though gossip,
since the recalculation will call the table->set_global_cache_hit_rate to set
the hitrate.

Fixes #5971

Closes #11079
2022-07-21 11:31:30 +03:00
Avi Kivity
5937b1fa23 treewide: remove empty comments in top-of-files
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.

Closes #10562
2022-05-13 07:11:58 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Pavel Solodovnikov
5dcfb94d5a gms: i_endpoint_state_change_subscriber: make callbacks to return futures
Coroutinize a few simple callbacks in the process.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Avi Kivity
e2a3f974d6 service: misc_services: fix lowres_clock::duration assumption
recalculate_hitrates() is defined to return future<lowres_clock::duration>
but actually returns future<std::chrono::milliseconds>. This fails
if the two types are not the same.

Fix by returning the declared type.
2021-12-28 21:15:40 +02:00
Pavel Emelyanov
770d34796b cache_hitrate_calculator: Keep reference on gossiper
The calculator needs to update its app-state on gossiper. Keeping
a reference is safe -- gossiper starts early, the calculator -- at
the very very end, stop in reverse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:52:27 +03:00
Avi Kivity
222ef17305 build, treewide: enable -Wredundant-move
Returning a function parameter guarantees copy elision and does not
require a std::move().  Enable -Wredundant-move to warn us that the
move is unneeded, and gain slightly more readable code. A few violations
are trivially adjusted.

Closes #9004
2021-07-11 12:53:02 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
e0749d6264 treewide: some random header cleanups
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Pavel Emelyanov
5ecbc33be5 database.*: Remove unused headers
The database.hh is the central recursive-headers knot -- it has ~50
includes. This patch leaves only 34 (it remains the champion though).
Similar thing for database.cc.
Both changes help the latter compile ~4% faster :)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210414183107.30374-1-xemul@scylladb.com>
2021-04-18 14:03:17 +03:00
Asias He
eedcee7f31 gossip: Reduce unncessary VIEW_BACKLOG updates
The blacklog of current and max in VIEW_BACKLOG is not update but the
nodes are updating VIEW_BACKLOG all the time. For example:

```
INFO  2020-03-06 17:13:46,761 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026590,718)
INFO  2020-03-06 17:13:46,821 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026531,742)
INFO  2020-03-06 17:13:47,765 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027590,721)
INFO  2020-03-06 17:13:47,825 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027531,745)
INFO  2020-03-06 17:13:48,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028590,726)
INFO  2020-03-06 17:13:48,833 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028531,750)
INFO  2020-03-06 17:13:49,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029590,729)
INFO  2020-03-06 17:13:49,832 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029531,753)
```

The downside of such updates:

 - Introduces more gossip exchange traffic
 - Updates system.peers all the time

The extra unnecessary gossip traffic is fine to a cluster in a good
shape but when some of the nodes or shards are loaded, such messages and
the handling of such messages can make the system even busy.

With this patch, VIEW_BACKLOG is updated only when the backlog is really
updated.

Btw, we can even make the update only when the change of the backlog is
great than a threshold, e.g., 5%, which can reduce the traffic even
further.

Fixes #5970
2020-09-29 13:37:37 +03:00
Pavel Emelyanov
67d5fad65f storage_service: Remove some inclusions of its header
GC pass over .cc files. Some really do not need it, some need for features/gossiper

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-01 09:08:40 +03:00
Kamil Braun
1f7290a0ff versioned_value: remove versioned_value::factory class
If there was a Most Useless Abstraction award, this would be a good
candidate.
2020-04-20 12:57:16 +02:00
Rafael Ávila de Espíndola
c5795e8199 everywhere: Replace engine().cpu_id() with this_shard_id()
This is a bit simpler and might allow removing a few includes of
reactor.hh.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200326194656.74041-1-espindola@scylladb.com>
2020-03-27 11:40:03 +03:00
Pavel Emelyanov
b11cf6e950 cql3/query_processor.hh: Debloat from other headers
This gives ~30% less (251 jobs -> 181 jobs) recompile when touching it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200212225828.3374-1-xemul@scylladb.com>
2020-02-16 11:22:30 +02:00
Rafael Ávila de Espíndola
845116dfaf gossiper: Store subscribers in an atomic_vector
The new guarantees are a bit better IMHO:

Once a subscriber is removed, it is never notified. This was not true
in the old code since it would iterate over a copy that would still
have that subscriber.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Rafael Ávila de Espíndola
c62a33965d load_broadcaster: Unregister from load_broadcaster::stop_broadcasting
This is in preparation for unregistration returning a future.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Rafael Ávila de Espíndola
9d4cf25c84 load_meter: Use a shared_ptr to store a load_broadcaster
load_broadcaster::stop_broadcasting uses shared_from_this(). Since
that is the only reference that the produced shared_ptr knows of, it
is deleted immediately. Fix that by also using a shared_ptr in
load_meter.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-01-22 08:16:03 -08:00
Pavel Emelyanov
148da64a7e storage_servce: Move load_broadcaster away
This simplifies the storage_service API and fixes the
complain about shared_ptr usage instead of unique_ptr.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-13 13:55:09 +03:00
Pavel Emelyanov
b6e1e6df64 misc_services: Introduce load_meter
There's a lonely get_load_map() call on storage_service that
needs only load broadcaster, always runs on shard 0 and that's it.

Next patch will move this whole stuff into its own helper no-shard
container and this is preparation for this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-01-13 13:53:08 +03:00
Pavel Emelyanov
f2b20e7083 cache_hitrate_calculator: Do not reinvent the peering_sharded_service
The class in question wants to run its own instances on different
shards, for this sake it keeps reference on sharded self to call
invoke_on() on. There's a handy peering_sharded_service<> in seastar
for the same, using it makes the code nicer and shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191226112401.23960-1-xemul@scylladb.com>
2020-01-03 15:48:19 +02:00
Botond Dénes
136fc856c5 treewide: silence discarded future warnings for questionable discards
This patches silences the remaining discarded future warnings, those
where it cannot be determined with reasonable confidence that this was
indeed the actual intent of the author, or that the discarding of the
future could lead to problems. For all those places a FIXME is added,
with the intent that these will be soon followed-up with an actual fix.
I deliberately haven't fixed any of these, even if the fix seems
trivial. It is too easy to overlook a bad fix mixed in with so many
mechanical changes.
2019-08-26 19:28:43 +03:00
Botond Dénes
fddd9a88dd treewide: silence discarded future warnings for legit discards
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
2019-08-26 18:54:44 +03:00
Gleb Natapov
00c4078af3 cache_hitrate_calculator: do not ignore a future returned from gossiper::add_local_application_state
We should wait for a future returned from add_local_application_state() to
resolve before issuing new calculation, otherwise two
add_local_application_state() may run simultaneously for the same state.

Fixes #4838.

Message-Id: <20190812082158.GE17984@scylladb.com>
2019-08-13 11:48:38 +03:00
Gleb Natapov
31bf4cfb5e cache_hitrate_calculator: make cache hitrate calculation preemptable
The calculation is done in a non preemptable loop over all tables, so if
numbers of tables is very large it may take a while since we also build
a string for gossiper state. Make the loop preemtable and also make
the string calculation more efficient by preallocating memory for it.
Message-Id: <20190516132748.6469-3-gleb@scylladb.com>
2019-05-16 15:32:36 +02:00
Gleb Natapov
4517c56a57 cache_hitrate_calculator: do not copy stats map for each cpu
invoke_on_all() copies provided function for each shard it is executed
on, so by moving stats map into the capture we copy it for each shard
too. Avoid it by putting it into the top level object which is already
captured by reference.
Message-Id: <20190516132748.6469-2-gleb@scylladb.com>
2019-05-16 15:32:24 +02:00
Gleb Natapov
c6b3b9ff13 cache_hitrate_calculator: wait for ongoing calculation to complete during stop
Currently stop returns ready future immediately. This is not a problem
since calculation loop holds a shared pointer to the local service, so
it will not be destroyed until calculation completes and global database
object db, that also used by the calculation, is never destroyed. But the
later is just a workaround for a shutdown sequence that cannot handle
it and will be changed one day. Make cache hitrate calculation service
ready for it.

Message-Id: <20190422113538.GR21208@scylladb.com>
2019-04-22 14:44:42 +03:00
Gleb Natapov
306f5b99b5 cache_hitrate_calculator: fix use after free in non_system_filter lambda
non_system_filter lambda is defined static which means it is initialized
only once, so the 'this' that is will capture will belong to a shard
where the function runs first. During service destruction the function
may run on different shard and access already other's shard service that
may be already freed.

Fixed #4425

Message-Id: <20190421152139.GN21208@scylladb.com>
2019-04-21 18:22:31 +03:00
Avi Kivity
f02c64cadf streaming: stream_session: remove include of db/view/view_update_from_staging_generator.hh
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
2019-01-05 17:33:25 +02:00