Commit Graph

53 Commits

Author SHA1 Message Date
Gleb Natapov
807e37502a db/consistency_level: do not use result from heat weighted load balancer if it contains duplicates
Because of https://github.com/scylladb/scylladb/issues/9285 heat weighted
load balancer may sometimes return same node twice. It may cause wrong
data to be read or unexpected errors to be returned to a client. Since
the original bug is not easy to fix and it is rare lets introduce a
workaround. We will check for duplicates and will use non HWLB one if
one is found.

Fixes scylladb/scylladb#20430

Closes scylladb/scylladb#20414
2024-09-05 15:21:35 +03:00
Kefu Chai
a439ebcfce treewide: include fmt/ranges.h and/or fmt/std.h
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:56:16 +08:00
Kefu Chai
be364d30fd db: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16664
2024-01-09 11:44:19 +02:00
Benny Halevy
4c20b84680 db/consistency_level: use locator::topology rather than fb_utilities
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Avi Kivity
0cabf4eeb9 build: disable implicit fallthrough
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.

Existing intended cases were annotated.

Closes #14607
2023-07-10 19:36:06 +02:00
Kamil Braun
0e36377f56 db: consistency_level: remove overload of filter_for_query
Not used anymore after the previous commit.
2023-06-14 11:41:36 +02:00
Petr Gusev
052b91fb1f storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading
We are going to use remapped_endpoints_for_reading, we need
to make sure we use it in the right place. The
get_live_sorted_endpoints function looks like what we
need - it's used in all read code paths.
From its name, however, this was not obvious.

Also, we add the parameter ks_name as we'll need it
to pass to remapped_endpoints_for_reading.
2023-05-09 18:42:03 +04:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Pavel Emelyanov
64c9359443 storage_proxy: Don't use default-initialized endpoint in get_read_executor()
After calling filter_for_query() the extra_replica to speculate to may
be left default-initialized which is :0 ipv6 address. Later below this
address is used as-is to check if it belongs to the same DC or not which
is not nice, as :0 is not an address of any existing endpoint.

Recent move of dc/rack data onto topology made this place reveal itself
by emitting the internal error due to :0 not being present on the
topology's collection of endpoints. Prior to this move the dc filter
would count :0 as belonging to "default_dc" datacenter which may or may
not match with the dc of the local node.

The fix is to explicitly tell set extra_replica from unset one.

fixes: #11825

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11833
2022-10-25 09:16:50 +03:00
Avi Kivity
46bd0b1e62 consistency_level: accept effective_replication_map as parameter, rather than keyspace
A keyspace is a mutable object that can change from time to time. An
effective_replication_map captures the state of a keyspace at a point in
time and can therefore be consistent (with care from the caller).

Change consistency_level's functions to accept an effective_replication_map.
This allows the caller to ensure that separate calls use the same
information and are consistent with each other.

Current callers are likely correct since they are called from one
continuation, but it's better to be sure.
2022-08-11 17:58:42 +03:00
Avi Kivity
1078d1bfda consistency_level: be more const when using replication_strategy
We don't modify the replication_strategy here, so use const. This
will help when the object we get is const itself, as it will be in
the next patches.
2022-08-11 17:58:42 +03:00
Botond Dénes
fbbe2529c1 Merge "Remove global snitch usage from consistency_level.cc" from Pavel Emelyanov
"
There are several helpers in this .cc file that need to get datacenter
for endpoints. For it they use global snitch, because there's no other
place out there to get that data from.

The whole dc/rack info is now moving to topology, so this set patches
the consistency_level.cc to get the topology. This is done two ways.
First, the helpers that have keyspace at hand may get the topology via
ks's effective_replication_map.

Two difficult cases are db::is_local() and db.count_local_endpoints()
because both have just inet_address at hand. Those are patched to be
methods of topology itself and all their callers already mess with
token metadata and can get topology from it.
"

* 'br-consistency-level-over-topology' of https://github.com/xemul/scylla:
  consistency_level: Remove is_local() and count_local_endpoints()
  storage_proxy: Use topology::local_endpoints_count()
  storage_proxy: Use proxy's topology for DC checks
  storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
  storage_proxy: Use topology local_dc_filter in its methods
  storage_proxy: Mark some digest_read_resolver methods private
  forwarding_service: Use topology local_dc_filter
  storage_service: Use topology local_dc_filter
  consistency_level: Use topology local_dc_filter
  consitency-level: Call count_local_endpoints from topology
  consistency_level: Get datacenter from topology
  replication_strategy: Remove hold snitch reference
  effective_replication_map: Get datacenter from topology
  topology: Add local-dc detection shugar
2022-08-05 13:31:55 +03:00
Pavel Emelyanov
c3718b7a6e consistency_level: Remove is_local() and count_local_endpoints()
No code uses them now -- switched to use topology -- so thse two can be
dropped together with their calls for global snitch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
0da8caba1d consistency_level: Use topology local_dc_filter
The filter_for_query() helper has keyspace at hand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
de58b33eee consitency-level: Call count_local_endpoints from topology
Similar to previous patch, in those places with keyspace object at
hand the topology can be obtained from ks' replication map

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
f84ee8f0fb consistency_level: Get datacenter from topology
In some of db/consistency_level.cc helpers the topology can be
obtained from keyspace's effective replication map

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Kamil Braun
a9fd156a1b db: consistency_level: filter_for_query: take const gossiper& 2022-08-04 12:19:38 +02:00
Avi Kivity
5937b1fa23 treewide: remove empty comments in top-of-files
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.

Closes #10562
2022-05-13 07:11:58 +02:00
Pavel Emelyanov
11c99fc41b table: Don't use global gossiper
The table::get_hit_rate needs gossiper to get hitrates state from.
There's no way to carry gossiper reference on the table itself, so it's
up to the callers of that method to provide it. Fortunately, there's
only one caller -- the proxy -- but the call chain to carry the
reference it not very short ... oh, well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:33:08 +03:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Benny Halevy
4d2561ff75 abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Avi Kivity
222ef17305 build, treewide: enable -Wredundant-move
Returning a function parameter guarantees copy elision and does not
require a std::move().  Enable -Wredundant-move to warn us that the
move is unneeded, and gain slightly more readable code. A few violations
are trivially adjusted.

Closes #9004
2021-07-11 12:53:02 +03:00
Avi Kivity
4d70f3baee storage_proxy: change unordered_set<inet_address> to small_vector in write path
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.

This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:

 - there is a quadratic algorithm in
   abstract_write_response_handler::response(): it searches for a replica
   and erases it. Since this happens for every replica, it happens N^2/2
   times.
 - replica sets for writes always include all datacenters, while reads
   usually involve just one datacenter.

So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.

We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.

Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
 --task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.

before: median 204842.54 tps ( 54.2 allocs/op,  13.2 tasks/op,   49890 insns/op)
after:  median 206077.65 tps ( 52.2 allocs/op,  13.2 tasks/op,   49138 insns/op)

Closes #8847
2021-06-17 13:46:40 +03:00
Nadav Har'El
b6b4df9a47 heat-weighted load balancing: improve handling of near-perfect cache
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.

So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.

So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:

1. Two nodes with hit rates above 0.95 will be considered to have the
   same hit rate, so they will get equal amount of work - even if one
   has hit rate 0.98 and the other 0.99.

2. A cold node with it rate 0.0 will get 5% of the work of a node with
   the perfect hit rate limited to 0.95. This will allow the cold node to
   slowly warm up its cache. Before this patch, if the hot node happened
   to have a hit rate of 0.999 (the previous maximum), the cold node would
   get just 0.1% of the work and remain almost idle and fill its cache
   extremely slowly - which is a waste.

Fixes #8815.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>
2021-06-17 11:02:08 +02:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Avi Kivity
c71d007797 consistency_level: deinline assure_sufficient_live_nodes()
assure_sufficient_live_nodes() is a huge template calling other
huge templates, and requires "network_topology_strategy.hh". It is
inlined in consistency_level.hh. This increases compile time and recompiles.

Move the template out-of-line and use "extern template" to instantiate it.
This is not ideal as new callers would require updates to the
instantiated signatures, but I think our goal should be to de-template
it completely instead. Meanwhile, this reduces some pain.

Ref #1.

Closes #8637
2021-05-19 15:03:51 +03:00
Avi Kivity
e9802348b5 storage_proxy, treewide: use utils::small_vector inet_address_vector:s
Replace std::vector<inet_address> with a small_vector of size 3 for
replica sets (reflecting the common case of local reads, and the somewhat
less common case of single-datacenter writes). Vectors used to
describe topology changes are of size 1, reflecting that up to one
node is usually involved with topology changes. At those counts and
below we save an allocation; above those counts everything still works,
but small_vector allocates like std::vector.

In a few places we need to convert between std::vector and the new types,
but these are all out of the hot paths (or are in a hot path, but behind a
cache).
2021-05-05 18:36:54 +03:00
Avi Kivity
cea5493cb7 storage_proxy, treewide: introduce names for vectors of inet_address
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
2021-05-05 18:36:48 +03:00
Eliran Sinvani
925cdc9ae1 consistency level: fix wrong quorum calculation whe RF = 0
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.

Tests: Unit Tests (dev)
Fixes #6905

Closes #7296
2020-09-29 13:25:41 +03:00
Konstantin Osipov
18b9bb57ac lwt: rename metrics to match accepted terminology
Rename inherited metrics cas_propose and cas_commit
to cas_accept and cas_learn respectively.

A while ago we made a decision to stick to widely accepted
terms for Paxos rounds: prepare, accept, learn. The rest
of the code is using these terms, so rename the metrics
to avoid confusion/technical debt.

While at it, rename a few internal methods and functions.

Fixes #6169

Message-Id: <20200414213537.129547-1-kostja@scylladb.com>
2020-04-15 12:20:30 +02:00
Vladimir Davydov
25aeefd6f3 cql: fix CAS consistency level validation
This patch resurrects Cassandra's code validating a consistency level
for CAS requests. Basically, it makes CAS requests use a special
function instead of validate_for_write to make error messages more
coherent.

Note, we don't need to resurrect requireNetworkTopologyStrategy as
EACH_QUORUM should work just fine for both CAS and non-CAS writes.
Looks like it is just an artefact of a rebase in the Cassandra
repository.
2019-11-14 12:15:39 +01:00
Konstantin Osipov
e555dc502e lwt: implement basic lightweight transactions support
Support single-statement conditional updates and as well as batches.

This patch almost fully rewrites column_condition.cc, implementing
is_satisfied_by().

Most of the remaining complications in column_condition implementation
come from the need to properly handle frozen and multi-cell
collection in predicates - up until now it was not possible
to compare entire collection values between each other. This is further
complicated since multi-cell lists and sets are returned as maps.

We can no longer assume that the columns fetched by prefetch operation
are non-frozen collections. IF EXISTS/IF NOT EXISTS condition
fetches all columns, besides, a column may be needed to check other
condition.

When fetching the old row for LWT or to apply updates on list/columns,
we now calculate precisely the list of columns to fetch.

The primary key columns are also included in CAS batch result set,
and are thus also prefetched (the user needs them to figure out which
statements failed to apply).

The patch is cross-checked for compatibility with cassandra-3.11.4-1545-g86812fa502
but does deviate from the origin in handling of conditions on static
row cells. This is addressed in future series.
2019-10-27 23:42:49 +03:00
Avi Kivity
4676e07400 consistency_level: simplify validation API
Remove unused parameters, replace refcounted pointers by references.
2018-11-27 13:41:49 +02:00
Avi Kivity
2c08bff8d5 Split consistency_level.hh header
It has two unrelated users: cql for validation, and storage_proxy for
complicated calculations. Split the simple stuff into a new header to reduce
dependencies.
2018-11-27 13:32:10 +02:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Avi Kivity
d77e044cde db: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Botond Dénes
6e59cee244 db::consistency_level::filter_for_query() add preferred_endpoints
To the second overload (the one without read-repair related params) too.
2018-09-03 10:31:44 +03:00
Botond Dénes
aaf67bcbaa Consider preferred replicas when choosing endpoints for query_singular()
Propagate the preferred_replicas to db::filter_for_query() and consider
them when selecting the endpoints. The algoritm for selecting the
endpoints is as follows:
* Compute the intersection of the endpoint candidates and the
preferred endpoints.
* If this yields a set of endpoints that already satisfies the CL
requirements use this set.
* Otherwise select the remaining endpoints according to the
load-balancing strategy, just like before.
2018-03-13 10:34:34 +02:00
Gleb Natapov
357c77a333 consistency_level: constify quorum_for() and local_quorum_for() 2017-12-05 13:01:20 +02:00
Gleb Natapov
87094849fa storage_proxy: load balance read requests according to cache hit rates
This patch makes storage proxy to choose replicas to read from base on
their cache hit rates. Replicas with higher cache hit rates will see
more requests while replicas with lower hit rates will see less. Local
node has a special bonus and will get more requests even if another node
has slightly higher cache hit rate (same goes for local vs remote DC),
but after the patch it is no longer guarantied that a coordinator node
will be chosen as a replica for the read (if the feature is enabled).
2017-06-13 09:57:14 +03:00
Gleb Natapov
bc8aa1b4ee choose extra replica for speculation in filter_for_query()
Currently storage proxy has to loop over remaining replicas to search
for suitable extra replica, but doing it in filter_for_query() is
extremely easy, so do it there instead.
2017-06-13 09:57:14 +03:00
Gleb Natapov
8437ea3b99 consistency_level: drop filter_for_query_dc_local function
Merge filter_for_query_dc_local() functionality into filter_for_query().
This is more efficient since filter_for_query_dc_local() partitions
endpoints into 'local' and 'remote' set but filter_for_query() already
does it for CL=LOCAL so for such queries we needlessly do it twice.
2017-06-13 09:57:14 +03:00
Gleb Natapov
9cc076c9f3 storage_proxy: preserve endpoint's order while filtering local nodes for query
filter_for_query() gets sorted by preference list of endpoints and
should preserve that order after filtering out non local endpoints for
local query. partition() does not guaranty this while stable_partition()
does, so use it instead.

Fixes #1450.
Message-Id: <20160713100909.GM10767@scylladb.com>
2016-07-13 13:17:28 +03:00
Gleb Natapov
dfdbb1e703 storage_proxy: move hack to make coordinator most preferable node for read into sorting function
This is kind of sorting, so it belongs there, but it also fixes a bug in
storage_proxy::get_read_executor() that assumes filter_for_query() do
not change order of nodes in all_nodes when extra replica is chosen.
Otherwise if coordinator ip happens to be last in all_nodes then it will
be chosen as extra replica and will be quired twice.
Message-Id: <1460549369-29523-1-git-send-email-gleb@scylladb.com>
2016-04-14 14:56:21 +03:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Gleb Natapov
f59415b3c6 Take pending endpoints into account while checking for sufficient live nodes
During bootstrapping additional copies of data has to be made to ensure
that CL level is met (see CASSANDRA-833 for details). Our code does
that, but it does not take into account that bootstraping node can be
dead which may cause request to proceed even though there is no
enough live nodes for it to be completed. In such a case request neither
completes nor timeouts, so it appear to be stuck from CQL layer POV. The
patch fixes this by taking into account pending nodes while checking
that there are enough sufficient live nodes for operation to proceed.

Fixes #965

Message-Id: <20160303165250.GG2253@scylladb.com>
2016-03-07 13:30:13 +01:00
Avi Kivity
d5cf0fb2b1 Add license notices 2015-09-20 10:43:39 +03:00