Currently, a pending replica that applies a write on a table that has
materialized views, will build all the view updates as a normal replica,
only to realize at a late point, in db::view::get_view_natural_endpoint(),
that it doesn't have a paired view replica to send the updates to. It will
then either drop the view updates, or send them to a pending view
replica, if such exists.
This work is unnecessary since it may be dropped, and even if there is a
pending view replica to send the updates to, the updates that are built
by the pending replica may be wrong since it may have incomplete
information.
This commit fixes the inefficiency by skipping the view update building
step when applying an update on a pending replica.
The metric total_view_updates_on_wrong_node is added to count the cases
that a view update is determined to be unnecessary.
The test reproduces the scenario of writing to a table and applying
the update on a pending replica, and verifies that the pending replica
doesn't try to build view updates.
Fixesscylladb/scylladb#19152Closesscylladb/scylladb#19488Fixesscylladb/scylladb#20787
(cherry picked from commit 08b29460fc)
Closesscylladb/scylladb#20934
Fixes#20862
With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.
Test included.
(cherry picked from commit ee5e71172f)
Closesscylladb/scylladb#20913
Currently, when attempting to send a hint, we might choose its recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destination of the hint is leaving.
Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylladb#20558Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
This is a backport of the original PR without the tests, done avoid the need of resolving merge conflicts in that area.
Closesscylladb/scylladb#20559
* github.com:scylladb/scylladb:
hints: send hints with CL=ALL if target is leaving
hints: inline do_send_one_mutation
Because of https://github.com/scylladb/scylladb/issues/9285 hit weighted
load balancer may sometimes return same node twice. It may cause wrong
data to be read or unexpected errors to be returned to a client. Since
the original bug is not easy to fix and it is rare lets introduce a
workaround. We will check for duplicates and will use non HWLB one if
one is found.
(cherry picked from commit 807e37502a)
Closesscylladb/scylladb#20470
Currently, when attempting to send a hint, we might choose its
recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only
send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed
away from that node, it is still considered to be a natural endpoint of
the data that it used to own. Because of that, it might happen that a
hint is sent directly to it but streaming will miss it, effectively
resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea,
send the hint to all replicas also in the case when the original
destiantion of the hint is leaving.
Note that this is a conservative fix written only with the decommission
+ vnode-based keyspaces combo in mind. In general, such "data loss" can
occur in other situations where the replica set is changing and we go
through a streaming phase, i.e. other topology operations in case of
vnodes and tablet load balancing. However, the consistency guarantees of
hinted handoff in the face of topology changes are not defined and it is
not clear what they should be, if there should be any at all. The
picture is further complicated by the fact that hints are used by
materialized views, and sending view updates to more replicas than
necessary can introduce inconsistencies in the form of "ghost rows".
This fix was developed in response to a failing test which checked the
hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
(cherry picked from commit 61ac0a336d)
It's a small method and it is only used once in send_one_mutation.
Inlining it lets us get rid of its declaration in the header - now, if
one needs to change the variables passed from one function to another,
it is no longer necessary to change the header.
(cherry picked from commit 8abb06ab82)
Bind variables in CQL have two formats: positional (`?`) where a
variable is referred to by its relative position in the statement,
and named (`:var`), where the user is expected to supply a
name->value mapping.
In 19a6e69001 we identified the case where a named bind variable
appears twice in a query, and collapsed it to a single entry in the
statement metadata. Without this, a driver using the named variable
syntax cannot disambiguate which variable is referred to.
However, it turns out that users can use the positional call form
even with the named variable syntax, by using the positional
API of the driver. To support this use case, we add a configuration
variable to disable the same-variable detection.
Because the detection has to happen when the entire statement is
visible, we have to supply the configuration to the parser. We
call it the `dialect` and pass it from all callers. The alternative
would be to add a pre-prepare call similar to fill_prepare_context that
rewrites all expressions in a statement to deduplicate variables.
A unit test is added.
Fixes#15559
(cherry picked from commit ea8441dfa3)
(cherry picked from commit edb3068ecf)
A dialect is a different way to interpret the same CQL statement.
Examples:
- how duplicate bind variable names are handled (later in this series)
- whether `column = NULL` in LWT can return true (as is now) or
whether it always returns NULL (as in SQL)
Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.
The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.
(cherry picked from commit d69bf4f010)
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.
Not a regression, so a backport isn't strictly necessary.
Closesscylladb/scylladb#19792
* github.com:scylladb/scylladb:
config: specialize from-string conversion for bool
config: wrap boost::lexical_cast<> when converting from strings
(cherry picked from commit 9eb47b3ef0)
Currently, each frozen mutation we get from
system_keyspace::query_mutations is unfrozen in whole
to a mutation and only then we check its key with
the provided `accept_keyspace` function.
This is wasteful, since they key can be processed
directly form the frozen mutation, before taking
the toll of unfreezing it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 52234214e5)
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls
when freed.
For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
(inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
(inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```
This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time,
and by that, reducing memory footprint and preventing
reactor stalls.
Fixes#18173
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 95a5fba0ea)
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.
As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.
In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.
Fixes: scylladb/scylladb#19439
This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.
(cherry picked from commit 6af7882c59)
(cherry picked from commit 5ec8c06561)
Refs #19765Closesscylladb/scylladb#19896
* github.com:scylladb/scylladb:
test: regression test for MV crash with tablets during decommission
db/view: drop view updates to replaced node marked as left
Before these changes, we didn't specify which I/O scheduling
group commitlog instances in hinted handoff should use.
In this commit, we set it explicitly to the commitlog
scheduling group. The rationale for this choice is the fact
we don't want to cause a bottleneck on the write path
-- if hints are written too slowly, new incoming mutations
(NOT hints) might be rejected due to a too high number
of hints currently being written to disk; see
`storage_proxy::create_write_response_handler_helper()`
for more context.
(cherry picked from commit 6a7fb18b52)
Closesscylladb/scylladb#20094
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.
Fixes#18506.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ad5c5bca5f)
In 58784cd, aa4b06a and other commits migrating
hinted handoff from IPs to host IDs (scylladb/scylladb#15567),
we started ignoring hint directories of invalid names,
i.e. those that represent neither an IP address, nor a host ID.
They remain on disk and are taken into account while computing
e.g. the total size of hints, but they're not used in any way.
These changes add logs informing the user when Scylla
encounters such a directory.
Closesscylladb/scylladb#17566
(cherry picked from commit a5528a2093)
Closesscylladb/scylladb#19892
Change the format of sync points to use host ID instead of IPs, to be
consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores
host IDs instead of IPs.
The encoding of sync points now always uses the new v3 format with host
IDs.
The decoding supports both formats with host IDs and IPs, so a sync point
contains now a variant of either types, and in the case of the new
format the translation from IP to host ID is avoided.
When a node that is permanently down is replaced, it is marked as "left"
but it still can be a replica of some tablets. We also don't keep IPs of
nodes that have left and the `node` structure for such node returns an
empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired
with the left node might decide to generate a view update. Because
storage proxy still uses IPs and not host IDs, it needs to obtain the
view replica's IP and tell the storage proxy to write a view update to
that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to
write a hint towards this address - hinted handoff on the other hand
operates on host IDs and not IPs, so it attempts to translate the IP
back, which triggers an assertion as there is no replica with IP
0.0.0.0.
As a quick workaround for this issue just drop view updates towards
nodes which seem to have IPs that are all zeros. It would be more proper
to keep the view updates as hints and replay them later to the new
paired replica, but achieving this right now would require much more
significant changes. For now, fixing a crash is more important than
keeping views consistent with base replicas.
Fixes: scylladb/scylladb#19439
(cherry picked from commit 6af7882c59)
Scans should not pollute the cache with cold data, in general. In the
case of the batchlog table, there is another reason to bypass the cache:
this table can have a lot of partition tombstones, which currently are
not purged from the cache. So in certain cases, using the cache can make
batch replay very slow, because it has to scan past tombstones of
already replayed batches.
(cherry picked from commit 31c0fa07d8)
We have a commented code snippet from Origin with cleanup and a FIXME to
implement it. Origin flushes the memtables and kicks a compaction. We
only implement the flush here -- the flush will trigger a compaction
check and we leave it up to the compaction manager to decide when a
compaction is worthwhile.
This method used to be called only from unbootstrap, so a cleanup was
not really needed. Now it is also called at the end of repair, if the
table is using repair-based tombstone-gc. If the memtable is filled with
tombstones, this can add a lot of time to the runtime of each repair. So
flush the memtable at the end, so the tombstones can be purged (they
aren't purged from memtables yet).
(cherry picked from commit 2dd057c96d)
Before these changes, it could happen that Scylla initialized
endpoint managers for hint directories representing
* host IDs before migrating hinted handoff to using host IDs,
* IP addresses after the migration.
One scenario looked like this:
1. Start Scylla and upgrade the cluster to using host IDs.
2. Create, by hand, a hint directory representing an IP address.
3. Trigger changing the host filter in hinted handoff; it could
be achieved by, for example, restricting the set of data
centers Scylla is allowed to save hints for.
When changing the host filter, we browse the hint directories
and create endpoint managers if we can send hints towards
the node corresponding to a given hint directory. We only
accepted hint directories representing IP addresses
and host IDs. However, we didn't check whether the local node
has already been upgraded to host-ID-based hinted handoff
or not. As a result, endpoint managers were created for
both IP addresses and host IDs, no matter whether we were
before or after the migration.
These changes make sure that any time we browse the hint
directories, we take that into account.
Fixesscylladb/scylladb#19172
(cherry picked from commit c9bb0a4da6)
Closesscylladb/scylladb#19426
Allow an extension to describe itself as the CQL property
string that created it (and is serialized to schema tables)
Only paxos extension requires override.
(cherry picked from commit d27620e146)
Some time ago it turned out that if unrecognized feature name is met in scylla.yaml, the whole experimental features list is ignored, but scylla continues to boot. There's UNUSED feature which is the proper way to deprecate a feature, and this PR improves its handling in several ways.
1. The recently removed "tablets" feature is partially brought back, but marked as UNUSED
2. Any UNUSED features met while parsing are printed into logs
3. The enum_option<> helper is enlightened along the way
refs: #18968
(cherry picked from commit f56cdb1cac)
(cherry picked from commit 0c0a7d9b9a)
(cherry picked from commit b85a02a3fe)
(cherry picked from commit b2520b8185)
Refs #19230Closesscylladb/scylladb#19266
* github.com:scylladb/scylladb:
config: Mark tablets feature as unused
main: Warn unused features
enum_option: Carry optional key on board
enum_option: Remove on-board _map member
Currently, when generating and propagating view updates, if we notice
that we've already exceeded the time limit, we throw an exception
inheriting from `request_timeout_exception`, to later catch and
log it when finishing request handling. However, when catching, we
only check timeouts by matching the `timed_out_error` exception,
so the exception thrown in the view update code is not registered
as a timeout exception, but an unknown one. This can cause tests
which were based on the log output to start failing, as in the past
we were noticing the timeout at the end of the request handling
and using the `timed_out_error` to keep processing it and now, even
though we do notice the timeout even earlier, due to it's type we
log an error to the log, instead of treating it as a regular timeout.
In this patch we make the error thrown on timeout during view updates
inherit from `timed_out_error` instead of the `request_timeout_exception`
(it is also moved from the "exceptions" directory, where we define
exceptions returned to the user).
Aside from helping with the issue described above, we also improve our
metrics, as the `request_timeout_exception` is also not checked for
in the `is_timeout_exception` method, and because we're using it to
check whether we should update write timeout metrics, they will only
start getting updated after this patch.
Fixes#19261
(cherry picked from commit 4aa7ada771)
Closesscylladb/scylladb#19262
This features used to be there for a while, but then it was removed by
83d491af02. This patch partially takes it
back, but maps to UNUSED, so that if met in config, it's warned, but
other features are parsed as well.
refs: #18968
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit b2520b8185)
In [d0f5873](d0f58736c8), we introduced mappings IP–host ID between hint directories and the hint endpoint managers managing them. As a consequence, it may happen that one hint directory stores hints towards multiple nodes at the same time. If any of those nodes leaves the cluster, we should drain the hint directory. However, before these changes that doesn't happen – we only drain it when the node of the same host ID as the hint endpoint manager leaves the cluster.
This PR fixes that draining issue in the pre-host-ID-based hinted handoff. Now no matter which of the nodes corresponding to a hint directory leaves the cluster, the directory will be drained.
We also introduce error injections to be able to test that it indeed happens.
Fixesscylladb/scylladb#18761
(cherry picked from commit [745a9c6](745a9c6ab8))
(cherry picked from commit [e855794](e855794327))
Refs scylladb/scylladb#18764Closesscylladb/scylladb#19114
* github.com:scylladb/scylladb:
db/hints: Introduce an error injection to test draining
db/hints: Ensure that draining happens
We want to verify that a hint directory is drained
when any of the nodes correspodning to it leaves
the cluster. The test scenario should happen before
the whole cluster has been migrated to
the host-ID-based hinted handoff, so when we still
rely on the mappings between hint endpoint managers
and the hint directories managed by them.
To make such a test possible, in these changes we
introduce an error injection rejecting incoming
hints. We want to test a scenario when:
1. hints are saved towards a given node -- node N1,
2. N1 changes its IP to a different one,
3. some other node -- node N2 -- changes its IP
to the original IP of N1,
4. hints are saved towards N2 and they are stored
in the same directory as the hints saved towards
N1 before,
5. we start draining N2.
Because at some point N2 needs to be stopped,
it may happen that some mutations towards
a distributed system table generate a hint
to N2 BEFORE it has finished changing its IP,
effectively creating another hint directory
where ALL of the hints towards the node
will be stored from there on. That would disturb
the test scenario. Hence, this error injection is
necessary to ensure that all of the steps in the
test proceed as expected.
(cherry picked from commit e855794327)
Before hinted handoff is migrated to using host IDs
to identify nodes in the cluster, we keep track
of mappings between hint endpoint managers
identified by host IDs and the hint directories
managed by them and represented by IP addresses.
As a consequence, it may happen that one hint
directory corresponds to multiple nodes
-- it's intended. See 64ba620 for more details.
Before these changes, we only started the draining
process of a hint directory if the node leaving
the cluster corresponded to that hint directory
AND was identified by the same host ID as
the hint endpoint manager managing that directory.
As a result, the draining did not always happen
when it was supposed to.
Draining should start no matter which of the nodes
corresponding to a hint directory is leaving
the cluster. This commit ensures that it happens.
(cherry picked from commit 745a9c6ab8)
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)
Closesscylladb/scylladb#19014
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.
The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 83d491af02)
Closesscylladb/scylladb#19012
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769
(cherry picked from commit 2ab143fb40)
[avi: adjust test/alternator/suite.yaml to reflect new keyspace]
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.
Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.
This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.
Fixes: #17786Fixes: #18709
(cherry-picked from 519317dc58)
This commit also includes the follow-up patch that removes the
flakiness from the test that is introduced by the commit above.
The flakiness was caused by enabling the
delay_before_get_view_natural_endpoint injection on a node
and not disabling it before the node is shut down. The patch
removes the enabling of the injection on the node in the first
place.
By squashing the commits, we won't introduce a place in the
commit history where a potential bisect could mistakenly fail.
Fixes: https://github.com/scylladb/scylladb/issues/18941
(cherry-picked from 0de3a5f3ff)
Closesscylladb/scylladb#18974
Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.
The existing inet_address::to_string() calls fmt::format("{}", *this)
anyway. However, the to_string() method is declared in .cc file, while
form formatter is in the header and is equipeed with constexprs so
that converting an address to string is done as much as possible
compile-time.
Also, though minor, fmt::to_string(foo) is believed to be even faster
than fmt::format("{}", foo).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18712
User-defined types can depend on each other, creating directed acyclic graph.
In order to support restoring schema from `DESC SCHEMA`, UDTs should be
ordered topologically, not alphabetically as it was till now.
This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema.
Fixes#18539Closesscylladb/scylladb#18302
* github.com:scylladb/scylladb:
test/cql-pytest/test_describe: add test for UDTs ordering
cql3/statements/describe_statement: UDTs topological sorting
cql3/statements/describe_statement: allow to skip alphabetical sorting
types: add a method to get all referenced user types
db/cql_type_parser: use generic topological sorting
db/cql_type_parses: futurize raw_builder::build()
test/boost: add test for topological sorting
utils: introduce generic topological sorting algorithm
This feature corrected how we store the token in secondary indexes. It
was introduced in 7ff72b0ba5 (2020; 4.4) and can now be assumed present
everywhere. Note that we still support indexes created with the old format.
The PER_TABLE_PARTITIONERS feature was added in 90df9a44ce (2020; 4.0)
and can now be assumed to be always present. We also remove the associated
schema_feature.
The CDC feature was made non-experimental in e9072542c1 (2020; 4.4)
and can now be assumed to be always present. We also remove the corresponding
schema_feature.
The VIEW_VIRTUAL_COLUMNS feature was added in a108df09f9 (2019; 3.1)
and can now be assumed to be always present.
The corresponding schema_feature is removed. Note schema_features are not sent
over the wire. A digest calculation without VIEW_VIRTUAL_COLUMNS is no longer tested.
"me" format sstables were introduced in d370558279 (Jan 2022; 5.1)
and so can be assumed always present. The listener that checks when
the cluster understands ME_SSTABLE was removed and in its place
we default to sstable_version_types::me (and call on_enabled()
immediately).