For various reasons, a view building write may fail. When that
happens, the view building should not finish until these writes
are successfully retried and they should not interfere with any
writes that are performed to the base table while the view is
building.
The test introduced in this patch confirms that this is the case.
Refs scylladb/scylladb#19261Closesscylladb/scylladb#19263
Change the format of sync points to use host ID instead of IPs, to be consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores host IDs instead of IPs.
The decoding supports both formats with host IDs and IPs, so a sync point contains now a variant of either types, and in the case of new type the translation is avoided.
Fixes#18653Closesscylladb/scylladb#19134
* github.com:scylladb/scylladb:
db/hints: migrate sync point to host ID
db/hints: rename sync point structures with _v1 suffix to _v1_v2
Some time ago it turned out that if unrecognized feature name is met in scylla.yaml, the whole experimental features list is ignored, but scylla continues to boot. There's UNUSED feature which is the proper way to deprecate a feature, and this PR improves its handling in several ways.
1. The recently removed "tablets" feature is partially brought back, but marked as UNUSED
2. Any UNUSED features met while parsing are printed into logs
3. The enum_option<> helper is enlightened along the way
refs: #18968Closesscylladb/scylladb#19230
* github.com:scylladb/scylladb:
config: Mark tablets feature as unused
main: Warn unused features
enum_option: Carry optional key on board
enum_option: Remove on-board _map member
Commit 882b2f4e9f (cql3, schema_tables: Generalize function creation)
erroneously says that optional<context> is not suitable for future<>
type, but in fact it is.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19204
This features used to be there for a while, but then it was removed by
83d491af02. This patch partially takes it
back, but maps to UNUSED, so that if met in config, it's warned, but
other features are parsed as well.
refs: #18968
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Change the format of sync points to use host ID instead of IPs, to be
consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores
host IDs instead of IPs.
The encoding of sync points now always uses the new v3 format with host
IDs.
The decoding supports both formats with host IDs and IPs, so a sync point
contains now a variant of either types, and in the case of the new
format the translation from IP to host ID is avoided.
After wasm udf appeared, code in main, create_function_statement and schema_tables got some involvements into details of wasm engine management. Also, even prior to this, there was duplication in how function context is created by statement code and schema_tables code.
This PR generalizes function context creation and encapsulates the management in sharded<lang::manager> service. Also it removes the wasm::startup_context thing and makes wasm start/stop be "classical" (see #2737)
Closesscylladb/scylladb#19166
* github.com:scylladb/scylladb:
code: Enlighten wasm headers usage
lang: Unfriend wasm context from manager
lang, cql3, schema_tables: Don't mess with db::config
lang: Don't use db::config to create lua context
lang: Don't use db::config to create wasm context
lang: Drop manager::precompile() method
cql3, schema_tables: Generalize function creation
wasm: Replace startup_context with wasm_config
lang: Add manager::start() method
lang: Move manager to lang namespace
lang: Move wasm::manager to its .cc/.hh files
In 58784cd, aa4b06a and other commits migrating
hinted handoff from IPs to host IDs (scylladb/scylladb#15567),
we started ignoring hint directories of invalid names,
i.e. those that represent neither an IP address, nor a host ID.
They remain on disk and are taken into account while computing
e.g. the total size of hints, but they're not used in any way.
These changes add logs informing the user when Scylla
encounters such a directory.
Closesscylladb/scylladb#17566
Now when function context creation is encapsulated in lang::manager,
some .cc files can stop using wasm-specific headers and just go with the
lang/manager.hh one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Not function context creation is encapsulated in lang::manager so it's
possible to patch-out few more places that use database as config
provider.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a function is created with the CREATE FUNCTION statement, the
statement handler does all the necessary preparations on its own. The
very same code exists in schema_tables, when the function is loaded on
boot. This patch generalizes both and keeps function language-specific
context creation inside lang/ code.
The creation function returns context via argument reference. It would
have been nicer if it was returned via future<>, but it's not suitable
for future<T> type :(
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's going to become a facade in front of both -- wasm and lua, so keep
it in files with language independent names.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog - this can be observed in the following scenario:
1. Cluster starts, all nodes gossip their empty view update backlog to one another
2. On node N, `view_update_backlog_broker` (the backlog gossiper) performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1s
3. Within the next 1s, coordinator (different than N) sends a write to N causing a remote view update (which we do not wait for). As a result, node N replies immediately with an increased view update backlog, which is then noted by the coordinator.
4. Still within the 1s, node N finishes the view update in the background, dropping its view update backlog to 0.
5. In the next and following iterations of `view_update_backlog_broker` on N, backlog is empty, as it was in step 2, so no change is seen and no update is sent due to the check
```
auto backlog = _sp.local().get_view_update_backlog();
if (backlog_published && *backlog_published == backlog) {
sleep_abortable(gms::gossiper::INTERVAL, _as).get();
continue;
}
```
After this scenario happens, the coordinator keeps an information about an increased view update backlog on N even though it's actually already empty
This patch fixes the issue this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.
Fixes: https://github.com/scylladb/scylladb/issues/18461
Similarly to https://github.com/scylladb/scylladb/pull/18646, without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none
Tests: manual. Currently this patch only affects the length of MV flow control delay, which is not reliable to base a test on. A proper test will be added when MV admission control is added, so we'll be able to base the test on rejected requests
Closesscylladb/scylladb#18663
* github.com:scylladb/scylladb:
mv: gossip the same backlog if a different backlog was sent in a response
node_update_backlog: divide adding and fetching backlogs
Currently, when generating and propagating view updates, if we notice
that we've already exceeded the time limit, we throw an exception
inheriting from `request_timeout_exception`, to later catch and
log it when finishing request handling. However, when catching, we
only check timeouts by matching the `timed_out_error` exception,
so the exception thrown in the view update code is not registered
as a timeout exception, but an unknown one. This can cause tests
which were based on the log output to start failing, as in the past
we were noticing the timeout at the end of the request handling
and using the `timed_out_error` to keep processing it and now, even
though we do notice the timeout even earlier, due to it's type we
log an error to the log, instead of treating it as a regular timeout.
In this patch we make the error thrown on timeout during view updates
inherit from `timed_out_error` instead of the `request_timeout_exception`
(it is also moved from the "exceptions" directory, where we define
exceptions returned to the user).
Aside from helping with the issue described above, we also improve our
metrics, as the `request_timeout_exception` is also not checked for
in the `is_timeout_exception` method, and because we're using it to
check whether we should update write timeout metrics, they will only
start getting updated after this patch.
Closesscylladb/scylladb#19102
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a the backlog changed
since the last gossip round so a different backlog could have been send
through the response piggyback mechanism. With that information, gossip
will send an unchanged backlog to other nodes in the following gossip round.
Fixes: https://github.com/scylladb/scylladb/issues/18461
Currently, we only update the backlogs in node_update_backlog at the
same time when we're fetching them. This is done using storage_proxy's
method get_view_update_backlog, which is confusing because it's a getter
with side-effects. Additionally, we don't always want to update the
backlog when we're reading it (as in gossip which is only on shard 0)
and we don't always want to read it when we're updating it (when we're
not handling any writes but the backlog drops due to background work
finish).
This patch divides the node_view_backlog::add_fetch as well the
storage_proxy::get_view_update_backlog both into two methods; one
for updating and one for reading the backlog. This patch only replaces
the places where we're currently using the view backlog getter, more
situations where we should get/update the backlog should be considered
in a following patch.
Currently it gets the streaming/maintenance one from database, but it
can as well just assume that it's already running in the correct one,
and the main code fulfils this assumption.
This removes one more place that uses database as sched groups provider.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19078
Currently the code wraps simple "if" with std::invoke over a lambda.
Also, the local variable that gets the result, is declared as const one,
which prevents it from being std::move()-d in the very next line.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19106
If a node restart just before it stores bootstrapping node's IP it will
not have ID to IP mapping for bootstrapping node which may cause failure
on a write path. Detect this and fail bootstrapping if it happens.
Closesscylladb/scylladb#18927
* github.com:scylladb/scylladb:
raft topology: fix indentation after previous commit
raft topology: do not add bootstrapping node without IP as pending
test: add test of bootstrap where the coordinator crashes just before storing IP mapping
schema_tables: remove unused code
In d0f5873, we introduced mappings IP–host ID between hint directories and the hint endpoint managers managing them. As a consequence, it may happen that one hint directory stores hints towards multiple nodes at the same time. If any of those nodes leaves the cluster, we should drain the hint directory. However, before these changes that doesn't happen – we only drain it when the node of the same host ID as the hint endpoint manager leaves the cluster.
This PR fixes that draining issue in the pre-host-ID-based hinted handoff. Now no matter which of the nodes corresponding to a hint directory leaves the cluster, the directory will be drained.
We also introduce error injections to be able to test that it indeed happens.
Fixesscylladb/scylladb#18761Closesscylladb/scylladb#18764
* github.com:scylladb/scylladb:
db/hints: Introduce an error injection to test draining
db/hints: Ensure that draining happens
Currently, the backlog used for MV flow control is only updated
after we generate view updates as a result of a write request.
However, when the resources are no longer used, we should also
notice that to prevent excessive slowdowns caused by the MV
flow control calulating the delays based of an outdated, large
backlog.
This patch makes it so the backlogs are updated every time
a view update finishes, and not only when the updates start.
Fixes#18783Closesscylladb/scylladb#18804
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18964
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.
The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18898
We want to verify that a hint directory is drained
when any of the nodes correspodning to it leaves
the cluster. The test scenario should happen before
the whole cluster has been migrated to
the host-ID-based hinted handoff, so when we still
rely on the mappings between hint endpoint managers
and the hint directories managed by them.
To make such a test possible, in these changes we
introduce an error injection rejecting incoming
hints. We want to test a scenario when:
1. hints are saved towards a given node -- node N1,
2. N1 changes its IP to a different one,
3. some other node -- node N2 -- changes its IP
to the original IP of N1,
4. hints are saved towards N2 and they are stored
in the same directory as the hints saved towards
N1 before,
5. we start draining N2.
Because at some point N2 needs to be stopped,
it may happen that some mutations towards
a distributed system table generate a hint
to N2 BEFORE it has finished changing its IP,
effectively creating another hint directory
where ALL of the hints towards the node
will be stored from there on. That would disturb
the test scenario. Hence, this error injection is
necessary to ensure that all of the steps in the
test proceed as expected.
Before hinted handoff is migrated to using host IDs
to identify nodes in the cluster, we keep track
of mappings between hint endpoint managers
identified by host IDs and the hint directories
managed by them and represented by IP addresses.
As a consequence, it may happen that one hint
directory corresponds to multiple nodes
-- it's intended. See 64ba620 for more details.
Before these changes, we only started the draining
process of a hint directory if the node leaving
the cluster corresponded to that hint directory
AND was identified by the same host ID as
the hint endpoint manager managing that directory.
As a result, the draining did not always happen
when it was supposed to.
Draining should start no matter which of the nodes
corresponding to a hint directory is leaving
the cluster. This commit ensures that it happens.
This change supports changing replication factor in tablets-enabled keyspaces.
This covers both increasing and decreasing the number of tablets replicas through
first building topology mutations (`alter_keyspace_statement.cc`) and then
tablets/topology/schema mutations (`topology_coordinator.cc`).
For the limitations of the current solution, please see the docs changes attached to this PR.
Fixes: #16129Closesscylladb/scylladb#16723
* github.com:scylladb/scylladb:
test: Do not check tablets mutations on nodes that don't have them
test: Fix the way tablets RF-change test parses mutation_fragments
test/tablets: Unmark RF-changing test with xfail
docs: document ALTER KEYSPACE with tablets
Return response only when tablets are reallocated
cql-pytest: Verify RF is changes by at most 1 when tablets on
cql3/alter_keyspace_statement: Do not allow for change of RF by more than 1
Reject ALTER with 'replication_factor' tag
Implement ALTER tablets KEYSPACE statement support
Parameterize migration_manager::announce by type to allow executing different raft commands
Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks
Extend system.topology with 3 new columns to store data required to process alter ks global topo req
Allow query_processor to check if global topo queue is empty
Introduce new global topo `keyspace_rf_change` req
New raft cmd for both schema & topo changes
Add storage service to query processor
tablets: tests for adding/removing replicas
tablet_allocator: make load_balancer_stats_manager configurable by name
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.
Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.
This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.
Fixes: #17786Fixes: #18709Closesscylladb/scylladb#18816
Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.
The system-distributed-keyspace and view-update-generator often go in pair, because streaming, repair and sstables-loader (via distributed-loader) need them booth to check if sstable is staging and register it if it's such. The check is performed by messing directly with system_distributed.view_build_status table, and the registration happens via view-update-generator.
That's not nice, other services shouldn't know that view status is kept in system table. Also view-update-generator is a service to generae and push view updates, the fact that it keeps staging sstables list is the implementation detail.
This PR replaces dependencies on the mentioned pair of services with the single dependency on view-builder (repair, sstables-loader and stream-manager are enlightened) and hides the view building-vs-staging details inside the view_builder.
Along the way, some simplification of repair_writer_impl class is done.
Closesscylladb/scylladb#18706
* github.com:scylladb/scylladb:
stream_manager: Remove system_distributed_keyspace and view_update_generator
repair: Remove system_distributed_keyspace and view_update_generator
streaming: Remove system_distributed_keyspace and view_update_generator
sstables_loader: Remove system_distributed_keyspace and view_update_generator
distributed_loader: Remove system_distributed_keyspace and view_update_generator
view: Make register_staging_sstable() a method of view_builder
view: Make check_view_build_ongoing() helper a method of view_builder
streaming: Proparage view_builder& down to make_streaming_consumer()
repair: Keep view_builder& on repair_writer_impl
distributed_loader: Propagate view_builder& via process_upload_dir()
stream_manager: Add view builder dependency
repair_service: Add view builder dependency
sstables_loader: Add view_bulder dependency
main: Start sstables loader later
repair: Remove unwanted local references from repair_meta
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769
this series disables operator<<:s for vector and unordered_map, and drop operator<< for mutation, because we don't have to keep it to work with these operator:s anymore. this change is a follow up of https://github.com/scylladb/seastar/issues/1544
this change is a cleanup. so no need to backport
Closesscylladb/scylladb#18866
* github.com:scylladb/scylladb:
mutation,db: drop operator<< for mutation and seed_provider_type&
build: disable operator<< for vector and unordered_map
db/heat_load_balance: include used header
test: define a more generic boost_test_print_type
test/boost: define fmt::formatter for service_level_controller_test.cc
test/boost: include test/lib/test_utils.hh
in theory, std::result_of_t should have been removed in C++20. and
std::invoke_result_t is available since C++17. thanks to libstdc++,
the tree is compiling. but we should not rely on this.
so, in this change, we replace all `std::result_of_t` with
`std::invoke_result_t`. actually, clang + libstdc++ is already warning
us like:
```
In file included from /home/runner/work/scylladb/scylladb/multishard_mutation_query.cc:9:
In file included from /home/runner/work/scylladb/scylladb/schema/schema_registry.hh:11:
In file included from /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/unordered_map:38:
Warning: /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/type_traits:2624:5: warning: 'result_of<void (noop_compacted_fragments_consumer::*(noop_compacted_fragments_consumer &))()>' is deprecated: use 'std::invoke_result' instead [-Wdeprecated-declarations]
2624 | using result_of_t = typename result_of<_Tp>::type;
| ^
/home/runner/work/scylladb/scylladb/mutation/mutation_compactor.hh:518:43: note: in instantiation of template type alias 'result_of_t' requested here
518 | if constexpr (std::is_same_v<std::result_of_t<decltype(&GCConsumer::consume_end_of_stream)(GCConsumer&)>, void>) {
|
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18835
since we've migrated away from the generic homebrew formatters
for range-alike containers, there is no need to keep there operator<<
around -- they were preserved in order to work with the container
formatters which expect operator<< of the elements.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in this header, we use `hr_logger.trace("returned _pp={}", p)` to
print a `vector<float>`, so we we need to include `fmt/ranges.h`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Callers of it had just checked if an sstable still has some views
building, so the should talk to view-builder to register the sstable
that's now considered to be staging.
Effectively. this is to hide the view-update-generator from other
services and make them communicate with the builder only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper checks if there's an ongoing build of a view, and it's in
fact internal to view-builder, who keeps its status in one of its
system tables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing inet_address::to_string() calls fmt::format("{}", *this)
anyway. However, the to_string() method is declared in .cc file, while
form formatter is in the header and is equipeed with constexprs so
that converting an address to string is done as much as possible
compile-time.
Also, though minor, fmt::to_string(foo) is believed to be even faster
than fmt::format("{}", foo).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18712
User-defined types can depend on each other, creating directed acyclic graph.
In order to support restoring schema from `DESC SCHEMA`, UDTs should be
ordered topologically, not alphabetically as it was till now.
This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema.
Fixes#18539Closesscylladb/scylladb#18302
* github.com:scylladb/scylladb:
test/cql-pytest/test_describe: add test for UDTs ordering
cql3/statements/describe_statement: UDTs topological sorting
cql3/statements/describe_statement: allow to skip alphabetical sorting
types: add a method to get all referenced user types
db/cql_type_parser: use generic topological sorting
db/cql_type_parses: futurize raw_builder::build()
test/boost: add test for topological sorting
utils: introduce generic topological sorting algorithm
This feature corrected how we store the token in secondary indexes. It
was introduced in 7ff72b0ba5 (2020; 4.4) and can now be assumed present
everywhere. Note that we still support indexes created with the old format.
The PER_TABLE_PARTITIONERS feature was added in 90df9a44ce (2020; 4.0)
and can now be assumed to be always present. We also remove the associated
schema_feature.
The CDC feature was made non-experimental in e9072542c1 (2020; 4.4)
and can now be assumed to be always present. We also remove the corresponding
schema_feature.
The VIEW_VIRTUAL_COLUMNS feature was added in a108df09f9 (2019; 3.1)
and can now be assumed to be always present.
The corresponding schema_feature is removed. Note schema_features are not sent
over the wire. A digest calculation without VIEW_VIRTUAL_COLUMNS is no longer tested.
"me" format sstables were introduced in d370558279 (Jan 2022; 5.1)
and so can be assumed always present. The listener that checks when
the cluster understands ME_SSTABLE was removed and in its place
we default to sstable_version_types::me (and call on_enabled()
immediately).