`topology` currently contains the `requests` map, which is suitable for
node-specific requests such as "this node wants to join" or "this node
must be removed". But for requests for operations that affect the
cluster as a whole, a separate request type and field is more
appropriate. Introduce one.
The enum currently contains the option `new_cdc_generation` for requests
to create a new CDC generation in the cluster. We will implement the
whole procedure in later commits.
- make it a static column in `system.topology`
- move it from node-specific `ring_slice` to cluster-global `topology`
We will use it in scenarios where no node is transitioning.
Also make it `std::optional` in topology for consistency with other
fields (previously, the 'no value' state for this field was represented
using default-constructed `utils::UUID`).
since #13452, we switched most of the caller sites from std::regex
to boost::regex. in this change, all occurences of `#include <regex>`
are dropped unless std::regex is used in the same source file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13765
In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`.
This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date.
Closes#13573
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes
reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes
reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes
reader_concurrency_semaphore: update API w.r.t. recent permit state name changes
reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes
Current S3 client was tested over minio and it takes few more touches to work with amazon S3.
The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS.
So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs.
In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust.
fixes: #13425Closes#13493
* github.com:scylladb/scylladb:
doc: Add a document describing how to configure S3 backend
s3/test: Add ability to run boost test over real s3
s3/client: Sign requests if configured
s3/client: Add connection factory with DNS resolve and configurable HTTPS
s3/client: Keep server port on config
s3/client: Construct it with config
s3/client: Construct it with sstring endpoint
sstables: Make s3_storage with endpoint config
sstables_manager: Keep object storage configs onboard
code: Introduce conf/object_storage.yaml configuration file
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.
To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name
The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.
To keep the described configuration the proposed place is the
object_storage.yaml file with the format
endpoints:
- name: a.b.c
port: 443
aws_key: 12345
aws_secret: abcdefghijklmnop
...
When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in this series, we try to use `generation_type` as a proxy to hide the consumers from its underlying type. this paves the road to the UUID based generation identifier. as by then, we cannot assume the type of the `value()` without asking `generation_type` first. better off leaving all the formatting and conversions to the `generation_type`. also, this series changes the "generation" column of sstable registry table to "uuid", and convert the value of it to the original generation_type when necessary, this paves the road to a world with UUID based generation id.
Closes#13652
* github.com:scylladb/scylladb:
db: use uuid for the generation column in sstable registry table
db, sstable: add operator data_value() for generation_type
db, sstable: print generation instead of its value
* change the "generation" column of sstable registry table from
bigint to uuid
* from helper to convert UUID back to the original generation
in the long run, we encourage user to use uuid based generation
identifier. but in the transition period, both bigint based and uuid
based identifiers are used for the generation. so to cater both
needs, we use a hackish way to store the integer into UUID. to
differentiate the was-integer UUID from the geniune UUID, we
check the UUID's most_significant_bits. because we only support
serialize UUID v1, so if the timestamp in the UUID is zero,
we assume the UUID was generated from an integer when converting it
back to a generation identififer.
also, please note, the only use case of using generation as a
column is the sstable_registry table, but since its schema is fixed,
we cannot store both a bigint and a UUID as the value of its
`generation` column, the simpler way forward is to use a single type
for the generation. to be more efficient and to preserve the type of
the generation, instead of using types like ascii string or bytes,
we will always store the generation as a UUID in this table, if the
generation's identifier is a int64_t, the value of the integer will
be used as the least significant bits of the UUID.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.
this also silences warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13721
this series silences the warnings from GCC 13. some of these changes are considered as critical fixes, and posted separately.
see also #13243Closes#13723
* github.com:scylladb/scylladb:
cdc: initialize an optional using its value type
compaction: disambiguate type name
db: schema_tables: drop unused variable
reader_concurrency_semaphore: fix signed/unsigned comparision
locator: topology: disambiguate type names
raft: disambiguate promise name in raft::awaited_conf_changes
We change the meaning and name of `replication_state`: previously it was meant
to describe the "state of tokens" of a specific node; now it describes the
topology as a whole - the current step in the 'topology saga'. It was moved
from `ring_slice` into `topology`, renamed into `transition_state`, and the
topology coordinator code was modified to switch on it first instead of node
state - because there may be no single transitioning node, but the topology
itself may be transitioning.
This PR was extracted from #13683, it contains only the part which refactors
the infrastructure to prepare for non-node specific topology transitions.
Closes#13690
* github.com:scylladb/scylladb:
raft topology: rename `update_replica_state` -> `update_topology_state`
raft topology: remove `transition_state::normal`
raft topology: switch on `transition_state` first
raft topology: `handle_ring_transition`: rename `res` to `exec_command_res`
raft topology: parse replaced node in `exec_global_command`
raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on`
storage_service: extract raft topology coordinator fiber to separate class
raft topology: rename `replication_state` to `transition_state`
raft topology: make `replication_state` a topology-global state
this also silence the warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:1489:10: error: variable ‘ts’ set but not used [-Werror=unused-but-set-variable]
1489 | auto ts = db_clock::now();
| ^~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so we can apply `execute_cql()` on `generation_type` directly without
extracting its value using `generation.value()`. this paves the road to
adding UUID based generation id to `generation_type`. as by then, we
will have both UUID based and integer based `generation_type`, so
`generation_type::value()` will not be able to represent its value
anymore. and this method will be replaced by `operator data_value()` in
this use case.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change prepares for the change to use `variant<UUID, int64_t>`
as the value of `generation_type`. as after this change, the "value"
of a generation would be a UUID or an integer, and we don't want to
expose the variant in generation's public interface. so the `value()`
method would be changed or removed by then.
this change takes advantage of the fact that the formatter of
`generation_type` always prints its value. also, it's better to
reuse `generation_type` formatter when appropriate.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The new name is more generic - it describes the current step of a
'topology saga` (a sequence of steps used to implement a larger topology
operation such as bootstrap).
Previously it was part of `ring_slice`, belonging to a specific node.
This commit moves it into `topology`, making it a cluster-global
property.
The `replication_state` column in `system.topology` is now `static`.
This will allow us to easily introduce topology transition states that
do not refer to any specific node. `commit_cdc_generation` will be such
a state, allowing us to commit a new CDC generation even though all
nodes are normal (none are transitioning). One could argue that the
other states are conceptually already cluster-global: for example,
`write_both_read_new` doesn't affect only the tokens of a bootstrapping
(or decommissioning etc.) node; it affects replica sets of other tokens
as well (with RFs greater than 1).
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
in C++20, compiler generate operator!=() if the corresponding
operator==() is already defined, the language now understands
that the comparison is symmetric in the new standard.
fortunately, our operator!=() is always equivalent to
`! operator==()`, this matches the behavior of the default
generated operator!=(). so, in this change, all `operator!=`
are removed.
in addition to the defaulted operator!=, C++20 also brings to us
the defaulted operator==() -- it is able to generated the
operator==() if the member-wise lexicographical comparison.
under some circumstances, this is exactly what we need. so,
in this change, if the operator==() is also implemented as
a lexicographical comparison of all memeber variables of the
class/struct in question, it is implemented using the default
generated one by removing its body and mark the function as
`default`. moreover, if the class happen to have other comparison
operators which are implemented using lexicographical comparison,
the default generated `operator<=>` is used in place of
the defaulted `operator==`.
sometimes, we fail to mark the operator== with the `const`
specifier, in this change, to fulfil the need of C++ standard,
and to be more correct, the `const` specifier is added.
also, to generate the defaulted operator==, the operand should
be `const class_name&`, but it is not always the case, in the
class of `version`, we use `version` as the parameter type, to
fulfill the need of the C++ standard, the parameter type is
changed to `const version&` instead. this does not change
the semantic of the comparison operator. and is a more idiomatic
way to pass non-trivial struct as function parameters.
please note, because in C++20, both operator= and operator<=> are
symmetric, some of the operators in `multiprecision` are removed.
they are the symmetric form of the another variant. if they were
not removed, compiler would, for instance, find ambiguous
overloaded operator '=='.
this change is a cleanup to modernize the code base with C++20
features.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13687
Fix two issues with the replace operation introduced by recent PRs.
Add a test which performs a sequence of basic topology operations (bootstrap,
decommission, removenode, replace) in a new suite that enables the `raft`
experimental feature (so that the new topology change coordinator code is used).
Fixes: #13651Closes#13655
* github.com:scylladb/scylladb:
test: new suite for testing raft-based topology
test: remove topology_custom/test_custom.py
raft topology: don't require new CDC generation UUID to always be present
raft topology: include shard_count/ignore_msb during replace
During node replace we don't introduce a new CDC generation, only during
regular bootstrap. Instead of checking that `new_cdc_generation_uuid`
must be present whenever there's a topology transition, only check it
when we're in `commit_cdc_generation` state.
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
`seastar::current_backtrace()` can be quite heavey.
When we pass it to a log message in relatively detailed log_level
(debug/trace), we pay the price of `current_backtrace` every time,
but we rarely print the message.
Closes#13527
* github.com:scylladb/scylladb:
locator/topology: call seastar::current_backtrace only when log_level is enabled
schema_tables: call seastar::current_backtrace only when log_level is enabled
This series cleans up the generation and value types used in gms / gossiper.
Currently we use a blend of int, int32_t, and int64_t around messaging.
This change defines gms::generation_type and gms::version_type as int32_t
and add check in non-release modes that the respective int64 value passed over messaging do not overflow 32 bits.
Closes#12966
* github.com:scylladb/scylladb:
gossiper: version_generator: add {debug_,}validate_gossip_generation
gms: gossip_digest: use generation_type and version_type
gms: heart_beat_state: use generation_type and version_type
gms: versioned_value: use version_type
gms: version_generator: define version_type and generation_type strong types
utils: move generation-number to gms
utils: add tagged_integer
gms: versioned_value: make members private
scylla-gdb: add get_gms_versioned_value
gms: versioned_value: delete unused compare_to function
gms: gossip_digest: delete unused compare_to function
The only reason why it's there (right next to compaction_fwd.hh) is
because the database::table_truncate_state subclass needs the definition
of compaction_manager::compaction_reenabler subclass.
However, the former sub is not used outside of database.cc and can be
defined in .cc. Keeping it outside of the header allows dropping the
compaction_manager.hh from database.hh thus greatly reducing its fanout
over the code (from ~180 indirect inclusions down to ~20).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13622
Derived from utils::tagged_integer, using different tags,
the types are incompatible with each other and require explicit
typecasting to- and from- their value type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although get_generation_number implementation is
completely generic, it is used exclusively to seed
the gossip generation number.
Following patches will define a strong gms::generation_id
type and this function should return it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Introduce a new table `CDC_GENERATIONS_V3` (`system.cdc_generations_v3`).
The table schema is a copy-paste of the `CDC_GENERATIONS_V2` schema. The
difference is that V2 lives in `system_distributed_keyspace` and writes to it
are distributed using regular `storage_proxy` replication mechanisms based on
the token ring. The V3 table lives in `system_keyspace` and any mutations
written to it will go through group 0.
Extend the `TOPOLOGY` schema with new columns:
- `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping
node's `ring_slice`, it stores UUID of a newly introduced CDC
generation which is used as partition key for the `CDC_GENERATIONS_V3`
table to access this new generation's data. It's a regular column,
meaning that every row (corresponding to a node) will have its own.
- `current_cdc_generation_uuid` and `current_cdc_generation_timestamp`
together form the ID of the newest CDC generation in the cluster.
(the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is
when the CDC generation starts operating). Those are static columns
since there's a single newest CDC generation.
When topology coordinator handles a request for node to join, calculate a new
CDC generation using the bootstrapping node's tokens, translate it to mutation
format, and insert this mutation to the CDC_GENERATIONS_V3 table through group 0
at the same time we assign tokens to the node in Raft topology. The partition
key for this data is stored in the bootstrapping node's `ring_slice`.
After inserting new CDC generation data , we need to pick a timestamp for this
generation and commit it, telling all nodes in the cluster to start using the
generation for CDC log writes once their clocks cross that timestamp.
We introduce a separate step to the bootstrap saga, before
`write_both_read_old`, called `commit_cdc_generation`. In this step, the
coordinator takes the `new_cdc_generation_data_uuid` stored in a bootstrapping
node's `ring_slice` - which serves as the key to the table where the CDC
generation data is stored - and combines it with a timestamp which it generates
a bit into the future (as in old gossiper-based code, we use 2 * ring_delay, by
default 1 minute). This gives us a CDC generation ID which we commit into the
topology state as the `current_cdc_generation_id` while switching the saga to
the next step, `write_both_read_old`.
Once a new CDC generation is committed to the cluster by the topology
coordinator, we also need to publish it to the user-facing description tables so
CDC applications know which streams to read from.
This uses regular distributed table writes underneath (tables living in the
`system_distributed` keyspace) so it requires `token_metadata` to be nonempty.
We need a hack for the case of bootstrapping the first node in the cluster -
turning the tokens into normal tokens earlier in the procedure in
`token_metadata`, but this is fine for the single-node case since no streaming
is happening.
When a node notices that a new CDC generation was introduced in
`storage_service::topology_state_load`, it updates its internal data structures
that are used when coordinating writes to CDC log tables.
We include the current CDC generation data in topology snapshot transfers.
Some fixes and refactors included.
Closes#13385
* github.com:scylladb/scylladb:
docs: cdc: describe generation changes using group 0 topology coordinator
cdc: generation_service: add a FIXME
cdc: generation_service: add legacy_ prefix for gossiper-based functions
storage_service: include current CDC generation data in topology snapshots
db: system_keyspace: introduce `query_mutations` with range/slice
storage_service: hold group 0 apply mutex when reading topology snapshot
service: raft_group0_client: introduce `hold_read_apply_mutex`
storage_service: use CDC generations introduced by Raft topology
raft topology: publish new CDC generation to the user description tables
raft topology: commit a new CDC generation on node bootstrap
raft topology: create new CDC generation data during node bootstrap
service: topology_state_machine: make topology::find const
db: system_keyspace: small refactor of `load_topology_state`
cdc: generation: extract pure parts of `make_new_generation` outside
db: system_keyspace: add storage for CDC generations managed by group 0
service: topology_state_machine: better error checking for state name (de)serialization
service: raft: plumbing `cdc::generation_service&`
cdc: generation: `get_cdc_generation_mutations`: take timestamp as parameter
cdc: generation: make `topology_description_generator::get_sharding_info` a parameter
sys_dist_ks: make `get_cdc_generation_mutations` public
sys_dist_ks: move find_schema outside `get_cdc_generation_mutations`
sys_dist_ks: move mutation size threshold calculation outside `get_cdc_generation_mutations`
service/raft: group0_state_machine: signal topology state machine in `load_snapshot`
In order not to copy the rvalue consumer arg -- instantly convert it
into value. No other tricks.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.
There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).
Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).
Fixes#13594Closes#13604
* github.com:scylladb/scylladb:
db: system_keyspace: use microsecond resolution for group0_history range tombstone
utils: UUID_gen: accept decimicroseconds in min_time_UUID
Move gms::arrival_window to api/failure_detector which is its only user.
and get rid of the rest, which is not used, now that we use direct_failure_detector instead.
TODO: integare direct_failure_detector with failure_detector api.
Closes#13576
* github.com:scylladb/scylladb:
gms: get rid of unused failure_detector
api: failure_detector: remove false dependency on failure_detector::arrival_window
test: rest_api: add test_failure_detector
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.
There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).
Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).
Fixes#13594
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `function_name` without the help of `operator<<`.
the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13608
The legacy failure_detector is now unused and can be removed.
TODO: integare direct_failure_detector with failure_detector api.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Database functions currently receive their arguments as an std::vector. This
is inflexible (for example, one cannot use small_vector to reduce allocations).
This series adapts the function signature to accept parameters using std::span.
Some changes in the keys interface are needed to support this. Lastly, one call
site is migrated to small_vector.
This is in support of changing selectors to use expressions.
Closes#13581
* github.com:scylladb/scylladb:
cql3: abstract_function_selector: use small_vector for argument buffer
db, cql3: functions: pass function parameters as a span instead of a vector
keys: change from_optional_exploded to accept a span instead of a vector
There is a `query_mutations` function which loads the entire contents of
a given table into memory. There was no function for e.g. loading just a
single partition in the form of mutations. Introduce one.
Once a new CDC generation is committed to the cluster by the topology
coordinator, we also need to publish it to the user-facing description
tables so CDC applications know which streams to read from.
This uses regular distributed table writes underneath (tables living
in the `system_distributed` keyspace) so it requires `token_metadata`
to be nonempty. We need a hack for the case of bootstrapping the
first node in the cluster - turning the tokens into normal tokens
earlier in the procedure in `token_metadata`, but this is fine for the
single-node case since no streaming is happening.
After inserting new CDC generation data (see previous commit), we need
to pick a timestamp for this generation and commit it, telling all nodes
in the cluster to start using the generation for CDC log writes once
their clocks cross that timestamp.
We introduce a separate step to the bootstrap saga, before
`write_both_read_old`, called `commit_cdc_generation`. In this step, the
coordinator takes the `new_cdc_generation_data_uuid` stored in a
bootstrapping node's `ring_slice` - which serves as the key to the table
where the CDC generation data is stored - and combines it with a
timestamp which it generates a bit into the future (as in old
gossiper-based code, we use 2 * ring_delay, by default 1 minute). This
gives us a CDC generation ID which we commit into the topology state as
the `current_cdc_generation_id` while switching the saga to the next
step, `write_both_read_old`.
`system_keyspace::load_topology_state` is extended to load
`current_cdc_generation_id`.
For now, nodes don't react to `current_cdc_generation_id`. In later
commit we'll extend `storage_service::topology_state_load` to start
using the current CDC generation for CDC log table writes.
The solution with specifying a timestamp into the future is the same as
it is for gossip-based topology changes and it has the same consistency
problem - if some node is temporarily partitioned away from the quorum,
it might not learn about the new CDC generation before its clock crosses
the generation's timestamp, causing it to temporarily send writes to the
wrong CDC streams (until it learns about the new timestamp). I left a
FIXME which describes an alternative solution which wasn't viable for
gossiper-based topology changes, but it is viable when we have a
fault-tolerant topology coordinator.
Calculate a new CDC generation using the bootstrapping node's tokens,
translate it to mutation format, and insert this mutation to the
CDC_GENERATIONS_V3 table through group 0 at the same time we assign
tokens to the node in Raft topology. The partition key for this data is
stored in the bootstrapping node's `ring_slice`.
The data is inserted, but it's not used for anything yet, we'll do it in
later commits.
Two FIXMEs are left for follow-ups:
- in `get_sharding_info` we shouldn't have to use the token owner's IP,
but get the host ID directly from token metadata (#12279),
- splitting the CDC generation data write into multiple commands. The
comment elaborates.
The variables necessary for constructing a `ring_slice` are now living in
a local block of code. This makes it easier to see which data is part of
the `ring_slice` and will make it easier to add more data to
`ring_slice` in following commits.
Also add some more sanity checking.
The method needs proxy to get data_dictionary::database from to pass down to select_statement::prepare(). And a legacy bit that can come with data_dictionary::database as well. Fortunately, all the call traces that end up at select_statement() start inside table:: methods that have view_update_generator, or at view_builder::consumer that has reference to view_builder. Both services can share the database reference. However, the call traces in question pass through several code layers, so the PR adds data_dictionary::database to those layers one by one.
Closes#13591
* github.com:scylladb/scylladb:
view_info: Drop calls to get_local_storage_proxy()
view_info: Add data_dictionary argument to select_statement()
view_info: Add data_dictionary argument to partition_slice() method
view_filter_checking_visitor: Construct with data_dictionary
view: Carry data_dictionary arg through standalone helpers
view_updates: Carry data_dictionary argument throug methods
view_update_builder: Construct with data dictionary
table: Push view_update_generator arg to affected_views()
view: Add database getters to v._update_generator and v._builder
The `CDC_GENERATIONS_V3` table schema is a copy-paste of the
`CDC_GENERATIONS_V2` schema. The difference is that V2 lives in
`system_distributed_keyspace` and writes to it are distributed using
regular `storage_proxy` replication mechanisms based on the token ring.
The V3 table lives in `system_keyspace` and any mutations written to it
will go through group 0.
Also extend the `TOPOLOGY` schema with new columns:
- `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping
node's `ring_slice`, it stores UUID of a newly introduced CDC
generation which is used as partition key for the `CDC_GENERATIONS_V3`
table to access this new generation's data. It's a regular column,
meaning that every row (corresponding to a node) will have its own.
- `current_cdc_generation_uuid` and `current_cdc_generation_timestamp`
together form the ID of the newest CDC generation in the cluster.
(the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is
when the CDC generation starts operating). Those are static columns
since there's a single newest CDC generation.
The function would generate a mutation timestamp for itself, take it as
parameter instead. We'll use timestamps provided by Group 0 APIs when
creating CDC generations during Group 0- based topology changes.
It was a `static` function inside system_distributed_keyspace. Later it
will be used for another table living in system_keyspace, so move it
outside, to the CDC generations module, and make it accessible from
other places.