`check_and_repair_cdc_streams` is an existing API which you can use when the
current CDC generation is suboptimal, e.g. after you decommissioned a node the
current generation has more stream IDs than you need. In that case you can do
`nodetool checkAndRepairCdcStreams` to create a new generation with fewer
streams.
It also works when you change number of shards on some node. We don't
automatically introduce a new generation in that case but you can use
`checkAndRepairCdcStreams` to create a new generation with restored
shard-colocation.
This PR implements the API on top of raft topology, it was originally
implemented using gossiper. It uses the `commit_cdc_generation` topology
transition state and a new `publish_cdc_generation` state to create new CDC
generations in a cluster without any nodes changing their `node_state`s in the
process.
Closes#13683
* github.com:scylladb/scylladb:
docs: update topology-over-raft.md
test: topology_experimental_raft: test `check_and_repair_cdc` API
raft topology: implement `check_and_repair_cdc_streams` API
raft topology: implement global request handling
raft topology: introduce `prepare_new_cdc_generation_data`
raft_topology: `get_node_to_work_on_opt`: return guard if no node found
raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition
raft topology: separate `publish_cdc_generation` state
raft topology: non-node-specific `exec_global_command`
raft topology: introduce `start_operation()`
raft topology: non-node-specific `topology_mutation_builder`
topology_state_machine: introduce `global_topology_request`
topology_state_machine: use `uint16_t` for `enum_class`es
raft topology: make `new_cdc_generation_data_uuid` topology-global
It was already outdated before this PR.
Describe the version of topology state machine implemented in this PR.
Fix some typos and make it proper markdown so it renders nicely on
GitHub etc.
Update the `Generation switching` section: most of the existing
description landed in `Gossiper-based topology changes` subsection, and
a new subsection was added to describe Raft group 0 based topology
changes. Marked as WIP - we expect further development in this area
soon.
The existing gossiper-based description was also updated a bit.
Greatly expand on the details of how the semaphore works.
Organize the content into thematic chapters to improve navigation.
Improve formatting while at it.
The names of these states have been the source of confusion ever since
they were introduced. Give them names which better reflects their true
meaning and gives less room for misinterpretation. The changes are:
* active/unused -> active
* active/used -> active/need_cpu
* active/blocked -> active/await
Hopefully the new names do a better job at conveying what these states
really mean:
* active - a regular admitted permit, which is active (as opposed to
an inactive permit).
* active/need_cpu - an active permit which was marked as needing CPU for
the read to make progress. This permit prevents admission of new
permits while it is in this state.
* active/await - a former active/need_cpu permit, which has to wait on
I/O or a remote shard. While in this state, it doesn't block the
admission of new permits (pending other criteria such as resource
availability).
The schema is
CREATE TABLE system.sstables (
location text,
generation bigint,
format text,
status text,
uuid uuid,
version text,
PRIMARY KEY (location, generation)
)
A sample entry looks like:
location | generation | format | status | uuid | version
---------------------------------------------------------------------+------------+--------+--------+--------------------------------------+---------
/data/object_storage_ks/test_table-d096a1e0ad3811ed85b539b6b0998182 | 2 | big | sealed | d0a743b0-ad38-11ed-85b5-39b6b0998182 | me
The uuid field points to the "folder" on the storage where the sstable
components are. Like this:
s3
`- test_bucket
`- f7548f00-a64d-11ed-865a-0c1fbc116bb3
`- Data.db
- Index.db
- Filter.db
- ...
It's not very nice that the whole /var/lib/... path is in fact used as
location, it needs the PR #12707 to fix this place.
Also, the "status" part is not yet fully functional, it only supports
three options:
- creating -- the same as TemporaryTOC file exists on disk
- sealed -- default state
- deleting -- the analogy for the deletion log on disk
The latter needs support from the distributed_loader, which's not yet
there. In fact, distributes_loader also needs to be patched to actualy
select entries from this table on load. Also it needs the mentioned
PR #12707 to support staging and quarantine sstables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The topology state machine will track all the nodes in a cluster,
their state, properties (topology, tokens, etc) and requested actions.
Node state can be one of those:
none - the node is not yet in the cluster
bootstrapping - the node is currently bootstrapping
decommissioning - the node is being decommissioned
removing - the node is being removed
replacing - the node is replacing another node
normal - the node is working normally
rebuild - the node is being rebuilt
left - the node is left the cluster
Nodes in state left are never removed from the state.
Tokens also can be in one of the states:
write_both_read_old - writes are going to new and old replica, but reads are from
old replicas still
write_both_read_new - writes still going to old and new replicas but reads are
from new replica
owner - tokens are owned by the node and reads and write go to new
replica set only
Tokens that needs to be move start in 'write_both_read_old' state. After entire
cluster learns about it streaming start. After the streaming tokens move
to 'write_both_read_new' state and again the whole cluster needs to learn about it
and make sure no reads started before that point exist in the system.
After that tokens may move to 'owner' state.
topology_request is the field through which a topology operation request
can be issued to a node. A request is one of the topology operation
currently supported: join, leave, replace or remove.
Until now, the instructions on generating wasm files and using them
for Scylla UDFs were stored in docs/dev, so they were not visible
on the docs website. Now that the Rust helper library for UDFs
is ready, and we're inviting users to try it out, we should also
make the rest of the Wasm UDF documentation readily available
for the users.
Closes#13139
When the WASM UDFs were first introduced, the LANGUAGE required in
the CQL statements to use them was "xwasm", because the ABI for the
UDFs was still not specified and changes to it could be backwards
incompatible.
Now, the ABI is stabilized, but if backwards incompatible changes
are made in the future, we will add a new ABI version for them, so
the name "xwasm" is no longer needed and we can finally
change it to "wasm".
Closes#13089
The WASM UDF implementation has changed since the last time the docs
were written. In particular, the Rust helper library has been
released, and using it should be the recommended method.
Some decisions that were only experimental at the start, were also
"set in stone", so we should refer to them as such.
The docs also contain some code examples. This patch adds tests for
these examples to make sure that they are not wrong and misleading.
Closes#12941
Closes#12071
* github.com:scylladb/scylladb:
docs/dev: building.md: mention node-exporter packages
docs/dev: building.md: replace `dev` with `<mode>` in list of debs
The developer documentation from `building.md` suggested to run unit tests with `./tools/toolchain/dbuild test` command, however this command only invokes `test` bash tool, which immediately returns with status `1`:
```
[piotrs@new-host scylladb]$ ./tools/toolchain/dbuild test
[piotrs@new-host scylladb]$ echo $?
1
```
This was probably unintended mistake and what author really meant was invoking `dbuild ninja test`.
Closes#12890
This extracts information which was there in row_cache.md, but is
relevant to MVCC in general.
It also makes adaptations and reflects the upcoming changes in this
series related to switching to the new mutation_partition_v2 model:
- continuity in evictable snapshots can now overlap. This is needed
to represent range tombstone information, which is linked to
continuity information.
- description of range tombstone representation was added
Rename `system.raft_config` to `system.raft_snapshot_config` to make it clearer
what the table stores.
Remove the `my_server_id` partition key column from
`system.raft_snapshot_config` and a corresponding column from
`system.raft_snapshots` which would store the Raft server ID of the local node.
It's unnecessary, all servers running on a given node in different groups will
use the same ID - the Raft ID of the node which is equal to its Host ID. There
will be no multiple servers running in a single Raft group on the same node.
Closes#12513
* github.com:scylladb/scylladb:
db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG
db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config'
Leave the guide for manual opening in though, the script might not work
in all cases.
Also update the version example, we changed how development versions
look like.
Closes#12511
Make it clear that the table stores the snapshot configuration, which is
not necessarily the currently operating configuration (the last one
appended to the log).
In the future we plan to have a separate virtual table for showing the
currently operating configuration, perhaps we will call it
`system.raft_config`.
Currently, we call cargo build every time we build scylla, even
when no rust files have been changed.
This is avoided by adding a depfile to the ninja rule for the rust
library.
The rust file is generated by default during cargo build,
but it uses the full paths of all depenencies that it includes,
and we use relative paths. This is fixed by specifying
CARGO_BUILD_DEP_INFO_BASEDIR='.', which makes it so the current
path is subtracted from all generated paths.
Instead of using 'always' when specifying when to run the cargo
build, a dependency on Cargo.lock is added additionally to the
depfile. As a result, the rust files are recompiled not only
when the source files included in the depfile are modified,
but also when some rust dependency is updated.
Cargo may put an old cached file as a result of the build even
when the Cargo.lock was recently updated. Because of that, the
the build result may be older than the Cargo.lock file even
if the build was just performed. This may cause ninja to rebuilt
the file every following time. To avoid this, we 'touch' the
build result, so that its last modification time is up to date.
Because the dependency on Cargo.lock was added, the new command
for the build does not modify it. Instead, the developer must
update it when modifying the dependencies - the docs are updated
to reflect that.
Closes#12489Fixes#12508
This commit removes consume_in_reverse::legacy_half_reverse, an option
once used to indicate that the given key ranges are sorted descending,
based on the clustering key of the start of the range, and that the
range tombstones inside partition would be sorted (descending, as all
the mutation fragments would) according to their end (but range
tombstone would still be stored according to their start bound).
As it turns out, mutation::consume, when called with legacy_half_reverse
option produces invalid fragment stream, one where all the row
tombstone changes come after all the clustering rows. This was not an
issue, since when constructing results from the query, Scylla would not
pass the tombstones to the client, but instead compact data beforehand.
In this commit, the consume_in_reverse::legacy_half_reverse is removed,
along with all the uses.
As for the swap out in mutation_partition.cc in query_mutation and
to_data_query_result:
The downstream was not prepared to deal with legacy_half_reverse.
mutation::consume contains
```
if (reverse == consume_in_reverse::yes) {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
} else {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
}
```
So why did it work at all? to_data_query_result deals with a single slice.
The used consumer (compact_for_query_v2) compacts-away the range tombstone
changes, and thus the only difference between the consume_in_reverse::no
and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys
and the second one was ordered decreasing. This property is maintained if
we swap out for the consume_in_reverse::yes format.
Refs: #12353Closes#12453
* github.com:scylladb/scylladb:
mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse
mutation_partition_view: treat query::partition_slice::option::reversed in to_data_query_result as consume_in_reverse::yes
mutation: move consume_in_reverse def to mutation_consumer.hh
Currently, the rust build system in Scylla creates a separate
static library for each incuded rust package. This could cause
duplicate symbol issues when linking against multiple libraries
compiled from rust.
This issue is fixed in this patch by creating a single static library
to link against, which combines all rust packages implemented in
Scylla.
The Cargo.lock for the combined build is now tracked, so that all
users of the same scylla version also use the same versions of
imported rust modules.
Additionally, the rust package implementation and usage
docs are modified to be compatible with the build changes.
This patch also adds a new header file 'rust/cxx.hh' that contains
definitions of additional rust types available in c++.
This commit removes consume_in_reverse::legacy_half_reverse, an option
once used to indicate that the given key ranges are sorted descending,
based on the clustering key of the start of the range, and that the
range tombstones inside partition would be sorted (descending, as all
the mutation fragments would) according to their end (but range
tombstone would still be stored according to their start bound).
As it turns out, mutation::consume, when called with legacy_half_reverse
option produces invalid fragment stream, one where all the row
tombstone changes come after all the clustering rows. This was not an
issue, since when constructing results from the query, Scylla would not
pass the tombstones to the client, but instead compact data beforehand.
In this commit, the consume_in_reverse::legacy_half_reverse is removed,
along with all the uses.
As for the swap out in mutation_partition.cc in query_mutation and
to_data_query_result:
The downstream was not prepared to deal with legacy_half_reverse.
mutation::consume contains
```
if (reverse == consume_in_reverse::yes) {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
} else {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
}
```
So why did it work at all? to_data_query_result deals with a single slice.
The used consumer (compact_for_query_v2) compacts-away the range tombstone
changes, and thus the only difference between the consume_in_reverse::no
and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys
and the second one was ordered decreasing. This property is maintained if
we swap out for the consume_in_reverse::yes format.
Add instructions on how to backport a feature to on older version of Scylla.
It contains a detailed step-by-step instruction so that people unfamiliar with intricacies of Scylla's repository organization can easily get the hang of it.
This is the guide I wish I had when I had to do my first backport.
I put it in backport.md because that looks like the file responsible for this sort of information.
For a moment I thought about `CONTRIBUTING.md`, but this is a really short file with general information, so it doesn't really fit there. Maybe in the future there will be some sort of unification (see #12126)
Closes#12138
* github.com:scylladb/scylladb:
dev/docs: add additional git pull to backport docs
docs/dev: add a note about cherry-picking individual commits
docs/dev: use 'is merged into' instead of 'becomes'
docs/dev: mention that new backport instructions are for the contributor
docs/dev: Add backport instructions for contributors
The diagnostics dumped by the reader concurrency semaphore are pretty
common-sight in logs, as soon as a node becomes problematic. The reason
is that the reader concurrency semaphore acts as the canary in the coal
mine: it is the first that starts screaming when the node or workload is
unhealthy. This patch adds documentation of the content of the
diagnostics and how to diagnose common problems based on it.
Fixes: #10471Closes#11970
Some people prefer to cherry-pick individual commits
so that they have less conflicts to resolve at once.
Add a comment about this possibility.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The backport instructions said that after passing
the tests next `becomes` master, but it's more
exact to say that next `is merged into` master.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Previously the section was called:
"How to backport a patch", which could be interpreted
as instructions for the maintainer.
The new title clearly states that these instructions
are for the contributor in case the maintainer couldn't
backport the patch by themselves.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add instructions on how to backport a feature
to on older version of Scylla.
It contains a detailed step-by-step instruction
so that people unfamiliar with intricacies
of Scylla's repository organization can
easily get the hang of it.
This is the guide I wish I had when I had
to do my first backport.
I put it in backport.md because that
looks like the file responsible
for this sort of information.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.
A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively.
A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.
Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.
Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.
In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.
Closes#11449Closes#11674
* github.com:scylladb/scylladb:
sstables: mx/writer: optimize large data stats members order
sstables: mx/writer: keep large data stats entry as members
db: large_data_handler: dynamically update config thresholds
utils/updateable_value: add transforming_value_updater
db/large_data_handler: cql_table_large_data_handler: record large_collections
db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
db/large_data_handler: cql_table_large_data_handler: move ctor out of line
docs: large-rows-large-cells-tables: fix typos
db/system_keyspace: add collection_elements column to system.large_cells
gms/feature_service: add large_collection_detection cluster feature
test: sstable_3_x_test: add test_sstable_too_many_collection_elements
test: lib: simple_schema: add support for optional collection column
test: lib: simple_schema: build schema in ctor body
test: lib: simple_schema: cql: define s1 as static only if built this way
db/large_data_handler: maybe_record_large_cells: consider collection_elements
db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
sstables: mx/writer: add large_data_type::elements_in_collection
db/large_data_handler: get the collection_elements_count_threshold
db/config: add compaction_collection_elements_count_warning_threshold
test: sstable_3_x_test: add test_sstable_write_large_cell
test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
The "virtual dirty" term is not very informative. "Virtual" means
"not real", but it doesn't say in which way it isn't real.
In this case, virtual dirty refers to real dirty memory, minus
the portion of memtables that has been written to disk (but not
yet sealed - in that case it would not be dirty in the first
place).
I chose to call "the portion of memtables that has been written
to disk" as "spooled memory". At least the unique term will cause
people to look it up and may be easier to remember. From that
we have "unspooled memory".
I plan to further change the accounting to account for spooled memory
rather than unspooled, as that is a more natural term, but that is left
for later.
The documentation, config item, and metrics are adjusted. The config
item is practically unused so it isn't worth keeping compatibility here.
And bump the schema version offset since the new schema
should be distinguishable from the previous one.
Refs scylladb/scylladb#11660
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a new large_data_stats type and entry for keeping
the collection_elements_count_threshold and the maximum value
of collection_elements.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Google Groups recently started rewriting the From: header, garbaging
our git log. This script rewrites it back, using the Reply-To header
as a still working source.
Closes#11416
For generating #include directives in the generated files,
so we don't have to hand-craft include the dependencies
in the right order.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>