Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#9103
compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.
Closes#9104
When a table with compact storage has no regular column (only primary
key columns), an artificial column of type empty is added. Such column
type can't be returned via CQL so CDC Log shouldn't contain a column
that reflects this artificial column.
This patch does two things:
1. Make sure that CDC Log schema does not contain columns that reflect
the artificial column from a base table.
2. When composing mutation to CDC Log, ommit the artificial column.
Fixes#8410
Test: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8988
In preparation for caching index objects, manage them under LSA.
Implementation notes:
key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it. Actual linearization should be rare since partition
keys are typically small.
Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:
class parsed_promoted_index_entry;
calss parsed_partition_index_entry;
This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
The bootstrap procedure starts by "waiting for range setup", which means
waiting for a time interval specified by the `ring_delay` parameter (30s
by default) so the node can receive the tokens of other nodes before
introducing its own tokens.
However it may sometimes happen that the node doesn't receive the
tokens. There are no explicit checks for this. But the code may crash in
weird ways if the tokens-received assuption is false, and we are lucky
if it does crash (instead of, for example, allowing the node to
incorrectly bootstrap, causing data loss in the process).
Introduce an explicit check-and-throw-if-false: a bootstrapping node now
checks that there's at least one NORMAL token in the token ring, which
means that it had to have contacted at least one existing node
in the cluster, which means that it received the gossip application
states of all nodes from that node; in particular the tokens of all
nodes.
Also add an assert in CDC code which relies on that assumption
(and would cause weird division-by-zero errors if the assumption
was false; better to crash on assert than this).
Ref #8889.
Closes#8896
A node with this commit, when creating a new CDC generation (during
bootstrap, upgrade, or when running checkAndRepairCdcStreams command)
will check for the CDC_GENERATIONS_V2 feature and:
- If the feature is enabled create the generation in the v2 format
and insert it into the new internal table. This is safe because
a node joins the feature only if it understands the new format.
- Otherwise create it in the v1 format, limiting its size as before,
and insert it into the old table.
The second case should only happen if we perform bootstrap or run
checkAndRepairCdcStreams in the middle of an upgrade procedure. On fully
upgraded clusters the feature shall be enabled, causing all new
generations to use the new format.
This function given a generation ID retrieves its data from the internal
table in which the data resides. This depends on the version of the ID:
for _v1 we're using system_distributed.cdc_generation_descriptions, for
_v2 we're using the better
system_distributed_v2.cdc_generation_descriptions_v2 (see the previous
commit for detailed explanation of the superiority of the new table).
This is a new type of CDC generation identifiers. Compared to old IDs,
additionally to the timestamp it contains an UUID.
These new identifiers will allow a safer and more efficient algorithm of
introducing new generations into a cluster (introduced in a later commit).
For now, nodes keep using the old identifier format when creating new
generations and whenever they learn about a new CDC generation from gossip
they assume that it also is stored in the v1 format. But they do know how
to (de)serialize the second format and how to persist new identifiers in
local tables.
"
Storage service needs migration notifier reference to pass it to cdc
service via get_local_storage_service(). This set removes
- get_local_storage_service from cdc
- migration notifier from storage service
- db_context::builder from cdc (released nuclear binding energy)
tests: unit(dev)
"
* 'br-cdc-no-storage-service' of https://github.com/xemul/scylla:
storage_service: Remove migration notifier dependency
cdc: Remove db_context::builder
cdc: Provide migration notifier right at once
cdc: Remove db_context::builder::with_migration_notifier
Right now the builder is just an opaque transfer between cdc_service
constructor args and cdc_service's db_context constructor args.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only way db_context's migration notifier reference is set up
is via cdc_service->db_context::builder->.build chain of calls.
Since the builder's notifier optional reference is always
disengaged (the .with_migration_notifier is removed by previous
patch) the only possible notifier reference there is from the
storage service which, in turn, is the same as in main.cc.
Said that -- push the notifier reference onto db_context directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before this change, cdc$deleted_ columns were all NULL in pre-images.
Lack of such information made it hard to correctly interpret the
pre-image rows, for example:
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
For this example, pre-image generated for the second operation
would look like this (in both 'true' and 'full' pre-image mode):
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
v=NULL has two meanings:
1. If pre-image was in 'true' mode, v=NULL describes that v was not
affected (affected columns: pk, ck, v2).
2. If pre-image was in 'full' mode, v=NULL describes that v was equal
to NULL in the pre-image.
Therefore, to properly decode pre-images you would need to know in
which mode pre-image was configured on the CDC-enabled table at the
moment this CDC log row was inserted. There is no way to determine
such information (you can only check a current mode of pre-image).
A solution to this problem is to fill in the cdc$deleted_ columns
for pre-images. After this change, for the INSERT described above, CDC
now generates the following log row:
If in pre-image 'true' mode:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
If in pre-image 'full' mode:
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
A client library now can properly decode a pre-image row. If it sees
a NULL value, it can now check the cdc$deleted_ column to determine
if this NULL value was a part of pre-image or it was omitted due to
not being an affected column in the delta operation.
No such change is necessary for the post-image rows, as those images
are always generated in the 'full' mode.
Additional example of trouble decoding pre-images before this change.
tbl2 - 'true' pre-image mode, tbl3 - 'full' pre-image mode:
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
generated pre-image:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
generated pre-image:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
Both pre-images look the same, but:
1. v=NULL in tbl2 describes v being omitted from the pre-image.
2. v=NULL in tbl3 described v being NULL in the pre-image.
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).
Ref #1.
Improve the exception message of providing invalid "ttl" value to the
table.
Previously, if you executed a CREATE TABLE query with invalid "ttl"
value, you would get a non-descriptive error message:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'};
ServerError: stoi
This commit adds more descriptive exception messages:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'};
ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'};
ConfigurationException: Invalid CDC option: ttl too large
Add validation of "enable" and "postimage" CDC options. Both options
are boolean options, but previously they were not validated, meaning
you could issue a query:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'};
and it would be executed without any errors, silently interpreting
"dsfdsd" as false.
This commit narrows possible values of those boolean CDC options to
false, true, 0, 1. After applying this change, issuing the query above
would result in this error message:
ConfigurationException: Invalid value for CDC option "enabled": dsfdsd
CDC log uses `bytes` to deal with cells and their values, and linearizes all
values indiscriminately. This series makes a switch from `bytes` to
`managed_bytes` to avoid that linearization.
Fixes#7506.
Closes#8429
* github.com:scylladb/scylla:
cdc: log: change yet another occurence of `bytes` to `managed_bytes`
cdc: log: switch the remaining usages of `bytes` to `managed_bytes` in collection_visitor
cdc: log: change `deleted_elements` in log_mutation_builder from bytes to managed_bytes
cdc: log: rewrite collection merge to use managed_bytes instead of bytes
cdc: log: don't linearize collections in get_preimage_col_value
cdc: log: change return type of get_preimage_col_value to managed_bytes
cdc: log: remove an unnecessary copy in process_row_visitor::live_atomic_cell
cdc: log: switch cell_map from bytes to managed_bytes
cdc: log: change the argument of log_mutation_builder::set_value to managed_bytes_view
cdc: log: don't linearize the primary key in log_mutation_builder
atomic_cell: add yet another variant of make_live for managed_bytes_view
compound: add explode_fragmented
The "most important" major changes are:
1. storage_service: simplify CDC generation management during node replace
Previously, when node A replaced node B, it would obtain B's
generation timestamp from its application state (gossiped by other
nodes) and start gossiping it immediately on bootstrap.
But that's not necessary:
- if this is the timestamp of the last (current) generation, we would
obtain it from other nodes anyway (every node gossips the last known
timestamp),
- if this is the timestamp of an earlier generation, we would forget
it immediately and start gossiping the last timestamp (obtained from
other nodes).
This commit simplifies the bootstrap code (in node-replace case) a bit:
the replacing node no longer attempts to retrieve the CDC generation
timestamp from the node being replaced.
2. tree-wide: introduce cdc::generation_id type
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
- if a piece of code doesn't care about the timestamp, it just passes
the ID around
- if it does care, it can access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
3. cdc: handle missing generation case in check_and_repair_cdc_streams
check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.
I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.
---
Additionally the PR does some simplifications, stylistic improvements,
removes some dead code, coroutinizes some functions, uncoroutinizes others
(due to miscompiles), adds additional logging, updates some stale comments.
Read commit messages for more details.
Closes#8283
* github.com:scylladb/scylla:
cdc: log a message when creating a new CDC generation
cdc: handle missing generation case in check_and_repair_cdc_streams
tree-wide: introduce cdc::generation_id type
tree-wide: rename "cdc streams timestamp" to "cdc generation id"
cdc: remove some functions from generation.hh
storage_service: make set_gossip_tokens a static free-function
db: system_keyspace: group cdc functions in single place
cdc: get rid of "get_local_streams_timestamp"
sys_dist_ks: update comment at quorum_if_many
storage_service: simplify CDC generation management during node replace
check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.
I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.
This is a follow-up to the previous commit.
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
the ID around
2. if it does care, it can simply access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.
The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.
For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.
* switch get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
min_time_UUID(), max_time_UUID(), create_time_safe() to
std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
Each CDC generation always has a timestamp, but the fact that the
timestamp identifies the generation is an implementation detail.
We abstract away from this detail by using a more generic naming scheme:
a generation "identifier" (whatever that is - a timestamp or something
else).
It's possible that a CDC generation will be identified by more than a
timestamp in the (near) future.
The actual string gossiped by nodes in their application state is left
as "CDC_STREAMS_TIMESTAMP" for backward compatibility.
Some stale comments have been updated.
This function retrieves the persisted timestamp of the last known CDC
generation (which this node is currently gossiping to other nodes).
It checks that the timestamp is present; if not, it throws an error.
The check is unnecessary. It's used only in a quite esoteric place
(start_gossiping, which implements an almost-never-used API call),
and it's fine if the timestamp is gone - in start_gossiping,
we can start gossiping the tokens without the CDC generation timestamp
(well, if the timestamp is not present in system tables, something
weird must have happened, but that doesn't mean we can't resume
gossiping - fixing CDC generation management in such a case is
a separate problem).
In preparation for removing linearization from abstract_type::compare,
add options to avoid linearization in tuple_deserializing_iterator.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Refs #7961Fixes#8014
Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.
Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.
Otoh, it allows writing code using this that avoids data copying,
depending on destination.
v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase
v3:
* Changed shared_ptr to unique
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.
However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.