Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
Move saving features to `system.local#supported_features`
to the point after passing all remote feature checks in
the gossiper, right before joining the ring.
This makes `system.local#supported_features` column to store
advertised feature set. Leave a comment in the definition of
`system.local` schema to reflect that.
Since the column value is not actually used anywhere for now,
it shouldn't affect any tests or alter the existing behavior.
Later, we can optimize the gossip communication between nodes
in the cluster, removing the feature check altogether
in some cases (since the column value should now be monotonic).
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is a utility function for writing the supported
feature set to the `system.local` table.
Will be used to move the corresponding part from
`system_keyspace::setup_version` to the gossiper
after passing remote feature check, effectively making
`system.local#supported_features` store the advertised
features (which already passed the feature check).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
To ensure consistency of schema and topology changes,
Scylla needs a linearizable storage for this data
available at every member of the database cluster.
The series introduces such storage as a service,
available to all Scylla subsystems. Using this service, any other
internal service such as gossip or migrations (schema) could
persist changes to cluster metadata and expect this to be done in
a consistent, linearizable way.
The series uses the built-in Raft library to implement a
dedicated Raft group, running on shard 0, which includes all
members of the cluster (group 0), adds hooks to topology change
events, such as adding or removing nodes of the cluster, to update
group 0 membership, ensures the group is started when the
server boots.
The state machine for the group, i.e. the actual storage
for cluster-wide information still remains a stub. Extending
it to actually persist changes of schema or token ring
is subject to a subsequent series.
Another Raft related service was implemented earlier: Raft Group
Registry. The purpose of the registry is to allow Scylla have an
arbitrary number of groups, each with its own subset of cluster
members and a relevant state machine, sharing a common transport.
Group 0 is one (the first) group among many.
"
* 'raft-group-0-v12' of github.com:scylladb/scylla-dev:
raft: (server) improve tracing
raft: (metrics) fix spelling of waiters_awaken
raft: make forwarding optional
raft: (service) manage Raft configuration during topology changes
raft: (service) break a dependency loop
raft: (discovery) introduce leader discovery state machine
system_keyspace: mark scylla_local table as always-sync commitlog
system_keyspace: persistence for Raft Group 0 id and Raft Server Id
raft: add a test case for adding entries on follower
raft: (server) allow adding entries/modify config on a follower
raft: (test) replace virtual with override in derived class
raft: (server) fix a typo in exception message
raft: (server) implement id() helper
raft: (server) remove apply_dummy_entry()
raft: (test) fix missing initialization in generator.hh
"
This set covers simple but diverse cases:
- cache hitrace calculator
- repair
- system keyspace (virtual table)
- dht code
- transport event notifier
All the places just require straightforward arguments passing.
And a reparation in transport -- event notifier needs a backref
to the owning server.
Remaining after this set is the snitch<->gossiper interaction
and the cache hitrate app state update from table code.
tests: unit(dev)
"
* 'br-unglobal-gossiper-cont' of https://github.com/xemul/scylla:
transport: Use server gossiper in event notifier
transport: Keep backreference from event_notifier
transport: Keep gossiper on server
dht: Pass gossiper to range_streamer::add_ranges
dht: Pass gossiper argument to bootstrap
system_keyspace: Keep gossiper on cluster_status_table
code: Carry gossiper down to virtual tables creation
repair: Use local gossiper reference
cache_hitrate_calculator: Keep reference on gossiper
Re-enable previously persisted enabled features on node
startup. The features list to be enabled is read from
`system.local#enabled_features`.
In case an unknown feature is encountered, the node
fails to boot with an exception, because that means
the node is doing a prohibited downgrade procedure.
Features should be enabled before commitlog starts replaying
since some features affect storage (for example, when
determining used sstable format).
This patch implements a part of solution proposed by Tomek
in https://github.com/scylladb/scylla/issues/4458.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
It is infrequently updated (typically once at start) but stores
critical state for this instance survival (Raft Group 0 id, Raft
server id, sstables format), so always write it to commit log
in sync mode.
Implement system_keyspace helpers to persist Raft Group 0 id
and Raft Server id.
Do not use coroutines in a template function to work around
https://bugs.llvm.org/show_bug.cgi?id=50345
One of the tables needs gossiper and uses global one. This patch
prepares the fix by patching the main -> register_virtual_tables
stack with the gossiper reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The db::config reference is available on the database, which
can be get from the virtual_table itself. The problem is that
it's a const refernece, while system.config will be updateable
and will need non-const reference.
Adding non-const get_config() on the database looks wrong. The
database shouldn't be used as config provider, even the const
one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_local_host_id() accepts UUID references and starts
to save it in local keyspace and in all shards' local cache.
Before it was coroutinized the UUID was copied on captures
and survived, after it it remains references. The problem is
that callers pass local variables as arguments that go away
"really soon".
Fix it to accept UUID as value, it's short enough for safe
and painless copy.
fixes: #9425
tests: dtest.ReplaceAddress_rbo_enabled.replace_node_diff_ip(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211004145421.32137-1-xemul@scylladb.com>
Some places in the code want to have future-less access to the
host id, now they do it all by themselves. Local cache seems to
be a better place (for the record -- some time ago the "better
place" argument justified cached host id relocation from the
storage_service onto the database).
While at it -- add the future-less getter for the host_id to be
used further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will appear the future-less method which better deserves
the get_ prefix, so give the existing method the load_ one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the namespace scope functions in system_keyspace have no place
to store context, so they must store their context in global
variables. This prevents conversion of those global variables
to constructor-provided depdendencies.
Take the first step towards providing a place to store the
context by converting system_keyspace to a class. All the functions
are static, so no context is yet available, but we can de-static-ify
them incrementally in the future and store the context in class members.
Indentation is a mess, but can be easily fixed later.
In anticipation of making system_keyspace a class instead of a
namespace, rename any member that is currently forward-declared,
since one can't forward-declare a class member. Each member
is taken out of the system_keyspace namespace and gains a
system_keyspace prefix. Aliases are added to reduce code churn.
The result isn't lovely, but can be adjusted later.
Merge two fragments together, in anticipation of making 'legacy'
s struct instead of a namespace (when system_keyspace is a class,
we can't nest a namespace inside it).
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.
That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:
1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
documentation in the future.
Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.
Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.
Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.
Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.
Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.
So, now the snapshots schema looks like this:
CREATE TABLE raft_snapshots (
group_id timeuuid,
server_id uuid,
snapshot_id uuid,
idx int,
term int,
-- no `config` field here, moved to `raft_config` table
PRIMARY KEY ((group_id, server_id))
)
CREATE TABLE raft_config (
group_id timeuuid,
my_server_id uuid,
server_id uuid,
disposition text, -- can be either 'CURRENT` or `PREVIOUS'
can_vote bool,
ip_addr inet,
PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
);
This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is a new type of CDC generation identifiers. Compared to old IDs,
additionally to the timestamp it contains an UUID.
These new identifiers will allow a safer and more efficient algorithm of
introducing new generations into a cluster (introduced in a later commit).
For now, nodes keep using the old identifier format when creating new
generations and whenever they learn about a new CDC generation from gossip
they assume that it also is stored in the v1 format. But they do know how
to (de)serialize the second format and how to persist new identifiers in
local tables.
This is a follow-up to the previous commit.
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
the ID around
2. if it does care, it can simply access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).
Each CDC generation always has a timestamp, but the fact that the
timestamp identifies the generation is an implementation detail.
We abstract away from this detail by using a more generic naming scheme:
a generation "identifier" (whatever that is - a timestamp or something
else).
It's possible that a CDC generation will be identified by more than a
timestamp in the (near) future.
The actual string gossiped by nodes in their application state is left
as "CDC_STREAMS_TIMESTAMP" for backward compatibility.
Some stale comments have been updated.
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. This commit adds an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
System raft table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.
The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.
The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Functions for checking if the keyspace is system/internal were based
on sstring references, which is impractical compared to string views
and may lead to unnecessary creation of sstring instances.
There are few places that initialize db and system_ks and need the
messaging service. Pass the reference to it from main instead of
using the global helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#6341
Since scylla no longer supports upgrading from a version without the
"new" (dedicated) truncation record table, we can remove support for these
and the migtration thereof.
Make sure the above holds whereever this is committed.
Note that this does not remove the "truncated_at" field in
system.local.
We'd like to use the same uuid both for printing compaction log
messages and to update compaction_history.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Some caller have partition_key_view, but not partition_key, so thy need
to create a temporary and copy just to pass a reference. Change it by
accepting a view.
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.
The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round. It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.
Fixes#5779
Message-Id: <20200330154853.GA31074@scylladb.com>
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
This wrapper just makes sure the system_keyspace::get_saved_tokens
reports non empty result. Move them close together.
As a side effect -- get rid of penultimate global storage_service
reference from size_estimates_virtual_reader (the last one will
be removed soon).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
There's a lot of code around that needs storage service purely to
get the specific feature value (cluster_supports_<something> calls).
This creates several circular dependencies, e.g. storage_service <->
migration_manager one and database <-> storage_servuce. Also features
sit on storage_service, but register themselfs on the feature_service
and the former subscribes on them back which also looks strange.
I propose to keep all the features on feature_service, this keeps the
latter intependent from other components, makes it possible to break
one of the mentioned circle dependencyand heavily relax the other.
Also the set helps us fighting the globals and, after it, the
feature_service can be safely stopped at the very last moment.
Tests: unit(dev), manual debug build start-stop
"
* 'br-features-to-service-5' of https://github.com/xemul/scylla:
gossiper: Avoid string merge-split for nothing
features: Stop on shutdown
storage_service: Remove helpers
storage_service: Prepare to switch from on-board feature helpers
cql3: Check feature in .validate
database: Use feature service
storage_proxy: Use feature service
migration_manager: Use feature service
start: Pass needed feature as argument into migrate_truncation_records
features: Unfriend storage_service
features: Simplify feature registration
features: Introduce known_feature_set
features: Move disabled features set from storage_service
features: Move schema_features helper
features: Move all features from storage_service to feature_service
storage_service: Use feature_config from _feature_service
features: Add feature_config
storage_service: Kill set_disabled_features
gms: Move features stuff into own .cc file
migration_manager: Move some fns into class
This will be used to persist CDC streams generation timestamp
proposed by a joining node in case the node crashes or restarts,
similarly to the way tokens are persisted.
The get_saved_cdc_streams_timestamp method retrieves the generation
timestamp from the system table. It will be used by a restarting
node.
The update_cdc_streams_timestamp method saves CDC stream
generation timestamp of the calling node in the system table.
A joining node will persist the timestamp before it proposes it to other
nodes.
This change introduces system.clients table, which provides
information about CQL clients connected.
PK is the client's IP address, CK consists of outgoing port number
and client_type (which will be extended in future to thrift/alternator/redis).
Table supplies also shard_id and username. Other columns,
like connection_stage, driver_name, driver_version...,
are currently empty but exist for C* compatibility and future use.
This is an ordinary table (i.e. non-virtual) and it's updated upon
accepting connections. This is also why C*'s column request_count
was not introduced. In case of abrupt DB stop, the table should not persist,
so it's being truncated on startup.
Resolves#4820
Addition of cdc column in scylla_tables changes how schema
digests are calculated, and affect the ABI of schema update
messages (adding a column changes other columns' indexes
in frozen_mutation).
To fix this, extend the schema_tables mechanism with support
for the cdc column, and adjust schemas and mutations to remove
that column when sending schemas during upgrade.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>