We remove flush from set_scylla_local_param_as
since it's now redundant. We add it to
save_local_enabled_features as features need to
be available before schema commitlog replay.
We skip the flush if save_local_enabled_features
is called from topology_state_load when the features
are migrated to system.topology and we don't need
strict durability.
We want to switch system.scylla_local table to the
schema commitlog, but load phases hamper here - schema
commitlog is initialized after phase1,
so a table which is using it should be moved to phase2,
but system.scylla_local contains features, and we need
them before schema commitlog initialization for
SCHEMA_COMMITLOG feature.
In this commit we are taking a different approach to
loading system tables. First, we load them all in
one pass in 'readonly' mode. In this mode, the table
cannot be written to and has not yet been assigned
a commit log. To achieve this we've added _readonly bool field
to the table class, it's initialized to true in table's
constructor. In addition, we changed the table constructor
to always assign nullptr to commitlog, and we trigger
an internal error if table.commitlog() property is accessed
while the table is in readonly mode. Then, after
triggering on_system_tables_loaded notifications on
feature_service and sstable_format_selector, we call
system_keyspace::mark_writable and eventually
table::mark_ready_for_writes which selects the
proper commitlog and marks the table as writable.
In sstable_compaction_test we drop several
mark_ready_for_writes calls since they are redundant,
the table has already been made writable in
env.make_table_for_tests call.
The table::commitlog function either returns the current
commitlog or causes an error if the table is readonly. This
didn't work for virtual tables, since they never called
mark_ready_for_writes. In this commit we add this
call to initialize_virtual_tables.
Our goal is to switch system.local table to schema
commitlog and stop doing flushes when we write to it.
This means it would be incorrect to read from this
table until schema commitlog is replayed.
On the other hand, we need truncation records
to be loaded before we start replaying schema
commitlog, since commitlog_replayer relies on them.
In this commit we inline the system_keyspace::setup
function and split its content into two parts. In
the first part, before schema commitlog replay,
we load truncation records. It's safe to load
them before schema commitlog replay since we intend
to let the flushes on writes to system.truncated
table. In the second part, after schema commitlog replay,
we do the rest of the job - build_bootstrap_info and
db::schema_tables::save_system_schema.
We decided to inline this function since there is
very low cohesion between the actions it's performing.
It's just simpler to reason about them individually.
This is a readability refactoring commit without observable changes
in behaviour.
initialize_virtual_tables logically belongs to virtual_tables module,
and it allows to make other functions in virtual_tables.cc
(register_virtual_tables, install_virtual_readers)
local to the module, which simplifies the matters a bit.
all_virtual_tables() is not needed anymore, all the references to
registered virtual tables are now local to virtual_tables module
and can just use virtual_tables variable directly.
On boot system keyspace is kicked to insert local info into system.local
table. Among other things there's dc:rack pair which sys.ks. gets from
its cache which, in turn, should have been previously initialized from
snitch on sys.ks. start. This patch makes the local info updating method
get the dc:rack from caller via argument. Callers, in turn, call snitch
directly, because these are main and cql_test_env startup routines.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On boot several manipulations with system.local are performed.
1. The host_id value is selected from it with key = local
If not found, system_keyspace generates a new host_id, inserts the
new value into the table and returns back
2. The cluster_name is selected from it with key = local
Then it's system_keyspace that either checks that the name matches
the one from db::config, or inserts the db::config value into the
table
3. The row with key = local is updated with various info like versions,
listen, rpc and bcast addresses, dc, rack, etc. Unconditionally
All three steps are scattered over main, p.1 is called directly, p.2 and
p.3 are executed via system_keyspace::setup() that happens rather late.
Also there's some touch of this table from the cql_test_env startup code.
The proposal is to collect this setup into one place and execute it
early -- as soon as the system.local table is populated. This frees the
system_keyspace code from the logic of selecting host id and cluster
name leaving it to main and keeps it with only select/insert work.
refs: #2795
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15082
Now, it is possible to load topology_features separately from the
topology struct. It will be used in the code that checks enabled
features on startup.
Those without `_as` suffix are just marked non-static
The `..._as` ones are made class methods (now they are local to system_keyspace.cc)
After that the `..._as` ones are patched to use `this->` instead of `qctx`
Closes#14890
* github.com:scylladb/scylladb:
system_keyspace: Stop using qctx in [gs]et_scylla_local_param_as()
system_keyspace: Reuse container() and _db member for flushing
system_keyspace: Make [gs]et_scylla_local_param_as() class methods
system_keyspace: De-static [gs]et_scylla_local_param()
These are now two .cc-local templatized helpers, but they are only
called by system_keyspace:: non-static methods, so can be such as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All same-class callers are now non-static methods of system_keyspace,
all external callers do it via an object at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice.
Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member
Closes#14876
* github.com:scylladb/scylladb:
schema_tables: Extract query_processor from system_keyspace for querying
schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
migration_manager: Add system_keyspace argument to get_schema_mapping()
The schema_tables() column-mapping code runs queries over system. table,
but it needs LOCAL_ONE CL and cherry-pick on caching, so regular
system_keyspace::execute_cql() won't work here.
However, since schema_tables is somewhat part of system_keyspace, it's
natural to let the former fetch private query_processor& from the latter
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller is raft_group0_client with sys.ks. dependency reference and
group0_state_machine with raft_group0_client exporing its sys.ks.
This makes it possible to instantly drop one more qctx reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller is raft_group0_client with sys.ks. dependency reference.
This allows to drop one qctx reference right at once
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fortunately, this is pretty simple -- the only caller is storage_service
that has sharded<system_keysace> dependency reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14824
This is to make m.s. initialization more solid and simplify sys.ks.::setup()
Closes#14832
* github.com:scylladb/scylladb:
system_keyspace: Remove unused snitch arg from setup()
messaging_service: Setup preferred IPs from config
Population of messageing service preferred IPs cache happens inside
system keyspace setup() call and it needs m.s. per ce and additionally
snitch. Moving preferred ip cache to initial configuration keeps m.s.
start more self-contained and keeps system_keyspace::setup() simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This template call is only used by system keyspace paxos methods. All
those methods are no longer static and can use system_keyspace::_qp
reference to real query processor instead of global qctx. The
execute_cql_with_timeout() wrapper is moved to system_keyspace to make
it work
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The service::paxos_state methods that call those already have system
keyspace reference at hand and can call method on an object
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Initialization of `system_keyspace` is now all done at once instead of
being spread out through the entire procedure. This is doable because
`query_processor` is now available early. A couple of FIXMEs have been
resolved.
Take references to services which are initialized earlier. The
references to `gossiper`, `storage_service` and `raft_group0_registry`
are no longer needed.
This will allow us to move the `make` step right after starting
`system_keyspace`.
Split `system_keyspace::make` into two steps: creating regular
`system` and `system_schema` tables, then creating virtual tables.
This will allow, in later commit, to make `system_keyspace`
initialization independent of services used for distributed
communication such as `gossiper`. See further commits for details.
Take `query_processor` and `database` references directly, not through
`sharded<...>&`. This is now possible because we moved `query_processor`
and `database` construction early, so by the time `system_keyspace` is
started, the services it depends on were also already started.
Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member
functions can now be replaced with direct uses of `_qp` and `_db`.
Runtime assertions for dependant services being initialized are gone.
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).
The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.
Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.
Fixes: #13837Closes#13172
* github.com:scylladb/scylladb:
gossiper: Remove features and sysks from gossiper
system_keyspace: De-static save_local_supported_features()
system_keyspace: De-static load_|save_local_enabled_features()
system_keyspace: Move enable_features_on_startup to feature_service (cont)
system_keyspace: Move enable_features_on_startup to feature_service
feature_service: Open-code persist_enabled_feature_info() into enabler
gms: Move feature enabler to feature_service.cc
gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
gms: Persist features explicitly in features enabler
feature_service: Make persist_enabled_feature_info() return a future
system_keyspace: De-static load_peer_features()
gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
gossiper: Enable features and register enabler from outside
gms: Add feature_service and system_keyspace to feature_enabler
The methods that take storage_proxy as argument can now accept a
replica::database instead. So update their signatures and update all
callers. With that, system_keyspace.* no longer depends on storage_proxy
directly.
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
That's, in fact, an independent change, because feature enabler doesn't
need this method. So this patch is like "while at it" thing, but on the
other hand it ditches one more qctx usage.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All callers now have the system keyspace instance at hand.
Unfortunately, this de-static doesn't allow more qctx drops, because
both methods use set_|get_scylla_local_param helpers that do use qctx
and are still in use by other static methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This code belongs to feature service, system keyspace shoulnd't be aware
of any pecularities of startup features enabling, only loading and
saving the feature lists.
For now the move happens only in terms of code declarations, the
implementation is kept in its old place to reduce the patch churn.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order not to copy the rvalue consumer arg -- instantly convert it
into value. No other tricks.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There is a `query_mutations` function which loads the entire contents of
a given table into memory. There was no function for e.g. loading just a
single partition in the form of mutations. Introduce one.
Once a new CDC generation is committed to the cluster by the topology
coordinator, we also need to publish it to the user-facing description
tables so CDC applications know which streams to read from.
This uses regular distributed table writes underneath (tables living
in the `system_distributed` keyspace) so it requires `token_metadata`
to be nonempty. We need a hack for the case of bootstrapping the
first node in the cluster - turning the tokens into normal tokens
earlier in the procedure in `token_metadata`, but this is fine for the
single-node case since no streaming is happening.
The `CDC_GENERATIONS_V3` table schema is a copy-paste of the
`CDC_GENERATIONS_V2` schema. The difference is that V2 lives in
`system_distributed_keyspace` and writes to it are distributed using
regular `storage_proxy` replication mechanisms based on the token ring.
The V3 table lives in `system_keyspace` and any mutations written to it
will go through group 0.
Also extend the `TOPOLOGY` schema with new columns:
- `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping
node's `ring_slice`, it stores UUID of a newly introduced CDC
generation which is used as partition key for the `CDC_GENERATIONS_V3`
table to access this new generation's data. It's a regular column,
meaning that every row (corresponding to a node) will have its own.
- `current_cdc_generation_uuid` and `current_cdc_generation_timestamp`
together form the ID of the newest CDC generation in the cluster.
(the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is
when the CDC generation starts operating). Those are static columns
since there's a single newest CDC generation.
Add new columns to the `system.topology` table: `shard_count` and
`ignore_msb`. When a node bootstraps or restarts and observes that the
values stored in `topology` are different than the local values, it
updates them. This is done in the `update_topology_with_local_metadata`
function (the 'metadata' here being the two values).
Additional flag persisted in `system.scylla_local` is used to safely
avoid performing read barriers when the values didn't change on node
restart. A comment in `update_topology_with_local_metadata` explains why
this flag is needed.
An example use case where `shard_count` and `ignore_msb` are needed is
creating CDC generations.
Fixes: #13508
The schema is
CREATE TABLE system.sstables (
location text,
generation bigint,
format text,
status text,
uuid uuid,
version text,
PRIMARY KEY (location, generation)
)
A sample entry looks like:
location | generation | format | status | uuid | version
---------------------------------------------------------------------+------------+--------+--------+--------------------------------------+---------
/data/object_storage_ks/test_table-d096a1e0ad3811ed85b539b6b0998182 | 2 | big | sealed | d0a743b0-ad38-11ed-85b5-39b6b0998182 | me
The uuid field points to the "folder" on the storage where the sstable
components are. Like this:
s3
`- test_bucket
`- f7548f00-a64d-11ed-865a-0c1fbc116bb3
`- Data.db
- Index.db
- Filter.db
- ...
It's not very nice that the whole /var/lib/... path is in fact used as
location, it needs the PR #12707 to fix this place.
Also, the "status" part is not yet fully functional, it only supports
three options:
- creating -- the same as TemporaryTOC file exists on disk
- sealed -- default state
- deleting -- the analogy for the deletion log on disk
The latter needs support from the distributed_loader, which's not yet
there. In fact, distributes_loader also needs to be patched to actualy
select entries from this table on load. Also it needs the mentioned
PR #12707 to support staging and quarantine sstables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter is the only user of the class. This keeps system keyspace
code free from unrelated logic and from raft::server_id type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to keep system keyspace free from group0 logic and from the
service::group0_upgrade_state type
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>