* 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla:
main: allow joining raft group0 before waiting for gossiper to settle
service: raft_group0: make `join_group0` re-entrant
service: storage_service: add `join_group0` method
raft_group_registry: update gossiper state only on shard 0
raft: don't update gossiper state if raft is enabled early or not enabled at all
gms: feature_service: add `cluster_uses_raft_mgmt` accessor method
db: system_keyspace: add `bootstrap_needed()` method
db: system_keyspace: mark getter methods for bootstrap state as "const"
The feature represents the ability to store storage options
in keyspace metadata: represented as a map of options,
e.g. storage type, bucket, authentication details, etc.
Move the listener from feature service to the `raft_group_registry`.
Enable support for the `USES_RAFT_CLUSTER_MANAGEMENT`
feature when the former is enabled.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Also, change the signature of `support()` method to return
`future<>` since it's now a coroutine. Adjust existing call sites.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
The patch adds the `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
and `USES_RAFT_CLUSTER_MANAGEMENT` gossiper features.
These features provide a way to organize the automatic
switch to raft-based cluster management.
The scheme is as follows:
1. Every new node declares support for raft-based cluster ops.
2. At the moment, no nodes in the cluster can actually use
raft for cluster management, until the `SUPPORTS*` feature is enabled
(i.e. understood by every node in the cluster).
3. After the first `SUPPORTS*` feature is enabled, the nodes
can declare support for the second, `USES*` feature, which
means that the node can actually switch to use raft-based cluster
ops.
The scheme ensures that even if some nodes are down while
transitioning to new bootstrap mechanism, they can easily
switch to the new procedure, not risking to disrupt the
cluster.
The features are not actually wired to anything yet,
providing a framework for the integration with `raft_group0`
code, which is subject for a follow-up series.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211220081318.274315-1-pa.solodovnikov@scylladb.com>
This new feature will be used to determined whether the whole cluster
is ready to use additional page_size field in max_result_size.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This will be used for re-enabling previously enabled cluster
features, which will be introduces in later patches.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Save each feature enabled through the feature_service
instance in the `system.scylla_local` under the
'enabled_features' key.
The features would be persisted only if the underlying
query context used by `db::system_keyspace` is initialized.
Since `system.scylla_local` table is essentially a
string->string map, use an ad-hoc method for serializing
enabled features set: the same as used in gossiper for
translating supported features set via gossip.
The entry should be saved before we enable the feature so
that crash-after-enable is safe.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This utility will also be used for de-serialization of
persisted enabled features, which will be introduced in a
later patch.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive
operations to Alternator. These operations can enable, disable, or inquire
about, the chosen expiration-time attribute.
Currently, the information about the chosen attribute is only saved, with
no actual expiration of any items taking place.
Some of the tests for the TTL feature start to pass, so their xfail tag
is removed.
Because this this new feature is incomplete, it is not enabled unless
the "alternator-ttl" experimental feature is enabled. Moreover, for
these operations to be allowed, the entire cluster needs to support
this experimental feature, because all nodes need to participate in the
data expiration - if some old nodes don't support Alternator TTL, some
of the data they hold won't get expired... So we don't allow enabling
TTL until all the nodes in the cluster support this feature.
The implementation is in a new source file, alternator/ttl.cc. This
source file will continue to grow as we implement the expiration feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
These features have been around for over 2 years and every reasonable
deployment should have them enabled.
The only case when those features could be not enabled is when the user
has used enable_sstables_mc_format config flag to disable MC sstable
format. This case has been eliminated by removing
enable_sstables_mc_format config flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When a node joins this feature (which it does immediately when upgrading
to a version that has this commit), it says: "I understand the new
generation storage format and the new identifier format". Thus, when the
feature becomes enabled - after all nodes have joined it - it means that
it's safe to create new generations using these new storage/ID formats.
To control the transition to the data variant of range scans. As there
is a difference in how the data and mutation variants calculate pages
sizes, the transition to the former has to happen in a controlled
manner, when all nodes in the cluster support it, to avoid artificial
differences in page content and subsequently triggering false-positive
read repair.
Add new alternator-streams experimental flag for
alternator streams control.
CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Add new correct_idx_token_in_secondary_index feature, which will be used
to determine if all nodes in the cluster support new
token_column_computation. This column computation will replace
legacy_token_column_computation in secondary indexes, which was
incorrect as this column computation produced values that when compared
with unsigned comparison (CQL type bytes comparison) resulted in
different ordering than token signed comparison. See issue:
https://github.com/scylladb/scylla/issues/7443
Checks for features introduced over 2 years ago were removed
in previous commits, so all that is left is removing the feature
bits itself. Note that the feature strings are still sent
to other nodes just to be double sure, but the new code assumes
that all these features are implicitly enabled.
Nowadays the knowledge about known/supported features is
scattered between frature_service and storage_service. The
latter uses knowledge about the selected _sstables_format
to alter the "supported" set.
Encapsulate this knowledge inside the feature_service with
the help of "masked_features" -- those, that shouldn't be
advertized to other nodes. When only maskable feature for
today is the UNBOUNDED_RANGE_TOMBSTONES one. Nowadays it's
reported as supported only if the sstables format is MC.
With this patch it starts as masked and gets unmasked when
the sstables format is selected to be MC, so the change is
correct.
This will make it possible to move sstables_format from
storage service to anywhere else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set of bool enable_something-s on feature_fonfig duplicates
the disabled_features set on it, so remove the former and make
full use of the latter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This feature will ensure that caching can be switched
off per table only after the whole cluster supports it.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Always enable lightweight transactions. Remove the check for the command
line switch from the feature service, assuming LWT is always enabled.
Remove the check for LWT from Alternator.
Note that in order for the cluster to work with LWT, all nodes need
to support it.
Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in
--experimental-features command line option, but do nothing with it.
Changes in v2:
* remove enable_lwt feature flag, it's always there
Closes#6102
test: unit (dev, debug)
Message-Id: <20200401071149.41921-1-kostja@scylladb.com>
The previous patch made the LA format the default. We no longer need to
choose between writing the older KA format or LA, so the LA_SSTABLE
cluster feature has became unnecessary.
Unfortunately, we cannot completely remove this feature: Since commit
4f3ce42163 we cannot remove cluster features
because this node will refuse to join a cluster which already agreed on
features that it lacks - thinking it is an old node trying to join a
new cluster.
So the LA_SSTABLE feature flag remains, and we continue to advertise
that our node supports it. We just no longer care about what other
nodes advertised for it, so we can remove a bit of code that cared.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200324232607.4215-3-nyh@scylladb.com>
"
This PR makes it possible to enable the usage of different partitioner for each table. If no table-specific partitioner is set for a given table then a default partitioner is used.
The PR is composed of the following parts:
- Introduction of schema::get_partitioner that still returns dht::global_partitioner
- Replacement of all the usage of dht::global_partitioner with schema::get_partitioner
- Making it possible to set table-specific partitioner in a schema_builder
- Remove all the places that were setting default partitioner except for main.cc (mostly tests)
- Move default partitioner from i_partitioner to schema.cc and hide it from the rest of the codebase
- Remove dht::global_partitioner
After this PR there's no such thing as global partitioner at all. There is only a default partitioner but it still has to be accessed through schema::get_partitioner.
There are some intermediate states in which i_partitioner is stored as shared_ptr in the schema but the final version keeps it by const&.
The PR does not enable per table partitioner end-to-end. Just the internals of the single node are covered. I still have to deal with:
- Making sure a table has the same partitioner on each node
- Allowing user to set up a table-specific partitioner on table
- Signal driver about what partitioner is used by a given table
- Persist partitioner info for each table that does not use default partitioner.
Fixes#5493
Tests: unit(dev, release, debug), dtest(byo)
"
* 'per_table_partitioner' of https://github.com/haaawk/scylla:
schema: drop optional from _partitioner field
make_multishard_combining_reader: stop taking partitioner
split_range_to_single_shard: stop taking partitioner as argument
tests: remove unused murmur3 includes
partitioner: move default_partitioner to schema.cc
partitioner: hide dht::default_partitioner
schema: include partitioner name in scylla tables mutation
schema: make it possible to set custom partitioner
scylla_tables: add partitioner column
schema_features: add PER_TABLE_PARTITIONERS feature
features: add PER_TABLE_PARTITIONERS feature
This new feature is required because we now allow
setting partitioner per table. This will influence
the digest of table schema so we must not include
partitioner name into the digest unless we know that
the whole cluster already supports per table partitioners.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
In test_services.cc there is
gms::feature_service test_feature_service;
And the feature_service constructor has
, _lwt_feature(*this, features::LWT)
But features::LWT is a global sstring constructed in another file.
Solve the problem by making the feature strings constexpr
std::string_view.
I found the issue while trying to benchmark the std::string switch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Acked-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200309225749.36661-1-espindola@scylladb.com>
The storage service no longer needs to mess with feature
config. It only needs two features to register onself in,
but this can be solved by respective cluster_supports_foo
helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now features are registered into a map of vectors, but
it looks like the vector is always 1-item long and is
used to keep pointer on feature, instead of the feature
itself.
Switch it into map of reference_wrapper-s.
Before this patch we could register more than one
feature under the same name, now we can't. But this
seems to be OK, as we don't actually do this. To catch
violations of this restriction there's an assert() in the
feature_service::register_feature.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two masks -- supported and known. They differ in
unbounded_range_tombstones one which is set depending on the
sstables format in use.
Since the feature_service doesn't know anything about sstables
format, the logic is reverted -- the feature service reports
back the known mask (all features) and storage_service clears
the unbounded_range_tombstones if the sst format is low -- but
is (hopefully) left intact.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And leave some temporary storage_service->feature links. The plan
is to make every subsystem that needs storage_service for features
stop doing so and switch on the feature_service.
The feature_service is the service w/o any dependencies, it will be
freed last, thus making the service dependency tree be a tree,
not a graph with loops.
While at it -- make all const-s not have _FEATURE suffix (now there
are both options) and const-qualify cluster_supports_lwt().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some features take db::config to find out whether to be enabled
or disabled. This creates unwanted dependency between database and
features, so split the features configuration explicitly. Also
this will make the "this is for testing env only" logic cleaner
and simpler to understand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Feature lifetime is tied to storage_service lifetime, but features are now managed
by gossip. To avoid circular dependency, add a new feature_service service to manage
feature lifetime.
To work around the problem, the current code re-initializes features after
gossip is initialized. This patch does not fix this problem; it only makes it
possible to solve it by untyping features from gossip.