Commit Graph

750 Commits

Author SHA1 Message Date
Pavel Emelyanov
271ceb57b9 gossiper: Keep immutable options on gossip_config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Benny Halevy
132c9d5933 main: shutdown: do not abort on certain system errors
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.

The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.

This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind.  Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
2022-02-27 16:26:48 +02:00
Pavel Emelyanov
de6c60c1c9 client_data: Sanitize connection_notifier
Now the connection_notifier is all gone, only the client_data bits are left.
To keep it consistent -- rename the files.

Also, while at it, brush up the header dependencies and remove the not
really used constexprs for client states.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 15:02:26 +03:00
Pavel Emelyanov
971c431a23 code: Remove old on-disk version of system.clients table
This includes most of the connection_notifier stuff as well as
the auxiliary code from system_keyspace.cc and a bunch of
updating calls from the client state changing.

Other than less code and less disk updates on clients connection
paths, this removes one usage of the nasty global qctx thing.

Since the system.clients goes away rename the system.clients_v
here too so the table is always present out there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 15:02:26 +03:00
Michael Livshin
3fef604075 sstables_manager: add get_local_host_id() method and support
Since ME sstable format includes originating host id in stats
metadata, local host id needs to be made available for writing and
validation.

Both Scylla server (where local host id comes from the `system.local`
table) and unit tests (where it is fabricated) must be accomodated.
Regardless of how the host id is obtained, it is stored in the db
config instance and accessed through `sstables_manager`.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
7d2af177eb system_keyspace, main: load (or create) local host id earlier
We want it to be cached before any sstable is written, so do it right
after system_keyspace::minimal_setup().

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Botond Dénes
ef34c10a94 main: run scylla main to when there are no arguments
main() has some logic to select the main function it will delegate to
based on argv[1]. The intent is that when the value of argv[1] suggest
that the user did not specify a specific app to run, we default to
"server" (scylla proper).
This logic currently breaks down when there are no arguments at all: in
this case the following error is printed and scylla refuses to start:

    error: unrecognized first argument: expected it to be "server", a regular command-line argument or a valid tool name (see `scylla --list-tools`), but got

Fix this by checking for empty argv[1] and defaulting to "server" in
that case.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220210092125.293682-1-bdenes@scylladb.com>
2022-02-10 11:47:20 +02:00
Michał Sala
66a93d3000 cql3: query_processor: add forward_service reference to query_processor 2022-02-01 21:14:41 +01:00
Michał Sala
a6cf3f52bd service: introduce forward_service
The new service is responsible for:
* spreading forward_request execution across multiple nodes in cluster
* collecting forward_request execution results and merging them

`forward_service::dispatch` method takes forward_request as an
argument, and forwards its execution to group of other nodes (using rpc
verb added in previous commits). Each node (in the group chosen by
dispatch method) is provided with forward_request, which is no different
from the original argument except for changed partition ranges. They are
changed so that vnodes contained in them are owned by recipient node.

Executing forward_request is realized in `forward_service::execute`
method, that is registered to be called on FORWARD_REQUEST verb receipt.
Process of executing forward_request consists of mocking few
non-serializable object (such as `cql3::selection`) in order to create
`service:pager:query_pagers::pager` and `cql3::selection::result_set_builder`.
After pager and result_set_builder creation, execution process resembles
what might be seen in select_statement's execution path.
2022-02-01 21:14:41 +01:00
Nadav Har'El
4aa9e86924 Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity
Alternator is a coordinator-side service and so should not access
the replica module. In this series all but one of uses of the replica
module are replaced with data_dictionary.

One case remains - accessing the replication map which is not
available (and should not be available) via the data dictionary.

The data_dictionary module is expanded with missing accessors.

Closes #9945

* github.com:scylladb/scylla:
  alternator: switch to data_dictionary for table listing purposes
  data_dictionary: add get_tables()
  data_dictionary: introduce keyspace::is_internal()
2022-01-19 11:34:25 +02:00
Avi Kivity
7399f3fae7 alternator: switch to data_dictionary for table listing purposes
As a coordinator-side service, alternator shouldn't touch the
replica module, so it is migrated here to data_dictionary.

One use case still remains that uses replica::keyspace - accessing
the replication map. This really isn't a replica-side thing, but it's
also not logically part of the data dictionary, so it's left using
replica::keyspace (using the data_dictionary::database::real_database()
escape hatch). Figuring out how to expose the replication map to
coordinator-side services is left for later.
2022-01-19 11:03:36 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
4392c20bd3 replica: move distributed_loader into replica module
distributed_loader is replica-side thing, so it belongs in the
replica module ("distributed" refers to its ability to load
sstables in their correct shards). So move it to the replica
module.
2022-01-10 15:25:28 +02:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Botond Dénes
a37b4bbbaf main: improve handling of non-matching argv[1]
Be silent when argv[1] starts with "-", it is probably an option to
scylla (and "server" is missing from the cmd line).
Print an error and stop when argv[1] doesn't start with "-" and thus the
user assumably meant to start either the server or a tool and mis-typed
it. Instead of trying to guess what they meant stop with a clear error
message.
2022-01-06 06:59:59 +02:00
Botond Dénes
fe0bfa1d7b main: add move tool listing to --list-tools
And make it the central place listing available tools (to minimize the
places to update when adding a new one). The description is edited to
point to this command instead of listing the tools itself.
2022-01-06 06:58:44 +02:00
Botond Dénes
ab0e39503b main: rephrase app description
Remove "compatible with Apache Cassandra", scylla is much more than that
already.

Rephrase the part describing the included tools such that it is clear
that the scylla server is the main thing and the tools are the "extra"
additions. Also use the term "tool" instead of the term "app".
2022-01-06 06:37:32 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Avi Kivity
5eccb42846 Merge "Host tool executables in the scylla main executable" from Botond
"
A big problem with scylla tool executables is that they include the
entire scylla codebase and thus they are just as big as the scylla
executable itself, making them impractical to deploy on production
machines. We could try to combat this by selectively including only the
actually needed dependencies but even ignoring the huge churn of
sorting out our depedency hell (which we should do at one point anyway),
some tools may genuinely depend on most of the scylla codebase.

A better solution is to host the tool executables in the scylla
executable itself, switching between the actual main function to run
some way. The tools themselves don't contain a lot of code so
this won't cause any considerable bloat in the size of the scylla
executable itself.
This series does exactly this, folds all the tool executables into the
scylla one, with main() switching between the actual main it will
delegate to based on a argv[1] command line argument. If this is a known
tool name, the respective tool's main will be invoked.
If it is "server", missing or unrecognized, the scylla main is invoked.

Originally this series used argv[0] as the mean to switch between the
main to run. This approach was abandoned for the approach mentioned above
for the following reasons:
* No launcher script, hard link, soft link or similar games are needed to
  launch a specific tool.
* No packaging needed, all tools are automatically deployed.
* Explicit tool selection, no surprises after renaming scylla to
  something else.
* Tools are discoverable via scylla's description.
* Follows the trend set by modern command line multi-command or multi-app
  programs, like git.

Fixes: #7801

Tests: unit(dev)
"

* 'tools-in-scylla-exec-v5' of https://github.com/denesb/scylla:
  main,tools,configure.py: fold tools into scylla exec
  tools: prepare for inclusion in scylla's main
  main: add skeleton switching code on argv[1]
  main: extract scylla specific code into scylla_main()
2022-01-04 17:55:07 +02:00
Pavel Solodovnikov
83862d9871 db: save supported features after passing gossip feature check
Move saving features to `system.local#supported_features`
to the point after passing all remote feature checks in
the gossiper, right before joining the ring.

This makes `system.local#supported_features` column to store
advertised feature set. Leave a comment in the definition of
`system.local` schema to reflect that.

Since the column value is not actually used anywhere for now,
it shouldn't affect any tests or alter the existing behavior.

Later, we can optimize the gossip communication between nodes
in the cluster, removing the feature check altogether
in some cases (since the column value should now be monotonic).

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-23 12:48:37 +03:00
Botond Dénes
bb0874b28b main,tools,configure.py: fold tools into scylla exec
The infrastructure is now in place. Remove the proxy main of the tools,
and add appropriate `else if` statements to the executable switch in
main.cc. Also remove the tool applications from the `apps` list and add
their respective sources as dependencies to the main scylla executable.
With this, we now have all tool executables living inside the scylla
main one.
2021-12-20 18:27:25 +02:00
Botond Dénes
972d789a27 main: add skeleton switching code on argv[1]
To prepare for the scylla executable hosting more than one apps,
switching between them using argv[1]. This is consistent with how most
modern multi-app/multi-command programs work, one prominent example
being git.
For now only one app is present: scylla itself, called "server". If
argv[1] is missing or unrecognized, this is what is used as the default
for backward-compatibility.

The scylla app also gets a description, which explains that scylla hosts
multiple apps and lists all the available ones.
2021-12-20 18:26:38 +02:00
Botond Dénes
1a4ca831a4 main: extract scylla specific code into scylla_main()
main() now contains only generic setup and teardown code and it
delegates to scylla_main().
In the next patches we want to wire in tool executables into the scylla
one. This will require selecting the main to run at runtime.
scylla_main() will be just one of those (the default).
2021-12-20 17:31:46 +02:00
Avi Kivity
a97731a7e5 migration_manager: replace uses of get_storage_proxy and get_local_storage_proxy with constructor-provided reference
A static helper also gained a storage_proxy parameter.
2021-12-16 21:05:47 +02:00
Avi Kivity
d768e9fac5 cql3, related: switch to data_dictionary
Stop using database (and including database.hh) for schema related
purposes and use data_dictionary instead.

data_dictionary::database::real_database() is called from several
places, for these reasons:

 - calling yet-to-be-converted code
 - callers with a legitimate need to access data (e.g. system_keyspace)
   but with the ::database accessor removed from query_processor.
   We'll need to find another way to supply system_keyspace with
   data access.
 - to gain access to the wasm engine for testing whether used
   defined functions compile. We'll have to find another way to
   do this as well.

The change is a straightforward replacement. One case in
modification_statement had to change a capture, but everything else
was just a search-and-replace.

Some files that lost "database.hh" gained "mutation.hh", which they
previously had access to through "database.hh".
2021-12-15 13:54:23 +02:00
Gleb Natapov
e9fafea5c1 migration_manager: pass raft_gr to the migration manager
Migration manager will be use raft group zero to distribute schema
changes.
2021-12-11 12:31:07 +02:00
Avi Kivity
f28552016f Update seastar submodule
* seastar f8a038a0a2...8d15e8e67a (21):
  > core/program_options: preserve defaultness of CLI arguments
  > log: Silence logger when logging
  > Include the core/loop.hh header inside when_all.hh header
  > http: Fix deprecated wrappers
  > foreign_ptr: Add concept
  > util: file: add read_entire_file
  > short_streams: move to util
  > Revert "Merge: file: util: add read_entire_file utilities"
  > foreign_ptr: declare destroy as a static method
  > Merge: file: util: add read_entire_file utilities
  > Merge "output_stream: handle close failure" from Benny
  > net: bring local_address() to seastar::connected_socket.
  > Merge "Allow programatically configuring seastar" from Botond
  > Merge 'core: clean up memory metric definitions' from John Spray
  > Add PopOS to debian list in install-dependencies.sh
  > Merge "make shared_mutex functions exception safe and noexcept" from Benny
  > on_internal_error: set_abort_on_internal_error: return current state
  > Implementation of iterator-range version of when_any
  > net: mark functions returning ethernet_address noexcept
  > net: ethernet_address: mark functions noexcept
  > shared_mutex: mark wake and unlock methods noexcept

Contains patch from Botond Dénes <bdenes@scylladb.com>:

db/config: configure logging based on app_template::seastar_options

Scylla has its own config file which supports configuring aspects of
logging, in addition to the built-in CLI logging options. When applying
this configuration, the CLI provided option values have priority over
the ones coming from the option file. To implement this scylla currently
reads CLI options belonging to seastar from the boost program options
variable map. The internal representation of CLI options however do not
constitute an API of seastar and are thus subject to change (even if
unlikely). This patch moves away from this practice and uses the new
shiny C++ api: `app_template::seastar_options` to obtain the current
logging options.
2021-12-08 14:21:11 +02:00
Pavel Emelyanov
2d8272dc03 thrift: Keep sharded proxy reference on thrift_handler
Carried via main -> controller -> server -> factory -> handler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-03 17:48:19 +03:00
Pavel Emelyanov
e4f35e2139 migration_manager: Eliminate storage service from passive announcing
Currently storage service acts as a glue between database schema value
and the migration manager "passive_announce" call. This interposing is
not required, migration manager can do all the management itself, and
the linkage can be done in main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Pavel Emelyanov
d4d0bd147e migration_manager: Subscribe on gossiper events
This is to start schema pulls upon on_join, on_alive and on_change ones
in the next patch. Migration manager already has gossiper reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Nadav Har'El
605a2de398 config: change default prometheus_address handling, again
In the very recent commit 3c0e703 fixing issue #8757, we changed the
default prometheus_address setting in scylla.yaml to "localhost", to
match the default listen_address in the same file. We explained in that
commit how this helped developers who use an unchanged scylla.yaml, and
how it didn't hurt pre-existing users who already had their own scylla.yaml.

However, it was quickly noted by Tzach and Amnon that there is one use case
that was hurt by that fix:

Our existing documentation, such as the installation guide
https://www.scylladb.com/download/?platform=centos ask the user to take
our initial scylla.yaml, and modify listen_address, rpc_address, seeds,
and cluster_name - and that's it. That document - and others - don't
tell the user to also override prometheus_address, so users will likely
forget to do so - and monitoring will not work for them.

So this patch includes a different solution to #8757.
What it does is:
1. The setting of prometheus_address in scylla.yaml is commented out.
2. In config.cc, prometheus_address defaults to empty.
3. In main.cc, if prometheus_address is empty (i.e., was not explicitly
   set by the user), the value of listen_address is used instead.

In other words, the idea is that prometheus_address, if not explicitly set
by the user, should default to listen_address - which is the address used
to listen to the internal Scylla inter-node protocol.

Because the documentation already tells the user to set listen_address
and to not leave it set to localhost, setting it will also open up
prometheus, thereby solving #9701. Meanwhile, developers who leave the
default listen_address=localhost will also get prometheus_address=localhost,
so the original #8757 is solved as well. Finally, for users who had an old
scylla.yaml where prometheus_address was explicitly set to something,
this setting will continue to be used. This was also a requirement of
issue #8757.

Fixes #9701.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211129155201.1000893-1-nyh@scylladb.com>
2021-12-02 19:43:30 +02:00
Avi Kivity
078f69c133 Merge "raft: (service) implement group 0 as a service" from Kostja
"
To ensure consistency of schema and topology changes,
Scylla needs a linearizable storage for this data
available at every member of the database cluster.

The series introduces such storage as a service,
available to all Scylla subsystems. Using this service, any other
internal service such as gossip or migrations (schema) could
persist changes to cluster metadata and expect this to be done in
a consistent, linearizable way.

The series uses the built-in Raft library to implement a
dedicated Raft group, running on shard 0, which includes all
members of the cluster (group 0), adds hooks to topology change
events, such as adding or removing nodes of the cluster, to update
group 0 membership, ensures the group is started when the
server boots.

The state machine for the group, i.e. the actual storage
for cluster-wide information still remains a stub. Extending
it to actually persist changes of schema or token ring
is subject to a subsequent series.

Another Raft related service was implemented earlier: Raft Group
Registry. The purpose of the registry is to allow Scylla have an
arbitrary number of groups, each with its own subset of cluster
members and a relevant state machine, sharing a common transport.
Group 0 is one (the first) group among many.
"

* 'raft-group-0-v12' of github.com:scylladb/scylla-dev:
  raft: (server) improve tracing
  raft: (metrics) fix spelling of waiters_awaken
  raft: make forwarding optional
  raft: (service) manage Raft configuration during topology changes
  raft: (service) break a dependency loop
  raft: (discovery) introduce leader discovery state machine
  system_keyspace: mark scylla_local table as always-sync commitlog
  system_keyspace: persistence for Raft Group 0 id and Raft Server Id
  raft: add a test case for adding entries on follower
  raft: (server) allow adding entries/modify config on a follower
  raft: (test) replace virtual with override in derived class
  raft: (server) fix a typo in exception message
  raft: (server) implement id() helper
  raft: (server) remove apply_dummy_entry()
  raft: (test) fix missing initialization in generator.hh
2021-11-30 16:24:51 +02:00
Piotr Sarna
ecd122a1b0 Merge 'alternator: rudimentary implementation of TTL expiration service' from Nadav Har'El
In this patch series we add an implementation of an
expiration service to Alternator, which periodically scans the data in
the table, looking for expired items and deleting them.

We also continue to improve the TTL test suite to cover additional
corner cases discovered during the development of the code.

This implementation is good enough to make all existing tests but one,
plus a few new ones, pass, but is still a very partial and inefficient
implementation littered with FIXMEs throughout the code. Among other
things, this initial implementation doesn't do anything reasonable about pacing of
the scan or about multiple tables, it scans entire items instead of only the
needed parts, and because each shard "owns" a different subset of the
token ranges, if a node goes down, partitions which it "owns" will not
get expired.

The current tests cannot expose these problems, so we will need to develop
additional tests for them.

Because this implementation is very partial, the Alternator TTL continues
to remain "experimental", cannot be used without explicitly enabling this
experimental feature, and must not be used for any important deployment.

Refs #5060 but doesn't close the issue (let's not close it until we have a
reasonably complete implementation - not this partial one).

Closes #9624

* github.com:scylladb/scylla:
  alternator: fix TTL expiration scanner's handling of floating point
  test/alternator: add TTL test for more data
  test/alternator: remove "xfail" tag from passing tests in test_ttl.py
  test/alternator: make test_ttl.py tests fast on Alternator
  alternator: initial implmentation of TTL expiration service
  alternator: add another unwrap_number() variant
  alternator: add find_tag() function
  test/alternator: test another corner case of TTL setting
  test/alternator: test TTL expiration for table with sort key
  test/alternator: improve basic test for TTL expiration
  test/alternator: extract is_aws() function
2021-11-28 22:12:52 +02:00
Avi Kivity
ec775ba292 Merge "Remove more gms::get(_local)?_gossiper() calls" from Pavel E
"
This set covers simple but diverse cases:
- cache hitrace calculator
- repair
- system keyspace (virtual table)
- dht code
- transport event notifier

All the places just require straightforward arguments passing.
And a reparation in transport -- event notifier needs a backref
to the owning server.

Remaining after this set is the snitch<->gossiper interaction
and the cache hitrate app state update from table code.

tests: unit(dev)
"

* 'br-unglobal-gossiper-cont' of https://github.com/xemul/scylla:
  transport: Use server gossiper in event notifier
  transport: Keep backreference from event_notifier
  transport: Keep gossiper on server
  dht: Pass gossiper to range_streamer::add_ranges
  dht: Pass gossiper argument to bootstrap
  system_keyspace: Keep gossiper on cluster_status_table
  code: Carry gossiper down to virtual tables creation
  repair: Use local gossiper reference
  cache_hitrate_calculator: Keep reference on gossiper
2021-11-28 14:18:28 +02:00
Pavel Solodovnikov
1365e2f13e gms: feature_service: re-enable features on node startup
Re-enable previously persisted enabled features on node
startup. The features list to be enabled is read from
`system.local#enabled_features`.

In case an unknown feature is encountered, the node
fails to boot with an exception, because that means
the node is doing a prohibited downgrade procedure.

Features should be enabled before commitlog starts replaying
since some features affect storage (for example, when
determining used sstable format).

This patch implements a part of solution proposed by Tomek
in https://github.com/scylladb/scylla/issues/4458.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:24 +02:00
Nadav Har'El
13a3aca460 alternator: initial implmentation of TTL expiration service
In this patch we add an incomplete implementation of an expiration
service to Alternator, which periodically scans the data in the table,
looking for expired items and deleting them.

This implementation involves a new "expiration service" which runs a
background scan in each shard. Each shard "owns" a subset of the token
ranges - the intersection of the node's primary ranges with this shard's
token ranges - and scans those ranges over and over, deleting any items
which are found expired.

This implementation is good enough to make all existing tests but one
pass, but is still a partial and inefficient implementation littered with
FIXMEs throughout the code. Among other things, this implementation
doesn't do anything reasonable about pacing of the scan or about multiple
tables, it scans entire items instead of only the needed parts, and
if a node goes down, the part of the token range which it "owns" will not
be scanned for expiration (we need living nodes to take over the
background expiration work for dead nodes).

The current tests cannot expose these problems, so we will need to develop
additional tests for them.

Because this implementation is very partial, the Alternator TTL continues
to remain "experimental", cannot be used without explicitly enabling this
experimental feature, and must not be used for any important deployment.
The new TTL expiration service will only run (at the moment) in the
background if the Alternator TTL experimental feature is enabled and
and if Alternator is enabled as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:37 +02:00
Konstantin Osipov
c22f945f11 raft: (service) manage Raft configuration during topology changes
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.

However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.

In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.

Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.

Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:

Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.

Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.

Add tracing.
2021-11-25 12:35:42 +03:00
Pavel Emelyanov
43951318c8 transport: Keep gossiper on server
The gossiper is needed by the transport::event_notifier. There's
already gossiper reference on the transport controller, but it's
a local reference, because controller doesn't need more. This
patch upgrages controller reference to sharded<> and propagates
it further up to the server.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:54:45 +03:00
Pavel Emelyanov
ef1960d034 code: Carry gossiper down to virtual tables creation
One of the tables needs gossiper and uses global one. This patch
prepares the fix by patching the main -> register_virtual_tables
stack with the gossiper reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:52:55 +03:00
Pavel Emelyanov
770d34796b cache_hitrate_calculator: Keep reference on gossiper
The calculator needs to update its app-state on gossiper. Keeping
a reference is safe -- gossiper starts early, the calculator -- at
the very very end, stop in reverse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:52:27 +03:00
Pavel Emelyanov
4a34226aa6 streaming, main: Remove global stream_manager
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
fd920e2420 streaming, api: Standardize the API start/stop
Todays idea of API reg/unreg is to carry the target service via
lambda captures down to the route handlers and unregister those
handers before the target is about to stop.

This patch makes it so for the streaming API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
aaa58b7b89 storage_service: Keep streaming_manager reference
The manager is drained() on drain/decommission/isolate. Since now
it's storage_service who orchestrates all of the above, it needs
and explicit reference on the target.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:35 +03:00
Pavel Emelyanov
c2c676784a streaming: Fix interaction with gossiper
Streaming manager registers itself in gossiper, so it needs an explicit
dependency reference. Also it forgets to unregister itself, so do it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:59 +03:00
Pavel Emelyanov
73e10c7aed streaming: Move start/stop onto common rails
In case of streaming this mostly means dropping the global
init/uninit calls and replacing them with sharded<stream_manager>
instance. It's still global, but it's being fixed atm.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:58 +03:00
Benny Halevy
d344765ec6 get rid of the global batchlog_manager
Now that it's unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
9cde52c58f storage_service: keep a reference to the batchlog_manager
Rather than accessing the global batchlog_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
03039e8f8a main: allow setting the global batchlog_manager
As a prerequisite to globalizing the batchlog_manager,
allow setting a global pointer to it and instantiate
the sharded<db::batchlog_manager> on the main/cql_test_env
stack.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
1d7556d099 main: pass erm_factory to storage_service
To be used for creating effective_replication_map
when token_metadata changes, and update all
keyspaces with it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00