Commit Graph

3141 Commits

Author SHA1 Message Date
Avi Kivity
f86dd857ca Merge 'Certificate based authorization' from Calle Wilund
Fixes #10099

Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex.

Example:

scylla.yaml:

```
    authenticator: com.scylladb.auth.CertificateAuthenticator
    auth_superuser_name: <name>
    auth_certificate_role_query: CN=([^,\s]+)

    client_encryption_options:
      enabled: True
      certificate: <server cert>
      keyfile: <server key>
      truststore: <shared trust>
      require_client_auth: True
```
In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for  qlsh set "usercert" and "userkey" to these certificate files.

No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it.

Otherwise, connection becomes the role described.

To facilitate this, this also contains the addition of allowing setting super user name + salted passwd via command line/conf + some tweaks to SASL part of connection setup.

Closes #12214

* github.com:scylladb/scylladb:
  docs: Add documentation of certificate auth + auth_superuser_name
  auth: Add TLS certificate authenticator
  transport: Try to do early, transport based auth if possible
  auth: Allow for early (certificate/transport) authentication
  auth: Allow specifying initial superuser name + passwd (salted) in config
  roles-metadata: Coroutinuze some helpers
2023-06-27 12:52:14 +03:00
Botond Dénes
f5e3b8df6d Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
```
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
`INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`

to
`INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`

Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.

Closes #14364

* github.com:scylladb/scylladb:
  table: Optimize creation of reader excluding staging for view building
  view_update_generator: Dump throughput and duration for view update from staging
  utils: Extract pretty printers into a header
2023-06-27 07:25:30 +03:00
Raphael S. Carvalho
1ff8645eaa view_update_generator: Dump throughput and duration for view update from staging
Very helpful for user to understand how fast view update generation
is processing the staging sstables. Today, logs are completely
silent on that. It's not uncommon for operators to peek into
staging dir and deduce the throughput based on removal of files,
which is terrible.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 21:58:23 -03:00
Calle Wilund
a3db540142 auth: Add TLS certificate authenticator
Fixes #10099

Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator,
will extract roles from TLS authentication certificate (not wire cert - those are
server side) subject, based on configurable regex.

Example:

scylla.yaml:

authenticator: com.scylladb.auth.CertificateAuthenticator
auth_superuser_name: <name>
auth_certificate_role_queries:
	- source: SUBJECT
	  query: CN=([^,\s]+)

client_encryption_options:
  enabled: True
  certificate: <server cert>
  keyfile: <server key>
  truststore: <shared trust>
  require_client_auth: True

In a client, then use a certificate signed with the <shared trust>
store as auth cert, with the common name <name>. I.e. for cqlsh
set "usercert" and "userkey" to these certificate files.

No user/password needs to be sent, but role will be picked up
from auth certificate. If none is present, the transport will
reject the connection. If the certificate subject does not
contain a recongnized role name (from config or set in tables)
the authenticator mechanism will reject it.

Otherwise, connection becomes the role described.
2023-06-26 15:00:21 +00:00
Calle Wilund
69217662bd auth: Allow specifying initial superuser name + passwd (salted) in config
Instead of locking this to "cassandra:cassandra", allow setting in scylla.yaml
or commandline. Note that config values become redundant as soon as auth tables
are initialized.
2023-06-26 15:00:20 +00:00
Alexey Novikov
ca4e7f91c6 compact and remove expired rows from cache on read
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch

Refs #2252, #6033

Closes #12917
2023-06-26 15:29:01 +02:00
Avi Kivity
b858a4669d cql3: expr: break up expression.hh header
Adding a function declaration to expression.hh causes many
recompilations. Reduce that by:

 - moving some restrictions-related definitions to
   the existing expr/restrictions.hh
 - moving evaluation related names to a new header
   expr/evaluate.hh
 - move utilities to a new header
   expr/expr-utilities.hh

expression.hh contains only expression definitions and the most
basic and common helpers, like printing.
2023-06-22 14:21:03 +03:00
Avi Kivity
32b27d6a08 cql3: expr: change evaluation_input vector components to take spans
Spans are slightly cleaner, slightly faster (as they avoid an indirection),
and allow for replacing some of the arguments with small_vector:s.

Closes #14313
2023-06-22 11:28:01 +02:00
Avi Kivity
8576502c48 Merge 'raft topology: ban left nodes from the cluster' from Kamil Braun
Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.

This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.

Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details.

Closes #13850

* github.com:scylladb/scylladb:
  test: pylib: increase checking period for `get_alive_endpoints`
  test: add node banning test
  test: pylib: manager_client: `get_cql()` helper
  test: pylib: ScyllaCluster: server pause/unpause API
  raft topology: ban left nodes
  raft topology: skip `left_token_ring` state during `removenode`
  raft topology: prepare decommission path for node banning
  raft topology: introduce `left_token_ring` state
  raft topology: `raft_topology_cmd` implicit constructor
  messaging_service: implement host banning
  messaging_service: exchange host IDs and map them to connections
  messaging_service: store the node's host ID
  messaging_service: don't use parameter defaults in constructor
  main: move messaging_service init after system_keyspace init
2023-06-21 20:16:45 +03:00
Kefu Chai
f014ccf369 Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai""
This reverts commit 562087beff.

The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.

Fixes #10459
2023-06-21 13:02:40 +03:00
Tomasz Grabiec
29cbdb812b dht: Rename dht::shard_of() to dht::static_shard_of()
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e48ec6fed3 db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
16797c2d1a db: token_ring_table: Filter out tablet-based keyspaces
Querying from virtual table system.token_ring fails if there is a
tablet-based table due to attempt to obtain a per-keyspace erm.

Fix by not showing such keyspaces.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
2303466375 db: schema: Attach table pointer to schema
This will make it easier to access table proprties in places which
only have schema_ptr. This is in particular useful when replacing
dht::shard_of() uses with s->table().shard_of(), now that sharding is
no longer static, but table-specific.

Also, it allows us to install a guard which catches invalid uses of
schema::get_sharder() on tablet-based tables.

It will be helpful for other uses as well. For example, we can now get
rid of the static_props hack.
2023-06-21 00:58:24 +02:00
Kamil Braun
643e69af89 Merge 'Cluster features on raft: add storage for supported and enabled features' from Piotr Dulikowski
This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature.

Two new columns are added to `system.topology`:

- `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator.

These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns.

During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features.

Closes #14232

* github.com:scylladb/scylladb:
  storage_service: update supported cluster features in group0 on start
  storage_service: add methods for features to topology mutation builder
  storage_service: use explicit ::set overload instead of a template
  storage_service: reimplement mutation builder setters
  storage_service: introduce topology_mutation_builder_base
  topology_state_machine: include information about features
  system_keyspace: introduce deserialize_set_column
  db/system_keyspace: add storage for cluster features managed in group 0
2023-06-20 18:32:00 +02:00
Piotr Dulikowski
bc84d59665 topology_state_machine: include information about features
Now, the newly added `supported_features` and `enabled_features` columns
are reflected in the `topology_state_machine` structure.
2023-06-20 16:41:05 +02:00
Piotr Dulikowski
e527e63abc system_keyspace: introduce deserialize_set_column
There are three places in system_keyspace.cc which deserialize a column
holding a set of tokens and convert it to an unordered set of
dht::token. The deserialization process involves a small number of steps
that are the same in all of those places, therefore they can be
abstracted away.

This commit adds `deserialize_set_column` function which takes care of
deserializing the column to `set_type_impl::native_type` which can be
then passed to `decode_tokens`. The new function will also be useful for
decoding set columns with cluster features, which will be handled in the
next commit.
2023-06-20 16:37:09 +02:00
Kamil Braun
b8ddfd9ef9 raft topology: introduce left_token_ring state
We want for the decommissioning node to wait before shutting down until
every node learns that it left the token ring. Otherwise some nodes may
still try coordinating writes to that nodes after it already shut down,
leading to unnecessary failures on the data path(e.g. for CL=ALL writes).

Before this change, a node would shut down immediately after observing
that it was in `left` state; some other nodes may still see it in
`decommissioning` state and the topology transition state as
`write_both_read_new`, so they'd try to write to that node.

After this change, the node first enters the `left_token_ring` state
before entering `left`, while the topology transition state is removed
(so we've finished the token ring change - the node no longer has tokens
in the ring, but it's still part of the topology). There we perform a
read barrier, allowing all nodes to observe that the decommissioning
node has indeed left the token ring. Only after that barrier succeeds we
allow the node to shut down.
2023-06-20 13:03:46 +02:00
Botond Dénes
562087beff Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai"
This reverts commit d1dc579062, reversing
changes made to 3a73048bc9.

Said commit caused regressions in dtests. We need to investigate and fix
those, but in the meanwhile let's revert this to reduce the disruption
to our workflows.

Refs: #14283
2023-06-19 08:49:27 +03:00
Kamil Braun
028183c793 main, cql_test_env: simplify system_keyspace initialization
Initialization of `system_keyspace` is now all done at once instead of
being spread out through the entire procedure. This is doable because
`query_processor` is now available early. A couple of FIXMEs have been
resolved.
2023-06-18 13:39:27 +02:00
Kamil Braun
33c19baabc db: system_keyspace: take simpler service references in make
Take references to services which are initialized earlier. The
references to `gossiper`, `storage_service` and `raft_group0_registry`
are no longer needed.

This will allow us to move the `make` step right after starting
`system_keyspace`.
2023-06-18 13:39:27 +02:00
Kamil Braun
b34605d161 db: system_keyspace: call initialize_virtual_tables from main
`initialize_virtual_tables` was called from `system_keyspace::make`,
which caused this `make` function to take a bunch of references to
late-initialized services (`gossiper`, `storage_service`).

Call it from `main`/`cql_test_env` instead.

Note: `system_keyspace::make` is called from
`distributed_loader::init_system_keyspace`. The latter function contains
additional steps: populate the system keyspaces (with data from
sstables) and mark their tables ready for writes.

None of these steps apply to virtual tables.

There exists at least one writable virtual table, but writes into
virtual tables are special and the implementation of writes is
virtual-table specific. The existing writable virtual table
(`db_config_table`) only updates in-memory state when written to. If a
virtual table would like to create sstables, or populate itself with
sstable data on startup, it will have to handle this in its own
initialization function.

Separating `initialize_virtual_tables` like this will allow us to
simplify `system_keyspace` initialization, making it independent of
services used for distributed communication.
2023-06-18 13:39:27 +02:00
Kamil Braun
c931d9327d db: system_keyspace: refactor virtual tables creation
Split `system_keyspace::make` into two steps: creating regular
`system` and `system_schema` tables, then creating virtual tables.

This will allow, in later commit, to make `system_keyspace`
initialization independent of services used for distributed
communication such as `gossiper`. See further commits for details.
2023-06-18 13:39:27 +02:00
Kamil Braun
035045c288 db: system_keyspace: remove system_keyspace_make
The code can now be inlined in `system_keyspace::make` as we no longer
access private members of `database`.
2023-06-18 13:39:27 +02:00
Kamil Braun
cf120e46b8 db: system_keyspace: refactor local system table creation code
`system_keyspace_make` would access private fields of `database` in
order to create local system tables (creating the `keyspace` and
`table` in-memory structures, creating directory for `system` and
`system_schema`).

Extract this part into `database::create_local_system_table`.

Make `database::add_column_family` private.
2023-06-18 13:39:27 +02:00
Kamil Braun
3f04a5956c replica: database: remove is_bootstrap argument from create_keyspace
Unused.
2023-06-18 13:39:27 +02:00
Kamil Braun
53cf646103 db: system_keyspace: don't take sharded<> references
Take `query_processor` and `database` references directly, not through
`sharded<...>&`. This is now possible because we moved `query_processor`
and `database` construction early, so by the time `system_keyspace` is
started, the services it depends on were also already started.

Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member
functions can now be replaced with direct uses of `_qp` and `_db`.
Runtime assertions for dependant services being initialized are gone.
2023-06-18 13:39:26 +02:00
Piotr Dulikowski
dcd520f6cf db/system_keyspace: add storage for cluster features managed in group 0
The `system.topology` table is extended with two new columns that will
be used to manage cluster features:

- `supported_features set<text>` is a new clustering column which holds
  the features that given node advertises as supported. It will be first
  initialized when the node joins the cluster, and then updated every
  time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the
  features that are considered enabled by the cluster. Unlike in the
  current gossip-based implementation the features will not be enabled
  implicitly when all nodes support a feature, but rather when via an
  explicit action of the topology coordinator.
2023-06-16 13:19:53 +02:00
Tomasz Grabiec
e41ff4604d Merge 'raft_topology: fencing and global_token_metadata_barrier' from Gusev Petr
This is the initial implementation of [this spec](https://docs.google.com/document/d/1X6pARlxOy6KRQ32JN8yiGsnWA9Dwqnhtk7kMDo8m9pI/edit).

* the topology version (int64) was introduced, it's stored in topology table and updated through RAFT at the relevant stages of the topology change algorithm;
* when the version is incremented, a `barrier_and_drain` command is sent to all the nodes in the cluster, if some node is unavailable we fail and retry indefinitely;
* the `barrier_and_drain` handler first issues a `raft_read_barrier()` to obtain the latest topology, and  then waits until all requests using previous versions are finished; if this round of RPCs is finished the topology change coordinator can be sure that there are no requests inflight using previous versions and such requests can't appear in the future.
* after `barrier_and_drain` the topology change coordinator issues the `fence` command, it stores the current version in local table as `fence_version` and blocks requests with older versions by throwing `stale_topology_exception`; if a request with older version was started before the fence, its reply will also be fenced.
* the fencing part of the PR is for the future, when we relax the requirement that all nodes are available during topology change; it should protect the cluster from requests with stale topology from the nodes which was unavailable during topology change and which was not reached by the `barrier_and_drain()` command;
* currently, fencing is implemented for `mutation` and `read` RPCs, other RPCs will be handled in the follow-ups; since currently all nodes are supposed to be alive the missing parts of the fencing doesn't break correctness;
* along with fencing, the spec above also describes error handling, isolation and `--ignore_dead_nodes` parameter handling, these will be also added later; [this ticket](https://github.com/scylladb/scylladb/issues/14070) contains all that remains to be done;
* we don't worry about compatibility when we change topology table schema or `raft_topology_cmd_handler` RPC method signature since the raft topology code is currently hidden by `--experimental raft` flag and is not accessible to the users. Compatibility is maintained for other affected RPCs (mutation, read) - the new `fencing_token` parameter is `rpc::optional`, we skip the fencing check if it's not present.

Closes #13884

* github.com:scylladb/scylladb:
  storage_service: warn if can't find ip for server
  storage_proxy.cc: add and use global_token_metadata_barrier
  storage_service: exec_global_command: bool result -> exceptions
  raft_topology: add cmd_index to raft commands
  storage_proxy.cc: add fencing to read RPCs
  storage_proxy.cc: extract handle_read
  storage_proxy.cc: refactor encode_replica_exception_for_rpc
  storage_proxy: fix indentation
  storage_proxy: add fencing for mutation
  storage_servie: fix indentation
  storage_proxy: add fencing_token and related infrastructure
  raft topology: add fence_version
  raft_topology: add barrier_and_drain cmd
  token_metadata: add topology version
2023-06-16 12:07:31 +02:00
Petr Gusev
f6b019c229 raft topology: add fence_version
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
2023-06-15 15:48:00 +04:00
Petr Gusev
253d8a8c65 token_metadata: add topology version
It's stored in as a static column in topology table,
will be updated at various steps of the topology
change state machine.

The initial value is 1, zero means that topology
versions are not yet supported, will be
used in RPC handling.
2023-06-15 15:48:00 +04:00
Kefu Chai
4c2df04449 db: config: add uuid_sstable_identifiers_enabled option
unlike Cassandra 4.1, this option is true by default, will be used
for enabling cluster feature of "UUID_SSTABLE_IDENTIFIERS". not wired yet.

please note, because we are still using sstableloader and
sstabledump based on 3.x branch, while the Cassandra upstream
introduced the uuid sstable identifier in its 4.x branch, these tool
fail to work with the sstables with uuid identifier, so this option
is disabled when performing these tests. we will enable it once
these tools are updated to support the uuid-basd sstable identifiers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kefu Chai
15543464ce sstables, replica: support UUID in generation_type
this change generalize the value of generation_type so it also
supports UUID based identifier.

* sstables/generation_type.h:
  - add formatter and parse for UUID. please note, Cassandra uses
    a different format for formatting the SSTable identifier. and
    this formatter suits our needs as it uses underscore "_" as the
    delimiter, as the file name of components uses dash "-" as the
    delimiter. instead of reinventing the formatting or just use
    another delimiter in the stringified UUID, we choose to use the
    Cassandra's formatting.
  - add accessors for accessing the type and value of generation_type
  - add constructor for constructing generation_type with UUID and
    string.
  - use hash for placing sstables with uuid identifiers into shards
    for more uniformed distrbution of tables in shards.
* replica/table.cc:
  - only update the generator if the given generation contains an
    integer
* test/boost:
  - add a simple test to verify the generation_type is able to
    parse and format

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kamil Braun
0e36377f56 db: consistency_level: remove overload of filter_for_query
Not used anymore after the previous commit.
2023-06-14 11:41:36 +02:00
Pavel Emelyanov
c68c154fb6 code: Reduce tracing/*hh fanout
There are some headers that include tracing/*.hh ones despite all they
need is forward-declared trace_state_ptr

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14155
2023-06-07 19:19:22 +03:00
Nadav Har'El
5984db047d Merge 'mv: forbid IS NOT NULL on columns outside the primary key' from Jan Ciołek
statement_restrictions: forbid IS NOT NULL on columns outside the primary key

IS NOT NULL is currently allowed only when creating materialized views.
It's used to convey that the view will not include any rows that would make the view's primary key columns NULL.

Generally materialized views allow to place restrictions on the primary key columns, but restrictions on the regular columns are forbidden. The exception was IS NOT NULL - it was allowed to write regular_col IS NOT NULL. The problem is that this restriction isn't respected, it's just silently ignored (see #10365).

Supporting IS NOT NULL on regular columns seems to be as hard as supporting any other restrictions on regular columns.
It would be a big effort, and there are some reasons why we don't support them.

For now let's forbid such restrictions, it's better to fail than be wrong silently.

Throwing a hard error would be a breaking change.
To avoid breaking existing code the reaction to an invalid IS NOT NULL restrictions is controlled by the `strict_is_not_null_in_views` flag.

This flag can have the following values:
* `true` - strict checking. Having an `IS NOT NULL` restriction on a column that doesn't belong to the view's primary key causes an error to be thrown.
* `warn` - allow invalid `IS NOT NULL` restrictions, but throw a warning. The invalid restrictions are silently ignored.
* `false` - allow invalid `IS NOT NULL` restricitons, without any warnings or errors. The invalid restrictions are silently ignored.

The default values for this flag are `warn` in `db::config` and `true` in scylla.yaml.

This way the existing clusters will have `warn` by default, so they'll get a warning if they try to create such an invalid view.

New clusters with fresh scylla.yaml will have the flag set to `true`, as scylla.yaml overwrites the default value in `db::config`.
New clusters will throw a hard error for invalid views, but in older existing clusters it will just be a warning.
This way we can maintain backwards compatibility, but still move forward by rejecting invalid queries on new clusters.

Fixes: #10365

Closes #13013

* github.com:scylladb/scylladb:
  boost/restriction_test: test the strict_is_not_null_in_views flag
  docs/cql/mv: columns outside of view's primary key can't be restricted
  cql-pytest: enable test_is_not_null_forbidden_in_filter
  statement_restrictions: forbid IS NOT NULL on columns outside the primary key
  schema_altering_statement: return warnings from prepare_schema_mutations()
  db/config: add strict_is_not_null_in_views config option
  statement_restrictions: add get_not_null_columns()
  test: remove invalid IS NOT NULL restrictions from tests
2023-06-07 12:12:19 +03:00
Jan Ciolek
c67d65987e db/config: add strict_is_not_null_in_views config option
IS NOT NULL shouldn't be allowed on columns
which are outside of the materialized view's primary key.
It's currently allowed to create views with such restrictions,
but they're silently ignored, it's a bug.

In the following commits restricting regular columns
with IS NOT NULL will be forbidden.
This is a breaking change.

Some users might have existing code that creates
views with such restrictions, we don't want to break it.

To deal with this a new feature flag is introduced:
strict_is_not_null_in_views.

By default it's set to `warn`. If a user tries to create
a view with such invalid restrictions they will get a warning
saying that this is invalid, but the query will still go through,
it's just a warning.

The default value in scylla.yaml will be `true`. This way new clusters
will have strict enforcement enabled and they'll throw errors when the
user tries to create such an invalid view,
Old clusters without the flag present in scylla.yaml will
have the flag set to warn, so they won't break on an update.

There's also the option to set the flag to `false`. It's dangerous,
as it silences information about a bug, but someone might want it
to silence the warnings for a moment.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-07 01:48:39 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Aarav Arora
a12d2d5f16 fix: keyspace spell
Closes #14121
2023-06-04 13:48:43 +03:00
Konstantin Osipov
b39ca97919 consistent_cluster_management: make the default
As per our roll out plan, make consistent_cluster_management (aka Raft
for schema changes) the default going forward. It means all
clusters which upgrade from the previous version and don't have
`consistent_cluster_management` explicitly set in scylla.yaml will begin
upgrading to Raft once all nodes in the cluster have moved to the new
version.

Fixes #13980

Closes #13984
2023-06-02 09:05:09 +02:00
Kamil Braun
8be69fc3a0 Merge 'Initialize group0 server on boot before allowing incoming requests' from Gleb
The series includes mostly cleanups and one bug fix.

The fix is for the race where messages that need to access group0 server are arriving
before the server is initialized.

* 'gleb/group0-sp-mm-race-v2' of github.com:scylladb/scylla-dev:
  service: raft: fix typo
  service: raft: split off setup_group0_if_exist from setup_group0
  storage_service: do not allow override_decommission flag if consistent cluster management is enabled
  storage_service: fix indentation after the previous patch
  storage_service: co-routinize storage_service::join_cluster() function
  storage_service: do not reload topology from peers table if topology over raft is enabled
  storage_service: optimize debug logging code in case debug log is not enabled
2023-06-01 17:37:58 +02:00
Gleb Natapov
acc035b504 storage_service: do not allow override_decommission flag if consistent cluster management is enabled
If consistent cluster management is enabled it is not possible to
restart decommissioned node since it will not be part of the grouup0.
2023-05-31 10:40:42 +03:00
Avi Kivity
ffce6d94fc Merge 'service: storage_proxy: make hint write handlers cancellable' from Kamil Braun
The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose:
to make it possible to cancel a handler for a view update write,
which means we stop waiting for a response to the write, timing out
the handler immediately. This was done to solve issue with node
shutdown hanging because it was waiting for a view update to finish;
view updates were configured with 5 minute timeout. See #3966, #4028.

Now we're having a similar problem with hint updates causing shutdown
to hang in tests (#8079).

`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy`
that a node is down.

To make it possible to reuse this algorithm for other handlers, move
the functionality into `abstract_write_response_handler`. We inherit
from `bi::list_base_hook` so it introduces small memory overhead to
each write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Use this new functionality to also cancel hint write handlers when we
shutdown. This fixes #8079.

Closes #14047

* github.com:scylladb/scylladb:
  test: reproducer for hints manager shutdown hang
  test: pylib: ScyllaCluster: generalize config type for `server_add`
  test: pylib: scylla_cluster: add explicit timeout for graceful server stop
  service: storage_proxy: make hint write handlers cancellable
  service: storage_proxy: rename `view_update_handlers_list`
  service: storage_proxy: make it possible to cancel all write handler types
2023-05-30 01:36:50 +03:00
Kamil Braun
beabb61566 test: reproducer for hints manager shutdown hang 2023-05-29 11:03:39 +02:00
Kamil Braun
0ef35ceed4 service: storage_proxy: make hint write handlers cancellable
Whether a write handler should be cancellable is now controlled by a
parameter passed to `create_write_response_handler`. We plumb it down
from `send_to_endpoint` which is called by hints manager.

This will cause hint write handlers to immediately timeout when we
shutdown or when a destination node is marked as dead.

Fixes #8079
2023-05-29 11:03:18 +02:00
Pavel Emelyanov
5861d15912 Merge 'Small gossiper and migration_manager cleanups' from Gleb
Some assorted cleanups here: consolidation of schema agreement waiting
into a single place and removing unused code from the gossiper.

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/

Reviewed-by: Konstantin Osipov <kostja@scylladb.com>

* gleb/gossiper-cleanups of github.com:scylladb/scylla-dev:
  storage_service: avoid unneeded copies in on_change
  storage_service: remove check that is always true
  storage_service: rename handle_state_removing to handle_state_removed
  storage_service: avoid string copy
  storage_service: delete code that handled REMOVING_TOKENS state
  gossiper: remove code related to advertising REMOVING_TOKEN state
  migration_manager: add wait_for_schema_agreement() function
2023-05-27 10:49:54 +03:00
Gleb Natapov
a429018a8a migration_manager: add wait_for_schema_agreement() function
Several subsystems re-implement the same logic for waiting for schema
agreement. Provide the function in the migration_manager and use it
instead.
2023-05-25 14:44:53 +03:00
Tomasz Grabiec
51e3b9321b Merge ' mvcc: make schema upgrades gentle' from Michał Chojnowski
After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.

This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery.
Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version.
After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and by the MVCC version merge ongoing in the background after the upgrade.

The series:
1. Does some code cleanup in the mutation_partition area.
2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).
3. Adds upgrading variants of constructors and apply() for `row` and its wrappers.
4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains.
5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version.

Fixes #2577

Closes #13761

* github.com:scylladb/scylladb:
  test: mvcc_test: add a test for gentle schema upgrades
  partition_version: make partition_entry::upgrade() gentle
  partition_version: handle multi-schema snapshots in merge_partition_versions
  mutation_partition_v2: handle schema upgrades in apply_monotonically()
  partition_version: remove the unused "from" argument in partition_entry::upgrade()
  row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades
  partition_version: handle multi-schema entries in partition_entry::squashed
  partition_snapshot_row_cursor: handle multi-schema snapshots
  partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots
  partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots
  partition_version: add a logalloc::region argument to partition_entry::upgrade()
  memtable: propagate the region to memtable_entry::upgrade_schema()
  mutation_partition: add an upgrading variant of lazy_row::apply()
  mutation_partition: add an upgrading variant of rows_entry::rows_entry
  mutation_partition: switch an apply() call to apply_monotonically()
  mutation_partition: add an upgrading variant of rows_entry::apply_monotonically()
  mutation_fragment: add an upgrading variant of clustering_row::apply()
  mutation_partition: add an upgrading variant of row::row
  partition_version: remove _schema from partition_entry::operator<<
  partition_version: remove the schema argument from partition_entry::read()
  memtable: remove _schema from memtable_entry
  row_cache: remove _schema from cache_entry
  partition_version: remove the _schema field from partition_snapshot
  partition_version: add a _schema field to partition_version
  mutation_partition: change schema_ptr to schema& in mutation_partition::difference
  mutation_partition: change schema_ptr to schema& in mutation_partition constructor
  mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor
  mutation_partition: add upgrading variants of row::apply()
  partition_version: update the comment to apply_to_incomplete()
  mutation_partition_v2: clean up variants of apply()
  mutation_partition: remove apply_weak()
  mutation_partition_v2: remove a misleading comment in apply_monotonically()
  row_cache_test: add schema changes to test_concurrent_reads_and_eviction
  mutation_partition: fix mixed-schema apply()
2023-05-24 22:58:43 +02:00
Kefu Chai
b0c40a2a03 db: config: s/ingore/ignore/
this string is used in as the option description in the command line
help message. so it is a part of user facing interface.

in this change, the typo is fixed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14013
2023-05-24 13:35:24 +03:00
Pavel Emelyanov
5aea6938ae commitlog: Introduce and use comitlog sched group
Nowadays all commitlog code runs in whatever sched group it's kicked
from. Since IO prio classes are going to be inherited from the current
sched group the commitlog IO loops should be moved into commitlog sched
group, not inherit a "random" one.

There are currently two places that need correct context for IO -- the
.cycle() method and segments replenisher.

`$ perf-simple-query --write -c2` results

--- Before the patch ---
194898.36 tps ( 56.3 allocs/op,  12.7 tasks/op,   54307 insns/op,        0 errors)
199286.23 tps ( 56.2 allocs/op,  12.7 tasks/op,   54375 insns/op,        0 errors)
199815.84 tps ( 56.2 allocs/op,  12.7 tasks/op,   54377 insns/op,        0 errors)
198260.98 tps ( 56.3 allocs/op,  12.7 tasks/op,   54380 insns/op,        0 errors)
198572.86 tps ( 56.2 allocs/op,  12.7 tasks/op,   54371 insns/op,        0 errors)

median 198572.86 tps ( 56.2 allocs/op,  12.7 tasks/op,   54371 insns/op,        0 errors)
median absolute deviation: 713.36
maximum: 199815.84
minimum: 194898.36

--- After the patch ---
194751.80 tps ( 56.3 allocs/op,  12.7 tasks/op,   54331 insns/op,        0 errors)
199084.70 tps ( 56.2 allocs/op,  12.7 tasks/op,   54389 insns/op,        0 errors)
195551.47 tps ( 56.3 allocs/op,  12.7 tasks/op,   54385 insns/op,        0 errors)
197953.47 tps ( 56.3 allocs/op,  12.7 tasks/op,   54386 insns/op,        0 errors)
198710.00 tps ( 56.3 allocs/op,  12.7 tasks/op,   54387 insns/op,        0 errors)

median 197953.47 tps ( 56.3 allocs/op,  12.7 tasks/op,   54386 insns/op,        0 errors)
median absolute deviation: 1131.24
maximum: 199084.70
minimum: 194751.80

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14005
2023-05-23 21:25:57 +03:00