scylladb

Author	SHA1	Message	Date
Avi Kivity	f86dd857ca	Merge 'Certificate based authorization' from Calle Wilund Fixes #10099 Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex. Example: scylla.yaml: ``` authenticator: com.scylladb.auth.CertificateAuthenticator auth_superuser_name: <name> auth_certificate_role_query: CN=([^,\s]+) client_encryption_options: enabled: True certificate: <server cert> keyfile: <server key> truststore: <shared trust> require_client_auth: True ``` In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for qlsh set "usercert" and "userkey" to these certificate files. No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it. Otherwise, connection becomes the role described. To facilitate this, this also contains the addition of allowing setting super user name + salted passwd via command line/conf + some tweaks to SASL part of connection setup. Closes #12214 * github.com:scylladb/scylladb: docs: Add documentation of certificate auth + auth_superuser_name auth: Add TLS certificate authenticator transport: Try to do early, transport based auth if possible auth: Allow for early (certificate/transport) authentication auth: Allow specifying initial superuser name + passwd (salted) in config roles-metadata: Coroutinuze some helpers	2023-06-27 12:52:14 +03:00
Botond Dénes	f5e3b8df6d	Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho View building from staging creates a reader from scratch (memtable \+ sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: ``` + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert ``` That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from `INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s` to `INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s` Refs https://github.com/scylladb/scylladb/issues/14089. Fixes scylladb/scylladb#14244. Closes #14364 * github.com:scylladb/scylladb: table: Optimize creation of reader excluding staging for view building view_update_generator: Dump throughput and duration for view update from staging utils: Extract pretty printers into a header	2023-06-27 07:25:30 +03:00
Raphael S. Carvalho	1ff8645eaa	view_update_generator: Dump throughput and duration for view update from staging Very helpful for user to understand how fast view update generation is processing the staging sstables. Today, logs are completely silent on that. It's not uncommon for operators to peek into staging dir and deduce the throughput based on removal of files, which is terrible. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 21:58:23 -03:00
Calle Wilund	a3db540142	auth: Add TLS certificate authenticator Fixes #10099 Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex. Example: scylla.yaml: authenticator: com.scylladb.auth.CertificateAuthenticator auth_superuser_name: <name> auth_certificate_role_queries: - source: SUBJECT query: CN=([^,\s]+) client_encryption_options: enabled: True certificate: <server cert> keyfile: <server key> truststore: <shared trust> require_client_auth: True In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for cqlsh set "usercert" and "userkey" to these certificate files. No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it. Otherwise, connection becomes the role described.	2023-06-26 15:00:21 +00:00
Calle Wilund	69217662bd	auth: Allow specifying initial superuser name + passwd (salted) in config Instead of locking this to "cassandra:cassandra", allow setting in scylla.yaml or commandline. Note that config values become redundant as soon as auth tables are initialized.	2023-06-26 15:00:20 +00:00
Alexey Novikov	ca4e7f91c6	compact and remove expired rows from cache on read when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs #2252, #6033 Closes #12917	2023-06-26 15:29:01 +02:00
Avi Kivity	b858a4669d	cql3: expr: break up expression.hh header Adding a function declaration to expression.hh causes many recompilations. Reduce that by: - moving some restrictions-related definitions to the existing expr/restrictions.hh - moving evaluation related names to a new header expr/evaluate.hh - move utilities to a new header expr/expr-utilities.hh expression.hh contains only expression definitions and the most basic and common helpers, like printing.	2023-06-22 14:21:03 +03:00
Avi Kivity	32b27d6a08	cql3: expr: change evaluation_input vector components to take spans Spans are slightly cleaner, slightly faster (as they avoid an indirection), and allow for replacing some of the arguments with small_vector:s. Closes #14313	2023-06-22 11:28:01 +02:00
Avi Kivity	8576502c48	Merge 'raft topology: ban left nodes from the cluster' from Kamil Braun Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected. This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node. Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details. Closes #13850 * github.com:scylladb/scylladb: test: pylib: increase checking period for `get_alive_endpoints` test: add node banning test test: pylib: manager_client: `get_cql()` helper test: pylib: ScyllaCluster: server pause/unpause API raft topology: ban left nodes raft topology: skip `left_token_ring` state during `removenode` raft topology: prepare decommission path for node banning raft topology: introduce `left_token_ring` state raft topology: `raft_topology_cmd` implicit constructor messaging_service: implement host banning messaging_service: exchange host IDs and map them to connections messaging_service: store the node's host ID messaging_service: don't use parameter defaults in constructor main: move messaging_service init after system_keyspace init	2023-06-21 20:16:45 +03:00
Kefu Chai	f014ccf369	Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai"" This reverts commit `562087beff`. The regressions introduced by the reverted change have been fixed. So let's revert this revert to resurrect the uuid_sstable_identifier_enabled support. Fixes #10459	2023-06-21 13:02:40 +03:00
Tomasz Grabiec	29cbdb812b	dht: Rename dht::shard_of() to dht::static_shard_of() This is in order to prevent new incorrect uses of dht::shard_of() to be accidentally added. Also, makes sure that all current uses are caught by the compiler and require an explicit rename.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e48ec6fed3	db, storage_proxy: Drop mutation/frozen_mutation ::shard_of() dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	16797c2d1a	db: token_ring_table: Filter out tablet-based keyspaces Querying from virtual table system.token_ring fails if there is a tablet-based table due to attempt to obtain a per-keyspace erm. Fix by not showing such keyspaces.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	2303466375	db: schema: Attach table pointer to schema This will make it easier to access table proprties in places which only have schema_ptr. This is in particular useful when replacing dht::shard_of() uses with s->table().shard_of(), now that sharding is no longer static, but table-specific. Also, it allows us to install a guard which catches invalid uses of schema::get_sharder() on tablet-based tables. It will be helpful for other uses as well. For example, we can now get rid of the static_props hack.	2023-06-21 00:58:24 +02:00
Kamil Braun	643e69af89	Merge 'Cluster features on raft: add storage for supported and enabled features' from Piotr Dulikowski This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature. Two new columns are added to `system.topology`: - `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes. - `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator. These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns. During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features. Closes #14232 * github.com:scylladb/scylladb: storage_service: update supported cluster features in group0 on start storage_service: add methods for features to topology mutation builder storage_service: use explicit ::set overload instead of a template storage_service: reimplement mutation builder setters storage_service: introduce topology_mutation_builder_base topology_state_machine: include information about features system_keyspace: introduce deserialize_set_column db/system_keyspace: add storage for cluster features managed in group 0	2023-06-20 18:32:00 +02:00
Piotr Dulikowski	bc84d59665	topology_state_machine: include information about features Now, the newly added `supported_features` and `enabled_features` columns are reflected in the `topology_state_machine` structure.	2023-06-20 16:41:05 +02:00
Piotr Dulikowski	e527e63abc	system_keyspace: introduce deserialize_set_column There are three places in system_keyspace.cc which deserialize a column holding a set of tokens and convert it to an unordered set of dht::token. The deserialization process involves a small number of steps that are the same in all of those places, therefore they can be abstracted away. This commit adds `deserialize_set_column` function which takes care of deserializing the column to `set_type_impl::native_type` which can be then passed to `decode_tokens`. The new function will also be useful for decoding set columns with cluster features, which will be handled in the next commit.	2023-06-20 16:37:09 +02:00
Kamil Braun	b8ddfd9ef9	raft topology: introduce `left_token_ring` state We want for the decommissioning node to wait before shutting down until every node learns that it left the token ring. Otherwise some nodes may still try coordinating writes to that nodes after it already shut down, leading to unnecessary failures on the data path(e.g. for CL=ALL writes). Before this change, a node would shut down immediately after observing that it was in `left` state; some other nodes may still see it in `decommissioning` state and the topology transition state as `write_both_read_new`, so they'd try to write to that node. After this change, the node first enters the `left_token_ring` state before entering `left`, while the topology transition state is removed (so we've finished the token ring change - the node no longer has tokens in the ring, but it's still part of the topology). There we perform a read barrier, allowing all nodes to observe that the decommissioning node has indeed left the token ring. Only after that barrier succeeds we allow the node to shut down.	2023-06-20 13:03:46 +02:00
Botond Dénes	562087beff	Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai" This reverts commit `d1dc579062`, reversing changes made to `3a73048bc9`. Said commit caused regressions in dtests. We need to investigate and fix those, but in the meanwhile let's revert this to reduce the disruption to our workflows. Refs: #14283	2023-06-19 08:49:27 +03:00
Kamil Braun	028183c793	main, cql_test_env: simplify `system_keyspace` initialization Initialization of `system_keyspace` is now all done at once instead of being spread out through the entire procedure. This is doable because `query_processor` is now available early. A couple of FIXMEs have been resolved.	2023-06-18 13:39:27 +02:00
Kamil Braun	33c19baabc	db: system_keyspace: take simpler service references in `make` Take references to services which are initialized earlier. The references to `gossiper`, `storage_service` and `raft_group0_registry` are no longer needed. This will allow us to move the `make` step right after starting `system_keyspace`.	2023-06-18 13:39:27 +02:00
Kamil Braun	b34605d161	db: system_keyspace: call `initialize_virtual_tables` from `main` `initialize_virtual_tables` was called from `system_keyspace::make`, which caused this `make` function to take a bunch of references to late-initialized services (`gossiper`, `storage_service`). Call it from `main`/`cql_test_env` instead. Note: `system_keyspace::make` is called from `distributed_loader::init_system_keyspace`. The latter function contains additional steps: populate the system keyspaces (with data from sstables) and mark their tables ready for writes. None of these steps apply to virtual tables. There exists at least one writable virtual table, but writes into virtual tables are special and the implementation of writes is virtual-table specific. The existing writable virtual table (`db_config_table`) only updates in-memory state when written to. If a virtual table would like to create sstables, or populate itself with sstable data on startup, it will have to handle this in its own initialization function. Separating `initialize_virtual_tables` like this will allow us to simplify `system_keyspace` initialization, making it independent of services used for distributed communication.	2023-06-18 13:39:27 +02:00
Kamil Braun	c931d9327d	db: system_keyspace: refactor virtual tables creation Split `system_keyspace::make` into two steps: creating regular `system` and `system_schema` tables, then creating virtual tables. This will allow, in later commit, to make `system_keyspace` initialization independent of services used for distributed communication such as `gossiper`. See further commits for details.	2023-06-18 13:39:27 +02:00
Kamil Braun	035045c288	db: system_keyspace: remove `system_keyspace_make` The code can now be inlined in `system_keyspace::make` as we no longer access private members of `database`.	2023-06-18 13:39:27 +02:00
Kamil Braun	cf120e46b8	db: system_keyspace: refactor local system table creation code `system_keyspace_make` would access private fields of `database` in order to create local system tables (creating the `keyspace` and `table` in-memory structures, creating directory for `system` and `system_schema`). Extract this part into `database::create_local_system_table`. Make `database::add_column_family` private.	2023-06-18 13:39:27 +02:00
Kamil Braun	3f04a5956c	replica: database: remove `is_bootstrap` argument from create_keyspace Unused.	2023-06-18 13:39:27 +02:00
Kamil Braun	53cf646103	db: system_keyspace: don't take `sharded<>` references Take `query_processor` and `database` references directly, not through `sharded<...>&`. This is now possible because we moved `query_processor` and `database` construction early, so by the time `system_keyspace` is started, the services it depends on were also already started. Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member functions can now be replaced with direct uses of `_qp` and `_db`. Runtime assertions for dependant services being initialized are gone.	2023-06-18 13:39:26 +02:00
Piotr Dulikowski	dcd520f6cf	db/system_keyspace: add storage for cluster features managed in group 0 The `system.topology` table is extended with two new columns that will be used to manage cluster features: - `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes. - `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather when via an explicit action of the topology coordinator.	2023-06-16 13:19:53 +02:00
Tomasz Grabiec	e41ff4604d	Merge 'raft_topology: fencing and global_token_metadata_barrier' from Gusev Petr This is the initial implementation of [this spec](https://docs.google.com/document/d/1X6pARlxOy6KRQ32JN8yiGsnWA9Dwqnhtk7kMDo8m9pI/edit). * the topology version (int64) was introduced, it's stored in topology table and updated through RAFT at the relevant stages of the topology change algorithm; * when the version is incremented, a `barrier_and_drain` command is sent to all the nodes in the cluster, if some node is unavailable we fail and retry indefinitely; * the `barrier_and_drain` handler first issues a `raft_read_barrier()` to obtain the latest topology, and then waits until all requests using previous versions are finished; if this round of RPCs is finished the topology change coordinator can be sure that there are no requests inflight using previous versions and such requests can't appear in the future. * after `barrier_and_drain` the topology change coordinator issues the `fence` command, it stores the current version in local table as `fence_version` and blocks requests with older versions by throwing `stale_topology_exception`; if a request with older version was started before the fence, its reply will also be fenced. * the fencing part of the PR is for the future, when we relax the requirement that all nodes are available during topology change; it should protect the cluster from requests with stale topology from the nodes which was unavailable during topology change and which was not reached by the `barrier_and_drain()` command; * currently, fencing is implemented for `mutation` and `read` RPCs, other RPCs will be handled in the follow-ups; since currently all nodes are supposed to be alive the missing parts of the fencing doesn't break correctness; * along with fencing, the spec above also describes error handling, isolation and `--ignore_dead_nodes` parameter handling, these will be also added later; [this ticket](https://github.com/scylladb/scylladb/issues/14070) contains all that remains to be done; * we don't worry about compatibility when we change topology table schema or `raft_topology_cmd_handler` RPC method signature since the raft topology code is currently hidden by `--experimental raft` flag and is not accessible to the users. Compatibility is maintained for other affected RPCs (mutation, read) - the new `fencing_token` parameter is `rpc::optional`, we skip the fencing check if it's not present. Closes #13884 * github.com:scylladb/scylladb: storage_service: warn if can't find ip for server storage_proxy.cc: add and use global_token_metadata_barrier storage_service: exec_global_command: bool result -> exceptions raft_topology: add cmd_index to raft commands storage_proxy.cc: add fencing to read RPCs storage_proxy.cc: extract handle_read storage_proxy.cc: refactor encode_replica_exception_for_rpc storage_proxy: fix indentation storage_proxy: add fencing for mutation storage_servie: fix indentation storage_proxy: add fencing_token and related infrastructure raft topology: add fence_version raft_topology: add barrier_and_drain cmd token_metadata: add topology version	2023-06-16 12:07:31 +02:00
Petr Gusev	f6b019c229	raft topology: add fence_version It's stored outside of topology table, since it's updated not through RAFT, but with a new 'fence' raft command. The current value is cached in shared_token_metadata. An initial fence version is loaded in main during storage_service initialisation.	2023-06-15 15:48:00 +04:00
Petr Gusev	253d8a8c65	token_metadata: add topology version It's stored in as a static column in topology table, will be updated at various steps of the topology change state machine. The initial value is 1, zero means that topology versions are not yet supported, will be used in RPC handling.	2023-06-15 15:48:00 +04:00
Kefu Chai	4c2df04449	db: config: add uuid_sstable_identifiers_enabled option unlike Cassandra 4.1, this option is true by default, will be used for enabling cluster feature of "UUID_SSTABLE_IDENTIFIERS". not wired yet. please note, because we are still using sstableloader and sstabledump based on 3.x branch, while the Cassandra upstream introduced the uuid sstable identifier in its 4.x branch, these tool fail to work with the sstables with uuid identifier, so this option is disabled when performing these tests. we will enable it once these tools are updated to support the uuid-basd sstable identifiers. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kefu Chai	15543464ce	sstables, replica: support UUID in generation_type this change generalize the value of generation_type so it also supports UUID based identifier. * sstables/generation_type.h: - add formatter and parse for UUID. please note, Cassandra uses a different format for formatting the SSTable identifier. and this formatter suits our needs as it uses underscore "_" as the delimiter, as the file name of components uses dash "-" as the delimiter. instead of reinventing the formatting or just use another delimiter in the stringified UUID, we choose to use the Cassandra's formatting. - add accessors for accessing the type and value of generation_type - add constructor for constructing generation_type with UUID and string. - use hash for placing sstables with uuid identifiers into shards for more uniformed distrbution of tables in shards. * replica/table.cc: - only update the generator if the given generation contains an integer * test/boost: - add a simple test to verify the generation_type is able to parse and format Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kamil Braun	0e36377f56	db: consistency_level: remove overload of `filter_for_query` Not used anymore after the previous commit.	2023-06-14 11:41:36 +02:00
Pavel Emelyanov	c68c154fb6	code: Reduce tracing/hh fanout There are some headers that include tracing/.hh ones despite all they need is forward-declared trace_state_ptr Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14155	2023-06-07 19:19:22 +03:00
Nadav Har'El	5984db047d	Merge 'mv: forbid IS NOT NULL on columns outside the primary key' from Jan Ciołek statement_restrictions: forbid IS NOT NULL on columns outside the primary key IS NOT NULL is currently allowed only when creating materialized views. It's used to convey that the view will not include any rows that would make the view's primary key columns NULL. Generally materialized views allow to place restrictions on the primary key columns, but restrictions on the regular columns are forbidden. The exception was IS NOT NULL - it was allowed to write regular_col IS NOT NULL. The problem is that this restriction isn't respected, it's just silently ignored (see #10365). Supporting IS NOT NULL on regular columns seems to be as hard as supporting any other restrictions on regular columns. It would be a big effort, and there are some reasons why we don't support them. For now let's forbid such restrictions, it's better to fail than be wrong silently. Throwing a hard error would be a breaking change. To avoid breaking existing code the reaction to an invalid IS NOT NULL restrictions is controlled by the `strict_is_not_null_in_views` flag. This flag can have the following values: * `true` - strict checking. Having an `IS NOT NULL` restriction on a column that doesn't belong to the view's primary key causes an error to be thrown. * `warn` - allow invalid `IS NOT NULL` restrictions, but throw a warning. The invalid restrictions are silently ignored. * `false` - allow invalid `IS NOT NULL` restricitons, without any warnings or errors. The invalid restrictions are silently ignored. The default values for this flag are `warn` in `db::config` and `true` in scylla.yaml. This way the existing clusters will have `warn` by default, so they'll get a warning if they try to create such an invalid view. New clusters with fresh scylla.yaml will have the flag set to `true`, as scylla.yaml overwrites the default value in `db::config`. New clusters will throw a hard error for invalid views, but in older existing clusters it will just be a warning. This way we can maintain backwards compatibility, but still move forward by rejecting invalid queries on new clusters. Fixes: #10365 Closes #13013 * github.com:scylladb/scylladb: boost/restriction_test: test the strict_is_not_null_in_views flag docs/cql/mv: columns outside of view's primary key can't be restricted cql-pytest: enable test_is_not_null_forbidden_in_filter statement_restrictions: forbid IS NOT NULL on columns outside the primary key schema_altering_statement: return warnings from prepare_schema_mutations() db/config: add strict_is_not_null_in_views config option statement_restrictions: add get_not_null_columns() test: remove invalid IS NOT NULL restrictions from tests	2023-06-07 12:12:19 +03:00
Jan Ciolek	c67d65987e	db/config: add strict_is_not_null_in_views config option IS NOT NULL shouldn't be allowed on columns which are outside of the materialized view's primary key. It's currently allowed to create views with such restrictions, but they're silently ignored, it's a bug. In the following commits restricting regular columns with IS NOT NULL will be forbidden. This is a breaking change. Some users might have existing code that creates views with such restrictions, we don't want to break it. To deal with this a new feature flag is introduced: strict_is_not_null_in_views. By default it's set to `warn`. If a user tries to create a view with such invalid restrictions they will get a warning saying that this is invalid, but the query will still go through, it's just a warning. The default value in scylla.yaml will be `true`. This way new clusters will have strict enforcement enabled and they'll throw errors when the user tries to create such an invalid view, Old clusters without the flag present in scylla.yaml will have the flag set to warn, so they won't break on an update. There's also the option to set the flag to `false`. It's dangerous, as it silences information about a bug, but someone might want it to silence the warnings for a moment. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-07 01:48:39 +02:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Aarav Arora	a12d2d5f16	fix: keyspace spell Closes #14121	2023-06-04 13:48:43 +03:00
Konstantin Osipov	b39ca97919	consistent_cluster_management: make the default As per our roll out plan, make consistent_cluster_management (aka Raft for schema changes) the default going forward. It means all clusters which upgrade from the previous version and don't have `consistent_cluster_management` explicitly set in scylla.yaml will begin upgrading to Raft once all nodes in the cluster have moved to the new version. Fixes #13980 Closes #13984	2023-06-02 09:05:09 +02:00
Kamil Braun	8be69fc3a0	Merge 'Initialize group0 server on boot before allowing incoming requests' from Gleb The series includes mostly cleanups and one bug fix. The fix is for the race where messages that need to access group0 server are arriving before the server is initialized. * 'gleb/group0-sp-mm-race-v2' of github.com:scylladb/scylla-dev: service: raft: fix typo service: raft: split off setup_group0_if_exist from setup_group0 storage_service: do not allow override_decommission flag if consistent cluster management is enabled storage_service: fix indentation after the previous patch storage_service: co-routinize storage_service::join_cluster() function storage_service: do not reload topology from peers table if topology over raft is enabled storage_service: optimize debug logging code in case debug log is not enabled	2023-06-01 17:37:58 +02:00
Gleb Natapov	acc035b504	storage_service: do not allow override_decommission flag if consistent cluster management is enabled If consistent cluster management is enabled it is not possible to restart decommissioned node since it will not be part of the grouup0.	2023-05-31 10:40:42 +03:00
Avi Kivity	ffce6d94fc	Merge 'service: storage_proxy: make hint write handlers cancellable' from Kamil Braun The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See #3966, #4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Use this new functionality to also cancel hint write handlers when we shutdown. This fixes #8079. Closes #14047 * github.com:scylladb/scylladb: test: reproducer for hints manager shutdown hang test: pylib: ScyllaCluster: generalize config type for `server_add` test: pylib: scylla_cluster: add explicit timeout for graceful server stop service: storage_proxy: make hint write handlers cancellable service: storage_proxy: rename `view_update_handlers_list` service: storage_proxy: make it possible to cancel all write handler types	2023-05-30 01:36:50 +03:00
Kamil Braun	beabb61566	test: reproducer for hints manager shutdown hang	2023-05-29 11:03:39 +02:00
Kamil Braun	0ef35ceed4	service: storage_proxy: make hint write handlers cancellable Whether a write handler should be cancellable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes #8079	2023-05-29 11:03:18 +02:00
Pavel Emelyanov	5861d15912	Merge 'Small gossiper and migration_manager cleanups' from Gleb Some assorted cleanups here: consolidation of schema agreement waiting into a single place and removing unused code from the gossiper. CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/ Reviewed-by: Konstantin Osipov <kostja@scylladb.com> * gleb/gossiper-cleanups of github.com:scylladb/scylla-dev: storage_service: avoid unneeded copies in on_change storage_service: remove check that is always true storage_service: rename handle_state_removing to handle_state_removed storage_service: avoid string copy storage_service: delete code that handled REMOVING_TOKENS state gossiper: remove code related to advertising REMOVING_TOKEN state migration_manager: add wait_for_schema_agreement() function	2023-05-27 10:49:54 +03:00
Gleb Natapov	a429018a8a	migration_manager: add wait_for_schema_agreement() function Several subsystems re-implement the same logic for waiting for schema agreement. Provide the function in the migration_manager and use it instead.	2023-05-25 14:44:53 +03:00
Tomasz Grabiec	51e3b9321b	Merge ' mvcc: make schema upgrades gentle' from Michał Chojnowski After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls. This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery. Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version. After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and by the MVCC version merge ongoing in the background after the upgrade. The series: 1. Does some code cleanup in the mutation_partition area. 2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry). 3. Adds upgrading variants of constructors and apply() for `row` and its wrappers. 4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains. 5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version. Fixes #2577 Closes #13761 * github.com:scylladb/scylladb: test: mvcc_test: add a test for gentle schema upgrades partition_version: make partition_entry::upgrade() gentle partition_version: handle multi-schema snapshots in merge_partition_versions mutation_partition_v2: handle schema upgrades in apply_monotonically() partition_version: remove the unused "from" argument in partition_entry::upgrade() row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades partition_version: handle multi-schema entries in partition_entry::squashed partition_snapshot_row_cursor: handle multi-schema snapshots partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots partition_version: add a logalloc::region argument to partition_entry::upgrade() memtable: propagate the region to memtable_entry::upgrade_schema() mutation_partition: add an upgrading variant of lazy_row::apply() mutation_partition: add an upgrading variant of rows_entry::rows_entry mutation_partition: switch an apply() call to apply_monotonically() mutation_partition: add an upgrading variant of rows_entry::apply_monotonically() mutation_fragment: add an upgrading variant of clustering_row::apply() mutation_partition: add an upgrading variant of row::row partition_version: remove _schema from partition_entry::operator<< partition_version: remove the schema argument from partition_entry::read() memtable: remove _schema from memtable_entry row_cache: remove _schema from cache_entry partition_version: remove the _schema field from partition_snapshot partition_version: add a _schema field to partition_version mutation_partition: change schema_ptr to schema& in mutation_partition::difference mutation_partition: change schema_ptr to schema& in mutation_partition constructor mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor mutation_partition: add upgrading variants of row::apply() partition_version: update the comment to apply_to_incomplete() mutation_partition_v2: clean up variants of apply() mutation_partition: remove apply_weak() mutation_partition_v2: remove a misleading comment in apply_monotonically() row_cache_test: add schema changes to test_concurrent_reads_and_eviction mutation_partition: fix mixed-schema apply()	2023-05-24 22:58:43 +02:00
Kefu Chai	b0c40a2a03	db: config: s/ingore/ignore/ this string is used in as the option description in the command line help message. so it is a part of user facing interface. in this change, the typo is fixed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14013	2023-05-24 13:35:24 +03:00
Pavel Emelyanov	5aea6938ae	commitlog: Introduce and use comitlog sched group Nowadays all commitlog code runs in whatever sched group it's kicked from. Since IO prio classes are going to be inherited from the current sched group the commitlog IO loops should be moved into commitlog sched group, not inherit a "random" one. There are currently two places that need correct context for IO -- the .cycle() method and segments replenisher. `$ perf-simple-query --write -c2` results --- Before the patch --- 194898.36 tps ( 56.3 allocs/op, 12.7 tasks/op, 54307 insns/op, 0 errors) 199286.23 tps ( 56.2 allocs/op, 12.7 tasks/op, 54375 insns/op, 0 errors) 199815.84 tps ( 56.2 allocs/op, 12.7 tasks/op, 54377 insns/op, 0 errors) 198260.98 tps ( 56.3 allocs/op, 12.7 tasks/op, 54380 insns/op, 0 errors) 198572.86 tps ( 56.2 allocs/op, 12.7 tasks/op, 54371 insns/op, 0 errors) median 198572.86 tps ( 56.2 allocs/op, 12.7 tasks/op, 54371 insns/op, 0 errors) median absolute deviation: 713.36 maximum: 199815.84 minimum: 194898.36 --- After the patch --- 194751.80 tps ( 56.3 allocs/op, 12.7 tasks/op, 54331 insns/op, 0 errors) 199084.70 tps ( 56.2 allocs/op, 12.7 tasks/op, 54389 insns/op, 0 errors) 195551.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54385 insns/op, 0 errors) 197953.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54386 insns/op, 0 errors) 198710.00 tps ( 56.3 allocs/op, 12.7 tasks/op, 54387 insns/op, 0 errors) median 197953.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54386 insns/op, 0 errors) median absolute deviation: 1131.24 maximum: 199084.70 minimum: 194751.80 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14005	2023-05-23 21:25:57 +03:00

1 2 3 4 5 ...

3141 Commits