scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Botond Dénes	3a51053e66	Merge 'De-static system_keyspace::_group0_ methods' from Pavel Emelyanov These are users of global `qctx` variable or call `(get\|set)_scylla_local_param(_as)?` which, in turn, also reference the `qctx`. Unfortunately, the latter(s) are still in use by other code and cannot be marked non-static in this PR Closes #14869 * github.com:scylladb/scylladb: system_keyspace: De-static set_raft_group0_id() system_keyspace: De-static get_raft_group0_id() system_keyspace: De-static get_last_group0_state_id() system_keyspace: De-static group0_history_contains() raft: Add system_keyspace argument to raft_group0::join_group0()	2023-07-28 14:53:22 +03:00
Pavel Emelyanov	d311784721	system_keyspace: De-static set_raft_group0_id() The caller is group0 code with sys_ks local variable Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:13:59 +03:00
Pavel Emelyanov	7837bc7d5a	system_keyspace: De-static get_raft_group0_id() The callers are in group0 code that have sys_ks local variable/argument Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:13:11 +03:00
Pavel Emelyanov	26dd7985a8	system_keyspace: De-static get_last_group0_state_id() The caller is raft_group0_client with sys.ks. dependency reference and group0_state_machine with raft_group0_client exporing its sys.ks. This makes it possible to instantly drop one more qctx reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:12:04 +03:00
Pavel Emelyanov	3de0efd32c	system_keyspace: De-static group0_history_contains() The caller is raft_group0_client with sys.ks. dependency reference. This allows to drop one qctx reference right at once Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:11:08 +03:00
Avi Kivity	cf81eef370	Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d` (5.2.0), it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485. Manually tested using ccm on cluster upgrade scenarios and node restarts. Closes #14441 * github.com:scylladb/scylladb: test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled schema_mutations, migration_manager: Ignore empty partitions in per-table digest migration_manager, schema_tables: Implement migration_manager::reload_schema() schema_tables: Avoid crashing when table selector has only one kind of tables	2023-07-28 00:01:33 +03:00
Pavel Emelyanov	e9218e6873	system_keyspace: Don't update schema version in .setup() The db.get_version() called that early returns value that database got construction-time, i.e. -- empty_version thing. It makes little sense committing it into the system k.s. all the more so the "real" version is calculated and updated few steps after .setup(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14833	2023-07-27 09:38:57 +03:00
Pavel Emelyanov	c017117340	system_keyspace: Remove qctx usage from load_topology_state() Fortunately, this is pretty simple -- the only caller is storage_service that has sharded<system_keysace> dependency reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14824	2023-07-27 08:56:40 +03:00
Avi Kivity	615544a09a	Merge 'Init messaging service preferred IP cache via config' from Pavel Emelyanov This is to make m.s. initialization more solid and simplify sys.ks.::setup() Closes #14832 * github.com:scylladb/scylladb: system_keyspace: Remove unused snitch arg from setup() messaging_service: Setup preferred IPs from config	2023-07-26 22:12:28 +03:00
Nadav Har'El	056d04954c	Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions. This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed. Fixes: #14819 Closes #14821 * github.com:scylladb/scylladb: test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations db/view/view_updating_consumer: account for the size of mutations mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() mutation/mutation: add memory_usage()	2023-07-26 20:04:28 +03:00
Pavel Emelyanov	6b82071064	system_keyspace: Remove unused snitch arg from setup() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-26 16:05:26 +03:00
Pavel Emelyanov	0fba57a3e8	messaging_service: Setup preferred IPs from config Population of messageing service preferred IPs cache happens inside system keyspace setup() call and it needs m.s. per ce and additionally snitch. Moving preferred ip cache to initial configuration keeps m.s. start more self-contained and keeps system_keyspace::setup() simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-26 16:03:23 +03:00
Botond Dénes	d66b07823b	db/view/view_updating_consumer: account for the size of mutations All partitions will have a corresponding mutation object in the buffer. These objects have non-negligible sizes, yet the consumer did not bump the _buffer_size when a new partition was consumer. This resulted in empty partitions not moving the _buffer_size at all, and thus they could accumulate without bounds in the buffer, never triggering a flush just by themselves. We have recently seen this causing OOM. This patch fixes that by bumping the _buffer_size with the size of the freshly created mutation object.	2023-07-26 03:07:25 -04:00
Botond Dénes	ad2ddffb22	Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from - proxy::remote::handle_truncate() - schema_tables::merge_schema() - legacy_schema_migrator - tests All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx Closes #14778 * github.com:scylladb/scylladb: system_keyspace: Make save_truncation_record() non-static code: Pass sharded<db::system_keyspace>& to database::truncate() db: Add sharded<system_keyspace>& to legacy_schema_migrator	2023-07-26 08:48:49 +03:00
Kamil Braun	e6099c4685	Merge 'config: set schema_commitlog_segment_size_in_mb to 128 ' from Patryk Jędrzejczak Fixes #14668 In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`. Additionally, we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668. Closes #14704 * github.com:scylladb/scylladb: replica: do not derive the commitlog sync period for schema commitlog config: set schema_commitlog_segment_size_in_mb to 128 config: add schema_commitlog_segment_size_in_mb variable	2023-07-24 10:23:34 +02:00
Pavel Emelyanov	db1c6e2255	system_keyspace: Make save_truncation_record() non-static ... and stop using qctx Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-21 13:12:50 +03:00
Pavel Emelyanov	eaeffcdb81	code: Pass sharded<db::system_keyspace>& to database::truncate() The arguments goes via the db::(drop\|truncate)_table_on_all_shards() pair of calls that start from - storage_proxy::remote: has its sys.ks reference already - schema_tables::merge_schema: has sys.ks argument already - legacy_schema_migrator: the reference was added by previous patch - tests: run in cql_test_env with sys.ks on board Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-21 13:11:59 +03:00
Pavel Emelyanov	1ef34a5ada	db: Add sharded<system_keyspace>& to legacy_schema_migrator One of the class' methods calls db::drop_table_on_all_shards() that will need sys.ks. in the next patch. The reference in question is provided from the only caller -- main.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-21 12:38:46 +03:00
Botond Dénes	53da97416a	Merge 'Remove qctx from system.paxos table access methods' from Pavel Emelyanov The "fix" is straightforward -- callers of system_keyspace::paxos methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference. Closes #14758 * github.com:scylladb/scylladb: db/system_keyspace: Move and use qctx::execute_cql_with_timeout() db/system_keyspace: Make paxos methods non-static service/paxos: Add db::system_keyspace& argument to some methods test: Optionally initialize proxy remote for cql_test_env proxy/remote: Keep sharded<db::system_keyspace>& dependency	2023-07-20 16:53:25 +03:00
Pavel Emelyanov	8a87c87824	db/system_keyspace: Move and use qctx::execute_cql_with_timeout() This template call is only used by system keyspace paxos methods. All those methods are no longer static and can use system_keyspace::_qp reference to real query processor instead of global qctx. The execute_cql_with_timeout() wrapper is moved to system_keyspace to make it work Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 19:32:10 +03:00
Pavel Emelyanov	b9ef16c06f	db/system_keyspace: Make paxos methods non-static The service::paxos_state methods that call those already have system keyspace reference at hand and can call method on an object Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 19:32:10 +03:00
Patryk Jędrzejczak	b3be9617dc	config: set schema_commitlog_segment_size_in_mb to 128 We increase the default schema commitlog segment size so that the large mutations do not fail. We have agreed that 128 MB is sufficient.	2023-07-19 14:16:49 +02:00
Patryk Jędrzejczak	5b167a4ad7	config: add schema_commitlog_segment_size_in_mb variable In #14668, we have decided to introduce a new scylla.yaml variable for the schema commitlog segment size. The segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. Therefore, increasing the schema commitlog segment size is sometimes necessary.	2023-07-19 14:16:41 +02:00
Kefu Chai	8f390997cb	db: do not use std::cmp_not_equal() when appropriate this change is a follow-up of `3129ae3c8c`. since in both cases in this change, the `num_ranges` should always be greater than zero, there is no need to use `int` for its type, and "num_ranges" returned by the CQL query should always be greater or equal to zero, so there is no need to check if it is positive. in this change, we * change the type of `num_ranges` to `size_t` * change std::cmp_not_equal() to != to avoid using the verbose `std::cmp_not_equal()` helper, for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14754	2023-07-19 13:25:21 +03:00
Asias He	c29e7e4644	Revert "Revert "view_update_generator: Increase the registration_queue_size"" This reverts commit `4cee8206f8`. The test is fixed. Closes #14750	2023-07-19 11:46:28 +03:00
Kamil Braun	eb6202ef9c	Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak Fixes #9405 `sync_point` API provided with incorrect sync point id might allocate crazy amount of memory and fail with `std::bad_alloc`. To fix this, we can check if the encoded sync point has been modified before decoding. We can achieve this by calculating a checksum before encoding, appending it to the encoded sync point, and compering it with a checksum calculated in `db::hints::decode` before decoding. Closes #14534 * github.com:scylladb/scylladb: db: hints: add checksum to sync point encoding db: hints: add the version_size constant	2023-07-18 13:05:10 +02:00
Botond Dénes	21ff6efd74	test/boost/view_build_test: improve test_view_update_generator_register_semaphore_unit_leak By making it independent of the number of units the view update generator's registration semaphore is created with. We want to increase this number significantly and that would destabilize this test significantly. To prevent this, detach the test from the number of units completely, while stil preserving the original intent behind it, as best as it could be determined. Closes #14727	2023-07-18 09:18:28 +03:00
Kefu Chai	fa3129fa29	treewide: use unsigned variable to compare with unsigned some times we initialize a loop variable like auto i = 0; or int i = 0; but since the type of `0` is `int`, what we get is a variable of `int` type, but later we compare it with an unsigned number, if we compile the source code with `-Werror=sign-compare` option, the compiler would warn at seeing this. in general, this is a false alarm, as we are not likely to have a wrong comparison result here. but in order to prevent issues due to the integer promotion for comparison in other places. and to prepare for enabling `-Werror=sign-compare`. let's use unsigned to silence this warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 10:27:18 +08:00
Kefu Chai	3129ae3c8c	treewide: compare signed and unsigned using std::cmp_() when comparing signed and unsigned numbers, the compiler promotes the signed number to coomon type -- in this case, the unsigned type, so they can be compared. but sometimes, it matters. and after the promotion, the comparison yields the wrong result. this can be manifested using a short sample like: ``` int main(int argc, char argv) { int x = -1; unsigned y = 2; fmt::print("{}\n", x < y); return 0; } ``` this error can be identified by `-Werror=sign-compare`, but before enabling this compiling option. let's use `std::cmp_()` to compare them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 10:27:18 +08:00
Patryk Jędrzejczak	02618831ef	db: hints: add checksum to sync point encoding sync point API provided with incorrect sync point id might allocate crazy amount of memory and fail with std::bad_alloc. To fix this, we can check if the encoded sync point has been modified before decoding. We can achieve this by calculating a checksum before encoding, appending it to the encoded sync point, and compering it with a checksum calculated in db::hints::decode before decoding.	2023-07-17 16:05:07 +02:00
Patryk Jędrzejczak	0a424e1760	db: hints: add the version_size constant The next commit changes the format of encoding sync points to V2. The new format appends the checksum to the encoded sync points and its implementation uses the checksum_size constant - the number of bytes required to store the checksum. To increase consistency and readability, we can additionally add and use the version_size constant. Definitions of sync_point::decode and sync_point::encode are slightly changed so that they don't depend on the version_size value and make implementation of the V2 format easier.	2023-07-17 16:02:18 +02:00
Kefu Chai	3ed982df87	query_context: do not include unused header in this header, none of the exceptions defined by `exceptions/exceptions.hh` is used. so let's drop the `#include`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14718	2023-07-17 12:00:49 +03:00
Raphael S. Carvalho	d6029a195e	Remove DateTieredCompactionStrategy This is the last step of deprecation dance of DTCS. In Scylla 5.1, users were warned that DTCS was deprecated. In 5.2, altering or creation of tables with DTCS was forbidden. 5.3 branch was already created, so this is targetting 5.4. Users that refused to move away from DTCS will have Scylla falling back to the default strategy, either STCS or ICS. See: WARN 2023-07-14 09:49:11,857 [shard 0] schema_tables - Falling back to size-tiered compaction strategy after the problem: Unable to find compaction strategy class 'DateTieredCompactionStrategy Then user can later switch to a supported strategy with alter table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14559	2023-07-14 16:20:48 +03:00
Asias He	dad5caf141	streaming: Add stream_plan_ranges_percentage This option allows user to change the number of ranges to stream in batch per stream plan. Currently, each stream plan streams 10% of the total ranges. With more ranges per stream plan, it reduces the waiting time between two stream plans. For example, stream_plan1: shard0 (t0), shard1 (t1) stream_plan2: shard0 (t2), shard1 (t3) We start stream_plan2 after all shards finish streaming in stream_plan1. If shard0 and shard1 in stream_plan1 finishes at different time. One of the shards will be idle. If we stream more ranges in a single stream plan, the waiting time will be reduced. Previously, we retry the stream plan if one of the stream plans is failed. That's one of the reasons we want more stream plans. With RBNO and `1f8b529e08` (range_streamer: Disable restream logic), the restream factor is not important anymore. Also, more ranges in a single stream plan will create bigger but fewer sstables on the receiver side. The default value is the same as before: 10% percentage of total ranges. Fixes #14191 Closes #14402	2023-07-14 09:03:01 +03:00
Botond Dénes	4cee8206f8	Revert "view_update_generator: Increase the registration_queue_size" This reverts commit `d3034e0fab`. The test modified by this commit (view_build_test.test_view_update_generator_register_semaphore_unit_leak) often fails, breaking build jobs.	2023-07-13 16:48:50 +03:00
Asias He	d3034e0fab	view_update_generator: Increase the registration_queue_size When repair writes a sstable to disk, we check if the sstable needs view update processing. If yes, the sstable will be placed into the staging dir for processing, with the _registration_sem semaphore to prevent too many pending unprocessed sstables. We have seen multiple cases in the field where view update processing is inefficient and way too slow which blocks the base table repair to finish on time. This patch increases the registration_queue_size to a bigger number to mitigate the problem that slow view update processing blocks repair. It is better to have a consistent base table + inconsistent view table than inconsistent base table + inconsistent view table. Currently, sstables in staging dir are not compacted. So we could not increase the _registration_sem with too big number to avoid accumulate too many sstables. The view_build_test.cc is updated to make the test pass. Closes #14241	2023-07-12 15:51:35 +03:00
Botond Dénes	296837120d	db: move virtual tables into virtual_tables.cc The definitions of virtual tables make up approximately a quarter of the huge system_keyspace.cc file (almost 4K lines), pulling in a lot of headers only used by them. Move them to a separate source file to make system_keyspace.cc easier for humans and compilers to digest. This patch also moves the `register_virtual_tables()`, `install_virtual_readers()` as well as the `virtual_tables` global. Closes #14308	2023-07-12 15:26:54 +03:00
Avi Kivity	0cabf4eeb9	build: disable implicit fallthrough Prevent switch case statements from falling through without annotation ([[fallthrough]]) proving that this was intended. Existing intended cases were annotated. Closes #14607	2023-07-10 19:36:06 +02:00
Gleb Natapov	4f23eec44f	Rename experimental raft feature to consistent-topology-changes Make the name more descriptive Fixes #14145 Message-Id: <ZKQ2wR3qiVqJpZOW@scylladb.com>	2023-07-07 11:08:10 +02:00
Nadav Har'El	d6aba8232b	alternator: configurable override for DescribeEndpoints The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554) where even if a user specifies a specific enpoint URL, the SDK uses DescribeEndpoints to try to "refresh" the endpoint. The problem is that DescribeEndpoints can't return a scheme (http or https) and the SDK arbitrarily picks https - making it unable to communicate with Alternator over http. As an example, the new "dynamodb shell" (written in C++) cannot communicate with Alternator running over http. This patch adds a configuration option, "alternator_describe_endpoints", which can be used to override what DescribeEndpoints does: 1. Empty string (the default) leaves the current behavior - DescribeEndpoints echos the request's "Host" header. 2. The string "disabled" disables the DescribeEndpoints (it will return an UnknownOperationException). This is how DynamoDB Local behaves, and the AWS C++ SDK and the Dynamodb Shell work well in this mode. 3. Any other string is a fixed string to be returned by DescribeEndpoints. It can be useful in setups that should return a known address. Note that this patch does not, by default, change the current behaivor of DescribeEndpoints. But it us the future to override its behavior in a user experiences problems in the field - without code changes. Fixes #14410. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14432	2023-07-07 11:08:10 +02:00
Tomasz Grabiec	c25201c1a3	Merge 'view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes https://github.com/scylladb/scylladb/issues/14503 Closes #14502 * github.com:scylladb/scylladb: test: view_build_test: add range tombstones to test_view_update_generator_buffering test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations view_updating_consumer: make buffer limit a variable view: fix range tombstone handling on flushes in view_updating_consumer	2023-07-05 21:21:43 +02:00
Michał Chojnowski	ac29b6f198	view_updating_consumer: make buffer limit a variable The limit doesn't change at runtime, but we this patch makes it variable for unit testing purposes.	2023-07-05 17:33:47 +02:00
Michał Chojnowski	5ad0846bff	view: fix range tombstone handling on flushes in view_updating_consumer View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes #14503	2023-07-04 20:33:21 +02:00
Tomasz Grabiec	f2ed9fcd7e	schema_mutations, migration_manager: Ignore empty partitions in per-table digest Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d`, it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in 18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485.	2023-07-03 23:06:55 +02:00
Tomasz Grabiec	0c86abab4d	migration_manager, schema_tables: Implement migration_manager::reload_schema() Will recreate schema_ptr's from schema tables like during table alter. Will be needed when digest calculation changes in reaction to cluster feature at run time.	2023-07-03 20:32:59 +02:00
Tomasz Grabiec	9bfe9f0b2f	schema_tables: Avoid crashing when table selector has only one kind of tables Currently not reachable, because selectors are always constructed with both kinds initailized. Will be triggered by the next patch.	2023-07-03 20:32:59 +02:00
Pavel Emelyanov	0d4c981423	database: Remove unused proxy arg from update_keyspace_on_all_shards() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-03 14:19:54 +03:00
Tomasz Grabiec	a9282103ba	Merge 'Call storage_service notifications only after keyspace schema changes are applied on all shards' from Benny Halevy This series aims at hardening schema merges and preventing inconsistencies across shards by updating the database shards before calling the notification callback. As seen in #13137, we don't want to call the notifications on all shards in parallel while the database shards are in flux. In addition, any error to update the keyspace will cause abort so not to leave the database shards in an inconsistent state . Other changes optimize this path by: - updating shard 0 first, to seed the effective_replication_map. - executing `storage_service::keyspace_changed` only once, on shard 0 to prevent quadratic update of the token_metadata and e_r_m on every keyspace change. Fixes #13137 Closes #14158 * github.com:scylladb/scylladb: migration_manager: propagate listener notification exceptions storage_service: keyspace_changed: execute only on shard 0 database: modify_keyspace_on_all_shards: execute func first on shard 0 database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards database: add modify_keyspace_on_all_shards schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace database: create_keyspace_on_all_shards database: update_keyspace_on_all_shards database: drop_keyspace_on_all_shards	2023-06-29 12:17:53 +02:00
Avi Kivity	f86dd857ca	Merge 'Certificate based authorization' from Calle Wilund Fixes #10099 Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex. Example: scylla.yaml: ``` authenticator: com.scylladb.auth.CertificateAuthenticator auth_superuser_name: <name> auth_certificate_role_query: CN=([^,\s]+) client_encryption_options: enabled: True certificate: <server cert> keyfile: <server key> truststore: <shared trust> require_client_auth: True ``` In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for qlsh set "usercert" and "userkey" to these certificate files. No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it. Otherwise, connection becomes the role described. To facilitate this, this also contains the addition of allowing setting super user name + salted passwd via command line/conf + some tweaks to SASL part of connection setup. Closes #12214 * github.com:scylladb/scylladb: docs: Add documentation of certificate auth + auth_superuser_name auth: Add TLS certificate authenticator transport: Try to do early, transport based auth if possible auth: Allow for early (certificate/transport) authentication auth: Allow specifying initial superuser name + passwd (salted) in config roles-metadata: Coroutinuze some helpers	2023-06-27 12:52:14 +03:00
Botond Dénes	f5e3b8df6d	Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho View building from staging creates a reader from scratch (memtable \+ sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: ``` + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert ``` That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from `INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s` to `INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s` Refs https://github.com/scylladb/scylladb/issues/14089. Fixes scylladb/scylladb#14244. Closes #14364 * github.com:scylladb/scylladb: table: Optimize creation of reader excluding staging for view building view_update_generator: Dump throughput and duration for view update from staging utils: Extract pretty printers into a header	2023-06-27 07:25:30 +03:00

1 2 3 4 5 ...

3193 Commits