scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 22:25:48 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	41e69836fd	db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	b754433ac1	migration_notifier: Introduce before_drop_keyspace() Tablet allocator will need to inject mutations on keyspace drop.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	5b046043ea	migration_manager: Make prepare_keyspace_drop_announcement() return a future<> It will be extended with listener notification firing, which is an async operation.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	4b4238b069	test: perf: Introduce perf-tablets Example output: $ build/release/scylla perf-tablets --tables 10 --tablets-per-table $((8*1024)) --rf 3 testlog - Total tablet count: 81920 testlog - Size of tablet_metadata in memory: 7683 KiB testlog - Copied in 2.163421 [ms] testlog - Cleared in 0.767507 [ms] testlog - Saved in 774.813232 [ms] testlog - Read in 246.666885 [ms] testlog - Read mutations in 211.677292 [ms] testlog - Size of canonical mutations: 20.633621 [MiB] testlog - Disk space used by system.tablets: 0.902344 [MiB]	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	70a35f70a6	test: Introduce tablets_test	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	b4ac329367	test: lib: Do not override table id in create_table() It is already set by schema_maker. In tablets_test we will depend on the id being the same as that set in the schema_builder, so don't change it to something else.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	5a24984147	utils, tablets: Introduce external_memory_usage()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	f3fbfdaa37	db: tablets: Add printers Example: TRACE 2023-03-30 12:06:33,918 [shard 0] tablets - Read tablet metadata: { 8cd5b560-cee2-11ed-9cd5-7f37187f2167: { [0]: last_token=-6917529027641081857, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0}, [1]: last_token=-4611686018427387905, replicas={3160b965-1925-4677-884b-c761e2bf4272:0}, [2]: last_token=-2305843009213693953, replicas={3160b965-1925-4677-884b-c761e2bf4272:0}, [3]: last_token=-1, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0}, [4]: last_token=2305843009213693951, replicas={3160b965-1925-4677-884b-c761e2bf4272:0}, [5]: last_token=4611686018427387903, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0}, [6]: last_token=6917529027641081855, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0}, [7]: last_token=9223372036854775807, replicas={3160b965-1925-4677-884b-c761e2bf4272:0} } }	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	9d786c1ebc	db: tablets: Add persistence layer	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	fa8ad9a585	dht: Use last_token_of_compaction_group() in split_token_range_msb()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	fceb5f8cf6	locator: Introduce tablet_metadata token_metadata now stores tablet metadata with information about tablets in the system.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	241f7febec	dht: Introduce first_token()	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	462e3ffd36	dht: Introduce next_token()	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	27acf3b129	storage_proxy: Improve trace-level logging	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	34a9c62ae5	locator: token_metadata: Fix confusing comment on ring_range() It could be interpreted to mean that the search token is excluded.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	e4865bd4d1	dht, storage_proxy: Abstract token space splitting Currently, scans are splitting partition ranges around tokens. This will have to change with tablets, where we should split at tablet boundaries. This patch introduces token_range_splitter which abstracts this task. It is provided by effective_replication_map implementation.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	b769c4ee55	Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" This reverts commit `95bf8eebe0`. Later patches will adapt this code to work with token_range_splitter, and the unit test added by the reverted commit will start to fail. The unit test asks the query_ranges_to_vnodes_generator to split the range: [t:end, t+1:start) around token t, and expects the generator to produce an empty range [t:end, t:end] After adapting this code to token_range_splitter, the input range will not be split because it is recognized as adjacent to t:end, and the optimization logic will not kick in. Rather than adding more logic to handle this case, I think it's better to drop the optimization, as it is not very useful (rarely happens) and not required for correctness.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	94e1c7b859	db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() This allows update_pending_ranges(), invoked on keyspace creation, to succeed in the presence of keyspaces with per-table replication strategy. It will update only vnode-based erms, which is intended behavior, since only those need pending ranges updated. This change will also make node operations like bootstrap, repair, etc. to work (not fail) in the presence of keyspaces with per-table erms, they will just not be replicated using those algorithms. Before, these would fail inside get_effective_replication_map(), which is forbidden for keyspaces with per-table replication.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	dc04da15ec	db: Introduce get_non_local_vnode_based_strategy_keyspaces() It's meant to be used in places where currently get_non_local_strategy_keyspaces() is used, but work only with keyspaces which use vnode-based replication strategy.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	8fcb320e71	service: storage_proxy: Avoid copying keyspace name in write handler	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	9b17ad3771	locator: Introduce per-table replication strategy Will be used by tablet-based replication strategies, for which effective replication map is different per table. Also, this patch adapts existing users of effective replication map to use the per-table effective replication map. For simplicity, every table has an effective replication map, even if the erm is per keyspace. This way the client code can be uniform and doesn't have to check whether replication strategy is per table. Not all users of per-keyspace get_effective_replication_map() are adapted yet to work per-table. Those algorithms will throw an exception when invoked on a keyspace which uses per-table replication strategy.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	5d9bcb45de	treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	bb297d86a0	locator: Introduce effective_replication_map With tablet-based replication strategies it will represent replication of a single table. Current vnode_effective_replication_map can be adapted to this interface. This will allow algorithms like those in storage_proxy to work with both kinds of replication strategies over a single abstraction.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	d3c9ad4ed6	locator: Rename effective_replication_map to vnode_effective_replication_map In preparation for introducing a more abstract effective_replication_map which can describe replication maps which are not based on vnodes.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	1343bfa708	locator: effective_replication_map: Abstract get_pending_endpoints()	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	7b01fe8742	db: Propagate feature_service to abstract_replication_strategy::validate_options() Some replication strategy options may be feature-dependent.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	9781d3ffc5	db: config: Introduce experimental "TABLETS" feature	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	a892e144cc	db: Log replication strategy for debugging purposes	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	7543c75b62	db: Log full exception on error in do_parse_schema_tables()	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	c923bdd222	db: keyspace: Remove non-const replication strategy getter Keyspace will store replication_ptr, which is a const pointer. No user needs a mutable reference.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	bf2ce8ff75	config: Reformat	2023-04-24 10:49:36 +02:00
Chang Chen Chien	c25a718008	docs: fix typo in using-scylla/local-secondary-indexes.rst Closes #13607	2023-04-24 06:56:19 +03:00
Pavel Emelyanov	5e201b9120	database: Remove compaction_manager.hh inclusion into database.hh The only reason why it's there (right next to compaction_fwd.hh) is because the database::table_truncate_state subclass needs the definition of compaction_manager::compaction_reenabler subclass. However, the former sub is not used outside of database.cc and can be defined in .cc. Keeping it outside of the header allows dropping the compaction_manager.hh from database.hh thus greatly reducing its fanout over the code (from ~180 indirect inclusions down to ~20). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13622	2023-04-23 16:27:11 +03:00
Tomasz Grabiec	bd0b299322	Merge 'Manage CDC generations when bootstrapping nodes using Raft Group 0 topology coordinator' from Kamil Braun Introduce a new table `CDC_GENERATIONS_V3` (`system.cdc_generations_v3`). The table schema is a copy-paste of the `CDC_GENERATIONS_V2` schema. The difference is that V2 lives in `system_distributed_keyspace` and writes to it are distributed using regular `storage_proxy` replication mechanisms based on the token ring. The V3 table lives in `system_keyspace` and any mutations written to it will go through group 0. Extend the `TOPOLOGY` schema with new columns: - `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping node's `ring_slice`, it stores UUID of a newly introduced CDC generation which is used as partition key for the `CDC_GENERATIONS_V3` table to access this new generation's data. It's a regular column, meaning that every row (corresponding to a node) will have its own. - `current_cdc_generation_uuid` and `current_cdc_generation_timestamp` together form the ID of the newest CDC generation in the cluster. (the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is when the CDC generation starts operating). Those are static columns since there's a single newest CDC generation. When topology coordinator handles a request for node to join, calculate a new CDC generation using the bootstrapping node's tokens, translate it to mutation format, and insert this mutation to the CDC_GENERATIONS_V3 table through group 0 at the same time we assign tokens to the node in Raft topology. The partition key for this data is stored in the bootstrapping node's `ring_slice`. After inserting new CDC generation data , we need to pick a timestamp for this generation and commit it, telling all nodes in the cluster to start using the generation for CDC log writes once their clocks cross that timestamp. We introduce a separate step to the bootstrap saga, before `write_both_read_old`, called `commit_cdc_generation`. In this step, the coordinator takes the `new_cdc_generation_data_uuid` stored in a bootstrapping node's `ring_slice` - which serves as the key to the table where the CDC generation data is stored - and combines it with a timestamp which it generates a bit into the future (as in old gossiper-based code, we use 2 * ring_delay, by default 1 minute). This gives us a CDC generation ID which we commit into the topology state as the `current_cdc_generation_id` while switching the saga to the next step, `write_both_read_old`. Once a new CDC generation is committed to the cluster by the topology coordinator, we also need to publish it to the user-facing description tables so CDC applications know which streams to read from. This uses regular distributed table writes underneath (tables living in the `system_distributed` keyspace) so it requires `token_metadata` to be nonempty. We need a hack for the case of bootstrapping the first node in the cluster - turning the tokens into normal tokens earlier in the procedure in `token_metadata`, but this is fine for the single-node case since no streaming is happening. When a node notices that a new CDC generation was introduced in `storage_service::topology_state_load`, it updates its internal data structures that are used when coordinating writes to CDC log tables. We include the current CDC generation data in topology snapshot transfers. Some fixes and refactors included. Closes #13385 * github.com:scylladb/scylladb: docs: cdc: describe generation changes using group 0 topology coordinator cdc: generation_service: add a FIXME cdc: generation_service: add legacy_ prefix for gossiper-based functions storage_service: include current CDC generation data in topology snapshots db: system_keyspace: introduce `query_mutations` with range/slice storage_service: hold group 0 apply mutex when reading topology snapshot service: raft_group0_client: introduce `hold_read_apply_mutex` storage_service: use CDC generations introduced by Raft topology raft topology: publish new CDC generation to the user description tables raft topology: commit a new CDC generation on node bootstrap raft topology: create new CDC generation data during node bootstrap service: topology_state_machine: make topology::find const db: system_keyspace: small refactor of `load_topology_state` cdc: generation: extract pure parts of `make_new_generation` outside db: system_keyspace: add storage for CDC generations managed by group 0 service: topology_state_machine: better error checking for state name (de)serialization service: raft: plumbing `cdc::generation_service&` cdc: generation: `get_cdc_generation_mutations`: take timestamp as parameter cdc: generation: make `topology_description_generator::get_sharding_info` a parameter sys_dist_ks: make `get_cdc_generation_mutations` public sys_dist_ks: move find_schema outside `get_cdc_generation_mutations` sys_dist_ks: move mutation size threshold calculation outside `get_cdc_generation_mutations` service/raft: group0_state_machine: signal topology state machine in `load_snapshot`	2023-04-21 18:11:27 +02:00
Anna Stuchlik	a68b976c91	doc: document `tombstone_gc` as not experimental The tombstone_gc was documented as experimental in version 5.0. It is no longer experimental in version 5.2. This commit updates the information about the option. Closes #13469	2023-04-21 14:43:25 +02:00
Botond Dénes	fcd7f6ac5f	Update tools/java submodule * tools/java c9be8583...eb3c43f8 (1): > Use EstimatedHistogram in metricPercentilesAsArray	2023-04-21 14:31:38 +03:00
Kefu Chai	a2aa133822	treewide: use std::lexicographical_compare_threeway this the standard library offers `std::lexicographical_compare_threeway()`, and we never uses the last two addition parameters which are not provided by `std::lexicographical_compare_threeway()`. there is no need to have the homebrew version of trichotomic compare function. in this change, * all occurrences of `lexicographical_tri_compare()` are replaced with `std::lexicographical_compare_threeway()`. * ``lexicographical_tri_compare()` is dropped. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13615	2023-04-21 14:28:18 +03:00
Kefu Chai	51fc0bc698	sstables: use default generated operator== C++20 compiler is able to generate defaulted operator== and operator!=. and the default generated operators behaves exactly the same as the ones crafted by us. so let's it do its job. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13614	2023-04-21 14:25:39 +03:00
Botond Dénes	10c1f1dc80	Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun in `make_group0_history_state_id_mutation`, when adding a new entry to the group 0 history table, if the parameter `gc_older_than` is engaged, we create a range tombstone in the mutation which deletes entries older than the new one by `gc_older_than`. In particular if `gc_older_than = 0`, we want to delete all older entries. There was a subtle bug there: we were using millisecond resolution when generating the tombstone, while the provided state IDs used microsecond resolution. On a super fast machine it could happen that we managed to perform two schema changes in a single millisecond; this happened sometimes in `group0_test.test_group0_history_clearing_old_entries` on our new CI/promotion machines, causing the test to fail because the tombstone didn't clear the entry correspodning to the previous schema change when performing the next schema change (since they happened in the same millisecond). Use microsecond resolution to fix that. The consecutive state IDs used in group 0 mutations are guaranteed to be strictly monotonic at microsecond resolution (see `generate_group0_state_id` in service/raft/raft_group0_client.cc). Fixes #13594 Closes #13604 * github.com:scylladb/scylladb: db: system_keyspace: use microsecond resolution for group0_history range tombstone utils: UUID_gen: accept decimicroseconds in min_time_UUID	2023-04-21 14:08:56 +03:00
Kamil Braun	55f43e532c	Merge 'get rid of gms/failure_detector' from Benny Halevy Move gms::arrival_window to api/failure_detector which is its only user. and get rid of the rest, which is not used, now that we use direct_failure_detector instead. TODO: integare direct_failure_detector with failure_detector api. Closes #13576 * github.com:scylladb/scylladb: gms: get rid of unused failure_detector api: failure_detector: remove false dependency on failure_detector::arrival_window test: rest_api: add test_failure_detector	2023-04-21 11:47:44 +02:00
Kamil Braun	f7408130c9	Merge 'Fix topology management when raft-based topology is enabled' from Tomasz Grabiec Fixes a problem when raft-based topology is enabled, which loads topology from storage. It starts by clearing topology and then adding nodes one by one. Before this patch, this violates internal invariant of topology object which puts the local node as the first node. This would manifest by triggering an assert in topology::pop_node() which throws if popping the node at index 0 in order to keep the information about local node around. This is normally prevented by a check in topology::remove_node() which avoid calling pop_node() if removing the local node. But since there is no node which is marked as local, this check allows the first node to be popped. To fix the problem I lift the invariant that local node is always in _nodes. We still have information about local node in config. Instead of keeping it in _nodes, we recognize it as part of indexing. We also allow removing the local node like a regular node. The path which reloads topology works correctly after this, the local node will be recognized when (if) it is added to the topology. Fixes #13495 Closes #13498 * github.com:scylladb/scylladb: locator: topology: Fix move assignment locator: topology: Add printer tests: topology: Test that topology clearing preserves information about local node locator: topology: Recognize local node as part of indexing it locator: topology: Fix get_location(ep) for local node locator: topology: Fix typo locator: topology: Preserve config when cloning	2023-04-21 11:45:08 +02:00
Alejo Sanchez	ce87aedd30	test: topology smp test with custom cluster Instead of decommission of initial cluster, use custom cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13589	2023-04-21 10:43:54 +02:00
Kamil Braun	f9d8118c8d	db: system_keyspace: use microsecond resolution for group0_history range tombstone in `make_group0_history_state_id_mutation`, when adding a new entry to the group 0 history table, if the parameter `gc_older_than` is engaged, we create a range tombstone in the mutation which deletes entries older than the new one by `gc_older_than`. In particular if `gc_older_than = 0`, we want to delete all older entries. There was a subtle bug there: we were using millisecond resolution when generating the tombstone, while the provided state IDs used microsecond resolution. On a super fast machine it could happen that we managed to perform two schema changes in a single millisecond; this happened sometimes in `group0_test.test_group0_history_clearing_old_entries` on our new CI/promotion machines, causing the test to fail because the tombstone didn't clear the entry correspodning to the previous schema change when performing the next schema change (since they happened in the same millisecond). Use microsecond resolution to fix that. The consecutive state IDs used in group 0 mutations are guaranteed to be strictly monotonic at microsecond resolution (see `generate_group0_state_id` in service/raft/raft_group0_client.cc). Fixes #13594	2023-04-21 10:33:05 +02:00
Kamil Braun	218a056825	utils: UUID_gen: accept decimicroseconds in min_time_UUID The function now accepts higher-resolution duration types, such as microsecond resolution timestamps. Will be used by the next commit.	2023-04-21 10:33:02 +02:00
Kefu Chai	ca6ebbd1f0	cql3, db: sstable: specialize fmt::formatter<function_name> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `function_name` without the help of `operator<<`. the corresponding `operator<<()` are dropped dropped in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13608	2023-04-21 10:07:28 +03:00
Botond Dénes	d74f3598f4	Merge 'dht: specialize fmt::formatter<dht::token>' from Kefu Chai this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `dht::token` without the help of `operator<<`. the corresponding `operator<<()` is preserved in this change, as it has lots of users in this project, we will tackle them case-by-case in follow-up changes. also, the forward declaration of `operator<<(ostream&, constdht::token&)` in `dht/i_partitioner.hh` is removed. ias it not necessary. Refs https://github.com/scylladb/scylladb/issues/13245 Closes #13610 * github.com:scylladb/scylladb: dht: remove unnecessarily forward declaration dht: specialize fmt::formatter<dht::token>	2023-04-21 09:51:25 +03:00
Kefu Chai	c5fa1ac9f7	sstable: specialize fmt::formatter<component_type> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `component_type` without the help of `operator<<`. the corresponding `operator<<()` are dropped dropped in this change, as all its callers are now using fmtlib for formatting now. also, please note, to enable fmtlib to format `std::set<component_type>` in `test/boost/sstable_3_x_test.cc` , we need to include `<fmt/ranges.h>` in that source file. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13598	2023-04-21 09:49:24 +03:00
Kefu Chai	9215adee46	streaming: specialize fmt::formatter<stream_reason> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `stream_reason` without the help of `operator<<`. please note, because we still cannot use the generic formatter for std::unordered_map provided by fmtlib, so in order to drop `operator<<` for `stream_reason`, and to print `unordered_map<stream_reason>`, `fmt::join()` is used as a temporary solution. we will audit all `fmt::join()` calls, after removing the homebrew formatter of `std::unordered_map`. the corresponding `operator<<()` are dropped dropped in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13609	2023-04-21 09:44:23 +03:00
Kefu Chai	ecb5380638	treewide: s/boost::lexical_cast<std::string>/fmt::to_string()/ this change replaces all occurrences of `boost::lexical_cast<std::string>` in the source tree with `fmt::to_string()`. for couple reasons: * `boost::lexical_cast<std::string>` is longer than `fmt::to_string()`, so the latter is easier to parse and read. * `boost::lexical_cast<std::string>` creates a stringstream under the hood, so it can use the `operator<<` to stringify the given object. but stringstream is known to be less performant than fmtlib. * we are migrating to fmtlib based formatting, see #13245. so using `fmt::to_string()` helps us to remove yet another dependency on `operator<<`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13611	2023-04-21 09:43:53 +03:00
Benny Halevy	3f1ac846d8	gms: get rid of unused failure_detector The legacy failure_detector is now unused and can be removed. TODO: integare direct_failure_detector with failure_detector api. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-21 09:08:27 +03:00

1 2 3 4 5 ...

36346 Commits