scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 18:40:38 +00:00

Author	SHA1	Message	Date
Asias He	a19917eb91	gossiper: Drop replacement_quarantine It is not used any more after "gossiper: Drop unused replaced_endpoint". Refs #5482	2020-07-06 11:27:55 +03:00
Asias He	2bc73ad290	gossiper: Drop unused replaced_endpoint It is not used any more after `75cf1d18b5` (storage_service: Unify handling of replaced node removal from gossip) in the "Make replacing node take writes" series. Refs #5482	2020-07-06 11:27:55 +03:00
Rafael Ávila de Espíndola	64c8164e6c	everywhere: Update to seastar api v4 (when_all_succeed returning a tuple) We now just need to replace a few calls to then with then_unpack. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200618172100.111147-1-espindola@scylladb.com>	2020-06-23 19:40:18 +03:00
Asias He	dddde33512	gossip: Do not send shutdown message when a node is in unknown status When a replacing node is in early boot up and is not in HIBERNATE sate yet, if the node is killed by a user, the node will wrongly send a shutdown message to other nodes. This is because UNKNOWN is not in SILENT_SHUTDOWN_STATES, so in gossiper::do_stop_gossiping, the node will send shutdown message. Other nodes in the cluster will call storage_service::handle_state_normal for this node, since NORMAL and SHUTDOWN status share the same status handler. As a result, other nodes will incorrectly think the node is part of the cluster and the replace operation is finished. Such problem was seen in replace_node_no_hibernate_state_test dtest: n1, n2 are in the cluster n2 is dead n3 is started to replace n2, but n3 is killed in the middle n3 announces SHUTDOWN status wrongly n1 runs storage_service::handle_state_normal for n3 n1 get tokens for n3 which is empty, because n3 hasn't gossip tokens yet n1 skips update normal tokens for n3, but think n3 has replaced n2 n4 starts to replace n2 n4 checks the tokens for n2 in storage_service::join_token_ring (Cannot replace token {} which does not exist!) or storage_service::prepare_replacement_info (Cannot replace_address {} because it doesn't exist in gossip) To fix, we add UNKNOWN into SILENT_SHUTDOWN_STATES and avoid sending shutdown message. Tests: replace_address_test.py:TestReplaceAddress.replace_node_no_hibernate_state_test Fixes: #6436	2020-06-08 11:32:23 +02:00
Pavel Emelyanov	ee31191e21	storage_service: Move get_generation_number to util/ This is purely utility helper routine. As a nice side effect the inclusion of storage_service.hh is removed from several unrelated places. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-01 09:08:40 +03:00
Pavel Emelyanov	ccdee822e1	storage_service: Get rid of one-line helpers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 14:17:31 +03:00
Pavel Emelyanov	d53d2bb664	features: Introduce and use masked features Nowadays the knowledge about known/supported features is scattered between frature_service and storage_service. The latter uses knowledge about the selected _sstables_format to alter the "supported" set. Encapsulate this knowledge inside the feature_service with the help of "masked_features" -- those, that shouldn't be advertized to other nodes. When only maskable feature for today is the UNBOUNDED_RANGE_TOMBSTONES one. Nowadays it's reported as supported only if the sstables format is MC. With this patch it starts as masked and gets unmasked when the sstables format is selected to be MC, so the change is correct. This will make it possible to move sstables_format from storage service to anywhere else. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 13:21:07 +03:00
Pavel Emelyanov	bb3a71529a	features: Get rid of per-features booleans The set of bool enable_something-s on feature_fonfig duplicates the disabled_features set on it, so remove the former and make full use of the latter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 13:09:12 +03:00
Piotr Jastrzebski	0475dab359	feature: add PER_TABLE_CACHING feature This feature will ensure that caching can be switched off per table only after the whole cluster supports it. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-05-05 08:14:49 +02:00
Avi Kivity	f3bcd4d205	Merge 'Support SSL Certificate Hot Reloading' from Calle " Fixes #6067 Makes the scylla endpoint initializations that support TLS use reloadable certificate stores, watching used cert + key files for changes, and reload iff modified. Tests in separate dtest set. " * elcallio-calle/reloadable-tls: transport: Use reloadable tls certificates redis: Use reloadable tls certificates alternator: Use reloadable tls certificates messaging_service: Use reloadable TLS certificates	2020-05-04 15:11:16 +03:00
Piotr Sarna	bec95a0605	treewide: use thread-safe variant of localtime In order to ensure thread-safety, all usages of localtime() are replaced with localtime_r(), which may accept a local buffer. Tests: unit(dev) Fixes #6364 Message-Id: <ad4a0c0e1707f0318325718715a3a647e3ebfdfe.1588592156.git.sarna@scylladb.com>	2020-05-04 14:46:08 +03:00
Calle Wilund	08d069f78d	messaging_service: Use reloadable TLS certificates Changes messaging service rpc to use reloadable tls certificates iff tls is enabled- Note that this means that the service cannot start listening at construction time if TLS is active, and user need to call start_listen_ex to initialize and actually start the service. Since "normal" messaging service is actually started from gms, this route too is made a continuation.	2020-05-04 11:32:21 +00:00
Piotr Sarna	be5d3f4733	Merge 'A bunch of refactors in versioned_value and gossiper' from Kamil 1. Remove the `versioned_value::factory` class, it didn't add any value. It just forced us to create an object for making `versioned_value`s, for no sensible reason. 2. Move some `versioned_value` deserialization code (string -> internal data structures) into the versioned_value module. Previously, it was scattered all around the place. 3. Make `gossiper::get_seeds` const and return a const reference. I needed these refactors for a PR I was preparing to fix an issue with CDC. The attempt of fixing the issue failed (I'm trying something different now), but the refactors might be useful anyway. * kbr--vv-refactor: gossiper: make `get_seeds` method const and return a const ref versioned_value: remove versioned_value::factory class gms: move TOKENS string deserialization code into versioned_value	2020-04-28 10:27:45 +02:00
Rafael Ávila de Espíndola	d8555513a9	gms: Don't keep references to reallocated vector entries These callbacks can block a seastar thread and the underlying vector can be reallocated concurrently. This is no different than if it was a plain std::vector and the solution is similar: use values instead of references. Fixes #6230 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200422182304.120906-1-espindola@scylladb.com>	2020-04-23 16:06:36 +03:00
Nadav Har'El	08c39bde1a	gossiper: add convenience function for getting number of nodes The gossiper has a convenience functions get_up_endpoint_count() and get_down_endpoint_count(), but strangely no function to get the total number. Even though it's easy to calculate the total by summing up their result it is inefficient and also incovenient because of of these functions returns a future. So let's add another function, get_all_endpoint_count(), to get the total number of nodes. We will use this function in the next patch. Signed-off-by: Nadav Har'El <n...@scylladb.com> Message-Id: <20200422182035.15106-1-nyh@scylladb.com>	2020-04-23 08:23:05 +02:00
Kamil Braun	d73a21057a	gossiper: make `get_seeds` method const and return a const ref	2020-04-20 12:57:16 +02:00
Kamil Braun	1f7290a0ff	versioned_value: remove versioned_value::factory class If there was a Most Useless Abstraction award, this would be a good candidate.	2020-04-20 12:57:16 +02:00
Kamil Braun	113384b6f8	gms: move TOKENS string deserialization code into versioned_value And do the same with CDC_STREAMS_TIMESTAMP. The code that took a list of tokens represented as a string inside versioned_value (for gossiping) and deserialized it into an `unordered_set<dht::token>` lived in the storage_service module, while the code that did the serializing (set -> string) lived in versioned_value. There was a similar situation with the CDC generation timestamp. To increase maintanability and reusability, the deserialization code is now placed next to the serialization code in versioned_value. Furthermore, the `make_full_token_string`, `make_token_string`, and `make_cdc_streams_timestamp_string` (serialization functions) are moved out of versioned_value::factory and made static methods of versioned_value instead.	2020-04-20 12:57:13 +02:00
Calle Wilund	a14a28cdf4	gms::inet_address: Fix sign extension error in custom address formatting Fixes #5808 Seems some gcc:s will generate the code as sign extending. Mine does not, but this should be more correct anyhow. Added small stringify test to serialization_test for inet_address	2020-04-12 17:48:44 +03:00
Nadav Har'El	c1a7a071ea	merge: Remove most inclusions of reactor.hh Merged patch series from Avi Kivity: This patchset removes most inclusions of reactor.hh, by switching to new namespace-scoped API:s instead of those using engine() as a way to get the reactor. With this, we are down to 12 translation units depending on reactor.hh, mostly for deprecated API:s like reactor::at_exit(). Avi Kivity (3): logalloc: use namespace-scope seastar::idle_cpu_handler and related rather than reactor scope test: sstable-utils: deinline do_make_keys() treewide: replace calls to engine().some_api() with some_api() configure.py \| 14 +++----- auth/common.hh \| 3 +- checked-file-impl.hh \| 4 +-- db/system_keyspace_view_types.hh \| 2 +- flat_mutation_reader.hh \| 1 + lister.hh \| 2 +- message/messaging_service.hh \| 2 +- redis/server.hh \| 2 +- sstables/compress.hh \| 2 +- sstables/integrity_checked_file_impl.hh \| 2 +- test/lib/sstable_utils.hh \| 35 ++++--------------- test/lib/test_services.hh \| 2 +- thrift/server.hh \| 2 +- transport/server.hh \| 2 +- utils/error_injection.hh \| 3 +- utils/joinpoint.hh \| 2 +- utils/loading_cache.hh \| 2 +- utils/logalloc.hh \| 6 ++-- utils/rate_limiter.hh \| 2 +- api/system.cc \| 1 + auth/default_authorizer.cc \| 2 +- auth/password_authenticator.cc \| 2 +- database.cc \| 1 + db/commitlog/commitlog.cc \| 4 +-- db/hints/resource_manager.cc \| 3 +- db/system_distributed_keyspace.cc \| 2 +- dht/i_partitioner.cc \| 2 +- gms/feature_service.cc \| 3 +- lister.cc \| 4 +-- locator/ec2_snitch.cc \| 3 +- locator/gce_snitch.cc \| 1 + main.cc \| 1 + reader_concurrency_semaphore.cc \| 2 +- redis/server.cc \| 4 +-- sstables/sstables.cc \| 11 +++--- table.cc \| 3 +- test/boost/commitlog_test.cc \| 2 +- test/boost/database_test.cc \| 2 +- test/boost/flush_queue_test.cc \| 2 +- test/boost/gossip_test.cc \| 2 +- .../gossiping_property_file_snitch_test.cc \| 1 + test/boost/loading_cache_test.cc \| 2 +- test/boost/sstable_3_x_test.cc \| 1 + test/boost/sstable_datafile_test.cc \| 1 + test/boost/sstable_test.cc \| 1 + test/lib/sstable_utils.cc \| 26 ++++++++++++++ test/manual/gossip.cc \| 2 +- test/manual/hint_test.cc \| 2 +- test/manual/sstable_scan_footprint_test.cc \| 2 +- test/perf/perf_mutation.cc \| 1 + test/perf/perf_row_cache_update.cc \| 1 + test/perf/perf_sstable.cc \| 1 + test/tools/cql_repl.cc \| 2 +- thrift/server.cc \| 2 +- transport/server.cc \| 4 +-- utils/config_file.cc \| 3 +- utils/file_lock.cc \| 2 +- utils/logalloc.cc \| 14 ++++---- utils/updateable_value.cc \| 2 +- 59 files changed, 119 insertions(+), 98 deletions(-)	2020-04-05 13:47:39 +03:00
Avi Kivity	88ade3110f	treewide: replace calls to engine().some_api() with some_api() This removes the need to include reactor.hh, a source of compile time bloat. In some places, the call is qualified with seastar:: in order to resolve ambiguities with a local name. Includes are adjusted to make everything compile. We end up having 14 translation units including reactor.hh, primarily for deprecated things like reactor::at_exit(). Ref #1	2020-04-05 12:46:04 +03:00
Tomasz Grabiec	df48b5ec9d	gossip: Fix a confusing parameter name Message-Id: <1585940635-1194-1-git-send-email-tgrabiec@scylladb.com>	2020-04-05 08:24:51 +02:00
Konstantin Osipov	9948f548a5	lwt: remove Paxos from experimental list Always enable lightweight transactions. Remove the check for the command line switch from the feature service, assuming LWT is always enabled. Remove the check for LWT from Alternator. Note that in order for the cluster to work with LWT, all nodes need to support it. Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in --experimental-features command line option, but do nothing with it. Changes in v2: * remove enable_lwt feature flag, it's always there Closes #6102 test: unit (dev, debug) Message-Id: <20200401071149.41921-1-kostja@scylladb.com>	2020-04-01 09:12:21 +02:00
Asias He	743b529c2b	gossip: Add an option to force gossip generation Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation number g1, g2, g3. n1, n2, n3 running scylla version with commit `0a52ecb6df` (gossip: Fix max generation drift measure) One year later, user wants the upgrade n1,n2,n3 to a new version when n3 does a rolling restart with a new version, n3 will use a generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's gossip update and mark g3 as down. Such unnecessary marking of node down can cause availability issues. For example: DC1: n1, n2 DC2: n3, n4 When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which causes the whole DC2 to be unavailable. To fix, we can start the node with a gossip generation within MAX_GENERATION_DIFFERENCE difference for the new node. Once all the nodes run the version with commit `0a52ecb6df`, the option is no logger needed. Fixes #5164	2020-03-27 12:15:21 +01:00
Rafael Ávila de Espíndola	c5795e8199	everywhere: Replace engine().cpu_id() with this_shard_id() This is a bit simpler and might allow removing a few includes of reactor.hh. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200326194656.74041-1-espindola@scylladb.com>	2020-03-27 11:40:03 +03:00
Nadav Har'El	a0f025f4ce	sstable: LA format is the default, so ignore "LA_SSTABLE" feature flag The previous patch made the LA format the default. We no longer need to choose between writing the older KA format or LA, so the LA_SSTABLE cluster feature has became unnecessary. Unfortunately, we cannot completely remove this feature: Since commit `4f3ce42163` we cannot remove cluster features because this node will refuse to join a cluster which already agreed on features that it lacks - thinking it is an old node trying to join a new cluster. So the LA_SSTABLE feature flag remains, and we continue to advertise that our node supports it. We just no longer care about what other nodes advertised for it, so we can remove a bit of code that cared. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200324232607.4215-3-nyh@scylladb.com>	2020-03-25 13:00:28 +01:00
Rafael Ávila de Espíndola	eca0ac5772	everywhere: Update for deprecated apply functions Now apply is only for tuples, for varargs use invoke. This depends on the seastar changes adding invoke. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200324163809.93648-1-espindola@scylladb.com>	2020-03-25 08:49:53 +02:00
Botond Dénes	e0284bb9ee	treewide: add missing headers and/or forward declarations	2020-03-23 09:29:45 +02:00
Rafael Ávila de Espíndola	9445608df6	gms: Add a default constructor to feature_config Also move it out of line while at it. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200316180321.45914-1-espindola@scylladb.com>	2020-03-20 13:34:26 +01:00
Asias He	cdcedf5eb9	gossip: Make is_safe_for_bootstrap more strict Consider 1. Start n1, n2 in the cluster 2. Stop n2 and delete all data for n2 3. Start n2 to replace itself with replace_address_first_boot: n2 4. Kill n2 before n2 finishes the replace operation 5. Remove replace_address_first_boot: n2 from scylla.yaml of n2 6. Delete all data for n2 7. Start n2 At step 7, n2 will be allowed to bootstrap as a new node, because the application state of n2 in the cluster is HIBERNATE which is not rejected in the check of is_safe_for_bootstrap. As a result, n2 will replace n2 with a different tokens and a different host_id, as if the old n2 node was removed from the cluster silently. Fixes #5172	2020-03-17 17:37:16 +01:00
Avi Kivity	ee9df91a76	Merge "Allow setting partitioner per table" from Piotr " This PR makes it possible to enable the usage of different partitioner for each table. If no table-specific partitioner is set for a given table then a default partitioner is used. The PR is composed of the following parts: - Introduction of schema::get_partitioner that still returns dht::global_partitioner - Replacement of all the usage of dht::global_partitioner with schema::get_partitioner - Making it possible to set table-specific partitioner in a schema_builder - Remove all the places that were setting default partitioner except for main.cc (mostly tests) - Move default partitioner from i_partitioner to schema.cc and hide it from the rest of the codebase - Remove dht::global_partitioner After this PR there's no such thing as global partitioner at all. There is only a default partitioner but it still has to be accessed through schema::get_partitioner. There are some intermediate states in which i_partitioner is stored as shared_ptr in the schema but the final version keeps it by const&. The PR does not enable per table partitioner end-to-end. Just the internals of the single node are covered. I still have to deal with: - Making sure a table has the same partitioner on each node - Allowing user to set up a table-specific partitioner on table - Signal driver about what partitioner is used by a given table - Persist partitioner info for each table that does not use default partitioner. Fixes #5493 Tests: unit(dev, release, debug), dtest(byo) " * 'per_table_partitioner' of https://github.com/haaawk/scylla: schema: drop optional from _partitioner field make_multishard_combining_reader: stop taking partitioner split_range_to_single_shard: stop taking partitioner as argument tests: remove unused murmur3 includes partitioner: move default_partitioner to schema.cc partitioner: hide dht::default_partitioner schema: include partitioner name in scylla tables mutation schema: make it possible to set custom partitioner scylla_tables: add partitioner column schema_features: add PER_TABLE_PARTITIONERS feature features: add PER_TABLE_PARTITIONERS feature	2020-03-16 11:13:47 +02:00
Rafael Ávila de Espíndola	69874f4330	feature_service: Remove default constructor This makes user that feature_config_from_db_config is used for both tests and main.cc. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200312153453.37282-2-espindola@scylladb.com>	2020-03-16 11:01:15 +02:00
Rafael Ávila de Espíndola	7c26eb61a3	feature_service: Initialize local variable The use of an uninitialized variable was not being noticed because this is only used by main.cc. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200312153453.37282-1-espindola@scylladb.com>	2020-03-16 11:01:15 +02:00
Asias He	7ac9e0f2a1	gossip: Print CDC_STREAMS_TIMESTAMP correctly I saw UNKNOWN application state in the logs: INFO 2020-03-06 11:09:48,931 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=CACHE_HITRATES, versioned_value=Value(,14) INFO 2020-03-06 11:09:48,931 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=SCHEMA_TABLES_VERSION, versioned_value=Value(3,15) INFO 2020-03-06 11:09:48,931 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=RPC_READY, versioned_value=Value(0,16) INFO 2020-03-06 11:09:48,931 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(,17) INFO 2020-03-06 11:09:48,931 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=SHARD_COUNT, versioned_value=Value(1,30) INFO 2020-03-06 11:09:48,931 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=IGNOR_MSB_BITS, versioned_value=Value(12,31) INFO 2020-03-06 11:09:48,931 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=UNKNOWN, versioned_value=Value(1583371936128,20) It turned out it was CDC_STREAMS_TIMESTAMP. $ nodetool gossipinfo\|grep 1583371936128 X8:1583371936128 X8:1583371936128 Fixes #5992	2020-03-15 11:51:35 +01:00
Piotr Jastrzebski	782f2caf41	schema_features: add PER_TABLE_PARTITIONERS feature With per table partitioners, partitioner name will be a part of table schema. To allow rolling upgrade we need to perform special logic that hides new partitioner name schema column during the upgrade. This commit adds new schema feature that controls this logic. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-15 10:25:20 +01:00
Piotr Jastrzebski	90df9a44ce	features: add PER_TABLE_PARTITIONERS feature This new feature is required because we now allow setting partitioner per table. This will influence the digest of table schema so we must not include partitioner name into the digest unless we know that the whole cluster already supports per table partitioners. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-15 10:25:20 +01:00
Rafael Ávila de Espíndola	a1ca83b067	gms: Fix static initialization order problem In test_services.cc there is gms::feature_service test_feature_service; And the feature_service constructor has , _lwt_feature(*this, features::LWT) But features::LWT is a global sstring constructed in another file. Solve the problem by making the feature strings constexpr std::string_view. I found the issue while trying to benchmark the std::string switch. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Acked-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200309225749.36661-1-espindola@scylladb.com>	2020-03-13 12:37:22 +02:00
Pavel Emelyanov	0a10e9787e	features: Remove future-based when_enabled() This API is considered to be error-prone, all users of it are reworked, so let's drop it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-03-02 19:55:52 +03:00
Asias He	62774ff882	gossiper: Always use the new generation number User reported an issue that after a node restart, the restarted node is marked as DOWN by other nodes in the cluster while the node is up and running normally. Consier the following: - n1, n2, n3 in the cluster - n3 shutdown itself - n3 send shutdown verb to n1 and n2 - n1 and n2 set n3 in SHUTDOWN status and force the heartbeat version to INT_MAX - n3 restarts - n3 sends gossip shadow rounds to n1 and n2, in storage_service::prepare_to_join, - n3 receives response from n1, in gossiper::handle_ack_msg, since _enabled = false and _in_shadow_round == false, n3 will apply the application state in fiber1, filber 1 finishes faster filber 2, it sets _in_shadow_round = false - n3 receives response from n2, in gossiper::handle_ack_msg, since _enabled = false and _in_shadow_round == false, n3 will apply the application state in fiber2, filber 2 yields - n3 finishes the shadow round and continues - n3 resets gossip endpoint_state_map with gossiper.reset_endpoint_state_map() - n3 resumes fiber 2, apply application state about n3 into endpoint_state_map, at this point endpoint_state_map contains information including n3 itself from n2. - n3 calls gossiper.start_gossiping(generation_number, app_states, ...) with new generation number generated correctly in storage_service::prepare_to_join, but in maybe_initialize_local_state(generation_nbr), it will not set new generation and heartbeat if the endpoint_state_map contains itself - n3 continues with the old generation and heartbeat learned in fiber 2 - n3 continues the gossip loop, in gossiper::run, hbs.update_heart_beat() the heartbeat is set to the number starting from 0. - n1 and n2 will not get update from n3 because they use the same generation number but n1 and n2 has larger heartbeat version - n1 and n2 will mark n3 as down even if n3 is alive. To fix, always use the the new generation number. Fixes: #5800 Backports: 3.0 3.1 3.2	2020-02-20 11:20:20 +01:00
Pavel Emelyanov	eb827c9f5d	gossiper: Keep needed for failure_detection values on board And drop the gossiper -> storage_service link Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-10 20:54:32 +03:00
Pavel Emelyanov	2f3490dc8d	gossiper: Use own token_metadata Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-10 20:54:32 +03:00
Avi Kivity	bed61b96a2	Merge "Move features from storage- into feature-service" from Pavel " There's a lot of code around that needs storage service purely to get the specific feature value (cluster_supports_<something> calls). This creates several circular dependencies, e.g. storage_service <-> migration_manager one and database <-> storage_servuce. Also features sit on storage_service, but register themselfs on the feature_service and the former subscribes on them back which also looks strange. I propose to keep all the features on feature_service, this keeps the latter intependent from other components, makes it possible to break one of the mentioned circle dependencyand heavily relax the other. Also the set helps us fighting the globals and, after it, the feature_service can be safely stopped at the very last moment. Tests: unit(dev), manual debug build start-stop " * 'br-features-to-service-5' of https://github.com/xemul/scylla: gossiper: Avoid string merge-split for nothing features: Stop on shutdown storage_service: Remove helpers storage_service: Prepare to switch from on-board feature helpers cql3: Check feature in .validate database: Use feature service storage_proxy: Use feature service migration_manager: Use feature service start: Pass needed feature as argument into migrate_truncation_records features: Unfriend storage_service features: Simplify feature registration features: Introduce known_feature_set features: Move disabled features set from storage_service features: Move schema_features helper features: Move all features from storage_service to feature_service storage_service: Use feature_config from _feature_service features: Add feature_config storage_service: Kill set_disabled_features gms: Move features stuff into own .cc file migration_manager: Move some fns into class	2020-02-09 19:22:07 +02:00
Piotr Jastrzebski	61d8308848	gossiper: stop calling global_partitioner() Obtain name of the default partitioner from config instead of a global. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-06 07:59:07 +01:00
Piotr Jastrzebski	03bdce2d68	partitioner: move to_sstring to token Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-05 09:31:32 +01:00
Nadav Har'El	3de09042bb	CDC topology change support Merged pull request https://github.com/scylladb/scylla/pull/5485 by Kamil Braun: This series introduces the notion of CDC generations: sets of CDC streams used by the cluster to choose partition keys for CDC log writes. Each CDC generation begins operating at a specific time point, called the generation's timestamp (cdc_streams_timestamp in the code). It continues being used by all nodes in the cluster to generate log writes until superseded by a new generation. Generations are chosen so that CDC log writes are colocated with their corresponding base table writes, i.e. their partition keys (which are CDC stream identifiers picked from the generation operating at time of making the write) fall into the same vnode and shard as the corresponding base table write partition keys. Currently this is probabilistic and not 100% of log writes will be colocated - this will change in future commits, after per-table partitioners are implemented. CDC generations are a global property of the cluster -- they don't depend on any particular table's configuration. Therefore the old "CDC stream description tables", which were specific to each CDC-enabled table, were removed and replaced by a new, global description table inside the system_distributed keyspace. A new generation is introduced and supersedes the previous one whenever we insert new tokens into the token ring, which breaks the colocation property of the previous generation. The new generation is chosen to account for the new tokens and restore colocation. This happens when a new node joins the cluster. The joining node is responsible for creating and informing other nodes about the new CDC generation. It does that by serializing it and inserting into an internal distributed table ("CDC topology description table"). If it fails the insert, it fails the joining process. It then announces the generation to other nodes through gossip using the generation's timestamp, which is the partition key of the inserted distributed table entry. Nodes that learn about the new generation through gossip attempt to retrieve it from the distributed table. This might fail - for example, if the node is partitioned away from all replicas that hold this generation's table entry. In that case the node might stop accepting writes, since it knows that it should send log entries to a new generation of streams, but it doesn't know what the generation is. The node will keep trying to retrieve the data in the background until it succeeds or sees that it is no longer necessary (e.g., because yet another generation superseded this one). So we give up some availability to achieve safety. However, this solution is not completely safe (might break consistency properties): if a node learns about a new generation too late (if gossip doesn't reach this node in time), the node might send writes to the wrong (old) generation. In the future we will introduce a transaction-based approach where we will always make sure that all nodes receive the new generation before any of them starts using it (and if it's impossible e.g. due to a network partition, we will fail the bootstrap attempt). In practice, if the admin makes sure that the cluster works correctly before bootstrapping a new node, and a network partition doesn't start in the few seconds window where a new generation is announced, everything will work as it should. After the learning node retrieves the generation, it inserts it into an in-memory data structure called "CDC metadata". This structure is then used when performing writes to the CDC log -- given the timestamp of the written mutation, the data structure will return the CDC generation operating at this time point. CDC metadata might reject the query for two reasons: if the timestamp belongs to an earlier generation, which most probably doesn't have the colocation property anymore, or if it is picked too far away into the future, where we don't know if the current generation won't be superseded by a different one (so we don't yet know the set of streams that this log write should be sent to). If the client uses server-generated timestamps, the query will never be rejected. Clients can also use client-generated timestamps, but they must make sure that their clocks are not too desynchronized with the database -- otherwise some or all of their writes to CDC-enabled tables will be rejected. In the case of rolling upgrade, where we restart nodes that were previously running without CDC, we act a bit differently - there is no naturally selected joining node which must propose a new generation. We have to select such a node using other means. For this we use a bully approach: every node compares its host id with host ids of other nodes and if it finds that it has the greatest host id, it becomes responsible for creating the first generation. This change also fixes the way of choosing values of the "time" column of CDC log writes: the timeuuid is chosen in a way which preserves ordering of corresponding base table mutations (the timestamp of this timeuuid is equal to the base table mutation timestamp). Warning: if you were running a previous CDC version (without topology change support), make sure to disable CDC on all tables before performing the upgrade. This will drop the log data -- backup it if needed. TODO in future patchset: expire CDC generations. Currently, each inserted CDC generation will stay in the distributed tables forever (until manually removed by the administrator). When a generation is superseded, it should become "expired", and 24 hours after expiration, it should be removed. The distributed tables (cdc_topology_description and cdc_description) both have an "expired" column which can be used for this purpose. Unit tests: dev, debug, release dtests (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/907/	2020-02-04 10:20:29 +02:00
Benny Halevy	f45fabab73	gossiper: do_stop_gossiping: copy live endpoints vector It can be resized asynchronously by mark_dead. Fixes #5701 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200203091344.229518-1-bhalevy@scylladb.com>	2020-02-04 10:20:28 +02:00
Pavel Emelyanov	8a7f13420f	gossiper: Avoid string merge-split for nothing The caller of check_knows_remote_features merges a set of features into a string, but the method in question ... splits them back into the set. Avoid this unneeded step and clean the respective storage service helpers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-03 15:16:23 +03:00
Pavel Emelyanov	aa6b1efc35	features: Unfriend storage_service The storage service no longer needs to mess with feature config. It only needs two features to register onself in, but this can be solved by respective cluster_supports_foo helpers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-03 15:16:23 +03:00
Pavel Emelyanov	9b67226715	features: Simplify feature registration Now features are registered into a map of vectors, but it looks like the vector is always 1-item long and is used to keep pointer on feature, instead of the feature itself. Switch it into map of reference_wrapper-s. Before this patch we could register more than one feature under the same name, now we can't. But this seems to be OK, as we don't actually do this. To catch violations of this restriction there's an assert() in the feature_service::register_feature. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-03 15:16:23 +03:00
Pavel Emelyanov	da6af8bde7	features: Introduce known_feature_set There are two masks -- supported and known. They differ in unbounded_range_tombstones one which is set depending on the sstables format in use. Since the feature_service doesn't know anything about sstables format, the logic is reverted -- the feature service reports back the known mask (all features) and storage_service clears the unbounded_range_tombstones if the sst format is low -- but is (hopefully) left intact. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-03 15:16:23 +03:00

1 2 3 4 5 ...

571 Commits