scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 12:36:56 +00:00

Author	SHA1	Message	Date
Jesse Haber-Kucharsky	a139afc30c	auth: Reject logins from disallowed roles When the `LOGIN` option for a role is set to `false`, Scylla should not permit the role to log in. Fixes #4284 Tests: unit (debug)	2019-02-28 15:02:53 -05:00
Avi Kivity	88322086cb	Merge "Add fuzzer-type unit test for range scans" from Botond " This series adds a fuzzer-type unit test for range scans, which generates a semi-random dataset and executes semi-random range scans against it, validating the result. This test aims to cover a wide range of corner cases with the help of randomness. Data and queries against it are generated in such a way that various corner cases and their combinations are likely to be covered. The infrastructure under range-scans have gone under massive changes in the last year, growing in complexity and scope. The correctness of range scans is critical for the correct functioning of any Scylla cluster, and while the current unit tests served well in detecting any major problems (mostly while developing), they are too simplistic and can only be relied on to check the correctness of the basic functionality. This test aims to extend coverage drastically, testing cases that the author of the range-scan code or that of the existing unit tests didn't even think exists, by relying on some randomness. Fixes: #3954 (deprecates really) " * 'more-extensive-range-scan-unit-tests/v2' of https://github.com/denesb/scylla: tests/multishard_mutation_query_test: add fuzzy test tests/multishard_mutation_query_test: refactor read_all_partitions_with_paged_scan() tests/test_table: add advanced `create_test_table()` overload tests/test_table: make `create_test_table()` customizable query: add trim_clustering_row_ranges_to() tests/test_table: add keyspace and table name params tests/test_table: s/create_test_cf/create_test_table/ tests: move create_test_cf() to tests/test_table.{hh,cc} tests/multishard_mutation_query_test: drop many partition test tests/multishard_mutation_query_test: drop range tombstone test	2019-02-27 17:26:53 +02:00
Nadav Har'El	da54d0fc7d	Materialized views: fix accidental zeroing of flow-control delay The materialized-views flow control carefully calculates an amount of microseconds to delay a client to slow it down to the desired rate - but then a typo (std::min instead of std::max) causes this delay to be zeroed, which in effect completely nullifies the flow control algorithm. Before this fix, experiments suggested that view flow control was not having any effect and view backlog not bounded at all. After this fix, we can see the flow control having its desired effect, and the view backlog converging. Fixes #4143. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190226161452.498-1-nyh@scylladb.com>	2019-02-26 18:22:18 +02:00
Tomasz Grabiec	b06aac4fdb	Merge "Fix temporary spurious schema version mismatch when nodes are restarted" from Asias Fixes: #4148 Fixes: #4258 Tests: resharding_test.py:reshardingtest_nodes4_with_sizetieredcompactionstrategy.resharding_by_smp_increase_test * seastar-dev.git asias/fix_schema_mismatch_when_nodes_restarts/v1: database: Add update_schema_version and announce_schema_version storage_service: Add application_state::SCHEMA when gossip starts	2019-02-26 12:55:52 +01:00
Avi Kivity	5f94bc902a	transport: add option to disable shard-aware drivers The shard-aware drivers can cause a huge amount of connections to be created when there are tens of thousands of clients. While normally the shard-aware drivers are beneficial, in those cases they can consume too much memory. Provide an option to disable shard awareness from the server (it is likely to be easier to do this on the server than to reprovision those thousands of clients). Tests: manual test with wireshark. Message-Id: <20190223173331.24424-1-avi@scylladb.com>	2019-02-26 12:44:11 +01:00
Asias He	459836079c	storage_service: Add application_state::SCHEMA when gossip starts In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw: 4 nodes in the tests n1, n2, n3, n4 are started n1 is stopped n1 is changed to use different shard config n1 is restarted ( 2019-01-27 04:56:00,377 ) The backtrace happened on n2 right fater n1 restarts: 0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled 1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled 2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled 3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed) 4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status = 5 Segmentation fault on shard 0. 6 Backtrace: 7 0x00000000041c0782 8 0x00000000040d9a8c 9 0x00000000040d9d35 10 0x00000000040d9d83 11 /lib64/libpthread.so.0+0x00000000000121af 12 0x0000000001a8ac0e 13 0x00000000040ba39e 14 0x00000000040ba561 15 0x000000000418c247 16 0x0000000004265437 17 0x000000000054766e 18 /lib64/libc.so.6+0x0000000000020f29 19 0x00000000005b17d9 The theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time n1 has SCHEMA application_state, when n1 restarts, n2 gets new application state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty application_state for SCHEMA. We dereference the nullptr application_state and abort. In commit `da80f27f44`, we fixed the problem by checking the pointer before dereference. To prevent this to happen in the first place, we'd better to add application_state::SCHEMA when gossip starts. This way, peer nodes always see the application_state::SCHEMA when a node restarts. Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test Fixes #4148 Fixes #4258	2019-02-26 19:30:22 +08:00
Piotr Sarna	c743617236	cql3: unify max value for row limit and per-partition limit Limits are stored as uint32_t everywhere, but in some places int32_t was used, which created inconsistencies when comparing the value to std::numeric_limits<Type>::max(). In order to solve inconsistencies, the types are unified to uint32_t, and instead of explicitly calling numeric limit max, an already existing constant value query::max_rows is utilized. Fixes #4253 Message-Id: <4234712ff61a0391821acaba63455a34844e489b.1550683120.git.sarna@scylladb.com>	2019-02-21 13:56:02 +02:00
Duarte Nunes	6e83457b1b	Merge 'Add PER PARTITION LIMIT' from Piotr " This series introduces PER PARTITION LIMIT to CQL. Protocol and storage is already capable of applying per-partition limits, so for nonpaged queries the changes are superficial - a variable is parsed and passed down. For paged queries and filtering the situation is a little bit more complicated due to corner cases: results for one partition can be split over 2 or more pages, filtering may drop rows, etc. To solve these, another variable is added to paging state - the number of rows already returned from last served partition. Note that "last" partition may be stretched over any number of pages, not just the last one, which is a case especially when considering filtering. As a result, per-partition-limiting queries are not eligible for page generator optimization, because they may need to have their results locally filtered for extraneous rows (e.g. when the next page asks for per-partition limit 5, but we already received 4 rows from the last partition, so need just 1 more from last partition key, but 5 from all next ones). Tests: unit (dev) Fixes #2202 " * 'add_per_partition_limit_3' of https://github.com/psarna/scylla: tests: remove superficial ignore_order from filtering tests tests: add filtering with per partition key limit test tests: publish extract_paging_state and count_rows_fetched tests: fix order of parameters in with_rows_ignore_order cql3,grammar: add PER PARTITION LIMIT idl,service: add persistent last partition row count cql3: prevent page generator usage for per-partition limit cql3: add checking for previous partition count to filtering pager: add adjusting per-partition row limit cql3: obey per partition limit for filtering cql3: clean up unneeded limit variables cql3: obey per partition limit for select statement cql3: add get_per_partition_limit cql3: add per_partition_limit to CQL statement	2019-02-18 14:47:11 +00:00
Piotr Sarna	acf7bedad4	idl,service: add persistent last partition row count In order to process paged queries with per-partition limits properly, paging state needs to keep additional information: what was the row count of last partition returned in previous run. That's necessary because the end of previous page and the beginning of current one might consist of rows with the same partition key and we need to be able to trim the results to the number indicated by per-partition limit.	2019-02-18 11:06:44 +01:00
Piotr Sarna	1dadae212a	cql3: add checking for previous partition count to filtering Filtering now needs to take into account per partition limits as well, and for that it's essential to be able to compare partition keys and decide which rows should be dropped - if previous page(s) contained rows with the same partition key, these need to be taken into consideration too.	2019-02-18 11:06:43 +01:00
Piotr Sarna	82a3883575	pager: add adjusting per-partition row limit For filtering pagers, per partition limit should be set to page size every time a query is executed, because some rows may potentially get dropped from results.	2019-02-18 10:55:52 +01:00
Piotr Sarna	b965c3778f	cql3: obey per partition limit for filtering Filtering queries now take into account the limit of rows per single partition provided by the user.	2019-02-18 10:29:34 +01:00
Gleb Natapov	b01a659014	storage_proxy: remove old Cassandra code Part of the code is already implemented (counters and hinted-handoff). Part of the code will probably never be (triggers). And the rest is the code that estimates number of rows per range to determine query parallelism, but we implemented exponential growth algorithms instead. Message-Id: <20190214112226.GE19055@scylladb.com>	2019-02-18 10:34:55 +02:00
Avi Kivity	a1567b0997	Merge "replace get_restricted_ranges() function with generator interface" from Gleb " get_restricted_ranges() is inefficient since it calculates all vnodes that cover a requested key ranges in advance, but callers often use only the first one. Replace the function with generator interface that generates requested number of vnodes on demand. " * 'gleb/query_ranges_to_vnodes_generator' of github.com:scylladb/seastar-dev: storage_proxy: limit amount of precaclulated ranges by query_ranges_to_vnodes_generator storage_proxy: remove old get_restricted_ranges() interface cql3/statements/select_statement: convert index query interface to new query_ranges_to_vnodes_generator interface tests: convert storage_proxy test to new query_ranges_to_vnodes_generator interface storage_proxy: convert range query path to new query_ranges_to_vnodes_generator interface storage_proxy: introduce new query_ranges_to_vnode_generator interface	2019-02-18 10:33:54 +02:00
Botond Dénes	2125e99531	service/storage_service: fix pre-bootstrap wait for schema agreement When bootstrapping, a node should to wait to have a schema agreement with its peers, before it can join the ring. This is to ensure it can immediately accept writes. Failing to reach schema agreement before joining is not fatal, as the node can pull unknown schemas on writes on-demand. However, if such a schema contains references to UDFs, the node will reject writes using it, due to #3760. To ensure that schema agreement is reached before joining the ring, `storage_service::join_token_ring()` has to checks. First it checks that at least one peer was connected previously. For this it compares `database::get_version()` with `database::empty_version`. The (implied) assumption is that this will become something other than `database::empty_version` only after having connected (and pulled schemas from) at least one peer. This assumption doesn't hold anymore, as we now set the version earlier in the boot process. The second check verifies that we have the same schema version as all known, live peers. This check assumes (since `3e415e2`) that we have already "met" all (or at least some) of our peers and if there is just one known node (us) it concludes that this is a single-node cluster, which automatically has schema agreement. It's easy to see how these two checks will fail. The first fails to ensure that we have met our peers, and the second wrongfully concludes that we are a one-node cluster, and hence have schema agreement. To fix this, modify the first check. Instead of relying on the presence of a non-empty database version, supposedly implying that we already talked to our peers, explicitely make sure that we have really talked to at least one other node, before proceeding to the second check, which will now do the correct thing, actually checking the schema versions. Fixes: #4196 Branches: 3.0, 2.3 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>	2019-02-15 15:56:46 +01:00
Calle Wilund	64e8c6f31d	storage_service: Add features disabling for tests	2019-02-13 09:08:12 +00:00
Calle Wilund	ff5e541335	storage_service: Add "truncation_table" feature	2019-02-13 09:08:12 +00:00
Gleb Natapov	26e5700819	storage_proxy: limit amount of precaclulated ranges by query_ranges_to_vnodes_generator Do not recalculate too much ranges in advance, it requires large allocation and usually means that a consumer of the interface is going to do to much work in parallel. Fixes: #3767	2019-02-12 10:45:25 +02:00
Gleb Natapov	ecc5230de5	storage_proxy: remove old get_restricted_ranges() interface It is not used any more.	2019-02-11 14:45:43 +02:00
Gleb Natapov	2735a85c8e	storage_proxy: convert range query path to new query_ranges_to_vnodes_generator interface	2019-02-11 14:45:43 +02:00
Gleb Natapov	692a0bd000	storage_proxy: introduce new query_ranges_to_vnode_generator interface get_restricted_ranges() function gets query provided key ranges and divides them on vnode boundaries. It iterates over all ranges and calculates all vnodes, but all its users are usually interested in only one vnode since most likely it will be enough to populate a page. If it will be not enough they will ask for more. This patch introduces new interface instead of the function that allows to generate vnode ranges on demand instead of precalculating all of them.	2019-02-11 14:45:43 +02:00
Botond Dénes	181bf64858	query: add trim_clustering_row_ranges_to() This algorithm was already duplicated in two places (service/pager/query_pagers.cc and mutation_reader.cc). Soon it will be used in a third place. Instead of triplicating, move it into a function that everybody can use.	2019-02-08 16:30:17 +02:00
Calle Wilund	ba6a8ef35b	tls: Use a default prio string disabling TLS1.0 forcing min 128bits Fixes #4010 Unless user sets this explicitly, we should try explicitly avoid deprecated protocol versions. While gnutls should do this for connections initiated thusly, clients such as drivers etc might use obsolete versions. Message-Id: <20190107131513.30197-1-calle@scylladb.com>	2019-02-05 15:34:18 +02:00
Asias He	28d6d117d2	migration_manager: Fix nullptr dereference in maybe_schedule_schema_pull Commit `976324bbb8` changed to use get_application_state_ptr to get a pointer of the application_state. It may return nullptr that is dereferenced unconditionally. In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw: 4 nodes in the tests n1, n2, n3, n4 are started n1 is stopped n1 is changed to use different shard config n1 is restarted ( 2019-01-27 04:56:00,377 ) The backtrace happened on n2 right fater n1 restarts: 0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled 1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled 2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled 3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed) 4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status = 5 Segmentation fault on shard 0. 6 Backtrace: 7 0x00000000041c0782 8 0x00000000040d9a8c 9 0x00000000040d9d35 10 0x00000000040d9d83 11 /lib64/libpthread.so.0+0x00000000000121af 12 0x0000000001a8ac0e 13 0x00000000040ba39e 14 0x00000000040ba561 15 0x000000000418c247 16 0x0000000004265437 17 0x000000000054766e 18 /lib64/libc.so.6+0x0000000000020f29 19 0x00000000005b17d9 We do not know when this backtrace happened, but according to log from n3 an n4: INFO 2019-01-27 04:56:22,154 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL INFO 2019-01-27 04:56:21,594 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL We can be sure the backtrace on n2 happened before 04:56:21 - 19 seconds (the delay the gossip notice a peer is down), so the abort time is around 04:56:0X. The migration_manager::maybe_schedule_schema_pull that triggers the backtrace must be scheduled before n1 is restarted, because it dereference application_state pointer after it sleeps 60 seconds, so the time maybe_schedule_schema_pull is called is around 04:55:0X which is before n1 is restarted. So my theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time n1 has SCHEMA application_state, when n1 restarts, n2 gets new application state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty application_state for SCHEMA. We dereference the nullptr application_state and abort. Fixes: #4148 Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test Message-Id: <9ef33277483ae193a49c5f441486ee6e045d766b.1548896554.git.asias@scylladb.com>	2019-02-01 09:01:08 +02:00
Duarte Nunes	ea34e242de	Merge 'Do not use hints for view building' from Piotr " This series prevents view building to fall back to storing hints. Instead, it will try to send hints to an endpoint as if it has consistency level ONE, and in case of failure retry the whole building step. Then, view building will never be marked as finished prematurely (because of pending hints), which will help avoid creating inconsistencies when decommissioning a node from the cluster. Tests: unit (release) dtest (materialized_views_test.py.) Fixes #3857 Fixes #4039 " 'do_not_mark_view_as_built_with_hints_7' of https://github.com/psarna/scylla: db,view: add updating view_building_paused statistics database: add view_building_paused metrics table: make populate_views not allow hints db,view: add allow_hints parameter to mutate_MV storage_proxy: add allow_hints parameter to send_to_endpoint	2019-01-28 10:31:14 +00:00
Piotr Sarna	e0fe9ce2c0	storage_proxy: add allow_hints parameter to send_to_endpoint With hints allowed, send_to_endpoint will leverage consistency level ANY to send data. Otherwise, it will use the default - cl::ONE.	2019-01-28 09:38:41 +01:00
Piotr Jastrzebski	7666e81b51	Decouple database.hh from types/user.hh This commit declares shared_ptr<user_types_metadata> in database.hh were user_types_metadata is an incomplete type so it requires "Allow to use shared_ptr with incomplete type other than sstable" to compile correctly. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-01-24 09:55:04 +01:00
Gleb Natapov	85cb09294e	storage_service: do not start thrift and cql servers if a node is isolated due to errors Scylla starts doing IO much earlier that it starts cql/thrift servers. The IO may cause an error that will try stop all servers, but since they are still not running it will do nothing, but servers will be started later. Fix it by checking that the node is not isolated before starting servers. Message-Id: <20190110152830.GE3172@scylladb.com>	2019-01-21 13:04:23 +00:00
Piotr Sarna	87c23372fb	cql3: fix filtering with LIMIT with regard to paging Previously the limit was erroneously applied per page instead of being accumulated, which might have caused returning too many rows. As of now, LIMIT is handled properly inside restrictions filter. Fixes #4100	2019-01-17 13:25:09 +01:00
Piotr Sarna	0eb703dc80	all: rename view_update_from_staging_generator The new name, view_update_generator, is both more concise and correct, since we now generate from directories other than "/staging".	2019-01-15 17:31:47 +01:00
Piotr Sarna	13c8c84045	service: add generating view updates from uploaded sstables SSTables loaded to the system via /upload dir may sometimes be needed to generate view updates from them (if their table has accompanying views). Fixes #4047	2019-01-15 17:31:37 +01:00
Piotr Sarna	46305861c3	init: pass view update generator to storage service Storage service needs to access view update generator in order to register staging sstables from /upload directory.	2019-01-15 17:31:36 +01:00
Duarte Nunes	fa2b0384d2	Replace std::experimental types with C++17 std version. Replace stdx::optional and stdx::string_view with the C++ std counterparts. Some instances of boost::variant were also replaced with std::variant, namely those that called seastar::visit. Scylla now requires GCC 8 to compile. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20190108111141.5369-1-duarte@scylladb.com>	2019-01-08 13:16:36 +02:00
Nadav Har'El	da090a5458	materialized views: move hints to top-level directory While we keep ordinary hints in a directory parallel to the data directory, we decided to keep the materialized view hints in a subdirectory of the data directory, named "view_pending_updates". But during boot, we expect all subdirectories of data/ to be keyspace names, and when we notice this one, we print a warning: WARN: database - Skipping undefined keyspace: view_pending_updates This spurious warning annoyed users. But moreover, we could have bigger problems if the user actually tries to create a keyspace with that name. So in this patch, we move the view hints to a separate top-level directory, which defaults to /var/lib/scylla/view_hints, but as usual can be configured. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190107142257.16342-1-nyh@scylladb.com>	2019-01-07 16:43:43 +02:00
Avi Kivity	f02c64cadf	streaming: stream_session: remove include of db/view/view_update_from_staging_generator.hh This header, which is easily replaced with a forward declaration, introduces a dependency on database.hh everywhere. Remove it and scatter includes of database.hh in source files that really need it.	2019-01-05 17:33:25 +02:00
Piotr Sarna	a73d9ccf31	service: mark existing views as built before bootstrap When a node is bootstrapping, it will receive data from other nodes via streaming, including materialized views. Regardless whether these views are built on other nodes or not, building them on newly bootstrapped nodes has no effect - updates were either already streamed completely (if view building have finished) or will be propagated via view building, if the process is still ongoing. So, marking all views as 'built' for the bootstrapped node prevents it from spawning superfluous view building processes. Fixes #4001 Message-Id: <fd53692c38d944122d1b1013fdb0aedf517fa409.1546498861.git.sarna@scylladb.com>	2019-01-03 09:39:33 +00:00
Avi Kivity	c180a18dbb	Distribute distributed_loader into its own header and source files distributed_loader is a sizeable fraction of database.cc, so moving it out reduces compile time and improves readability. Message-Id: <20181230200926.15074-1-avi@scylladb.com>	2018-12-31 14:27:27 +02:00
Avi Kivity	7830086317	client_state: change set_keyspace() to accept a single database shard set_keyspace() only needs one shard (it is checking replicated state, not sharded data) so arrange for it to receive only that one shard.	2018-12-29 10:58:39 +02:00
Asias He	4d3c463536	storage_service: Stop cql server before gossip We saw failure in dtest concurrent_schema_changes_test.py: TestConcurrentSchemaChanges.changes_while_node_down_test test. ====================================================================== ERROR: changes_while_node_down_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 432, in changes_while_node_down_test self.make_schema_changes(session, namespace='ns2') File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 86, in make_schema_changes session.execute('USE ks_%s' % namespace) File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result() File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result raise self._final_exception ConnectionShutdown: Connection to 127.0.0.1 is closed The test: session = self.patient_cql_connection(node2) self.prepare_for_changes(session, namespace='ns2') node1.stop() self.make_schema_changes(session, namespace='ns2') --> ConnectionShutdown exception throws The problem is that, after receiving the DOWN event, the python Cassandra driver will call Cluster:on_down which checks if this client has any connections to the node being shutdown. If there is any connections, the Cluster:on_down handler will exit early, so the session to the node being shutdown will not be removed. If we shutdown the cql server first, the connection count will be zero and the session will be removed. Fixes: #4013 Message-Id: <7388f679a7b09ada10afe7e783d7868a58aac6ec.1545634941.git.asias@scylladb.com>	2018-12-27 14:13:43 +02:00
Duarte Nunes	2f69ba2844	lwt: Remove Paxos-related Cassandra code Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181227112526.4180-1-duarte@scylladb.com>	2018-12-27 13:30:10 +02:00
Avi Kivity	c96fc1d585	Merge "Introduce row level repair" from Asias " === How the the partition level repair works - The repair master decides which ranges to work on. - The repair master splits the ranges to sub ranges which contains around 100 partitions. - The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions. - If the checksum matches, the data in this sub range is synced. - If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers. === Major problems with partition level repair - A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions. - Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice === Row level repair Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch === How the row level repair works - To solve the problem of reading data twice Read the data only once for both checksum and synchronization between nodes. We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers. We need to find a sync boundary among the nodes which contains only N bytes of rows. - To solve the problem of sending unnecessary data. We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems. For example: Node1 has set1 = {row1, row2, row3} Node2 has set2 = { row2, row3} Node3 has set3 = {row1, row2, row4} To repair: Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3. Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2 Node1 sends row3 (set1 + set2 + set3 - set3) to Node3. === How to implement repair with set reconciliation - Step A: Negotiate sync boundary class repair_sync_boundary { dht::decorated_key pk; position_in_partition position } Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary. - Step B: Get missing rows from peer nodes so that repair master contains all the rows Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers. At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes. Now, local node contains all the rows. - Step C: Send missing rows to the peer nodes Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes. === How the RPC API looks like - repair_range_start() Step A: - request_sync_boundary() Step B: - request_combined_row_hashes() - reqeust_full_row_hashes() - request_row_diff() Step C: - send_row_diff() - repair_range_stop() === Performance evaluation We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below. 1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows. Time to repair: old = 87 min new = 70 min (rebuild took 50 minutes) improvement = 19.54% 2) 100% synced: all of the 3 nodes have 1 billion identical rows. Time to repair: old = 43 min new = 24 min improvement = 44.18% 3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows. Time to repair: old: 211 min new: 44 min improvement: 79.15% Bytes sent on wire for repair: old: tx= 162 GiB, rx = 90 GiB new: tx= 1.15 GiB, tx = 0.57 GiB improvement: tx = 99.29%, rx = 99.36% It is worth noting that row level repair sends and receives exactly the number of rows needed in theory. In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair slave 1 and 1 million rows from repair slave 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair slave 1 to repair slave 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair slave 2 to repair slave 1. In the result, we saw the rows on wire were as expected. tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000 Fixes: #3033 Tests: dtests/repair_additional_test.py " * 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits) repair: Enable row level repair repair: Add row_level_repair repair: Add docs for row level repair repair: Add repair_init_messaging_service_handler repair: Add repair_meta repair: Add repair_writer repair: Add repair_reader repair: Add repair_row repair: Add fragment_hasher repair: Add decorated_key_with_hash repair: Add get_random_seed repair: Add get_common_diff_detect_algorithm repair: Add shard_config repair: Add suportted_diff_detect_algorithms repair: Add repair_stats to repair_info repair: Introduce repair_stats flat_mutation_reader: Add make_generating_reader storage_service: Introduce ROW_LEVEL_REPAIR feature messaging_service: Add RPC verbs for row level repair repair: Export the repair logger ...	2018-12-25 13:13:00 +02:00
Duarte Nunes	e6a8883228	service/storage_proxy: Protect against empty mutation when storing hint mutation_holder::get_mutation_for() can return nullptr's, so protect against those when storing a hint. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181221194853.98775-2-duarte@scylladb.com>	2018-12-23 11:14:44 +02:00
Duarte Nunes	6c4a34f378	service/storage_proxy: Protect against empty mutation in mutation_holder The per_destination_mutation holder can contain empty mutations, so make sure release_mutation() skips over those. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181221194853.98775-1-duarte@scylladb.com>	2018-12-23 11:14:43 +02:00
Duarte Nunes	2d7c026d6e	service/storage_proxy: Release mutation as early as possible When delaying a base write, there is no need to hold on to the mutation if all replicas have already replied. We introduce mutation_holder::release_mutation(), which frees the mutations that are no longer needed during the rest of the delay. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	756b601560	service/storage_proxy: Delay replica writes based on view update backlog As the amount of pending view updates increases we know that there’s a mismatch between the rate at which the base receives writes and the rate at which the view retires them. We react by applying backpressure to decrease the rate of incoming base writes, allowing the slow view replicas to catch up. We want to delay the client’s next writes to a base replica. We use the base’s backlog of view updates to derive this delay. If we achieve CL and the backlogs of all replicas involved were last seen to be empty, then we wouldn't delay the client's reply. However, it could be that one of the replicas is actually overloaded, and won't reply for many new such requests. We'll eventually start applying backpressure to the client via the background's write queue, but in the meanwhile we may be dropping view updates. To mitigate this we rely on the backlog being gossiped periodically. Fixes #2538 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	997bdf5d98	service/storage_proxy: Get the backlog of a particular base replica Add a function that returns the view update backlog for a particular replica. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	819b6f3406	service/storage_proxy: Add counters for delayed base writes Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	37dfd22619	service: Distribute a node's view update backlog This patch introduces the view_update_backlog_broker class, which is responsible for periodically updating the local gossip state with the current node's view update backlog. It also registers to updates from other nodes, and updates the local coordinator's view of their view update backlogs. We consider the view update backlog received from a peer through the mutation_done verb to be always fresh, but we consider the one received through gossip to be fresh only if it has a higher timestamp than what we currently have recorded. This is because a node only updates its gossip state periodically, and also because a node can transitively receive gossip state about a third node with outdated information. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	8da6a31e75	service: Advertise view update backlog over gossip This lays the groundwork for brokering a node's view update backlog across the whole cluster. This is needed for when a coordinator does not contact a given replica for a long time, and uses a backlog view that is outdated and causes requests to be unnecessarily delayed. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	ede5742f9b	service/storage_proxy: Send view update backlog from replicas Change the inter-node protocol so we can propagate the view update backlog from a base replica to the coordinator through the mutation_done and mutation_failed verbs. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00

1 2 3 4 5 ...

1376 Commits