scylladb

Author	SHA1	Message	Date
Asias He	4d3c463536	storage_service: Stop cql server before gossip We saw failure in dtest concurrent_schema_changes_test.py: TestConcurrentSchemaChanges.changes_while_node_down_test test. ====================================================================== ERROR: changes_while_node_down_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 432, in changes_while_node_down_test self.make_schema_changes(session, namespace='ns2') File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 86, in make_schema_changes session.execute('USE ks_%s' % namespace) File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result() File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result raise self._final_exception ConnectionShutdown: Connection to 127.0.0.1 is closed The test: session = self.patient_cql_connection(node2) self.prepare_for_changes(session, namespace='ns2') node1.stop() self.make_schema_changes(session, namespace='ns2') --> ConnectionShutdown exception throws The problem is that, after receiving the DOWN event, the python Cassandra driver will call Cluster:on_down which checks if this client has any connections to the node being shutdown. If there is any connections, the Cluster:on_down handler will exit early, so the session to the node being shutdown will not be removed. If we shutdown the cql server first, the connection count will be zero and the session will be removed. Fixes: #4013 Message-Id: <7388f679a7b09ada10afe7e783d7868a58aac6ec.1545634941.git.asias@scylladb.com>	2018-12-27 14:13:43 +02:00
Duarte Nunes	2f69ba2844	lwt: Remove Paxos-related Cassandra code Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181227112526.4180-1-duarte@scylladb.com>	2018-12-27 13:30:10 +02:00
Duarte Nunes	66e45469b2	streaming/stream_session: Don't use table reference across defer points When creating a sstable from which to generate view updates, we held on to a table reference across defer points. In case there's a concurrent schema drop, the table object might be destroyed and we will incur in a use-after-free. Solve this by holding on to a shared pointer and pinning the table object. Refs #4021 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181227105921.3601-1-duarte@scylladb.com>	2018-12-27 13:05:46 +02:00
Avi Kivity	b349e11aba	tools: toolchain: avoid docker-provided /tmp On at least one system, using the container's /tmp as provided by docker results in spurious EINVALs during aio: INFO 2018-12-27 09:54:08,997 [shard 0] gossip - Feature ROW_LEVEL_REPAIR is enabled unknown location(0): fatal error: in "test_write_many_range_tombstones": storage_io_error: Storage I/O error: 22: Invalid argument seastar/tests/test-utils.cc(40): last checkpoint The setup is overlayfs over xfs. To avoid this problem, pass through the host's /tmp to the container. Using --tmpfs would be better, but it's not possible to guess a good size as the amount of temporary space needed depends on build concurrency. Message-Id: <20181227101345.11794-1-avi@scylladb.com>	2018-12-27 10:17:23 +00:00
Avi Kivity	2c4a732735	tools: toolchain: update baseline Fedora packages Image fedora-29-20181219 was broken due to the followin chain of events: - we install gnutls, which currently is at version 3.6.5 - gnutls 3.6.5 introduced a dependency on nettle 3.4.1 - the gnutls rpm does not include a version requirement on nettle, so an already-installed nettle will not be upgraded when gnutls is installed - the fedora:29 image which we use as a baseline has nettle installed - docker does not pull the latest tag in FROM statements during "docker build" - my build machine already had a fedora:29 image, with nettle 3.4 installed (the repository's image has 3.4.1, but docker doesn't automatically pull if an image with the required tag exists) As a result, the image ended up hacing gnutls 3.6.5 and nettle 3.4, which are incompatible. To fix, update all packages after installation to attempt to have a self consistent package set even if dependencies are not perfect, and regenerate the image. Message-Id: <20181226135711.24074-1-avi@scylladb.com>	2018-12-26 14:58:23 +00:00
Avi Kivity	1414837fcc	tools: toolchain: improve dbuild for continuous integration environments The '-t' flag to 'docker run' passes the tty from the caller environment to the container, which is nice for interactive jobs, but fails if there is no tty, such as in a continuous integration environment. Given that, the '-i' flag doesn't make sense either as there isn't any input to pass. Remove both, and replace with --sig-proxy=true which allows SIGTERM to terminate the container instead of leaving it alive. This reduces the chances of the build stopping but leaving random containers around. Message-Id: <20181222105837.22547-1-avi@scylladb.com>	2018-12-26 10:50:34 +00:00
Avi Kivity	bfd8ade914	tools: toolchain: update toolchain for gcc-8.2.1-6 gcc was updated with some important fixes; update the toolchain to include it. Message-Id: <20181219190548.28675-1-avi@scylladb.com>	2018-12-26 10:21:02 +00:00
Benny Halevy	206483e6af	position_in_partition_view: print bound_weight as int Rather than a non-printable char. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20181226091115.18530-1-bhalevy@scylladb.com>	2018-12-26 11:19:30 +02:00
Rafael Ávila de Espíndola	f73c60d8cf	sstables: Convert an unreachable throw into an assert in read path The function pending_collection is only called when cdef->is_multi_cell() is true, so the throw is dead. This patch converts it to an assert. Message-Id: <20181207022119.38387-1-espindola@scylladb.com>	2018-12-26 11:10:19 +02:00
Benny Halevy	52188a20fa	HACKING.md: Add details about unit test debug info Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20181225133513.20751-1-bhalevy@scylladb.com>	2018-12-25 16:03:24 +02:00
Avi Kivity	c96fc1d585	Merge "Introduce row level repair" from Asias " === How the the partition level repair works - The repair master decides which ranges to work on. - The repair master splits the ranges to sub ranges which contains around 100 partitions. - The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions. - If the checksum matches, the data in this sub range is synced. - If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers. === Major problems with partition level repair - A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions. - Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice === Row level repair Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch === How the row level repair works - To solve the problem of reading data twice Read the data only once for both checksum and synchronization between nodes. We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers. We need to find a sync boundary among the nodes which contains only N bytes of rows. - To solve the problem of sending unnecessary data. We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems. For example: Node1 has set1 = {row1, row2, row3} Node2 has set2 = { row2, row3} Node3 has set3 = {row1, row2, row4} To repair: Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3. Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2 Node1 sends row3 (set1 + set2 + set3 - set3) to Node3. === How to implement repair with set reconciliation - Step A: Negotiate sync boundary class repair_sync_boundary { dht::decorated_key pk; position_in_partition position } Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary. - Step B: Get missing rows from peer nodes so that repair master contains all the rows Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers. At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes. Now, local node contains all the rows. - Step C: Send missing rows to the peer nodes Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes. === How the RPC API looks like - repair_range_start() Step A: - request_sync_boundary() Step B: - request_combined_row_hashes() - reqeust_full_row_hashes() - request_row_diff() Step C: - send_row_diff() - repair_range_stop() === Performance evaluation We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below. 1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows. Time to repair: old = 87 min new = 70 min (rebuild took 50 minutes) improvement = 19.54% 2) 100% synced: all of the 3 nodes have 1 billion identical rows. Time to repair: old = 43 min new = 24 min improvement = 44.18% 3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows. Time to repair: old: 211 min new: 44 min improvement: 79.15% Bytes sent on wire for repair: old: tx= 162 GiB, rx = 90 GiB new: tx= 1.15 GiB, tx = 0.57 GiB improvement: tx = 99.29%, rx = 99.36% It is worth noting that row level repair sends and receives exactly the number of rows needed in theory. In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair slave 1 and 1 million rows from repair slave 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair slave 1 to repair slave 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair slave 2 to repair slave 1. In the result, we saw the rows on wire were as expected. tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000 Fixes: #3033 Tests: dtests/repair_additional_test.py " * 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits) repair: Enable row level repair repair: Add row_level_repair repair: Add docs for row level repair repair: Add repair_init_messaging_service_handler repair: Add repair_meta repair: Add repair_writer repair: Add repair_reader repair: Add repair_row repair: Add fragment_hasher repair: Add decorated_key_with_hash repair: Add get_random_seed repair: Add get_common_diff_detect_algorithm repair: Add shard_config repair: Add suportted_diff_detect_algorithms repair: Add repair_stats to repair_info repair: Introduce repair_stats flat_mutation_reader: Add make_generating_reader storage_service: Introduce ROW_LEVEL_REPAIR feature messaging_service: Add RPC verbs for row level repair repair: Export the repair logger ...	2018-12-25 13:13:00 +02:00
Takuya ASADA	b9a06ae552	dist/offline_installer/redhat: support building RHEL 7 offline installer We had issue to build offline installer on RHEL because of repository difference. This fix enables to build offline installer both on CentOS and RHEL. Also it introduces --releasever <ver>, to build offline installer for specific minor version of CentOS and RHEL. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20181212032129.29515-1-syuu@scylladb.com>	2018-12-25 12:50:09 +02:00
Botond Dénes	3ae77a2587	configure.py: generate ${mode}-objects targets Sometimes one wants to just compile all the source files in the projects, because for example one just moved around code or files and there is no need to link and run anything, just check that everything still compiles. Since linking takes up a considerable amount of time it is worthwhile to have a specific target that caters for such needs. This patch introduces a ${mode}-objects target for each mode (e.g. release-objects) that only runs the compilation step for each source file but does not link anything. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <eaad329bf22dfaa3deff43344f3e65916e2c8aaf.1545045775.git.bdenes@scylladb.com>	2018-12-25 12:40:20 +02:00
Benny Halevy	f104951928	sstable_test: read_file should open the file read-only Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20181218145156.12716-1-bhalevy@scylladb.com>	2018-12-25 12:02:46 +02:00
Rafael Ávila de Espíndola	f8c81d4d89	tests: sstables: mc: add tests with incompatible schemas In one test the types in the schema don't match the types in the statistics file. In another a column is missing. The patch also updates the exceptions to have more human readable messages. Tests: unit (release) Part of issue #3960. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20181219233046.74229-1-espindola@scylladb.com>	2018-12-25 11:11:54 +02:00
Yibo Cai (Arm Technology China)	422987ab04	utils: add fast ascii string validation Validate ascii string by ORing all bytes and check if 7-th bit is 0. Compared with original std::any_of(), which checks ascii string byte by byte, this new approach validates input in 8 bytes and two independent streams. Performance is much higher for normal cases, though slightly slower when string is very short. See table below. Speed(MB/s) of ascii string validation +---------------+-------------+---------+ \| String length \| std::any_of \| u64 x 2 \| +---------------+-------------+---------+ \| 9 bytes \| 1691 \| 1635 \| +---------------+-------------+---------+ \| 31 bytes \| 2923 \| 3181 \| +---------------+-------------+---------+ \| 129 bytes \| 3377 \| 15110 \| +---------------+-------------+---------+ \| 1039 bytes \| 3357 \| 31815 \| +---------------+-------------+---------+ \| 16385 bytes \| 3448 \| 47983 \| +---------------+-------------+---------+ \| 1048576 bytes \| 3394 \| 31391 \| +---------------+-------------+---------+ Signed-off-by: Yibo Cai <yibo.cai@arm.com> Message-Id: <1544669646-31881-1-git-send-email-yibo.cai@arm.com>	2018-12-24 09:58:08 +02:00
Tomasz Grabiec	419c771791	sstables: index_reader: Fix abort when _trust_pi == trust_promoted_index::no data is not moved-from if _trust_pi == trust_promoted_index::no, which triggers the assert on data.empty(). We should make it empty unconditionally. Message-Id: <1545408731-14333-1-git-send-email-tgrabiec@scylladb.com>	2018-12-23 12:09:21 +02:00
Tomasz Grabiec	07d153c769	sstables: mc: reader: Use enum class instead of variant variant is an overkill here. Message-Id: <1545409014-16289-1-git-send-email-tgrabiec@scylladb.com>	2018-12-23 12:04:02 +02:00
Duarte Nunes	e6a8883228	service/storage_proxy: Protect against empty mutation when storing hint mutation_holder::get_mutation_for() can return nullptr's, so protect against those when storing a hint. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181221194853.98775-2-duarte@scylladb.com>	2018-12-23 11:14:44 +02:00
Duarte Nunes	6c4a34f378	service/storage_proxy: Protect against empty mutation in mutation_holder The per_destination_mutation holder can contain empty mutations, so make sure release_mutation() skips over those. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181221194853.98775-1-duarte@scylladb.com>	2018-12-23 11:14:43 +02:00
Duarte Nunes	5e7d18380d	Merge 'Reduce dependencies on config.hh for extensions access' from Avi " Some files use db/config.hh just to access extensions. Reduce dependencies on this global and volatile file by providing another path to access extensions. Tests: unit(release) " * tag 'unconfig-2/v1' of https://github.com/avikivity/scylla: hints: reduce dependencies on db/config.hh commitlog: reduce dependencies on db/config.hh cql3: reduce dependencies on db/config.hh database: provide accessor to db::extensions	2018-12-21 20:15:44 +00:00
Avi Kivity	eae030b061	hints: reduce dependencies on db/config.hh Instead of accessing extensions via config, access it via database::extensions(). This reduces recompilations when configuration is extended.	2018-12-21 20:15:44 +00:00
Avi Kivity	cc8312a8b9	commitlog: reduce dependencies on db/config.hh Instead of accessing extensions via config, access it via database::extensions(). This reduces recompilations when configuration is extended.	2018-12-21 20:15:43 +00:00
Avi Kivity	d2dae3af86	cql3: reduce dependencies on db/config.hh Instead of accessing extensions via config, access it via database::extensions(). This reduces recompilations when configuration is extended.	2018-12-21 20:15:43 +00:00
Avi Kivity	74c1afad29	database: provide accessor to db::extensions Rather than forcing callers to go through get_config(), provide a direct accessor. This reduces dependencies on config.hh, and will allow separation of extensions from configuration.	2018-12-21 20:15:43 +00:00
Tomasz Grabiec	d2f96a60f6	sstables: mc: index_reader: Handle CK_SIZE split across buffers properly we incorrectly falled-through to the next state instead of returning to read more data. This can manifest in a number of ways, an abort, or incorrect read. Introduced in `917528c` Fixes #4011. Message-Id: <1545402032-4114-1-git-send-email-tgrabiec@scylladb.com>	2018-12-21 16:34:10 +02:00
Tomasz Grabiec	7afe2bad51	sstables: mc: reader: Avoid unnecessary index reads on fast forwarding When the next pending fragments are after the start of the new range, we know there is no need to skip. Caught by perf_fast_forward --datasets large-part-ds3 \ --run-tests=large-partition-slicing Refs #3984 Message-Id: <1545308006-16389-1-git-send-email-tgrabiec@scylladb.com>	2018-12-20 16:21:07 +00:00
Gleb Natapov	393269d34b	streaming: hold to sink while close() is running and call close on error as well Currently if something throws while streaming in mutation sending loop sink is not closed. Also when close() is running the code does not hold onto sink object. close() is async, so sink should be kept alive until it completes. The patch uses do_with() to hold onto sink while close is running and run close() on error path too. Fixes #4004. Message-Id: <20181220155931.GL3075@scylladb.com>	2018-12-20 18:03:37 +02:00
Tomasz Grabiec	2b55ab8c8e	Merge "Add more extensive test for mutation reader fast-forwarding" from Paweł Mutation readers allow fast-forwarding the ranges from which the data is being read. The main user of this feature is cache which, when reading from the underlying reader, may want to skip some data it already has. Unsurprisingly, this adds more complexity to the implementation of the readers and more edge cases the developers need to take care of. While most of the readers were at least to some extent checked in this area those test usually were quite isolated (e.g. one test doing inter-partition fast-forwarding, another doing intra-partition fast-forwarding) and as a consequence didn't cover many corner cases. This patch adds a generic test for fast-forwarding and slicing that covers more complicated scenarios when those operations are combined. Needless to say that did uncover some problems, but fortunately none of them is user-visible. Fixes #3963. Fixes #3997. Tests: unit(release, debug) * https://github.com/pdziepak/scylla.git test-fast-forwarding/v4.1: tests/flat_mutation_reader_assertions: accumulate received tombstones tests/flat_mutation_reader_assertions: add more test messages tests/flat_mutation_reader_assertions: relax has_monotonic_positions() check tests/mutation_readers: do not ignore streamed_mutation::forwarding Revert "mutation_source_test: add option to skip intra-partition fast-forwarding tests" memtable: it is not a single partition read if partition fast-forwaring is enabled sstables: add more tracing in mp_row_consumer_m row_cache: use make_forwardable() to implement streamed_mutation::forwarding row_cache: read is not single-partition if inter-partition forwarding is enabled row_cache: drop support for streamed_mutation::forwarding::yes entirely sstables/mp_row_consumer: position_range end bound is exclusive mutation_fragment_filter: handle streamed_mutation::forwarding::yes properly tests/mutation_reader: reduce sleeping time tests/memtable: fix partition_range use-after-free tests/mutation: fix partition range use-after-free flat_mutation_reader_from_mutations: add overload that accepts a slice and partition range flat_mutation_reader_from_mutations: fix empty range case flat_mutation_reader_from_mutations: destroy all remaining mutations tests/mutation_source: drop dropped column handling test tests/mutation_source: add test for complex fast_forwarding and slicing	2018-12-20 15:05:21 +01:00
Paweł Dziepak	3355d16938	tests/mutation_source: add test for complex fast_forwarding and slicing While we already had tests that verified inter- and intra-partition fast-forwarding as well as slicing, they had quite limited scope and didn't combine those operations. The new test is meant to extensively test these cases.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	26a30375b1	tests/mutation_source: drop dropped column handling test Schema changes are now covered by for_each_schema_change() function. Having some additional tests in run_mutation_source_tests() is problematic when it is used to test intermediate mutation readers because schema changes may be irrelevant for them, which makes the test a waste of time (might be a problem in debug mode) and requires those intermediate reader to use more complex underlying reader that supports schema changes (again, problem in a very slow debug mode).	2018-12-20 13:27:25 +00:00
Paweł Dziepak	048ed2e3d3	flat_mutation_reader_from_mutations: destroy all remaining mutations If the reader is fast-forwarded to another partition range mutation_ may be left with some partial mutations. Make sure that those are properly destroyed.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	d50cd31eee	flat_mutation_reader_from_mutations: fix empty range case An iterator shall not be dereferenced until it is verified that it is dereferencable.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	93488209de	tests/mutation: fix partition range use-after-free	2018-12-20 13:27:25 +00:00
Paweł Dziepak	e91165d929	tests/memtable: fix partition_range use-after-free	2018-12-20 13:27:25 +00:00
Paweł Dziepak	5db8dacd1f	tests/mutation_reader: reduce sleeping time It is a very bad taste to sleep anywhere in the code. The test should be fixed to explicitly test various orderings between concurrent operations, but before that happens let's at least readuce how much those sleeps slow it down by changing it from milliseconds to microseconds.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	243aade3b2	mutation_fragment_filter: handle streamed_mutation::forwarding::yes properly	2018-12-20 13:27:25 +00:00
Paweł Dziepak	dfa5b3d996	sstables/mp_row_consumer: position_range end bound is exclusive	2018-12-20 13:27:25 +00:00
Paweł Dziepak	df1d438fcd	row_cache: drop support for streamed_mutation::forwarding::yes entirely	2018-12-20 13:27:25 +00:00
Paweł Dziepak	adcb3ec20c	row_cache: read is not single-partition if inter-partition forwarding is enabled	2018-12-20 13:27:25 +00:00
Paweł Dziepak	7ecee197c4	row_cache: use make_forwardable() to implement streamed_mutation::forwarding Implementing intra-partition fast-forwarding adds more complexity to already very-much-not-trivial cache readers and isn't really critical in any way since it is not used outside of the tests. Let's use the generic adapter instead of natively implementing it.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	e96a5f96d9	sstables: add more tracing in mp_row_consumer_m	2018-12-20 13:27:25 +00:00
Paweł Dziepak	18825af830	memtable: it is not a single partition read if partition fast-forwaring is enabled Single-partition reader is less expensive than the one that accepts any range of partitions, but it doesn't support fast-forwarding to another partition range properly and therefore cannot be used if that option is enabled.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	bcb5aed1ef	Revert "mutation_source_test: add option to skip intra-partition fast-forwarding tests" This reverts commit `b36733971b`. That commit made run_mutation_reader_tests() support mutation_sources that do not implement streamed_mutation::forwarding::yes. This is wrong since mutation_sources are not allowed to ignore or otherwise not support that mode. Moreover, there is absolutely no reason for them to do so since there is a make_forwardable() adapter that can make any mutation_reader a forwardable one (at the cost of performance, but that's not always important).	2018-12-20 13:27:25 +00:00
Paweł Dziepak	8706750b9b	tests/mutation_readers: do not ignore streamed_mutation::forwarding It is wrong to silently ignore streamed_mutation::forwarding option which completely changes how the reader is supposed to operate. The best solution is to use make_forwardable() adapter which changes non-forwardable reader to a forwardable one.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	edf2c71701	tests/flat_mutation_reader_assertions: relax has_monotonic_positions() check Since `41ede08a1d` "mutation_reader: Allow range tombstones with same position in the fragment stream" mutation readers emit fragments in non-decreasing order (as opposed to strictly increasing), has_monotonic_posiitons() needs to be updated to allow that.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	787d1ba7b2	tests/flat_mutation_reader_assertions: add more test messages	2018-12-20 13:27:25 +00:00
Paweł Dziepak	593fb936c2	tests/flat_mutation_reader_assertions: accumulate received tombstones Current data model employed by mutation readers doesn't have an unique representation of range tombstones. This complicates testing by making multiple ways of emitting range tombstones and rows equally valid. This patch adds an option to verify mutation readers by checking whether tombstones they emit properly affect the clustered rows regardless of how exactly the tombstones are emitted. The interface of flat_mutation_reader_assertions is extended by adding may_produce_tombstones() that accepts any number of tombstones and accumulates them. Then, produces_row_with_key() accepts an additional argument which is the expected timestamp of the range tombstone that affects that row.	2018-12-20 13:27:25 +00:00
Paweł Dziepak	e6d26a528f	Merge "Optimize slicing sstable readers" from Tomasz " Contains several improvements for fast-forwarding and slicing readers. Mainly for the MC format, but not only: - Exiting the parser early when going out of the fast-forwarding window [MC-format-only] - Avoiding reading of the head of the partition when slicing - Avoiding parsing rows which are going to be skipped [MC-format-only] " * 'sstable-mc-optimize-slicing-reads' of github.com:tgrabiec/scylla: sstables: mc: reader: Skip ignored rows before parsing them sstables: mc: reader: Call _cells.clear() when row ends rather than when it starts sstables: mc: mutation_fragment_filter: Take position_in_partition rather than a clustering_row sstables: mc: reader: Do not call consume_row_marker_and_tombstone() for static rows sstables: mc: parser: Allow the consumer to skip the whole row sstables: continuous_data_consumer: Introduce skip() sstables: continuous_data_consumer: Make position() meaningful inside state_processor::process_state() sstables: mc: parser: Allocate dynamic_bitset once per read instead of once per row sstables: reader: Do not read the head of the partition when index can be used sstables: mc: mutation_fragment_filter: Check the fast-forward window first sstables: mc: writer: Avoid calling unsigned_vint::serialized_size()	2018-12-20 12:48:22 +00:00
Avi Kivity	b66f59aa3d	Merge "materialized views: Apply backpressure from view replicas" from Duarte " As the amount of pending view updates increases we know that there’s a mismatch between the rate at which the base receives writes and the rate at which the view retires them. We react by applying backpressure to decrease the rate of incoming base writes, allowing the slow view replicas to catch up. We want to delay the client’s next writes to a base replica and we use the base’s backlog of view updates to derive this delay. To validate this approach we tested a 3 node Scylla cluster on GCE, using n1-standard-4 instances with NVMEs. A loader running on a n1-standard-8 instance run cassandra-stress with 100 threads. With the delay function d(x) set to 1s, we see no base write timeouts. With the delay function as defined in the series, we see that backlogs stabilize at some (arbitrary) point, as predicted, but this stabilization co-exists with base write timeouts. However, the system overall behaves better than the current version, with the 100 view update limit, and also better than the version without such limit or any backpressure. More work is necessary to further stabilize the system. Namely, we want to keep delaying until we see the backlog is decreasing. This will require us to add more delay beyond the stabilization point, which in turn should minimize the base write timeouts, and will also minimize the amount of memory the backlog takes at each base replica. Design document: https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo Fixes #2538 " Reviewed-by: Nadav Har'El <nyh@scylladb.com> * 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits) service/storage_proxy: Release mutation as early as possible service/storage_proxy: Delay replica writes based on view update backlog service/storage_proxy: Get the backlog of a particular base replica service/storage_proxy: Add counters for delayed base writes main: Start and stop the view_update_backlog_broker service: Distribute a node's view update backlog service: Advertise view update backlog over gossip service/storage_proxy: Send view update backlog from replicas service/storage_proxy: Prepare to receive replica view update backlog service/storage_proxy: Expose local view update backlog tests/view_schema_test: Add simple test for db::view::node_update_backlog db/view: Introduce node_update_backlog class db/hints: Initialize current backlog database: Add counter for current view backlog database: Expose current memory view update backlog idl: Add db::view::update_backlog db/view: Add view_update_backlog database: Wait on view update semaphore for view building service/storage_proxy: Use near-infinite timeouts for view updates database: generate_and_propagate_view_updates no longer needs a timeout ...	2018-12-20 12:44:51 +02:00

1 2 3 4 5 ...

17525 Commits