scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	2c561ecaed	db: Allow writes to be timed out	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	b1ae6ad2ad	db: Introduce counters for failed reads and writes	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	31645e2c4a	commitlog: Allow allocations to be timed out	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	e14caaef60	utils/logalloc: Add ability to timeout run_when_memory_available() task	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	61d81617e1	utils/flush_queue: Add ability to wait with a timeout	2016-11-29 16:40:58 +01:00
Glauber Costa	353a4cd2d4	commitlog: sync segments before acquiring semaphore on shutdown. Sync all segments before acquiring the semaphore, otherwise waiting may have to wait for the timer to kick in and push them down. Note that we can't guarantee that no other requests were executed in the mean time, so we have to sync again. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>	2016-11-29 11:07:28 +02:00
Tomasz Grabiec	96c7764458	Revert "prevent commitlog replay position reordering during reserve refill" This reverts commit `0e9b75d406`. commitlog_test fails with this: Running 14 test cases... ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen WARN 2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory) ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen WARN 2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required) *** 1 failure detected in test suite "tests/commitlog_test.cc" WARN 2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)	2016-11-28 20:52:13 +01:00
Raphael S. Carvalho	f141b0cdae	database: atomically add new sstables to cf when refreshing New sstables are loaded and added in parallel, meaning that scylla can potentially return stale data if a new sstable containing a tombstone wasn't loaded yet. Compaction should also not run until all new sstables are added for similar reasons. Fix is about separating blocking and non-blocking steps to allow atomic add of multiple new sstables. Fixes #1368. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>	2016-11-28 20:30:48 +02:00
Glauber Costa	0e9b75d406	prevent commitlog replay position reordering during reserve refill When requests hit the commitlog, each of them will be assigned a replay position, which we expect to be ordered. If reorders happen, the request will be discarded and re-applied. Although this is supposed to be rare, it does increase our latencies, specially when big requests are involved. Processing big requests is expensive and if we have to do it twice that adds to the cost. The commitlog is supposed to issue replay positions in order, and it coudl be that the code that adds them to the memtables will reorder them. However, there is one instance in which the commitlog will not keep its side of the bargain. That happens when the reserve is exhausted, and we are allocating a segment directly at the same time the reserve is being replenished. The following sequence of events with its deferring points will ilustrate it: on_timer: return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) { At this point, the segment id is already allocated. new_segment(): if (_reserve_segments.empty()) { [ ... ] return allocate_segment(true).then ... At this point, we have a new segment that has an id that is higher than the previous id allocated. Then we resume the execution from the deferring point in on_timer(): i = _reserve_segments.emplace(i, std::move(s)); The next time we need to allocate a segment, we'll pick it from the reserve. But the segment in the reserve has an id that is lower than the id that we have already used. Reorders are bad, but this one is particularly bad: because the reorder happens with the segment id side of the replay position, that means that every request that falls into that segment will have to be reinserted. This bug can be a bit tricky to reproduce. To make it more common, we can artificially add a sleep() fiber after the allocate_segment(false) in on_timer(). If we do that, we'll see a sea of reinsertions going on in the logs (if dblog is set to debug). Applying this patch (keeping the sleep) will make them all disappear. We do this by rewriting the reserve logic, so that the segments always come from the reserve. If we draw from a single pool all the time, there is no chance of reordering happening. To make that more amenable, we'll have the reserve filler always running in the background and take it out of the timer code. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>	2016-11-28 19:26:26 +01:00
Takuya ASADA	1042e40188	dist/common/scripts/scylla_kernel_check: fix incorrect document URL Fixes #1871 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480327243-18177-1-git-send-email-syuu@scylladb.com>	2016-11-28 13:51:19 +02:00
Avi Kivity	18df2d9e9e	partition_version: fix const correctness in rows_entry_compare Using a non-const-correct comparator results in build failures with boost 1.55. Fixes #1892. Message-Id: <20161128104335.28789-1-avi@scylladb.com>	2016-11-28 10:55:12 +00:00
Avi Kivity	5358984982	Merge seastar upstream * seastar 93c3b12...7790e68 (7): > core/reactor: Introduce reactor-*/dervie-busy_ns metric > Collectd: Hold a reference to the metrics implementation in registration > future: Improve comments > fstream: actually use dynamically adjusted buffer > debug: add latency detector script > reactor: add static probes for latency detector > semaphore: Fix with_semaphore() in case wait() throws	2016-11-28 11:05:59 +02:00
Avi Kivity	28857e42e7	Merge " Virtualize size_estimates system table" from Duarte "We currently write the size_estimates system table for every schema on a periodic basis, currently set to 5 minutes, which can interfere with an ongoing workload. This patchset virtualizes it such that queries are intercepted and we calculate the results on the fly, only for the ranges the caller is interested in. Fixes #1616" * 'virtual-estimates/v4' of github.com:duarten/scylla: size_estimates_virtual_reader: Add unit test db: Delete size_estimates_recorder size_estimates: Add virtual reader column_family: Add support for virtual readers storage_service: get_local_tokens() returns a future nonwrapping_range: Add slice() function range: Find a sequence's lower and upper bounds system_keyspace: Build mutations for size estimates size_estimates: Store the token range as bytes range_estimates: Add schema murmur3_partitioner: Convert maximum_token to sstring	2016-11-28 10:12:59 +02:00
Avi Kivity	176fca5775	logalloc: use correct header for unique_ptr <bits/unique_ptr.hh> is a libstdc++ internal header. USe <memory> instead.	2016-11-27 23:08:04 +02:00
Glauber Costa	c32803f2f0	database: move reversion of virtual dirty state closer to update_cache. When we finish writing a memtable, we revert the dirty memory charges immediately. When we do that, dirty memory will grow back to what it was, and soon (we hope) will go down again when we release the requests for real. During that time, we may not accept new requests. Sealing can take a long time, specially in the face of Linux issues like the ones we have seen in the past. It also will take proportionally more time if the SSTables end up being small, which is a possibility in some scenarios. This patch changes the dirty_memory_manager so that the charges won't be reverted right after we finish the flush. Rather, we will hold on to it, and revert it right before we update the cache. We don't need to do it for all classes of memtable writes, because after we finish flushing, flush_one() will destroy the hashed element anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>	2016-11-24 18:18:15 +01:00
Raphael S. Carvalho	4781b6eb71	sstables: use nonwrapping_range::make to avoid compilation issues GCC 5.3.1 was unable to convert bound to optional<bound>. sstables/sstables.cc:2494:123: error: no matching function for call to ‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position, dht::ring_position)’ (dtr.right.exclusive ? dht::ring_position::starting_at : dht::ring_position::ending_at)(std::move(t2))); In file included from ./dht/i_partitioner.hh:52:0, from ./query-request.hh:28, from ./clustering_key_filter.hh:27, from sstables/sstables.hh:35, from sstables/sstables.cc:38: ./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range( const wrapping_range<U>&) [with T = dht::ring_position] explicit nonwrapping_range(const wrapping_range<T>& r) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>	2016-11-24 11:26:16 +02:00
Duarte Nunes	cc3f26c993	lz4: Conditionally use LZ4_compress_default() Since not all distributions have a version of LZ4 with LZ4_compress_default(), we use it conditionally. This is specially important beginning with version 1.7.3 of LZ4, which deprecates the LZ4_compress() function in favour of LZ4_compress_default() and thus prevents Scylla from compiling due to the deprecated warning. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161124092339.23017-1-duarte@scylladb.com>	2016-11-24 11:25:03 +02:00
Avi Kivity	1be95b1227	Merge seastar upstream * seastar d6f26d8...93c3b12 (3): > rpc: Conditionally use LZ4_compress_default() > queue: allow queue to change its maximum size > util/defer: add missing return to move assignment	2016-11-24 11:00:53 +02:00
Duarte Nunes	a527ba285f	thrift: Don't apply cell limit across rows In Thrift, SliceRange defines a count that limits the number of cells to return from that row (in CQL3 terms, it limits the number of rows in that partition). While this limit is honored in the engine, the Thrift layer also applies the same limit, which, while redundant in most cases, is used to support the get_paged_slice verb. Currently, the limit is not being reset per Thrift row (CQL3 partition), so in practice, instead of limiting the cells in a row, we're limiting the rows we return as well. This patch fixes that by ensuring the limit applies only within a row/partition. Fixes #1882 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161123220001.15496-1-duarte@scylladb.com>	2016-11-24 10:38:31 +02:00
Takuya ASADA	ce80fb3a39	dist/ubuntu: increase number of open files on Ubuntu 14.04(upstart) Follow the change of NOFILE for non-systemd environment. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479975050-14907-1-git-send-email-syuu@scylladb.com>	2016-11-24 10:13:41 +02:00
Avi Kivity	d58c8aaa32	db: remove unused belongs_to_{current,other}_shard(s) functions Obsoleted by new sharding mechanism, but break the build for some.	2016-11-23 21:39:29 +02:00
Avi Kivity	b81a57e8eb	config, dht: reduce default msb ignore bits to 4 With the default value of 12, a node's range is partitioned into 4096 * smp::count sub-ranges which are queried sequentually for a range scan. If the number of rows in the table is smaller than the required result size, we will query all of them. This can take so long that we time out. A better fix is to query multiple sub-ranges in parallel and merge them, but for that we need to resurrect the non-sequential merger.	2016-11-23 21:25:37 +02:00
Pekka Enberg	c526a9f0be	Update seastar submodule * seastar 7473945...d6f26d8 (2): > semaphore_units: add missing return statement > metrics: Do not detroy the metrics layer if it is been used	2016-11-23 20:27:09 +02:00
Paweł Dziepak	919825a2c7	Merge "Improve sharding in large clusters" from Avi "Clusters with a large number of nodes, or a low number of vnodes, and a high number of shards, or a combination, suffer from an aliasing problem: both vnodes and intra-node sharding consider the most significant bits to select the owning node and owning shard respectively. Since the same bits are used for both, a low number of vnodes leads to some shards being overcommitted relative to others. This series fixes the problem by sharding on bits 0:47 of the token (murmur3 partitioner only), leaving the most significant 12 bits for vnodes. Simulation shows that this value provides reasonable sharding for 100-node, 30-shard clusters. In order to prevent re-sharding sstables on each boot, token ranges for the range are stored in a new sub-component of the sstable Statistics component. With the default 12 ignored bits we have 4096 token ranges for non-Level-compacted SSTables, which takes some space but is still reasonable. Fixes #1277."	2016-11-23 11:25:53 +00:00
Glauber Costa	18b9fa3d43	dist: increase number of open files This limit was found to be too low for production environments. It would be hit at boot, when we're touching a lot of files from multiple shards before deciding that we don't need them. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <87bbf43da1a67f5fa6174017205c6ef8bdb0dc3d.1479829232.git.glauber@scylladb.com>	2016-11-23 13:10:25 +02:00
Avi Kivity	07d5a20bae	Wire up sharding ignore msb parameter to configuration We might have used a fancy map<sstring, any> to pass the parameters, but that's overkill for now.	2016-11-22 22:40:47 +02:00
Avi Kivity	8b1d689de8	partitioner: add ignore_msb parameters to byte ordered and random partitioners Ignored; doesn't make sense on byte ordered, and random is deprecated.	2016-11-22 21:56:42 +02:00
Avi Kivity	af16c0fac4	murmur3_partitioner: shard on the middle token bits, not most significant bits Sharding on the most significant token bits aliases with the vnode mechanism, which also uses the most significant bits; this requires a huge number of vnodes to achieve good sharding. This patch teaches the murmur3 partitioner to ignore the most significant N bits when calculating a token's hard, so we use token bits which still have some entropy. In effect, with changes the token range layout from shard 0 shard 1 ... shard S-1 to shard 0 shard 1 ... shard S-1 shard 0 shard 1 ... shard S-1 ... shard 0 shard 1 ... shard S-1 Where the number of repetitions of the block is 2^(ignored msb bits). For compatibility, the default is zero ignored bits, matching the pre-patch state, until we wire things up.	2016-11-22 21:56:42 +02:00
Avi Kivity	024c8ef8a1	db: adjust sstable load to use sstable self-reporting of shard ownership Instead of calculating the owning shard from the sstable's partition key range, delegate to the new sstable method for getting owning shard infomation. This insulates us from changes in the sharding algorithm.	2016-11-22 21:56:40 +02:00
Avi Kivity	98a4544e1c	sstables: add method to get sstable owning shards from an unloaded sstable When we load an sstable, we don't know beforehand which shards it belongs to; we don't want to open it until we do. Add a method that allows us to read just the sharding data, without opening anything else.	2016-11-22 21:52:23 +02:00
Avi Kivity	bdd11648ac	sstables: add intra-node sharding metadata Add a metadata component that describes token ranges that are spanned by this sstable. With the current sharding algorithm, where each shard owns a single token range, the first/last partition key is sufficient to describing sharding information, but for multi-range algorithms, this is not sufficient.	2016-11-22 21:44:25 +02:00
Avi Kivity	316ef1d70a	sstables: automate writing statistics components Add a virtual funnction to metadata_base so we can loop over statistics components when writing them.	2016-11-22 21:05:06 +02:00
Glauber Costa	13973e7f3b	keep background work semaphore alive during sstable flush We have a semaphore controlling the amount of background work generated by the memtable flush process. However, because we are not moving it inside the memtable post-flush continuation, the units are being released when we star the flush and not when we finish it. That's not the intended behavior and that can cause flushes to accumulate. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>	2016-11-22 19:54:08 +01:00
Avi Kivity	d05b22e502	sstables: automatically calculate offsets in statistics Instead of calculating the offset for each statistic component manually, use a loop to iterate over all components, accumulating the offset as we go along.	2016-11-22 20:35:24 +02:00
Avi Kivity	7c5e6525ef	sstables: switch statistics components to generic serialized_size() implementation	2016-11-22 20:20:38 +02:00
Avi Kivity	096ae59a5b	sstables: introduce generic serialized_size() Introduce a new function that reuses the file_writer code to compute the serialized size of an sstable object, by serializing it into memory and discarding the result.	2016-11-22 20:06:23 +02:00
Avi Kivity	3c06ffac9d	sstables: const correctness for the write(file_writer&, T&) functions write() doesn't need to change its input; so change it to const. The only snag is that describe_type() isn't and can't be made const-correct, so cheat when it is called and const_cast the input. This helps in writing a generic serialized_size() that is const correct, in the next patch.	2016-11-22 20:04:27 +02:00
Tomasz Grabiec	eefc538225	Update seastar submodule * seastar 7504026...7473945 (1): > Merge "Improve support for timeouts in primitives"	2016-11-22 17:51:29 +01:00
Glauber Costa	0b8b5abf16	commitlog: acquire semaphore earlier Recently we have changed our shutdown strategy to wait for the _request_controller semaphore to make sure no other allocations are in-flight. That was done to fix an actual issue. The problem is that this wasn't done early enough. We acquire the semaphore after we have already marked ourselves as _shutdown and released the timer. That means that if there is an allocation in flight that needs to use a new segment, it will never finish - and we'll therefore neve acquire the semaphore. Fix it by acquiring it first. At this point the allocations will all be done and gone, and then we can shutdown everything else. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>	2016-11-21 22:19:32 +00:00
Avi Kivity	6bdb8ba31d	storage_proxy: don't query concurrently needlessly during range queries storage_proxy has an optimization where it tries to query multiple token ranges concurrently to satisfy very large requests (an optimization which is likely meaningless when paging is enabled, as it always should be). However, the rows-per-range code severely underestimates the number of rows per range, resulting in a large number of "read-ahead" internal queries being performed, the results of most of which are discarded. Fix by disabling this code. We should likely remove it completely, but let's start with a band-aid that can be backported. Fixes #1863. Message-Id: <20161120165741.2488-1-avi@scylladb.com>	2016-11-21 18:19:46 +02:00
Glauber Costa	0ca8c3f162	database: keep a pointer to the memtable list in a memtable We current pass a region group to the memtable, but after so many recent changes, that is a bit too low level. This patch changes that so we pass a memtable list instead. Doing that also has a couple of advantages. Mainly, during flush we must get to a memtable to a memtable_list. Currently we do that by going to the memtable to a column family through the schema, and from there to the memtable_list. That, however, involves calling virtual functions in a derived class, because a single column family could have both streaming and normal memtables. If we pass a memtable_list to the memtable, we can keep pointer, and when needed get the memtable_list directly. Not only that gets rid of the inheritance for aesthetic reasons, but that inheritance is not even correct anymore. Since the introduction of the big streaming memtables, we now have a plethora of lists per column family and this transversal is totally wrong. We haven't noticed before because we were flushing the memtables based on their individual sizes, but it has been wrong all along for edge cases in which we would have to resort to size-based flush. This could be the case, for instance, with various plan_ids in flight at the same time. At this point, there is no more reason to keep the derived classes for the dirty_memory_manager. I'm only keeping them around to reduce clutter, although they are useful for the specialized constructors and to communicate to the reader exactly what they are. But those can be removed in a follow up patch if we want. The old memtable constructor signature is kept around for the benefit of two tests in memtable_tests which have their own flush logic. In the future we could do something like we do for the SSTable tests, and have a proxy class that is friends with the memtable class. That too, is left for the future. Fixes #1870 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>	2016-11-21 18:18:27 +02:00
Duarte Nunes	def2bc72b0	size_estimates_virtual_reader: Add unit test Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	6a37d87c76	db: Delete size_estimates_recorder Now that access to the size_estimates system is virtualized, we no longer need the recorder. Fixes #1616 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	225648780d	size_estimates: Add virtual reader This patch add a virtual mutation_reader so that queries to the size_estimates system table are handled by the engine without needing to perform any IO. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	cd7e2fd602	column_family: Add support for virtual readers Virtual readers allow queries to selected tables, usually system tables, to be answered by the engine. This is useful for tables which aren't written by users and whose contents can be calculated on demand. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	c0d450c57d	storage_service: get_local_tokens() returns a future This patch changes the get_local_tokens() function in storage_service to return a future instead of requiring running under a seastar::thread. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	9b384d375f	nonwrapping_range: Add slice() function This patch add the slice() function to nonwrapping range, which uses its bounds to slice an input sequence. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	bdba8d99c3	range: Find a sequence's lower and upper bounds This patch extracts a pair of functions from mutation_partition to calculate the lower and upper bounds of a sequence from a nonwrapping_range. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	636287fdf2	system_keyspace: Build mutations for size estimates This patch adds a function to system_keyspace responsible for creating a mutation to a partition of the size_estimates system table from a set of range_estimates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	18ddec245e	size_estimates: Store the token range as bytes This patch changes the range_estimates struct so that the tokens are represented as utf8 encoded bytes. This will make future patches require less conversions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:14:21 +00:00

1 2 3 4 5 ...

10786 Commits