scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Calle Wilund	1626c9ce5e	commitlog/replayer: Bugfix: minimum rp broken, and cl reader offset too The previous fix removed the additional insertion of "min rp" per source shard based on whether we had processed existing CF:s or not (i.e. if a CF does not exist as sstable at all, we must tag it as zero-rp, and make whole shard for it start at same zero. This is bad in itself, because it can cause data loss. It does not cause crashing however. But it did uncover another, old old lingering bug, namely the commitlog reader initiating its stream wrongly when reading from an actual offset (i.e. not processing the whole file). We opened the file stream from the file offset, then tried to read the file header and magic number from there -> boom, error. Also, rp-to-file mapping was potentially suboptimal due to using bucket iterator instead of actual range. I.e. three fixes: * Reinstate min position guarding for unencoutered CF:s * Fix stream creating in CL reader * Fix segment map iterator use. v2: * Fix typo Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com> (cherry picked from commit `b12b65db92`)	2017-03-28 11:16:17 +02:00
Calle Wilund	bc6def4a0f	commitlog_replayer: Do proper const-loopup of min positions for shards Fixes #2173 Per-shard min positions can be unset if we never collected any sstable/truncation info for it, yet replay segments of that id. Wrap the lookups to handle "missing data -> default", which should have been there in the first place. Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com> (cherry picked from commit `c3a510a08d`)	2017-03-22 18:46:17 +02:00
Calle Wilund	c814cb3433	commitlog_replayer: Make replay parallel per shard Fixes #2098 Replay previously did all segments in parallel on shard 0, which caused heavy memory load. To reduce this and spread footprint across shards, instead do X segments per shard, sequential per shard. v2: * Fixed whitespace errors Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com> (cherry picked from commit `078589c508`)	2017-03-20 12:50:43 +02:00
Paweł Dziepak	a096a173bf	Merge "Avoid loosing changes to keyspace parameters of system_auth and tracing keyspaces" form Tomek "If a node is bootstrapped with auto_boostrap disabled, it will not wait for schema sync before creating global keyspaces for auth and tracing. When such schema changes are then reconciled with schema on other nodes, they may overwrite changes made by the user before the node was started, because they will have higher timestamp. To prevent that, let's use minimum timestamp so that default schema always looses with manual modifications. This is what Cassandra does. Fixes #2129." * tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev: db: Create default auth and tracing keyspaces using lowest timestamp migration_manager: Append actual keyspace mutations with schema notifications (cherry picked from commit `6db6d25f66`)	2017-03-08 20:37:10 +01:00
Avi Kivity	cf27d44412	config: disable new sharding algorithm It still has problems: - while resharding a very large leveled compaction strategy table, a huge amount of tiny sstables are generated, overwhelming the file descriptor limits - there is a large impact on read latency while resharding is going on	2017-01-12 10:15:51 +02:00
Amnon Heiman	231cf22c0e	Set the prometheus prefix to scylla This patch make the prometheus prefix configurable and set the default value to scylla. Fixes #1964 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com> (cherry picked from commit `70b2a1bfd4`)	2016-12-27 17:57:08 +02:00
Avi Kivity	5c96b04f4d	Revert "config, dht: reduce default msb ignore bits to 4" This reverts commit `b81a57e8eb`. With exponential range scanning, we should now be able to survive msb ignore bits of 12, which allows better sharding on large clusters. (cherry picked from commit `3989e4ed15`)	2016-12-27 17:54:09 +02:00
Avi Kivity	a61ff53150	Merge "rework flush criteria" from Glauber "The current criteria for memtable flush is not being respected. The problem is demonstrated to happen when the dirty memory group is over limit, and so is the system table extra allowance. In that situation, both the normal region and the system table region will be under pressure and try to flush. More specifically, because the normal region inherits from the system region, if the normal region is under pressure (over the soft limit threshold), the system region will certainly be as well, even though it has an extra allowance. This is because after virtual dirty, we start blocking when we reach half the region, but memory itself can grow up to 100 % of the region. So the total amount of memory used will be certainly bigger than the system pressure threshold, which is now 50 % plus the allowance. To fix that, this patch reworks the flush logic so that the regions are not dependent on each other. Fixes #1918" * 'flush-criteria-v6' of github.com:glommer/scylla: config: get rid of memtable_total_space database: rework dirty memory hierarchy system keyspace: write batchlog mutation in user memory database: remove flush_token database: abstract pressure condition notification database: encapsulate semaphore_units into a flush_permit database: remove friendship declaration database: simplify flush_one database: make memtable_list aware in cases it can't flush	2016-12-14 11:24:10 +02:00
Glauber Costa	2aa6514667	config: get rid of memtable_total_space Those values are now statically set. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 17:05:12 -05:00
Glauber Costa	db7cc3cba8	system keyspace: write batchlog mutation in user memory Batchlog is a potentially memory-intensive table whose workload is driven by user needs, not system's. Move it to the user dirty memory manager. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:35 -05:00
Tomasz Grabiec	059a1a4f22	db: Fix commitlog replay to not drop cell mutations with older schema column_mapping is not safe to access across shards, because data_type is not safe to access. One of the manifestation of this is that abstract_type::is_value_compatible_with() always fails if the two types belong to different shards. During replay, column_mapping lives on the replaying shard, and is used by converting_mutation_partition_applier against the schema on the target shard. Since types in the mapping will be considered incompatible with types in the schema, all cells will be dropped. Fix by using column_mapping in a safe way, by copying it to the target shard if necessary. Each shard maintains its own cache of column mappings. Fixes #1924. Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>	2016-12-13 12:19:32 +02:00
Glauber Costa	9b5e6d6bd8	commitlog: correctly report requests blocked The semaphore future may be unavailable for many reasons. Specifically, if the task quota is depleted right between sem.wait() and the .then() clause in get_units() the resulting future won't be available. That is particularly visible if we decrease the task quota, since those events will be more frequent: we can in those cases clearly see this counter going up, even though there aren't more requests pending than usual. This patch improves the situation by replacing that check. We now verify whether or not there are waiters in the semaphore. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>	2016-12-09 15:02:26 +02:00
Tomasz Grabiec	f7197dabf8	commitlog: Fix replay to not delete dirty segments The problem is that replay will unlink any segments which were on disk at the time the replay starts. However, some of those segments may have been created by current node since the boot. If a segment is part of reserve for example, it will be unlinked by replay, but we will still use that segment to log mutations. Those mutations will not be visible to replay after a crash though. The fix is to record preexisting segents before any new segments will have a chance to be created and use that as the replay list. Introduced in `abe7358767`. dtest failure: commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>	2016-12-07 15:54:47 +02:00
Asias He	00d7a35949	utils: Put crc32 under utils namespace It conflicts with crc in zlib Message-Id: <1480918984-4117-2-git-send-email-asias@scylladb.com>	2016-12-05 11:48:29 +02:00
Glauber Costa	99a5a77234	prevent commitlog replay position reordering during reserve refill When requests hit the commitlog, each of them will be assigned a replay position, which we expect to be ordered. If reorders happen, the request will be discarded and re-applied. Although this is supposed to be rare, it does increase our latencies, specially when big requests are involved. Processing big requests is expensive and if we have to do it twice that adds to the cost. The commitlog is supposed to issue replay positions in order, and it coudl be that the code that adds them to the memtables will reorder them. However, there is one instance in which the commitlog will not keep its side of the bargain. That happens when the reserve is exhausted, and we are allocating a segment directly at the same time the reserve is being replenished. The following sequence of events with its deferring points will ilustrate it: on_timer: return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) { At this point, the segment id is already allocated. new_segment(): if (_reserve_segments.empty()) { [ ... ] return allocate_segment(true).then ... At this point, we have a new segment that has an id that is higher than the previous id allocated. Then we resume the execution from the deferring point in on_timer(): i = _reserve_segments.emplace(i, std::move(s)); The next time we need to allocate a segment, we'll pick it from the reserve. But the segment in the reserve has an id that is lower than the id that we have already used. Reorders are bad, but this one is particularly bad: because the reorder happens with the segment id side of the replay position, that means that every request that falls into that segment will have to be reinserted. This bug can be a bit tricky to reproduce. To make it more common, we can artificially add a sleep() fiber after the allocate_segment(false) in on_timer(). If we do that, we'll see a sea of reinsertions going on in the logs (if dblog is set to debug). Applying this patch (keeping the sleep) will make them all disappear. We do this by rewriting the reserve logic, so that the segments always come from the reserve. If we draw from a single pool all the time, there is no chance of reordering happening. To make that more amenable, we'll have the reserve filler always running in the background and take it out of the timer code. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>	2016-12-01 13:20:46 +01:00
Tomasz Grabiec	31645e2c4a	commitlog: Allow allocations to be timed out	2016-11-29 16:40:58 +01:00
Glauber Costa	353a4cd2d4	commitlog: sync segments before acquiring semaphore on shutdown. Sync all segments before acquiring the semaphore, otherwise waiting may have to wait for the timer to kick in and push them down. Note that we can't guarantee that no other requests were executed in the mean time, so we have to sync again. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>	2016-11-29 11:07:28 +02:00
Tomasz Grabiec	96c7764458	Revert "prevent commitlog replay position reordering during reserve refill" This reverts commit `0e9b75d406`. commitlog_test fails with this: Running 14 test cases... ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen WARN 2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory) ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen WARN 2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required) *** 1 failure detected in test suite "tests/commitlog_test.cc" WARN 2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)	2016-11-28 20:52:13 +01:00
Glauber Costa	0e9b75d406	prevent commitlog replay position reordering during reserve refill When requests hit the commitlog, each of them will be assigned a replay position, which we expect to be ordered. If reorders happen, the request will be discarded and re-applied. Although this is supposed to be rare, it does increase our latencies, specially when big requests are involved. Processing big requests is expensive and if we have to do it twice that adds to the cost. The commitlog is supposed to issue replay positions in order, and it coudl be that the code that adds them to the memtables will reorder them. However, there is one instance in which the commitlog will not keep its side of the bargain. That happens when the reserve is exhausted, and we are allocating a segment directly at the same time the reserve is being replenished. The following sequence of events with its deferring points will ilustrate it: on_timer: return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) { At this point, the segment id is already allocated. new_segment(): if (_reserve_segments.empty()) { [ ... ] return allocate_segment(true).then ... At this point, we have a new segment that has an id that is higher than the previous id allocated. Then we resume the execution from the deferring point in on_timer(): i = _reserve_segments.emplace(i, std::move(s)); The next time we need to allocate a segment, we'll pick it from the reserve. But the segment in the reserve has an id that is lower than the id that we have already used. Reorders are bad, but this one is particularly bad: because the reorder happens with the segment id side of the replay position, that means that every request that falls into that segment will have to be reinserted. This bug can be a bit tricky to reproduce. To make it more common, we can artificially add a sleep() fiber after the allocate_segment(false) in on_timer(). If we do that, we'll see a sea of reinsertions going on in the logs (if dblog is set to debug). Applying this patch (keeping the sleep) will make them all disappear. We do this by rewriting the reserve logic, so that the segments always come from the reserve. If we draw from a single pool all the time, there is no chance of reordering happening. To make that more amenable, we'll have the reserve filler always running in the background and take it out of the timer code. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>	2016-11-28 19:26:26 +01:00
Avi Kivity	28857e42e7	Merge " Virtualize size_estimates system table" from Duarte "We currently write the size_estimates system table for every schema on a periodic basis, currently set to 5 minutes, which can interfere with an ongoing workload. This patchset virtualizes it such that queries are intercepted and we calculate the results on the fly, only for the ranges the caller is interested in. Fixes #1616" * 'virtual-estimates/v4' of github.com:duarten/scylla: size_estimates_virtual_reader: Add unit test db: Delete size_estimates_recorder size_estimates: Add virtual reader column_family: Add support for virtual readers storage_service: get_local_tokens() returns a future nonwrapping_range: Add slice() function range: Find a sequence's lower and upper bounds system_keyspace: Build mutations for size estimates size_estimates: Store the token range as bytes range_estimates: Add schema murmur3_partitioner: Convert maximum_token to sstring	2016-11-28 10:12:59 +02:00
Avi Kivity	b81a57e8eb	config, dht: reduce default msb ignore bits to 4 With the default value of 12, a node's range is partitioned into 4096 * smp::count sub-ranges which are queried sequentually for a range scan. If the number of rows in the table is smaller than the required result size, we will query all of them. This can take so long that we time out. A better fix is to query multiple sub-ranges in parallel and merge them, but for that we need to resurrect the non-sequential merger.	2016-11-23 21:25:37 +02:00
Paweł Dziepak	919825a2c7	Merge "Improve sharding in large clusters" from Avi "Clusters with a large number of nodes, or a low number of vnodes, and a high number of shards, or a combination, suffer from an aliasing problem: both vnodes and intra-node sharding consider the most significant bits to select the owning node and owning shard respectively. Since the same bits are used for both, a low number of vnodes leads to some shards being overcommitted relative to others. This series fixes the problem by sharding on bits 0:47 of the token (murmur3 partitioner only), leaving the most significant 12 bits for vnodes. Simulation shows that this value provides reasonable sharding for 100-node, 30-shard clusters. In order to prevent re-sharding sstables on each boot, token ranges for the range are stored in a new sub-component of the sstable Statistics component. With the default 12 ignored bits we have 4096 token ranges for non-Level-compacted SSTables, which takes some space but is still reasonable. Fixes #1277."	2016-11-23 11:25:53 +00:00
Avi Kivity	07d5a20bae	Wire up sharding ignore msb parameter to configuration We might have used a fancy map<sstring, any> to pass the parameters, but that's overkill for now.	2016-11-22 22:40:47 +02:00
Glauber Costa	0b8b5abf16	commitlog: acquire semaphore earlier Recently we have changed our shutdown strategy to wait for the _request_controller semaphore to make sure no other allocations are in-flight. That was done to fix an actual issue. The problem is that this wasn't done early enough. We acquire the semaphore after we have already marked ourselves as _shutdown and released the timer. That means that if there is an allocation in flight that needs to use a new segment, it will never finish - and we'll therefore neve acquire the semaphore. Fix it by acquiring it first. At this point the allocations will all be done and gone, and then we can shutdown everything else. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>	2016-11-21 22:19:32 +00:00
Duarte Nunes	6a37d87c76	db: Delete size_estimates_recorder Now that access to the size_estimates system is virtualized, we no longer need the recorder. Fixes #1616 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	225648780d	size_estimates: Add virtual reader This patch add a virtual mutation_reader so that queries to the size_estimates system table are handled by the engine without needing to perform any IO. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	636287fdf2	system_keyspace: Build mutations for size estimates This patch adds a function to system_keyspace responsible for creating a mutation to a partition of the size_estimates system table from a set of range_estimates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	18ddec245e	size_estimates: Store the token range as bytes This patch changes the range_estimates struct so that the tokens are represented as utf8 encoded bytes. This will make future patches require less conversions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:14:21 +00:00
Duarte Nunes	e7a5162c1d	range_estimates: Add schema This will be used in future patches, when virtualizing the size_estimates system table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 10:56:32 +00:00
Glauber Costa	21c1e2b48c	commitlog: wait for pending allocations to finish before closing gate. allocations may enter the gate, so it would be wise for us to wait for them. Fixes #1860 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <53cd6996c1cbd8b38bab3b03604bd11e5c20beda.1479650012.git.glauber@scylladb.com>	2016-11-20 19:45:33 +02:00
Glauber Costa	60b7d35f15	commitlog: close file after read, and not at stop There are other code paths that may interrupt the read in the middle and bypass stop. It's safer this way. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8c32ca2777ce2f44462d141fd582848ac7cf832d.1479477360.git.glauber@scylladb.com>	2016-11-18 14:09:33 +00:00
Glauber Costa	59a41cf7f1	commitlog: use read ahead for replay requests Aside from putting the requests in the commitlog class, read ahead will help us going through the file faster. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-17 14:09:54 -05:00
Glauber Costa	aa375cd33d	commitlog: use commitlog priority for replay Right now replay is being issued with the standard seastar priority. The rationale for that at the time is that it is an early event that doesn't really share the disk with anybody. That is largely untrue now that we start compactions on boot. Compactions may fight for bandwidth with the commitlog, and with such low priority the commitlog is guaranteed to lose. Fixes #1856 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-17 14:09:02 -05:00
Glauber Costa	4d3d774757	commitlog: close replay file Replay file is opened, so it should be closed. We're not seeing any problems arising from this, but they may happen. Enabling read ahead in this stream makes them happen immediately. Fix it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-17 12:35:24 -05:00
Glauber Costa	895e838ac0	get rid of max_memtable_size After recent changes to the memtable code, there is no reason for us to uphold a maximum memtable size. Now that we only flush one memtable at a time anyway, and also have soft limit notifications from the region_group_reclaimer, we can just set the soft limit to the target size and let all of that be handled by the dirty_memory_manager. It does have the added property that we'll be flushing when we globally reach the soft limit threshold. In conditions in which we have multiple CF writes fighting for memory, that guarantees that we will start flushing much earlier than the hard limit. The threshold is set to 1/4 of dirty memory. While in theory we would prefer the memtables to go as big as 1/2 of dirty memory, in my experiments I have found 1/4 to be a better fit, at least for the moment. The reason for such behavior is that in situations where we have slow disks, setting the soft limit to 1/2 of dirty will put us in a situation in which we may not have finished writing down the memtable when we hit the limit, and then throttle. When set the threshold to 1/4 of dirty, we don't throttle at all. This behavior could potentially be fixed by not doing the full memtable-based throttling after we do the commitlog throttling, but that is not something realistic for the moment. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Calle Wilund	11baf37ab5	commitlog: Prevent exceptions in stream::produce from being set twice Fixes #1775 stream lacks a check "is_open", which is a bummer. We have to both prevent exception propagation and add a flag of our own to make sure exceptions in producer code reaches consumer, and does not simply get lost in the reactor. Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>	2016-11-07 11:41:33 +01:00
Tomasz Grabiec	c1a7e2090e	Revert "database: change find_column_families signature so it returns a lw_shared_ptr" This reverts commit `f3528ede65`.	2016-11-04 10:48:21 +01:00
Glauber Costa	f3528ede65	database: change find_column_families signature so it returns a lw_shared_ptr There are places in which we need to use the column family object many times, with deferring points in between. Because the column family may have been destroyed in the deferring point, we need to go and find it again. If we use lw_shared_ptr, however, we'll be able to at least guarantee that the object will be alive. Some users will still need to check, if they want to guarantee that the column family wasn't removed. But others that only need to make sure we don't access an invalid object will be able to avoid the cost of re-finding it just fine. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Avi Kivity	75706c0a26	size_estimates_recorder: sort token range before rewrapping it Since size estimates are stored as wrapped ranges, we call compat::wrap() to convert from the now-standard unwrapped ranges back to wrapped ranges. However, compat::wrap() relies on the ranges being in sorted order, but our input is not. This leads to a crash as we find an unexpected empty token in the middle of the vector. Sort it so compat::wrap() works as expected. Fixes #1804. Message-Id: <1478161908-25051-1-git-send-email-avi@scylladb.com>	2016-11-03 09:43:41 +01:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Raphael S. Carvalho	a3e065da9b	db: make it possible to use custom error handler with io checker By default, io checker will cause Scylla to shutdown if it finds specific system errors. Right now, io checker isn't flexible enough to allow a specialized handler. For example, we don't want to Scylla to shutdown if there's an permission problem when uploading new files from upload dir. This desired flexibility is made possible here by allowing a handler parameter to io check functions and also changing existing code to take advantage of it. That's a step towards fixing #1709. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-10-27 15:54:21 -02:00
Glauber Costa	a13c410749	commitlog: cycle based on total size, not on mutation size We calculate two sizes during the allocation: "size", which is the in-segment size of this mutation, and "s", which is that plus the overhead. cycle() must be called with the latter, not the former, as doing otherwise may lead to buffer overflows. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <ccf346d8d0ebb44a1ba9fd069653bab0d7be0a61.1477063157.git.glauber@scylladb.com>	2016-10-21 18:57:41 +03:00
Glauber Costa	d9875784a1	commitlog: do not wait on pending operations for batch mode This was explicitly mentioned in my set as gone in one of the versions. Somehow it came back in the final version - sorry about that. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2a0eba28cd74267d1a1fdcf1aef2901cc74ffc9f.1477059963.git.glauber@scylladb.com>	2016-10-21 17:27:16 +03:00
Glauber Costa	d5618c6ace	commitlog: add total_operations type for requests_blocked_memory Current tracker for pending allocations is a queue_size GAUGE. Add a total_operations version so we have more insight on what's going on. It will be called requests_blocked_memory for consistency with other subsystems that track similar things. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-20 09:25:38 -04:00
Glauber Costa	1578d7363a	commitlog: rework blocking logic The current incarnation of commitlog establishes a maximum amount of writes that can be in-flight, and blocks new requests after that limit is reached. That is obviously something we must do, but the current approach to it is problematic for two main reasons: 1) It forces the requests that trigger a write to wait on the current write to finish. That is excessive; ideally we would wait for one particular write to finish, not necessarily the current one. That is made worse by the fact that when a write is followed by a flush (happens when we move to a new segment), then we must wait for all writes in that segment to finish. 1) it casts concurrency in terms of writes instead of memory, which makes the aforementioned problem a lot worse: if we have very big buffers in flight and we must wait for them to finish, that can take a long time, often in the order of seconds, causing timeouts. The approach taken by this patch is to replace the _write_semaphore with a request_controller. This data structure will account the amount of memory used by the buffers and set a limit on it. New allocations will be held until we go below that limit, and will be released as soon as this happens. This guarantees that the latencies introduced by this mechanism are spread out a lot better among requests and will keep higher percentile latencies in check. To test this, I have ran a workload that times out frequently. That workload use 10 threads to write 100 partitions (to isolate from the effects of the memtable introduced latencies) in a loop and each partition is 2MB in size. After 10 minutes running this load, we are left with the following percentiles: latency mean : 51.9 [WRITE:51.9] latency median : 9.8 [WRITE:9.8] latency 95th percentile : 125.6 [WRITE:125.6] latency 99th percentile : 1184.0 [WRITE:1184.0] latency 99.9th percentile : 1991.2 [WRITE:1991.2] latency max : 2338.2 [WRITE:2338.2] After this patch: latency mean : 54.9 [WRITE:54.9] latency median : 43.5 [WRITE:43.5] latency 95th percentile : 126.9 [WRITE:126.9] latency 99th percentile : 253.9 [WRITE:253.9] latency 99.9th percentile : 364.6 [WRITE:364.6] latency max : 471.4 [WRITE:471.4] Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:56:36 -04:00
Glauber Costa	aec724bbda	commitlog: factor out code for checking mutation size In a subsequent patch, I'll use this code in a different place. To prepare for that, we move it out as a method. It also fits a lot better inside the segment manager, so move it there. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	a50996f376	commitlog: calculate segment-independent size of mutations Goal is to calculate a size that is lesser or equal than the segment-dependent size. This was originally written by Tomasz, and featured in his submission "commitlog: Handle overload more gracefully" Extracted here so it sits clearly in a different patch. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	0b7c9fa17f	commitlog: remove _needed_size It is mostly an optimization, and while it makes sense in this context, it won't soon as we'll stop waiting for the current cycle specifically to finish. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	6214bdeb66	commitlog: move segment_manager constructor outside the class definition We'll do that so we can, in following patches, use static members from the segment. Those are not defined at this point. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	299877f432	commitlog: add a counter for pending allocations We track the amount of pending allocations but we don't really export it. It will be crucial when we stop tracking pending writes. This patch exports it through a method instead of the totals structure, so we can easily change it. Current code probing pending_allocations (the api code) is also converted to use the public method instead of the totals struct. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00

1 2 3 4 5 ...

744 Commits