scylladb

Author	SHA1	Message	Date
Glauber Costa	d2438059a7	database: keep a pointer to the memtable list in a memtable We current pass a region group to the memtable, but after so many recent changes, that is a bit too low level. This patch changes that so we pass a memtable list instead. Doing that also has a couple of advantages. Mainly, during flush we must get to a memtable to a memtable_list. Currently we do that by going to the memtable to a column family through the schema, and from there to the memtable_list. That, however, involves calling virtual functions in a derived class, because a single column family could have both streaming and normal memtables. If we pass a memtable_list to the memtable, we can keep pointer, and when needed get the memtable_list directly. Not only that gets rid of the inheritance for aesthetic reasons, but that inheritance is not even correct anymore. Since the introduction of the big streaming memtables, we now have a plethora of lists per column family and this transversal is totally wrong. We haven't noticed before because we were flushing the memtables based on their individual sizes, but it has been wrong all along for edge cases in which we would have to resort to size-based flush. This could be the case, for instance, with various plan_ids in flight at the same time. At this point, there is no more reason to keep the derived classes for the dirty_memory_manager. I'm only keeping them around to reduce clutter, although they are useful for the specialized constructors and to communicate to the reader exactly what they are. But those can be removed in a follow up patch if we want. The old memtable constructor signature is kept around for the benefit of two tests in memtable_tests which have their own flush logic. In the future we could do something like we do for the SSTable tests, and have a proxy class that is friends with the memtable class. That too, is left for the future. Fixes #1870 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com> (cherry picked from commit `0ca8c3f162`)	2016-11-21 18:18:56 +02:00
Glauber Costa	4539b8403a	database: fix direct flushes of non-durable column families. If a Column Family is non-durable, then its flushes will never create a memtable flush reader. Our current flush logic depends on that being created and destroyed to release the semaphore permits on the flush. We will remove the permits ourselves it there is an exception, but not under normal circumnstances. Given this issue, however, it would be more adequate to always try to remove the permits after we flush. If the permits were already removed by the flush reader, then this test will just see that the permit is not in the map and return. But if it is still there, then it is removed. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com> (cherry picked from commit `1933349654`)	2016-11-18 21:33:22 +01:00
Raphael S. Carvalho	558f535fcb	db: do not leak deleted sstable when deletion triggers an exception The leakage results in deleted sstables being opened until shutdown, and disk space isn't released. That's because column_family::rebuild_sstable_list() will not remove reference to deleted sstables if an exception was triggered in sstables::delete_atomically(). A sstable only has its files closed when its object is destructed. The exception happens when a major compaction is issued in parallel to a regular one, and one of them will be unable to delete a sstable already deleted by the other. That results in remove_by_toc_name() triggering boost::filesystem ::filesystem_error because TOC and temporary TOC don't exist. We wouldn't have seen this problem if major compaction were going through compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should be made resilient. Fixes #1840. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com> (cherry picked from commit `3dc9294023`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <760f96d81de0bab7507bb4f52c06b30f21e82577.1479420770.git.raphaelsc@scylladb.com>	2016-11-18 13:10:46 +02:00
Glauber Costa	3d45d0d339	fix shutdown and exception conditions for flush logic This patch addresses post-merge follow up comments by Tomek. Basically, what we do is: - we don't need to signal() from remove_from_flush_manager(), because the explicit flushes no longer wait on the condition variable. So we don't. - We now wait on the stop() flushes (regardless of their return status) so we can make sure that the _flush_queue will indeed be done with. - we acquire the semaphore before shutting down the dirty_memory_manager to make sure that there are no pending flushes - the flush manager that holds the semaphore has to match in the exception handler Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com> (cherry picked from commit `461778918b`)	2016-11-18 11:53:21 +02:00
Avi Kivity	affc0d9138	Merge "get rid of memtable size parameter and rework flush logic" from Glauber "This patchset allows Scylla to determine the size of a memtable instead of relying in the user-provided memtable_cleanup_threshold. It does that by allowing the region_group to specify a soft limit which will trigger the allocation as early as it is reached. Given that, we'll keep the memtables in memory for as long as it takes to reach that limit, regardless of the individual size of any single one of them. That limit is set to 1/4 of dirty memory. That's the same as last submission, except this time I have run some experiments to gauge behavior of that versus 1/2 of dirty memory, which was a preferred theoretical value. After that is done, the flush logic is reworked to guarantee that flushes are not initiated if we already have one memtable under flush. That allow us to better take advantage of coalescing opportunities with new requests and prevents the pending memtable explosion that is ultimately responsible for Issue 1817. I have run mainly two workloads with this. The first one a local RF=1 workload with large partitions, sized 128kB and 100 threads. The results are: Before: op rate : 632 [WRITE:632] partition rate : 632 [WRITE:632] row rate : 632 [WRITE:632] latency mean : 157.8 [WRITE:157.8] latency median : 115.5 [WRITE:115.5] latency 95th percentile : 486.7 [WRITE:486.7] latency 99th percentile : 534.8 [WRITE:534.8] latency 99.9th percentile : 599.0 [WRITE:599.0] latency max : 722.6 [WRITE:722.6] Total partitions : 189667 [WRITE:189667] Total errors : 0 [WRITE:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:00 END After: op rate : 951 [WRITE:951] partition rate : 951 [WRITE:951] row rate : 951 [WRITE:951] latency mean : 104.8 [WRITE:104.8] latency median : 102.5 [WRITE:102.5] latency 95th percentile : 155.8 [WRITE:155.8] latency 99th percentile : 177.8 [WRITE:177.8] latency 99.9th percentile : 686.4 [WRITE:686.4] latency max : 1081.4 [WRITE:1081.4] Total partitions : 285324 [WRITE:285324] Total errors : 0 [WRITE:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:00 END The other workload was the workload described in #1817. And the result is that we now have a load that is very stable around 100k ops/s and hardly any timeouts, instead of the 1.4 baseline of wild variations around 100k ops/s and lots of timeouts, or the deep reduction of 1.5-rc1." * 'issue-1817-v4' of github.com:glommer/scylla: database: rework memtable flush logic get rid of max_memtable_size pass a region to dirty_memory_manager accounting API memtable: add a method to expose the region_group logalloc: allow region group reclaimer to specify a soft limit database: remove outdated comment database: uphold virtual dirty for system tables. (cherry picked from commit `5d067eebf2`)	2016-11-17 14:41:23 +02:00
Raphael S. Carvalho	fa308c079c	database: fix collectd metrics for clustering key filter Same instance name was used for exported metrics, which is definitely wrong. Checked it works properly now via collectd exporter. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <471a36706113af60aeba86fb56a365feb4dab31a.1477086706.git.raphaelsc@scylladb.com>	2016-10-22 09:51:18 +03:00
Paweł Dziepak	6755a679f6	drop key readers key_readers weren't used since introduction of continuity flag to cache entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	7bebfb851f	database: enable fast forwarding of range_sstable_reader When fast forwarding a reader that combines sstable reader we must also remember that the set of sstables for the new range may be different than for the previous one. The reader introduced in this patch makes sure that we read from correct sstables. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Tomasz Grabiec	4357d0a6d9	db: Add counter for writes blocked on dirty memory There is already queue_length-requests_blocked_memory, but it's a gauge so does not reflect what happened between the sampling points. total_operations-requests_blocked_memory will allow to see if there were any (and how many) requests which were blocked by dirty memory. Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>	2016-10-10 14:25:22 +03:00
Glauber Costa	33e9c2bbdd	memtable: reduce sstable flush concurrency to one Limiting the concurrency of memtable flushes to 4 was a temporary workaround for the fact that we lacked good write behind support. Now that write behind is properly merged we can reduce the concurrency to what it should be, one. This means that memtable flushes will now be serialized, and only when one of them ends will the next one begin. Disk parallelism is obtained through the write-behind mechanism. Fixes #1373 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>	2016-10-09 10:48:57 +03:00
Tomasz Grabiec	2a5a90f391	db: Do not timeout streaming readers There is a limit to concurrency of sstable readers on each shard. When this limit is exhausted (currently 100 readers) readers queue. There is a timeout after which queued readers are failed, equal to read_request_timeout_in_ms (5s by default). The reason we have the timeout here is primarily because the readers created for the purpose of serving a CQL request no longer need to execute after waiting longer than read_request_timeout_in_ms. The coordinator no longer waits for the result so there is no point in proceeding with the read. This timeout should not apply for readers created for streaming. The streaming client currently times out after 10 minutes, so we could wait at least that long. Timing out sooner makes streaming unreliable, which under high load may prevent streaming from completing. The change sets no timeout for streaming readers at replica level, similarly as we do for system tables readers. Fixes #1741. Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>	2016-10-07 15:41:04 +03:00
Raphael S. Carvalho	7ea4513595	database: trigger compaction after loading new sstables Scylla wasn't trying to compact new sstables uploaded via 'nodetool refresh'. Thus, all new sstables were left uncompacted until user issued 'nodetool flush' or a new sstable was written which would trigger compaction too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>	2016-10-06 18:26:49 +03:00
Avi Kivity	f8118d9fc2	Merge "Virtual dirty memory management" from Glauber "Description: ============ Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that, is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Results ======= With this patchset running a load big enough to easily saturate the disk, (commitlog disabled to highlight the effects of the memtable writer), I am able to run scylla for many minutes, with timeouts occurring only when I run out of disk space, whereas without this patch a swarm of timeouts would start merely 2 seconds after the load started - and would never get stable. In V2, I have sent a set of graphs illustrating the performance of this solution. This version does not have any significant differences in that front. For details, please refer to https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ Accuracy of the accounting: --------------------------- It is important for us to be as accurate as possible when accounting freed memory, since every byte we mark as freed may allow one or more requests to be executed. I have measured the accuracy of this approach (ignoring padding, object size for the mutation fragments) to be 99.83 % of used memory in the test workload I have ran (large, 65k mutations). Memtables under this circumnstance tend to have a very high occupancy ratio because throttle breeds idle, and idle breeds compact-on-idle. Known Issues: ------------- A lot of time can be elapsed between destroying the flush_reader and actually releasing memory. The release of memory only happens when the SSTable is fully sealed, and we have to flush the files, as well as finish writing all SSTable components at this point. This happened in practice with a buggy kernel that would result in flushes taking a long time. After that is fixed, this is just a theoretical problem and in practice it shouldn't matter given the time we expect those operations to take." * 'virtual-dirty-v6' of github.com:glommer/scylla: database: allow virtual dirty memory management streamed_mutation: make _buffer private add accounting of memory read to partition_snapshot_reader move partition_snapshot_reader code to header file LSA: allow a group to query its own region group memtables: split scanning reader in two sstables: use special reader for writing a memtable LSA: export information about object memory footprint LSA: export information about size of the throttle queue database: export virtual dirty bytes region group	2016-10-04 20:57:52 +03:00
Glauber Costa	f89a67c75c	database: allow virtual dirty memory management Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Raphael S. Carvalho	747b42299c	database: remove unused code Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>	2016-10-04 09:26:43 +03:00
Raphael S. Carvalho	a3bf7558f2	lcs: fix broken token range distribution at higher levels Uniform token range distribution across sstables in a level > 1 was broken, because we were only choosing sstable with lowest first key, when compacting a level > 0. This resulted in performance problem because L1->L2 may have a huge overlap over time, for example. Last compacted key will now be stored for each level to ensure sort of "round robin" selection of sstables for compactions at level >= 1. That's also done by C*, and they were once affected by it as described in https://issues.apache.org/jira/browse/CASSANDRA-6284. Fixes #1719. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-30 14:09:16 -03:00
Glauber Costa	f5fd6bd714	LSA: export information about size of the throttle queue Also add information about for how long has the oldest been sitting in the queue. This is part of the backpressure work to allow us to throttle incoming requests if we won't have memory to process them. Shortages can happen in all sorts of places, and it is useful when designing and testing the solutions to know where they are, and how bad they are. This counter is named for consistency after similar counters from transport/. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Glauber Costa	aa6a96d09b	database: export virtual dirty bytes region group Currently, we export the region group where memtables are placed as dirty bytes. Upcoming patches will optimistically mark some bytes in this region as free, a scheme we know as "virtual dirty". We are still interested in knowing the real state of the dirty region, so we will keep track of the bytes virtually freed and split the counters in two. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Asias He	b505e34062	database: Introduce make_streaming_reader The make_streaming_reader returns a combined mutation reader reads mutations from sstables and memtable. The memtable reader handles memtable flushing automatically so no special handling is needed here. It will be used by streaming soon.	2016-09-26 16:02:48 +08:00
Raphael S. Carvalho	67343798cf	api: implement api to return sstable count per level 'nodetool cfstats' wasn't showing per-level sstable count because the API wasn't implemented. Fixes #1119. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0dcdf9196eaec1692003fcc8ef18c77d0834b2c6.1474410770.git.raphaelsc@scylladb.com>	2016-09-21 09:13:40 +03:00
Raphael S. Carvalho	dffb41f9d8	sstables: remove schema parameter from some sstable methods schema can now be found in the sstable object itself. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0fa44fedbe784d924522d7eeca77c16294479c6e.1473959677.git.raphaelsc@scylladb.com>	2016-09-19 13:25:58 +02:00
Calle Wilund	f126cf769a	column_family: Ensure flush() waits for all previous flushes + self Fixes #1577 Message-Id: <1472569952-4066-1-git-send-email-calle@scylladb.com>	2016-09-14 11:00:41 +01:00
Tomasz Grabiec	a498da1987	database: Ignore spaces in initial_token list Currently we get boost::lexical_cast on startup if inital_token has a list which contains spaces after commas, e.g.: initial_token: -1100081313741479381, -1104041856484663086, ... Fixes #1664. Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>	2016-09-14 11:58:13 +03:00
Raphael S. Carvalho	b9f67351da	db: expose clustering filter info via collectd That's needed to observe behavior of clustering filter, and to check if it's worthwhile for a specific workload. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 11:32:23 -03:00
Raphael S. Carvalho	a2dc88889d	db: enable clustering optimization only on dtcs Leveled strategy will not benefit from this strategy because there's only a few sstables that will contain a given partition key, which means that a clustering key that belongs to a specific partition key can only be in a few sstables as well. Date tiered strategy is the one that will actually benefit the most from this optimization. Size tiered may benefit from it too if clustering key isn't overwritten, but it will not use the clustering optimization. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 11:31:07 -03:00
Raphael S. Carvalho	8d03ccd604	sstables: optimize reads with clustering filter If user specifies a clustering filter, it's possible to filter out sstable based on its metadata that tracks min/max clustering value. For example, if sstable stores clustering key from 'a' through 'c', it's possible to filter out that sstable if user asks for data with clustering key greater than 'c'. That's done by comparing each component separately because clustering key may be composite. Further information can be found here: https://issues.apache.org/jira/browse/CASSANDRA-5514 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:51:50 -03:00
Raphael S. Carvalho	004617839d	database: check bloom filter of all sstables earlier All sstables will now have bloom filter checked in a single pass before reader iterate through all candidates. It's possible that we will need to futurize the procedure if it holds cpu for too long. This change is also a step towards the optimization that will rule out sstables based on clustering filter. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:50:08 -03:00
Raphael S. Carvalho	1f31223f32	sstables: store schema in sstable object That will be needed for optimization that will store decorated keys in the sstable object, and also for a subsequent work that will detect wrong metadata (min/max column names) by looking at columns in the schema. As schema is stored in sstable, there's no longer a need to store ks and cf names in it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:49:17 -03:00
Glauber Costa	dc5d8e33af	Revert "row_cache: update sstable histograms on cache hits" This reverts commit `1726b1d0cc`. Reverting this patch turns our SSTable access counter into a miss counter only. The estimated histogram always starts its first bucket at 1, so by marking cache accesses we will be wrongly feeding "1" into the buckets. Notice that this is not yet ideal: nodetool is supposed to show a histogram of all reads, and by doing this we are changing its meaning slightly. Workloads that serve mostly from cache will be distorted towards their misses. The real solution is to use a different histogram, but we will need to enforce a newer version of nodetool for that: the current issue is that nodetool expects an EstimatedHistogram in a specific format in the other side. Conflicts: row_cache.hh Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy lladb.com> Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-01 18:07:31 +03:00
Duarte Nunes	ba374da043	database: Trace sstable accesses This patch traces when we read from an sstable, be it a key range or a single one. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:32 +02:00
Duarte Nunes	030db65c62	database: Accept a trace_state_ptr This patch changes the database and column_family types so a trace_state_ptr can be passed in when querying. This enables tracing of the inner components. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:28 +02:00
Glauber Costa	1726b1d0cc	row_cache: update sstable histograms on cache hits If we have a cache hit, we still need to update our sstable histogram - notting that we have touched 0 SSTables. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:14:22 -04:00
Glauber Costa	ce24fd05fe	database: keep statistics on SSTables touched per read That is done for single partition queries only - mimicking what Cassandra does on that matter. For this to be correct, we also need to update this histogram on cache hits - in which case we update the read as having touched 0 SSTables. That will be done on a separate patch. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:14:21 -04:00
Piotr Jastrzebski	3607d99269	Remove clustering_key_filtering_context. Remove clustering_key_filter_factory and clustering_key_filtering_context. Use partition_slice directly with a static get_ranges method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 20:31:55 +02:00
Raphael S. Carvalho	108fd1fade	database: close file in lister After listing is done, let's close file. This fixes no bug. It's only an improvement. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <2f52d297bcf6a6b6e3429912c28f17e6b37f8842.1471381607.git.raphaelsc@scylladb.com>	2016-08-17 11:01:44 +03:00
Glauber Costa	b361dee488	database: memtables pending flushes tell us nothing We have two counters that tracks how many memtable flushes are in progress, and how much memory are they pinning. The problem is, after we have revamped the code to limit the amount of flushes in progress, those counters became useless: as they live inside the semaphore side, they will only be incremented once we have past the semaphore. One wouldn't notice if working with CPU-bound problems, where memtables don't pile. But as soon as they do, those counters will always show the same numbers: the depth of the semaphore, which doesn't mean much. The problem is poised to become much worse: once we enable write behind in full and set the semaphore's depth to one, that's the number we'll see here all the time. The fix is to move the counters outside the semaphore, which will bring back its old semantics. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c5ae6903e170f3f356cdda7ed78a4c9ba8d5f024.1471370504.git.glauber@scylladb.com>	2016-08-17 10:54:15 +03:00
Paweł Dziepak	8a386a51bd	Merge "Don't cache wide partitions" from Piotr "When reading a partition try to read it all but once more bytes are read than a given limit we decide that partition is wide and we don't cache it. Instead we retry the read with clustering key filtering applied."	2016-07-21 10:24:25 +01:00
Piotr Jastrzebski	636a4acfd0	Add flag to configure max size of a cached partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:47:20 +02:00
Tomasz Grabiec	0d26294fac	database: Add table name to log message about sealing Message-Id: <1468917744-2539-1-git-send-email-tgrabiec@scylladb.com>	2016-07-20 10:12:31 +03:00
Tomasz Grabiec	a0832f08d2	schema_tables: Add more logging Message-Id: <1468917771-2592-1-git-send-email-tgrabiec@scylladb.com>	2016-07-20 10:12:00 +03:00
Duarte Nunes	3518db531e	database: Get non-system column_families This patch adds an utility function that allows fetching the set of column_families that do not belong to the system keyspace. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Duarte Nunes	4bc00c2055	database: Expose selection of sstables by a range This patch allows a set of a column_family's sstables to be selected according to a range of ring_positions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Avi Kivity	1048e1071b	db: do not create column family directories belonging to foreign keyspaces Currently, for any column family, we create a directory for it in all keyspace directories. This is incredibly awkward. Fix by iterating over just the keyspace's column families, not all column families in existence. Fixes #1457. Message-Id: <1468495182-18424-1-git-send-email-avi@scylladb.com>	2016-07-14 14:31:05 +03:00
Avi Kivity	23edc1861a	db: estimate queued read size more conservatively There are plenty of continuations involved, so don't assume it fits in 1k. Message-Id: <1468429516-4591-1-git-send-email-avi@scylladb.com>	2016-07-14 11:42:24 +02:00
Avi Kivity	d3c87975b0	db: don't over-allocate memory for mutation_reader column_family::make_reader() doesn't deal with sstables directly, so it doesn't need to reserve memory for them. Fixes #1453. Message-Id: <1468429143-4354-1-git-send-email-avi@scylladb.com>	2016-07-14 10:01:42 +02:00
Avi Kivity	24e3026e32	Merge "compaction manager refactoring" from Raphael	2016-07-10 17:16:23 +03:00
Tomasz Grabiec	6a1f9a9b97	db: Improve logging Message-Id: <1467997671-16570-1-git-send-email-tgrabiec@scylladb.com>	2016-07-10 16:15:03 +03:00
Tomasz Grabiec	c0233c877d	db: Avoid out-of-memory when flushing cannot keep up memtable_list::seal_on_overlflow() is called on each mutation to check if current memtable should be flushed. It will call memtable_list::seal_active_memtable() when that is the case. The number of concurrent seals is guarded by a semaphore, starting from commit `0f64eb7e7d`, and allows at most 4 of them. If there are 4 flushes already pending, every incoming mutation will enqueue a new flush task on the semaphore's wait list, without waiting for it. The wait queue can grow without bounds, eventually leading to out-of-memory. The fix is to seal the memtable immediately to satisfy should_flush() condition, but limit concurrency of actual flushes. This way the wait queue size on the semaphore is limited by memtables pending a flush, which is fairly limited. Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>	2016-07-10 10:53:51 +03:00
Raphael S. Carvalho	e38f66c6fe	database: make certain column family functions const qualified Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 15:05:22 -03:00
Vlad Zolotarov	f2bf453be2	database: revive mutation retry in case of replay_position_reordered_exception The logic that would retry applying a mutation in case of a replay_position_reordered_exception error was broken by a commit `0c31f3e626` Author: Glauber Costa <glauber@scylladb.com> Date: Wed Apr 20 19:09:21 2016 -0400 database: move memtable throttler to the LSA throttler This patch makes it work again. Fixes #1439 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1467893342-30559-1-git-send-email-vladz@cloudius-systems.com>	2016-07-07 15:00:35 +02:00

1 2 3 4 5 ...

632 Commits