scylladb

Author	SHA1	Message	Date
Paweł Dziepak	6755a679f6	drop key readers key_readers weren't used since introduction of continuity flag to cache entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	7bebfb851f	database: enable fast forwarding of range_sstable_reader When fast forwarding a reader that combines sstable reader we must also remember that the set of sstables for the new range may be different than for the previous one. The reader introduced in this patch makes sure that we read from correct sstables. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Tomasz Grabiec	4357d0a6d9	db: Add counter for writes blocked on dirty memory There is already queue_length-requests_blocked_memory, but it's a gauge so does not reflect what happened between the sampling points. total_operations-requests_blocked_memory will allow to see if there were any (and how many) requests which were blocked by dirty memory. Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>	2016-10-10 14:25:22 +03:00
Glauber Costa	33e9c2bbdd	memtable: reduce sstable flush concurrency to one Limiting the concurrency of memtable flushes to 4 was a temporary workaround for the fact that we lacked good write behind support. Now that write behind is properly merged we can reduce the concurrency to what it should be, one. This means that memtable flushes will now be serialized, and only when one of them ends will the next one begin. Disk parallelism is obtained through the write-behind mechanism. Fixes #1373 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>	2016-10-09 10:48:57 +03:00
Tomasz Grabiec	2a5a90f391	db: Do not timeout streaming readers There is a limit to concurrency of sstable readers on each shard. When this limit is exhausted (currently 100 readers) readers queue. There is a timeout after which queued readers are failed, equal to read_request_timeout_in_ms (5s by default). The reason we have the timeout here is primarily because the readers created for the purpose of serving a CQL request no longer need to execute after waiting longer than read_request_timeout_in_ms. The coordinator no longer waits for the result so there is no point in proceeding with the read. This timeout should not apply for readers created for streaming. The streaming client currently times out after 10 minutes, so we could wait at least that long. Timing out sooner makes streaming unreliable, which under high load may prevent streaming from completing. The change sets no timeout for streaming readers at replica level, similarly as we do for system tables readers. Fixes #1741. Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>	2016-10-07 15:41:04 +03:00
Raphael S. Carvalho	7ea4513595	database: trigger compaction after loading new sstables Scylla wasn't trying to compact new sstables uploaded via 'nodetool refresh'. Thus, all new sstables were left uncompacted until user issued 'nodetool flush' or a new sstable was written which would trigger compaction too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>	2016-10-06 18:26:49 +03:00
Avi Kivity	f8118d9fc2	Merge "Virtual dirty memory management" from Glauber "Description: ============ Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that, is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Results ======= With this patchset running a load big enough to easily saturate the disk, (commitlog disabled to highlight the effects of the memtable writer), I am able to run scylla for many minutes, with timeouts occurring only when I run out of disk space, whereas without this patch a swarm of timeouts would start merely 2 seconds after the load started - and would never get stable. In V2, I have sent a set of graphs illustrating the performance of this solution. This version does not have any significant differences in that front. For details, please refer to https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ Accuracy of the accounting: --------------------------- It is important for us to be as accurate as possible when accounting freed memory, since every byte we mark as freed may allow one or more requests to be executed. I have measured the accuracy of this approach (ignoring padding, object size for the mutation fragments) to be 99.83 % of used memory in the test workload I have ran (large, 65k mutations). Memtables under this circumnstance tend to have a very high occupancy ratio because throttle breeds idle, and idle breeds compact-on-idle. Known Issues: ------------- A lot of time can be elapsed between destroying the flush_reader and actually releasing memory. The release of memory only happens when the SSTable is fully sealed, and we have to flush the files, as well as finish writing all SSTable components at this point. This happened in practice with a buggy kernel that would result in flushes taking a long time. After that is fixed, this is just a theoretical problem and in practice it shouldn't matter given the time we expect those operations to take." * 'virtual-dirty-v6' of github.com:glommer/scylla: database: allow virtual dirty memory management streamed_mutation: make _buffer private add accounting of memory read to partition_snapshot_reader move partition_snapshot_reader code to header file LSA: allow a group to query its own region group memtables: split scanning reader in two sstables: use special reader for writing a memtable LSA: export information about object memory footprint LSA: export information about size of the throttle queue database: export virtual dirty bytes region group	2016-10-04 20:57:52 +03:00
Glauber Costa	f89a67c75c	database: allow virtual dirty memory management Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Raphael S. Carvalho	747b42299c	database: remove unused code Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>	2016-10-04 09:26:43 +03:00
Raphael S. Carvalho	a3bf7558f2	lcs: fix broken token range distribution at higher levels Uniform token range distribution across sstables in a level > 1 was broken, because we were only choosing sstable with lowest first key, when compacting a level > 0. This resulted in performance problem because L1->L2 may have a huge overlap over time, for example. Last compacted key will now be stored for each level to ensure sort of "round robin" selection of sstables for compactions at level >= 1. That's also done by C*, and they were once affected by it as described in https://issues.apache.org/jira/browse/CASSANDRA-6284. Fixes #1719. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-30 14:09:16 -03:00
Glauber Costa	f5fd6bd714	LSA: export information about size of the throttle queue Also add information about for how long has the oldest been sitting in the queue. This is part of the backpressure work to allow us to throttle incoming requests if we won't have memory to process them. Shortages can happen in all sorts of places, and it is useful when designing and testing the solutions to know where they are, and how bad they are. This counter is named for consistency after similar counters from transport/. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Glauber Costa	aa6a96d09b	database: export virtual dirty bytes region group Currently, we export the region group where memtables are placed as dirty bytes. Upcoming patches will optimistically mark some bytes in this region as free, a scheme we know as "virtual dirty". We are still interested in knowing the real state of the dirty region, so we will keep track of the bytes virtually freed and split the counters in two. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Asias He	b505e34062	database: Introduce make_streaming_reader The make_streaming_reader returns a combined mutation reader reads mutations from sstables and memtable. The memtable reader handles memtable flushing automatically so no special handling is needed here. It will be used by streaming soon.	2016-09-26 16:02:48 +08:00
Raphael S. Carvalho	67343798cf	api: implement api to return sstable count per level 'nodetool cfstats' wasn't showing per-level sstable count because the API wasn't implemented. Fixes #1119. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0dcdf9196eaec1692003fcc8ef18c77d0834b2c6.1474410770.git.raphaelsc@scylladb.com>	2016-09-21 09:13:40 +03:00
Raphael S. Carvalho	dffb41f9d8	sstables: remove schema parameter from some sstable methods schema can now be found in the sstable object itself. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0fa44fedbe784d924522d7eeca77c16294479c6e.1473959677.git.raphaelsc@scylladb.com>	2016-09-19 13:25:58 +02:00
Calle Wilund	f126cf769a	column_family: Ensure flush() waits for all previous flushes + self Fixes #1577 Message-Id: <1472569952-4066-1-git-send-email-calle@scylladb.com>	2016-09-14 11:00:41 +01:00
Tomasz Grabiec	a498da1987	database: Ignore spaces in initial_token list Currently we get boost::lexical_cast on startup if inital_token has a list which contains spaces after commas, e.g.: initial_token: -1100081313741479381, -1104041856484663086, ... Fixes #1664. Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>	2016-09-14 11:58:13 +03:00
Raphael S. Carvalho	b9f67351da	db: expose clustering filter info via collectd That's needed to observe behavior of clustering filter, and to check if it's worthwhile for a specific workload. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 11:32:23 -03:00
Raphael S. Carvalho	a2dc88889d	db: enable clustering optimization only on dtcs Leveled strategy will not benefit from this strategy because there's only a few sstables that will contain a given partition key, which means that a clustering key that belongs to a specific partition key can only be in a few sstables as well. Date tiered strategy is the one that will actually benefit the most from this optimization. Size tiered may benefit from it too if clustering key isn't overwritten, but it will not use the clustering optimization. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 11:31:07 -03:00
Raphael S. Carvalho	8d03ccd604	sstables: optimize reads with clustering filter If user specifies a clustering filter, it's possible to filter out sstable based on its metadata that tracks min/max clustering value. For example, if sstable stores clustering key from 'a' through 'c', it's possible to filter out that sstable if user asks for data with clustering key greater than 'c'. That's done by comparing each component separately because clustering key may be composite. Further information can be found here: https://issues.apache.org/jira/browse/CASSANDRA-5514 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:51:50 -03:00
Raphael S. Carvalho	004617839d	database: check bloom filter of all sstables earlier All sstables will now have bloom filter checked in a single pass before reader iterate through all candidates. It's possible that we will need to futurize the procedure if it holds cpu for too long. This change is also a step towards the optimization that will rule out sstables based on clustering filter. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:50:08 -03:00
Raphael S. Carvalho	1f31223f32	sstables: store schema in sstable object That will be needed for optimization that will store decorated keys in the sstable object, and also for a subsequent work that will detect wrong metadata (min/max column names) by looking at columns in the schema. As schema is stored in sstable, there's no longer a need to store ks and cf names in it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:49:17 -03:00
Glauber Costa	dc5d8e33af	Revert "row_cache: update sstable histograms on cache hits" This reverts commit `1726b1d0cc`. Reverting this patch turns our SSTable access counter into a miss counter only. The estimated histogram always starts its first bucket at 1, so by marking cache accesses we will be wrongly feeding "1" into the buckets. Notice that this is not yet ideal: nodetool is supposed to show a histogram of all reads, and by doing this we are changing its meaning slightly. Workloads that serve mostly from cache will be distorted towards their misses. The real solution is to use a different histogram, but we will need to enforce a newer version of nodetool for that: the current issue is that nodetool expects an EstimatedHistogram in a specific format in the other side. Conflicts: row_cache.hh Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy lladb.com> Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-01 18:07:31 +03:00
Duarte Nunes	ba374da043	database: Trace sstable accesses This patch traces when we read from an sstable, be it a key range or a single one. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:32 +02:00
Duarte Nunes	030db65c62	database: Accept a trace_state_ptr This patch changes the database and column_family types so a trace_state_ptr can be passed in when querying. This enables tracing of the inner components. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:28 +02:00
Glauber Costa	1726b1d0cc	row_cache: update sstable histograms on cache hits If we have a cache hit, we still need to update our sstable histogram - notting that we have touched 0 SSTables. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:14:22 -04:00
Glauber Costa	ce24fd05fe	database: keep statistics on SSTables touched per read That is done for single partition queries only - mimicking what Cassandra does on that matter. For this to be correct, we also need to update this histogram on cache hits - in which case we update the read as having touched 0 SSTables. That will be done on a separate patch. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:14:21 -04:00
Piotr Jastrzebski	3607d99269	Remove clustering_key_filtering_context. Remove clustering_key_filter_factory and clustering_key_filtering_context. Use partition_slice directly with a static get_ranges method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 20:31:55 +02:00
Raphael S. Carvalho	108fd1fade	database: close file in lister After listing is done, let's close file. This fixes no bug. It's only an improvement. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <2f52d297bcf6a6b6e3429912c28f17e6b37f8842.1471381607.git.raphaelsc@scylladb.com>	2016-08-17 11:01:44 +03:00
Glauber Costa	b361dee488	database: memtables pending flushes tell us nothing We have two counters that tracks how many memtable flushes are in progress, and how much memory are they pinning. The problem is, after we have revamped the code to limit the amount of flushes in progress, those counters became useless: as they live inside the semaphore side, they will only be incremented once we have past the semaphore. One wouldn't notice if working with CPU-bound problems, where memtables don't pile. But as soon as they do, those counters will always show the same numbers: the depth of the semaphore, which doesn't mean much. The problem is poised to become much worse: once we enable write behind in full and set the semaphore's depth to one, that's the number we'll see here all the time. The fix is to move the counters outside the semaphore, which will bring back its old semantics. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c5ae6903e170f3f356cdda7ed78a4c9ba8d5f024.1471370504.git.glauber@scylladb.com>	2016-08-17 10:54:15 +03:00
Paweł Dziepak	8a386a51bd	Merge "Don't cache wide partitions" from Piotr "When reading a partition try to read it all but once more bytes are read than a given limit we decide that partition is wide and we don't cache it. Instead we retry the read with clustering key filtering applied."	2016-07-21 10:24:25 +01:00
Piotr Jastrzebski	636a4acfd0	Add flag to configure max size of a cached partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:47:20 +02:00
Tomasz Grabiec	0d26294fac	database: Add table name to log message about sealing Message-Id: <1468917744-2539-1-git-send-email-tgrabiec@scylladb.com>	2016-07-20 10:12:31 +03:00
Tomasz Grabiec	a0832f08d2	schema_tables: Add more logging Message-Id: <1468917771-2592-1-git-send-email-tgrabiec@scylladb.com>	2016-07-20 10:12:00 +03:00
Duarte Nunes	3518db531e	database: Get non-system column_families This patch adds an utility function that allows fetching the set of column_families that do not belong to the system keyspace. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Duarte Nunes	4bc00c2055	database: Expose selection of sstables by a range This patch allows a set of a column_family's sstables to be selected according to a range of ring_positions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Avi Kivity	1048e1071b	db: do not create column family directories belonging to foreign keyspaces Currently, for any column family, we create a directory for it in all keyspace directories. This is incredibly awkward. Fix by iterating over just the keyspace's column families, not all column families in existence. Fixes #1457. Message-Id: <1468495182-18424-1-git-send-email-avi@scylladb.com>	2016-07-14 14:31:05 +03:00
Avi Kivity	23edc1861a	db: estimate queued read size more conservatively There are plenty of continuations involved, so don't assume it fits in 1k. Message-Id: <1468429516-4591-1-git-send-email-avi@scylladb.com>	2016-07-14 11:42:24 +02:00
Avi Kivity	d3c87975b0	db: don't over-allocate memory for mutation_reader column_family::make_reader() doesn't deal with sstables directly, so it doesn't need to reserve memory for them. Fixes #1453. Message-Id: <1468429143-4354-1-git-send-email-avi@scylladb.com>	2016-07-14 10:01:42 +02:00
Avi Kivity	24e3026e32	Merge "compaction manager refactoring" from Raphael	2016-07-10 17:16:23 +03:00
Tomasz Grabiec	6a1f9a9b97	db: Improve logging Message-Id: <1467997671-16570-1-git-send-email-tgrabiec@scylladb.com>	2016-07-10 16:15:03 +03:00
Tomasz Grabiec	c0233c877d	db: Avoid out-of-memory when flushing cannot keep up memtable_list::seal_on_overlflow() is called on each mutation to check if current memtable should be flushed. It will call memtable_list::seal_active_memtable() when that is the case. The number of concurrent seals is guarded by a semaphore, starting from commit `0f64eb7e7d`, and allows at most 4 of them. If there are 4 flushes already pending, every incoming mutation will enqueue a new flush task on the semaphore's wait list, without waiting for it. The wait queue can grow without bounds, eventually leading to out-of-memory. The fix is to seal the memtable immediately to satisfy should_flush() condition, but limit concurrency of actual flushes. This way the wait queue size on the semaphore is limited by memtables pending a flush, which is fairly limited. Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>	2016-07-10 10:53:51 +03:00
Raphael S. Carvalho	e38f66c6fe	database: make certain column family functions const qualified Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 15:05:22 -03:00
Vlad Zolotarov	f2bf453be2	database: revive mutation retry in case of replay_position_reordered_exception The logic that would retry applying a mutation in case of a replay_position_reordered_exception error was broken by a commit `0c31f3e626` Author: Glauber Costa <glauber@scylladb.com> Date: Wed Apr 20 19:09:21 2016 -0400 database: move memtable throttler to the LSA throttler This patch makes it work again. Fixes #1439 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1467893342-30559-1-git-send-email-vladz@cloudius-systems.com>	2016-07-07 15:00:35 +02:00
Paweł Dziepak	32a5de7a1f	db: handle receiving fragmented mutations If mutations are fragmented during streaming a special care must be taken so that isolation guarantees are not broken. Mutations received with flag "fragmented" set are applied to a memtable that is used only by that particular streaming task and the sstables created by flushing such memtables are not made visible until the task is complte. Also, in case the streaming fails all data is dropped. This means that fragmented mutations cannot benefit from coalescing of writes from multiple streaming plans, hence separate way of handling them so that there is no loss of performance for small partitions. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	4031c0ed8f	streaming: pass plan_id to column family for apply and flush plan_id is needed to keep track of the origin of mutations so that if they are fragmented all fragments are made visible at the same time, when that particular streaming plan_id completes. Basically, each streaming plan that sends big (fragmented) mutations is going to have its own memtables and a list of sstables which will get flushed and made visible when that plan completes (or dropped if it fails). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	51ec7a7285	db: wait for ongoing flushes at end of streaming When flush_streaming_mutations() is called at the end of streaming it is supposed to flush all data and then invalidate cache. ranges However, if there are already some memtable flushes in progress it won't wait for them. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Glauber Costa	54ce6221a7	allow the dirty memory manager to be used without a database object Some of our tests don't provide a database object to a CF. Create a default dirty memory manager object that can be used without a database for them. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <872f8c9232ff87d788e271b1db86c814d7a75d9f.1467832713.git.glauber@scylladb.com>	2016-07-07 10:00:43 +01:00
Glauber Costa	b0932ceb04	database: act on LSA pressure notification Issue 1195 describes a scenario with a fairly easy reproducer in which we can freeze the database. That involves writing simultaneously to multiple CFs, such that the sum of all the memory they are using is larger than the dirty memory limit, without not any of them individually being larger than the memtable size. Because we will never reach the individual memtable seal size for any of them, none of them will initiate a flush leading the database to a halt. The LSA has now gained infrastructure that allow us to be notified when pressure conditions mount. What we will do in this case is initiate a flush ourselves. Fixes #1195 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 17:46:28 -04:00
Glauber Costa	7169b727ea	move system tables to its own region In the spirit of what we are doing for the read semaphore, this patch moves system writes to its own dirty memory manager. Not only will it make sure that system tables will not be serialized by its own semaphore, but it will also put system tables in its own region group. Moving system tables to its own region group has the advantage that system requests won't be waiting during throttle behind a potentially big queue of user requests, since requests are tended to in FIFO order within the same region group. However, system tables being more controlled and predictable, we can actually go a step further and give them some extra reservation so they may not necessarily block even if under pressure (up to 10 MB more). Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 17:46:28 -04:00

1 2 3 4 5 ...

626 Commits