scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	d61002cc33	lsa: Reduce reclamation latency Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. The change improves worst case latency. Reclamation time statistics over 30 second period after cache fills up, in microseconds: Before: avg = 1524.283148 stdev = 11021.021118 min = 12.934000 max = 144356.000000 sum = 257603.852000 samples = 169 After: avg = 1317.362414 stdev = 1913.542802 min = 263.935000 max = 19244.600000 sum = 175209.201000 samples = 133 Refs #1634. Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>	2017-01-19 17:35:36 +02:00
Amnon Heiman	e19fa02a17	remove scollectd from headers As the metrics migration progressed, some include to scollectd.hh left behind. Because of the nature of the scollecd implementation those include brings alot of code with them to the header files and eventually to many source file. This patch remove those include and add a missing include to storage_proxy.cc. The reason the compiler didn't complain is an indication to the problematic nature of those include in the first place. Before this patch, change in metrics.hh would cause 169 files to compile, after this change 17. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>	2017-01-17 17:39:47 +02:00
Tomasz Grabiec	ddfee57c97	Replace iostream include with iosfwd in headers Message-Id: <1484656119-8386-4-git-send-email-tgrabiec@scylladb.com>	2017-01-17 14:52:44 +02:00
Vlad Zolotarov	022bca16bf	utils::logalloc: move collectd counters registration to metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:55 -05:00
Pekka Enberg	f83503c09e	date.h: 64-bit year and days representation We need 64-bit year and days representation to support the boundary values of the CQL data type, which is implemented using Joda Time library's DateTime type.	2017-01-09 10:42:20 +02:00
Pekka Enberg	7f2fc6470c	utils/date.h: Import date and time library sources This patch imports the "date.h" date and time library based on the C++11 <chrono> header, which is proposed for standadization: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0355r1.html We need it to implement support for the CQL date type. Import repository https://github.com/HowardHinnant/date Import commit: commit 2935f80109b8cfc15eb1243afe35f7ec3530f971 Author: Howard Hinnant <howard.hinnant@gmail.com> Date: Sun Jan 1 15:02:08 2017 -0500 Have get_version check for the file named version first	2017-01-09 10:39:54 +02:00
Tomasz Grabiec	ab5c77fcf1	bloom_filter: Allow checking presence using pre-hashed key Will allow us to calculate the hash once and use it on many filters instead of calculating the hash for each filter separately. Another change made is to avoid precomputing all indexes during filter operations, and have for_each_index() template instead which invokes a functor.	2016-12-19 14:20:58 +01:00
Asias He	00d7a35949	utils: Put crc32 under utils namespace It conflicts with crc in zlib Message-Id: <1480918984-4117-2-git-send-email-asias@scylladb.com>	2016-12-05 11:48:29 +02:00
Tomasz Grabiec	e14caaef60	utils/logalloc: Add ability to timeout run_when_memory_available() task	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	61d81617e1	utils/flush_queue: Add ability to wait with a timeout	2016-11-29 16:40:58 +01:00
Avi Kivity	176fca5775	logalloc: use correct header for unique_ptr <bits/unique_ptr.hh> is a libstdc++ internal header. USe <memory> instead.	2016-11-27 23:08:04 +02:00
Paweł Dziepak	ef57b9a26f	rename memory_usage() to external_memory_usage() where applicable Renaming the function to external_memory_usage() makes it clear that sizeof(T) is not included, something that was a source of confusion in the past. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Glauber Costa	f86c9e36f4	logalloc: allow region group reclaimer to specify a soft limit The region_group_reclaimer will let us know every time we are over the limit we have specified for memory usage. However, For some applications, we would be interested in knowing about memory build up earlier, so we can start doing something about it before we reach that condition. This patch introduce soft limit notifications for the region_group_reclaimer. After this patch is applied, start_reclaim() is called earlier, and stop_reclaim() later, after the soft condition is abated. There are methods that allow one to easily test if the pressure condition is a soft limit condition or a hard, threshold condition and act accordingly. Whether to act on both conditions or just one of them is up to the application. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00
Paweł Dziepak	b8d737ff0a	tests/row_cache_test: verify that eviction follows lru Refs #1847. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 18:57:54 +01:00
Glauber Costa	93386bcec7	histograms: do not use latency_in_nano Now that the histogram has its own unit expressed in its template parameter, there is no reason to convert it to nano just so we may need to convert it back if the histogram needs another unit. This patch will keep everything as a duration until last moment, and then we'll convert when needed. This was suggested by Amnon. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>	2016-11-14 18:01:43 +02:00
Glauber Costa	608d825790	histogram: fix reporting units We are tracking latencies in microseconds, but almost everywhere else they are reported in microseconds. Instead of just converting, this patch tries to be a bit more future proof and embed the unit into the type - and we then default to microseconds. I have verified that the JMX measures now report sane values for both the storage proxy and the column family. nodetool cfhistograms still works fine. That one is reported in nanoseconds, but through the estimated_histogram, not ihistogram. Fixes #1836 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-11 11:36:56 -05:00
Glauber Costa	1342d044eb	moving averages: change metrics calculation We have recently fixed a bug due to which the constructor parameters for moving average were inverted, leading to the numbers being just plain wrong. However, the calculation of alpha was already inverted, meaning it was right by accident and now that's wrong. With the wrong alpha, the values we see are still correct, but they move very quickly. The intention of this code is obviously to smooth things out. This was found out by Nadav. I have tested and confirmed that the smoothing factor now works as expected. Fixes #1837 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-10 22:33:34 -05:00
Amnon Heiman	a977ea85e1	histogram: moving_average and total rate should be calculate in seconds The moving average and the total average should be calculated in seconds and not nanoseconds. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-11-10 22:32:53 -05:00
Glauber Costa	d3f11fbabf	histogram: moving averages: fix inverted parameters moving_averages constructor is defined like this: moving_average(latency_counter::duration interval, latency_counter::duration tick_interval) But when it is time to initialize them, we do this: ... {tick_interval(), std::chrono::minutes(1)} ... As it can be seen, the interval and tick interval are inverted. This leads to the metrics being assigned bogus values. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>	2016-11-10 11:28:51 -08:00
Tomasz Grabiec	6548132423	lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions is_compactible() will pass on very small regions. full_compaction() is only used in tests to force objects to be moved due to compaction, so we want all reclaimable regions to be compacted.	2016-10-18 11:16:08 +02:00
Paweł Dziepak	d08cffd3c7	lsa: avoid exceptions during segment_zone creation LSA tries to allocate zones as large as possible (while still leaving enough free space for the standard allocator). It uses the amount of free memory in order to guess how much it can get, but that obviously doesn't account for fragmentation and the allocation attempt may fail. This patch changes the LSA code so that it doesn't throw in case zone couldn't be created but just returns a null pointer which should be more performant if the LSA memory cannot grow any more. Fixes #1394. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>	2016-10-14 11:08:24 +02:00
Tomasz Grabiec	e617bcd8a7	logalloc: disable abort on allocation failure in places in which it is benign Some places start big expecting allocation failure, then reduce the requested size. Let's not abort in such cases. Message-Id: <1476295120-32047-1-git-send-email-tgrabiec@scylladb.com>	2016-10-13 10:53:32 +03:00
Tomasz Grabiec	4357d0a6d9	db: Add counter for writes blocked on dirty memory There is already queue_length-requests_blocked_memory, but it's a gauge so does not reflect what happened between the sampling points. total_operations-requests_blocked_memory will allow to see if there were any (and how many) requests which were blocked by dirty memory. Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>	2016-10-10 14:25:22 +03:00
Avi Kivity	c94fb1bf12	build: reduce inclusions of messaging_service.hh Remove inclusions from header files (primary offender is fb_utilities.hh) and introduce new messaging_service_fwd.hh to reduce rebuilds when the messaging service changes. Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>	2016-10-05 11:46:49 +03:00
Avi Kivity	f8118d9fc2	Merge "Virtual dirty memory management" from Glauber "Description: ============ Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that, is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Results ======= With this patchset running a load big enough to easily saturate the disk, (commitlog disabled to highlight the effects of the memtable writer), I am able to run scylla for many minutes, with timeouts occurring only when I run out of disk space, whereas without this patch a swarm of timeouts would start merely 2 seconds after the load started - and would never get stable. In V2, I have sent a set of graphs illustrating the performance of this solution. This version does not have any significant differences in that front. For details, please refer to https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ Accuracy of the accounting: --------------------------- It is important for us to be as accurate as possible when accounting freed memory, since every byte we mark as freed may allow one or more requests to be executed. I have measured the accuracy of this approach (ignoring padding, object size for the mutation fragments) to be 99.83 % of used memory in the test workload I have ran (large, 65k mutations). Memtables under this circumnstance tend to have a very high occupancy ratio because throttle breeds idle, and idle breeds compact-on-idle. Known Issues: ------------- A lot of time can be elapsed between destroying the flush_reader and actually releasing memory. The release of memory only happens when the SSTable is fully sealed, and we have to flush the files, as well as finish writing all SSTable components at this point. This happened in practice with a buggy kernel that would result in flushes taking a long time. After that is fixed, this is just a theoretical problem and in practice it shouldn't matter given the time we expect those operations to take." * 'virtual-dirty-v6' of github.com:glommer/scylla: database: allow virtual dirty memory management streamed_mutation: make _buffer private add accounting of memory read to partition_snapshot_reader move partition_snapshot_reader code to header file LSA: allow a group to query its own region group memtables: split scanning reader in two sstables: use special reader for writing a memtable LSA: export information about object memory footprint LSA: export information about size of the throttle queue database: export virtual dirty bytes region group	2016-10-04 20:57:52 +03:00
Glauber Costa	86aa0b830d	LSA: allow a group to query its own region group Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	28e3f2f6ee	LSA: export information about object memory footprint We allocate objects of a certain size, but we use a bit more memory to hold them. To get a clerer picture about how much memory will an object cost us, we need help from the allocator. This patch exports an interface that allow users to query into a specific allocator to get that information. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Gleb Natapov	32989d1e66	Merge seastar upstream * seastar 2b55789...5b7252d (3): > Merge "rpc: serialize large messages into fragmented memory" from Gleb > Merge "Print backtrace on SIGSEGV and SIGABRT" from Tomasz > test_runner: avoid nested optionals Includes patch from Gleb to adapt to seastar changes.	2016-09-28 17:34:16 +03:00
Glauber Costa	f5fd6bd714	LSA: export information about size of the throttle queue Also add information about for how long has the oldest been sitting in the queue. This is part of the backpressure work to allow us to throttle incoming requests if we won't have memory to process them. Shortages can happen in all sorts of places, and it is useful when designing and testing the solutions to know where they are, and how bad they are. This counter is named for consistency after similar counters from transport/. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Nadav Har'El	fe1ba753ce	Avoid semaphore's default initial value The fact that Seastar's semaphore has a default initializer of 1 if not explicitly initialized is confusing and unexpected and recently lead to two bugs. So ScyllaDB should not rely on this default behavior, and specify the initial value of each semaphore explicitly. In several cases in the ScyllaDB code, the explict initialization was missing, and this patch adds it. In one case (rate_limiter) I even think the default of 1 was a bit strange, and 0 makes more sense. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1474530745-23951-1-git-send-email-nyh@scylladb.com>	2016-09-24 19:25:02 +03:00
Tomasz Grabiec	b0b28696b5	scylla-gdb: Add 'scylla lsa-segment' command Allows one to examine contents of LSA segment. Example: (gdb) scylla lsa-segment 0x601000480000 0x601000480e70: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000480f10: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000480fb0: free size=192 0x60100048107e: free size=42 0x6010004814e0: free size=192 0x6010004815ae: free size=40 0x6010004815e8: free size=192 0x6010004816b8: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000481758: free size=192 ...	2016-09-20 16:53:21 +02:00
Gleb Natapov	2e8b255741	Merge seastar upstream * seastar 0303e0c...e534401 (6): > Merge "enable rpc to work on non contiguous memory for receive" from Gleb > install-dependencies.sh: install python3 for Ubuntu/Debian, which requires for configure.py > fix tcp stuck when output_stream write more than 212992 bytes once. > scripts/posix_net_conf.sh: supress 'ls: cannot access /sys/class/net/<NIC>/device/msi_irqs/' error message > scripts/posix_net_conf.sh: fix 'command not found' error when specifies --cpu-mask > native_network_stack: Fix use after free/missing wait in dhcp Includes: "Remove utils::fragmented_input_stream and utils::input_stream in favor of seastar version" from Gleb.	2016-09-15 12:12:16 +03:00
Glauber Costa	4310635bae	move estimated histogram to utils Nothing sstable-specific in it, really. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:23 -04:00
Paweł Dziepak	e76203c927	utils: add input_stream input_stream performs a type erasure on seastar::simple_input_stream and fragmented_input_stream. The main goal is to keep the overhead for the cases when simple_input_stream is used minimum. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	29827a9726	utils: add fragmented_input_stream fragmented_input_stream is an input stream usable by IDL-generated deserializers which can read from fragmented buffers. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Amnon Heiman	4c14b2a527	histogram: Add an estimated sum method The histogram implementation uses sampling to estimate the mean and sum. This patch adds a method that returns an estimated sum based on the mean and the total number of events measured. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1467547341-30438-2-git-send-email-amnon@scylladb.com>	2016-08-16 11:06:50 +03:00
Avi Kivity	98226a14ac	Merge "Exception propagation writers in commitlog batch" " While periodic mode is a all-bets-off crap-shoot as far as knowing if data actually reached disk or not, batch mode is supposed to be somewhat more reliable/deterministic. Thus, if we get an exception writing/flushing the current buffer, we should propagate exceptions to all execution paths involved in this buffer. Flush queue can now (optionally) propagate exceptions to all clients, and commit log uses this to ensure that commit log writers in batch mode all generate exceptions on disk errors. Also includes some rudimentary tests for flush queue mechanisms. Note: other main user, sstable flushing, is not affected, as default mode is still to keep exceptions to individual worker continuations, not waiters."	2016-08-08 15:33:26 +03:00
Glauber Costa	fe6a0d97d1	logalloc: make sure allocations in release_requests don't recurse back into the allocator Calls like later() and with_gate() may allocate memory, although that is not very common. This can create a problem in the sense that it will potentially recurse and bring us back to the allocator during free - which is the very thing we are trying to avoid with the call to later(). This patch wraps the relevant calls in the reclaimer lock. This do mean that the allocation may fail if we are under severe pressure - which includes having exhausted all reserved space - but at least we won't recurse back to the allocator. To make sure we do this as early as possible, we just fold both release_requests and do_release_requests into a single function Thanks Tomek for the suggestion. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>	2016-08-04 11:16:53 +02:00
Glauber Costa	ad58691afb	logalloc: make sure blocked requests memory allocations are served from the standar allocator Issue 1510 describes a scenario in which, under load, we allocate memory within release_requests() leading to a reentry into an invalid state in our blocked requests' shared_promise. This is not easy to trigger since not all allocations will actually get to the point in which they need a new segment, let alone have that happening during another allocator call. Having those kinds of reentry is something we have always sought to avoid with release_requests(): this is the reason why most of the actual routine is deferred after a call to later(). However, that is a trick we cannot use for updating the state of the blocked requests' shared_promise: we can't guarantee when is that going to run, and we always need a valid shared_promise, in a valid state, waiting for new requests to hook into. The solution employed by this patch is to make sure that no allocation operations whatsoever happen during the initial part of release_requests on behalf of the shared promise. Allocation is now deferred to first use, which relieves release_requests() from all allocation duties. All it needs to do is free the old object and signal to the its user that an allocation is needed (by storing {} into the shared_promise). Fixes #1510 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>	2016-08-03 20:40:30 +02:00
Calle Wilund	620e54cae4	flush_queue: Allow exception propagation to waiters Re-worked to use shared_promise<> as signal mechanism (because we have that now), which also makes it less painful to implement exceptions propagating not only from "func" to "post", but also from given func->post chain entry to any waiters. v2: * Remove leading "_" in template types	2016-08-03 14:49:38 +00:00
Tomasz Grabiec	9476bc5a31	Introduce --abort-on-lsa-bad-alloc command line option Useful for triggerring core dump on allocation failure inside LSA, which makes it easier to debug allocation failures. They normally don't cause aborts, just fail the current operation, which makes it hard to figure out what was the cause of allocation failure. Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>	2016-08-03 17:26:44 +03:00
Calle Wilund	a277975fd4	flush_queue: ensure gate is always closed (even with exceptions)	2016-08-01 08:23:42 +00:00
Avi Kivity	de62285591	bloom_filter: use correct types 'long' is not a defined size. It happens to match Java's long on Linux x86_64, but may not on other platforms (e.g. Windows x64). Message-Id: <1469352705-1079-1-git-send-email-avi@scylladb.com>	2016-07-24 14:00:37 +02:00
Avi Kivity	900639915d	bloom_filter: fix overflow for large filters We use ::abs(), which has an int parameter, on long arguments, resulting in incorrect results. Switch to std::abs() instead, which has the correct overloads. Fixes #1494. Message-Id: <1469347802-28933-1-git-send-email-avi@scylladb.com>	2016-07-24 11:31:26 +03:00
Vlad Zolotarov	f64f27beb9	utils: add get_time_UUID(system_clock::time_point) Creates a type 1 UUID (time-based UUID) with the given system_clock::time_point Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Avi Kivity	d261927fa3	logalloc: change sprint() of a pointer to use void* explicitly Otherwise, fmtlib dislikes it.	2016-07-18 19:37:16 +03:00
Paweł Dziepak	cfa581b426	utils/managed_vector: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	703509a1c7	utils/managed_bytes: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Avi Kivity	f96e5d7c1b	managed_bytes: fix build with gcc 6 gcc 6 complains that deleting a managed_bytes::external isn't defined because the size isn't known. I'm not sure it's correct, but there's no way to tell because flexible arrays aren't standardized. Fix by using an array of zero size. Message-Id: <1466715187-4125-1-git-send-email-avi@scylladb.com>	2016-06-27 10:54:10 +02:00
Glauber Costa	4e81f19ab5	LSA: fix typo in region merge There are many potentially tricky things about referring to different regions from the LSA perspective. Madness, however, is not one of them. I can only assume we meant made? Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8eb81f35de4b208a494e43cb392eea07b87b2bf1.1466534798.git.glauber@scylladb.com>	2016-06-21 22:58:44 +03:00

1 2 3 4 5 ...

317 Commits