scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Author	SHA1	Message	Date
Glauber Costa	2bffa8af74	logalloc: make sure allocations in release_requests don't recurse back into the allocator Calls like later() and with_gate() may allocate memory, although that is not very common. This can create a problem in the sense that it will potentially recurse and bring us back to the allocator during free - which is the very thing we are trying to avoid with the call to later(). This patch wraps the relevant calls in the reclaimer lock. This do mean that the allocation may fail if we are under severe pressure - which includes having exhausted all reserved space - but at least we won't recurse back to the allocator. To make sure we do this as early as possible, we just fold both release_requests and do_release_requests into a single function Thanks Tomek for the suggestion. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com> (cherry picked from commit `fe6a0d97d1`)	2016-08-04 11:17:54 +02:00
Glauber Costa	4a6d0d503f	logalloc: make sure blocked requests memory allocations are served from the standar allocator Issue 1510 describes a scenario in which, under load, we allocate memory within release_requests() leading to a reentry into an invalid state in our blocked requests' shared_promise. This is not easy to trigger since not all allocations will actually get to the point in which they need a new segment, let alone have that happening during another allocator call. Having those kinds of reentry is something we have always sought to avoid with release_requests(): this is the reason why most of the actual routine is deferred after a call to later(). However, that is a trick we cannot use for updating the state of the blocked requests' shared_promise: we can't guarantee when is that going to run, and we always need a valid shared_promise, in a valid state, waiting for new requests to hook into. The solution employed by this patch is to make sure that no allocation operations whatsoever happen during the initial part of release_requests on behalf of the shared promise. Allocation is now deferred to first use, which relieves release_requests() from all allocation duties. All it needs to do is free the old object and signal to the its user that an allocation is needed (by storing {} into the shared_promise). Fixes #1510 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com> (cherry picked from commit `ad58691afb`)	2016-08-04 11:17:49 +02:00
Avi Kivity	75a36ae453	bloom_filter: fix overflow for large filters We use ::abs(), which has an int parameter, on long arguments, resulting in incorrect results. Switch to std::abs() instead, which has the correct overloads. Fixes #1494. Message-Id: <1469347802-28933-1-git-send-email-avi@scylladb.com> (cherry picked from commit `900639915d`)	2016-07-24 11:32:28 +03:00
Vlad Zolotarov	f64f27beb9	utils: add get_time_UUID(system_clock::time_point) Creates a type 1 UUID (time-based UUID) with the given system_clock::time_point Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Avi Kivity	d261927fa3	logalloc: change sprint() of a pointer to use void* explicitly Otherwise, fmtlib dislikes it.	2016-07-18 19:37:16 +03:00
Paweł Dziepak	cfa581b426	utils/managed_vector: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	703509a1c7	utils/managed_bytes: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Avi Kivity	f96e5d7c1b	managed_bytes: fix build with gcc 6 gcc 6 complains that deleting a managed_bytes::external isn't defined because the size isn't known. I'm not sure it's correct, but there's no way to tell because flexible arrays aren't standardized. Fix by using an array of zero size. Message-Id: <1466715187-4125-1-git-send-email-avi@scylladb.com>	2016-06-27 10:54:10 +02:00
Glauber Costa	4e81f19ab5	LSA: fix typo in region merge There are many potentially tricky things about referring to different regions from the LSA perspective. Madness, however, is not one of them. I can only assume we meant made? Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8eb81f35de4b208a494e43cb392eea07b87b2bf1.1466534798.git.glauber@scylladb.com>	2016-06-21 22:58:44 +03:00
Tomasz Grabiec	597cbbdedc	Merge branch 'pdziepak/streamed-mutations/v5' from seastar-dev.git From Paweł: This series introduces streaming_mutations which allow mutations to be streamed between the producers and the consumers as a series of mutation_fragments. Because of that the mutation streaming interface works well with partitions larger than available memory provided that actual producer and consumer implementations can support this as well. mutation_fragments are the basic objects that are emitted by streamed_mutations they can represent a static row, a clustering row, the beginning and the end of a range tombstone. They are ordered by their clustering keys (with static rows being always the first emitted mutation fragment). The beginning of range tombstone is emitted before any clustering row affected by that tombstone and the end of range tombstone is emitted after the last clustering row affected by it. Range tombstones are disjoint. In this series all producers are converted to fully support the new interface, that includes cache, memtables and sstables. Mutation queries and data queries are the only consumers converted so far. To minimize the per-mutation_fragment overhead streamed_mutations use batching. The actual producer implementation fills a buffer until it is full (currently, buffer size is 16, the limit should, however, be changed to depend on the actual size in memory of the stored elements) or end of stream is reached. In order to guarantee isolation of writes reads from cache and memtable use MVCC. When a reader is created it takes a snapshot of the particular cache or memtable entry. The snapshot is immutable and if there happen to be any incoming writes while the read is active a new version of partition is created. When the snapshot is destroyed partition versions are merged together as much as possible. Performance results with perf_simple_query (median of results with duration 15): before after diff write 618652.70 618047.58 -0.10% read 661712.44 608070.49 -8.11%	2016-06-21 12:15:21 +02:00
Tomasz Grabiec	e783b58e3b	Merge branch 'glommer/LSA-throttler-v6' from git@github.com:glommer/scylla.gi From Glauber: This is my new take at the "Move throttler to the LSA" series, except this one don't actually move anything anywhere: I am leaving all memtable conversion out, and instead I am sending just the LSA bits + LSA active reclaim. This should help us see where we are going, and then we can discuss all memtable changes in a series on its own, logically separated (and hopefully already integrated with virtual dirty). [tgrabiec: trivial merge conflicts in logalloc.cc]	2016-06-21 10:22:26 +02:00
Glauber Costa	579d121db8	LSA: export largest region We now keep the regions sorted by size, and the children region groups as well. Internally, the LSA has all information it needs to make size-based reclaim decisions. However, we don't do reclaim internally, but rather warn our user that a pressure situation is mounted. The user of a region_group doesn't need to evict the largest region in case of pressure and is free to do whatever it chooses - including nothing. But more likely than not, taking into account which region is the largest makes sense. This patch puts together this last missing piece of the puzzle, and exports the information we have internally to the user. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:51:00 -04:00
Glauber Costa	35f8a2ce2c	LSA: add a backpointer to the region from its private data Region is implemented using the pimpl pattern (region_impl), and all its relevant data is present in a private structure instead of the region itself. That private structure is the one that the other parts of the LSA will refer to, the region_group being the prime example. To allow classes such as the region_group the externally export a particular region, we will introduce a backpointer region_impl -> region. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	38a402307d	LSA: enhance region_group reclaimer We are currently just allowing the region_group to specify a throttle_threshold, that triggers throttling when a certain amount of memory is reached. We would like to notify the callers that such condition is reached, so that the callers can do something to alleviate it - like triggering flushes of their structures. The approach we are taking here is to pass a reclaimer instance. Any user of a region_group can specialize its methods start_reclaiming and stop_reclaiming that will be called when the region_group becomes under pressure or ceases to be, respectively. Now that we have such facility, it makes more sense to move the throttle_threshold here than having it separately. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	6404028c6a	LSA: move subgroups to a heap as well When we decide to evict from a specific region_group due to excessive memory usage, we must also consider looking at each of their children (subgroups). It could very well be that most of memory is used by one of the subgroups, and we'll have to evict from there. We also want to make sure we are evicting from the biggest region of all, and not the biggest region in the biggest region_group. To understand why this is important, consider the case in which the regions are memtables associated with dirty region groups. It could be that a very big memtable was recently flushed, and a fairly small one took its place. That region group is still quite large because the memtable hasn't finished flushing yet, but that doesn't mean we should evict from it. To allow us to efficiently pick which region is the largest, each root of each subtree will keep track of its maximal score, defined as the maximum between our largest region total_space and the maximum maximal score of subtrees. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	e1eab5c845	LSA: store regions in a heap for regions_group Currently, the regions in a region group are organized in a simple vector. We can do better by using a binomial heap, as we do for segments, and then updating when there is change. Internally to the LSA, we are in good position to always know when change happens, so that's really the best way to do it. The end game here, is to easily call for the reclaim of the largest offending region (potentially asynchronously). Because of that, we aren't really interested in the region occupancy, but in the region reclaimable occuppancy instead: that's simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing that effectively lists all non reclaimable regions in the end of the heap, in no particular order. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	54d4d46cf7	LSA: move throttling code to LSA. The database code uses a throttling function to make sure that memory used for the dirty region never is over the limit. We track that with a region group, so it makes sense to move this as generic functionality into LSA. This patch implements the LSA-side functionality and a later patch will convert the current memtable throttler to use it. Unlike the current throttling mechanism, we'll not use a timer-based mechanism here. Aside from being more generic and friendlier towards other users, this is a good change for current memtable by itself. The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we would be better off without them. Let's discuss the merits of each separately: 1) 10ms timer: If we are throttling, we expect somebody to flush the memtables for memory to be released. Since we are in position to know exactly when a memtable was written, thus releasing memory, we can just call unthrottle at that point, instead of using a timer. 2) 1MB release threshold: we do that because we have no idea how much memory a request will use, so we put the cut somehow. However, because of 1) we don't call unthrottle through a timer anymore, and do it directly instead. This means that we can just execute the request and see how much memory it has used, with no need to guess. So we'll call unthrottle at the end of every request that was previously throttled. Writing the code this way also has the advantage that we need one less continuation in the common case of the database not being throttled. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:34:19 -04:00
Paweł Dziepak	dfa827161d	utils: add anchorless list The main user of this list is MVCC implementation in partition_version.cc. The reason why boost::intrusive::list<> cannot be used is that tere is no single owner of the list who could keep boost::intrusive::list<> object alive. In the MVCC case there is at least one partition_entry object and possibly multiple partition_snapshot objects which lifetime is independent and the list must remain in a valid state as long as at least one of them is alive. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	84713d2236	utils: extract optimized_optional<> from mutation_opt Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Glauber Costa	01a658f51d	LSA: helper function for region_group current hierarchy walk converted, but more users will come. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Glauber Costa	741aa16748	LSA: allow a region_group to have a threshold for throttling specified Allocations will still be allowed if made directly, but callers will have the choice (in an upcoming patch) to proceed only if memory is below this threshold. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Glauber Costa	7cd0c0731e	region_group: delete move constructor Tomek correctly points out that since we are now using "this" in lambda captures, we should make the region_group not movable. We currently define a move constructor, but there are no users. So we should just remove them. copy constructor is already deleted, and so are the copy and move assignment operators. So by removing the move constructor, we should be fine. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Tomasz Grabiec	cd9955d2ce	lsa: Reclaim 1 segment by default Reclaiming many segments was observed to cause up to multi-ms latency. With the new setting, the latency of reclamation cycle with full segments (worst case mode) is below 1ms. I saw no decrease in throughput compared to the step of 16 segments in neither of these modes: - full segments, reclaim by random evicition - sparse segments (3% occupancy), reclaim by compaction and no eviction Fixes #1274.	2016-06-14 15:13:15 +02:00
Tomasz Grabiec	86b76171a8	lsa: Use the same step in both internal and external reclamations	2016-06-14 15:13:15 +02:00
Tomasz Grabiec	d74d902a01	lsa: Make reclamation step configurable	2016-06-14 15:13:14 +02:00
Tomasz Grabiec	93bb95bd0d	lsa: Log reclamation rate	2016-06-14 15:13:14 +02:00
Tomasz Grabiec	cb18418022	lsa: Print more details before aborting	2016-06-14 15:13:14 +02:00
Pekka Enberg	8df5aa7b0c	utils/exceptions: Whitelist EEXIST and ENOENT in should_stop_on_system_error() There are various call-sites that explicitly check for EEXIST and ENOENT: $ git grep "std::error_code(E" database.cc: if (e.code() != std::error_code(EEXIST, std::system_category())) { database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) { database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) { database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) { sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) { sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) { Commit `961e80a` ("Be more conservative when deciding when to shut down due to disk errors") turned these errors into a storage_io_exception that is not expected by the callers, which causes 'nodetool snapshot' functionality to break, for example. Whitelist the two error codes to revert back to the old behavior of io_check(). Message-Id: <1465454446-17954-1-git-send-email-penberg@scylladb.com>	2016-06-09 10:03:04 +02:00
Pekka Enberg	02d033667a	utils: Improve storage_io_exception error message Make storage_io_exception exception error message less cryptic by actually including the human-readable error message from std::system_error... Before: nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage io error errno: 2 After: nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage I/O error: 2: No such file or directory We can improve this further by including the name of the file that the I/O error happened on. Message-Id: <1465452061-15474-1-git-send-email-penberg@scylladb.com>	2016-06-09 09:58:00 +02:00
Amnon Heiman	2cf882c365	rate_moving_average: mean_rate is not initilized The rate_moving_average is used by timed_rate_moving_average to return its internal values. If there are no timed event, the mean_rate is not propertly initilized. To solve that the mean_rate is now initilized to 0 in the structure definition. Refs #1306 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1465231006-7081-1-git-send-email-amnon@scylladb.com>	2016-06-07 09:38:58 +03:00
Avi Kivity	961e80ab74	Be more conservative when deciding when to shut down due to disk errors Currently we only shut down on EIO. Expand this to shut down on any system_error. This may cause us to shut down prematurely due to a transient error, but this is better than not shutting down due to a permanent error (such as ENOSPC or EPERM). We may whitelist certain errors in the future to improve the behavior. Fixes #1311. Message-Id: <1465136956-1352-1-git-send-email-avi@scylladb.com>	2016-06-06 10:56:34 +02:00
Amnon Heiman	5f84e55bf6	histogram: total need to be increment on plus operator The total counter (the one that count the actual number of sample points) should be incremented when adding histograms. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1464172277-4251-1-git-send-email-amnon@scylladb.com>	2016-06-05 12:09:36 +03:00
Piotr Jastrzebski	136b8148d2	Use idle CPU to compact LSA memory Register an idle CPU handler that compacts a single segment every time there's nothing better to execute on CPU. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c26aa608a1e0752fb9e6db1833ef3ba1de95f161.1464169748.git.piotr@scylladb.com>	2016-05-26 12:43:53 +03:00
Amnon Heiman	8ef25ceb05	Add waited avrage rate related object This patch adds a few data structure for derived and accumulative statistics that are similiar to the yammer implementation used by the JMX. It also adds a plus operator to histogram which cleans the histogram usage. moving_average - An exponentially-weighted moving average. calculate an event rate on a given interval. rate_moving_average and timed_rate_moving_average - Calculate 1m, 5m and 15m ewma an all time avrage and a counter. rate_moving_average_and_histogram and timed_rate_moving_average_and_histogram - Combines a histogram with a rate_moving_average. It also expose a histogram API so it will be an easy task to replace a histogram with a timed_rate_moving_average_and_histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:47:49 +03:00
Pekka Enberg	4ed702f0da	Merge "Authorizer support" from Calle "Conversion/implementation of "authorizer" code from origin, handling permissions management for users/resources. Default implementation keeps mapping of <user.resource>->{permissions} in a table, contents of which is cached for slightly quicker checks. Adds access control to all (existing) cql statements. Adds access management support to the CQL impl. (GRANT/REVOKE/LIST) Verified manually and with dtest auth_test.py. Note that several of these still fail due to (unrelated) unimplemented features, like index, types etc. Fixes #1138"	2016-04-19 15:00:38 +03:00
Calle Wilund	ead1c882f8	utils::loading_cache: Version of the LoadingCache type used in origin Simple, expiring, cache of potentially limited number of entries.	2016-04-19 11:49:05 +00:00
Takuya ASADA	f6252be0c1	utils: fix compilation error on utils/exceptions.hh It doesn't able to find std::system_error due to missing header. Fixes #1202 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461006884-28316-1-git-send-email-syuu@scylladb.com>	2016-04-19 09:37:31 +03:00
Calle Wilund	c446fe50e6	tuple_hash: Add convinence operator for two arguments (non-pair)	2016-04-18 13:51:15 +00:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Tomasz Grabiec	a0cba3c86f	logalloc: Introduce tracker::occupancy() Returns occupancy information for all memory allocated by LSA, including segment pools / zones.	2016-03-22 16:28:10 +01:00
Tomasz Grabiec	529c8b8858	logalloc: Rename tracker::occupancy() to region_occupancy()	2016-03-22 14:56:44 +01:00
Tomasz Grabiec	ca08db504b	managed_bytes: Make operator[] work for large blobs as well Fixes assertion in mutation_test: mutation_test: ./utils/managed_bytes.hh:349: blob_storage::char_type* managed_bytes::data(): Assertion `!_u.ptr->next' Introduced in `ea7c2dd085` Message-Id: <1458648786-9127-1-git-send-email-tgrabiec@scylladb.com>	2016-03-22 14:43:52 +02:00
Tomasz Grabiec	184e2831e7	managed_bytes: Mark move-assignment noexcept	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	92d4cfc3ab	managed_bytes: Make copy assignment exception-safe	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	22d193ba9f	managed_bytes: Make linearization_context::forget() noexcept It is needed for noexcept destruction, which we need for exception safety in higher layers. According to [1], erase() only throws if key comparison throws, and in our case it doesn't. [1] http://en.cppreference.com/w/cpp/container/unordered_map/erase	2016-03-21 18:41:27 +01:00
Benoît Canet	1fb9a48ac5	exception: Optionally shutdown communication on I/O errors. I/O errors cannot be fixed by Scylla the only solution is to shutdown the database communications. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458154098-9977-1-git-send-email-benoit@scylladb.com>	2016-03-17 15:02:52 +02:00
Paweł Dziepak	338fd34770	lsa: update _closed_occupancy after freeing all segments _closed_occupancy will be used when a region is removed from its region group, make sure that it is accurate. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-17 11:12:05 +00:00
Paweł Dziepak	99b61d3944	lsa: set _active to nullptr in region destructor In region destructor, after active segments is freed pointer to it is left unchanged. This confuses the remaining parts of the destructor logic (namely, removal from region group) which may rely on the information in region_impl::_active. In this particular case the problem was that code removing from the region group called region_impl::occupancy() which was dereferencing _active if not null. Fixes #993. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1457341670-18266-1-git-send-email-pdziepak@scylladb.com>	2016-03-07 10:15:28 +01:00
Calle Wilund	e79ca557ed	managed_bytes: Change init of small object to silence error on gcc5 Fixes #865 (Some) gcc 5 (5.3.0 for me) on ubuntu will generate errors on compilation of this code (compiling logalloc_test). The memcpy to inline storage seems to confuse the compiler. Simply change to std::copy, which shuts the compiler up. Any decent stl should convert primitive std::copy to memcpy anyway, but since it is also the inline (small storage), it should not matter which way. Message-Id: <1456931988-5876-4-git-send-email-calle@scylladb.com>	2016-03-02 18:21:51 +02:00
Calle Wilund	43ea1f5945	utils::jointpoint: Helper type to generate a singular value for all shards Lets operations working on all shards "join" and acquire the same value of something, with that value being based on whenever all shards reach the join. Obvious use case: time stamp after one set of per-shard ops, but before final ones. The generation of the value is guaranteed to happen on the shards that created the join point. Based on the join-ops in CF::snapshot, but abstracted and made caller responsibility. Primary use case is to help deal with the join-problem of truncation. Message-Id: <1456332856-23395-1-git-send-email-calle@scylladb.com>	2016-02-24 18:59:25 +02:00

1 2 3 4 5 ...

276 Commits