scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Glauber Costa	e08fa7dafa	fix potential stale data in cache update We currently have a problem in update_cache, that can be trigger by ordering issues related to memtable flush termination (not initiation) and/or update_cache() call duration. That issue is described in #1364, and in short, happens if a call to update_cache starts before and ongoing call finishes. There is now a new SSTable that should be consulted by the presence checker that is not. The partition checker operates in a stale list because we need to make sure the SSTable we just wrote is excluded from it. This patch changes the partition checker so that all SSTables currently in use are consulted, except for the one we have just flushed. That provides both the guarantee that we won't check our own SSTable and access to the most up-to-date SSTable list. Fixes #1364 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>	2016-06-23 10:54:44 +02:00
Pekka Enberg	bcba45f546	Merge "Prevent old node to join new cluster" from Asias Fixes #1253	2016-06-23 10:25:38 +03:00
Piotr Jastrzebski	9b011bff18	row_cache: add contiguity flag to cache entry to reduce disk IO during scans Add contiguity flag to cache entry and set it in scanning reader. Partitions fetched during scanning are continuous and we know there's nothing between them. Clear contiguity flag on cache entries when the succeeding entry is removed. Use continuous flag in range queries. Don't go do disk if we know that there's nothing between two entries we have in cache. We know that when continuous flag of the first one is set to true. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <72bae432717037e95d1ac9465deaccfa7c7da707.1466627603.git.piotr@scylladb.com>	2016-06-23 09:43:15 +03:00
Avi Kivity	5af22f6cb1	main: handle exceptions during startup If we don't, std::terminate() causes a core dump, even though an exception is sort-of-expected here and can be handled. Add an exception handler to fix. Fixes #1379. Message-Id: <1466595221-20358-1-git-send-email-avi@scylladb.com>	2016-06-23 09:25:33 +03:00
Avi Kivity	a192c80377	gdb: fully-qualify type names gdb gets confused if a non-fully-qualified class name is used when we are in some namespace context. Help it out by adding a :: prefix. Message-Id: <1466587895-8690-1-git-send-email-avi@scylladb.com>	2016-06-22 12:04:17 +02:00
Avi Kivity	9dacd4fb80	Merge "query: Add new limits" from Duarte This patchset adds two new types of query limits: - Per partition row limit, which limits how many rows a given partition may return; needed both for thrift and for future CQL features; - Limit on the number of partitions returned, needed by thrift.	2016-06-22 11:03:13 +03:00
Duarte Nunes	82dbf5bff3	storage_proxy: Trace when retrying a query Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:48:15 +02:00
Duarte Nunes	69798df95e	query: Limit number of partitions returned This is required to implement a thrift verb. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:48:13 +02:00
Duarte Nunes	594e43a60a	compact_query: Rename partition_limit This patch renames compact_query::_partition_limit to _current_partition_limit for clarity, as the next patch adds a partition limit that limits the number of partitions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:47:29 +02:00
Duarte Nunes	e9ebd87991	compact_query: Rename limit to row_limit This patch renames compact_query::_limit to _row_limit for clarity, as a subsequent patch introduces yet another limit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:47:28 +02:00
Duarte Nunes	01b18063ea	query: Add per-partition row limit This patch as a per-partition row limit. It ensures both local queries and the reconciliation logic abide by this limit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:46:51 +02:00
Duarte Nunes	20d9813a89	storage_proxy: Fetch last replica row just in time This patch changes the way we fetch each replica's last row to determine if we got incomplete information from any of them. Instead of fetching the last rows up front, we fetch them on demand only if we actually trigger the code that needs them. We now get the last row from the versions vector of vectors. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 00:15:38 +02:00
Duarte Nunes	4ce9fc24cb	storage_proxy: Extract finding last row This patch extracts to a function the code that actually determines the last row of a partition based on the direction of the query. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 00:15:38 +02:00
Takuya ASADA	73ba4ac337	dist: drop sudoers.d from .rpm, since systemd moved to PermissionsStartOnly Since systemd moved to PermissionsStartOnly, only upstart uses sudoers. So move common/sudoers.d to dist/ubuntu, drop them from .rpm. Also, Ubuntu 15.10/16.04 does not requires sudoers since these are uses systemd. So copy sudoers only for 14.04. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1466536491-9860-1-git-send-email-syuu@scylladb.com>	2016-06-21 22:59:18 +03:00
Glauber Costa	4e81f19ab5	LSA: fix typo in region merge There are many potentially tricky things about referring to different regions from the LSA perspective. Madness, however, is not one of them. I can only assume we meant made? Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8eb81f35de4b208a494e43cb392eea07b87b2bf1.1466534798.git.glauber@scylladb.com>	2016-06-21 22:58:44 +03:00
Benoît Canet	8e4dee0bd1	scylla_setup: Hide /dev/loop* The user probably don't want to use /dev/loop* as RAID devices. Fixes: #1259 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466520602-7888-1-git-send-email-benoit@scylladb.com>	2016-06-21 19:27:40 +03:00
Tzach Livyatan	27b99f47e8	scylla_setup: improve the wording of disk setup phase. Fix #1197 by adding XFS related info to the interactive prompt Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <1466504625-28926-1-git-send-email-tzach@scylladb.com>	2016-06-21 19:26:31 +03:00
Avi Kivity	96ebc4e7b5	Merge seastar upstream * seastar 401c333...3029ebe (3): > util: add a seastar::value_of() helper function > rpc: force closing listen fd on server stop > reactor: fix I/O priority class id assignment	2016-06-21 15:11:26 +03:00
Tomasz Grabiec	597cbbdedc	Merge branch 'pdziepak/streamed-mutations/v5' from seastar-dev.git From Paweł: This series introduces streaming_mutations which allow mutations to be streamed between the producers and the consumers as a series of mutation_fragments. Because of that the mutation streaming interface works well with partitions larger than available memory provided that actual producer and consumer implementations can support this as well. mutation_fragments are the basic objects that are emitted by streamed_mutations they can represent a static row, a clustering row, the beginning and the end of a range tombstone. They are ordered by their clustering keys (with static rows being always the first emitted mutation fragment). The beginning of range tombstone is emitted before any clustering row affected by that tombstone and the end of range tombstone is emitted after the last clustering row affected by it. Range tombstones are disjoint. In this series all producers are converted to fully support the new interface, that includes cache, memtables and sstables. Mutation queries and data queries are the only consumers converted so far. To minimize the per-mutation_fragment overhead streamed_mutations use batching. The actual producer implementation fills a buffer until it is full (currently, buffer size is 16, the limit should, however, be changed to depend on the actual size in memory of the stored elements) or end of stream is reached. In order to guarantee isolation of writes reads from cache and memtable use MVCC. When a reader is created it takes a snapshot of the particular cache or memtable entry. The snapshot is immutable and if there happen to be any incoming writes while the read is active a new version of partition is created. When the snapshot is destroyed partition versions are merged together as much as possible. Performance results with perf_simple_query (median of results with duration 15): before after diff write 618652.70 618047.58 -0.10% read 661712.44 608070.49 -8.11%	2016-06-21 12:15:21 +02:00
Pekka Enberg	11dd20d640	Revert "ami: Change type from EBS to Instance" This reverts commit `2d7f8f4a47`. Avi sayeth: "Isn't this the other way round? EBS is persistent." and "The patch is wrong too. Instance store takes 5 minutes to boot compared to 1 minute for EBS."	2016-06-21 12:41:30 +03:00
Tomasz Grabiec	e783b58e3b	Merge branch 'glommer/LSA-throttler-v6' from git@github.com:glommer/scylla.gi From Glauber: This is my new take at the "Move throttler to the LSA" series, except this one don't actually move anything anywhere: I am leaving all memtable conversion out, and instead I am sending just the LSA bits + LSA active reclaim. This should help us see where we are going, and then we can discuss all memtable changes in a series on its own, logically separated (and hopefully already integrated with virtual dirty). [tgrabiec: trivial merge conflicts in logalloc.cc]	2016-06-21 10:22:26 +02:00
Calle Wilund	2b812a392a	commitlog_replayer: Fix calculation of global min pos per shard If a CF does not have any sstables at all, we should treat it as having a replay position of zero. However, since we also must deal with potential re-sharding, we cannot just set shard->uuid->zero initially, because we don't know what shards existed. Go through all CF:s post map-reduce, and for every shard where a CF does not have an RP-mapping (no sstables found), set the global min pos (for shard) to zero. Fixes #1372 Message-Id: <1465991864-4211-1-git-send-email-calle@scylladb.com>	2016-06-21 10:05:05 +03:00
Benoît Canet	2d7f8f4a47	ami: Change type from EBS to Instance Instance types does not have ephemeral drive that disapear on reboot. Fixes #1229 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466443232-5898-1-git-send-email-benoit@scylladb.com>	2016-06-21 09:56:26 +03:00
Calle Wilund	88ffe60138	batchlog_manager: Change replay mutation CL to ALL Try to emulate the origin behaviour for batch reply. They use an explicit write handler, combinging 1.) Hinting to all known dead endpoints 2.) Sending to all persumed live, requiring ack from all 3.) Hinting to endpoint to which send failed. We don't have hints, so try to work around by doing send with cl=ALL, and if send fails (wholly or partially), retain the batch in the log. This is still slight behavioural difference, and we also risk filling up the batch log in extreme cases. (Though probably not in any real environment). Refs #1222 Message-Id: <1466444170-23797-1-git-send-email-calle@scylladb.com>	2016-06-21 09:41:09 +03:00
Glauber Costa	7f29cb8aba	tests: add logalloc tests for pressure notification tests to make sure varios scenarios of pressure notification for active asynchronous reclaim work. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:58:39 -04:00
Glauber Costa	8f5047fc5f	tests: add tests to new region_group throttle interface Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:51:00 -04:00
Glauber Costa	579d121db8	LSA: export largest region We now keep the regions sorted by size, and the children region groups as well. Internally, the LSA has all information it needs to make size-based reclaim decisions. However, we don't do reclaim internally, but rather warn our user that a pressure situation is mounted. The user of a region_group doesn't need to evict the largest region in case of pressure and is free to do whatever it chooses - including nothing. But more likely than not, taking into account which region is the largest makes sense. This patch puts together this last missing piece of the puzzle, and exports the information we have internally to the user. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:51:00 -04:00
Glauber Costa	35f8a2ce2c	LSA: add a backpointer to the region from its private data Region is implemented using the pimpl pattern (region_impl), and all its relevant data is present in a private structure instead of the region itself. That private structure is the one that the other parts of the LSA will refer to, the region_group being the prime example. To allow classes such as the region_group the externally export a particular region, we will introduce a backpointer region_impl -> region. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	38a402307d	LSA: enhance region_group reclaimer We are currently just allowing the region_group to specify a throttle_threshold, that triggers throttling when a certain amount of memory is reached. We would like to notify the callers that such condition is reached, so that the callers can do something to alleviate it - like triggering flushes of their structures. The approach we are taking here is to pass a reclaimer instance. Any user of a region_group can specialize its methods start_reclaiming and stop_reclaiming that will be called when the region_group becomes under pressure or ceases to be, respectively. Now that we have such facility, it makes more sense to move the throttle_threshold here than having it separately. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	6404028c6a	LSA: move subgroups to a heap as well When we decide to evict from a specific region_group due to excessive memory usage, we must also consider looking at each of their children (subgroups). It could very well be that most of memory is used by one of the subgroups, and we'll have to evict from there. We also want to make sure we are evicting from the biggest region of all, and not the biggest region in the biggest region_group. To understand why this is important, consider the case in which the regions are memtables associated with dirty region groups. It could be that a very big memtable was recently flushed, and a fairly small one took its place. That region group is still quite large because the memtable hasn't finished flushing yet, but that doesn't mean we should evict from it. To allow us to efficiently pick which region is the largest, each root of each subtree will keep track of its maximal score, defined as the maximum between our largest region total_space and the maximum maximal score of subtrees. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	e1eab5c845	LSA: store regions in a heap for regions_group Currently, the regions in a region group are organized in a simple vector. We can do better by using a binomial heap, as we do for segments, and then updating when there is change. Internally to the LSA, we are in good position to always know when change happens, so that's really the best way to do it. The end game here, is to easily call for the reclaim of the largest offending region (potentially asynchronously). Because of that, we aren't really interested in the region occupancy, but in the region reclaimable occuppancy instead: that's simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing that effectively lists all non reclaimable regions in the end of the heap, in no particular order. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	54d4d46cf7	LSA: move throttling code to LSA. The database code uses a throttling function to make sure that memory used for the dirty region never is over the limit. We track that with a region group, so it makes sense to move this as generic functionality into LSA. This patch implements the LSA-side functionality and a later patch will convert the current memtable throttler to use it. Unlike the current throttling mechanism, we'll not use a timer-based mechanism here. Aside from being more generic and friendlier towards other users, this is a good change for current memtable by itself. The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we would be better off without them. Let's discuss the merits of each separately: 1) 10ms timer: If we are throttling, we expect somebody to flush the memtables for memory to be released. Since we are in position to know exactly when a memtable was written, thus releasing memory, we can just call unthrottle at that point, instead of using a timer. 2) 1MB release threshold: we do that because we have no idea how much memory a request will use, so we put the cut somehow. However, because of 1) we don't call unthrottle through a timer anymore, and do it directly instead. This means that we can just execute the request and see how much memory it has used, with no need to guess. So we'll call unthrottle at the end of every request that was previously throttled. Writing the code this way also has the advantage that we need one less continuation in the common case of the database not being throttled. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:34:19 -04:00
Paweł Dziepak	6f25533f4e	mutation_query: drop querying_reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:31:52 +01:00
Paweł Dziepak	ed12c164f8	mutation_query: make mutation queries streaming-friendly Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:31:28 +01:00
Paweł Dziepak	0828c88b25	mutation_partition: implement streaming-friendly data_query() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:31:19 +01:00
Paweł Dziepak	67ae9457e3	mutation_partition: introduce mutation_querier mutation_querier is a streamed_mutation consumer that adds the mutation content to query::result. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:53 +01:00
Paweł Dziepak	f54e604a16	mutation_partition: introduce compact_for_query compact_for_query is an intermediate stage used to compact data in a flattened stream of mutations before they are consumed by query building consumers. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:53 +01:00
Paweł Dziepak	2b7e62599d	mutation_reader: add consume_flattened() Mutation reader produces a stream of streamed_mutations. Each streamed_mutation itself is a stream so basically we are dealing here with a stream of streams. consume_flattened() flattens such stream of streams making all its elements consumable by a single consumer. It also allows reversing the mutations before consumption using reverse_streamed_mutation().	2016-06-20 21:29:52 +01:00
Paweł Dziepak	5566d23180	streamed_mutation: add reverse_streamed_mutation() reverse_streamed_mutation() is an inefficient way of reversing streamed_mutations. First, it collects all mutation_fragments and then it emits them in the reversed orders (except static row which always is the first element and it also flips the bounds of range tombstones). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	f676d1779b	range_tombstone: add flip_bound_kind() flip_bound_kind() changes start bound to end bound and vice versa while preserving the inclusivness. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	a3423bac38	tests/streamed_mutation: test freezing streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	6e68f0931e	frozen_mutation: freeze streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	349905d0fd	range_tombstone_list: add clear() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	494c6fa9c1	tests/mutation_query_test: make sure mutations are sliced properly Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	8dfabf2790	mutation_reader: support slicing in make_reader_returning_many() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	6871bd5fa0	memtable: fully support streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	983321f194	tests/mutation: do not create memtable on stack Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	4a5a9148e3	tests/row_cache: test slicing mutation reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	e1a8d94542	tests/row_cache: test mvcc Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	b2c37429e7	row_cache: drop slicing_reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00

1 2 3 4 5 ...

9663 Commits