scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 18:40:38 +00:00

Author	SHA1	Message	Date
Calle Wilund	49d3d79dfe	sstables: Fix compilation error on boost 1.55 Message-Id: <1461067254-526-2-git-send-email-calle@scylladb.com>	2016-04-25 12:54:44 +03:00
Piotr Jastrzebski	8231385e0c	sstables: Remove unused code from mp_row_consumer _mutation_to_subscription is not used anywhere so it should probably be removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <90ef62daee0c183b29dcb86d08843145d657ea38.1461179970.git.piotr@scylladb.com>	2016-04-20 23:10:43 +03:00
Raphael S. Carvalho	bf03cd1ea6	sstables: kill unused code from size tiered strategy Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <485b1e49419cb052218ab4558f27270ce3bd03b4.1460761821.git.raphaelsc@scylladb.com>	2016-04-19 08:46:06 +03:00
Raphael S. Carvalho	29db5f5e1f	sstables: move compaction strategy code to a new source file Moving compaction strategy code from sstables/compaction.cc to sstables/compaction_strategy.cc That improves readability. Strategy code should be separated from the generic compaction code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <5af6fc8f7321351a071fc0ce03c80ffea21f8396.1460761821.git.raphaelsc@scylladb.com>	2016-04-19 08:45:43 +03:00
Pekka Enberg	3f2286d02e	Merge "Delete compacted sstables atomically" from Avi "If we compact sstables A, B into a new sstable C we must either delete both A and B, or none of them. This is because a tombstone in B may delete data in A, and during compaction, both the tombstone and the data are removed. If only B is deleted, then the data gets resurrected. Non-atomic deletion occurs because the filesystem does not support atomic deletion of multiple files; but the window for that is small and is not addressed in this patchset. Another case is when A is shared across multiple shards (as is the case when changing shard count, or migrating from existing Cassandra sstables). This case is covered by this patchset. Fixes #1181."	2016-04-14 22:04:15 +03:00
Avi Kivity	a843aea547	db: delete compacted sstables atomically If sstables A, B are compacted, A and B must be deleted atomically. Otherwise, if A has data that is covered by a tombstone in B, and that tombstone is deleted, and if B is deleted while A is not, then the data in A is resurrected. Fixes #1181.	2016-04-14 17:14:26 +03:00
Avi Kivity	3798d04ae8	sstables: convert sstable::mark_for_deletion() to atomic deletion infrastructure All deletions must go through the same data structure, or some atomic deletions will never be satisified.	2016-04-14 17:14:26 +03:00
Avi Kivity	2ba584db8d	sstables: add delete_atomically(), for atomically deleting multiple sstables When we compact a set of sstables, we have to remove the set atomically, otherwise we can resurrect data if the following happens: insert data to sstable A insert tombstone to sstable B compact A+B -> C (removing both data and tombstone) delete B only read data from A Since an sstable may be shared by multiple shard, and each shard performs compaction at a different time, we need to defer deletion of an sstable set until all shards agree that the set can be deleted. An additional atomicity issue exists because posix does not provide a way to atomically delete multiple files. This issue is not addressed by this patch.	2016-04-14 17:14:26 +03:00
Pekka Enberg	60352f810a	Merge "Fixes for the reading of missing Summary" from Glauber "This patchset contains some fixes spotted during post-merged review by {Nad,}av{,i}. I don't consider any of them a must for backport to 1.0, but since we haven't yet even backported the main series, might as well backport everything. It also includes some unit tests to make sure that they will be kept working in the future."	2016-04-13 11:32:05 +03:00
Raphael S. Carvalho	c7b728e716	sstables: Fix leveled compaction strategy There is a problem in the implementation of leveled compaction strategy that prevents level 1 from being compacted into level 2, and so forth. As a result, all sstables will only belong to either level 0 or 1. One of the consequences is level 1 being overwhelmed by a huge amount of sstables. The root of the problem is a conditional statement in the code that prevents a single sstable, with level > 0, from being compacted into a subsequent level that is empty or has no overlapping sstables. Fixes #1180. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <9a4bffdb0368dea77b49c23687015ff5832299ab.1460508373.git.raphaelsc@scylladb.com>	2016-04-13 11:14:14 +03:00
Raphael S. Carvalho	c28d168619	sstables: allow user to specify max sstable size with leveled strategy This change will allow user to specify the maximum size of a new sstable created as a result of leveled compaction. Example of using this setting: ALTER TABLE ks.test5 with compaction = {'sstable_size_in_mb': '1000', 'class': 'LeveledCompactionStrategy'} Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <ebb9844401af74388bda12586c2435283f6d8db8.1460486043.git.raphaelsc@scylladb.com>	2016-04-13 09:13:33 +03:00
Raphael S. Carvalho	15246f31f7	sstables: fix incorrect sstable size when compression is enabled Size of uncompressed sstable was being unconditionally used to determine when to stop writing a table. When compression is enabled, compressed size should be used instead. Problem affected Scylla when compression and leveled strategy were used. Fixes #1177. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d9bf26def41fb33ca297f4127ce042b7f67adf96.1460484529.git.raphaelsc@scylladb.com>	2016-04-13 09:01:01 +03:00
Glauber Costa	114ba5e3a8	be robust against broken summary files Now that we can boot without a Summary file, we can just as easily boot with a broken one. Suggested by Nadav, and it is actually very easy to do, so do it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:55:01 -04:00
Glauber Costa	72dc45999d	review fixes for generate_summary Spotted by Avi post-merge 1) Need to close the file 2) Should be using the parameter pc instead of the default_class Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:55:01 -04:00
Glauber Costa	f78f43850d	clear components if reading toc fail This shouldn't be a problem in practice, because if read_toc() fails, the users will just tend to discard the sstable object altogether, and not insist on using it. However, if somebody does try to keep using it, a subsequent read_toc() could theoretically have some components filled up leading the new reader to believe the toc was populated successfully. It is easier to just clear the _components set and never worry about it, than trying to reason about whether or not that could happen. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:55:01 -04:00
Glauber Costa	0f41ef1b84	index_reader: avoid misleading parent name Also add comments about the expected signature of IndexConsumer Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:15:11 -04:00
Avi Kivity	715794cce6	sstables: filter sstables single-row read using first_key/last_key Using leveled compaction strategy, only a few sstables will contain a given key, so we need to filter out the rest. Using the summary entries to filter keys works if the key is before the first summary entry, but does not work if it is after the last summary entry, because the last summary entry does not represent the last key; so sstables that are are towards the beginning of the ring are read even if they do not contain the key, greatly reducing read performance. Fix by consulting the summary's first_key/last_key entries before consulting the summary entry array.	2016-04-12 10:33:17 +03:00
Raphael S. Carvalho	8fe7524e46	sstables: enable leveled strategy feature to prevent L0 from falling behind If level 0 falls behind, size tiered strategy is used on it to reduce overhead until we can catch up on the higher levels. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <17bf15b7d12cd5dc652cc92939c0c68f921662a2.1459976469.git.raphaelsc@scylladb.com>	2016-04-11 11:52:00 +03:00
Nadav Har'El	818f14f444	stable: overhaul (again) range tombstone merging In commit `99ecda3c96`, we overhauled the way we read Cassandra's disjoint range tombstones, and convert them to the overlapping whole-prefix tombstones which we support. Unfortunately, while this algorithm worked correctly for a couple of test cases, it did not for additional test cases. While the previous algorithm could not generate "wrong" tombstones (it didn't generate things it didn't see), it could generate redundant overlapping tombstones, and missed some sanity checks about the correctness of the merge process. In this patch, a new algorithm makes sure to not generate redundant tombstones, and includes additional tests to ensure that we do not mistakenly merge range tombstones which cannot actually be merged. The following patches will include tests which failed with the previous algorithm, and succeeds with this one. I described the new algorithm on the ScyllaDB mailing list this way: 1. Have a stack of open ranges, start & timestamp for each (no end for each), and just one "end of last contiguous deletion" Processing each range tombstone: 2. If the start of a range tombstone is not adjacent to the "end of last deletion", assert we have no open range on the stack (because we can never close those). In any case, set the "end of of last deletion" to the end of this tombstone. 3. If the current tombstone's timestamp is STRICTLY HIGHER than that on the top of the stack, push the new tombstone's start+timestamp to the stack. Note: If it was STRICTLY LOWER, throw error (it means the open range will never be closed). 4. If the current tombstone's end matches (i.e., closes row) of the start on the top of the stack, emit this tombstone and pop the stack. When the row ends: 5. Assert the stack is empty. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459778074-10759-1-git-send-email-nyh@scylladb.com>	2016-04-11 11:35:23 +03:00
Glauber Costa	8a50b027aa	summary: generate one if it is not present There are cases in which a Summary file will not be present, and imported SSTables will have just the Index and Data files. In earlier versions of Cassandra, a Summary didn't exist, so one may not be generated when migrating. In Issue #1170, we can see an example of tables generated by CQLSSTableWriter, and they lack a Summary. Cassandra is robust against this and can cope perfectly with the Summary not existing. I will argue that we should do the same. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	4de26fdec8	sstables: allow read_toc to be called more than once We do that by bailing immediately if we detect that the components map is already populated. This allow us to call read_toc() earlier if we need to - for instance, to inquire about the existence of the Summary - without the need to re-read the components again later. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	736e21222e	sstables: avoid passing schema unnecessarily for prepare_summary we can just pass the min interval as a parameter and avoid having the schema do yet another hop. For sealing the summary, it is completely unused and we can do away with it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	0de3a32147	index reader: make index_consumer a template parameter This is done so we can use other consumers. An example of that, is regeneration of the Summary from an existing Index. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	8453ff7788	make get_sstable_key_range an instance method Because just creating an SSTable object does not generate any I/O, get_sstable_key_range should be an instance method. The main advantage of doing that is that we won't have to read the summary twice. The way we're doing it currently, if happens to be a shard-relevant table we'll call load() - which reads the summary again. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	6ae601a025	do not re-read the summary There are times in which we read the Summary file twice. That actually happens every time during normal boot (it doesn't during refresh). First during get_sstable_key_range and then again during load(). Every summary will have at least one entry, so we can easily test for whether or not this is properly initialized. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Raphael S. Carvalho	e15ce5eb4d	api: Add support to get column family compression ratio After this change, user can query compression ratio on a per column family basis with 'nodetool cfstats'. look at 'nodetool cfstats' output: ./bin/nodetool cfstats ks.test5 Keyspace: ks Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Flushes: 0 Table: test5 SSTable count: 1 Space used (live): 4774 Space used (total): 4774 Space used by snapshots (total): 0 Off heap memory used (total): 131384 SSTable Compression Ratio: 0.833333 ... Fixes #636. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <a1bee5a23fe63787df3e387a88f2d216ba4a4134.1459802771.git.raphaelsc@scylladb.com>	2016-04-05 12:46:40 +03:00
Gleb Natapov	70575699e4	commitlog, sstables: enlarge XFS extent allocation for large files With big rows I see contention in XFS allocations which cause reactor thread to sleep. Commitlog is a main offender, so enlarge extent to commitlog segment size for big files (commitlog and sstable Data files). Message-Id: <20160404110952.GP20957@scylladb.com>	2016-04-04 14:15:00 +03:00
Paweł Dziepak	3c107c4b05	sstables: remove HyperLogLog throw() specifier HyperLogLog constructor promises that it only throws instances of std::invalid_argument. That's a lie since it also adds elements to a vector (and doesn't catch potential bad_allocs). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-31 16:36:53 +01:00
Nadav Har'El	78c9f49585	sstables: Move check_marker() to source file The check_marker() function is use as a sanity-check of data we read from sstable, so instead of the header file key.hh, let's move it to the sstable-parsing source file partition.cc. In addition to having less code in header files, another benefit is that the function can now throw a more specific exception (malformed sstable exception). Also fixed the exception's message (which had a second "%d" but only one parameter). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459420430-5968-1-git-send-email-nyh@scylladb.com>	2016-03-31 14:22:51 +03:00
Nadav Har'El	99ecda3c96	sstables: overhaul range tombstone reading Until recently, we believed that range tombstones we read from sstables will always be for entire rows (or more generalized clustering-key prefixes), not for arbitrary ranges. But as we found out, because Cassandra insists that range tombstones do not overlap, it may take two overlapping row tombstones and convert them into three range tombstones which look like general ranges (see the patch for a more detailed example). Not only do we need to accept such "split" range tombstones, we also need to convert them back to our internal representation which, in the above example, involves two overlapping tombstones. This is what this patch does. This patch also contains a test for this case: We created in Cassandra an sstable with two overlapping deletions, and verify that when we read it to Scylla, we get these two overlapping deletions - despite the sstable file actually having contained three non-overlapping tombstones. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <b7c07466074bf0db6457323af8622bb5210bb86a.1459399004.git.glauber@scylladb.com>	2016-03-31 12:49:50 +03:00
Nadav Har'El	0fc9a5ee4d	sstables: merge range tombstones if possible This is a rewrite of Glauber's earlier patch to do the same thing, taking into account Avi's comments (do not use a class, do not throw from the constructor, etc.). I also verified that the actual use case which was broken in #1136 was fixed by this patch. Currently, we have no support for range tombstones because CQL will not generate them as of version 2.x. Thrift will, but we can safely leave this for the future. However, we have seen cases during a real migration in which a pure-CQL Cassandra would generate range tombstones in its SSTables. Although we are not sure how and why, those range tombstones were of a special kind: their end and next's start range were adjacent, which means that in reality, they could very well have been written as a single range tombstone for an entire clustering key - which we support just fine. This code will attempt to fix this problem temporarily by merging such ranges if possible. Care must be taken so that we don't end up accepting a true generic range tombstone by accident. Fixes #1136 Signed-off-by: Glauber Costa <glauber@scylladb.com> Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459333972-20345-1-git-send-email-nyh@scylladb.com>	2016-03-30 13:40:10 +03:00
Glauber Costa	23808ba184	sstables: fix exception printouts in check_marker As Nadav noticed in his bug report, check_marker is creating its error messages using characters instead of numbers - which is what we intended here in the first place. That happens because sprint(), when faced with an 8-byte type, interprets this as a character. To avoid that we'll use uint16_t types, taking care not to sign-extend them. The bug also noted that one of the error messages is missing a parameter, and that is also fixed. Fixes #1122 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <74f825bbff8488ffeb1911e626db51eed88629b1.1459266115.git.glauber@scylladb.com>	2016-03-29 19:23:28 +03:00
Glauber Costa	d5c1366e85	compaction: be verbose about which table is causing an exception When we, for some reason, fail to compact an SSTable, we do not log the file name leaving us with cryptic messages that tell us what happened, but not where it happened. This patch adds logging in compaction so that we'll know what's going on. Please note that readers are more of a concern, because the SSTable being written technically do not exist yet. Still, better safe than sorry: if open_data fails, or we leave an unfinished SSTable, it is still good to know which one was the culprit. Some argument can be made about whether we should log this at the lower SSTable level, or at the compaction level. The reason I am logging this at the compaction level, is that we don't really know which exception will trigger, and where: it may be the case that we're seeing exceptions that are not SSTable specific, and may not have the chance to log it properly. In particular, if the exception happens inside the reader: read_rows() and friends only return a mutation reader, which doesn't really do anything until we call read(). But at that time, we don't hold any pointers to the SSTable anymore. In Summary, logging at the compaction level guarantees that we always do it no matter what. Exceptions that are part of the main SSTable path can log the file name as well if they want: if that's the case, we'll be left with the name appearing twice. That's totally harmless, and better than none. Fixes #1123 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c5c969fb6aeb788a037bd7a4ea69979c1042cb34.1459263847.git.glauber@scylladb.com>	2016-03-29 18:15:56 +03:00
Raphael Carvalho	d515a7fd85	sstables: fix deletion of sstable with temporary TOC After `4e52b41a4`, remove_by_toc_name() became aware of temporary TOC files, however, it doesn't consider that some components may be missing if temporary TOC is present. When creating a new sstable, the first thing we do is to write all components into temporary TOC, so content of a temporary TOC isn't reliable until it is renamed. Solution is about implementing the following flow (described by Avi): "Flow should be: - remove all components in parallel - forgive ENOENT, since the compoent may not have been written; otherwise deletion error should be raised - fsync the directory - delete the temporary TOC " This problem can be reproduced by running compaction without disk space, so compaction would fail and leave a partial sstable that would be marked for deletion. Afterwards, remove_by_toc_name() would try to delete a component that doesn't exist because it looked at the content of temporary TOC. Fixes #1095. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>	2016-03-29 10:38:01 +03:00
Nadav Har'El	a05577ca41	sstable: fix read failure of certain sstables We had a problem reading certain existing Cassandra sstables into Scylla. Our consume_range_tombstone() function assumes that the start and end columns have a certain "end of component" markers, and want to verify that assumption. But because of bugs in older versions of Cassandra, see https://issues.apache.org/jira/browse/CASSANDRA-7593, sometimes the "end of component" was missing (set to 0). CASSANDRA-7593 suggested this problem might exist on the start column, so we allowed for that, but now we discovered a case where also the end column is set to 0 - causing the test in consume_range_tombstone() to fail and the sstable read to fail - causing Scylla to no be able to import that sstable from Cassandra. Allowing for an 0 also on the end column made it possible to read that sstable, compact it, and so on. Fixes #1125. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459173964-23242-1-git-send-email-nyh@scylladb.com>	2016-03-28 17:09:37 +03:00
Calle Wilund	4e52b41a46	sstables: Add delete func to rename TOC ensuring table is marked dead Note: "normal" remove_by_toc_name must now be prepared for and check if the TOC of the sstable is already moved to temp file when we get to the juicy delete parts. Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>	2016-03-24 12:01:53 +02:00
Nadav Har'El	2eb0627665	sstable: fix use-after-free of temporary ioclass copy Commit `6a3872b355` fixed some use-after-free bugs but introduced a new one because of a typo: Instead of capturing a reference to the long-living io-class object, as all the code does, one place in the code accidentally captured a copy of this object. This copy had a very temporary life, and when a reference to that copy was passed to sstable reading code which assumed that it lives at least as long as the read call, a use-after-free resulted. Fixes #1072 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1458595629-9314-1-git-send-email-nyh@scylladb.com>	2016-03-21 22:28:05 +01:00
Benoît Canet	3b1d3d977d	exceptions: Shutdown communications on non file I/O errors Apply the same treatment to non file filesystem I/O errors. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458154098-9977-2-git-send-email-benoit@scylladb.com>	2016-03-17 15:02:54 +02:00
Benoît Canet	1fb9a48ac5	exception: Optionally shutdown communication on I/O errors. I/O errors cannot be fixed by Scylla the only solution is to shutdown the database communications. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458154098-9977-1-git-send-email-benoit@scylladb.com>	2016-03-17 15:02:52 +02:00
Glauber Costa	6a3872b355	sstables: do not assume mutation_reader will be kept alive Our sstables::mutation_reader has a specialization in which start and end ranges are passed as futures. That is needed because we may have to read the index file for those. This works well under the assumption that every time a mutation_reader will be created it will be used, since whoever is using it will surely keep the state of the reader alive. However, that assumption is no longer true - for a while. We use a reader interface for reading everything from mutations and sstables to cache entries, and when we create an sstable mutation_reader, that does not mean we'll use it. In fact we won't, if the read can be serviced first by a higher level entity. If that happens to be the case, the reader will be destructed. However, since it may take more time than that for the start and end futures to resolve, by the time they are resolved the state of the mutation reader will no longer be valid. The proposed fix for that is to only resolve the future inside mutation_reader's read() function. If that function is called, we can have a reasonable expectation that the caller object is being kept alive. A second way to fix this would be to force the mutation reader to be kept alive by transforming it into a shared pointer and acquiring a reference to itself. However, because the reader may turn out not to be used, the delayed read actually has the advantage of not even reading anything from the disk if there is no need for it. Also, because sstables can be compacted, we can't guarantee that the sst object itself , used in the resolution of start and end can be alive and that has the same problem. If we delay the calling of those, we will also solve a similar problem. We assume here that the outter reader is keeping the SSTable object alive. I must note that I have not reproduced this problem. What goes above is the result of the analysis we have made in #1036. That being the case, a thorough review is appreciated. Fixes #1036 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <a7e4e722f76774d0b1f263d86c973061fb7fe2f2.1458135770.git.glauber@scylladb.com>	2016-03-16 17:51:02 +02:00
Nadav Har'El	02ba8ffbe8	Allow uncompression at end of file Asking to read from byte 100 when a file has 50 bytes is an obvious error. But what if we ask to read from byte 50? What if we ask to read 0 bytes at byte 50? :-) Before this patch, code which asked to read from the EOF position would get an exception. After this patch, it would simply read nothing, without error. This allows, for example, reading 0 bytes from position 0 on a file with 0 bytes, which apparently happened in issue #1039... A read which starts at a position higher than the EOF position still generates an exception. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1458137867-10998-1-git-send-email-nyh@scylladb.com>	2016-03-16 17:50:23 +02:00
Nadav Har'El	73297c7872	Fix out-of-range exception when uncompressing 0 bytes The uncompression code reads the compressed chunks containing the bytes pos through pos + len - 1. This, however, is not correct when len==0, and pos + len - 1 may even be -1, causing an out-of-range exception when calling locate() to find the chunks containing this byte position. So we need to treat len==0 specially, and in this case we don't read anything, and don't need to locate() the chunks to read. Refs #1039. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1458135987-10200-1-git-send-email-nyh@scylladb.com>	2016-03-16 15:54:48 +02:00
Vlad Zolotarov	ce47fcb1ba	sstables: properly account removal requests The same shard may create an sstables::sstable object for the same SStable that doesn't belong to it more than once and mark it for deletion (e.g. in a 'nodetool refresh' flow). In that case the destructor of sstables::sstable accounted the deletion requests from the same shard more than once since it was a simple counter incremented each time there was a deletion request while it should account request from the same shard as a single request. This is because the removal logic waited for all shards to agree on a removal of a specific SStable by comparing the counter mentioned above to the total number of shards and once they were equal the SStable files were actually removed. This patch fixes this by replacing the counter by an std::unordered_set<unsigned> that will store a shard ids of the shards requesting the deletion of the sstable object and will compare the size() of this set to smp::count in order to decide whether to actually delete the corresponding SStable files. Fixes #1004 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1457886812-32345-1-git-send-email-vladz@cloudius-systems.com>	2016-03-14 11:45:08 +02:00
Raphael S. Carvalho	1ff7d32272	sstables: make write_simple() safer by using exclusive flag We should guarantee that write_simple() will not try to overwrite an existing file. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <194bd055f1f2dc1bb9766a67225ec38c88e7b005.1457818073.git.raphaelsc@scylladb.com>	2016-03-14 11:45:00 +02:00
Raphael S. Carvalho	0af786f3ea	sstables: fix race condition when writing to the same sstable in parallel When we are about to write a new sstable, we check if the sstable exists by checking if respective TOC exists. That check was added to handle a possible attempt to write a new sstable with a generation being used. Gleb was worried that a TOC could appear after the check, and that's indeed possible if there is an ongoing sstable write that uses the same generation (running in parallel). If TOC appear after the check, we would again crap an existing sstable with a temporary, and user wouldn't be to boot scylla anymore without manual intervention. Then Nadav proposed the following solution: "We could do this by the following variant of Raphael's idea: 1. create .txt.tmp unconditionally, as before the commit `031bf57c1` (if we can't create it, fail). 2. Now confirm that .txt does not exist. If it does, delete the .txt.tmp we just created and fail. 3. continue as usual 4. and at the end, as before, rename .txt.tmp to .txt. The key to solving the race is step 1: Since we created .txt.tmp in step 1 and know this creation succeeded, we know that we cannot be running in parallel with another writer - because such a writer too would have tried to create the same file, and kept it existing until the very last step of its work (step 4)." This patch implements the solution described above. Let me also say that the race is theoretical and scylla wasn't affected by it so far. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <ef630f5ac1bd0d11632c343d9f77a5f6810d18c1.1457818331.git.raphaelsc@scylladb.com>	2016-03-14 11:44:51 +02:00
Raphael S. Carvalho	031bf57c19	sstables: bail out if toc exists for generation used by write_components Currently, if sstable::write_components() is called to write a new sstable using the same generation of a sstable that exists, a temporary TOC will be unconditionally created. Afterwards, the same sstable::write_components() will fail when it reaches sstable::create_data(). The reason is obvious because data component exists for that generation (in this scenario). After that, user will not be able to boot scylla anymore because there is a generation with both a TOC and a temporary TOC. We cannot simply remove a generation with TOC and temporary TOC because user data will be lost (again, in this scenario). After all, the temporary TOC was only created because sstable::write_components() was wrongly called with the generation of a sstable that exists. Solution proposed by this patch is to trigger exception if a TOC file exists for the generation used. Some SSTable unit tests were also changed to guarantee that we don't try to overwrite components of an existing sstable. Refs #1014. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <caffc4e19cdcf25e4c6b9dd277d115422f8246c4.1457643565.git.raphaelsc@scylladb.com>	2016-03-11 09:22:51 +02:00
Nadav Har'El	1b4f8842ee	sstable: fix compressed data file overread Since commit `2f56577` ("sstables: more efficient read of compressed data file"), the compressed_file_input_stream uses a file_input_stream to efficiently read the compressed data at chunks some desired size (128 KB is our default) instead of at smaller compressed chunks. However, I had a bug where I mis-calculated the desired length of the read (giving the end byte instead of the length!) and as a result file_input_stream did not know where the read was supposed to stop, and always read 128 KB buffers. The results were not incorrect, because the sstable reader stops when it needs to, even if given too much data. But it was inefficient because too much data was read in the last buffer. With this patch, the length is correctly given to the input stream, and it can read a much smaller buffer at the end of the read, not the full 128 KB. I tested that this actually happens. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1457633616-15193-1-git-send-email-nyh@scylladb.com>	2016-03-11 09:17:50 +02:00
Glauber Costa	f2a8bcabc2	sstables: improve error messages The standard C++ exception messages that will be thrown if there is anything wrong writing the file, are suboptimal: they barely tell us the name of the failing file. Use a specialized create function so that we can capture that better. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:06:05 -05:00
Gleb Natapov	51ca3122cf	cleanup forward declaration for key types Message-Id: <20160310075138.GC6117@scylladb.com>	2016-03-10 10:52:19 +01:00

1 2 3 4 5 ...

595 Commits