scylladb

Author	SHA1	Message	Date
Calle Wilund	ff5df306e3	database: Use disk-marking delete function in discard_sstables Fixes #797 To make sure an inopportune crash after truncate does not leave sstables on disk to be considered live, and thus resurrect data, after a truncate, use delete function that renames the TOC file to make sure we've marked sstables as dead on disk when we finish this discard call. Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>	2016-03-24 12:02:08 +02:00
Glauber Costa	34a9fc106f	database: keep streaming memtables in their own region group Theoretically, because we can have a lot of pending streaming memtables, we can have the database start throttling and incoming connections slowing down during streaming. Turns out this is actually a very easy condition to trigger. That is basically because the other side of the wire in this case is quite efficient in sending us work. This situation is alleviated a bit by reducing parallelism, but not only it does't go away completely, once we have the tools to start increasing parallelism again it will become common place. The solution for this is to limit the streaming memtables to a fraction of the total allowed dirty memory. Using the nesting capability built in in the LSA regions, we will make the streaming region group a child of the main region group. With that, we can throttle streaming requests separately, while at the same time being able to control the total amount of dirty memory as well. Because of the property, it can still be the case that incoming requests will throttle earlier due to streaming - unless we allow for more dirty memory to be used during repairs - but at least that effect will be limited. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:40:47 -04:00
Glauber Costa	455d5a57d2	streaming memtables: coalesce incoming writes The repair process will potentially send ranges containing few mutations, definitely not enough to fill a memtable. It wants to know whether or not each of those ranges individually succeeded or failed, so we need a future for each. Small memtables being flushed are bad, and we would like to write bigger memtables so we can better utilize our disks. One of the ways to fix that, is changing the repair itself to send more mutations at a single batch. But relying on that is a bad idea for two reasons: First, the goals of the SSTable writer and the repair sender are at odds. The SSTable writer wants to write as few SSTables as possible, while the repair sender wants to break down the range in pieces as small as it can and checksum them individually, so it doesn't have to send a lot of mutations for no reason. Second, even if the repair process wants to process larger ranges at once, some ranges themselves may be small. So while most ranges would be large, we would still have potentially some fairly small SSTables lying around. The best course of action in this case is to coalesce the incoming streams write-side. repair can now choose whatever strategy - small or big ranges - it wants, resting assure that the incoming memtables will be coalesced together. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:38:22 -04:00
Glauber Costa	5fa866223d	streaming: add incoming streaming mutations to a different sstable Keeping the mutations coming from the streaming process as mutations like any other have a number of advantages - and that's why we do it. However, this makes it impossible for Seastar's I/O scheduler to differentiate between incoming requests from clients, and those who are arriving from peers in the streaming process. As a result, if the streaming mutations consume a significant fraction of the total mutations, and we happen to be using the disk at its limits, we are in no position to provide any guarantees - defeating the whole purpose of the scheduler. To implement that, we'll keep a separate set of memtables that will contain only streaming mutations. We don't have to do it this way, but doing so makes life a lot easier. In particular, to write an SSTable, our API requires (because the filter requires), that a good estimate on the number of partitions is informed in advance. The partitions also need to be sorted. We could write mutations directly to disk, but the above conditions couldn't be met without significant effort. In particular, because mutations can be arriving from multiple peer nodes, we can't really sort them without keeping a staging area anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:13:00 -04:00
Glauber Costa	78189de57f	database: make seal_on_overflow a method of the memtable_list Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	635bb942b2	database: move add_memtable as a method of the memtable_list The column family still has to teach the memtable list how to allocate a new memtable, since it uses CF parameters to do so. After that, the memtable_list's constructor takes a seal and a create function and is complete. The copy constructor can now go, since there are no users left. The behavior of keeping a reference to the underlying memtables can also go, since we can now guarantee that nobody is keeping references to it (it is not even a shared pointer anymore). Individual memtables are, and users may be keeping references to them individually. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	6ba95d450f	database: move active_memtable to memtable_list Each list can have a different active memtable. The column family method keeps existing, since the two separate sets of memtable are just an implementation detail to deal with the problem of streaming QoS: the active memtable keeps being the one from the main list. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	af6c7a5192	database: create a class for memtable_list memtable_list is currently just an alias for a vector of memtables. Let's move them to a class on its own, exporting the relevant methods to keep user code unchanged as much as possible. This will help us keeping separate lists of memtables. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Raphael Carvalho	370b1336fe	service: fix refresh Vlad and I were working on finding the root of the problems with refresh. We found that refresh was deleting existing sstable files because of a bug in a function that was supposed to return the maximum generation of a column family. The intention of this function is to get generation from last element of column_family::_sstables, which is of type std::map. However, we were incorrectly using std::map::end() to get last element, so garbage was being read instead of maximum generation. If the garbage value is lower than the minimum generation of a column family, then reshuffle_sstables() would set generation of all existing sstables to a lower value. That would confuse our mechanism used to delete sstables because sstables loaded at boot stage were touched. Solution to this problem is about using rbegin() instead of end() to get last element from column_family::_sstables. The other problem is that refresh will only load generations that are larger than or equal to X, so new sstables with lower generation will not be loaded. Solution is about creating a set with generation of live SSTables from all shards, and using this set to determine whether a generation is new or not. The last change was about providing an unused generation to reshuffle procedure by adding one to the maximum generation. That's important to prevent reshuffle from touching an existing SSTable. Tested 'refresh' under the following scenarios: 1) Existing generations: 1, 2, 3, 4. New ones: 5, 6. 2) Existing generations: 3, 4, 5, 6. New ones: 1, 2. 3) Existing generations: 1, 2, 3, 4. New ones: 7, 8. 4) No existing generation. No new generation. 5) No existing generation. New ones: 1, 2. I also had to adapt existing testcase for reshuffle procedure. Fixes #1073. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <1c7b8b7f94163d5cd00d90247598dd7d26442e70.1458694985.git.raphaelsc@scylladb.com>	2016-03-23 10:21:58 +02:00
Raphael Carvalho	de4b4e593d	db: better handling of failure in column_family::populate Improve handling of failure by saving first exception and ignoring the remaining futures. At the moment, code only throws first exception and doesn't care about any possible remaining future. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <383dc4445db09dd2fbce093d4609a0a0bc38a405.1458240398.git.raphaelsc@scylladb.com>	2016-03-20 17:33:20 +02:00
Benoît Canet	3b1d3d977d	exceptions: Shutdown communications on non file I/O errors Apply the same treatment to non file filesystem I/O errors. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458154098-9977-2-git-send-email-benoit@scylladb.com>	2016-03-17 15:02:54 +02:00
Benoît Canet	1fb9a48ac5	exception: Optionally shutdown communication on I/O errors. I/O errors cannot be fixed by Scylla the only solution is to shutdown the database communications. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458154098-9977-1-git-send-email-benoit@scylladb.com>	2016-03-17 15:02:52 +02:00
Pekka Enberg	d4b4baad98	Merge "Add more information to query result digest" from Paweł "This series adds more information (i.e. keys and tombstones) to the query result digest in order to ensure correctness and increase the chances of early detection of disagreement between replicas. The digest is no longer computed by hashing query::result but build using the query result builder. That is necessary since the query result itself doesn't contain all information required to compute the digest. Another consequence of this is that now replicas asked for a result need to send both the result and the digest to the coordinator as it won't be able to compute the digest itself. Unfortunately, these patches change our on wire communication: 1) hash computation is different 2) format of query::result is changed (and it is made non-final) Fixes #182."	2016-03-14 08:22:05 +02:00
Paweł Dziepak	82d2a2dccb	specify whether query::result, result_digest or both are needed Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Glauber Costa	a339296385	database: turn sstable generation number into an optional This patch makes sure that every time we need to create a new generation number - the very first step in the creation of a new SSTable, the respective CF is already initialized and populated. Failure to do so can lead to data being overwritten. Extensive details about why this is important can be found in Scylla's Github Issue #1014 Nothing should be writing to SSTables before we have the chance to populate the existing SSTables and calculate what should the next generation number be. However, if that happens, we want to protect against it in a way that does not involve overwriting existing tables. This is one of the ways to do it: every column family starts in an unwriteable state, and when it can finally be written to, we mark it as writeable. Note that this cannot be a part of add_column_family. That adds a column family to a db in memory only, and if anybody is about to write to a CF, that was most likely already called. We need to call this explicitly when we are sure we're ready to issue disk operations safely. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:06:05 -05:00
Glauber Costa	94e90d4a17	column_family: do not open code generation calculation We already have a function that wraps this, re-use it. This FIXME is still relevant, so just move it there. Let's not lose it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:05:47 -05:00
Glauber Costa	46fdeec60a	colum_family: remove mutation_count We use memory usage as a threshold these days, and nowhere is _mutation_count checked. Get rid of it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:05:47 -05:00
Asias He	4abaacfc61	db: Introduce column_family_exists It is cheaper than throwing a no_such_column_family exception to test if a cf is gone, e.g., deleted.	2016-03-09 16:50:38 +08:00
Glauber Costa	8260b8fc6f	touch CF directories during startup We try to be robust against files disappearing (due to any kind of corruption) inside the data directory. But if the data directory itself goes missing, that's a situation that we don't handle correctly. We will keep accepting writes normally, but when we try to flush the memtable to disk, we'll fail with a system error. Having the CF directory disappearing is not a common thing. But it is also one that we can easily protect against, by touching all CF directories we know about on startup. Fixes #999 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <ed66373dccca11742150a6d08e21ece3980227d3.1457379853.git.glauber@scylladb.com>	2016-03-09 09:06:51 +02:00
Vlad Zolotarov	a45ecaf336	database: store "incremental backup" configuration value in per-shard instance Store the "incremental_backups" configuration value in the database class (and use it when creating a keyspace::config) in order to be able to modify it in runtime. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-03-06 17:22:48 +02:00
Paweł Dziepak	bdc23ae5b5	remove db/serializer.hh includes Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-02 09:07:09 +00:00
Raphael S. Carvalho	34ed930aa4	sstables: fix lack of accuracy in disk usage report To report disk usage, scylla was only taking into account size of sstable data component. Other components such as index and filter may be relatively big too. Therefore, 'nodetool status' would report an innacurate disk usage. That can be fixed by taking into account size of all sstable components. Fixes #943. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <08453585223570006ac4d25fe5fb909ad6c140a5.1456762244.git.raphaelsc@scylladb.com>	2016-03-01 08:58:42 +02:00
Tomasz Grabiec	6cec131432	query: Switch to IDL-generated views and writers The query result footprint for cassandra-stress mutation as reported by tests/memory-footprint increased by 18% from 285 B to 337 B. perf_simple_query shows slight regression in throughput (-8%): build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000 Before: ~433k tps After: ~400k tps	2016-02-26 12:26:13 +01:00
Avi Kivity	a74f68eeb2	Merge "Properly tag readers" from Glauber "Gleb has recently noted that our query reads are not even being registered with the I/O queue. Investigating what is happening, I found out that while the priority that make_reader receives was not being properly passed downwards to the SSTable reader. The reader code is also used by compaction class, and that one is fine. But the CQL reads are not. On top of that, there are also some other places where the tag was not properly propagated, and those are patched."	2016-02-25 18:35:58 +02:00
Raphael S. Carvalho	fc4cbcde72	Revert "Revert "database: Fix use and assumptions about pending compations"" This reverts commit `a4d92750eb`. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <8a405e7c1daf94c4d70d8084f59ce7205d56fe52.1456415398.git.raphaelsc@scylladb.com>	2016-02-25 18:02:01 +02:00
Pekka Enberg	a4d92750eb	Revert "database: Fix use and assumptions about pending compations" This reverts commit `9586793c70`. It breaks sstable_test as follows: [penberg@nero scylla]$ build/release/tests/sstable_test --smp 1 Running 81 test cases... INFO [shard 0] compaction_manager - Asked to stop INFO [shard 0] compaction_manager - Stopped sstable_test: database.cc:878: future<> column_family::run_compaction(sstables::compaction_descriptor): Assertion `_stats.pending_compactions > 0' failed. unknown location(0): fatal error in "compaction_manager_test": signal: SIGABRT (application abort requested) tests/sstable_datafile_test.cc(1023): last checkpoint	2016-02-25 15:28:06 +02:00
Calle Wilund	9586793c70	database: Fix use and assumptions about pending compations Fixes #934 - faulty assert in discard_sstables run_with_compaction_disabled clears out a CF from compaction mananger queue. discard_sstables wants to assert on this, but looks at the wrong counters. pending_compactions is an indicator on how much interested parties want a CF compacted (again and again). It should not be considered an indicator of compactions actually being done. This modifies the usage slightly so that: 1.) The counter is always incremented, even if compaction is disallowed. The counters value on end of run_with_compaction_disabled is then instead used as an indicator as to whether a compaction should be re-triggered. (If compactions finished, it will be zero) 2.) Document the use and purpose of the pending counter, and add method to re-add CF to compaction for r_w_c_d above. 3.) discard_sstables now asserts on the right things. Message-Id: <1456332824-23349-1-git-send-email-calle@scylladb.com>	2016-02-25 08:57:04 +02:00
Glauber Costa	336babfcb8	database: add a priority class to a few SSTable readers Not all SSTable readers will end up getting the right tag for a priority class. In particular, the range reader, also used for the memtables complete ignores any priority class. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-02-24 18:00:34 -05:00
Glauber Costa	2816bc6fed	database: use a reference instead of a pointer to store the priority classes We will always initialize it, so don't use a pointer. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-02-24 18:00:34 -05:00
Glauber Costa	80ab41a715	memtable reader: also include a priority class There are situations when a memtable is already flushed but the memtable reader will continue to be in place, relaying reads to the underlying table. For that reason, the "memtables don't need a priority class" argument gets obviously broken. We need to pass a priority class for its reader as well. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-02-24 18:00:34 -05:00
Calle Wilund	590ec1674b	truncate: Require timestamp join-function to ensure equal values Fixes #937 In fixing #884, truncation not truncating memtables properly, time stamping in truncate was made shard-local. This however breaks the snapshot logic, since for all shards in a truncate, the sstables should snapshot to the same location. This patch adds a required function argument to truncate (and by extension drop_column_family) that produces a time stamp in a "join" fashion (i.e. same on all shards), and utilizes the joinpoint type in caller to do so. Message-Id: <1456332856-23395-2-git-send-email-calle@scylladb.com>	2016-02-24 18:59:31 +02:00
Tomasz Grabiec	d3b7e143dc	db: Fix error handling in populate_keyspace() When find_uuid() fails Scylla would terminate with: Exiting on unhandled exception of type 'std::out_of_range': _Map_base::at But we are supposed to ignore directories for unknown column families. The try {} catch block is doing just that when no_such_column_family is thrown from the find_column_family() call which follows find_uuid(). Fix by converting std::out_of_range to no_such_column_family. Message-Id: <1456056280-3933-1-git-send-email-tgrabiec@scylladb.com>	2016-02-21 14:19:31 +02:00
Raphael S. Carvalho	55be1830ff	database: make column_family::rebuild_sstable_list safer If any of the allocation in rebuild_sstable_list fail, the system may be left with an incorrect set of sstables. It's probably safer to assign the new set of sstables as a last step. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <52b188262dcc06730dc9220b54ff6810d7dca1ae.1455835030.git.raphaelsc@scylladb.com>	2016-02-21 11:55:15 +02:00
Tomasz Grabiec	a921479e71	Merge tag '807-v3' from https://github.com/avikivity/scylla From Avi: This patchset introduces a linearization context for managed_bytes objects. Within this context, any scattered managed_bytes (found only in lsa regions, so limited to memtable and cache) are auto-linearized for the lifetime of the context. This ensures that key and value lookups can use fast contiguous iterators instead of using slow discontiguous iterators (or crashing, as is the case now).	2016-02-16 14:29:48 +01:00
Avi Kivity	3c60310e38	key: relax some APIs to accept partition_key_view instead of const partition_key& Using a partition_key_view can save an allocation in some cases. We will make use of it when we linearize a partition_key; during the process we are given a simple byte pointer, and constructing a partition_key from that requires an allocation.	2016-02-09 19:55:13 +02:00
Calle Wilund	18203a4244	database::truncate/drop: Move time stamp generation to shard Fixes #884 Time stamps for truncation must be generated after flush, either by splitting the truncate into two (or more) for-each-shard operations, or simply by doing time stamping per shard (this solution). We generate TS on each shard after flushing, and then rely on the actual stored value to be the highest time point generated. This should however, from batch replay point of view, be functionally equivalent. And not a problem.	2016-02-09 15:45:37 +00:00
Calle Wilund	873f87430d	database: Check sstable dir name UUID part when populating CF Fixes #870 Only load sstables from CF directories that match the current CF uuid. Message-Id: <1454938450-4338-1-git-send-email-calle@scylladb.com>	2016-02-08 14:48:19 +01:00
Avi Kivity	f3ca597a01	Merge "Sstable cleanup fixes" from Tomasz " - Added waiting for async cleanup on clean shutdown - Crash in the middle of sstable removal doesn't leave system in a non-bootable state"	2016-02-04 12:36:13 +02:00
Tomasz Grabiec	136c9d9247	sstables: Improve error message in case of generation duplication Refs #870.	2016-02-03 17:35:50 +01:00
Raphael S. Carvalho	a46aa47ab1	make sstables::compact_sstables return list of created sstables Now, sstables::compact_sstables() receives as input a list of sstables to be compacted, and outputs a list of sstables generated by compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0d8397f0395ce560a7c83cccf6e897a7f464d030.1454110234.git.raphaelsc@scylladb.com>	2016-01-31 12:39:20 +02:00
Raphael S. Carvalho	ee84f310d9	move deletion of sstables generated by interrupted compaction This deletion should be handled by sstables::compact_sstables, which is the responsible for creation of new sstables. It also simplifies the code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <541206be2e910ab4edb1500b098eb5ebf29c6509.1454110234.git.raphaelsc@scylladb.com>	2016-01-31 12:39:20 +02:00
Raphael S. Carvalho	3b7970baff	compaction: delete generated sstables in event of an interrupt Generated sstables may imply either fully or partially written. Compaction is interrupted if it was deriberately asked to stop (stop API) or it was forced to do so in event of a failure, ex: out of disk space. There is a need to explicitly delete sstables generated by a compaction that was interrupted. Otherwise, such sstables will waste disk space and even worsen read performance, which degrades as number of generations to look at increases. Fixes #852. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <49212dbf485598ae839c8e174e28299f7127f63e.1453912119.git.raphaelsc@scylladb.com>	2016-01-28 14:05:57 +02:00
Tomasz Grabiec	9fa62af96b	database: Move implementation to .cc Message-Id: <1453980679-27226-1-git-send-email-tgrabiec@scylladb.com>	2016-01-28 13:35:33 +02:00
Glauber Costa	3f94070d4e	use auto&& instead of auto& for priority classes. By Avi's request, who reminds us that auto& is more suited for situations in which we are assigning to the variable in question. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <87c76520f4df8b8c152e60cac3b5fba5034f0b50.1453820373.git.glauber@scylladb.com>	2016-01-26 17:00:20 +02:00
Glauber Costa	b63611e148	mark I/O operations with priority classes After this patch, our I/O operations will be tagged into a specific priority class. The available classes are 5, and were defined in the previous patch: 1) memtable flush 2) commitlog writes 3) streaming mutation 4) SSTable compaction 5) CQL query Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Glauber Costa	f6cfb04d61	add a priority class to mutation readers SSTables already have a priority argument wired to their read path. However, most of our reads do not call that interface directly, but employ the services of a mutation reader instead. Some of those readers will be used to read through a mutation_source, and those have to patched as well. Right now, whenever we need to pass a class, we pass Seastar's default priority class. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Glauber Costa	15336e7eb7	key_source: turn it into a class Its definition as a lambda function is inconvenient, because it does not allow us to use default values for parameters. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Glauber Costa	58fdae33bd	mutation_source: turn it into a class Its definition as a lambda function is inconvenient, because it does not allow us to use default values for parameters. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Vlad Zolotarov	c2ab54e9c7	sstables flushing: enable incremental backup (if requested) Enable incremental backup when sstables are flushed if incremental backup has been requested. It has been enabled in the regular flushing flow before but wasn't in the compaction flow. This patch enables it in both places and does it using a backup capability of sstable::write_components() method(s). Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-01-21 12:13:20 +02:00
Tomasz Grabiec	06d1f4b584	database: Print table name when printing mutation	2016-01-19 13:46:28 +01:00

1 2 3 4 5 ...

511 Commits