scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 11:36:54 +00:00

Author	SHA1	Message	Date
Avi Kivity	d58c8aaa32	db: remove unused belongs_to_{current,other}_shard(s) functions Obsoleted by new sharding mechanism, but break the build for some.	2016-11-23 21:39:29 +02:00
Avi Kivity	b81a57e8eb	config, dht: reduce default msb ignore bits to 4 With the default value of 12, a node's range is partitioned into 4096 * smp::count sub-ranges which are queried sequentually for a range scan. If the number of rows in the table is smaller than the required result size, we will query all of them. This can take so long that we time out. A better fix is to query multiple sub-ranges in parallel and merge them, but for that we need to resurrect the non-sequential merger.	2016-11-23 21:25:37 +02:00
Pekka Enberg	c526a9f0be	Update seastar submodule * seastar 7473945...d6f26d8 (2): > semaphore_units: add missing return statement > metrics: Do not detroy the metrics layer if it is been used	2016-11-23 20:27:09 +02:00
Paweł Dziepak	919825a2c7	Merge "Improve sharding in large clusters" from Avi "Clusters with a large number of nodes, or a low number of vnodes, and a high number of shards, or a combination, suffer from an aliasing problem: both vnodes and intra-node sharding consider the most significant bits to select the owning node and owning shard respectively. Since the same bits are used for both, a low number of vnodes leads to some shards being overcommitted relative to others. This series fixes the problem by sharding on bits 0:47 of the token (murmur3 partitioner only), leaving the most significant 12 bits for vnodes. Simulation shows that this value provides reasonable sharding for 100-node, 30-shard clusters. In order to prevent re-sharding sstables on each boot, token ranges for the range are stored in a new sub-component of the sstable Statistics component. With the default 12 ignored bits we have 4096 token ranges for non-Level-compacted SSTables, which takes some space but is still reasonable. Fixes #1277."	2016-11-23 11:25:53 +00:00
Glauber Costa	18b9fa3d43	dist: increase number of open files This limit was found to be too low for production environments. It would be hit at boot, when we're touching a lot of files from multiple shards before deciding that we don't need them. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <87bbf43da1a67f5fa6174017205c6ef8bdb0dc3d.1479829232.git.glauber@scylladb.com>	2016-11-23 13:10:25 +02:00
Avi Kivity	07d5a20bae	Wire up sharding ignore msb parameter to configuration We might have used a fancy map<sstring, any> to pass the parameters, but that's overkill for now.	2016-11-22 22:40:47 +02:00
Avi Kivity	8b1d689de8	partitioner: add ignore_msb parameters to byte ordered and random partitioners Ignored; doesn't make sense on byte ordered, and random is deprecated.	2016-11-22 21:56:42 +02:00
Avi Kivity	af16c0fac4	murmur3_partitioner: shard on the middle token bits, not most significant bits Sharding on the most significant token bits aliases with the vnode mechanism, which also uses the most significant bits; this requires a huge number of vnodes to achieve good sharding. This patch teaches the murmur3 partitioner to ignore the most significant N bits when calculating a token's hard, so we use token bits which still have some entropy. In effect, with changes the token range layout from shard 0 shard 1 ... shard S-1 to shard 0 shard 1 ... shard S-1 shard 0 shard 1 ... shard S-1 ... shard 0 shard 1 ... shard S-1 Where the number of repetitions of the block is 2^(ignored msb bits). For compatibility, the default is zero ignored bits, matching the pre-patch state, until we wire things up.	2016-11-22 21:56:42 +02:00
Avi Kivity	024c8ef8a1	db: adjust sstable load to use sstable self-reporting of shard ownership Instead of calculating the owning shard from the sstable's partition key range, delegate to the new sstable method for getting owning shard infomation. This insulates us from changes in the sharding algorithm.	2016-11-22 21:56:40 +02:00
Avi Kivity	98a4544e1c	sstables: add method to get sstable owning shards from an unloaded sstable When we load an sstable, we don't know beforehand which shards it belongs to; we don't want to open it until we do. Add a method that allows us to read just the sharding data, without opening anything else.	2016-11-22 21:52:23 +02:00
Avi Kivity	bdd11648ac	sstables: add intra-node sharding metadata Add a metadata component that describes token ranges that are spanned by this sstable. With the current sharding algorithm, where each shard owns a single token range, the first/last partition key is sufficient to describing sharding information, but for multi-range algorithms, this is not sufficient.	2016-11-22 21:44:25 +02:00
Avi Kivity	316ef1d70a	sstables: automate writing statistics components Add a virtual funnction to metadata_base so we can loop over statistics components when writing them.	2016-11-22 21:05:06 +02:00
Glauber Costa	13973e7f3b	keep background work semaphore alive during sstable flush We have a semaphore controlling the amount of background work generated by the memtable flush process. However, because we are not moving it inside the memtable post-flush continuation, the units are being released when we star the flush and not when we finish it. That's not the intended behavior and that can cause flushes to accumulate. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>	2016-11-22 19:54:08 +01:00
Avi Kivity	d05b22e502	sstables: automatically calculate offsets in statistics Instead of calculating the offset for each statistic component manually, use a loop to iterate over all components, accumulating the offset as we go along.	2016-11-22 20:35:24 +02:00
Avi Kivity	7c5e6525ef	sstables: switch statistics components to generic serialized_size() implementation	2016-11-22 20:20:38 +02:00
Avi Kivity	096ae59a5b	sstables: introduce generic serialized_size() Introduce a new function that reuses the file_writer code to compute the serialized size of an sstable object, by serializing it into memory and discarding the result.	2016-11-22 20:06:23 +02:00
Avi Kivity	3c06ffac9d	sstables: const correctness for the write(file_writer&, T&) functions write() doesn't need to change its input; so change it to const. The only snag is that describe_type() isn't and can't be made const-correct, so cheat when it is called and const_cast the input. This helps in writing a generic serialized_size() that is const correct, in the next patch.	2016-11-22 20:04:27 +02:00
Tomasz Grabiec	eefc538225	Update seastar submodule * seastar 7504026...7473945 (1): > Merge "Improve support for timeouts in primitives"	2016-11-22 17:51:29 +01:00
Glauber Costa	0b8b5abf16	commitlog: acquire semaphore earlier Recently we have changed our shutdown strategy to wait for the _request_controller semaphore to make sure no other allocations are in-flight. That was done to fix an actual issue. The problem is that this wasn't done early enough. We acquire the semaphore after we have already marked ourselves as _shutdown and released the timer. That means that if there is an allocation in flight that needs to use a new segment, it will never finish - and we'll therefore neve acquire the semaphore. Fix it by acquiring it first. At this point the allocations will all be done and gone, and then we can shutdown everything else. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>	2016-11-21 22:19:32 +00:00
Avi Kivity	6bdb8ba31d	storage_proxy: don't query concurrently needlessly during range queries storage_proxy has an optimization where it tries to query multiple token ranges concurrently to satisfy very large requests (an optimization which is likely meaningless when paging is enabled, as it always should be). However, the rows-per-range code severely underestimates the number of rows per range, resulting in a large number of "read-ahead" internal queries being performed, the results of most of which are discarded. Fix by disabling this code. We should likely remove it completely, but let's start with a band-aid that can be backported. Fixes #1863. Message-Id: <20161120165741.2488-1-avi@scylladb.com>	2016-11-21 18:19:46 +02:00
Glauber Costa	0ca8c3f162	database: keep a pointer to the memtable list in a memtable We current pass a region group to the memtable, but after so many recent changes, that is a bit too low level. This patch changes that so we pass a memtable list instead. Doing that also has a couple of advantages. Mainly, during flush we must get to a memtable to a memtable_list. Currently we do that by going to the memtable to a column family through the schema, and from there to the memtable_list. That, however, involves calling virtual functions in a derived class, because a single column family could have both streaming and normal memtables. If we pass a memtable_list to the memtable, we can keep pointer, and when needed get the memtable_list directly. Not only that gets rid of the inheritance for aesthetic reasons, but that inheritance is not even correct anymore. Since the introduction of the big streaming memtables, we now have a plethora of lists per column family and this transversal is totally wrong. We haven't noticed before because we were flushing the memtables based on their individual sizes, but it has been wrong all along for edge cases in which we would have to resort to size-based flush. This could be the case, for instance, with various plan_ids in flight at the same time. At this point, there is no more reason to keep the derived classes for the dirty_memory_manager. I'm only keeping them around to reduce clutter, although they are useful for the specialized constructors and to communicate to the reader exactly what they are. But those can be removed in a follow up patch if we want. The old memtable constructor signature is kept around for the benefit of two tests in memtable_tests which have their own flush logic. In the future we could do something like we do for the SSTable tests, and have a proxy class that is friends with the memtable class. That too, is left for the future. Fixes #1870 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>	2016-11-21 18:18:27 +02:00
Duarte Nunes	def2bc72b0	size_estimates_virtual_reader: Add unit test Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	6a37d87c76	db: Delete size_estimates_recorder Now that access to the size_estimates system is virtualized, we no longer need the recorder. Fixes #1616 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	225648780d	size_estimates: Add virtual reader This patch add a virtual mutation_reader so that queries to the size_estimates system table are handled by the engine without needing to perform any IO. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	cd7e2fd602	column_family: Add support for virtual readers Virtual readers allow queries to selected tables, usually system tables, to be answered by the engine. This is useful for tables which aren't written by users and whose contents can be calculated on demand. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	c0d450c57d	storage_service: get_local_tokens() returns a future This patch changes the get_local_tokens() function in storage_service to return a future instead of requiring running under a seastar::thread. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	9b384d375f	nonwrapping_range: Add slice() function This patch add the slice() function to nonwrapping range, which uses its bounds to slice an input sequence. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	bdba8d99c3	range: Find a sequence's lower and upper bounds This patch extracts a pair of functions from mutation_partition to calculate the lower and upper bounds of a sequence from a nonwrapping_range. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	636287fdf2	system_keyspace: Build mutations for size estimates This patch adds a function to system_keyspace responsible for creating a mutation to a partition of the size_estimates system table from a set of range_estimates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	18ddec245e	size_estimates: Store the token range as bytes This patch changes the range_estimates struct so that the tokens are represented as utf8 encoded bytes. This will make future patches require less conversions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:14:21 +00:00
Duarte Nunes	e7a5162c1d	range_estimates: Add schema This will be used in future patches, when virtualizing the size_estimates system table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 10:56:32 +00:00
Duarte Nunes	01815ecd24	murmur3_partitioner: Convert maximum_token to sstring This patch ensures we can convert the maximum_token to an sstring. For Cassandra, the minimum and maximum tokens have the same representation. So, we use the string representation of the maximum_token for the maximum_token. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 10:56:32 +00:00
Takuya ASADA	eee63027e5	dist/ami/build_ami.sh: update base AMI to CentOS7-Base5 To drop unnecessary .ssh/authozied_keys, we need to update base AMI. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479496938-29724-1-git-send-email-syuu@scylladb.com>	2016-11-21 10:12:47 +02:00
Avi Kivity	783729c540	Merge "Clean up T::memory_usage() function" from Paweł "This series is just a cleanup which intention is to deal with all confusion related to the way T::memory_usage() functions work. * T::memory_usage() which returned external memory usage are renamed to T::external_memory_usage() * T::memory_usage() is introduced where needed to avoid repeating sizeof(T) + T::external_memory_usage()" Paweł Dziepak (6): rename memory_usage() to external_memory_usage() where applicable streamed_mutation: add memory_usage() to mutation fragment types keys: add memory_usage() partition_snapshot_accounter: use range_tombstone::memory_usage() mutation_rebuilder: use memory_usage() frozen_mutation: use memory_usage()	2016-11-21 10:11:39 +02:00
Avi Kivity	498887ca0d	Merge seastar upstream * seastar 31c5fd7...7504026 (2): > circular_buffer: add move assignment operator > scollectd: Fix serialization of GAUGE-typed values	2016-11-20 20:16:56 +02:00
Gleb Natapov	9222a47fed	sstable test: add test for generated summary data Message-Id: <20161117155051.GV6765@scylladb.com>	2016-11-20 19:50:45 +02:00
Glauber Costa	21c1e2b48c	commitlog: wait for pending allocations to finish before closing gate. allocations may enter the gate, so it would be wise for us to wait for them. Fixes #1860 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <53cd6996c1cbd8b38bab3b03604bd11e5c20beda.1479650012.git.glauber@scylladb.com>	2016-11-20 19:45:33 +02:00
Avi Kivity	a39b92a40a	build: fix tests-with-symbols generation Bad indentation caused the libs variable for tests-with-symbols to be overwritten, resulting in link failure.	2016-11-20 17:23:26 +02:00
Glauber Costa	504b5ac30f	database: don't check for waiters in the condition variable predicate. In the last iterations of this patchset, we have moved explicit flushes to acquire the semaphore directly and the coalescing inside the memtable_list. As a result, we are no longer keeping any kind of action for them inside the condition variable. Checking for them has no longer a purpose. This is a cleanup patch that remove does checks. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>	2016-11-18 21:34:48 +01:00
Glauber Costa	1933349654	database: fix direct flushes of non-durable column families. If a Column Family is non-durable, then its flushes will never create a memtable flush reader. Our current flush logic depends on that being created and destroyed to release the semaphore permits on the flush. We will remove the permits ourselves it there is an exception, but not under normal circumnstances. Given this issue, however, it would be more adequate to always try to remove the permits after we flush. If the permits were already removed by the flush reader, then this test will just see that the permit is not in the map and return. But if it is still there, then it is removed. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>	2016-11-18 21:32:29 +01:00
Avi Kivity	6eecbc80dc	CONTRIBUTING.md: add sections for help and issues Don't scare away users reporting an issue with the CLA.	2016-11-18 22:21:10 +02:00
Glauber Costa	60b7d35f15	commitlog: close file after read, and not at stop There are other code paths that may interrupt the read in the middle and bypass stop. It's safer this way. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8c32ca2777ce2f44462d141fd582848ac7cf832d.1479477360.git.glauber@scylladb.com>	2016-11-18 14:09:33 +00:00
Paweł Dziepak	249e0ab087	frozen_mutation: use memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	948c062e64	mutation_rebuilder: use memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	e04664e851	partition_snapshot_accounter: use range_tombstone::memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	711bd19f16	keys: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	6b8bf030c0	streamed_mutation: add memory_usage() to mutation fragment types This patch introduces memory_usage() to static_row, clustering_row and range_tombstone so that we can avoid repeating sizeof(T) + x.external_memory_usage(). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	ef57b9a26f	rename memory_usage() to external_memory_usage() where applicable Renaming the function to external_memory_usage() makes it clear that sizeof(T) is not included, something that was a source of confusion in the past. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Avi Kivity	fec4ef3390	Merge "Make sure commitlog replay is able to make progress" from Glauber "Fixes #1856 Commitlog replay reads are being issued without a priority. That means they will lose to compaction every time." * 'issue-1856-v2' of github.com:glommer/scylla: commitlog: use read ahead for replay requests commitlog: use commitlog priority for replay commitlog: close replay file	2016-11-18 12:04:18 +02:00
Takuya ASADA	55e5123313	dist/redhat: Support RHEL7 We supported install CentOS7 .rpm on RHEL7, but we haven't supported building on RHEL7, since there is little difference between CentOS, and that causes build error. This patch fixes the error, now we can produce .rpm for RHEL7 wihout using CentOS. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479431134-8032-1-git-send-email-syuu@scylladb.com>	2016-11-18 11:56:05 +02:00

... 18 19 20 21 22 ...

11716 Commits