scylladb

Author	SHA1	Message	Date
Raphael S. Carvalho	1857ba0abc	db: fix bad resource usage distribution when resharding due to refresh That's because a single shard is used to calculate generation for new sstables in upload directory, and that will result in that single shard sharing all the resources with other shards. For refresh without upload dir, it currently works fine because we reshuffle column family dir instead. flush_upload_dir() is now a free function, takes a distributed database object, and uses calculate_shard_from_sstable_generation() to decide which shard will move sstable using its own generation namespace. Fixes #2008. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>	2017-01-19 18:55:21 +02:00
Duarte Nunes	d53f96e0da	column_family: Only update stats once for a shared sstables This patch ensures that when adding a shared sstable, we select only one cpu to update that column family's stats. This is important so we don't overestimated the on-disk size of sstables when resharding This fixes only a temporary miscount of the current load, since shared sstables are eventually re-written, but a fixes a permanent miscount of the total load. Refs #1592 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170119144823.31041-1-duarte@scylladb.com>	2017-01-19 17:40:35 +02:00
Tomasz Grabiec	ea9ab36ad5	db: Move operator<<() definition to .cc Message-Id: <1484656119-8386-2-git-send-email-tgrabiec@scylladb.com>	2017-01-17 14:52:43 +02:00
Vlad Zolotarov	cda382e8d6	database: move collectd registrations to metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Raphael S. Carvalho	68dfcf5256	db: avoid excessive memory usage during resharding After resharding, sstables may be owned by all shards, which means that file descriptors and memory usage for metadata will increase by a factor equal to number of shards. That can easily lead to OOM. SSTable components are immutable, so they can be stored in one shard and shared with others that need it. We use the following formula to decide which shard will open the sstable and share it with the others: (generation % smp::count), which is the inverse of how we calculate generation for new sstables. So if no resharding is performed, everything is shard-local. With this approach, resource usage due to loaded sstables will be evenly distributed among shards. For this approach to work, we now only populate keyspaces from shard 0. It's now the sole responsible for iterating through column family dirs. In addition, most of population functions are now free and take distributed database object as parameter. Fixes #1951. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-01-09 15:24:36 -02:00
Avi Kivity	be11b054e1	Merge "Reduce the size of mutation_partition" from Piotr "Reduce the size of mutation_partition by implementing intrusive set using bi::rbtree_algorithms directly and using tree nodes optimized for size. This will reduce the size of mutation_partition by: 24 bytes + <number of cql rows> * 8 bytes This should have a positive impact on performance because mutation_partitions are stored both in memtable and cache. Fixes #742." * 'haaawk/742' of github.com:cloudius-systems/seastar-dev: intrusive_set: rename size() to calculate_size() Make intrusive_set_external_comparator::_value_traits static Implement intrusive set using rbtree_algorithms mutation_partition: make apply_reversibly_intrusive_set nongeneric mutation_partition: take schema in find_row and clustered_row mutation_partition: Extract intrusive set logic to a class. mutation_partition: Replace value_comp with key_comp calls	2017-01-05 17:34:10 +02:00
Tomasz Grabiec	cd630fece6	db: Make system tables use the commitlog Before this patch system table writes were not writing to commit log because database::add_column_family() disables writes to commit log for the table which is added if _commitlog is not set at that time. Fix by initializing commit log before system tables are created. Fixes #1986. Fixes recent regression in batch_test.py:TestBatch.replay_after_schema_change_test after scylla-jmx was updated to not flush system tables on nodetool flush. Could cause system keyspace writes to be delayed for more than before under heavy write workload. Refs #1926. Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>	2017-01-05 14:53:51 +02:00
Piotr Jastrzebski	4bbe05dd47	mutation_partition: take schema in find_row and clustered_row This will allow intrusive set implementation that does not store schema. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Paweł Dziepak	1a52569f7d	storage_proxy: pass maximum result size to replicas We may want to change the default individual result size limit in the future. If it is provided by the coordinator and not hardcoded in the replicas this can be done without causing data query digest mismatches or wasteful mutation query results.	2016-12-22 17:16:23 +01:00
Paweł Dziepak	a0523df8d6	result_memory_limiter: add accounter for digest reads Digest reads differ from data reads in a way that they do not really consume any memory. We still want them to stop in the same place that data reads would, but the per-shard semaphore shouldn't be updated by them.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	aa083d3d85	result_memory_limiter: split new_read() to new_{data, mutation}_read() For data queries it is very important that all replicas get limited in the same place (this includes replicas returning only digest). That's why they shouldn't be affected by per-shard result memory limit. Moreover, we should make sure that individual memory limits are the same, making the coordinator provide it for replicas which allow to safely change it in the future. Mutation queries are not as sensitive but it is still beneficial to make sure that all replicas use the same individual limit.	2016-12-22 13:35:04 +01:00
Raphael S. Carvalho	27fb8ec512	db: avoid excessive disk usage during sstable resharding Shared sstables will now be resharded in the same order to guarantee that all shards owning a sstable will agree on its deletion nearly the same time, therefore, reducing disk space requirement. That's done by picking which column family to reshard in UUID order, and each individual column family will reshard its shared sstables in generation order. Fixes #1952. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>	2016-12-21 23:18:06 +02:00
Avi Kivity	875635554d	Merge "educe overhead of partition presence checker during cache update" from Tomasz Refs #1943. * 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev: db: Compute key hash once in partition_presence_checker bloom_filter: Allow checking presence using pre-hashed key db: Use incremental selector in partition_presence_checker	2016-12-21 14:24:54 +02:00
Duarte Nunes	3fd79bb6d6	schema_tables: Merge views for schema merging Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	06ab61a570	schema_tables: Extract update_column_family This patch extracts update_column_family from schema_tables into database so it can be used when adding materialized views, in future patches. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	ecc4290bc6	database: Remove view from base table upon drop This patch changes the drop_column_family() function to remove a view schema from the list of views of its base table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	4f166cfa6a	database: Parse views schema table upon init This patch adds code for parsing the views schema table upon init and also ensures that when adding a view column family, that we add it to its base table list of views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	40c684b5f5	database: Extract common create cf code This patch moves some duplicate code into the add_column_family_and_create_directory() function. It also saves some superfluous keyspace lookups and readies the code to be used by materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	2b231f22b8	keyspace_metadata: Add tables() and views() functions This patch adds utility functions to keyspace_metadata to select only the tables or only the views out of all the schemas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	7818339791	materialized views: Add view class This patch adds the view class, which will contains functions related to populating a view, either from the base table's write path or from the view building mechanism which copies over already existing data in the base table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Tomasz Grabiec	0e487b3499	db: Compute key hash once in partition_presence_checker I measured reduction of cache update time by 20% for 6 sstables and by 40% for 16. Refs #1943.	2016-12-19 14:20:58 +01:00
Tomasz Grabiec	78844fa2e5	db: Use incremental selector in partition_presence_checker This reduces the number of sstables we need to check to only those whose token range overlaps with the key. Reduces cache update time. Especially effective with leveled compaction strategy. Refs #1943. Incremental selector works with an immutable sstable set, so cache updates need to be serialized. Otherwise we could mispopulate due to stale presence information. Presence checker interface was changed to accept decorated key in order to gain easy access to the token, which is required by the incremental selector.	2016-12-19 14:20:58 +01:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Asias He	85034c1b57	Convert to use dht::partition_range	2016-12-19 08:04:30 +08:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Avi Kivity	6bb875bdb7	Merge "storage_proxy: Enforce partition limit" from Duarte "This patchset ensures the partition limit is enforced at the storage_proxy level. To achieve this, we add the partition count to query::result, and allow the result_merger to trim excess partitions." * 'enforce-partition-limit/v3' of https://github.com/duarten/scylla: storage_proxy: Decrease limits when retrying command storage_proxy: Don't fetch superfluous partitions query::result: Add partition count column_family: Use counters in query::result::builder query_result_builder: Use the underlying counters mutation_partition: Count partitions in query_compacted mutation_partition: Remove tabs in query_compacted query::result::builder: Add partition count query_result_merger: Limit partitions	2016-12-16 13:57:37 +02:00
Glauber Costa	7133583797	track streaming and system virtual dirty memory A case could be made that we should have counters for them no matter what, since it can help us reason about the distribution of memory among the groups. But with the hierarchy being broken in 1.5 it becomes even more important. Now by looking solely at dirty, we will have no idea about how much memory we are using in those groups. After this patch, the dirty_memory_manager will register its metrics for the 3 groups that we have, and the legacy names will be used to show totals. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>	2016-12-16 10:59:40 +02:00
Paweł Dziepak	cf679a413c	db: use multi range reader for streaming readers A naive approach was to create a set of readers for each range and pass them all to combining reader. This however performed badly if the number of ranges was high. The solution is to use multi range reader which uses only a single set of readers and fast forwards from range to range when necessary. This adds another requirement that the ranges passed to make_streaming_reader() are sorted and disjoint.	2016-12-15 13:54:43 +00:00
Duarte Nunes	781cd82cb8	column_family: Use counters in query::result::builder This patch changes column_family::query() to use the counters in the builder to determine how many partitions and rows to ask for and also to implement the stop condition. This saves a continuation to do the bookkeeping, and allows us to remove data_query_result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Paweł Dziepak	cfd4d0f680	db: add metrics for short reads and memory used for results Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	ba51e7e8db	data_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	6c33a4f177	db: create result_memory_accounters when starting query This pach ensures than when we start executing a query a minimum result size is reserved from result_memory_limiter. Moreover, range queries need a way of merging memory usage information from different shards. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	15de8de9e5	reconcilable_result: keep result_memory_tracker object Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Avi Kivity	a61ff53150	Merge "rework flush criteria" from Glauber "The current criteria for memtable flush is not being respected. The problem is demonstrated to happen when the dirty memory group is over limit, and so is the system table extra allowance. In that situation, both the normal region and the system table region will be under pressure and try to flush. More specifically, because the normal region inherits from the system region, if the normal region is under pressure (over the soft limit threshold), the system region will certainly be as well, even though it has an extra allowance. This is because after virtual dirty, we start blocking when we reach half the region, but memory itself can grow up to 100 % of the region. So the total amount of memory used will be certainly bigger than the system pressure threshold, which is now 50 % plus the allowance. To fix that, this patch reworks the flush logic so that the regions are not dependent on each other. Fixes #1918" * 'flush-criteria-v6' of github.com:glommer/scylla: config: get rid of memtable_total_space database: rework dirty memory hierarchy system keyspace: write batchlog mutation in user memory database: remove flush_token database: abstract pressure condition notification database: encapsulate semaphore_units into a flush_permit database: remove friendship declaration database: simplify flush_one database: make memtable_list aware in cases it can't flush	2016-12-14 11:24:10 +02:00
Glauber Costa	2aa6514667	config: get rid of memtable_total_space Those values are now statically set. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 17:05:12 -05:00
Glauber Costa	80440c0d79	database: rework dirty memory hierarchy Issue #1918 describes a problem, in which we are generating smaller memtables than we could, and therefore not respecting the flush criteria. That happens because group sizes (and limits) for pressure purposes, and the the soft threshold is currently at 40 %. This causes system group's soft threshold to be way below regular's virtual dirty limit and close to regular group's soft threshold. The system group was very likely to become under soft pressure when regular was because writes to regular group are not yet throttled when they cross both soft thresholds. This is a direct consequence of the linear hierarchy between the regions and to guarantee that it won't happen we would have acqire the semaphore of all ancestor regions when flushing from a child region. While that works, it can lead to problems on its own, like priority inversion if the regions have different priorities - like streaming and regular, and groups lower in the hierarchy, like user, blocking explicit flushes from their ancestors To fix that, this patch reorganizes the dirty memory region groups so that groups are now completely independent. As a disadvantage, when streaming happen we will draw some memory from the cache, but we will live with it for the time being. Fixes #1918 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 14:07:53 -05:00
Glauber Costa	98030ad66c	database: abstract pressure condition notification Done in a separate patch to reduce clutter in the main patch. Soon we'll be testing for one more condition. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	bb1509c21e	database: simplify flush_one flush_one has to make sure that we're using the correct dirty_memory_manager object, because we could be flushing from a region group different than the one the flush request originated. It's simpler to just assume flush_one will be dealing with the right object, and use a different object instead of "this" when calling it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	8ab7c04caa	database: make memtable_list aware in cases it can't flush Some of our CFs can't be flushed. Those are the ones who are not marked as having durable writes. We treat them just the same from the point of view of the flush logic, but they provide a function that doesn't do anything and just returns right away. We already had troubles with that in the past, and that also poses a problem for an upcoming patch reworking the flush memtable pick criteria. It's easier, simpler, and cleaner, to just make the memtable_list aware it can't flush. Achieving that is also not very complicated: we just need a special constructor that doesn't take a seal function and then we make sure that it is initialized to an empty std::function Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Duarte Nunes	1e75a4950e	database: Complete query when hitting partition limit Currently, we weren't completing a query as early as possible if it reached the partition limit, we instead had to wait until reaching the end of the specified partition ranges. This patches fixes that by including a check to the partition limit in the termination condition. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161213114559.26438-1-duarte@scylladb.com>	2016-12-13 14:53:46 +02:00
Asias He	cd2105b8bd	database: make_streaming_reader for ranges Allow to make a streaming reader with a vector of ranges in addition to a single range. This will be used soon in following streaming patch. We can make the reader more efficient later.	2016-12-12 09:04:21 +08:00
Raphael S. Carvalho	fcfc84e836	compaction: reduce bloom filter overhead with incremental selector The procedure to calculate max purgeable timestamp is optimized by only visiting sstables that overlap with key being currently compacted. That's done using incremental sstable selector. Function to calculate maximum purgeable timestamp is made 10 times faster when compacting sstables overlap with 10% of all sstables. Fixes #1322. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-12-09 16:17:17 -02:00
Glauber Costa	733d87fcc6	database: try to acquire semaphore before we start flush As Tomek pointed out, as we are starting the flush before we acquire the semaphore, we are not really limiting parallelism, but only delaying the end of the flush instead. Fixes #1919 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <6cbf9ec2f3a341c76becf94f794cfa16539c5192.1481120410.git.glauber@scylladb.com>	2016-12-08 12:18:32 +01:00
Tomasz Grabiec	527ff6aa40	db: Clear memtable after flush when cache is disabled So that memory is released gradually (impacting latency less) and sooner than when memtable is destroyed. Active readers may keep the memtable alive for unbounded amount of time. Refs #1879	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	1b5f338c17	memtable: Track flushed memory in memtable object	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	c3768fe4de	memtable: Pass dirty_memory_manager& to memtable constructor The implementation assumes that memtable's region group is owned by dirty_memory_manager, and tries to obtain a reference to it like this: boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group)); This is undefined behavior when the region's group does not come from dirty manager. It's safer to be explicit about this dependency by taking a reference to dirty_memory_manager in the constructor.	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	b5d5612f98	database: Add counter for timed out writes	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	2c561ecaed	db: Allow writes to be timed out	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	b1ae6ad2ad	db: Introduce counters for failed reads and writes	2016-11-29 16:40:58 +01:00

1 2 3 4 5 ...

707 Commits