scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Vlad Zolotarov	cb614f9be4	database: lister::guarantee_type: handle the case when entry type may not be read There is a possibility that the type of the given entry may not be available that would manifest in the ENOENT or ENOTDIR value set in the errno by the fstat() call for this entry. In this case engine().file_type() will return a not engaged optional<directory_entry_type> value. Return the future with the std::runtime_error exception in this case. This will prevent any further usage of the not engaged optional value by the code in the normal flow. The exception is going to be propagated to the caller and it's the caller's responsibility to handle it. Fixes #2071 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:55 -05:00
Vlad Zolotarov	25502149cf	database: lister::scan_dir(): std::move() all that needs to be moved Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-16 11:56:44 -05:00
Avi Kivity	9530bac2d6	Merge "Adding metrics using histogram and labels" from Amnon "This series uses the newly added histogram and label support to add metrics to the storage_proxy and to the column_family. This would add latency and histogram and the missing metrics from column family." * 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev: database: add metrics registration for the coloumn family storage_proxy: add read and write latency histogram estimated_histogram: returns a metrics histogram	2017-02-09 11:40:57 +02:00
Amnon Heiman	292c08f598	database: add metrics registration for the coloumn family This patch adds a metrics registration to the column_family. Using label each column metrics is label with its keyspace and column family name. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-02-06 18:27:01 +02:00
Duarte Nunes	0eca6301d3	database: Apply mutation to views This patch changes the database apply path so that it also generates the mutations for the column family's views and sends them to the paired view replicas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:37:33 +01:00
Duarte Nunes	4777172348	column_family: Push view replica update This patch adds a function to push updates to the view replicas of a particular base table.	2017-02-06 13:36:45 +01:00
Nadav Har'El	92fc7386f6	materialized views: add VIEW write type This adds to the "write_type" enum also the "VIEW" write type. To be honest, I don't understand why the "write_type" distinction is important. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	11bd3bd29f	database: Ensure new write_type is correctly printed By removing the default case in the switch statement over a write_type variable, we ensure the compiler warns us about lack of exhaustiveness in case we add a value to the enum but forget to change the corresponding operator<<(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	16206e9f15	column_family: Generate view updates This patch adds the generate_view_updates() function to the column_family class, which will use the view_update_builder to generate updates to the column_family's materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	90cb35db04	column_family: Adds affected_views() function This patch the affected_views() to determine the column family's views a given update affects. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	082ef56df1	view: Store pk view column that's non-pk in the base To help calculate the view mutations from a base update, we store in the view class the column that's part of the view's primary key but not part of the base's, if such column exists. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	c35d14e285	column_family: Store a pointer to view Instead of storing the view in the column_family's map of materialized views, store a lw_shared_ptr so that the view can be removed while it is being updated. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Avi Kivity	7a00dd6985	Merge "Avoid avalanche of tasks after memtable flush" from Tomasz "Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce()	2017-02-02 17:49:31 +02:00
Paweł Dziepak	5a0955e89d	db: add operations for applying counter updates	2017-02-02 10:35:14 +00:00
Tomasz Grabiec	ed9ff19467	lsa: Document and annotate reclaimer notification callbacks They are called from region_group::update(), so must be alloc-free and noexcept.	2017-01-30 19:18:07 +01:00
Raphael S. Carvalho	1857ba0abc	db: fix bad resource usage distribution when resharding due to refresh That's because a single shard is used to calculate generation for new sstables in upload directory, and that will result in that single shard sharing all the resources with other shards. For refresh without upload dir, it currently works fine because we reshuffle column family dir instead. flush_upload_dir() is now a free function, takes a distributed database object, and uses calculate_shard_from_sstable_generation() to decide which shard will move sstable using its own generation namespace. Fixes #2008. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>	2017-01-19 18:55:21 +02:00
Duarte Nunes	d53f96e0da	column_family: Only update stats once for a shared sstables This patch ensures that when adding a shared sstable, we select only one cpu to update that column family's stats. This is important so we don't overestimated the on-disk size of sstables when resharding This fixes only a temporary miscount of the current load, since shared sstables are eventually re-written, but a fixes a permanent miscount of the total load. Refs #1592 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170119144823.31041-1-duarte@scylladb.com>	2017-01-19 17:40:35 +02:00
Tomasz Grabiec	ea9ab36ad5	db: Move operator<<() definition to .cc Message-Id: <1484656119-8386-2-git-send-email-tgrabiec@scylladb.com>	2017-01-17 14:52:43 +02:00
Vlad Zolotarov	cda382e8d6	database: move collectd registrations to metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Raphael S. Carvalho	68dfcf5256	db: avoid excessive memory usage during resharding After resharding, sstables may be owned by all shards, which means that file descriptors and memory usage for metadata will increase by a factor equal to number of shards. That can easily lead to OOM. SSTable components are immutable, so they can be stored in one shard and shared with others that need it. We use the following formula to decide which shard will open the sstable and share it with the others: (generation % smp::count), which is the inverse of how we calculate generation for new sstables. So if no resharding is performed, everything is shard-local. With this approach, resource usage due to loaded sstables will be evenly distributed among shards. For this approach to work, we now only populate keyspaces from shard 0. It's now the sole responsible for iterating through column family dirs. In addition, most of population functions are now free and take distributed database object as parameter. Fixes #1951. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-01-09 15:24:36 -02:00
Avi Kivity	be11b054e1	Merge "Reduce the size of mutation_partition" from Piotr "Reduce the size of mutation_partition by implementing intrusive set using bi::rbtree_algorithms directly and using tree nodes optimized for size. This will reduce the size of mutation_partition by: 24 bytes + <number of cql rows> * 8 bytes This should have a positive impact on performance because mutation_partitions are stored both in memtable and cache. Fixes #742." * 'haaawk/742' of github.com:cloudius-systems/seastar-dev: intrusive_set: rename size() to calculate_size() Make intrusive_set_external_comparator::_value_traits static Implement intrusive set using rbtree_algorithms mutation_partition: make apply_reversibly_intrusive_set nongeneric mutation_partition: take schema in find_row and clustered_row mutation_partition: Extract intrusive set logic to a class. mutation_partition: Replace value_comp with key_comp calls	2017-01-05 17:34:10 +02:00
Tomasz Grabiec	cd630fece6	db: Make system tables use the commitlog Before this patch system table writes were not writing to commit log because database::add_column_family() disables writes to commit log for the table which is added if _commitlog is not set at that time. Fix by initializing commit log before system tables are created. Fixes #1986. Fixes recent regression in batch_test.py:TestBatch.replay_after_schema_change_test after scylla-jmx was updated to not flush system tables on nodetool flush. Could cause system keyspace writes to be delayed for more than before under heavy write workload. Refs #1926. Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>	2017-01-05 14:53:51 +02:00
Piotr Jastrzebski	4bbe05dd47	mutation_partition: take schema in find_row and clustered_row This will allow intrusive set implementation that does not store schema. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Paweł Dziepak	1a52569f7d	storage_proxy: pass maximum result size to replicas We may want to change the default individual result size limit in the future. If it is provided by the coordinator and not hardcoded in the replicas this can be done without causing data query digest mismatches or wasteful mutation query results.	2016-12-22 17:16:23 +01:00
Paweł Dziepak	a0523df8d6	result_memory_limiter: add accounter for digest reads Digest reads differ from data reads in a way that they do not really consume any memory. We still want them to stop in the same place that data reads would, but the per-shard semaphore shouldn't be updated by them.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	aa083d3d85	result_memory_limiter: split new_read() to new_{data, mutation}_read() For data queries it is very important that all replicas get limited in the same place (this includes replicas returning only digest). That's why they shouldn't be affected by per-shard result memory limit. Moreover, we should make sure that individual memory limits are the same, making the coordinator provide it for replicas which allow to safely change it in the future. Mutation queries are not as sensitive but it is still beneficial to make sure that all replicas use the same individual limit.	2016-12-22 13:35:04 +01:00
Raphael S. Carvalho	27fb8ec512	db: avoid excessive disk usage during sstable resharding Shared sstables will now be resharded in the same order to guarantee that all shards owning a sstable will agree on its deletion nearly the same time, therefore, reducing disk space requirement. That's done by picking which column family to reshard in UUID order, and each individual column family will reshard its shared sstables in generation order. Fixes #1952. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>	2016-12-21 23:18:06 +02:00
Avi Kivity	875635554d	Merge "educe overhead of partition presence checker during cache update" from Tomasz Refs #1943. * 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev: db: Compute key hash once in partition_presence_checker bloom_filter: Allow checking presence using pre-hashed key db: Use incremental selector in partition_presence_checker	2016-12-21 14:24:54 +02:00
Duarte Nunes	3fd79bb6d6	schema_tables: Merge views for schema merging Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	06ab61a570	schema_tables: Extract update_column_family This patch extracts update_column_family from schema_tables into database so it can be used when adding materialized views, in future patches. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	ecc4290bc6	database: Remove view from base table upon drop This patch changes the drop_column_family() function to remove a view schema from the list of views of its base table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	4f166cfa6a	database: Parse views schema table upon init This patch adds code for parsing the views schema table upon init and also ensures that when adding a view column family, that we add it to its base table list of views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	40c684b5f5	database: Extract common create cf code This patch moves some duplicate code into the add_column_family_and_create_directory() function. It also saves some superfluous keyspace lookups and readies the code to be used by materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	2b231f22b8	keyspace_metadata: Add tables() and views() functions This patch adds utility functions to keyspace_metadata to select only the tables or only the views out of all the schemas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	7818339791	materialized views: Add view class This patch adds the view class, which will contains functions related to populating a view, either from the base table's write path or from the view building mechanism which copies over already existing data in the base table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Tomasz Grabiec	0e487b3499	db: Compute key hash once in partition_presence_checker I measured reduction of cache update time by 20% for 6 sstables and by 40% for 16. Refs #1943.	2016-12-19 14:20:58 +01:00
Tomasz Grabiec	78844fa2e5	db: Use incremental selector in partition_presence_checker This reduces the number of sstables we need to check to only those whose token range overlaps with the key. Reduces cache update time. Especially effective with leveled compaction strategy. Refs #1943. Incremental selector works with an immutable sstable set, so cache updates need to be serialized. Otherwise we could mispopulate due to stale presence information. Presence checker interface was changed to accept decorated key in order to gain easy access to the token, which is required by the incremental selector.	2016-12-19 14:20:58 +01:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Asias He	85034c1b57	Convert to use dht::partition_range	2016-12-19 08:04:30 +08:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Avi Kivity	6bb875bdb7	Merge "storage_proxy: Enforce partition limit" from Duarte "This patchset ensures the partition limit is enforced at the storage_proxy level. To achieve this, we add the partition count to query::result, and allow the result_merger to trim excess partitions." * 'enforce-partition-limit/v3' of https://github.com/duarten/scylla: storage_proxy: Decrease limits when retrying command storage_proxy: Don't fetch superfluous partitions query::result: Add partition count column_family: Use counters in query::result::builder query_result_builder: Use the underlying counters mutation_partition: Count partitions in query_compacted mutation_partition: Remove tabs in query_compacted query::result::builder: Add partition count query_result_merger: Limit partitions	2016-12-16 13:57:37 +02:00
Glauber Costa	7133583797	track streaming and system virtual dirty memory A case could be made that we should have counters for them no matter what, since it can help us reason about the distribution of memory among the groups. But with the hierarchy being broken in 1.5 it becomes even more important. Now by looking solely at dirty, we will have no idea about how much memory we are using in those groups. After this patch, the dirty_memory_manager will register its metrics for the 3 groups that we have, and the legacy names will be used to show totals. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>	2016-12-16 10:59:40 +02:00
Paweł Dziepak	cf679a413c	db: use multi range reader for streaming readers A naive approach was to create a set of readers for each range and pass them all to combining reader. This however performed badly if the number of ranges was high. The solution is to use multi range reader which uses only a single set of readers and fast forwards from range to range when necessary. This adds another requirement that the ranges passed to make_streaming_reader() are sorted and disjoint.	2016-12-15 13:54:43 +00:00
Duarte Nunes	781cd82cb8	column_family: Use counters in query::result::builder This patch changes column_family::query() to use the counters in the builder to determine how many partitions and rows to ask for and also to implement the stop condition. This saves a continuation to do the bookkeeping, and allows us to remove data_query_result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Paweł Dziepak	cfd4d0f680	db: add metrics for short reads and memory used for results Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	ba51e7e8db	data_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	6c33a4f177	db: create result_memory_accounters when starting query This pach ensures than when we start executing a query a minimum result size is reserved from result_memory_limiter. Moreover, range queries need a way of merging memory usage information from different shards. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	15de8de9e5	reconcilable_result: keep result_memory_tracker object Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Avi Kivity	a61ff53150	Merge "rework flush criteria" from Glauber "The current criteria for memtable flush is not being respected. The problem is demonstrated to happen when the dirty memory group is over limit, and so is the system table extra allowance. In that situation, both the normal region and the system table region will be under pressure and try to flush. More specifically, because the normal region inherits from the system region, if the normal region is under pressure (over the soft limit threshold), the system region will certainly be as well, even though it has an extra allowance. This is because after virtual dirty, we start blocking when we reach half the region, but memory itself can grow up to 100 % of the region. So the total amount of memory used will be certainly bigger than the system pressure threshold, which is now 50 % plus the allowance. To fix that, this patch reworks the flush logic so that the regions are not dependent on each other. Fixes #1918" * 'flush-criteria-v6' of github.com:glommer/scylla: config: get rid of memtable_total_space database: rework dirty memory hierarchy system keyspace: write batchlog mutation in user memory database: remove flush_token database: abstract pressure condition notification database: encapsulate semaphore_units into a flush_permit database: remove friendship declaration database: simplify flush_one database: make memtable_list aware in cases it can't flush	2016-12-14 11:24:10 +02:00

1 2 3 4 5 ...

722 Commits