scylladb

Author	SHA1	Message	Date
Pekka Enberg	930fa79aff	database: Add get_available_index_name() to database class	2017-05-04 14:59:11 +03:00
Pekka Enberg	c6e7d4484a	database: Make existing_index_names() per-keyspace operation	2017-05-04 14:59:11 +03:00
Pekka Enberg	8c729f0f5f	database: Rewrite existing_index_names() to use new index metadata	2017-05-04 14:59:11 +03:00
Avi Kivity	27c42359bc	Merge seastar upstream * seastar 6b21197...2ebe842 (6): > Merge "Various improvements to execution stages" from Paweł > app-template: allow apps to specify a name for help message > bool_class: avoid initializing object of incomplete type > app-template: make sure we can still get help with required options > prometheus: Http handler that returns prometheus 0.4 protobuf or text format > Update DPDK to 17.02 Includes patch from Pawel to adjust to updated execution_stage interface.	2017-03-26 10:50:21 +03:00
Raphael S. Carvalho	7deeffc953	database: serialize sstable cleanup We're cleaning up sstables in parallel. That means cleanup may need almost twice the disk space used by all sstables being cleaned up, if almost all sstables need cleanup and every one will discard an insignificant portion of its whole data. Given that cleanup is frequently issued when node is running out of disk space, we should serialize cleanups in every shard to decrease the disk space requirement. Fixes #192. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>	2017-03-19 12:33:03 +02:00
Duarte Nunes	876a514743	database: Upgrade mutation to current schema to push view updates This patch ensures we upgrade the mutation to the current schema when generating and pushing view updates, so that the it matches the most up to date views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 18:15:27 +01:00
Duarte Nunes	bfb8a3c172	materialized views: Replace db::view::view class The write path uses a base schema at a particular version, and we want it to use the materialized views at the corresponding version. To achieve this, we need to map the state currently in db::view::view to a particular schema version, which this patch does by introducing the view_info class to hold the state previously in db::view::view, and by having a view schema directly point to it. The changes in the patch are thus: 1) Introduce view_info to hold the extra view state; 2) Point to the view_info from the schema; 3) Make the functions in the now stateless db::view::view non-member; 4) Remove the db::view::view class. All changes are structural and don't affect current behavior. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 15:50:05 +01:00
Amnon Heiman	0a2eba1b94	database: requests_blocked_memory metric should be unique Metrics name should be unique per type. requests_blocked_memory was registered twice, one as a gauge and one as derived. This is not allowed. Fixes #2165 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170314162826.25521-1-amnon@scylladb.com>	2017-03-14 19:36:45 +02:00
Paweł Dziepak	b5f0e590be	db: make database::query() an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	38c1501f4d	db: make apply an execution stage	2017-03-09 09:27:43 +00:00
Avi Kivity	439b38f5ab	Merge "Improvements to counter implementation" from Paweł "This series adds various optimisations to counter implementation (nothing extreme, mostly just avoiding unnecessary operations) as well as some missing features such as tracing and dropping timed out queries. Performance was tested using: perf-simple-query -c4 --counters --duration 60 The following results are medians. before after diff write 18640.41 33156.81 +77.9% read 58002.32 62733.93 +8.2%" * tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits) cell_locker: add metrics for lock acquisition storage_proxy: count counter updates for which the node was a leader storage_proxy: use counter-specific timeout for writes storage_proxy: transform counter timeouts to mutation_write_timeout_exception db: avoid allocations in do_apply_counter_update() tests/counters: add test for apply reversability counters: attempt to apply in place atomic_cell: add COUNTER_IN_PLACE_REVERT flag counters: add equality operators counters: implement decrement operators for shard_iterator counters: allow using both views and mutable_views atomic_cell: introduce atomic_cell_mutable_view managed_bytes: add cast to mutable_view bytes: add bytes_mutable_view utils: introduce mutable_view db: add more tracing events for counter writes db: propagate tracing state for counter writes tests/cell_locker: add test for timing out lock acquisition counter_cell_locker: allow setting timeouts db: propagate timeout for counter writes ...	2017-03-07 11:48:13 +02:00
Avi Kivity	1af9e3a5cb	Merge "database: fix the 'nodetool clearsnapshot'" from Vlad "Work on this series started with fixing the 'nodetool clearsnapshot'. The current master code ignores the snapshots in deleted keyspaces (issue #2045). I noticed that in many places our code has to build the path to some directory/file it simply had the sstring(<path1>) + "/" + sstring(<path2>) constructs which may cause us issues if somebody decides to complile/run scylla on not-Unix-based OS, like Microsoft Windows. I understand that this is a long shot but if we can make it right now - why not to. The answer is boost::filesystem::path class - its synchronous parts, of course. I decided to take an initiative and fix the issues above and then use the fixed code for fixing the issue #2045: - Fix some minor issues in the existing code. - Extend the lister class and move it into the separate files outside database.cc. On the way I've found an issue in the existing code (issue #2071). This series fixes this one too (PATCH2)."	2017-03-06 16:45:31 +02:00
Paweł Dziepak	04b80272f2	cell_locker: add metrics for lock acquisition	2017-03-02 09:05:12 +00:00
Paweł Dziepak	f93a766db4	db: avoid allocations in do_apply_counter_update()	2017-03-02 09:05:12 +00:00
Paweł Dziepak	774241648d	db: add more tracing events for counter writes	2017-03-02 09:05:10 +00:00
Paweł Dziepak	277501f42f	db: propagate tracing state for counter writes	2017-03-02 09:05:10 +00:00
Paweł Dziepak	25173f8095	db: propagate timeout for counter writes	2017-03-02 09:05:10 +00:00
Paweł Dziepak	f25fa6566f	db: avoid deserialization when applying counter mutation In the later stages of counter write path a mutation is produced that already has all cells transformed to counter shards and can be applied to the memtable and written to the commitlog. The current interface expectes a frozen mutation, which is suboptimal for counters. The freeze itself is unaviodable -- it is required by commitlog, but we can avoid later deserialization of frozen_mutation when it is applied to the memtable if we pass the unfrozen mutation along.	2017-03-01 16:33:37 +00:00
Paweł Dziepak	582d397c41	introduce counter_write_query() Counter write path involves read-modify-write. That read is guaranteed to query only a single partition, does not care about dead cells and expects to receive an unserialized mutation as a result. Standard mutation queries can are able to produce results fit for counter updates, but the logic involved is much more general (i.e. slower), hence the addition of new, counter-specific kind of query.	2017-03-01 16:33:36 +00:00
Paweł Dziepak	426345e1d4	storage_proxy: avoid excessive mutation freezes	2017-03-01 16:33:36 +00:00
Duarte Nunes	c0e5964462	database: Explicitly use discard_result() Values returned from the lambda passed to finally() are immediately destroyed, so make that explicit by using discard_result(). Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170227235541.28330-1-duarte@scylladb.com>	2017-02-28 18:41:19 +02:00
Paweł Dziepak	0198d8e470	Merge "Introduce streamed_mutation::fast_forward_to()" from Tomasz "This introduces an API which allows forward navigation in a stream of mutation fragments. It allows one to consume only a subset of the stream by iteratively specifying sub-ranges from which fragments should be returned. API outline: When in forwarding mode, the stream does not return all fragments right away, but only those belonging to the current range. Initially current range only covers the static row. The stream can be forwarded, even before reaching end- of-stream for current range, to a later range with fast_forward_to(). Forwarding doesn't change initial restrictions of the stream, it can only be used to skip over data. Monotonicity of positions is preserved by forwarding. That is fragments emitted after forwarding will have greater positions than any fragments emitted before forwarding. For any range, all range tombstones relevant for that range which are present in the original stream will be emitted. Range tombstones emitted before forwarding which overlap with the new range are not necessarily re-emitted. When not in forwarding mode, the stream acts as if the current range was equal to the full range. This implies that fast_forward_to() cannot be used. Whether stream is in forwarding mode or not is specified when the stream is created, typically via mutation_source interface. What's left for later series: Optimization by providing specialized implementations. This series implements forwarding support in all mutation sources via generic wrapper which simply drops fragments." * tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev: tests: mutation_source_tests: Verify monotonicty of positions tests: random_mutation_generator: Spread the keys more tests: mutation_source_test: Make blobs more easily distinguishable tests: streamed_mutation: Test that merged stream passes mutation source tests tests: mutation_source_test: Add tests for forwarding of streamed_mutation tests: streamed_mutation_assertions: Add methods for navigating the stream tests: Add range generators to random_mutation_generator partition_slice_builder: Add with_ranges() query: Introduce full_clustering_range streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation() db: Enable creating forwardable readers via mutation_source mutation_source: Document liveness requirements mutation_source: Cleanup db: Replace virtual_reader_type with mutation_source_opt partition_version: Refactor make_partition_snapshot_reader() overloads database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr memtable: Accept all mutation_source parameters streamed_mutation: Implement fast_forward_to() in stream merger streamed_mutation: Add generic implementation of forwardable streamed_mutation streamed_mutation: Add fast_forward_to() API position_in_partition: Introduce position_range position_in_partition: Introduce position constructor for right after the static row streamed_mutation: Make cast to view non-explicit streamed_mutation: Make schema() getter non-copying	2017-02-24 10:37:51 +00:00
Tomasz Grabiec	892d4a2165	db: Enable creating forwardable readers via mutation_source Right now all mutation source implementations will use make_forwardable() wrapper.	2017-02-23 18:50:44 +01:00
Tomasz Grabiec	586dbaa8d3	db: Replace virtual_reader_type with mutation_source_opt Virtual reader is a mutation_source.	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	f46ae8128d	database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr It was using the state passed via as_mutation_source() instead. Let's respect mutation_source contract instead, and use the state passed via mutation_source invocation. Technically just a cleanup. Alse prerequisite for more cleanup.	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	2cc27f72ca	memtable: Accept all mutation_source parameters	2017-02-23 18:23:52 +01:00
Calle Wilund	e20b804a65	commitlog/database: Add "release" method to ensure we free segments On database stop, we do flush memtables and clean up commit log segment usage. However, since we never actually destroy the distributed<database>, we don't actually free the commitlog either, and thus never clear out the remaining (clean) segments. Thus we leave perfectly clean segments on disk. This just adds a "release" method to commitlog, and calls it from database::stop, after flushing CF:s. Message-Id: <1485784950-17387-1-git-send-email-calle@scylladb.com>	2017-02-21 18:17:47 +01:00
Paweł Dziepak	359c617821	db: restore call to check_valid_rp() `5a0955e89d` "db: add operations for applying counter updates" merged two column_family::apply() overloads into do_apply() in order to reduce code duplication. Unfortunately, a call to check_valid_rp() didn't survive that change. Message-Id: <20170221133800.30411-1-pdziepak@scylladb.com>	2017-02-21 15:26:04 +01:00
Vlad Zolotarov	978241d473	database: move lister class into separate files Move lister class away from database.cc. This is a preparation for moving it to the seastar library. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:50:40 -05:00
Vlad Zolotarov	34cafa71c3	database: make 'clearsnapshot' to delete the snapshots of deleted keyspaces if requested The current implementation of 'nodetool clearsnapshot' command only deletes the snapshots of the keyspaces that are alive at the time the command is issued (issue #2045). This, besides not implementing the spec, prevents users from being able to clear the disk space occupied by snapshots of deleted keyspaces that are no longer needed (e.g. snapshots created when KS is deleted). This patch fixes this issue by making the database::clear_snapshot() scan the data directories looking for the snapshots to be deleted instead of relying on in-memory data structures. This patch makes column_family::clear_snapshot() method not needed any more. Fixes #2045 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:50:40 -05:00
Vlad Zolotarov	e1ee669aff	database: lister: add the rmdir() static method Removes the directory with all its contents (like 'rm -rf <dir name>' shell command). Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:50:40 -05:00
Vlad Zolotarov	53532ba5ff	database: lister: pass the parent path object to callbacks Pass a parent directory boost::filesystem::path object to the walker and filter callbacks. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:50:37 -05:00
Vlad Zolotarov	b4c970dfc6	database: lister: make the "filter" callback receive directory_entry instead of sstring Filter should get all information that the caller has in hand that may be used for filtering. directory_entry has the following information: - Type of the entry - Its name For the code that used lister filters so far this would be enough, however it's not hard to imagine a filter that may need the parent directory as well. We will add the parent directory path in the follow up patches to make the interface complete. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:59 -05:00
Vlad Zolotarov	6f9f0e1b3f	database: lister: add "show_hidden" parameter If show_hidden parameter is set to show_hidden::yes - list hidden entries, otherwise skip them. By default set to show_hidden::no. This patch also completely removes default parameters in lister::scan_dir() and replaces them with a few lister::scan_dir() overloads that ensure that lambdas are always going to be the last parameter in the parameters list. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:58 -05:00
Vlad Zolotarov	9aedb191f6	database: lister: if entries' types set is empty - list everything Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:58 -05:00
Vlad Zolotarov	cb614f9be4	database: lister::guarantee_type: handle the case when entry type may not be read There is a possibility that the type of the given entry may not be available that would manifest in the ENOENT or ENOTDIR value set in the errno by the fstat() call for this entry. In this case engine().file_type() will return a not engaged optional<directory_entry_type> value. Return the future with the std::runtime_error exception in this case. This will prevent any further usage of the not engaged optional value by the code in the normal flow. The exception is going to be propagated to the caller and it's the caller's responsibility to handle it. Fixes #2071 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:55 -05:00
Vlad Zolotarov	25502149cf	database: lister::scan_dir(): std::move() all that needs to be moved Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-16 11:56:44 -05:00
Avi Kivity	9530bac2d6	Merge "Adding metrics using histogram and labels" from Amnon "This series uses the newly added histogram and label support to add metrics to the storage_proxy and to the column_family. This would add latency and histogram and the missing metrics from column family." * 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev: database: add metrics registration for the coloumn family storage_proxy: add read and write latency histogram estimated_histogram: returns a metrics histogram	2017-02-09 11:40:57 +02:00
Amnon Heiman	292c08f598	database: add metrics registration for the coloumn family This patch adds a metrics registration to the column_family. Using label each column metrics is label with its keyspace and column family name. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-02-06 18:27:01 +02:00
Duarte Nunes	0eca6301d3	database: Apply mutation to views This patch changes the database apply path so that it also generates the mutations for the column family's views and sends them to the paired view replicas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:37:33 +01:00
Duarte Nunes	4777172348	column_family: Push view replica update This patch adds a function to push updates to the view replicas of a particular base table.	2017-02-06 13:36:45 +01:00
Nadav Har'El	92fc7386f6	materialized views: add VIEW write type This adds to the "write_type" enum also the "VIEW" write type. To be honest, I don't understand why the "write_type" distinction is important. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	11bd3bd29f	database: Ensure new write_type is correctly printed By removing the default case in the switch statement over a write_type variable, we ensure the compiler warns us about lack of exhaustiveness in case we add a value to the enum but forget to change the corresponding operator<<(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	16206e9f15	column_family: Generate view updates This patch adds the generate_view_updates() function to the column_family class, which will use the view_update_builder to generate updates to the column_family's materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	90cb35db04	column_family: Adds affected_views() function This patch the affected_views() to determine the column family's views a given update affects. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	082ef56df1	view: Store pk view column that's non-pk in the base To help calculate the view mutations from a base update, we store in the view class the column that's part of the view's primary key but not part of the base's, if such column exists. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	c35d14e285	column_family: Store a pointer to view Instead of storing the view in the column_family's map of materialized views, store a lw_shared_ptr so that the view can be removed while it is being updated. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Avi Kivity	7a00dd6985	Merge "Avoid avalanche of tasks after memtable flush" from Tomasz "Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce()	2017-02-02 17:49:31 +02:00
Paweł Dziepak	5a0955e89d	db: add operations for applying counter updates	2017-02-02 10:35:14 +00:00
Tomasz Grabiec	ed9ff19467	lsa: Document and annotate reclaimer notification callbacks They are called from region_group::update(), so must be alloc-free and noexcept.	2017-01-30 19:18:07 +01:00

1 2 3 4 5 ...

757 Commits