scylladb

Author	SHA1	Message	Date
Raphael S. Carvalho	a6f8f4fe24	compaction: do not write expired cell as dead cell if it can be purged right away When compacting a fully expired sstable, we're not allowing that sstable to be purged because expired cell is unconditionally converted into a dead cell. Why not check if the expired cell can be purged instead using gc before and max purgeable timestamp? Currently, we need two compactions to get rid of a fully expired sstable which cells could have always been purged. look at this sstable with expired cell: { "partition" : { "key" : [ "2" ], "position" : 0 }, "rows" : [ { "type" : "row", "position" : 120, "liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z", "ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true }, "cells" : [ { "name" : "country", "value" : "1" }, ] now this sstable data after first compaction: [shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79 (~65% of original) in 229ms = 0.000328997MB/s. { ... "rows" : [ { "type" : "row", "position" : 79, "cells" : [ { "name" : "country", "deletion_info" : { "local_delete_time" : "2017-04-09T17:07:12Z" }, "tstamp" : "2017-04-09T17:07:12.702597Z" }, ] now another compaction will actually get rid of data: compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original) in 1ms = 0MB/s. ~2 total partitions merged to 0 NOTE: It's a waste of time to wait for second compaction because the expired cell could have been purged at first compaction because it satisfied gc_before and max purgeable timestamp. Fixes #2249, #2253 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>	2017-04-13 10:59:19 +03:00
Avi Kivity	27c42359bc	Merge seastar upstream * seastar 6b21197...2ebe842 (6): > Merge "Various improvements to execution stages" from Paweł > app-template: allow apps to specify a name for help message > bool_class: avoid initializing object of incomplete type > app-template: make sure we can still get help with required options > prometheus: Http handler that returns prometheus 0.4 protobuf or text format > Update DPDK to 17.02 Includes patch from Pawel to adjust to updated execution_stage interface.	2017-03-26 10:50:21 +03:00
Paweł Dziepak	a78501c206	mutation_query: add an execution stage	2017-03-09 09:27:43 +00:00
Avi Kivity	439b38f5ab	Merge "Improvements to counter implementation" from Paweł "This series adds various optimisations to counter implementation (nothing extreme, mostly just avoiding unnecessary operations) as well as some missing features such as tracing and dropping timed out queries. Performance was tested using: perf-simple-query -c4 --counters --duration 60 The following results are medians. before after diff write 18640.41 33156.81 +77.9% read 58002.32 62733.93 +8.2%" * tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits) cell_locker: add metrics for lock acquisition storage_proxy: count counter updates for which the node was a leader storage_proxy: use counter-specific timeout for writes storage_proxy: transform counter timeouts to mutation_write_timeout_exception db: avoid allocations in do_apply_counter_update() tests/counters: add test for apply reversability counters: attempt to apply in place atomic_cell: add COUNTER_IN_PLACE_REVERT flag counters: add equality operators counters: implement decrement operators for shard_iterator counters: allow using both views and mutable_views atomic_cell: introduce atomic_cell_mutable_view managed_bytes: add cast to mutable_view bytes: add bytes_mutable_view utils: introduce mutable_view db: add more tracing events for counter writes db: propagate tracing state for counter writes tests/cell_locker: add test for timing out lock acquisition counter_cell_locker: allow setting timeouts db: propagate timeout for counter writes ...	2017-03-07 11:48:13 +02:00
Tomasz Grabiec	4b6e77e97e	db: Fix overflow of gc_clock time point If query_time is time_point::min(), which is used by to_data_query_result(), the result of subtraction of gc_grace_seconds() from query_time will overflow. I don't think this bug would currently have user-perceivable effects. This affects which tombstones are dropped, but in case of to_data_query_result() uses, tombstones are not present in the final data query result, and mutation_partition::do_compact() takes tombstones into consideration while compacting before expiring them. Fixes the following UBSAN report: /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int' Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>	2017-03-01 18:49:56 +02:00
Paweł Dziepak	582d397c41	introduce counter_write_query() Counter write path involves read-modify-write. That read is guaranteed to query only a single partition, does not care about dead cells and expects to receive an unserialized mutation as a result. Standard mutation queries can are able to produce results fit for counter updates, but the logic involved is much more general (i.e. slower), hence the addition of new, counter-specific kind of query.	2017-03-01 16:33:36 +00:00
Tomasz Grabiec	f46ae8128d	database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr It was using the state passed via as_mutation_source() instead. Let's respect mutation_source contract instead, and use the state passed via mutation_source invocation. Technically just a cleanup. Alse prerequisite for more cleanup.	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	2489a0f82e	mutation_partition: Drop unneeded range tombstones Fixes #1254.	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	884858078a	mutation_partition: Simplify row removal	2017-02-13 16:12:15 +01:00
Duarte Nunes	7e150a18eb	mutation_partition: Introduce shadowable tombstone This patch introduces shadowable row tombstones. A shadowable row tombstone is valid only if the row has no live marker. In other words, the row tombstone is only valid as long as no newer insert is done (thus setting a live row marker; note that if the row timestamp set is lower than the tombstone's, then the tombstone remains in effect as usual). If a row has a shadowable tombstone with timestamp Ti and that row is updated with a timestamp Tj, such that Tj > Ti (and that update sets the row marker), then the shadowable tombstone is shadowed by that update. A concrete consequence is that if the update has cells with timestamp lower than Ti, then those cells are preserved (since the deletion is removed), and this is contrary to a regular, non-shadowable row tombstone where the tombstone is preserved and such cells are removed. Currently, only Materialized Views require shadowable row tombstones, which solve a problem with view row deletions. Consider a base row with columns p, v1, v2, PRIMARY KEY (p) denormalized into a view row consisting of columns p, v1, v2 PRIMARY KEY (p, v1), and the following operations: 1) INSERT INTO base (p, v1, v2) VALUES (0, 0, 1) USING TIMESTAMP 0; 2) UPDATE base SET v1 = 1 USING TIMESTAMP 1 WHERE p = 0; 3) UPDATE base SET v1 = 0 USING TIMESTAMP 2 WHERE p = 0; Without shadowable tombstones, the view contains: At 1), pk = (0, 0), row_marker@T0, v2=1@T0 At 2), pk = (0, 0), row_marker@T0, row_tombstone@T1, v2=1@T0 pk = (0, 1), row_marker@T1, v2=1@T0 At 3), pk = (0, 0), row_marker@T2, row_tombstone@T1, v2=1@T0 pk = (0, 1), row_marker@T1, row_tombstone@T2, v2=1@T0 Notice how, if we read row (0, 0), the value of v2 will be shadowed by the row tombstone we previously inserted. With a view's row tombstone becoming shadowable, at 3) the row (0, 0) will look like pk = (0, 0), row_marker@T2, shadowable_tombstone@T1, v2=1@T0, which is equivalent to pk = (0, 0), row_marker@T2, v2=1@T0. Since the shadowable tombstone is shadowed by the new row marker (T0 < T2), now v2 would be taken into account. Finally, note that this patch doesn't generalize the idea of shadowable tombstone, instead taking advantage of the fact that they are only needed by Materialized Views. This saves changing the tombstone representation to account for an extra flag, the bits such representation would require, and also avoids changes to the storage format. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Paweł Dziepak	b6564651e4	mutation_partition: make for_each_cell() accessible outside source file for_each_cell() const already can be used from any place in the code, allow the same with non-const version.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	0c93d01232	atomic_cell: make sure upper level tombstones cover counters Support for deletion of counters is limited in a way that once deleted they cannot be used again (i.e. tombstone always wins, regardless of the timestamp). Logic responsible for merging two counter cells already makes sure that tombstones are handled properly, but it is also necessary to ensure that higher level tombstones always cover counters.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	47d14906e6	mutation_partition: support querying counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	63f25eb12c	mutation_hasher: handle counter cells properly	2017-02-02 10:35:14 +00:00
Paweł Dziepak	a57e86cc37	mutation_partition: compute counter difference	2017-02-02 10:35:13 +00:00
Paweł Dziepak	2725a4945d	mutation_partition: apply counter cells properly	2017-02-02 10:35:13 +00:00
Piotr Jastrzebski	041b0a65ac	Implement intrusive set using rbtree_algorithms This new implementation takes less memory because it does not store comparator. It also uses tree nodes optimized for size. This means that instead of storing an enum field \|color\| they embed this information inside pointer to parent. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:46:58 +01:00
Piotr Jastrzebski	a0c20f5c49	mutation_partition: make apply_reversibly_intrusive_set nongeneric apply_reversibly_intrusive_set is used only in one place and always with rows_type. There's no need for it to be generic. This will allow changing intrusive set implementation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Piotr Jastrzebski	4bbe05dd47	mutation_partition: take schema in find_row and clustered_row This will allow intrusive set implementation that does not store schema. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Piotr Jastrzebski	fe3c91db90	mutation_partition: Extract intrusive set logic to a class. It will make it easier to change the implementation of the intrusive set. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Piotr Jastrzebski	da67ac7ae4	mutation_partition: Replace value_comp with key_comp calls This will reduce the size of bi::set API being used. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Paweł Dziepak	40176ca2f8	mutation_partition: use result limiter for digest reads Even if we are performing a digest query we should do proper result memory accounting so that the result ends exactly in the same place that it would if it was a data query. This is to avoid digest mismatches between replicas.	2016-12-22 17:16:23 +01:00
Paweł Dziepak	38ee69dee0	idl: allow writers to use any output stream Original IDL generated code was hardcoded to always use bytes_ostream. This patch makes the output stream a template parameter so that any valid output stream can be used. Unfortunately, making IDL writers generic requires updates in the code that uses them, this is fixed in C++17 which would be able to deduce the parameter in most cases.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	1c7cade559	mutation_partition: honour allowed_short_read for static rows	2016-12-22 13:35:04 +01:00
Piotr Jastrzebski	3e502de153	mutation_partition: don't use unique_ptr to manage LSA objects Unique_ptr won't destruct them correctly. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>	2016-12-21 09:40:15 +01:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Duarte Nunes	781cd82cb8	column_family: Use counters in query::result::builder This patch changes column_family::query() to use the counters in the builder to determine how many partitions and rows to ask for and also to implement the stop condition. This saves a continuation to do the bookkeeping, and allows us to remove data_query_result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	05b2ef4fa2	query_result_builder: Use the underlying counters This patch changes the query_result_builder to use the counters provided by the query::result::builder. It also ensures they are kept current. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	f5cf7f7921	mutation_partition: Count partitions in query_compacted This patch changes mutation_partition::query_compacted() to count the number of partitions written to the underlying writer. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	f21dfb8217	mutation_partition: Remove tabs in query_compacted Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Paweł Dziepak	ba51e7e8db	data_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	f1b9f49f2b	mutation_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	6c33a4f177	db: create result_memory_accounters when starting query This pach ensures than when we start executing a query a minimum result size is reserved from result_memory_limiter. Moreover, range queries need a way of merging memory usage information from different shards. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	34f9eb4cbd	mutation_compactor: honour stop_iteration from consumers Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	43fe3439ca	reconcilable_result: properly propagate short_read flag reconcilable_result can be merged with another or transformed into query::result. Make sure that short_read information is never lost. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Duarte Nunes	bdba8d99c3	range: Find a sequence's lower and upper bounds This patch extracts a pair of functions from mutation_partition to calculate the lower and upper bounds of a sequence from a nonwrapping_range. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Paweł Dziepak	ef57b9a26f	rename memory_usage() to external_memory_usage() where applicable Renaming the function to external_memory_usage() makes it clear that sizeof(T) is not included, something that was a source of confusion in the past. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	e981101fa9	Merge "Remove clustering_key_filtering_context" from Piotr "clustering_key_filtering_context is no longer needed. partition_slice can be used instead so this series removes clustering_key_filtering_context and passes partition_slice down where it's needed. Then a static get_ranges method is used to obtain clustering key ranges for a given partition. Fixes #1614."	2016-08-30 22:30:15 +01:00
Piotr Jastrzebski	3607d99269	Remove clustering_key_filtering_context. Remove clustering_key_filter_factory and clustering_key_filtering_context. Use partition_slice directly with a static get_ranges method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 20:31:55 +02:00
Piotr Jastrzebski	b05b90b3a5	Introduce clustering_key_filter_ranges. This fixes the problem of multiple concurrent get_ranges calls. Previously each call was invalidating the result of the previous call. Now they don't step on each other foot. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 19:46:38 +02:00
Paweł Dziepak	6012a7e733	mutation_partition: fix iterator invalidation in trim_rows Reversed iterators are adaptors for 'normal' iterators. These underlying iterators point to different objects that the reversed iterators themselves. The consequence of this is that removing an element pointed to by a reversed iterator may invalidate reversed iterator which point to a completely different object. This is what happens in trim_rows for reversed queries. Erasing a row can invalidate end iterator and the loop would fail to stop. The solution is to introduce reversal_traits::erase_dispose_and_update_end() funcion which erases and disposes object pointed to by a given iterator but takes also a reference to and end iterator and updates it if necessary to make sure that it stays valid. Fixes #1609. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1472080609-11642-1-git-send-email-pdziepak@scylladb.com>	2016-08-25 16:52:35 +03:00
Duarte Nunes	5161ea283f	query: query::clustering_range can't wrap around This patch changes the type of query::clustering_range to express that ranges that wrap around are not allowed, and ranges that have the start bound after the end bound are considered empty. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:50:20 +00:00
Duarte Nunes	ec490ffaba	query_result_builder: Don't count dead partitions With this patch we stop counting dead partitions (i.e., partitions containing only tombstones) towards the partition limit, which should apply only to partitions with live rows. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-02 21:17:06 +00:00
Duarte Nunes	21d0a2c764	query: Optionally send cell ttl This patch adds support to send a cell's ttl as part of a query's result. This is needed for thrift support. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Paweł Dziepak	93cc4454a6	streamed_mutation: emit range_tombstones directly Originally, streamed_mutations guaranteed that emitted tombstones are disjoint. In order to achieve that two separate objects were produced for each range tombstone: range_tombstone_begin and range_tombstone_end. Unfortunately, this forced sstable writer to accumulate all clustering rows between range_tombstone_begin and range_tombstone_end. However, since there is no need to write disjoint tombstones to sstables (see #1153 "Write range tombstones to sstables like Cassandra does") it is also not necessary for streamed_mutations to produce disjoint range tombstones. This patch changes that by making streamed_mutation produce range_tombstone objects directly. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:18 +01:00
Tomasz Grabiec	8c4b5e4283	db: Avoiding checking bloom filters during compaction Checking bloom filters of sstables to compute max purgeable timestamp for compaction is expensive in terms of CPU time. We can avoid calculating it if we're not about to GC any tombstone. This patch changes compacting functions to accept a function instead of ready value for max_purgeable. I verified that bloom filter operations no longer appear on flame graphs during compaction-heavy workload (without tombstones). Refs #1322.	2016-07-10 09:54:20 +02:00
Paweł Dziepak	23d0bfd065	mutation_partition: add row::memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	7a95847014	mutation_compactor: prepare for sstable compaction compact_mutation code is going to be shared among queries and sstable compaction. There are some differences though. Queries don't provide _max_purgeable and sstable compaction don't need any limits. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	4133cc7a53	mutation_reader: make consume_flattened() produce decorated keys Since decorated keys are already computed it is better to pass more information than less. Consumers interested just in partition key can just drop token and the ones requiring full decorated key don't need to recompute it. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:00 +01:00
Paweł Dziepak	3e86f9ab73	mutation_partition: extract compact_for_query to a separate header The compacting logic inside compact_for_query is going to be shared with sstable compaction. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00

1 2 3 4

168 Commits