scylladb

Author	SHA1	Message	Date
Botond Dénes	17509e9664	mutation_compactor: remove only-live related logic We removed the template parameter in the previous patch, now we can remove the logic related to it.	2022-07-12 08:44:32 +03:00
Botond Dénes	4d2ce5c304	mutation_compactor: remove emit_only_live_rows template parameter Now that we use emit_only_live_rows::no everywhere we can remove this template parameters. Only the template parameter is removed, the internal logic around it is left in place (will be removed in a next patch), by hard-wiring `only_live()`.	2022-07-12 08:43:49 +03:00
Botond Dénes	9ee8ef5930	mutation_compactor: remove unused compact_mutation_state::parameters	2022-07-12 08:41:51 +03:00
Botond Dénes	9beef08a1b	mutation_compactor: add current_full_position() convenience accessor	2022-06-23 13:36:24 +03:00
Botond Dénes	a3cd235de2	mutation_compactor: s/_last_clustering_pos/_last_pos/ Generalize position tracking to track non-clustering positions too. Also add an accessor for it.	2022-06-23 13:36:24 +03:00
Botond Dénes	5a6e807a1c	mutation_compactor: add state accessor to compact_mutation	2022-06-23 13:36:24 +03:00
Tomasz Grabiec	570b76bc5b	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-15 11:30:01 +02:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Tomasz Grabiec	604e720706	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-06 19:23:37 +02:00
Botond Dénes	279682056d	mutation_compactor: drop v1 related code-paths	2022-03-11 09:24:05 +02:00
Botond Dénes	924ff6a503	mutation_compactor: drop v1 support altogether from the API Fully mechanical change. Drop all v1 types, template types. Internal code is left unchanged, will be made v2 only in the next patch.	2022-03-11 09:24:05 +02:00
Botond Dénes	4e97477281	mutation_compactor: remove now unused compact_for_compaction	2022-03-10 09:16:33 +02:00
Botond Dénes	ad435dcf57	mutation_compactor: add v2 output The output version is selected via compactor_output_format, which is a template parameter of `compact_mutation_state` and all downstream types. This is to ensure a compaction state created to emit a v2 stream will not be accidentally used with a v1 consumer. When using a v2 output, the current active tombstone has to be tracked separately for the regular and for the gc consumer (if any), so that each can be closed properly on EOS. The current effective tombstone is tracked separately from these two. The reason is that purged tombstones are still applied to data, but are not emitted to the regular consumer.	2022-03-10 06:46:46 +02:00
Botond Dénes	1ccaeb2a1a	mutation_compactor: make _last_clustering_pos track last input Instead of updating _last_clustering_pos whenever a clustering fragment is pushed to the consumers, we now update it whenever a clustering fragment enters the compactor. Not only is this much more robust, but it also makes more sense. Just because a range tombstone is purged (and therefore the consumer doesn't see it), it still moves the logical clustering position in the stream. Also, tracking the input side avoids any ambiguity related to cases where we have two consumers (regular + gc consumer).	2022-03-10 06:46:46 +02:00
Botond Dénes	f1e9e3b3b7	compact_mutation: drop support for v1 input	2022-02-21 12:29:24 +02:00
Mikołaj Sielużycki	93d6eb6d51	compacting_reader: Support fast_forward_to position range. Fast forwarding is delegated to the underlying reader and assumes the it's supported. The only corner case requiring special handling that has shown up in the tests is producing partition start mutation in the forwarding case if there are no other fragments. compacting state keeps track of uncompacted partition start, but doesn't emit it by default. If end of stream is reached without producing a mutation fragment, partition start is not emitted. This is invalid behaviour in the forwarding case, so I've added a public method to compacting state to force marking partition as non-empty. I don't like this solution, as it feels like breaking an abstraction, but I didn't come across a better idea. Tests: unit(dev, debug, release) Message-Id: <20220128131021.93743-1-mikolaj.sieluzycki@scylladb.com>	2022-01-31 13:37:36 +02:00
Botond Dénes	eb42213db4	compact_mutation: close active range tombstone on page end The compactor recently acquired the ability to consume a v2 stream. The v2 spec requires that all streams end with a null tombstone. `range_tombstone_assembler`, the component the compactor uses for converting the v2 input into its v1 output enforces this with a check on `consume_end_of_partition()`. Normally the producer of the stream the compactor is consuming takes care of closing the active tombstone before the stream ends. The compactor however (or its consumer) can decide to end the consume early, e.g. to cut the current page. When this happens the compactor must take care of closing the tombstone itself. Furthermore it has to keep this tombstone around to re-open it on the next page. This patch implements this mechanism which was left out of `134601a15e`. It also adds a unit test which reproduces the problems caused by the missing mechanism. The compactor now tracks the last clustering position emitted. When the page ends, this position will be used as the position of the closing range tombstone change. This ensures the range tombstone only covers the actually emitted range. Fixes: #9907 Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>	2022-01-25 09:52:30 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Botond Dénes	e8a918b25c	compact_mutation: make start_new_page() independent of mutation_fragment version By using partition_region instead of mutation_fragment::kind. This will make incremental migration of users to v2 easier.	2022-01-07 13:47:39 +02:00
Botond Dénes	790e73141f	compact_mutation: add support for consuming a v2 stream Consuming either a v1 or v2 stream is supported now, but compacted fragments are still emitted in the v1 format, thus the compactor acts an online downgrader when consuming a v2 stream. This allows pushing out downgrade to v1 on the input side all the way into the compactor. This means that reads for example can now use an all v2 reader pipeline, the still mandatory downgrade to v1 happening at the last possible place: just before creating the result-set. Mandatory because our intra-node ABI is still v1. There are consumers who are ready for v2 in principle (e.g. compaction), they have to wait a little bit more.	2022-01-07 13:42:31 +02:00
Botond Dénes	1d842e980a	compact_mutation: extract range tombstone consumption into own method Next patch wants to reuse the same code.	2022-01-07 13:42:17 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Botond Dénes	f0ead81250	mutation_compactor: collect stats about compacted data Stats contain the number of partitions, static rows, clustering rows and range tombstones. For rows dead/live are counted separately.	2021-09-22 13:59:19 +03:00
Botond Dénes	f02632aeb0	range_tombstone_accumulator: drop _reversed flag	2021-09-09 15:42:15 +03:00
Botond Dénes	502a45ad58	treewide: switch to native reversed format for reverse reads We define the native reverse format as a reversed mutation fragment stream that is identical to one that would be emitted by a table with the same schema but with reversed clustering order. The main difference to the current format is how range tombstones are handled: instead of looking at their start or end bound depending on the order, we always use them as-usual and the reversing reader swaps their bounds to facilitate this. This allows us to treat reversed streams completely transparently: just pass along them a reversed schema and all the reader, compacting and result building code is happily ignorant about the fact that it is a reversed stream.	2021-09-09 15:42:15 +03:00
Asias He	4c1f8c2f83	compaction: Move compaction_garbage_collector.hh to compaction dir The top dir is a mess. Move compaction_garbage_collector.hh to the new home.	2021-08-07 08:07:09 +08:00
Asias He	47aae83185	mutation_compactor: Drop compact_for_mutation_query_state and compact_for_data_query_state They are not used.	2021-08-07 07:21:48 +08:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Botond Dénes	73808c12eb	mutation compactor: query compaction: ignore purgeable tombstones This behaviour is makes query result building sensitive to whether the data was recently compacted or not, in particular different digests will be produced depending on whether purgeable tombstones happened to be compacted (and thus purged) or not. This means that two replicas can produce different digests for the same data if has compacted some purgeable tombstones and the other not. To avoid this, drop purgeable tombstones during query compaction as well.	2021-01-22 15:27:48 +02:00
Wojciech Mitros	45215746fe	increase the maximum size of query results to 2^64 Currently, we cannot select more than 2^32 rows from a table because we are limited by types of variables containing the numbers of rows. This patch changes these types and sets new limits. The new limits take effect while selecting all rows from a table - custom limits of rows in a result stay the same (2^32-1). In classes which are being serialized and used in messaging, in order to be able to process queries originating from older nodes, the top 32 bits of new integers are optional and stay at the end of the class - if they're absent we assume they equal 0. The backward compatibility was tested by querying an older node for a paged selection, using the received paging_state with the same select statement on an upgraded node, and comparing the returned rows with the result generated for the same query by the older node, additionally checking if the paging_state returned by the upgraded node contained new fields with correct values. Also verified if the older node simply ignores the top 32 bits of the remaining rows number when handling a query with a paging_state originating from an upgraded node by generating and sending such a query to an older node and checking the paging_state in the reply(using python driver). Fixes #5101.	2020-08-03 17:32:49 +02:00
Avi Kivity	a4c44cab88	treewide: update concepts language from the Concepts TS to C++20 Seastar recently lost support for the experimental Concepts Technical Specification (TS) and gained support for C++20 concepts. Re-enable concepts in Scylla by updating our use of concepts to the C++20 standard. This change: - peels off uses of the GCC6_CONCEPT macro - removes inclusions of <seastar/gcc6-concepts.hh> - replaces function-style concepts (no longer supported) with equation-style concepts - semicolons added and removed as needed - deprecated std::is_pod replaced by recommended replacement - updates return type constraints to use concepts instead of type names (either std::same_as or std::convertible_to, with std::same_as chosen when possible) No attempt is made to improve the concepts; this is a specification update only. Message-Id: <20200531110254.2555854-1-avi@scylladb.com>	2020-06-02 09:12:21 +03:00
Vladimir Davydov	e0b31dd273	query: add flag to return static row on partition with no rows A SELECT statement that has clustering key restrictions isn't supposed to return static content if no regular rows matches the restrictions, see #589. However, for the CAS statement we do need to return static content on failure so this patch adds a flag that allows the caller to override this behavior.	2019-10-28 21:50:44 +03:00
Kamil Braun	bbdb438d89	collection_mutation: easier (de)serialization of collection_mutation(s). `collection_type_impl::serialize_mutation_form` became `collection_mutation(_view)_description::serialize`. Previously callers had to cast their data_type down to collection_type to use serialize_mutation_form. Now it's done inside `serialize`. In the future `serialize` will be generalized to handle UDTs. `collection_type_impl::deserialize_mutation_form` became a free standing function `deserialize_collection_mutation` with similiar benefits. Actually, noone needs to call this function manually because of the next paragraph. A common pattern consisting of linearizing data inside a `collection_mutation_view` followed by calling `deserialize_mutation_form` has been abstracted out as a `with_deserialized` method inside collection_mutation_view. serialize_mutation_form_only_live was removed, because it hadn't been used anywhere.	2019-10-25 10:42:58 +02:00
Kamil Braun	b1d16c1601	types: move collection_type_impl::mutation(_view) out of collection_type_impl. collection_type_impl::mutation became collection_mutation_description. collection_type_impl::mutation_view became collection_mutation_view_description. These classes now reside inside collection_mutation.hh. Additional documentation has been written for these classes. Related function implementations were moved to collection_mutation.cc. This makes it easier to generalize these classes to non-frozen UDTs in future commits. The new names (together with documentation) better describe their purpose.	2019-10-25 10:19:45 +02:00
Botond Dénes	7a4a609e88	Introduce Garbage Collected Consumer to Mutation Compactor Introduce consumer in mutation compactor that will only consume data that is purged away from regular consumer. The goal is to allow compaction implementation to do whatever it wants with the garbage collected data, like saving it for preventing data resurrection from ever happening, like described in issue #4531. noop_compacted_fragments_consumer is made available for users that don't need this capability. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2019-07-15 17:38:00 +03:00
Botond Dénes	33d72efa49	mutation_compactor: add detach_state() Allow the state of the compaction to be detached. The detached state is a set of mutation fragments, which if replayed through a new compactor object will result in the latter being in the same state as the previous one was. This allows for storing the compaction state in the compacted reader by using `unpop_mutation_fragment()` to push back the fragments that comprise the detached state into the reader. This way, if a new compaction object is created it can just consume the reader and continue where the previous compaction left off.	2018-09-03 10:31:44 +03:00
Paweł Dziepak	27014a23d7	treewide: require type info for copying atomic_cell_or_collection	2018-05-31 15:51:11 +01:00
Duarte Nunes	67dac67c46	mutation_partition: Regular base column in view determines row liveness When views contain a primary key column that is not part of the base table primary key, that column determines whether the row is live or not. We need to ensure that when that cell is dead, and thus the derived row marker, either by normal deletion of by TTL, so is the rest of the row. This patch introduces the idea of shawdowing row marker. We map the status of the regular base column in the view's PK to the view row's marker. If this marker is dead, so is that cell in the base table, and so should the view row become. To enforce that, a view row's dead marker shadows the whole row if that view includes a base regular column in its PK. Fixes #3360 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-23 09:32:02 +01:00
Botond Dénes	7a5143a670	Add querier The querier encapsulates all objects needed to serve queries, except result builders. It is designed to be suspendable, savable and resumable. It contains all logic needed to suspend, resume and determine whether the querier can be resumed or not. It is the foundation upon which the "reader-reuse" mechanism is built.	2018-03-13 10:34:34 +02:00
Botond Dénes	84d872babf	Add are_limits_reached() compact_mutation_state are_limits_reached() allows querying whether the compactor reached the page's limits. This is needed to determine whether there will be more pages and thus whether the compact_mutation_state has to be kept around.	2018-03-13 10:34:34 +02:00
Botond Dénes	2c1081b0e9	Add start_new_page() to compact_mutation_state start_new_page() resets the limits to the current page's ones and sets the _empty_partition flag so that the partition header (if the last page finished inside a partition) will be reemitted.	2018-03-13 10:34:34 +02:00
Botond Dénes	3fca8aaefb	Save last key of the page and method to query it Make a copy of the current decorated-key in consume_end_of_stream() so that it persists while the compaction state is suspended. Also add current_partition() to allow client code to query the partition the compaction is positioned in. This is needed to determine whether the start position of the next page matches that of the compact_mutation_state.	2018-03-13 10:34:34 +02:00
Botond Dénes	2fcc99fe43	Make compact_mutation reusable Currently compact_mutation is used as a use-once-then-throw-away object. After it satisfies its consumer it's destroyed together with the consumer. This conflicts with the effort to save and reuse readers and associated infrastructure between pages of a query. To resolve this conflict compact_mutation is split into two classes: (1) compact_mutation_state (2) compact_mutation compact_mutation_state encapsulates all the compaction logic and state, while compact_mutation continues to provide the same API using compact_mutation_state behind the scenes. compact_mutation_state doesn't store the consumer, instead its consume_* methods are templated on the consumer and take it as an argument. This allows compact_mutation_state to be independent of the consumer's type. Additionally compact_mutation can now be constructed from a shared pointer to compact_mutation_state. This allows client code to pre-construct a compaction state and retain it after the compact_mutation object is destroyed. These changes allow the state of a compaction to be saved and restored later while code that is only interested in storing the saved state can stay independent of the consumer's type. This patch only contains the splitting of compact_mutation into compact_mutation and compact_mutation_state. The next patches will add the missing functionality that is needed to make compact_mutation_state truly reusable across pages.	2018-03-13 10:34:34 +02:00
Botond Dénes	7bd500049d	Add the CompactedFragmentsConsumer Undust the commented CompactMutationConsumer concept, make it usable and rename it to CompactedFragmentsConsumer (as we not have flat readers).	2018-03-13 10:34:34 +02:00
Piotr Jastrzebski	96c97ad1db	Rename streamed_mutation* files to mutation_fragment* Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-24 20:56:49 +01:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Duarte Nunes	c7aa3ea069	mutation_partition: Remove obsolete short read detection When compacting a partition for querying we would read an extra row, to include any tombstones between that one and the previous row. This is no longer needed since we have a general mechanism to detect short reads in the storage_proxy. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811103031.22866-1-duarte@scylladb.com>	2017-08-15 12:01:55 +01:00
Duarte Nunes	4e693383f7	mutation_partion: Use row_tombstone This patch replaces the current row tombstone representation by a row_tombstone. The intent of the patch is thus to reify the idea of shadowable tombstones, that up until now we considered all materialized view row tombstones to be. We need to distinguish shadowable from non-shadowable row tombstones to support scenarios such as, when inserting to a table with a materialzied view: 1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1 2. delete from base using timestamp 2 where p = 3 3. insert into base (p, v1) values (3, 1) using timestamp 3 These should yield a view row where v2 is definitely null, but with the current implementation, v2 will pop back with its value v2=3@TS=1, even though its dead in the base row. This is because the row tombstone inserted at 2) is a shadowable one. This patch only addresses the memory representation of such row_tombstones. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:33 +02:00
Tomasz Grabiec	4b6e77e97e	db: Fix overflow of gc_clock time point If query_time is time_point::min(), which is used by to_data_query_result(), the result of subtraction of gc_grace_seconds() from query_time will overflow. I don't think this bug would currently have user-perceivable effects. This affects which tombstones are dropped, but in case of to_data_query_result() uses, tombstones are not present in the final data query result, and mutation_partition::do_compact() takes tombstones into consideration while compacting before expiring them. Fixes the following UBSAN report: /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int' Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>	2017-03-01 18:49:56 +02:00
Paweł Dziepak	34f9eb4cbd	mutation_compactor: honour stop_iteration from consumers Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00

1 2

59 Commits