scylladb

Author	SHA1	Message	Date
Botond Dénes	302917f63d	mutation_compactor: add validator The mutation compactor is used on most read-paths we have, so adding a validator to it gives us a good coverage, in particular it gives us full coverage of queries and compaction. The validator validates mutation token (and mutation fragment kind) monotonicity as that is quite cheap, while it is enough to catch the most common problems. As we already have a validator on the compaction path (in the sstable writer), the validator is disabled when the mutation compactor is instantiated for compaction. We should probably make this configurable at some point. The addition of this validator should prevent the worst of the fragment reordering bugs to affect reads.	2022-11-11 10:26:05 +02:00
Botond Dénes	0bcfc9d522	treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{} We just added a convenience static factory method for partition end, change the present users of the clunky constructor+tag to use it instead.	2022-11-11 09:58:18 +02:00
Botond Dénes	f1a039fc2b	treewide: use ::for_partition_start() instead of ::partition_start_tag_t{} We just added a convenience static factory method for partition start, change the present users of the clunky constructor+tag to use it instead.	2022-11-11 09:58:18 +02:00
Benny Halevy	8b38893895	mutation_compactor: pass tombstone_gc_state to compact_mutation_state Used in get_gc_before. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	d86810d22c	mutation_partition: compact_for_compaction_v2: get tombstone_gc_state To be passed down to compact_mutation_state in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	5dd15aa3c8	tombstone_gc: introduce tombstone_gc_state and use it to access the repair history maps. At this introductory patch, we use default-constructed tombstone_gc_state to access the thread-local maps temporarily and those use sites will be replaced in following patches that will gradually pass the tombstone_gc_state down from the compaction_manager to where it's used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Botond Dénes	70b4158ce0	mutation_compactor: detach_state(): make it no-op if partition was exhausted detach_state() allows the user to resume a compaction process later, without having to keep the compactor object alive. This happens by generating and returning the mutation fragments the user has to re-feed to a newly constructed compactor to bring it into the exact same state the current compactor was at the point of stopping the compaction. This state includes the partition-header (partition-start and static-row if any) and the currently active range tombstone. Detaching the state is pointless however when the compaction was stopped such that the currently compacted partition was completely exhausted. Allowing the state to be detached in this case seems benign but it caused a subtle bug in the main user of this feature: the partition range scan algorithm, where the fragments included in the detached state were pushed back into the reader which produced them. If the partition happened to be exhausted -- meaning the next fragment in the reader was a partition-start or EOS -- this resulted in the partition being re-emitted later without a partition-end, resulting in corrupt query-result being generated, in turn resulting in an obscure "IDL frame truncated" error. This patch solves this seemingly benign but sinister bug by making the return value of `detach_state()` an std::optional and returning a disengaged optional when the partition was exhausted.	2022-08-02 06:43:24 +03:00
Avi Kivity	00cec159d6	Revert "Merge 'multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes" This reverts commit `c3bad157e5`, reversing changes made to `e66809d051`. The checks it adds are triggered by some dtests. While it's possible the check is triggered due to an existing problem, better to investigate it out-of-tree. Fixes #11169.	2022-07-31 15:24:33 +03:00
Botond Dénes	f119554106	mutation_compactor: detach_state(): make it no-op if partition was exhausted detach_state() allows the user to resume a compaction process later, without having to keep the compactor object alive. This happens by generating and returning the mutation fragments the user has to re-feed to a newly constructed compactor to bring it into the exact same state the current compactor was at the point of stopping the compaction. This state includes the partition-header (partition-start and static-row if any) and the currently active range tombstone. Detaching the state is pointless however when the compaction was stopped such that the currently compacted partition was completely exhausted. Allowing the state to be detached in this case seems benign but it caused a subtle bug in the main user of this feature: the partition range scan algorithm, where the fragments included in the detached state were pushed back into the reader which produced them. If the partition happened to be exhausted -- meaning the next fragment in the reader was a partition-start or EOS -- this resulted in the partition being re-emitted later without a partition-end, resulting in corrupt query-result being generated, in turn resulting in an obscure "IDL frame truncated" error. This patch solves this seemingly benign but sinister bug by making the return value of `detach_state()` an std::optional and returning a disengaged optional when the partition was exhausted.	2022-07-28 09:02:26 +03:00
Botond Dénes	c54d19427d	mutation_compactor: don't ignore consumer's stop request on range tombstone Broken since the v2 output support was introduced (`ad435dc`). No known adverse affects, besides mutation reads stopping a little later than desired (on the next non-range-tombstone-change fragment) and hence consuming more memory than the limit set for them. Fixes: #11138 Closes #11139	2022-07-27 22:24:29 +03:00
Botond Dénes	17509e9664	mutation_compactor: remove only-live related logic We removed the template parameter in the previous patch, now we can remove the logic related to it.	2022-07-12 08:44:32 +03:00
Botond Dénes	4d2ce5c304	mutation_compactor: remove emit_only_live_rows template parameter Now that we use emit_only_live_rows::no everywhere we can remove this template parameters. Only the template parameter is removed, the internal logic around it is left in place (will be removed in a next patch), by hard-wiring `only_live()`.	2022-07-12 08:43:49 +03:00
Botond Dénes	9ee8ef5930	mutation_compactor: remove unused compact_mutation_state::parameters	2022-07-12 08:41:51 +03:00
Botond Dénes	9beef08a1b	mutation_compactor: add current_full_position() convenience accessor	2022-06-23 13:36:24 +03:00
Botond Dénes	a3cd235de2	mutation_compactor: s/_last_clustering_pos/_last_pos/ Generalize position tracking to track non-clustering positions too. Also add an accessor for it.	2022-06-23 13:36:24 +03:00
Botond Dénes	5a6e807a1c	mutation_compactor: add state accessor to compact_mutation	2022-06-23 13:36:24 +03:00
Tomasz Grabiec	570b76bc5b	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-15 11:30:01 +02:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Tomasz Grabiec	604e720706	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-06 19:23:37 +02:00
Botond Dénes	279682056d	mutation_compactor: drop v1 related code-paths	2022-03-11 09:24:05 +02:00
Botond Dénes	924ff6a503	mutation_compactor: drop v1 support altogether from the API Fully mechanical change. Drop all v1 types, template types. Internal code is left unchanged, will be made v2 only in the next patch.	2022-03-11 09:24:05 +02:00
Botond Dénes	4e97477281	mutation_compactor: remove now unused compact_for_compaction	2022-03-10 09:16:33 +02:00
Botond Dénes	ad435dcf57	mutation_compactor: add v2 output The output version is selected via compactor_output_format, which is a template parameter of `compact_mutation_state` and all downstream types. This is to ensure a compaction state created to emit a v2 stream will not be accidentally used with a v1 consumer. When using a v2 output, the current active tombstone has to be tracked separately for the regular and for the gc consumer (if any), so that each can be closed properly on EOS. The current effective tombstone is tracked separately from these two. The reason is that purged tombstones are still applied to data, but are not emitted to the regular consumer.	2022-03-10 06:46:46 +02:00
Botond Dénes	1ccaeb2a1a	mutation_compactor: make _last_clustering_pos track last input Instead of updating _last_clustering_pos whenever a clustering fragment is pushed to the consumers, we now update it whenever a clustering fragment enters the compactor. Not only is this much more robust, but it also makes more sense. Just because a range tombstone is purged (and therefore the consumer doesn't see it), it still moves the logical clustering position in the stream. Also, tracking the input side avoids any ambiguity related to cases where we have two consumers (regular + gc consumer).	2022-03-10 06:46:46 +02:00
Botond Dénes	f1e9e3b3b7	compact_mutation: drop support for v1 input	2022-02-21 12:29:24 +02:00
Mikołaj Sielużycki	93d6eb6d51	compacting_reader: Support fast_forward_to position range. Fast forwarding is delegated to the underlying reader and assumes the it's supported. The only corner case requiring special handling that has shown up in the tests is producing partition start mutation in the forwarding case if there are no other fragments. compacting state keeps track of uncompacted partition start, but doesn't emit it by default. If end of stream is reached without producing a mutation fragment, partition start is not emitted. This is invalid behaviour in the forwarding case, so I've added a public method to compacting state to force marking partition as non-empty. I don't like this solution, as it feels like breaking an abstraction, but I didn't come across a better idea. Tests: unit(dev, debug, release) Message-Id: <20220128131021.93743-1-mikolaj.sieluzycki@scylladb.com>	2022-01-31 13:37:36 +02:00
Botond Dénes	eb42213db4	compact_mutation: close active range tombstone on page end The compactor recently acquired the ability to consume a v2 stream. The v2 spec requires that all streams end with a null tombstone. `range_tombstone_assembler`, the component the compactor uses for converting the v2 input into its v1 output enforces this with a check on `consume_end_of_partition()`. Normally the producer of the stream the compactor is consuming takes care of closing the active tombstone before the stream ends. The compactor however (or its consumer) can decide to end the consume early, e.g. to cut the current page. When this happens the compactor must take care of closing the tombstone itself. Furthermore it has to keep this tombstone around to re-open it on the next page. This patch implements this mechanism which was left out of `134601a15e`. It also adds a unit test which reproduces the problems caused by the missing mechanism. The compactor now tracks the last clustering position emitted. When the page ends, this position will be used as the position of the closing range tombstone change. This ensures the range tombstone only covers the actually emitted range. Fixes: #9907 Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>	2022-01-25 09:52:30 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Botond Dénes	e8a918b25c	compact_mutation: make start_new_page() independent of mutation_fragment version By using partition_region instead of mutation_fragment::kind. This will make incremental migration of users to v2 easier.	2022-01-07 13:47:39 +02:00
Botond Dénes	790e73141f	compact_mutation: add support for consuming a v2 stream Consuming either a v1 or v2 stream is supported now, but compacted fragments are still emitted in the v1 format, thus the compactor acts an online downgrader when consuming a v2 stream. This allows pushing out downgrade to v1 on the input side all the way into the compactor. This means that reads for example can now use an all v2 reader pipeline, the still mandatory downgrade to v1 happening at the last possible place: just before creating the result-set. Mandatory because our intra-node ABI is still v1. There are consumers who are ready for v2 in principle (e.g. compaction), they have to wait a little bit more.	2022-01-07 13:42:31 +02:00
Botond Dénes	1d842e980a	compact_mutation: extract range tombstone consumption into own method Next patch wants to reuse the same code.	2022-01-07 13:42:17 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Botond Dénes	f0ead81250	mutation_compactor: collect stats about compacted data Stats contain the number of partitions, static rows, clustering rows and range tombstones. For rows dead/live are counted separately.	2021-09-22 13:59:19 +03:00
Botond Dénes	f02632aeb0	range_tombstone_accumulator: drop _reversed flag	2021-09-09 15:42:15 +03:00
Botond Dénes	502a45ad58	treewide: switch to native reversed format for reverse reads We define the native reverse format as a reversed mutation fragment stream that is identical to one that would be emitted by a table with the same schema but with reversed clustering order. The main difference to the current format is how range tombstones are handled: instead of looking at their start or end bound depending on the order, we always use them as-usual and the reversing reader swaps their bounds to facilitate this. This allows us to treat reversed streams completely transparently: just pass along them a reversed schema and all the reader, compacting and result building code is happily ignorant about the fact that it is a reversed stream.	2021-09-09 15:42:15 +03:00
Asias He	4c1f8c2f83	compaction: Move compaction_garbage_collector.hh to compaction dir The top dir is a mess. Move compaction_garbage_collector.hh to the new home.	2021-08-07 08:07:09 +08:00
Asias He	47aae83185	mutation_compactor: Drop compact_for_mutation_query_state and compact_for_data_query_state They are not used.	2021-08-07 07:21:48 +08:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Botond Dénes	73808c12eb	mutation compactor: query compaction: ignore purgeable tombstones This behaviour is makes query result building sensitive to whether the data was recently compacted or not, in particular different digests will be produced depending on whether purgeable tombstones happened to be compacted (and thus purged) or not. This means that two replicas can produce different digests for the same data if has compacted some purgeable tombstones and the other not. To avoid this, drop purgeable tombstones during query compaction as well.	2021-01-22 15:27:48 +02:00
Wojciech Mitros	45215746fe	increase the maximum size of query results to 2^64 Currently, we cannot select more than 2^32 rows from a table because we are limited by types of variables containing the numbers of rows. This patch changes these types and sets new limits. The new limits take effect while selecting all rows from a table - custom limits of rows in a result stay the same (2^32-1). In classes which are being serialized and used in messaging, in order to be able to process queries originating from older nodes, the top 32 bits of new integers are optional and stay at the end of the class - if they're absent we assume they equal 0. The backward compatibility was tested by querying an older node for a paged selection, using the received paging_state with the same select statement on an upgraded node, and comparing the returned rows with the result generated for the same query by the older node, additionally checking if the paging_state returned by the upgraded node contained new fields with correct values. Also verified if the older node simply ignores the top 32 bits of the remaining rows number when handling a query with a paging_state originating from an upgraded node by generating and sending such a query to an older node and checking the paging_state in the reply(using python driver). Fixes #5101.	2020-08-03 17:32:49 +02:00
Avi Kivity	a4c44cab88	treewide: update concepts language from the Concepts TS to C++20 Seastar recently lost support for the experimental Concepts Technical Specification (TS) and gained support for C++20 concepts. Re-enable concepts in Scylla by updating our use of concepts to the C++20 standard. This change: - peels off uses of the GCC6_CONCEPT macro - removes inclusions of <seastar/gcc6-concepts.hh> - replaces function-style concepts (no longer supported) with equation-style concepts - semicolons added and removed as needed - deprecated std::is_pod replaced by recommended replacement - updates return type constraints to use concepts instead of type names (either std::same_as or std::convertible_to, with std::same_as chosen when possible) No attempt is made to improve the concepts; this is a specification update only. Message-Id: <20200531110254.2555854-1-avi@scylladb.com>	2020-06-02 09:12:21 +03:00
Vladimir Davydov	e0b31dd273	query: add flag to return static row on partition with no rows A SELECT statement that has clustering key restrictions isn't supposed to return static content if no regular rows matches the restrictions, see #589. However, for the CAS statement we do need to return static content on failure so this patch adds a flag that allows the caller to override this behavior.	2019-10-28 21:50:44 +03:00
Kamil Braun	bbdb438d89	collection_mutation: easier (de)serialization of collection_mutation(s). `collection_type_impl::serialize_mutation_form` became `collection_mutation(_view)_description::serialize`. Previously callers had to cast their data_type down to collection_type to use serialize_mutation_form. Now it's done inside `serialize`. In the future `serialize` will be generalized to handle UDTs. `collection_type_impl::deserialize_mutation_form` became a free standing function `deserialize_collection_mutation` with similiar benefits. Actually, noone needs to call this function manually because of the next paragraph. A common pattern consisting of linearizing data inside a `collection_mutation_view` followed by calling `deserialize_mutation_form` has been abstracted out as a `with_deserialized` method inside collection_mutation_view. serialize_mutation_form_only_live was removed, because it hadn't been used anywhere.	2019-10-25 10:42:58 +02:00
Kamil Braun	b1d16c1601	types: move collection_type_impl::mutation(_view) out of collection_type_impl. collection_type_impl::mutation became collection_mutation_description. collection_type_impl::mutation_view became collection_mutation_view_description. These classes now reside inside collection_mutation.hh. Additional documentation has been written for these classes. Related function implementations were moved to collection_mutation.cc. This makes it easier to generalize these classes to non-frozen UDTs in future commits. The new names (together with documentation) better describe their purpose.	2019-10-25 10:19:45 +02:00
Botond Dénes	7a4a609e88	Introduce Garbage Collected Consumer to Mutation Compactor Introduce consumer in mutation compactor that will only consume data that is purged away from regular consumer. The goal is to allow compaction implementation to do whatever it wants with the garbage collected data, like saving it for preventing data resurrection from ever happening, like described in issue #4531. noop_compacted_fragments_consumer is made available for users that don't need this capability. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2019-07-15 17:38:00 +03:00
Botond Dénes	33d72efa49	mutation_compactor: add detach_state() Allow the state of the compaction to be detached. The detached state is a set of mutation fragments, which if replayed through a new compactor object will result in the latter being in the same state as the previous one was. This allows for storing the compaction state in the compacted reader by using `unpop_mutation_fragment()` to push back the fragments that comprise the detached state into the reader. This way, if a new compaction object is created it can just consume the reader and continue where the previous compaction left off.	2018-09-03 10:31:44 +03:00
Paweł Dziepak	27014a23d7	treewide: require type info for copying atomic_cell_or_collection	2018-05-31 15:51:11 +01:00
Duarte Nunes	67dac67c46	mutation_partition: Regular base column in view determines row liveness When views contain a primary key column that is not part of the base table primary key, that column determines whether the row is live or not. We need to ensure that when that cell is dead, and thus the derived row marker, either by normal deletion of by TTL, so is the rest of the row. This patch introduces the idea of shawdowing row marker. We map the status of the regular base column in the view's PK to the view row's marker. If this marker is dead, so is that cell in the base table, and so should the view row become. To enforce that, a view row's dead marker shadows the whole row if that view includes a base regular column in its PK. Fixes #3360 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-23 09:32:02 +01:00
Botond Dénes	7a5143a670	Add querier The querier encapsulates all objects needed to serve queries, except result builders. It is designed to be suspendable, savable and resumable. It contains all logic needed to suspend, resume and determine whether the querier can be resumed or not. It is the foundation upon which the "reader-reuse" mechanism is built.	2018-03-13 10:34:34 +02:00
Botond Dénes	84d872babf	Add are_limits_reached() compact_mutation_state are_limits_reached() allows querying whether the compactor reached the page's limits. This is needed to determine whether there will be more pages and thus whether the compact_mutation_state has to be kept around.	2018-03-13 10:34:34 +02:00

1 2

69 Commits