scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 02:20:37 +00:00

Author	SHA1	Message	Date
Piotr Jastrzebski	147cc031db	Move map_type_impl out of types.hh to types/map.hh Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-01-24 09:56:38 +01:00
Benny Halevy	93270dd8e0	gc_clock: make 64 bit Fixes: #3353 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-01-22 15:34:32 +02:00
Benny Halevy	16dda033a5	sstables: row_marker: initialize _expiry compare_row_marker_for_merge compares deletion_time also for row markers that have missing timestamps. This happened to succeed due to implicit initialization to 0. However, we prefer the initialization to be explicit and allow calling row_marker::deletion_time() in all states. Fixes #4068 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20190110102949.17896-1-bhalevy@scylladb.com>	2019-01-10 12:45:07 +01:00
Duarte Nunes	fa2b0384d2	Replace std::experimental types with C++17 std version. Replace stdx::optional and stdx::string_view with the C++ std counterparts. Some instances of boost::variant were also replaced with std::variant, namely those that called seastar::visit. Scylla now requires GCC 8 to compile. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20190108111141.5369-1-duarte@scylladb.com>	2019-01-08 13:16:36 +02:00
Tomasz Grabiec	ac49b1def0	mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup partition_snapshots created in the memtable will keep a reference to the memtable (as region*) and to memtable::_cleaner. As long as the reader is alive the memtable will be kept alive by partition_snapshot_flat_reader::_container_guard. But after that, nothing prevents it from being destroyed. The snapshot can outlive the read if mutation_cleaner::merge_and_destroy() defers its destruction for later. When the read ends after memtable was flushed, the snapshot will be queued in the cache's cleaner, but internally will reference memtable's region and cleaner. This will result in a use-after-free when the snapshot resumses destruction. The fix is to update snapshots's region and cleaner references at the time of queueing to point to the cache's region and cleaner. When memtable is destroyed without being moved to cache there is no problem, because the snapshot would be queued into memtable's cleaner, which will be drained on destruction from all snapshots. Introduced in `f3da043`. Fixes #4030.	2018-12-27 18:08:50 +01:00
Paweł Dziepak	9024187222	partition_slice: use small_vector for column_ids	2018-12-06 14:21:04 +00:00
Piotr Jastrzebski	411437f320	Fix format string in mutation_partition::operator<< fmt does not allow bool values for :d and previous format string was resulting in: fmt::v5::format_error: invalid type specifier Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <3980a3cdb903263e29689b1c6cd24e3592826fe0.1542284205.git.piotr@scylladb.com>	2018-11-15 12:22:10 +00:00
Avi Kivity	a71ab365e3	toplevel: convert sprint() to format() sprint() recently became more strict, throwing on sprint("%s", 5). Replace with the more modern format(). Mechanically converted with https://github.com/avikivity/unsprint.	2018-11-01 13:16:17 +00:00
Paweł Dziepak	637b9a7b3b	atomic_cell_or_collection: make operator<< show cell content After the new in-memory representation of cells was introduced there was a regression in atomic_cell_or_collection::operator<< which stopped printing the content of the cell. This makes debugging more incovenient are time-consuming. This patch fixes the problem. Schema is propagated to the atomic_cell_or_collection printer and the full content of the cell is printed. Fixes #3571. Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>	2018-10-24 13:29:51 +03:00
Botond Dénes	eb357a385d	flat_mutation_reader: make timeout opt-out rather than opt-in Currently timeout is opt-in, that is, all methods that even have it default it to `db::no_timeout`. This means that ensuring timeout is used where it should be is completely up to the author and the reviewrs of the code. As humans are notoriously prone to mistakes this has resulted in a very inconsistent usage of timeout, many clients of `flat_mutation_reader` passing the timeout only to some members and only on certain call sites. This is small wonder considering that some core operations like `operator()()` only recently received a timeout parameter and others like `peek()` didn't even have one until this patch. Both of these methods call `fill_buffer()` which potentially talks to the lower layers and is supposed to propagate the timeout. All this makes the `flat_mutation_reader`'s timeout effectively useless. To make order in this chaos make the timeout parameter a mandatory one on all `flat_mutation_reader` methods that need it. This ensures that humans now get a reminder from the compiler when they forget to pass the timeout. Clients can still opt-out from passing a timeout by passing `db::no_timeout` (the previous default value) but this will be now explicit and developers should think before typing it. There were suprisingly few core call sites to fix up. Where a timeout was available nearby I propagated it to be able to pass it to the reader, where I couldn't I passed `db::no_timeout`. Authors of the latter kind of code (view, streaming and repair are some of the notable examples) should maybe consider propagating down a timeout if needed. In the test code (the wast majority of the changes) I just used `db::no_timeout` everywhere. Tests: unit(release, debug) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>	2018-09-20 11:31:24 +02:00
Botond Dénes	d9a2ffad84	mutation_partition: don't move tracing_state early Currently the `trace_state` is moved into the `querier` object's constructor when one has to be created. Since the trace_state is used below this lines this had the effect that on the first page of the query, when a querier object has to be created, tracing would not work inside the `querier_cache` which received a move-from `trace_state` (a nullptr effectively). Change the move to a copy so the other half of the function doesn't use a moved-from `trace_state`. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4987419781aa287141aa9dc8ce99c5068b564c84.1536739052.git.bdenes@scylladb.com>	2018-09-12 11:32:08 +02:00
Gleb Natapov	d7674288a9	mutation_partition: accurately account for result size in digest only queries When measuring_output_stream is used to calculate result's element size it incorrectly takes into account not only serialized element size, but a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells constructors puts in the beginning. Fix it by taking starting point in a stream before element serialization and subtracting it afterwords. Fixes #3755 Message-Id: <20180906153609.GJ2326@scylladb.com>	2018-09-06 20:52:44 +03:00
Gleb Natapov	98092353df	mutation_partition: correctly measure static row size when doing digest calculation The code uses incorrect output stream in case only digest is requested and thus getting incorrect data size. Failing to correctly account for static row size while calculating digest may cause digest mismatch between digest and data query. Fixes #3753. Message-Id: <20180905131219.GD2326@scylladb.com>	2018-09-06 13:09:41 +03:00
Tomasz Grabiec	4fb3f7e8eb	managed_vector: Make external_memory_usage() ignore reserved space This ensures that row::external_memory_usage() is invariant to insertion order of cells. It should be so, so that accounting of a clustering_row, merged from multiple MVCC versions by the partition_snapshot_flat_reader on behalf of a memtable flush, doesn't give a greater result than what is used by the memtable region. Overaccounting leads to assertion failure in ~flush_memory_accounter. Fixes #3625 (hopefully). Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>	2018-09-03 17:09:54 +03:00
Botond Dénes	3bcd577907	Move reconcilable_result_builder declaration to mutation_query.hh It will be used by code outside of mutation_partition.cc so it needs to be public. The definition remains in mutation_partition.cc.	2018-09-03 10:31:44 +03:00
Botond Dénes	c12008b8cb	querier: split querier into separate data and mutation querier types Instead of hiding what compaction method the querier uses (and only expose it via rejecting 'can_be_used_for_page()`) make it very explicit that these are really two different queriers. This allows using different indexes for the two queriers in `querier_cache` and eliminating the possibility of picking up a querier with the wrong compaction method (read kind). This also makes it possible to add new querier type(s) that suit the multishard-query's needs without making a confusing mess of `querier` by making it a union of all querying logic. Splitting the queriers this way changes what happens when a lookup finds a querier of the wrong kind (e.g. emit_only_live::yes for an emit_only_live::no command). As opposed to dropping the found (but wrong) querier the querier will now simply not be found by the lookup. This is a result of using separate search indexes for the different mutation kinds. This change should have no practical implications. Splitting is done by making querier templated on `emit_only_live_rows`. It doesn't make sense to duplicate the entire querier as the two share 99% of the code.	2018-09-03 10:31:44 +03:00
Botond Dénes	cded477b94	querier: return std::optional<querier> instead of using create_fun() Requiring the caller of lookup() to pass in a `create_fun()` was not such a good idea in hindsight. It leads to awkward call sites and even more awkward code when trying to find out whether the lookup was successfull or not. Returning an optional gives calling code much more flexibility and makes the code cleaner.	2018-09-03 10:31:44 +03:00
Botond Dénes	5f726e9a89	querier: move all to query namespace To avoid name clashes.	2018-09-03 10:31:44 +03:00
Avi Kivity	ebff1cfc37	database: make database::_mutation_query_stage inherit the scheduling group Like the preceeding patch and for the same reasons, adjust database::_mutation_query_stage to inherit the scheduling group from its caller.	2018-08-24 19:04:49 +03:00
Tomasz Grabiec	024b3c9fd9	mutation_partition: Fix exception safety of row::apply_monotonically() When emplace_back() fails, value is already moved-from into a temporary, which breaks monotonicity expected from apply_monotonically(). As a result, writes to that cell will be lost. The fix is to avoid the temporary by in-place construction of cell_and_hash. To do that, appropriate cell_and_hash constructor was added. Found by mutation_test.cc::test_apply_monotonically_is_monotonic with some modifications to the random mutation generator. Introduced in `99a3e3a`. Fixes #3678. Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>	2018-08-09 15:29:10 +03:00
Tomasz Grabiec	6b1fe6cbe5	mutation_partition: Introduce set_continuity()	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	c6c54021a8	mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc When clustering keys are larger than 12.8 KiB they may get fragmented and key comparator will need to linearize them on comparison. This may cause lookups in the rows tree to fail with bad_alloc. Partition version merging (mutation_partition::apply_monotonically()) was not taking this into account. If we fail on lookup, the partition which is being applied may be incorrectly left with the clustering range since the begging up to the current row marked as continuous, if the current row has the continuity flag set, because we've moved all of the preceding rows into the target, and the correct lower bound row is no longer there in the source. This may mark some discontinuous ranges as continuous. Merging is retried by allocating_section, and there will be no problem if it eventually suceeds, original continity will be reflected in the sum. The problem will persist if it doesn't eventually succeed, when we're really out of memory. To protect against this, we could reset the continuity flag of the current row in the source when exiting on exception. Fixes #3583	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	de5c52f422	mutation_partition: Preserve continuity in case row merging with no tracker throws Example: p: row{key=A, cont=0} row{key=C, cont=1} this: row{key=C, cont=0} When we get to processing key=C, key=A was already moved to this, so p has stale continuity on key=C, which marks (-inf,C) as continuous, whereas it should mark only (A, C). That's not a problem if merging succeeds, but if exception happens at this point, we will violate the invariant which says that the sum of p and this should yield the same logical partition. It wouldn't because continuity of the sum is calculated as a set union, and (-inf, A) would be incorrectly turned into a continuous range. This is not a problem currently because continuity is always full when there is no tracker (memtables), so won't change anyway, and when there is a tracker (cache) we never merge but overwrite instead, so there is no memory allocation and thus no possibility for failure. But better be safe.	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	567da3e063	memtable, cache: Fix exception safety of partition entry insertions boost::intrusive::set::insert() may throw if keys require linearization and that fails, in which case we will leak the entry. When this happens in cache, we will also violate the invariant for entry eviction, which assumes all tracked entries are linked, and cause a SEGFAULT. Use the non-throwing and faster insert_before() instead. Where we can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure that entry is deallocated on insert failure. Fixes #3585.	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	3f509ee3a2	mutation_partition: Fix exception-safety of row copy constructor In case population of the vector throws, the vector object would not be destroyed. It's a managed object, so in addition to causing a leak, it would corrupt memory if later moved by the LSA, because it would try to fixup forward references to itself. Caused sporadic failures and crashes of row_cache_test, especially with allocation failure injector enabled. Introduced in `27014a23d7`. Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>	2018-07-17 13:21:21 +01:00
Tomasz Grabiec	074be4d4e8	memtable, cache: Run mutation_cleaner worker in its own scheduling group The worker is responsible for merging MVCC snapshots, which is similar to merging sstables, but in memory. The new scheduling group will be therefore called "memory compaction". We should run it in a separate scheduling group instead of main/memtables, so that it doesn't disrupt writes and other system activities. It's also nice for monitoring how much CPU time we spend on this.	2018-06-27 21:51:04 +02:00
Tomasz Grabiec	6c6ffaee71	mutation_cleaner: Make merge() redirect old instance to the new one If memtable snapshot goes away after memtable started merging to cache, it would enqueue the snapshots for cleaning on the memtable's cleaner, which will have to clean without deferrring when the memtable is destroyed. That may stall the reactor. To avoid this, make merge() cause the old instance of the cleaner to redirect to the new instance (owned by cache), like we do for regions. This way the snapshots mentioned earlier can be cleaned after memtable is destroyed, gracefully.	2018-06-27 21:51:04 +02:00
Tomasz Grabiec	c26a304fbb	mvcc: Merge partition version versions gradually in the background When snapshots go away, typically when the last reader is destroyed, we used to merge adjacent versions atomically. This could induce reactor stalls if partitions were large. This is especially true for versions created on cache update from memtables. The solution is to allow this process to be preempted and move to the background. mutation_cleaner keeps a linked list of such unmerged snapshots and has a worker fiber which merges them incrementally and asynchronously with regards to reads. This reduces scheduling latency spikes in tests/perf_row_cache_update for the case of large partition with many rows. For -c1 -m1G I saw them dropping from 23ms to 2ms.	2018-06-27 12:48:30 +02:00
Tomasz Grabiec	4d3cc2867a	mutation_partition: Make merging preemtable	2018-06-27 12:48:30 +02:00
Tomasz Grabiec	cd7c7ac40f	mutation_partition: Make do_compact() respect range tombstone merging rules It compares only timestamps, but it should use intrinsic ordering of the tombstone, which takes deletio ntime into consideration as well. If we have two range tombstones with the same timestamp but different deletion time (odd case, but still), then the one with the higher deletion time should win. That's what all other parts of the system use to resolve merges, in particular range_tombstone_list and compact_mutation_state (the fragment stream compactor). Not respecting this ordering violates the following equality: do_compact(do_compact(m1) + m2) == do_compact(m1 + m2) which may results in some clustered rows being missing in the right-hand side, but not in the left-hand side, due to differences in range tombstones. This impacts only tests currently. Message-Id: <1528705602-7218-1-git-send-email-tgrabiec@scylladb.com>	2018-06-11 10:05:52 +01:00
Paweł Dziepak	a040d37cd5	atomic_cell: switch to new IMR-based cell reperesentation This patch changes the implementation of atomic_cell and atomic_cell_or_collection to use the data::cell implementation which is based on the new in-memory representation infrastructure.	2018-05-31 15:51:11 +01:00
Paweł Dziepak	9bb1f10bb6	treewide: require type for comparing cells	2018-05-31 15:51:11 +01:00
Paweł Dziepak	aa25f0844f	atomic_cell: introduce fragmented buffer value interface As a prepratation for the switch to the new cell representation this patch changes the type returned by atomic_cell_view::value() to one that requires explicit linearisation of the cell value. Even though the value is still implicitly linearised (and only when managed by the LSA) the new interface is the same as the target one so that no more changes to its users will be needed.	2018-05-31 15:51:11 +01:00
Paweł Dziepak	ec9d166a4f	treewide: require type to compute cell memory usage	2018-05-31 15:51:11 +01:00
Paweł Dziepak	418c159057	treewide: require type to copy atomic_cell	2018-05-31 15:51:11 +01:00
Paweł Dziepak	27014a23d7	treewide: require type info for copying atomic_cell_or_collection	2018-05-31 15:51:11 +01:00
Paweł Dziepak	93130e80fb	atomic_cell: require column_definition for creating atomic_cell views	2018-05-31 15:51:11 +01:00
Tomasz Grabiec	82e8217ba0	mutation_partition: Reduce row lookups in apply_monotonically() This change speeds up merging of partition versions with many rows in case the merged version has many rows which fall between existing rows in the target version. This is often the case for time-series workloads, which insert rows at the front. Lookup can be avoided for all but the first row in the stride because we already have a reference to the successor in the target tree, we only need to check that the current entry in the target tree is still the successor. This change greatly reduces amount of lookups per row during version merging of large partitions in time-series workloads.	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	81d231f35b	mvcc: Remove rows from tracker gently Some parititons may have a lot of rows. Better to iterate over them incrementally as part of clear_gently() to avoid stalls.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	e0803ff71e	Introduce mutation_cleaner Used for collecting unsued partition_version objects and freeing them incrementally. Will be used for both cache and memtables.	2018-05-30 14:41:39 +02:00
Tomasz Grabiec	40cc766cf2	database: Add API for incremental clearing of partition entries Partitions can get very large. Destroying them all at once can stall the reactor for significant amount of time. We want to avoid that by doing destruction incrementally, deferring in between. A new API is added for that at various levels: stop_iteration clear_gently() noexcept; It returns stop_iteration::yes when the object is fully cleared and can be now destroyed quickly. So a deferring destruction can look like this: return repeat([this] { return clear_gently(); }); The reason why clear_gently() doesn't return a future<> itself is that some contexts cannot defer, like memory reclamation.	2018-05-30 12:18:56 +02:00
Duarte Nunes	eed09dfdf9	mutation_partition: Throw std::out_of_range with backtrace on cell_at Makes it easier to investigate bugs. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180521133753.16375-1-duarte@scylladb.com>	2018-05-23 13:51:54 +03:00
Paweł Dziepak	05c94bc98d	mutation_partition: do not dereference null in find_cell() row::find_cell() may be called for cells that do not exist in that row. In such case nullptr shall be returned, this patch makes sure that it is not dereferenced. Message-Id: <20180522091726.24396-1-pdziepak@scylladb.com>	2018-05-22 10:31:09 +01:00
Paweł Dziepak	33dffd5fb6	row: add clear_hash() Needed to measure the performance of hashing a cell.	2018-05-09 16:52:26 +01:00
Paweł Dziepak	00509913fc	mutation_partition: enable ADL for cell swap Calling fully qualified std::swap() prohibits the cell objects from using their own swap implementations. This patch invokes std::swap in the usual ADL-friendly way.	2018-05-09 16:52:26 +01:00
Paweł Dziepak	a2b5779714	counters: drop revertability of apply() Since `4cfcd8055e` 'Merge "Drop reversible apply() from mutation_partition" from Tomasz' it is no longer required for apply() to be revertable.	2018-05-09 16:52:26 +01:00
Vladimir Krivopalov	ed62b9a667	Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker. For #1969. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-04-26 13:27:42 -07:00
Duarte Nunes	c8baba4e3a	mutation_partition: Clarify comment about emptiness empty() doesn't distinguish between live and dead data, so clarify that in its comment. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-23 09:32:03 +01:00
Duarte Nunes	67dac67c46	mutation_partition: Regular base column in view determines row liveness When views contain a primary key column that is not part of the base table primary key, that column determines whether the row is live or not. We need to ensure that when that cell is dead, and thus the derived row marker, either by normal deletion of by TTL, so is the rest of the row. This patch introduces the idea of shawdowing row marker. We map the status of the regular base column in the view's PK to the view row's marker. If this marker is dead, so is that cell in the base table, and so should the view row become. To enforce that, a view row's dead marker shadows the whole row if that view includes a base regular column in its PK. Fixes #3360 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-23 09:32:02 +01:00
Duarte Nunes	b0cb5480d5	mutation_fragment: Allow querying if row is live For clustering_row and static_row, allow querying whether they are live or not. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-23 09:32:02 +01:00

1 2 3 4 5 ...

280 Commits