scylladb

Author	SHA1	Message	Date
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Tomasz Grabiec	d0c367f44f	mvcc: partition_snapshot: Support slicing range tombstones in reverse	2021-12-19 22:41:35 +01:00
Pavel Emelyanov	5515f7187d	range_tombstone, code: Add range_tombstone& getters Currently all the code operates on the range_tombstone class. and many of those places get the range tombstone in question from the range_tombstone_list. Next patches will make that list carry (and return) some new object called range_tombstone_entry, so all the code that expects to see the former one there will need to patched to get the range_tombstone from the _entry one. This patch prepares the ground for that by introdusing the range_tombstone& tombstone() { return *this; } getter on the range_tombstone itself and patching all future users of the _entry to call .tombstone() right now. Next patch will remove those getters together with adding the new range_tombstone_entry object thus automatically converting all the patched places into using the entry in a proper way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Benny Halevy	4439e5c132	everywhere: cleanup defer.hh includes Get rid of unused includes of seastar/util/{defer,closeable}.hh and add a few that are missing from source files. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-22 21:11:39 +03:00
Pavel Emelyanov	b3c89787be	mutation_partition: Return immutable collection for range tombstones Patch the .row_tombstones() to return the range_tombstone_list wrapped into the immutable_collection<> so that callers are guaranteed not to touch the collection itself, but still can modify the tombstones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	1bf643d4fd	mutation_partition: Pin mutable access to range tombstones Some callers of mutation_partition::row_tomstones() don't want (and shouldn't) modify the list itself, while they may want to modify the tombstones. This patch explicitly locates those that need to modify the collection, because the next patch will return immutable collection for the others. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Tomasz Grabiec	2d18360157	row_cache: Consume range tombstones incrementally Before the patch, all range tombstones up to the next row were copied into a vector, and then put into the buffer until it's full. This would get quadratic if there is much more range tombstones than fit in a buffer. The fix is to avoid the accumulation of all tombstones in the vector and invoke the callback instead, which stops the iteartion as soon as the buffer is full. Fixes #2581.	2021-07-26 17:48:05 +02:00
Piotr Sarna	e9d26dd7ed	utils/coroutine: wrap a helper in utils namespace The class name `coroutine` became problematic since seastar introduced it as a namespace for coroutine helpers. To avoid a clash, the class from scylla is wrapped in a separate namespace. Without this patch, Seastar submodule update fails to compile. Message-Id: <6cb91455a7ac3793bc78d161e2cb4174cf6a1606.1626949573.git.sarna@scylladb.com>	2021-07-22 13:28:43 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	8709844566	misc: fix indentation The patch fixes indentation issues introduced in previous patches related to removing `with_linearized_managed_bytes` uses from the code tree. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-08 14:16:08 +01:00
Pavel Solodovnikov	e04eb68a9c	treewide: remove remaining `with_linearized_managed_bytes` uses There is no point in calling the wrapper since linearization code is private in `managed_bytes` class and there is no one to call `managed_bytes::data` because it was deleted recently. This patch is a prerequisite for removing `with_linearized_managed_bytes` function completely, alongside with the corresponding parts of implementation in `managed_bytes`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-08 14:16:08 +01:00
Calle Wilund	4b65d67a1a	partition_version: Change range_tombstones() to return chunked_vector Refs #7364 The number of tombstones can be large. As a stopgap measure to just returning a source range (with keepalive), we can at least alleviate the problem by using a chunked vector. Closes #7433	2020-10-26 11:54:42 +02:00
Pavel Emelyanov	86897aa040	partition_version: Remove dead code The rows_iterator is no longer in use since `70c72773` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200831191208.18418-1-xemul@scylladb.com>	2020-09-01 10:19:47 +03:00
Botond Dénes	5e9a7d2608	row_cache: remove unnecessary includes of partition_snapshot_reader.hh Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200820124447.2561477-1-bdenes@scylladb.com>	2020-08-20 15:19:42 +02:00
Avi Kivity	a4c44cab88	treewide: update concepts language from the Concepts TS to C++20 Seastar recently lost support for the experimental Concepts Technical Specification (TS) and gained support for C++20 concepts. Re-enable concepts in Scylla by updating our use of concepts to the C++20 standard. This change: - peels off uses of the GCC6_CONCEPT macro - removes inclusions of <seastar/gcc6-concepts.hh> - replaces function-style concepts (no longer supported) with equation-style concepts - semicolons added and removed as needed - deprecated std::is_pod replaced by recommended replacement - updates return type constraints to use concepts instead of type names (either std::same_as or std::convertible_to, with std::same_as chosen when possible) No attempt is made to improve the concepts; this is a specification update only. Message-Id: <20200531110254.2555854-1-avi@scylladb.com>	2020-06-02 09:12:21 +03:00
Avi Kivity	adb64dc72f	treewide: tighten concepts syntax gcc 10 requires a semicolon after every compound requirement, as per the standard. Add missing semicolons where necessary. Message-Id: <20200129205805.20928-1-avi@scylladb.com>	2020-01-30 14:10:18 +02:00
Piotr Dulikowski	59fbbb993f	memtables: add partition/row hit/miss counters Adds per-table metrics for counting partition and row reuse in memtables. New metrics are as follows: - memtable_partition_writes - number of write operations performed on partitions in memtables, - memtable_partition_hits - number of write operations performed on partitions that previously existed in a memtable, - memtable_row_writes - number of row write operations performed in memtables, - memtable_row_hits - number of row write operations that ovewrote rows previously present in a memtable. Tests: unit(release)	2019-11-12 13:35:41 +01:00
Nadav Har'El	51fc6c7a8e	make static_row optional to reduce memory footprint Merged patch series from Avi Kivity: The static row can be rare: many tables don't have them, and tables that do will often have mutations without them (if the static row is rarely updated, it may be present in the cache and in readers, but absent in memtable mutations). However, it always consumes ~100 bytes of memory, even if it not present, due to row's overhead. Change it to be optional by allocating it as an external object rather than inlined into mutation_partition. This adds overhead when the static row is present (17 bytes for the reference, back reference, and lsa allocator overhead). perf_simple_query appears to marginally (2%) faster. Footprint is reduced by ~9% for a cache entry, 12% in memtables. More details are provided in the patch commitlog. Tests: unit (debug) Avi Kivity (4): managed_ref: add get() accessor managed_ref: add external_memory_usage() mutation_partition: introduce lazy_row mutation_partition: make static_row optional to reduce memory footprint cell_locking.hh \| 2 +- converting_mutation_partition_applier.hh \| 4 +- mutation_partition.hh \| 284 ++++++++++++++++++++++- partition_builder.hh \| 4 +- utils/managed_ref.hh \| 12 + flat_mutation_reader.cc \| 2 +- memtable.cc \| 2 +- mutation_partition.cc \| 45 +++- mutation_partition_serializer.cc \| 2 +- partition_version.cc \| 4 +- tests/multishard_mutation_query_test.cc \| 2 +- tests/mutation_source_test.cc \| 2 +- tests/mutation_test.cc \| 12 +- tests/sstable_mutation_test.cc \| 10 +- 14 files changed, 355 insertions(+), 32 deletions(-)	2019-10-22 12:25:15 +03:00
Avi Kivity	acc433b286	mutation_partition: make static_row optional to reduce memory footprint The static row can be rare: many tables don't have them, and tables that do will often have mutations without them (if the static row is rarely updated, it may be present in the cache and in readers, but absent in memtable mutations). However, it always consumes ~100 bytes of memory, even if it not present, due to row's overhead. Change it to be optional by using lazy_row instead of row. Some call sites treewide were adjusted to deal with the extra indirection. perf_simple_query appears to improve by 2%, from 163krps to 165 krps, though it's hard to be sure due to noisy measurements. memory_footprint comparisons (before/after): mutation footprint: mutation footprint: - in cache: 1096 - in cache: 992 - in memtable: 854 - in memtable: 750 - in sstable: 351 - in sstable: 351 - frozen: 540 - frozen: 540 - canonical: 827 - canonical: 827 - query result: 342 - query result: 342 sizeof(cache_entry) = 112 sizeof(cache_entry) = 112 -- sizeof(decorated_key) = 36 -- sizeof(decorated_key) = 36 -- sizeof(cache_link_type) = 32 -- sizeof(cache_link_type) = 32 -- sizeof(mutation_partition) = 200 -- sizeof(mutation_partition) = 96 -- -- sizeof(_static_row) = 112 -- -- sizeof(_static_row) = 8 -- -- sizeof(_rows) = 24 -- -- sizeof(_rows) = 24 -- -- sizeof(_row_tombstones) = 40 -- -- sizeof(_row_tombstones) = 40 sizeof(rows_entry) = 232 sizeof(rows_entry) = 232 sizeof(lru_link_type) = 16 sizeof(lru_link_type) = 16 sizeof(deletable_row) = 168 sizeof(deletable_row) = 168 sizeof(row) = 112 sizeof(row) = 112 sizeof(atomic_cell_or_collection) = 8 sizeof(atomic_cell_or_collection) = 8 Tests: unit (dev)	2019-10-15 15:42:05 +03:00
Tomasz Grabiec	e6afc89735	row_cache: Record upgraded schema in memtable entries during update Cache update may defer in the middle of moving of partition entry from a flushed memtable to the cache. If the schema was changed since the entry was written, it upgrades the schema of the partition_entry first but doesn't update the schema_ptr in memtable_entry. The entry is removed from the memtable afterward. If a memtable reader encounters such an entry, it will try to upgrade it assuming it's still at the old schema. That is undefined behavior in general, which may include: - read failures due to bad_alloc, if fixed-size cells are interpreted as variable-sized cells, and we misinterpret a value for a huge size - wrong read results - node crash This doesn't result in a permanent corruption, restarting the node should help. It's the more likely to happen the more rows there are in a partition. It's unlikely to happen with single-row partitions. Introduced in `70c7277`. Fixes #5128.	2019-10-03 22:03:29 +02:00
Tomasz Grabiec	90d6c0b9a2	row_cache, mvcc: Prevent locked snapshots from being evicted If the whole partition entry is evicted while being updated from the memtable, a subsequent read may populate the partition using the old version of data if it attempts to do it before cache update advances past that partition. Partial eviction is not affected because populating reads will notice that there is a newer snapshot corresponding to the updater. This can happen only in OOM situations where the whole cache gets evicted. Affects only tables with multi-row partitions, which are the only ones that can experience the update of partition entry being preempted. Introduced in `70c7277`. Fixes #5134.	2019-10-03 22:03:29 +02:00
Tomasz Grabiec	c88a4e8f47	mvcc: Introduce partition_snapshot::touch()	2019-10-03 22:03:28 +02:00
Tomasz Grabiec	25e2f87a37	row_cache, mvcc: Do not upgrade schema of entries which are being updated When a read enters a partition entry in the cache, it first upgrades it to the current schema of the cache. The same happens when an entry is updated after a memtable flush. Upgrading the entry is currently performed by squashing all versions and replacing them with a single upgraded version. That has a side effect of detaching all snapshots from the partition entry. Partition entry update on memtable flush is writing into a snapshot. If that snapshot is detached by a schema upgrade, the entry will be missing writes from the memtable which fall into continuous ranges in that entry which have not yet been updated. This can happen only if the update of the entry is preempted and the schema was altered during that, and a read hit that partition before the update went past it. Affects only tables with multi-row partitions, which are the only ones that can experience the update of partition entry being preempted. The problem is fixed by locking updated entries and not upgrading schema of locked entries. cache_entry::read() is prepared for this, and will upgrade on-the-fly to the cache's schema. Fixes #5135	2019-10-03 22:03:28 +02:00
Tomasz Grabiec	11440ff792	mvcc: Fix incorrect schema verison being used to copy the mutation when applying Currently affects only counter tables. Introduced in `27014a2`. mutation_partition(s, mp) is incorrect, because it uses s to interpret mp, while it should use mp_schema. We may hit this if the current node has a newer schema than the incoming mutation. This can happen during alter when we receive the mutation from a node which hasn't processed the schema change yet. This is undefined behavior in general. If the alter was adding or removing columns, this may result in corruption of the write where values of one column are inserted into a different column. Fixes #5095.	2019-09-25 11:28:07 +02:00
Tomasz Grabiec	20f5d5d1a1	mvcc: partition_snapshot: Introduce migrate() Snapshots which outlive the memtable will need to have their _region and _cleaner references updated. The snapshot can be destroyed after the memtable when it is queud in the mutation_cleaner.	2018-12-27 18:08:50 +01:00
Paweł Dziepak	637b9a7b3b	atomic_cell_or_collection: make operator<< show cell content After the new in-memory representation of cells was introduced there was a regression in atomic_cell_or_collection::operator<< which stopped printing the content of the cell. This makes debugging more incovenient are time-consuming. This patch fixes the problem. Schema is propagated to the atomic_cell_or_collection printer and the full content of the cell is printed. Fixes #3571. Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>	2018-10-24 13:29:51 +03:00
Tomasz Grabiec	b464b66e90	row_cache: Fix memtable reads concurrent with cache update missing writes Introduced in `5b59df3761`. It is incorrect to erase entries from the memtable being moved to cache if partition update can be preempted because a later memtable read may create a snapshot in the memtable before memtable writes for that partition are made visible through cache. As a result the read may miss some of the writes which were in the memtable. The code was checking for presence of snapshots when entering the partition, but this condition may change if update is preempted. The fix is to not allow erasing if update is preemptible. This also caused SIGSEGVs because we were assuming that no such snapshots will be created and hence were not invalidating iterators on removal of the entries, which results in undefined behavior when such snapshots are actually created. Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test Fixes #3532 Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>	2018-07-01 15:36:05 +03:00
Tomasz Grabiec	450985dfee	mvcc: Use RAII to ensure that partition versions are merged Before this patch, maybe_merge_versions() had to be manually called before partition snapshot goes away. That is error prone and makes client code more complicated. Delegate that task to a new partition_snapshot_ptr object, through which all snapshots are published now.	2018-06-27 21:51:04 +02:00
Tomasz Grabiec	c26a304fbb	mvcc: Merge partition version versions gradually in the background When snapshots go away, typically when the last reader is destroyed, we used to merge adjacent versions atomically. This could induce reactor stalls if partitions were large. This is especially true for versions created on cache update from memtables. The solution is to allow this process to be preempted and move to the background. mutation_cleaner keeps a linked list of such unmerged snapshots and has a worker fiber which merges them incrementally and asynchronously with regards to reads. This reduces scheduling latency spikes in tests/perf_row_cache_update for the case of large partition with many rows. For -c1 -m1G I saw them dropping from 23ms to 2ms.	2018-06-27 12:48:30 +02:00
Tomasz Grabiec	78274276f5	row_cache: Use the memtable cleaner to create memtable snapshot during update Memtable entries should be cleaned using memtable cleaner, which unlike the cache' cleaner is not associated with the cache tracker. It's an error to clean a snapshot using tracker which doesn't own the entries. This will corrupt cache tracker's row counter. Fixes failure of test_exception_safety_of_update_from_memtable from row_cache.cc in debug mode and with allocation failure injection enabled. Introduce in "cache: Defer during partition merging" (`70c72773be`). Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>	2018-06-14 18:03:02 +03:00
Paweł Dziepak	ec9d166a4f	treewide: require type to compute cell memory usage	2018-05-31 15:51:11 +01:00
Paweł Dziepak	27014a23d7	treewide: require type info for copying atomic_cell_or_collection	2018-05-31 15:51:11 +01:00
Tomasz Grabiec	5b59df3761	mvcc: Erase rows gradually in apply_to_incomplete() So that we avoid double-buffering partitions.	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	b7fdf4309f	mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	5bc201df10	cache: Release dirty memory with row granularity	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	70c72773be	cache: Defer during partition merging	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	c653137b2b	mvcc: Make apply_to_incomplete() work with attached versions Needed before making it preemptible. We cannot steal the entry since we may need to resume merging later.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	1792be3697	cache: Propagate phase to apply_to_incomplete() It will be needed to create snapshots with appropriate phase markers.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	494cb3f3da	cache: Prepare for incremental apply_to_incomplete() Incremental merging will be implemented by the means of resumable functions, which return stop_iteration::no when not yet finished. We're not using futures, so that the caller can do work around preemption points as well.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	3f19f76c67	mvcc: Destroy memtable partition versions gently Now all snapshots will have a mutation_cleaner which they will use to gently destroy freed partition_version objects. Destruction of memtable entries during cache update is also using the gentle cleaner now. We need to have a separate cleaner for memtable objects even though they're owned by cache's region, because memtable versions must be cleared without a cache_tracker. Each memtable will have its own cleaner, which will be merged with the cache's cleaner when memtable is merged into cache. Fixes some sources of reactor stalls on cache update when there are large partition entries in memtables.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	81d231f35b	mvcc: Remove rows from tracker gently Some parititons may have a lot of rows. Better to iterate over them incrementally as part of clear_gently() to avoid stalls.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	f0c1edd672	cache: Destroy partition versions incrementally Instead of destroying whole partition_versions at once, we will do that gently using mutation_cleaner to avoid reactor stalls. Large deletions could happen when large partition gets invalidated, upgraded to a new schema, or when it's abandaned by a detached snapshot. Refs #3289.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	40cc766cf2	database: Add API for incremental clearing of partition entries Partitions can get very large. Destroying them all at once can stall the reactor for significant amount of time. We want to avoid that by doing destruction incrementally, deferring in between. A new API is added for that at various levels: stop_iteration clear_gently() noexcept; It returns stop_iteration::yes when the object is fully cleared and can be now destroyed quickly. So a deferring destruction can look like this: return repeat([this] { return clear_gently(); }); The reason why clear_gently() doesn't return a future<> itself is that some contexts cannot defer, like memory reclamation.	2018-05-30 12:18:56 +02:00
Tomasz Grabiec	aa1458377c	mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged Fixes a bug in partition_snapshot::merge_partition_versions(), which would not attempt merging if the snapshot is attached to the latest version (in which case _version is nullptr and _entry is != nullptr). This would cause partition_version objects to accumulate if there was an older snapshot and it went away before the latest snapshot. Versions will be removed when the whole entry goes away (flush or eviction). May have caused performance problems. Fixes #3402.	2018-04-30 18:45:32 +02:00
Vladimir Krivopalov	e1ee833861	Always pass mutation_partitions to partition_entry::apply() Previously it was also possible to pass a frozen_mutation to it. Now we de-serialize frozen mutations at the calling side. This is a pre-requisite for collecting memtable statistics needed for writing into the SSTables 3.0 format. For #1969. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-04-25 14:58:47 -07:00
Tomasz Grabiec	b9d22584bb	cache: Add row-level stats about cache update from memtable	2018-03-07 16:52:58 +01:00
Tomasz Grabiec	7c34cd04e2	mvcc: Propagate information if insertion happened from ensure_entry_if_complete() It's needed by users to update statistics, different ones depending on if the row already existed or not.	2018-03-07 16:50:55 +01:00
Tomasz Grabiec	da901b93fc	cache: Track number of rows and row invalidations	2018-03-06 11:50:29 +01:00
Tomasz Grabiec	381bf02f55	cache: Evict with row granularity Instead of evicting whole partitions, evicts whole rows. As part of this, invalidation of partition entries was changed to not evict from snapshots right away, but unlink them and let them be evicted by the reclaimer.	2018-03-06 11:50:29 +01:00
Tomasz Grabiec	bee875fa7d	cache: Ensure all evictable partition_versions have a dummy after all rows Every evictable version will have a dummy entry at the end so that it can be tracked in the LRU. It is also needed to allow old versions to stay around (with tombstones and static rows) after all rows are evicted. Such versions must be fully discontinuous, and we need some entry to mark that.	2018-03-06 11:50:27 +01:00

1 2

99 Commits