scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Michael Litvak	43c76aaf2b	logstor: split log record to header and data Split the `log_record` to `log_record_header` type that has the record metadata fields and the mutation as a separate field which is the actual record data: struct log_record { log_record_header header; canonical_mutation mut; }; Both the header and mutation have variable serialized size. When a record is serialized in a write_buffer, we first put a small `record_header` that has the header size and data size, then the serialized header and data follow. The `log_location` of a record points to the beginning of the `record_header`, and the size includes the `record_header`. This allows us to read a record header without reading the data when it's not needed and avoid deserializing it: * on recovery, when scanning all segments, we read only the record headers. * on compaction, we read the record header first to determine if the record is alive, if yes then we read the data. Closes scylladb/scylladb#29457	2026-04-16 10:00:35 +03:00
Raphael S. Carvalho	1529605b32	logstor: Fix dangling reference captures and shadowed loc variable Three bugs fixed in segment_manager.cc: 1. write_to_separator(): captured [&index] where index was a local coroutine-frame reference. The future is stored in buf.pending_updates and resolved later in flush_separator_buffer(), by which time the enclosing coroutine frame is destroyed, making &index a dangling pointer. This is a use-after-free that manifests as a segfault. Fix: capture index_ptr (raw pointer by value) instead. 2. add_segment_to_compaction_group(): same dangling [&index] pattern inside the for_each_live_record lambda during recovery. Same fix applied. 3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer 'log_location loc', causing the function to always return a zero-initialized log_location{}. Fix: remove 'auto' so the assignment targets the outer variable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29451	2026-04-15 14:40:15 +03:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Michael Litvak	35547bfb6e	test: logstor: additional logstor tests	2026-03-31 18:45:08 +02:00
Michael Litvak	39baa573d2	logstor: add version and crc to buffer header add basic crc and validation to the buffer header. add also a version field that indicates the version of the on-disk format.	2026-03-31 18:45:08 +02:00
Michael Litvak	78426ae31b	logstor: add take_logstor_snapshot add the function table::take_logstor_snapshot that is similar to take_storage_snapshot for sstables. given a token range, for each storage group in the range, it flushes the separator buffers and then makes a snapshot of all segments in the sg's compaction groups while disabling compaction. the segment snapshot holds a reference to the segment so that it won't be freed by compaction, and it provides an input stream for reading the segment. this will be used for tablet migration to stream the segments.	2026-03-31 18:45:08 +02:00
Michael Litvak	754c1b83bd	logstor: segment input/output stream add functions for creating segment input and output streams, that will be used for segment streaming. the segment input stream creates a file input stream that reads a given segment. the segment output stream allocates a new local segment and creates an output stream that writes to the segment, and when closed it loads the segment and adds it to the compaction group.	2026-03-31 18:45:08 +02:00
Michael Litvak	17cab4181b	logstor: implement compaction_group::cleanup implement compaction group cleanup by clearing the range in the index and discarding the segments of the compaction group. segments are discarded by overwriting the segment header to indicate the segment is empty while preserving the segment generation number in order to not resurrect old data in the segment.	2026-03-31 18:45:08 +02:00
Michael Litvak	9fd6dace72	logstor: tablet split implement tablet split for logstor. flush the separator and then perform split as a new type of compaction: take a batch of segments from the source compaction group, read them and write all live records into left/right write buffers according to the split classifier, flush them to the compaction group, and free the old segments. segments that fit in a single target compaction group are removed from the source and added to the correct target group.	2026-03-31 18:45:08 +02:00
Michael Litvak	5de39afc24	logstor: tablet merge implement tablet merge with logstor. disable compaction for the new compaction group, then merge the merging compaction groups by merging their logstor segments set into the new cg - simply merging the segment histogram.	2026-03-31 18:40:57 +02:00
Michael Litvak	684ce8de71	logstor: add compaction reenabler add a function that stops and disabled compaction for a compaction group and returns a compaction reenabler object, similarly to the normal compaction manager. this will be useful for disabling compaction while doing operations on the compaction group's logstor segment set.	2026-03-31 18:40:56 +02:00
Michael Litvak	1d7c2e4f52	logstor: add segment header we have two types of segments. the active segment is "mixed" because we can write to it multiple write_buffers, each write buffer having records from different tables and tablets. in constrast, the separator and compaction write "full" segments - they write a single write_buffer that has records from a single tablet and storage group. for "full" segments, we add a segment header the contains additional useful metadata such as the table and token range in the segment. the write buffer header contains the type of the buffer, mixed or full. if it's full then it has a segment header placed after the write buffer header.	2026-03-31 18:40:56 +02:00
Michael Litvak	8615f68657	logstor: serialize writes to active segment previously when writing to the active segment, the allocation was serialized but multiple writes could proceed concurrently to different offsets. change it instead to serialize the entire write. we prefer to write larger buffers sequentially instead of multiple buffers concurrently. it is also better that we don't have "holes" in the segment. we also change the buffered_writer to send a single flushing buffer at a time. it has a ring of buffers, new writes are written to the head buffer, and a single consumer flushes the tail buffer.	2026-03-31 18:40:56 +02:00
Michael Litvak	e791823874	replica: extend compaction_group functions for logstor extend compaction_group functions such as disk size calculation and empty() to account also for the logstor segments that the compaction group owns. reuse the sstable_add_gate when there is a write in process to a compaction group, in order for the compaction group to be considered not empty.	2026-03-31 18:40:56 +02:00
Michael Litvak	d3db967802	replica: add compaction_group_for_logstor_segment add the function table::compaction_group_for_logstor_segment that we use when recovering a segment to find the compaction group for a segment based on its token range, similarly to compaction_group_for_sstable for sstables. extract the common logic from compaction_group_for_sstable to a common function compaction_group_for_token_range that finds a compaction group for a token range.	2026-03-31 18:40:56 +02:00
Michael Litvak	bf7bc5b410	logstor: code cleanup misc code cleanup and small changes	2026-03-31 18:40:56 +02:00
Piotr Dulikowski	60fb5270a9	logstor: fix fmt::format use with std::filesystem::path The version of fmt installed on my machine refuses to work with `std::filesystem::path` directly. Add `.string()` calls in places that attempt to print paths directly in order to make them work. Closes scylladb/scylladb#29148	2026-03-23 15:15:52 +01:00
Michael Litvak	31d339e54a	logstor: trigger separator flush for buffers that hold old segments A compaction group has a separator buffer that holds the mixed segments alive until the separator buffer is flushed. A mixed segment can be freed only after all separator buffers that hold writes from the segment are flushed. Typically a separator buffer is flushed when it becomes full. However it's possible for example that one compaction groups is filled slower than others and holds many segments. To fix this we trigger a separator flush periodically for separator buffers that hold old segments. We track the active segment sequence number and for each separator buffer the oldest sequence number it holds.	2026-03-18 19:24:28 +01:00
Michael Litvak	a0da07e5b7	logstor: recover segments into compaction groups Fix the logstor recovery to work with compaction groups. When recovering a segment find its token range and add it to the appropriate compaction groups. if it doesn't fit in a single compaction group then write each record to its compaction group's separator buffer.	2026-03-18 19:24:28 +01:00
Michael Litvak	24379acc76	logstor: range read extend the logstor mutation reader to support range read	2026-03-18 19:24:28 +01:00
Michael Litvak	a9d0211a64	logstor: change index to btree by token per table Change the primary index to be a btree that is ordered by token, similarly to a memtable, and create a index per-table instead of a single global index.	2026-03-18 19:24:28 +01:00
Michael Litvak	e7c3942d43	logstor: move segments to replica::compaction_group Add a segment_set member to replica::compaction_group that manages the logstor segments that belong to the compaction group, similarly to how it manages sstables. Add also a separator buffer in each compaction group. When writing a mutation to a compaction group, the mutation is written to the active segment and to the separator buffer of the compaction group, and when the separator buffer is flushed the segment is added to the compaction_group's segment set.	2026-03-18 19:24:28 +01:00
Michael Litvak	65cd0b5639	logstor: track memory usage add logstor::get_memory_usage() that returns an estimate of the memory usage by logstor. add tracking to how many unique keys are held in the index.	2026-03-18 19:24:27 +01:00
Michael Litvak	b7bdb1010a	logstor: logstor stats api add api to get logstor statistics about segments for a table	2026-03-18 19:24:27 +01:00
Michael Litvak	8bd3bd7e2a	logstor: compaction buffer pool pre-allocate write buffers for compaction	2026-03-18 19:24:27 +01:00
Michael Litvak	caf5aa47c2	logstor: separator: flush buffer when full flush separator buffers when they become full and switched instead of aggregating all the buffers and flushing them when the separator is switched.	2026-03-18 19:24:27 +01:00
Michael Litvak	6ddb7a4d13	logstor: hold segment until index updates add a write gate to write_buffer. when writing a record to the write buffer, the gate is held and passed back to the caller, and the caller holds the gate until the write operation is complete, including follow-up operations such as updating the index after the write. in particular, when writing a mutation in logstor::write, the write buffer is held open until the write is completed and updated in the index. when writing the write buffer to the active segment, we write the buffer and then wait for the write buffer gate to close, i.e. we wait for all index updates to complete before proceeding. the segment is held open until all the write operations and index updates are complete. this property is useful for correctness: when a segment is closed we know that all the writes to it are updated in the index. this is needed in compaction for example, where we take closed segments and check which records in them are alive by looking them up in the index. if the index is not updated yet then it will be wrong.	2026-03-18 19:24:27 +01:00
Michael Litvak	bd66edee5c	logstor: truncate table implement freeing all segments of a table for table truncate. first do barrier to flush all active and mixed segments and put all the table's data in compaction groups, then stop compaction for the table, then free the table's segments and remove the live entries from the index.	2026-03-18 19:24:27 +01:00
Michael Litvak	489efca47c	logstor: enable/disable compaction per table add functions to enable or disable compaction for a specific compaction group or for all compaction groups of a table.	2026-03-18 19:24:27 +01:00
Michael Litvak	21db4f3ed8	logstor: separator buffer pool pre-allocate write buffers for the separator	2026-03-18 19:24:27 +01:00
Michael Litvak	31aefdc07d	logstor: segment and separator barrier add barrier operation that forces switch of the active segment and separator, and waits for all existing segments to close and all separators to flush.	2026-03-18 19:24:27 +01:00
Michael Litvak	1231fafb46	logstor: separator debt controller add tracking of the total separator debt - writes that were written to a separator and waiting to be flushed, and add flow control to keep the debt in control by delaying normal writes.	2026-03-18 19:24:27 +01:00
Michael Litvak	17cb173e18	logstor: compaction controller adjust compaction shares by the compaction overhead: how many segments compaction writes to generate a single free segment for new writes.	2026-03-18 19:24:27 +01:00
Michael Litvak	1da1bb9d99	logstor: recovery: recover mixed segments using separator on recovery we may find mixed segments. recover them by adding them to a separator, reading all their records and writing them to the separator, and flush the separator.	2026-03-18 19:24:27 +01:00
Michael Litvak	b78cc787a6	logstor: wait for pending reads in compaction we free a segment from compaction after updating all live records in the segment to point to new locations in the index. we need to ensure they are no running operations that use the old locations before we free the segment.	2026-03-18 19:24:27 +01:00
Michael Litvak	600ec82bec	logstor: separator initial implementation of the separator. it replaces "mixed" segments - segments that have records from different groups, to segments by group. every write is written to the active segment and to a buffer in the active separator. the active separator has in-memory buffers by group. at some threshold number of segments we switch the active segment and separator atomically, and start flushing the separator. the separator is flushed by writing the buffers into new non-mixed segments, adding them to a compaction group, and frees the mixed segments.	2026-03-18 19:24:27 +01:00
Michael Litvak	009fc3757a	logstor: compaction groups divide the segments in the compaction manager to compaction group. compaction will compact only segments from a single compaction group at a time.	2026-03-18 19:24:27 +01:00
Michael Litvak	b3293f8579	logstor: cache files for read keep all files for all segments open for read to improve reads.	2026-03-18 19:24:26 +01:00
Michael Litvak	5a16980845	logstor: recovery: initial initial and basic recovery implementation. * find all files, read their segments and populate the index with the newest record for each key. * find which segments are used and build the usage histogram	2026-03-18 19:24:26 +01:00
Michael Litvak	bc9fc96579	logstor: add segment generation add segment generation number that is incremented when the segment is reused, and it's written to every buffer that is written to the segment. this is useful for recovery.	2026-03-18 19:24:26 +01:00
Michael Litvak	719f7cca57	logstor: reserve segments for compaction reserve segments for compaction so it always has enough segments to run and doesn't get stuck. do the compaction writes into full new segments instead of the active segment.	2026-03-18 19:24:26 +01:00
Michael Litvak	521fca5c92	logstor: index: buckets divide the primary index to buckets, each bucket containing a btree. the bucket is determined by using bits from the key hash.	2026-03-18 19:24:26 +01:00
Michael Litvak	99c3b1998a	logstor: add buffer header add a buffer header in each write buffer we write that contains some information that can be useful for recovery and reading.	2026-03-18 19:24:26 +01:00
Michael Litvak	ddd72a16b0	logstor: add group_id add group_id value to each log record that is passed with the mutation when writing it. the group_id will be used to group log records in segments, such that a segment will contain records only from a single group. this will be useful for tablet migration. we want for each tablet to have their own segments with all their records, so we can migrate them efficiently by copying these segments. the group_id value is set to a value equivalent to the tablet id.	2026-03-18 19:24:26 +01:00
Michael Litvak	08bea860ef	logstor: record generation add a record generation number for each record so we can compare records and find which one is newer.	2026-03-18 19:24:26 +01:00
Michael Litvak	28f820eb1c	logstor: generation utility basic utility for generation numbers that will be useful next. a generation number is an unsigned integer that can be incremented and compared even if it wraparounds, assuming the values we compare were written around the same time.	2026-03-18 19:24:26 +01:00
Michael Litvak	5f649dd39f	logstor: use RIPEMD-160 for index key use a 20-byte hash function for the index key to make hash collisions very unlikely. we assume there are no hash collisions.	2026-03-18 19:24:26 +01:00
Michael Litvak	2128b1b15c	replica: add logstor to db Add a single logstor instance in the database that is used for writing and reading to tables with kv storage	2026-03-18 19:24:26 +01:00
Michael Litvak	0b1343747f	logstor: initial commit initial implementation of the logstor storage engine for key-value tables that supports writes, reads and basic compaction. main components: * logstor: this is the main interface to users that supports writing and reading back mutations, and manages the internal components. * index: the primary index in-memory that maps a key to a location on disk. * write buffer: writes go initially to a write buffer. it accumulates multiple records in a buffer and writes them to the segment manager in 4k sized blocks. * segment manager: manages the storage - files, segments, compaction. it manages file and segment allocation, and writes 4k aligned buffers to the active segment sequentially. it tracks the used space in each segment. the compaction finds segment with low space usage and writes them to new segments, and frees the old segments.	2026-03-18 19:24:26 +01:00

49 Commits