Commit Graph

49 Commits

Author SHA1 Message Date
Michael Litvak
43c76aaf2b logstor: split log record to header and data
Split the `log_record` to `log_record_header` type that has the record
metadata fields and the mutation as a separate field which is the actual
record data:

struct log_record {
    log_record_header header;
    canonical_mutation mut;
};

Both the header and mutation have variable serialized size. When a
record is serialized in a write_buffer, we first put a small
`record_header` that has the header size and data size, then the
serialized header and data follow. The `log_location` of a record points
to the beginning of the `record_header`, and the size includes the
`record_header`.

This allows us to read a record header without reading the data when
it's not needed and avoid deserializing it:
* on recovery, when scanning all segments, we read only the record
  headers.
* on compaction, we read the record header first to determine if the
  record is alive, if yes then we read the data.

Closes scylladb/scylladb#29457
2026-04-16 10:00:35 +03:00
Raphael S. Carvalho
1529605b32 logstor: Fix dangling reference captures and shadowed loc variable
Three bugs fixed in segment_manager.cc:

1. write_to_separator(): captured [&index] where index was a local
   coroutine-frame reference. The future is stored in
   buf.pending_updates and resolved later in flush_separator_buffer(),
   by which time the enclosing coroutine frame is destroyed, making
   &index a dangling pointer. This is a use-after-free that manifests
   as a segfault. Fix: capture index_ptr (raw pointer by value) instead.

2. add_segment_to_compaction_group(): same dangling [&index] pattern
   inside the for_each_live_record lambda during recovery. Same fix
   applied.

3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer
   'log_location loc', causing the function to always return a
   zero-initialized log_location{}. Fix: remove 'auto' so the
   assignment targets the outer variable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29451
2026-04-15 14:40:15 +03:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Michael Litvak
35547bfb6e test: logstor: additional logstor tests 2026-03-31 18:45:08 +02:00
Michael Litvak
39baa573d2 logstor: add version and crc to buffer header
add basic crc and validation to the buffer header. add also a version
field that indicates the version of the on-disk format.
2026-03-31 18:45:08 +02:00
Michael Litvak
78426ae31b logstor: add take_logstor_snapshot
add the function table::take_logstor_snapshot that is similar to
take_storage_snapshot for sstables.

given a token range, for each storage group in the range, it flushes the
separator buffers and then makes a snapshot of all segments in the sg's
compaction groups while disabling compaction.

the segment snapshot holds a reference to the segment so that it won't
be freed by compaction, and it provides an input stream for reading the
segment.

this will be used for tablet migration to stream the segments.
2026-03-31 18:45:08 +02:00
Michael Litvak
754c1b83bd logstor: segment input/output stream
add functions for creating segment input and output streams, that will
be used for segment streaming.

the segment input stream creates a file input stream that reads a given
segment.

the segment output stream allocates a new local segment and creates an
output stream that writes to the segment, and when closed it loads the
segment and adds it to the compaction group.
2026-03-31 18:45:08 +02:00
Michael Litvak
17cab4181b logstor: implement compaction_group::cleanup
implement compaction group cleanup by clearing the range in the index
and discarding the segments of the compaction group.

segments are discarded by overwriting the segment header to indicate the
segment is empty while preserving the segment generation number in order
to not resurrect old data in the segment.
2026-03-31 18:45:08 +02:00
Michael Litvak
9fd6dace72 logstor: tablet split
implement tablet split for logstor.

flush the separator and then perform split as a new type of compaction:
take a batch of segments from the source compaction group, read them and
write all live records into left/right write buffers according to the
split classifier, flush them to the compaction group, and free the old
segments. segments that fit in a single target compaction group are
removed from the source and added to the correct target group.
2026-03-31 18:45:08 +02:00
Michael Litvak
5de39afc24 logstor: tablet merge
implement tablet merge with logstor.

disable compaction for the new compaction group, then merge the merging
compaction groups by merging their logstor segments set into the new cg
- simply merging the segment histogram.
2026-03-31 18:40:57 +02:00
Michael Litvak
684ce8de71 logstor: add compaction reenabler
add a function that stops and disabled compaction for a compaction group
and returns a compaction reenabler object, similarly to the normal
compaction manager.

this will be useful for disabling compaction while doing operations on
the compaction group's logstor segment set.
2026-03-31 18:40:56 +02:00
Michael Litvak
1d7c2e4f52 logstor: add segment header
we have two types of segments. the active segment is "mixed" because we
can write to it multiple write_buffers, each write buffer having records
from different tables and tablets. in constrast, the separator and
compaction write "full" segments - they write a single write_buffer that
has records from a single tablet and storage group.

for "full" segments, we add a segment header the contains additional
useful metadata such as the table and token range in the segment.

the write buffer header contains the type of the buffer, mixed or full.
if it's full then it has a segment header placed after the write buffer
header.
2026-03-31 18:40:56 +02:00
Michael Litvak
8615f68657 logstor: serialize writes to active segment
previously when writing to the active segment, the allocation was
serialized but multiple writes could proceed concurrently to different
offsets. change it instead to serialize the entire write.

we prefer to write larger buffers sequentially instead of multiple
buffers concurrently. it is also better that we don't have "holes" in
the segment.

we also change the buffered_writer to send a single flushing buffer at a
time. it has a ring of buffers, new writes are written to the head
buffer, and a single consumer flushes the tail buffer.
2026-03-31 18:40:56 +02:00
Michael Litvak
e791823874 replica: extend compaction_group functions for logstor
extend compaction_group functions such as disk size calculation and
empty() to account also for the logstor segments that the compaction
group owns.

reuse the sstable_add_gate when there is a write in process to a
compaction group, in order for the compaction group to be considered not
empty.
2026-03-31 18:40:56 +02:00
Michael Litvak
d3db967802 replica: add compaction_group_for_logstor_segment
add the function table::compaction_group_for_logstor_segment that we use
when recovering a segment to find the compaction group for a segment
based on its token range, similarly to compaction_group_for_sstable for
sstables.

extract the common logic from compaction_group_for_sstable to a common
function compaction_group_for_token_range that finds a compaction group
for a token range.
2026-03-31 18:40:56 +02:00
Michael Litvak
bf7bc5b410 logstor: code cleanup
misc code cleanup and small changes
2026-03-31 18:40:56 +02:00
Piotr Dulikowski
60fb5270a9 logstor: fix fmt::format use with std::filesystem::path
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.

Closes scylladb/scylladb#29148
2026-03-23 15:15:52 +01:00
Michael Litvak
31d339e54a logstor: trigger separator flush for buffers that hold old segments
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.

Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.

To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
2026-03-18 19:24:28 +01:00
Michael Litvak
a0da07e5b7 logstor: recover segments into compaction groups
Fix the logstor recovery to work with compaction groups. When recovering
a segment find its token range and add it to the appropriate compaction
groups. if it doesn't fit in a single compaction group then write each
record to its compaction group's separator buffer.
2026-03-18 19:24:28 +01:00
Michael Litvak
24379acc76 logstor: range read
extend the logstor mutation reader to support range read
2026-03-18 19:24:28 +01:00
Michael Litvak
a9d0211a64 logstor: change index to btree by token per table
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
2026-03-18 19:24:28 +01:00
Michael Litvak
e7c3942d43 logstor: move segments to replica::compaction_group
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.

When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
2026-03-18 19:24:28 +01:00
Michael Litvak
65cd0b5639 logstor: track memory usage
add logstor::get_memory_usage() that returns an estimate of the memory
usage by logstor. add tracking to how many unique keys are held in the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
b7bdb1010a logstor: logstor stats api
add api to get logstor statistics about segments for a table
2026-03-18 19:24:27 +01:00
Michael Litvak
8bd3bd7e2a logstor: compaction buffer pool
pre-allocate write buffers for compaction
2026-03-18 19:24:27 +01:00
Michael Litvak
caf5aa47c2 logstor: separator: flush buffer when full
flush separator buffers when they become full and switched instead of
aggregating all the buffers and flushing them when the separator is
switched.
2026-03-18 19:24:27 +01:00
Michael Litvak
6ddb7a4d13 logstor: hold segment until index updates
add a write gate to write_buffer. when writing a record to the write
buffer, the gate is held and passed back to the caller, and the caller
holds the gate until the write operation is complete, including
follow-up operations such as updating the index after the write.

in particular, when writing a mutation in logstor::write, the write
buffer is held open until the write is completed and updated in the
index.

when writing the write buffer to the active segment, we write the buffer
and then wait for the write buffer gate to close, i.e. we wait for all
index updates to complete before proceeding. the segment is held open
until all the write operations and index updates are complete.

this property is useful for correctness: when a segment is closed we
know that all the writes to it are updated in the index. this is needed
in compaction for example, where we take closed segments and check
which records in them are alive by looking them up in the index. if the
index is not updated yet then it will be wrong.
2026-03-18 19:24:27 +01:00
Michael Litvak
bd66edee5c logstor: truncate table
implement freeing all segments of a table for table truncate.

first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
489efca47c logstor: enable/disable compaction per table
add functions to enable or disable compaction for a specific compaction
group or for all compaction groups of a table.
2026-03-18 19:24:27 +01:00
Michael Litvak
21db4f3ed8 logstor: separator buffer pool
pre-allocate write buffers for the separator
2026-03-18 19:24:27 +01:00
Michael Litvak
31aefdc07d logstor: segment and separator barrier
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
2026-03-18 19:24:27 +01:00
Michael Litvak
1231fafb46 logstor: separator debt controller
add tracking of the total separator debt - writes that were written to a
separator and waiting to be flushed, and add flow control to keep the
debt in control by delaying normal writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
17cb173e18 logstor: compaction controller
adjust compaction shares by the compaction overhead: how many segments
compaction writes to generate a single free segment for new writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
1da1bb9d99 logstor: recovery: recover mixed segments using separator
on recovery we may find mixed segments. recover them by adding them to a
separator, reading all their records and writing them to the separator,
and flush the separator.
2026-03-18 19:24:27 +01:00
Michael Litvak
b78cc787a6 logstor: wait for pending reads in compaction
we free a segment from compaction after updating all live records in the
segment to point to new locations in the index. we need to ensure they
are no running operations that use the old locations before we free the
segment.
2026-03-18 19:24:27 +01:00
Michael Litvak
600ec82bec logstor: separator
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.

every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.

the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
2026-03-18 19:24:27 +01:00
Michael Litvak
009fc3757a logstor: compaction groups
divide the segments in the compaction manager to compaction group.
compaction will compact only segments from a single compaction group at
a time.
2026-03-18 19:24:27 +01:00
Michael Litvak
b3293f8579 logstor: cache files for read
keep all files for all segments open for read to improve reads.
2026-03-18 19:24:26 +01:00
Michael Litvak
5a16980845 logstor: recovery: initial
initial and basic recovery implementation.
* find all files, read their segments and populate the index with the
  newest record for each key.
* find which segments are used and build the usage histogram
2026-03-18 19:24:26 +01:00
Michael Litvak
bc9fc96579 logstor: add segment generation
add segment generation number that is incremented when the segment is
reused, and it's written to every buffer that is written to the segment.
this is useful for recovery.
2026-03-18 19:24:26 +01:00
Michael Litvak
719f7cca57 logstor: reserve segments for compaction
reserve segments for compaction so it always has enough segments to run
and doesn't get stuck.

do the compaction writes into full new segments instead of the active
segment.
2026-03-18 19:24:26 +01:00
Michael Litvak
521fca5c92 logstor: index: buckets
divide the primary index to buckets, each bucket containing a btree. the
bucket is determined by using bits from the key hash.
2026-03-18 19:24:26 +01:00
Michael Litvak
99c3b1998a logstor: add buffer header
add a buffer header in each write buffer we write that contains some
information that can be useful for recovery and reading.
2026-03-18 19:24:26 +01:00
Michael Litvak
ddd72a16b0 logstor: add group_id
add group_id value to each log record that is passed with the mutation
when writing it.

the group_id will be used to group log records in segments, such that a
segment will contain records only from a single group.

this will be useful for tablet migration. we want for each tablet to
have their own segments with all their records, so we can migrate them
efficiently by copying these segments.

the group_id value is set to a value equivalent to the tablet id.
2026-03-18 19:24:26 +01:00
Michael Litvak
08bea860ef logstor: record generation
add a record generation number for each record so we can compare
records and find which one is newer.
2026-03-18 19:24:26 +01:00
Michael Litvak
28f820eb1c logstor: generation utility
basic utility for generation numbers that will be useful next. a
generation number is an unsigned integer that can be incremented and
compared even if it wraparounds, assuming the values we compare were
written around the same time.
2026-03-18 19:24:26 +01:00
Michael Litvak
5f649dd39f logstor: use RIPEMD-160 for index key
use a 20-byte hash function for the index key to make hash collisions
very unlikely. we assume there are no hash collisions.
2026-03-18 19:24:26 +01:00
Michael Litvak
2128b1b15c replica: add logstor to db
Add a single logstor instance in the database that is used for writing
and reading to tables with kv storage
2026-03-18 19:24:26 +01:00
Michael Litvak
0b1343747f logstor: initial commit
initial implementation of the logstor storage engine for key-value
tables that supports writes, reads and basic compaction.

main components:
* logstor: this is the main interface to users that supports writing and
  reading back mutations, and manages the internal components.
* index: the primary index in-memory that maps a key to a location on
  disk.
* write buffer: writes go initially to a write buffer. it accumulates
  multiple records in a buffer and writes them to the segment manager in
  4k sized blocks.
* segment manager: manages the storage - files, segments, compaction. it
  manages file and segment allocation, and writes 4k aligned buffers to
  the active segment sequentially. it tracks the used space in each
  segment. the compaction finds segment with low space usage and writes
  them to new segments, and frees the old segments.
2026-03-18 19:24:26 +01:00