5.7 KiB
ScyllaDB commitlog segment file format
Note: Commitlog file formats are subject to change between scylla versions. Users should not make assumptions about nor rely on them. Commitlog files should never be used across ScyllaDB updates. This information is provided mainly for ScyllaDB contributors.
File descriptor structure
ScyllaDB commitlog segment files are named with a versioned, time-indexed scheme, as
<Prefix><version>-<id>.log
Where <Prefix> is application specific, but typically "Commitlog-", or in case of
files being recycled "Recycled-Commitlog-", <version> is the file format version,
and <id> is the id part of a replay position (timestamp + shard).
Segment file data structure
All control data is written in network byte order.
The file consists of a file header, followed by any number of chunks. Each chunk has its own header + a marker to the start of next chunk, to allow skipping it more easily, should any data corruption be present in the chunk's data.
Chunks contain data entries, with a small header, stored data + checksums to verify its integrity.
An entry can be a "multi-entry", i.e. several entries written as one.
Version 2
(Format used in ScyllaDB 1.0 to as of this writing - named '2' because it is a slight deviation on the format used in cassandra)
Segment file header
magic : uint32_t - ('S'<<24) |('C'<< 16) | ('L' << 8) | 'C';
version : uint32_t - same as descriptor
id : uint64_t - same as descriptor
crc : uint32_t - CRC32 of version, low 32 of id, high 32 of id.
Chunk header
file_pos : uint32_t - the file position of next chunk
crc : uint32_t - CRC32 of low 32 of segment id, high 32 of id and file offset of end of this header.
Entry
size : uint32_t - size of entry (data + full headers). Must be smaller than MAX_UINT32.
crc1 : uint32_t - CRC32 of size
data : bytes - actual entry data
crc2 : uint32_t - CRC32 of size, data
Multi-entry
magic : multi marker - 0xffffffff (MAX_UINT32)
size : size of all entries in this multi-entry + headers
crc : CRC32 of magic, size
<entries> * N
crc2 : CRC32 of magic, size and data in each entry
Version 3
Modified from v2 to improve error detection/false positive elimination. This version does CRC per written disk block instead of actual entry. Every block is also tagged with which file is being written, this to be able to better distinguish new data from that left over from recycling (reusing old files) and actual disk/file corruption.
Segment file header
magic : uint32_t - ('S'<<24) |('C'<< 16) | ('L' << 8) | 'C';
version : uint32_t - same as descriptor
id : uint64_t - same as descriptor
alignment : uint32_t - disk block size
crc : uint32_t - CRC32 of version, low 32 of id, high 32 of id, alignment.
Chunk header
file_pos : uint32_t - the file position of next chunk
crc : uint32_t - CRC32 of low 32 of segment id, high 32 of id and file offset of end of this header.
Entry
size : uint32_t - size of entry (data + full headers). Must be smaller than MAX_UINT32.
crc : uint32_t - CRC32 of size
data : bytes - actual entry data
Multi-entry
magic : multi marker - 0xffffffff (MAX_UINT32)
size : size of all entries in this multi-entry + headers
crc : CRC32 of magic, size
<entries> * N
Disk block (block size = `alignment`)
0 - <bs - 12> : Interleaved file data, i.e. the content above
<bs - 12> : uint64_t - same as descriptor.
<bs - 4> : uint32_t - CRC32 of block data up until crc (bs - 4), including segment id
The main benefit of the tagged and CRC:ed block is that if the CRC is broken, we know the part
of this file is corrupt. If the CRC is correct, but segment ID does not match, we can assume
the file is not fully written/prematurely ended. Both cases can mean data loss, depending on
how writing is done as well as OS and hardware.
Version 4
Modified from v3 to allow fragmented data entries, i.e. writing a single data entry as a stream across several segments. A fragmented entry is written by splitting data into sub-parts that will fit into the normal restrictions of a write (i.e. smaller than max mutation size, but also trying to fit into existing buffers as best we can to avoid wasting alignment slack).
Fragmented entry
magic : fragmented marker - 0xfffffffe (MAX_UINT32-1)
size : size of this fragmented entry part + headers
id : the stream id
offset : offset of this entry in the data stream
remaining : stream data remaining to write after this entry
crc : CRC32 of magic, size, id, offset and remaining
Each stream has a unique id (monotonic counter). We must handle reading and assembling streams out of order and interleaved because we can't always guarantee order replaying will be performed in.
The replayer needs to be called with a state storage (replay_state) to handle fragmented entries. When encountering one, we store the data into the state buffer for the id, and once we have all fragments (as defined by id, offset and remaining), we can report the full entry back to caller.