147 lines
5.7 KiB
Markdown
147 lines
5.7 KiB
Markdown
ScyllaDB commitlog segment file format
|
|
======================================
|
|
|
|
**Note:** Commitlog file formats are subject to change between scylla versions. Users should not make assumptions about nor rely on them.
|
|
Commitlog files should *never* be used across ScyllaDB updates. This information is provided mainly for ScyllaDB contributors.
|
|
|
|
File descriptor structure
|
|
-------------------------
|
|
|
|
ScyllaDB commitlog segment files are named with a versioned, time-indexed scheme, as
|
|
|
|
```
|
|
<Prefix><version>-<id>.log
|
|
```
|
|
|
|
Where `<Prefix>` is application specific, but typically "Commitlog-", or in case of
|
|
files being recycled "Recycled-Commitlog-", `<version>` is the file format version,
|
|
and `<id>` is the id part of a replay position (timestamp + shard).
|
|
|
|
Segment file data structure
|
|
---------------------------
|
|
|
|
All control data is written in network byte order.
|
|
|
|
The file consists of a file header, followed by any number of chunks. Each chunk has
|
|
its own header + a marker to the start of next chunk, to allow skipping it more easily,
|
|
should any data corruption be present in the chunk's data.
|
|
|
|
Chunks contain data entries, with a small header, stored data + checksums to verify its integrity.
|
|
|
|
An entry can be a "multi-entry", i.e. several entries written as one.
|
|
|
|
Version 2
|
|
---------
|
|
|
|
(Format used in ScyllaDB 1.0 to as of this writing - named '2' because it is a slight deviation on the format used in cassandra)
|
|
|
|
```
|
|
|
|
Segment file header
|
|
|
|
magic : uint32_t - ('S'<<24) |('C'<< 16) | ('L' << 8) | 'C';
|
|
version : uint32_t - same as descriptor
|
|
id : uint64_t - same as descriptor
|
|
crc : uint32_t - CRC32 of version, low 32 of id, high 32 of id.
|
|
|
|
Chunk header
|
|
|
|
file_pos : uint32_t - the file position of next chunk
|
|
crc : uint32_t - CRC32 of low 32 of segment id, high 32 of id and file offset of end of this header.
|
|
|
|
Entry
|
|
|
|
size : uint32_t - size of entry (data + full headers). Must be smaller than MAX_UINT32.
|
|
crc1 : uint32_t - CRC32 of size
|
|
data : bytes - actual entry data
|
|
crc2 : uint32_t - CRC32 of size, data
|
|
|
|
Multi-entry
|
|
|
|
magic : multi marker - 0xffffffff (MAX_UINT32)
|
|
size : size of all entries in this multi-entry + headers
|
|
crc : CRC32 of magic, size
|
|
<entries> * N
|
|
crc2 : CRC32 of magic, size and data in each entry
|
|
```
|
|
|
|
Version 3
|
|
---------
|
|
|
|
Modified from v2 to improve error detection/false positive elimination.
|
|
This version does CRC per written disk block instead of actual entry.
|
|
Every block is also tagged with which file is being written, this to
|
|
be able to better distinguish new data from that left over from recycling
|
|
(reusing old files) and actual disk/file corruption.
|
|
|
|
```
|
|
|
|
Segment file header
|
|
|
|
magic : uint32_t - ('S'<<24) |('C'<< 16) | ('L' << 8) | 'C';
|
|
version : uint32_t - same as descriptor
|
|
id : uint64_t - same as descriptor
|
|
alignment : uint32_t - disk block size
|
|
crc : uint32_t - CRC32 of version, low 32 of id, high 32 of id, alignment.
|
|
|
|
Chunk header
|
|
|
|
file_pos : uint32_t - the file position of next chunk
|
|
crc : uint32_t - CRC32 of low 32 of segment id, high 32 of id and file offset of end of this header.
|
|
|
|
Entry
|
|
|
|
size : uint32_t - size of entry (data + full headers). Must be smaller than MAX_UINT32.
|
|
crc : uint32_t - CRC32 of size
|
|
data : bytes - actual entry data
|
|
|
|
Multi-entry
|
|
|
|
magic : multi marker - 0xffffffff (MAX_UINT32)
|
|
size : size of all entries in this multi-entry + headers
|
|
crc : CRC32 of magic, size
|
|
<entries> * N
|
|
|
|
Disk block (block size = `alignment`)
|
|
|
|
0 - <bs - 12> : Interleaved file data, i.e. the content above
|
|
<bs - 12> : uint64_t - same as descriptor.
|
|
<bs - 4> : uint32_t - CRC32 of block data up until crc (bs - 4), including segment id
|
|
|
|
The main benefit of the tagged and CRC:ed block is that if the CRC is broken, we know the part
|
|
of this file is corrupt. If the CRC is correct, but segment ID does not match, we can assume
|
|
the file is not fully written/prematurely ended. Both cases can mean data loss, depending on
|
|
how writing is done as well as OS and hardware.
|
|
|
|
```
|
|
|
|
Version 4
|
|
---------
|
|
|
|
Modified from v3 to allow fragmented data entries, i.e. writing a single data entry
|
|
as a stream across several segments.
|
|
A fragmented entry is written by splitting data into sub-parts that will fit into
|
|
the normal restrictions of a write (i.e. smaller than max mutation size, but also
|
|
trying to fit into existing buffers as best we can to avoid wasting alignment slack).
|
|
|
|
```
|
|
Fragmented entry
|
|
|
|
magic : fragmented marker - 0xfffffffe (MAX_UINT32-1)
|
|
size : size of this fragmented entry part + headers
|
|
id : the stream id
|
|
offset : offset of this entry in the data stream
|
|
remaining : stream data remaining to write after this entry
|
|
crc : CRC32 of magic, size, id, offset and remaining
|
|
|
|
```
|
|
|
|
Each stream has a unique id (monotonic counter). We must handle reading and
|
|
assembling streams out of order and interleaved because we can't always
|
|
guarantee order replaying will be performed in.
|
|
|
|
The replayer needs to be called with a state storage (replay_state) to handle
|
|
fragmented entries. When encountering one, we store the data into the state
|
|
buffer for the id, and once we have all fragments (as defined by id, offset and
|
|
remaining), we can report the full entry back to caller.
|