ScyllaDB commitlog segment file format ====================================== **Note:** Commitlog file formats are subject to change between scylla versions. Users should not make assumptions about nor rely on them. Commitlog files should *never* be used across ScyllaDB updates. This information is provided mainly for ScyllaDB contributors. File descriptor structure ------------------------- ScyllaDB commitlog segment files are named with a versioned, time-indexed scheme, as ``` -.log ``` Where `` is application specific, but typically "Commitlog-", or in case of files being recycled "Recycled-Commitlog-", `` is the file format version, and `` is the id part of a replay position (timestamp + shard). Segment file data structure --------------------------- All control data is written in network byte order. The file consists of a file header, followed by any number of chunks. Each chunk has its own header + a marker to the start of next chunk, to allow skipping it more easily, should any data corruption be present in the chunk's data. Chunks contain data entries, with a small header, stored data + checksums to verify its integrity. An entry can be a "multi-entry", i.e. several entries written as one. Version 2 --------- (Format used in ScyllaDB 1.0 to as of this writing - named '2' because it is a slight deviation on the format used in cassandra) ``` Segment file header magic : uint32_t - ('S'<<24) |('C'<< 16) | ('L' << 8) | 'C'; version : uint32_t - same as descriptor id : uint64_t - same as descriptor crc : uint32_t - CRC32 of version, low 32 of id, high 32 of id. Chunk header file_pos : uint32_t - the file position of next chunk crc : uint32_t - CRC32 of low 32 of segment id, high 32 of id and file offset of end of this header. Entry size : uint32_t - size of entry (data + full headers). Must be smaller than MAX_UINT32. crc1 : uint32_t - CRC32 of size data : bytes - actual entry data crc2 : uint32_t - CRC32 of size, data Multi-entry magic : multi marker - 0xffffffff (MAX_UINT32) size : size of all entries in this multi-entry + headers crc : CRC32 of magic, size * N crc2 : CRC32 of magic, size and data in each entry ``` Version 3 --------- Modified from v2 to improve error detection/false positive elimination. This version does CRC per written disk block instead of actual entry. Every block is also tagged with which file is being written, this to be able to better distinguish new data from that left over from recycling (reusing old files) and actual disk/file corruption. ``` Segment file header magic : uint32_t - ('S'<<24) |('C'<< 16) | ('L' << 8) | 'C'; version : uint32_t - same as descriptor id : uint64_t - same as descriptor alignment : uint32_t - disk block size crc : uint32_t - CRC32 of version, low 32 of id, high 32 of id, alignment. Chunk header file_pos : uint32_t - the file position of next chunk crc : uint32_t - CRC32 of low 32 of segment id, high 32 of id and file offset of end of this header. Entry size : uint32_t - size of entry (data + full headers). Must be smaller than MAX_UINT32. crc : uint32_t - CRC32 of size data : bytes - actual entry data Multi-entry magic : multi marker - 0xffffffff (MAX_UINT32) size : size of all entries in this multi-entry + headers crc : CRC32 of magic, size * N Disk block (block size = `alignment`) 0 - : Interleaved file data, i.e. the content above : uint64_t - same as descriptor. : uint32_t - CRC32 of block data up until crc (bs - 4), including segment id The main benefit of the tagged and CRC:ed block is that if the CRC is broken, we know the part of this file is corrupt. If the CRC is correct, but segment ID does not match, we can assume the file is not fully written/prematurely ended. Both cases can mean data loss, depending on how writing is done as well as OS and hardware. ```