mirror of
https://github.com/scylladb/scylladb.git
synced 2026-04-20 08:30:35 +00:00
In6977064693("dist: scylla_raid_setup: reduce xfs block size to 1k"), we reduced the XFS block size to 1k when possible. This is because commitlog wants to write the smallest amount of padding it can, and older Linux could only write a multiple of the block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than a filesystem block. However, this doesn't play well with some SSDs that have 512 byte logical sector size and 4096 byte physical sector size - it causes them to issue read-modify-writes. To improve the situation, if we detect that the kernel is recent enough, format the filesystem with its default block size, which should be optimal. Note that commitlog will still issue sub-4k writes, which can translate to RMW. There, we believe that the amplification is reduced since sequential sub-physical-sector writes can be merged, and that the overhead from commitlog space amplification is worse than the RMW overhead. Tested on AWS i4i.large. fsqual report: ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 4096 context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` The sub-block overwrite cases are GOOD. In comparison, the fsqual report for 1k (similar): ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 1024 context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` Fixes #25441. [1]ed1128c2d0Closes scylladb/scylladb#25445 (cherry picked from commit5d1846d783)