scylladb

Author	SHA1	Message	Date
Yaniv Michael Kaul	af8eaa9ea5	scripts: fixes flagged by CodeQL/PyLens Unused imports, unused variables and such. Initially, there were no functional changes, just to get rid of some standard CodeQL warnings. I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE. So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead. While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil). Improvement - no need to backport. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#27883	2026-01-09 15:13:12 +02:00
Avi Kivity	bb02295695	setup: add the lazytime XFS mount option In `f828fe0d59` ("setup: add the lazytime XFS version") we added the lazytime mount option to /var/lib/scylla, but it was quickly reverted (`8f5e80e61a`) as it caused a regression on CentOS 7. We reinstate it now with a kernel version check. This will avoid the lazytime mount option on CentOS 7, which is unsupported anyway. The lazytime option avoids marking the inode as dirty if it's only for the purpose of updating mtime/ctime. This won't help much while writing sstables (since the write also updates extent information), but may help a little with with commitlog writes, since those are pure overwrites. It likely won't help with the RWF_NOWAIT violations seen in [1], since those are likely due to in-memory locking, not flushing dirty inodes to disk. Tested with an install to Ubuntu 24.04 LTS followed by a scylla_setup run. The lazytime option was added the the .mount file and showed up in the live mount. [1] https://github.com/scylladb/seastar/issues/2974 Closes scylladb/scylladb#26436 Fixes #26002	2025-10-09 15:55:58 +03:00
Avi Kivity	5d1846d783	dist: scylla_raid_setup: don't override XFS block size on modern kernels In `6977064693` ("dist: scylla_raid_setup: reduce xfs block size to 1k"), we reduced the XFS block size to 1k when possible. This is because commitlog wants to write the smallest amount of padding it can, and older Linux could only write a multiple of the block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than a filesystem block. However, this doesn't play well with some SSDs that have 512 byte logical sector size and 4096 byte physical sector size - it causes them to issue read-modify-writes. To improve the situation, if we detect that the kernel is recent enough, format the filesystem with its default block size, which should be optimal. Note that commitlog will still issue sub-4k writes, which can translate to RMW. There, we believe that the amplification is reduced since sequential sub-physical-sector writes can be merged, and that the overhead from commitlog space amplification is worse than the RMW overhead. Tested on AWS i4i.large. fsqual report: ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 4096 context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` The sub-block overwrite cases are GOOD. In comparison, the fsqual report for 1k (similar): ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 1024 context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` Fixes #25441. [1] `ed1128c2d0` Closes scylladb/scylladb#25445	2025-09-30 17:14:36 +03:00
Yaniv Michael Kaul	198ecd8039	Do not perform blkdiscard by default on the disks during RAID setup. This is not needed on clean disks, which is often the case with cloud instances, but can be useful on bare metal servers with disks that were used before. Therefore, the default is to skip blkdiscard operation, which makes overall installation faster. If the user wishes to run it anyway, use the newly introduced --blkdiscard option of scylla_raid_setup to perform it. Note: since we either perform online discard or schedule fstrim, the (previously used) space will gradually get trimmed, this way or another. Fixes: https://github.com/scylladb/scylladb/issues/24470 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#24579	2025-06-26 12:25:38 +02:00
Yaron Kaikov	b74565e83f	dist/common/scripts/scylla_raid_setup: reduce XFS metadata overhead The block size of 1k is significantly increasing metadata overhead with xfs since it reserves space upfront for btree expansion. With CRC disabled, this reservation doesn't happen. Smaller btree blocks reduce the fanout factor, increasing btree height and the reservation size. So block size implies a trade-off between write amplification and metadata size. Bigger blocks, smaller metadata, more write ampl. Smaller blocks, more metadata, and less write ampl. Let's disable both `rmapbt` and `relink` since we replicate data, and we can afford to rebuild a replica on local corruption. Fixes: https://github.com/scylladb/scylladb/issues/22028 Closes scylladb/scylladb#22072	2025-01-07 13:18:21 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Takuya ASADA	7ad5e69c54	scylla_raid_setup: fix failure on SELinux package installation After merged `5a470b2`, we found that scylla_raid_setup fails on offline mode installation. This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect internet. Seems like it occur because of missing "policycoreutils-python-utils" package, which is the package for "semange" command. So we need to implement the relabeling patch without using the command. Fixes #21441	2024-11-11 17:27:24 +09:00
Takuya ASADA	0ac450de05	scylla_raid_setup: configure SELinux file context On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla. Fixes #20573	2024-09-13 04:31:52 +09:00
Takuya ASADA	02b20089cb	scylla_raid_setup: install update-initramfs when it's not available scylla_raid_setup may fail on Ubuntu minimal image since it calls update-initramfs without installing. Closes scylladb/scylladb#19651	2024-07-24 11:55:16 +03:00
Kefu Chai	ab07fb25f5	scylla_raid_setup: reference xfsprog on the minimal 1024 block size the quote of "The minimum block size for crc enabled filesystems is 1024" comes from the output of mkfs.xfs, let's quote the source for better maintainability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17094	2024-02-14 08:44:14 +02:00
Kefu Chai	cd3c7a50ed	scylla_raid_setup: drop unused import Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17095	2024-02-02 15:20:40 +01:00
Takuya ASADA	a23278308f	dist: fix local-fs.target dependency systemd man page says: systemd-fstab-generator(3) automatically adds dependencies of type Before= to all mount units that refer to local mount points for this target unit. So "Before=local-fs.taget" is the correct dependency for local mount points, but we currently specify "After=local-fs.target", it should be fixed. Also replaced "WantedBy=multi-user.target" with "WantedBy=local-fs.target", since .mount are not related with multi-user but depends local filesystems. Fixes #8761 Closes scylladb/scylladb#15647	2023-11-06 18:39:53 +01:00
Takuya ASADA	58d94a54a3	scylla_raid_setup: faillback to other paths when UUID not avialable On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is not available, scylla_raid_setup will fail while mounting volume. To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available paths to mount the disk and fallback to other paths like by-partuuid, by-id, by-path or just using real device path like /dev/md0. To get device path, and also to dumping device status when UUID is not available, this will introduce UdevInfo class which communicate udev using pyudev. Related #11359 Closes scylladb/scylladb#13803	2023-10-17 12:24:58 +03:00
Vlad Zolotarov	e13a2b687d	scylla_raid_setup: make --online-discard argument useful This argument was dead since its introduction and 'discard' was always configured regardless of its value. This patch allows actually configuring things using this argument. Fixes #14963 Closes #14964	2023-08-21 12:21:23 +03:00
Takuya ASADA	fdceda20cc	scylla_raid_setup: wipe filesystem signatures from specified disks The discussion on the thread says, when we reformat a volume with another filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it detected two filesystem signatures, because mkfs.xxx did not cleared previous filesystem signature. To avoid this, we need to run wipefs before running mkfs. Note that this runs wipefs twice, for target disks and also for RAID device. wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks). Also dropped -f option from mkfs.xfs, it will check wipefs is working as we expected. Fixes #13737 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #13738	2023-05-08 16:53:43 +03:00
Takuya ASADA	a938b009ca	scylla_raid_setup: run uuidpath existance check only after mount failed We added UUID device file existance check on #11399, we expect UUID device file is created before checking, and we wait for the creation by "udevadm settle" after "mkfs.xfs". However, we actually getting error which says UUID device file missing, it probably means "udevadm settle" doesn't guarantee the device file created, on some condition. To avoid the error, use var-lib-scylla.mount to wait for UUID device file is ready, and run the file existance check when the service is failed. Fixes #11617 Closes #11666	2022-10-25 08:54:21 +03:00
Takuya ASADA	8835a34ab6	scylla_raid_setup: prevent mount failed for /var/lib/scylla Just like `4a8ed4c`, we also need to wait for udev event completion to create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the disk just after formatting. Fixes #11359	2022-08-27 03:27:44 +09:00
Takuya ASADA	40134efee4	scylla_raid_setup: check uuid and device path are valid Added code to check make sure uuid and uuid based device path are valid.	2022-08-27 03:08:31 +09:00
Takuya ASADA	48b6aec16a	scripts: use "out()" function for all capture_output subprocesses On `acaf0bb` we applied out() just for perftune.py because we had issue #10390 with this script. But the issue can happen with other commands too, let's apply it to all commands which uses capture_output. related #10390 Closes #10414	2022-04-26 13:56:52 +03:00
Takuya ASADA	32f2eb63ac	scylla_raid_setup: use mdmonitor only when RAID level > 0 We found that monitor mode of mdadm does not work on RAID0, and it is not a bug, expected behavior according to RHEL developer. Therefore, we should stop enabling mdmonitor when RAID0 is specified. Fixes #9540	2022-01-26 22:33:07 +09:00
Takuya ASADA	cd57815fff	Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8" This reverts commit `0d8f932f0b`, because RHEL developer explains this is not a bug, it's expected behavior. (mdadm --monitor does not start when RAID level is 0) see: https://bugzilla.redhat.com/show_bug.cgi?id=2031936 So we should stop downgrade mdadm package and modify our script not to enable mdmonitor.service on RAID0, use it only for RAID5.	2022-01-26 22:33:06 +09:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Takuya ASADA	0d8f932f0b	scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8 On CentOS8, mdmonitor.service does not works correctly when using mdadm-4.1-15.el8.x86_64 and later versions. Until we find a solution, let's pinning the package version to older one which does not cause the issue (4.1-14.el8.x86_64). Fixes #9540 Closes #9782	2021-12-27 12:07:34 +02:00
Takuya ASADA	6870938842	scylla_raid_setup: fix typo Closes #9790	2021-12-14 11:15:23 +02:00
Avi Kivity	a19d00ef9b	dist: scylla_raid_setup: mount XFS with online discard Online discard asks the disk to erase flash memory cells as soon as files are deleted. This gives the disk more freedom to choose where to place new files, so it improves performance. On older kernel versions, and on really bad disks, this can reduce performance so we add an option to disable it. Since fstrim is pointless when online discard is enabled, we don't configure it if online discard is selected. I tested it on an AWS i3.large instance, the flag showd up in `mount` after configuration. Closes #9608	2021-11-15 14:16:08 +02:00
Takuya ASADA	42fd73d033	scylla_setup: add RAID5 support This supports optional RAID5 support on scylla_setup. Fixes #9076 Closes #9093	2021-07-27 12:49:29 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Takuya ASADA	3d307919c3	scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem Currently, var-lib-scylla.mount may fails because it can start before MDRAID volume initialized. We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for device become available, but systemd manual says it automatically configure dependency for mount unit when we specify filesystem path by "absolute path of a device node". So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>. Fixes #8279 Closes #8681	2021-05-24 14:24:08 +03:00
Avi Kivity	6977064693	dist: scylla_raid_setup: reduce xfs block size to 1k Since Linux 5.12 [1], XFS is able to to asynchronously overwrite sub-block ranges without stalling. However, we want good performance on older Linux versions, so this patch reduces the block size to the minimum possible. That turns out to be 1024 for crc-protected filesystems (which we want) and it can also not be smaller than the sector size. So we fetch the sector size and set the block size to that if it is larger than 512. Most SSDs have a sector size of 512, so this isn't a problem. Tested on AWS i3.large. Fixes #8156. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251b Closes #8585	2021-05-05 16:07:50 +03:00
Takuya ASADA	c9324634ca	scylla_raid_setup: enabling mdmonitor.service on Debian variants On Debian variants, mdmonitor.service cannnot enable because it missing [Install] section, so 'systemctl enable mdmonitor.service' will fail, not able to run mdmonitor after the system restarted. To force running the service, add Wants=mdmonitor.service on var-lib-scylla.mount. Fixes #8494 Closes #8530	2021-04-28 11:32:27 +03:00
Takuya ASADA	0b01e1a167	dist: add DefaultDependencies=no to .mount units To avoid ordering cycle error on Ubuntu, add DefaultDependencies=no on .mount units. Fixes #8482 Closes #8495	2021-04-19 09:06:42 +03:00
Takuya ASADA	2d9feaacea	scylla_raid_setup: don't abort using raiddev when array_state is 'clear' On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes '/dev/md0 is already using' (issue #7627). So we merged the patch to find free mdX (`587b909`). However, look into /proc/mdstat of the AMI, it actually says no active md device available: ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat Personalities : unused devices: <none> We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True, but according to kernel doc, the file may available even array is STOPPED: clear No devices, no size, no level Writing is equivalent to STOP_ARRAY ioctl https://www.kernel.org/doc/html/v4.15/admin-guide/md.html So we should also check array_state != 'clear', not just array_state existance. Fixes #8219 Closes #8220	2021-03-07 18:30:11 +02:00
Takuya ASADA	2f344cf50d	dist: drop Ubuntu 14.04 code We don't support Ubuntu 14.04 anymore, drop them	2021-01-13 19:32:45 +09:00
Takuya ASADA	fffa8f5ded	dist: support Arch Linux Add support Arch Linux on setup script.	2021-01-13 19:32:45 +09:00
Takuya ASADA	3fefa520bd	dist/common/scripts: drop run() and out(), swtich to subprocess.run() We initially implemented run() and out() functions because we couldn't use subprocess.run() since we were on Python 3.4. But since we moved to relocatable python3, we don't need to implement it ourselves. Why we keep using these functions are, because we needed to set environemnt variable to set PATH. Since we recently moved away these codes to python thunk, we finally able to drop run() and out(), switch to subprocess.run().	2020-11-22 17:59:27 +02:00
Evgeniy Naydanov	587b909c5c	scylla_raid_setup: try /dev/md[0-9] if no --raiddev provided If scylla_raid_setup script called without --raiddev argument then try to use any of /dev/md[0-9] devices instead of only one /dev/md0. Do it in this way because on Ubuntu 20.04 /dev/md0 used by OS already. Closes #7628	2020-11-18 18:42:31 +02:00
Takuya ASADA	fc1c4f2261	scylla_raid_setup: use sysfs to detect existing RAID volume We may not able to detect existing RAID volume by device file existance, we should use sysfs instead to make sure it's running. Fixes #7383 Closes #7399	2020-10-29 09:13:55 +02:00
Takuya ASADA	8e1f7d4fc7	dist/common/scripts: drop makedirs(), use os.makedirs() Since os.makedirs() has exist_ok option, no need to create wrapper function.	2020-09-20 00:48:06 +09:00
Takuya ASADA	82701dc5ed	dist/common/scripts: drop dist_name() and dist_ver() It can be replaced with distro.name() and distro.version(). Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2020-09-20 00:42:27 +09:00
Takuya ASADA	db9e6f50f3	dist/common/scripts: skip internet access on offline installation We need to skip internet access on offline installation. To do this we need following changes: - prevent running yum/apt for each script - set default "NO" for scripts it requires package installation - set default "NO" for scripts it requires internet access, such as NTP See #7153 Fixes #7182	2020-09-16 10:05:20 +09:00
Takuya ASADA	6fbbe836c1	scylla_raid_setup: use mdadm.service on older Debian variants On older Debian variants does not have mdmonitor.service, we should use mdadm.service instead. Fixes #7000	2020-08-11 12:52:24 +03:00
Takuya ASADA	cff3e60f98	scylla_raid_setup: check var-lib-scylla.mount existance before formatting RAID We should run var-lib-scyllla.mount existance check before formatting RAID. Fixes #6965	2020-08-03 17:44:02 +03:00
Takuya ASADA	9e5d548f75	scylla_raid_setup: initialize MDRAID before mounting data volume var-lib-scylla.mount should wait for MDRAID initilization, so we need to add 'After=mdmonitor.service'. However, currently mdmonitor.service fails to start due to no mail address specified, we need to add the entry on mdadm.conf. Fixes #6876	2020-07-31 06:33:52 +09:00
Takuya ASADA	9b5f28a2e3	scylla_raid_setup: fix incorrect block device path To use UUID, we need a tag "UUID=<uuid>". reference: https://www.freedesktop.org/software/systemd/man/systemd.mount.html reference: https://man7.org/linux/man-pages/man8/mount.8.html	2020-07-15 18:22:46 +03:00
Takuya ASADA	e6e4359414	scylla_raid_setup: switch to systemd mount unit Since we already use systemd unit file for coredump bind mount and swapfile, we should move to systemd mount unit for data partition as well.	2020-07-13 17:14:44 +03:00
Takuya ASADA	90e28c5fcf	scylla_raid_setup: daemon-reload after mounts.conf installed systemd requires daemon-reload after adding drop-in file, so we need to do that after writing mounts.conf. Fixes #6674	2020-06-22 14:03:13 +03:00
Takuya ASADA	086f0ffd5a	scylla_raid_setup: create missing directories We need to create hints, view_hints, saved_caches directories on RAID volume. Fixes #5811	2020-03-12 09:29:29 +02:00
Takuya ASADA	98c182ec67	dist/redhat: align dependencies with debian On Debian, we don't add xfsprogs/mdadm on package dependency, install on scylla_raid_setup script instead. Since xfsprogs/mdadm only needed for constructing RAID, we can move dependencies to scylla_raid_setup too.	2020-02-23 15:34:35 +02:00
Alexys Jacob	02656fb00e	dist/common/scripts: add / normalize python3 shebang	2018-11-28 23:55:26 +01:00
Avi Kivity	8f5e80e61a	Revert "setup: add the lazytime XFS version" This reverts commit `f828fe0d59`. It causes scylla_raid_setup to fail on CentOS 7. Fixes #3784.	2018-09-26 11:10:07 +01:00

1 2

83 Commits