From da260ecd612dc19fc556ff3fdb7d1ebbaf47c8a8 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Mon, 12 Aug 2019 16:45:33 -0400 Subject: [PATCH] systemd: put scylla processes in systemd slices. It is well known that seastar applications, like Scylla, do not play well with external processes: CPU usage from external processes may confuse the I/O and CPU schedulers and create stalls. We have also recently seen that memory usage from other application's anonymous and page cache memory can bring the system to OOM. Linux has a very good infrastructure for resource control contributed by amazingly bright engineers in the form of cgroup controllers. This infrastructure is exposed by SystemD in the form of slices: a hierarchical structure to which controllers can be attached. In true systemd way, the hierarchy is implicit in the filenames of the slice files. a "-" symbol defines the hierarchy, so the files that this patch presents, scylla-server and scylla-helper, essentially create a "scylla" cgroup at the top level with "server" and "helper" children. Later we mark the Services needed to run scylla as belonging to one or the other through the Slice= directive. Scylla DBAs can benefit from this setup by using the systemd-run utility to fire ad-hoc commands. Let's say for example that someone wants to hypothetically run a backup and transfer files to an external object store like S3, making sure that the amount of page cache used won't create swap pressure leading to database timeouts. One can then run something like: ``` sudo systemd-run --uid=`id -u scylla` --gid=`id -g scylla` -t --slice=scylla-helper.slice /path/to/my/magical_backup_tool ``` (or even better, the backup tool can itself be a systemd timer) Changes from last version: - No longer use the CPUQuota - Minor typo fixes - postinstall fixup for small machines Benchmark results: ================== Test: read from disk, with 100% disk util using a single i3.xlarge (4 vCPUs). We have to fill the cache as we read, so this should stress CPU, memory and disk I/O. cassandra-stress command: ``` cassandra-stress read no-warmup duration=5m -rate threads=20 -node 10.2.209.188 -pop dist=uniform\(1..150000000\) ``` Baseline results: ``` Results: Op rate : 13,830 op/s [READ: 13,830 op/s] Partition rate : 13,830 pk/s [READ: 13,830 pk/s] Row rate : 13,830 row/s [READ: 13,830 row/s] Latency mean : 1.4 ms [READ: 1.4 ms] Latency median : 1.4 ms [READ: 1.4 ms] Latency 95th percentile : 2.4 ms [READ: 2.4 ms] Latency 99th percentile : 2.8 ms [READ: 2.8 ms] Latency 99.9th percentile : 3.4 ms [READ: 3.4 ms] Latency max : 12.0 ms [READ: 12.0 ms] Total partitions : 4,149,130 [READ: 4,149,130] Total errors : 0 [READ: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:05:00 ``` Question 1: =========== Does putting scylla in a special slice affect its performance ? Results with Scylla running in a slice: ``` Results: Op rate : 13,811 op/s [READ: 13,811 op/s] Partition rate : 13,811 pk/s [READ: 13,811 pk/s] Row rate : 13,811 row/s [READ: 13,811 row/s] Latency mean : 1.4 ms [READ: 1.4 ms] Latency median : 1.4 ms [READ: 1.4 ms] Latency 95th percentile : 2.2 ms [READ: 2.2 ms] Latency 99th percentile : 2.6 ms [READ: 2.6 ms] Latency 99.9th percentile : 3.3 ms [READ: 3.3 ms] Latency max : 23.2 ms [READ: 23.2 ms] Total partitions : 4,151,409 [READ: 4,151,409] Total errors : 0 [READ: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:05:00 ``` *Conclusion* : No significant change Question 2: =========== What happens when there is a CPU hog running in the same server as scylla? CPU hog: ``` taskset -c 0 /bin/sh -c "while true; do true; done" & taskset -c 1 /bin/sh -c "while true; do true; done" & taskset -c 2 /bin/sh -c "while true; do true; done" & taskset -c 3 /bin/sh -c "while true; do true; done" & sleep 330 ``` Scenario 1: CPU hog runs freely: ``` Results: Op rate : 2,939 op/s [READ: 2,939 op/s] Partition rate : 2,939 pk/s [READ: 2,939 pk/s] Row rate : 2,939 row/s [READ: 2,939 row/s] Latency mean : 6.8 ms [READ: 6.8 ms] Latency median : 5.3 ms [READ: 5.3 ms] Latency 95th percentile : 11.0 ms [READ: 11.0 ms] Latency 99th percentile : 14.9 ms [READ: 14.9 ms] Latency 99.9th percentile : 17.1 ms [READ: 17.1 ms] Latency max : 26.3 ms [READ: 26.3 ms] Total partitions : 884,460 [READ: 884,460] Total errors : 0 [READ: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:05:00 ``` Scenario 2: CPU hog runs inside scylla-helper slice ``` Results: Op rate : 13,527 op/s [READ: 13,527 op/s] Partition rate : 13,527 pk/s [READ: 13,527 pk/s] Row rate : 13,527 row/s [READ: 13,527 row/s] Latency mean : 1.5 ms [READ: 1.5 ms] Latency median : 1.4 ms [READ: 1.4 ms] Latency 95th percentile : 2.4 ms [READ: 2.4 ms] Latency 99th percentile : 2.9 ms [READ: 2.9 ms] Latency 99.9th percentile : 3.8 ms [READ: 3.8 ms] Latency max : 18.7 ms [READ: 18.7 ms] Total partitions : 4,069,934 [READ: 4,069,934] Total errors : 0 [READ: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:05:00 ``` *Conclusion*: With systemd slice we can keep the performance very close to baseline Question 3: =========== What happens when there is a CPU hog running in the same server as scylla? I/O hog: (Data in the cluster is 2x size of memory) ``` while true; do find /var/lib/scylla/data -type f -exec grep glauber {} + done ``` Scenario 1: I/O hog runs freely: ``` Results: Op rate : 7,680 op/s [READ: 7,680 op/s] Partition rate : 7,680 pk/s [READ: 7,680 pk/s] Row rate : 7,680 row/s [READ: 7,680 row/s] Latency mean : 2.6 ms [READ: 2.6 ms] Latency median : 1.3 ms [READ: 1.3 ms] Latency 95th percentile : 7.8 ms [READ: 7.8 ms] Latency 99th percentile : 10.9 ms [READ: 10.9 ms] Latency 99.9th percentile : 16.9 ms [READ: 16.9 ms] Latency max : 40.8 ms [READ: 40.8 ms] Total partitions : 2,306,723 [READ: 2,306,723] Total errors : 0 [READ: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:05:00 ``` Scenario 2: I/O hog runs in the scylla-helper systemd slice: ``` Results: Op rate : 13,277 op/s [READ: 13,277 op/s] Partition rate : 13,277 pk/s [READ: 13,277 pk/s] Row rate : 13,277 row/s [READ: 13,277 row/s] Latency mean : 1.5 ms [READ: 1.5 ms] Latency median : 1.4 ms [READ: 1.4 ms] Latency 95th percentile : 2.4 ms [READ: 2.4 ms] Latency 99th percentile : 2.9 ms [READ: 2.9 ms] Latency 99.9th percentile : 3.5 ms [READ: 3.5 ms] Latency max : 183.4 ms [READ: 183.4 ms] Total partitions : 3,984,080 [READ: 3,984,080] Total errors : 0 [READ: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:05:00 ``` *Conclusion*: With systemd slice we can keep the performance very close to baseline Signed-off-by: Glauber Costa --- dist/common/systemd/node-exporter.service | 1 + dist/common/systemd/scylla-fstrim.service | 1 + dist/common/systemd/scylla-helper.slice | 24 +++++++++++++ ...scylla-housekeeping-daily.service.mustache | 1 + .../systemd/scylla-server.service.mustache | 1 + dist/common/systemd/scylla-server.slice | 19 +++++++++++ dist/debian/debian/scylla-server.postrm | 1 + dist/debian/scylla-server.install.mustache | 1 + dist/redhat/scylla.spec.mustache | 3 ++ install.sh | 1 + scripts/create-relocatable-package.py | 3 ++ scylla_post_install.sh | 34 +++++++++++++++++++ 12 files changed, 90 insertions(+) create mode 100644 dist/common/systemd/scylla-helper.slice create mode 100644 dist/common/systemd/scylla-server.slice diff --git a/dist/common/systemd/node-exporter.service b/dist/common/systemd/node-exporter.service index d2e4e39e00..08ed4b7337 100644 --- a/dist/common/systemd/node-exporter.service +++ b/dist/common/systemd/node-exporter.service @@ -6,6 +6,7 @@ Type=simple User=scylla Group=scylla ExecStart=/usr/bin/node_exporter --collector.interrupts +Slice=scylla-helper.slice [Install] WantedBy=multi-user.target diff --git a/dist/common/systemd/scylla-fstrim.service b/dist/common/systemd/scylla-fstrim.service index 9457fa26e0..6755ba8e4e 100644 --- a/dist/common/systemd/scylla-fstrim.service +++ b/dist/common/systemd/scylla-fstrim.service @@ -5,6 +5,7 @@ After=network.target [Service] Type=simple ExecStart=/opt/scylladb/scripts/scylla_fstrim +Slice=scylla-helper.slice [Install] WantedBy=multi-user.target diff --git a/dist/common/systemd/scylla-helper.slice b/dist/common/systemd/scylla-helper.slice new file mode 100644 index 0000000000..3cc9b4cf96 --- /dev/null +++ b/dist/common/systemd/scylla-helper.slice @@ -0,0 +1,24 @@ +[Unit] +Description=Slice used to run companion programs to Scylla. Memory, CPU and IO restricted +Before=slices.target + +[Slice] +MemoryAccounting=true +IOAccounting=true +CPUAccounting=true + +CPUWeight=10 +IOWeight=10 + +# MemoryHigh is the throttle point +MemoryHigh=4% +# MemoryMax is the OOM point. As Scylla reserves 7% by default, the other two percent goes to +# the kernel and other non contained processes +MemoryMax=5% +# Systemd deprecated settings BlockIOWeight, MemoryLimit and CPUShares. But they are still the ones used in RHEL7 +# Newer SystemD wants MemoryHigh/MemoryMax, IOWeight and CPUWeight instead. Luckily both newer and older SystemD seem to +# ignore the unwanted option so safest to get both. Using just the old versions would work too but +# seems less future proof. Using just the new versions does not work at all for RHEL7/ +MemoryLimit=5% +CPUShares=10 +BlockIOWeight=10 diff --git a/dist/common/systemd/scylla-housekeeping-daily.service.mustache b/dist/common/systemd/scylla-housekeeping-daily.service.mustache index caa5caf0cb..578c82cd4b 100644 --- a/dist/common/systemd/scylla-housekeeping-daily.service.mustache +++ b/dist/common/systemd/scylla-housekeeping-daily.service.mustache @@ -12,6 +12,7 @@ ExecStart=/opt/scylladb/scripts/scylla-housekeeping --uuid-file /var/lib/scylla- {{#redhat}} ExecStart=/opt/scylladb/scripts/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid -q -c /etc/scylla.d/housekeeping.cfg --repo-files '/etc/yum.repos.d/scylla*.repo' version --mode d {{/redhat}} +Slice=scylla-helper.slice [Install] WantedBy=multi-user.target diff --git a/dist/common/systemd/scylla-server.service.mustache b/dist/common/systemd/scylla-server.service.mustache index be8f28eea4..1a08a15ed3 100644 --- a/dist/common/systemd/scylla-server.service.mustache +++ b/dist/common/systemd/scylla-server.service.mustache @@ -31,6 +31,7 @@ OOMScoreAdjust=-950 StandardOutput=syslog StandardError=syslog SyslogLevelPrefix=false +Slice=scylla-server.slice [Install] WantedBy=multi-user.target diff --git a/dist/common/systemd/scylla-server.slice b/dist/common/systemd/scylla-server.slice new file mode 100644 index 0000000000..44644da4e1 --- /dev/null +++ b/dist/common/systemd/scylla-server.slice @@ -0,0 +1,19 @@ +[Unit] +Description=Slice used to run Scylla. Maximum priority for IO and CPU +Before=slices.target + +[Slice] +MemoryAccounting=true +IOAccounting=true +CPUAccounting=true +# Systemd deprecated settings BlockIOWeight and CPUShares. But they are still the ones used in RHEL7 +# Newer SystemD wants IOWeight and CPUWeight instead. Luckily both newer and older SystemD seem to +# ignore the unwanted option so safest to get both. Using just the old versions would work too but +# seems less future proof. Using just the new versions does not work at all for RHEL7/ +BlockIOWeight=1000 +IOWeight=1000 +# Should be zero, but work around https://github.com/systemd/systemd/issues/8363 for older systemd +MemorySwapMax=1 +CPUShares=1000 +CPUWeight=1000 + diff --git a/dist/debian/debian/scylla-server.postrm b/dist/debian/debian/scylla-server.postrm index 4caee0bd13..3e586771a2 100644 --- a/dist/debian/debian/scylla-server.postrm +++ b/dist/debian/debian/scylla-server.postrm @@ -5,6 +5,7 @@ set -e case "$1" in purge|remove) rm -rf /etc/systemd/system/scylla-server.service.d/ + rm -rf /etc/systemd/system/scylla-helper.slice.d/ ;; esac diff --git a/dist/debian/scylla-server.install.mustache b/dist/debian/scylla-server.install.mustache index b30a61351b..cb95d0bea0 100644 --- a/dist/debian/scylla-server.install.mustache +++ b/dist/debian/scylla-server.install.mustache @@ -13,6 +13,7 @@ conf/housekeeping.cfg etc/scylla.d dist/common/systemd/scylla-housekeeping-daily.timer /lib/systemd/system dist/common/systemd/scylla-housekeeping-restart.timer /lib/systemd/system dist/common/systemd/scylla-fstrim.timer /lib/systemd/system +dist/common/systemd/*.slice /lib/systemd/system dist/debian/scripts/scylla_save_coredump opt/scylladb/scripts dist/debian/scripts/scylla_delay_fstrim opt/scylladb/scripts *.md NOTICE.txt ORIGIN licenses opt/scylladb/share/doc/scylla diff --git a/dist/redhat/scylla.spec.mustache b/dist/redhat/scylla.spec.mustache index 250ec352ac..d0e71c4b17 100644 --- a/dist/redhat/scylla.spec.mustache +++ b/dist/redhat/scylla.spec.mustache @@ -114,6 +114,7 @@ rm -rf $RPM_BUILD_ROOT /opt/scylladb/share/doc/scylla/licenses/ %{_unitdir}/*.service %{_unitdir}/*.timer +%{_unitdir}/*.slice %{_bindir}/scylla %{_bindir}/iotune %{_bindir}/scyllatop @@ -135,6 +136,8 @@ rm -rf $RPM_BUILD_ROOT %attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla/view_hints %attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla/coredump %attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla-housekeeping +%ghost /etc/systemd/system/scylla-helper.slice.d/ +%ghost /etc/systemd/system/scylla-helper.slice.d/memory.conf %ghost /etc/systemd/system/scylla-server.service.d/ %ghost /etc/systemd/system/scylla-server.service.d/capabilities.conf %ghost /etc/systemd/system/scylla-server.service.d/mounts.conf diff --git a/install.sh b/install.sh index 3c99bca420..adbf753059 100755 --- a/install.sh +++ b/install.sh @@ -142,6 +142,7 @@ if [ -z "$pkg" ] || [ "$pkg" = "server" ]; then install -d -m755 "$retc"/scylla "$rusr/lib/systemd/system" "$rusr/bin" "$rprefix/bin" "$rprefix/libexec" "$rprefix/libreloc" "$rprefix/scripts" install -m644 build/*.service -Dt "$rusr"/lib/systemd/system install -m644 dist/common/systemd/*.service -Dt "$rusr"/lib/systemd/system + install -m644 dist/common/systemd/*.slice -Dt "$rusr"/lib/systemd/system install -m644 dist/common/systemd/*.timer -Dt "$rusr"/lib/systemd/system install -m755 seastar/scripts/seastar-cpu-map.sh -Dt "$rprefix"/scripts install -m755 seastar/dpdk/usertools/dpdk-devbind.py -Dt "$rprefix"/scripts diff --git a/scripts/create-relocatable-package.py b/scripts/create-relocatable-package.py index b9692da8d6..ab4739bd5b 100755 --- a/scripts/create-relocatable-package.py +++ b/scripts/create-relocatable-package.py @@ -159,6 +159,9 @@ ar.add('build/SCYLLA-PRODUCT-FILE', arcname='SCYLLA-PRODUCT-FILE') ar.add('seastar/scripts') ar.add('seastar/dpdk/usertools') ar.add('install.sh') +# scylla_post_install.sh lives at the top level together with install.sh in the src tree, but while install.sh is +# not distributed in the .rpm and .deb packages, scylla_post_install is, so we'll add it in the package +# together with the other scripts that will end up in /usr/lib/scylla ar.add('scylla_post_install.sh', arcname="dist/common/scripts/scylla_post_install.sh") ar.add('scripts/relocate_python_scripts.py', arcname='relocate_python_scripts.py') ar.add('README.md') diff --git a/scylla_post_install.sh b/scylla_post_install.sh index 04594343a8..eeabedf634 100755 --- a/scylla_post_install.sh +++ b/scylla_post_install.sh @@ -1,4 +1,24 @@ #!/bin/bash +# +# Copyright (C) 2019 ScyllaDB +# + +# +# This file is part of Scylla. +# +# Scylla is free software: you can redistribute it and/or modify +# it under the terms of the GNU Affero General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# Scylla is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with Scylla. If not, see . +# if [ ! -d /run/systemd/system ]; then exit 0 @@ -35,5 +55,19 @@ EOS fi fi + # For systems with not a lot of memory, override default reservations for the slices + # seastar has a minimum reservation of 1.5GB that kicks in, and 21GB * 0.07 = 1.5GB. + # So for anything smaller than that we will not use percentages in the helper slice + MEMTOTAL_BYTES=$(cat /proc/meminfo | grep MemTotal | awk '{print $2 * 1024}') + if [ $MEMTOTAL_BYTES -lt 23008753371 ]; then + mkdir -p /etc/systemd/system/scylla-helper.slice.d/ + cat << EOS > /etc/systemd/system/scylla-helper.slice.d/memory.conf +[Slice] +MemoryHigh=1200M +MemoryMax=1400M +MemoryLimit=1400M +EOS + fi + systemctl --system daemon-reload >/dev/null || true fi