scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-23 00:02:37 +00:00

Author	SHA1	Message	Date
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Dario Mirovic	120f381a9d	pgo: fix maintenance socket path too long Maintenance socket path used for PGO is in the node workdir. When the node workdir path is too long, the maintenance socket path (workdir/cql.m) can exceed the Unix domain socket sun_path limit and failing the PGO training pipeline. To prevent this: - pass an explicit --maintenance-socket override pointing to a short determinitic path in /tmp derived from the MD5 hash of the workdir maintenance socket path - update maintenance_socket_path to return the matching short path so that exec_cql.py connects to the right socket The short path socket files are cleaned up after the cluster stops. The path is using MD5 hash of the workdir path, so it is deterministic. Fixes SCYLLADB-1070 Closes scylladb/scylladb#29149	2026-03-24 09:17:10 +01:00
Dario Mirovic	5d51501a0b	pgo: use maintenance socket for CQL setup in PGO training The default 'cassandra' superuser was removed from ScyllaDB, which broke PGO training. exec_cql.py relied on username/password auth ('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql and counters.cql. Switch exec_cql.py to connect via the Unix domain maintenance socket instead. The maintenance socket bypasses authentication, no credentials are needed. Additionally, create the 'cassandra' superuser via the maintenance socket during the populate phase, so that cassandra-stress keeps working. cassandra-stress hardcodes user=cassandra password=cassandra. Changes: - exec_cql.py: replace host/port/username/password arguments with a single --socket argument; add connect_maintenance_socket() with wait ready logic - pgo.py: add maintenance_socket_path() helper; update populate_auth_conns() and populate_counters() to pass the socket path to exec_cql.py Fixes SCYLLADB-1070 Closes scylladb/scylladb#29081	2026-03-19 16:52:36 +02:00
Botond Dénes	25db8f6a70	pgo/pgo.py: don't mutate input params It is considered a dangerous practice with possible unintended side-effects, affecting later calls to the same function. Found by CodeQL "Modification of parameter with default".	2026-01-13 08:33:17 +02:00
Marcin Maliszkiewicz	8aa2825caa	pgo: enable counters workload It was not enabled due to some cqlsh dependency missing. After 3 years it's hard to say if the thing is fixed or not, but anyway we don't need another big dependecy while we already have python driver used exstensively in tests. We use simple wrapper file exec_cql.py, shared with auth_conns workload to conveniently read needed preparation statements from the file. Additionally we switch tablets off as counters don't support it yet.	2025-09-03 15:43:51 +02:00
Marcin Maliszkiewicz	09476a4df8	pgo: add auth connections stress workload It uses some derived roles and permissions to exercise auth code paths and also creates new connection with each stress request to exercise also transport/server.cc connection handling code.	2025-09-03 15:43:51 +02:00
Marcin Maliszkiewicz	f2270034ec	pgo: enable auth in training clusters As it's best practice to use auth and we don't want to have 2^n configs to train we just enable auth for every workload.	2025-09-03 15:29:27 +02:00
Botond Dénes	72b2bbac4f	pgo/pgo.py: use tablet repair API for repair Since `a1d7722` tablet keyspaces are not allowed to be repaired via the old /storage_service/repair_async/{keyspace} API, instead the new /storage_service/tablets/repair API has to be used. Adjust the repair code and also add await_completion=true: the script just waits for the repair to finish immediately after starting it. Closes scylladb/scylladb#25455	2025-08-12 20:32:19 +03:00
Avi Kivity	29932a5af1	pgo: drop Java configuration Since `5e1cf90a51` ("build: replace tools/java submodule with packaged cassandra-stress") we run pre-packaged cassandra-stress. As such, we don't need to look for a Java runtime (which is missing on the frozen toolchain) and can rely on the cassandra-stress package finding its own Java runtime. Fix by just dropping all the Java-finding stuff. Note: Java 11 is in fact present on the frozen toolchain, just not in a way that pgo.py can find it. Fixes #24176. Closes scylladb/scylladb#24178	2025-05-26 10:16:03 +02:00
Avi Kivity	5e1cf90a51	build: replace tools/java submodule with packaged cassandra-stress We no longer use tools/java (scylladb/scylla-tools-java.git) for nodetool or cqlsh; only cassandra-stress. Since that is available in package form install that and excise the tools/java submodule from the source tree. pgo/ is adjusted to use the packaged cassandra-stress (and the cqlsh submodule). A few jmx references are dropped as well. Frozen toolchain regenerated. Optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#23698	2025-04-15 10:11:28 +03:00
Kefu Chai	de42dce4c4	pgo: use java-11 when running cassandra-stress we updated tools/java/build.xml recently to only build for java-11. so if - the `java` executable in `$PATH` points to a java which is neither java-8 nor java-11. - java-8 is installed java-8 is used to execute the cassandra-stress tool. and we would have following failure: ``` Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/cassandra/stress/Stress has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recogniz es class file versions up to 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:621) ``` in order to be compatible with the bytecode targeting java-11, let's run cassandra-stress with java-11. we do not need to support java-8, because the new tools/java is now building cassandra-stress targeting java-11 jre. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22142	2025-01-02 16:56:29 +02:00
Marcin Maliszkiewicz	80989556ac	pgo: add alternator workloads training This patch adds a set of alternator workloads to pgo training script. To confirm that added workloads are indeed affecting profile we can compare: ⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/clustering/prof.profdata Instrumentation level: IR entry_first = 0 Total functions: 105075 Maximum function count: 1079870885 Maximum internal block count: 2197851358 and ⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/alternator/prof.profdata Instrumentation level: IR entry_first = 0 Total functions: 105075 Maximum function count: 5240506052 Maximum internal block count: 9112894084 to see that function counters are on similar levels, they are around 5x higher for alternator but that's because it combines 5 specific sub-workloads. To confirm that final profile contains alterantor functions we can inspect: ⤖ llvm-profdata show --counts --function=alternator --value-cutoff 100000 ./build/release-pgo/profiles/merged.profdata (...) Instrumentation level: IR entry_first = 0 Functions shown: 356 Total functions: 105075 Number of functions with maximum count (< 100000): 97275 Number of functions with maximum count (>= 100000): 7800 Maximum function count: 7248370728 Maximum internal block count: 13722347326 we can see that 356 functions which symbol name contains word alternator were identified as 'hot' (with max count grater than 100'000). Running: ⤖ llvm-profdata show --counts --function=alternator --value-cutoff 1 ./build/release-pgo/profiles/merged.profdata (...) Instrumentation level: IR entry_first = 0 Functions shown: 806 Total functions: 105075 Number of functions with maximum count (< 1): 67036 Number of functions with maximum count (>= 1): 38039 Maximum function count: 7248370728 Maximum internal block count: 13722347326 we can see that 806 alternator functions were executed at least once during training. And finally to confirm that alternator specific PGO brings any speedups we run: for workload in read scan write write_gsi write_rmw do ./build/release/scylla perf-alternator-workloads --smp 4 --cpuset "10,12,14,16" --workload $workload --duration 1 --remote-host 127.0.0.1 2> /dev/null \| grep median done results BEFORE: median 258137.51910849303 median absolute deviation: 786.06 median 547.2578202937141 median absolute deviation: 6.33 median 145718.19856685458 median absolute deviation: 5689.79 median 89024.67095807113 median absolute deviation: 1302.56 median 43708.101729598646 median absolute deviation: 294.47 results AFTER: median 303968.55333940056 median absolute deviation: 1152.19 median 622.4757636209254 median absolute deviation: 8.42 median 198566.0403745328 median absolute deviation: 1689.96 median 91696.44912842038 median absolute deviation: 1891.84 median 51445.356525664996 median absolute deviation: 1780.15 We can see that single node cluster tps increase is typically 13% - 17% with notable exceptions, improvement for write_gsi is 3% and for write workload whopping 36%. The increase is on top of CQL PGO. Write workload is executed more often because it's involved also as data preparation for read and scan. Some further improvement could be to separate preparation from training as it's done for CQL but it would be a bit odd if ~3x higher counters for one flow have so big impact. Additional disclaimers: - tests are performing exactly the same workloads as in training so there might be some bias - tests are running single node cluster, more realistic setup will likely show lower improvement Fixes https://github.com/scylladb/scylla-enterprise/issues/4066	2024-12-27 16:16:04 +08:00
Michał Chojnowski	95c8d88b96	pgo: add a repair workload This workload is added to teach PGO about repair. Tests are inconclusive about its alignment with existing workloads, because repair doesn't seem utilize 100% of the reactor.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	1c9ce0a9ee	pgo: add a counters workload This workload is added to teach PGO about counters. Tests seem to show it's mostly aligned with existing CQL workloads. The config YAML is based on the default cassandra-stress schema.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	47dc0399cb	pgo: add a secondary index workload This workload is added to teach PGO about secondary indexes. Tests seem to show that it's mostly aligned with existing CQL workloads. The config YAML was copied from one of scylla-cluster-test test cases.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	e67f4a5c51	pgo: add a LWT workload This workload is added to teach PGO about LWT codepaths. Tests seem to show that it's mostly aligned with existing CQL workloads. The config YAML was copied from one of scylla-cluster-tests test cases.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	e217c124a6	pgo: add a decommission workload This workload is added to teach PGO about streaming. Tests show that this workload is mostly orthogonal to CQL workloads (where "orthogonal" means that training on workload A doesn't improve workload B much, while training on workload A doesn't improve workload B much), so adding it to the training is quite important.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	65abecaede	pgo: add a clustering workload In contrast to the basic workload, this workload uses clustering keys, CK range queries, RF=1, logged batches, and more CQL types. Tests seem to show that this workload is mostly aligned with the existing basic workload (where "aligned" means that training on workload A improves workload B about as much as training on workload B). The config YAML is based on the example YAML attached to cassandra-stress sources.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	c1297dbcd2	pgo: add a basic workload This commit adds the default cassandra-stress workload to the PGO training suite.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	f73b122de3	pgo: introduce a PGO training script Profile-guided optimization consists of the following steps: 1. Build the program as usual, but with with special options (instrumentation or just some supplementary info tables, depending on the exact flavor of PGO in use). 2. Collect an execution profile from the special binary by running a training workload on it. 3. Rebuild the program again, using the collected profile. This commit introduces a script automating step 2: running PGO training workloads on Scylla. The contents of training workloads will be added in future commits. The changes in configure.py responsible for steps 1. and 3. will also appear in future commits. As input, the script takes a path to the instrumented binary, a path to a the output file, and a directory with (optionally) prepopulated datasets for use in training. The output profile file can be then passed to the compiler to perform a PGO build. The script current supports two kinds of PGO instrumentation: LLVM instrumentation (binary instrumented with -fprofile-generate and -fcs-profile-generate passed to clang during compilation) and BOLT instrumentation (binary instrumented with `llvm-bolt -instrument`, with logs from this operation saved to $binary_path.boltlog) The actual training workloads for generating the profile will be added in later commits.	2024-12-27 16:16:04 +08:00

20 Commits