scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Botond Dénes	69150f0680	Merge 'Fix edge case issues related to tablet draining ' from Tomasz Grabiec Main problem: If we're draining the last node in a DC, we won't have a chance to evaluate candidates and notice that constraints cannot be satisfied (N < RF). Draining will succeed and node will be removed with replicas still present on that node. This will cause later draining in the same DC to fail when we will have 2 replicas which need relocaiton for a given tablet. The expected behvior is for draining to fail, because we cannot keep the RF in the DC. This is consistent, for example, with what happens when removing a node in a 2-node cluster with RF=2. Fixes #21826 Secondary problem: We allowed tablet_draining transition to be exited with undrained nodes, leaving replicas on nodes in the "left" state. Third problem: We removed DOWN nodes from the candidate node set, even when draining. This is not safe because it may lead to overload. This also makes the "main problem" more likely by extending it to the scenario when the DC is DOWN. The overload part in not a problem in practice currently, since migrations will block on global topology barrier if there are DOWN nodes. Closes scylladb/scylladb#21928 * github.com:scylladb/scylladb: tablets: load_balancer: Fail when draining with no candidate nodes tablets: load_balancer: Ignore skip_list when draining tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained	2025-01-07 13:04:00 +02:00
Botond Dénes	173fad296a	tools/schema_loader.cc: remove duplicate include of short_streams.hh Closes scylladb/scylladb#21982	2025-01-07 13:03:17 +02:00
David Garcia	66a5e7f672	docs: update Sphinx configuration for unified repository publishing This change is related to the unification of enterprise and open-source repositories. The Sphinx configuration is updated to build documentation either for `docs.scylladb.com/manual` or `opensource.docs.scylladb.com`, depending on the flag passed to Sphinx. By default, it will build docs for `docs.scylladb.com/manual`. If the `opensource` flag is passed, it will build docs for `opensource.docs.scylladb.com`, with a different set of versions. This change will prepare the configuration to publish to `docs.scylladb.com/manual` while allowing the option to keep publishing and editing docs with a different multiversion configuration. Note that this change will continue publishing docs to `opensource.docs.scylladb.com` for now since the `opensource` flag is being passed in the `gh-pages.yml` branch. chore: remove comment chore: update project name Closes scylladb/scylladb#22089	2025-01-07 12:54:51 +02:00
Kefu Chai	e4463b11af	treewide: replace boost::algorithm::join() with fmt::join() Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve performance and reduce dependency on Boost. `fmt::join()` allows direct formatting of ranges and tuples with custom separators without creating intermediate strings. When formatting comma-separated values into another string, fmt::join() avoids the overhead of temporary string creation that `boost::algorithm::join()` requires. This change also helps streamline our dependencies by leveraging the existing fmt library instead of Boost.Algorithm. To avoid the ambiguity, some caller sites were updated to call `seastar::format()` explicitly. See also - boost::algorithm::join(): https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp - fmt::join(): https://fmt.dev/11.0/api/#ranges-api Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22082	2025-01-07 12:45:05 +02:00
Aleksandra Martyniuk	a91e03710a	repair: check tasks local to given shard Currently task_manager_module::is_aborted checks the tasks local to caller's shard on a given shard. Fix the method to check the task map local to the given shard. Fixes: #22156. Closes scylladb/scylladb#22161	2025-01-06 21:53:54 +02:00
Kefu Chai	d3f3e2a6c8	.github: add more subdirectories to CLEANER_DIR in order to prevent future inclusion of unused headers, let's include - mutation_writer - node_ops - redis - replica subdirectories to CLEANER_DIR, so that this workflow can identify the regressions in future. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22050	2025-01-06 21:28:39 +02:00
Avi Kivity	5653d13d48	Merge 'Clean up test/alternator mistakes that service levels introduced' from Nadav Har'El The recent pull request https://github.com/scylladb/scylladb/pull/22031 introduced some regressions into the test/alternator framework. For a long time now, tests can create their own CQL roles for testing role-based features. But the new service levels test changed the "run" script and test.py's "suite.yaml" to create a new role and service level just for one test. This is not only ugly (the test code is now split to two places) and unnecessary, this setup also means that you can't run this test against an already-running copy of Scylla which wasn't prepared with the "right" role and service level. Even worse - the code that was added test/alternator/run was plain wrong - it used an outdated keyspace name (the code in suite.yaml was fine). So in this patch I remove that extra run and suite.yaml code, and replace it by code inside the service level test to create the role and service level that it wants to test rather than assume it already exists. While at it, I also removed a lot of duplicate and unnecessary code from this test. After this patch, test/alternator/run returns to work correctly, after #22031 broke it. This patch fixes a recent testing-framework regression, so doesn't need to be backported (unless that regression is backported). Fixes #22047. Closes scylladb/scylladb#22172 * github.com:scylladb/scylladb: test/alternator: fix mistakes introduced with test_service_levels.py test/alternator: move "cql" fixture to test/alternator/conftest.py	2025-01-06 17:44:25 +02:00
Anna Stuchlik	047ce13641	doc: add a new KB article about tombstone garbage collection in ICS Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22174	2025-01-06 16:48:50 +02:00
Kefu Chai	8873a4e1aa	test.py: pass "count" to re.sub() with kwarg since Python 3.13, passing count to `re.sub()` as positional argument has been deprecated. and when runnint `test.py` with Python 3.13, we have following warning: ``` /home/kefu/dev/scylladb/./test.py:1540: DeprecationWarning: 'count' is passed as positional argument args.modes = re.sub(r'.* List configured modes\n(.*)\n', r'\1', ``` see also https://github.com/python/cpython/issues/56166 in order to silence this distracting warning, let's pass `count` using kwarg. this change was created in the same spirit of `c3be4a36af`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22085	2025-01-06 16:35:38 +02:00
Avi Kivity	4632e217e3	cql3: grammar: simplify unaliasedSelector production The return variable s only gets a value by assignment from the temporary tmp. Make tmp the return value instead. Closes scylladb/scylladb#22151	2025-01-06 13:06:12 +02:00
Kefu Chai	9396c2ee6c	api: include "smaller" header Previously, `api/service_levels.hh` includes `api/api.hh` for accessing symbols like `api/http_context`. but these symbols are already available in a "smaller" header -- `api/api_init.hh`. so, in order to improve the build efficiency, let's include smaller headers in favor of "larger" ones. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22178	2025-01-06 13:04:33 +02:00
Nadav Har'El	fc22d5214f	Merge 'test.py: check for existence of combined test with correct path' from Kefu Chai test.py: Only check existence of Scylla executable Previously, we had inconsistent behavior around missing executables: - `561e88f0` added early failure if any executable was missing - `8b7a5ca8` added a partial skip for combined_test, but didn't properly handle build paths and artifacts This change: 1. Moves executable existence check to PythonTestSuite class 2. Only adds combined_test suite when the executable exists 3. Eliminates redundant os.access() checks 4. Corrects the path to combined_test when checking for its existence This allows running tests with a partial build while properly handling missing executables, particularly for the combined_test suite. Fixes scylladb/scylladb#22086 --- no need to backport, because the offending commit (`8b7a5ca88d`) is not included by any LTS branches yet. Closes scylladb/scylladb#22163 * github.com:scylladb/scylladb: test.py: Fix path checking for combined_test executable test.py: Throw only if scylla executable is not found	2025-01-06 09:21:01 +02:00
Nadav Har'El	e919794db8	test/alternator: fix mistakes introduced with test_service_levels.py This patch undoes multiple mistakes done when introducing the test for service levels in pull request #22031: 1. The PR introduced in test/alternator/run and test/alternator/suite.yaml a permanent role and service level that the service-level test is supposed to use. This was a mistake - the test can create the service level for its own use, using CQL, it does not need to assume such a service level already exists. It's important to fix this to allow the service level test to run against an installation of Scylla not set up by our own scripts. Moreover, while the code in suite.yaml was correct, the code in "run" was incorrect (used an outdated keyspace name). This patch removes that incorrect code. 2. The PR introduced a duplicate "cql" fixture, copied verbatim from test_cql_rbac.py (including a comment that was correct only in the latter file :-)). Let's de-duplicate it, using the fixture that I moved to conftest.py in the previous patch. 3. The PR used temporary_grant(). This needelessly complicated the test and added even more duplicate code, and this patch removes all that stuff. This test is about service levels, not RBAC and "grant". This test should just use a superuser role that has the permissions to do everything, and don't need to be granted specific permissions. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-01-05 19:40:14 +02:00
Nadav Har'El	879c0a3bd6	test/alternator: move "cql" fixture to test/alternator/conftest.py Most Alternator test use only the DynamoDB API, not CQL. Tests in test_cql_rbac.py did need CQL to set up roles and RBAC, so this file introduced a "cql" fixture to make CQL requests. A recently-introduced test/alternator/test_service_levels.py also needs access to CQL - it currently uses it for misguided reasons but the next patch will need it for creating a role and a service level. So instead of duplicating this fixture, let's move this fixture into test/alternator/conftest.py that all Alternator tests can share. The next patch will clean up this duplication in test_service_levels.py and the other mistakes it introduced. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-01-05 19:33:55 +02:00
Kefu Chai	569f8e9246	treewide: fix misspellings these misspellings were identified by codespell. let's fix them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22154	2025-01-05 16:13:09 +02:00
Raphael S. Carvalho	c973254362	Introduce incremental compaction strategy (ICS) ICS is a compaction strategy that inherits size tiered properties -- therefore it's write optimized too -- but fixes its space overhead of 100% due to input files being only released on completion. That's achieved with the concept of sstable run (similar in concept to LCS levels) which breaks a large sstable into fixed-size chunks (1G by default), known as run fragments. ICS picks similar-sized runs for compaction, and fragments of those runs can be released incrementally as they're compacted, reducing the space overhead to about (number_of_input_runs * 1G). This allows user to increase storage density of nodes (from 50% to ~80%), reducing the cost of ownership. NOTE: test_system_schema_version_is_stable adjusted to account for batchlog using IncrementalCompactionStrategy contains: compaction/: added incremental_compaction_strategy.cc (.hh), incremental_backlog_tracker.cc (.hh) compaction/CMakeLists.txt: include ICS cc files configure.py: changes for ICS files, includes test db/legacy_schema_migrator.cc / db/schema_tables.cc: fallback to ICS when strategy is not supported db/system_keyspace: pick ICS for some system tables schema/schema.hh: ICS becomes default test/boost: Add incremental_compaction_test.cc test/boost/sstable_compaction_test.cc: ICS related changes test/cqlpy/test_compaction_strategy_validation.py: ICS related changes docs/architecture/compaction/compaction-strategies.rst: changes to ICS section docs/cql/compaction.rst: changes to ICS section docs/cql/ddl.rst: adds reference to ICS options docs/getting-started/system-requirements.rst: updates sentence mentioning ICS docs/kb/compaction.rst: changes to ICS section docs/kb/garbage-collection-ics.rst: add file docs/kb/index.rst: add reference to <garbage-collection-ics> docs/operating-scylla/procedures/tips/production-readiness.rst: add ICS section some relevant commits throughout the ICS history: commit 434b97699b39c570d0d849d372bf64f418e5c692 Merge: 105586f747 30250749b8 Author: Paweł Dziepak <pdziepak@scylladb.com> Date: Tue Mar 12 12:14:23 2019 +0000 Merge "Introduce Incremental Compaction Strategy (ICS)" from Raphael " Introduce new compaction strategy which is essentially like size tiered but will work with the existing incremental compaction. Thus incremental compaction strategy. It works like size tiered, but each element composing a tier is a sstable run, meaning that the compaction strategy will look for N similar-sized sstable runs to compact, not just individual sstables. Parameters: * "sstable_size_in_mb": defines the maximum sstable (fragment) size composing a sstable run, which impacts directly the disk space requirement which is improved with incremental compaction. The lower the value the lower the space requirement for compaction because fragments involved will be released more frequently. * all others available in size tiered compaction strategy HOWTO ===== To change an existing table to use it, do: ALTER TABLE mykeyspace.mytable WITH compaction = {'class' : 'IncrementalCompactionStrategy'}; Set fragment size: ALTER TABLE mykeyspace.mytable WITH compaction = {'class' : 'IncrementalCompactionStrategy', 'sstable_size_in_mb' : 1000 } " commit 94ef3cd29a196bedbbeb8707e20fe78a197f30a1 Merge: dca89ce7a5 e08ef3e1a3 Author: Avi Kivity <avi@scylladb.com> Date: Tue Sep 8 11:31:52 2020 +0300 Merge "Add feature to limit space amplification in Incremental Compaction" from Raphael " A new option, space_amplification_goal (SAG), is being added to ICS. This option will allow ICS user to set a goal on the space amplification (SA). It's not supposed to be an upper bound on the space amplification, but rather, a goal. This new option will be disabled by default as it doesn't benefit write-only (no overwrites) workloads and could hurt severely the write performance. The strategy is free to delay triggering this new behavior, in order to increase overall compaction efficiency. The graph below shows how this feature works in practice for different values of space_amplification_goal: https://user-images.githubusercontent.com/1409139/89347544-60b7b980-d681-11ea-87ab-e2fdc3ecb9f0.png When strategy finds space amplification crossed space_amplification_goal, it will work on reducing the SA by doing a cross-tier compaction on the two largest tiers. This feature works only on the two largest tiers, because taking into account others, could hurt the compaction efficiency which is based on the fact that the more similar-sized sstables are compacted together the higher the compaction efficiency will be. With SAG enabled, min_threshold only plays an important role on the smallest tiers, given that the second-largest tier could be compacted into the largest tier for a space_amplification_goal value < 2. By making the options space_amplification_goal and min_threshold independent, user will be able to tune write amplification and space amplification, based on the needs. The lower the space_amplification_goal the higher the write amplification, but by increasing the min threshold, the write amplification can be decreased to a desired amount. " commit 7d90911c5fb3fa891ad64a62147c3a6ca26d61b1 Author: Raphael S. Carvalho <raphaelsc@scylladb.com> Date: Sat Oct 16 13:41:46 2021 -0300 compaction: ICS: Add garbage collection Today, ICS lacks an approach to persist expired tombstones in a timely manner, which is a problem because accumulation of tombstones are known to affecting latency considerably. For an expired tombstone to be purged, it has to reach the top of the LSM tree and hope that older overlapping data wasn't introduced at the bottom. The condition are there and must be satisfied to avoid data resurrection. STCS, today, has an inefficient garbage collection approach because it only picks a single sstable, which satisfies the tombstone density threshold and file staleness. That's a problem because overlapping data either on same tier or smaller tiers will prevent tombstones from being purged. Also, nothing is done to push the tombstones to the top of the tree, for the conditions to be eventually satisfied. Due to incremental compaction, ICS can more easily have an effecient GC by doing cross-tier compaction of relevant tiers. The trigger will be file staleness and tombstone density, which threshold values can be configured by tombstone_compaction_interval and tombstone_threshold, respectively. If ICS finds a tier which meets both conditions, then that tier and the larger[1] and closest-in-size[2] tier will be compacted together. [1]: A larger tier is picked because we want tombstones to eventually reach the top of the tree. [2]: It also has to be the closest-in-size tier as the smaller the size difference the higher the efficiency of the compaction. We want to minimize write amplification as much as possible. The staleness condition is there to prevent the same file from being picked over and over again in a short interval. With this approach, ICS will be continuously working to purge garbage while not hurting overall efficiency on a steady state, as same-tier compactions are prioritized. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211016164146.38010-1-raphaelsc@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22063	2025-01-04 15:43:52 +02:00
Kefu Chai	220cafe7c4	test.py: Fix path checking for combined_test executable Previously in `8b7a5ca88d`, we checked for combined_test existence without the "build" component in the path. This caused the test suite to never find the executable, preventing the test cases' cache from being populated. Changes: 1. Use path_to() to check executable existence, which: - Includes the "build" component in path - Handles both CMake and configure.py build paths 2. Move existence check out of _generate_cache() for clarity This ensures combined_test and its included tests are properly discovered and run. Fixes scylladb/scylladb#22086 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-04 06:11:21 +08:00
Kefu Chai	9d0f27e7c1	test.py: Throw only if scylla executable is not found Previously, we had inconsistent behavior around missing executables: - `561e88f0` added early failure if any executable was missing - `8b7a5ca8` added a partial skip for combined_test, but didn't properly handle build paths and artifacts This change: 1. Moves executable existence check to PythonTestSuite class 3. Eliminates redundant os.access() checks This allows running tests with a partial build while properly handling missing executables, particularly for the combined_test suite. In a succeeding change, we will correct the check for combined_tests. Refs scylladb/scylladb#19489 Refs scylladb/scylladb#22086 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-04 06:11:21 +08:00
Tomasz Grabiec	4c89e62470	Merge 'Phased barrier improvements' from Benny Halevy - utils: phased_barrier: advance_and_await: allocate new gate only when needed - utils: phased_barrier: add close() method - and use in existing services * Improvement. No backport needed Closes scylladb/scylladb#22018 * github.com:scylladb/scylladb: utils: phased_barrier: add close() method utils: phased_barrier: advance_and_await: allocate new gate only when needed	2025-01-03 18:51:23 +01:00
Avi Kivity	202f16e799	Merge 'Introduce workload prioritization for service levels' from Piotr Dulikowski This series introduces workload prioritization: an extension of the service levels feature which allows specifying "shares" per service level. The number of shares determines the priority of the user which has this service level attached (if multiple are attached then the one with the lowest shares wins). Different service levels will be isolated in the following way: - Each service level gets its own scheduling group with the number of shares (corresponding to the service level's number of shares), which controls the priority of the CPU and I/O used for user operations running on that service level. - Each service level gets two reader concurrency semaphores, one for user reads and the other for read-before-write done for view updates. - Each service level gets its own TCP connections for RPC to prevent priority inversion issues. Because of the mandatory use of scheduling groups, which are a globally limited resource, the number of service levels is now limited to 7 user created service levels + 1 created by default that cannot be removed. This feature has been previously only available in ScyllaDB Enterprise but has been made available for the source available ScyllaDB. The series was created by comparing the master branch with source-available-workbranch / enterprise branch and taking the workload prioritization related parts from the diff, then molding the resulting diff into a proper series. Some very minor changes were made such as fixing whitespace, removing unused or unnecessary code, adding some boilerplate (in api/) which was missing, but otherwise no major changes have been made. No backport is required. Closes scylladb/scylladb#22031 * github.com:scylladb/scylladb: tracing: record scheduling group in trace event record qos: un-shared-from-this standard_service_level_distributed_data_accessor alternator: execute under scheduling group for service level test.py: support multiple commands in prepare_cql in suite.yml docs: add documentation for workload prioritization docs/dev: describe workload prioritization features in service_levels test/auth_cluster: test workload prioritization in service level tests cqlpy/test_service_levels: add workload prioritization tests api: introduce service levels specific API api/cql_server_test: add information about scheduling group db/virtual_tables: add scheduling group column to system.clients test/boost: update service_level_controller_test for workload prio qos: include number of shares in DESCRIBE cql3/statements: update SL statements for workload prioritization transport/server: use scheduling group assigned to current user messaging_service: use separate set of connections per service levels replica/database: add reader concurrency semaphore groups qos: manage and assign scheduling groups to service levels qos: use the shares field in service level reads/writes qos: add shares to service_level_options qos: explicitly specify columns when querying service level tables db/system_distributed_keyspace: add shares column and upgrade code db/system_keyspace: adjust SL schema for workload prioritization gms: introduce WORKLOAD_PRIORITIZATION cluster feature build: increase the max number of scheduling groups qos: return correct error code when SL does not exist	2025-01-02 20:05:36 +02:00
Kefu Chai	0ea8cd2bb8	test/pylib/minio_server: use error level for fatal errors Previously fatal errors like missing Minio executable were logged at INFO level, which could be filtered out by log settings. Switch to ERROR level to ensure these critical issues are always visible to developers. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22084	2025-01-02 20:03:55 +02:00
Botond Dénes	7d42b80228	service/storage_proxy: data_read_resolver::resolve(): remove unneded maybe_yield() We already have a yield in the loop via apply_gently(), the maybe_yield is superfluous so remove it. Follow-up to https://github.com/scylladb/scylladb/pull/21884 Closes scylladb/scylladb#21984	2025-01-02 16:13:29 +01:00
Kefu Chai	de42dce4c4	pgo: use java-11 when running cassandra-stress we updated tools/java/build.xml recently to only build for java-11. so if - the `java` executable in `$PATH` points to a java which is neither java-8 nor java-11. - java-8 is installed java-8 is used to execute the cassandra-stress tool. and we would have following failure: ``` Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/cassandra/stress/Stress has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recogniz es class file versions up to 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:621) ``` in order to be compatible with the bytecode targeting java-11, let's run cassandra-stress with java-11. we do not need to support java-8, because the new tools/java is now building cassandra-stress targeting java-11 jre. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22142	2025-01-02 16:56:29 +02:00
Artsiom Mishuta	174199610b	test.py: add more log info if the server is broken attribute server_broken_reason into the server was introduced, to store the raw information regarding why the server was broken additional information was added in the error messages in case of "server broken" fixes: #21630 Closes scylladb/scylladb#22074	2025-01-02 16:54:55 +02:00
Kefu Chai	233e3969c4	utils: correct misspellings these misspellings were identified by codespell. let's fix them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22143	2025-01-02 16:47:57 +02:00
Avi Kivity	1ce373d80b	schema: deinline some speculative_retry methods This string conversion functions are not in any fast path. Deinlining them moves a <boost/lexical_cast.hpp> include out of a common header file. Some files accessed on boost::iterator_range via lexical_cast.hpp, so they gain a new dependency. Closes scylladb/scylladb#21950	2025-01-02 12:28:33 +01:00
Avi Kivity	051c310f02	tracing: record scheduling group in trace event record We have a "thread" field (unfortunately not yet displayed in cqlsh, but visible in the table) that records the shard on which a particular event was recorded. Record the scheduling group as well, as this can be useful to understand where the query came from. (cherry picked from commit 3c03b5f66376dca230868e54148ad1c6a1ad0ee2)	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	07fdf9d21f	qos: un-shared-from-this standard_service_level_distributed_data_accessor Apparently, it is not needed for standard_service_level_distributed_data_accessor to derive from enable_shared_from_this.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	b23bc3a5d5	alternator: execute under scheduling group for service level Now, the Alternator API requests are executed under the correct scheduling group of the service level assigned to the currently logged in user.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	67b11e846a	test.py: support multiple commands in prepare_cql in suite.yml This will be needed for alternator tests introduced in the next commit, which will have to execute multiple CQL operations during preparation.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	07b162fb5b	docs: add documentation for workload prioritization The doc pages were slightly adjusted during migration not to mention Scylla Enterprise and to fix some whitespace issues.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	241e710c19	docs/dev: describe workload prioritization features in service_levels The concept of shares, and some helper HTTP APIs, are now described in the developer documentation for service levels.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	473bb44722	test/auth_cluster: test workload prioritization in service level tests Update `test_connections_parameters_auto_update` to also check that the scheduling group of given connections is appropriately changed when a different service level is assigned to the user that the connection uses for authentication. Apart from that, more tests are added: - Check for the logic that forbids setting shares for a service level until all nodes in the cluster are upgraded - Test for handling the case when there are more scheduling groups than it is allowed (it might happen after upgrade from a non-workload-prio version) - Regression test for a bug where less scheduling groups could have been created than allowed due to some metrics not being renamed on scheduling group name change.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	29b153c9e7	cqlpy/test_service_levels: add workload prioritization tests Adjust existing cqlpy tests and add more in order to test the workload prioritization feature: - The DESCRIBE test is updated to check that generated statements contain information about shares - Two tests for shares in the LIST EFFECTIVE SERVICE LEVEL statement - Regression test which checks that we can create as many service levels as promised in the documentation (currently 7), but no more - Test which checks that NULL shares in the service levels table are treated as the default 1000 shares	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	49f5fc0e70	api: introduce service levels specific API Introduces two endpoints with operations specific to service levels: - switch_tenants: updates the scheduling group of all connections to be aligned with the service level specific to the logged in user. This is mostly legacy API, as with service levels on raft this is done automatically. - count_connections: for each user and for each scheduling group, counts how many connections are assigned to that user and scheduling group. This API is used in tests.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	a65c0c3735	api/cql_server_test: add information about scheduling group Now, information about connections' scheduling group is included in the HTTP API for querying information about connections' parameters.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	9319d65971	db/virtual_tables: add scheduling group column to system.clients Add the "scheduling_group" column to the system.clients table which names the scheduling group that currently serves the connection/client.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	bbc655ff32	test/boost: update service_level_controller_test for workload prio Adjust some of the existing tests in service_level_controller_test.cc and add some more in order to test the workload prioritization features, i.e. the service level shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ce4032dfc0	qos: include number of shares in DESCRIBE Now, the CREATE statements generated for each service level by the DESCRIBE SCHEMA WITH INTERNALS statement will account for the service level's shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	0f62eb45d1	cql3/statements: update SL statements for workload prioritization Introduce the "SHARES" keyword which can be used in conjunction with existing CQL statements related to the service levels. Adjust the CQL statements for service levels: - CREATE/ALTER now allow to set shares (only if the cluster is fully upgraded) - LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new column - LIST SERVICE LEVEL(S) also return the number of shares, and has the additional column "percentage of all service level shares"	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	6d90a933cd	transport/server: use scheduling group assigned to current user Now, when the user logs in and the connection becomes authenticated, the processing loop of the connection is switched to the scheduling group that corresponds to the service level assigned to the logged in user. The scheduling group is also updated when the service level assigned to this user changes. Starting from this commit, the scheduling groups managed by the service level controller are actually being used by user workload.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	f1b9737e07	messaging_service: use separate set of connections per service levels In order to make sure that the scheduling group carries over RPC, and also to prevent priority inversion issues between different service levels, modify the messaging service to use separate RPC connections for each service level in order to serve user traffic. The above is achieved by reusing the existing concept of "tenants" in messaging service: when a new service level (or, more accurately, service-level specific scheduling group) is first used in an RPC, a new tenant is created. In addition, extend the service level controller to be able to quickly look up the service level name of the currently active scheduling group in order to speed up the logic for choosing the tenant.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	7383013f43	replica/database: add reader concurrency semaphore groups Replace the reader concurrency semaphores for user reads and view updates with the newly introduced reader concurrency semaphore group, which assigns a semaphore for each service level. Each group is statically assigned to some pool of memory on startup and dynamically distribute this memory between the semaphores, relative to the number of shares of the corresponding scheduling group. The intent of having a separate reader concurrency semaphore for each scheduling group is to prevent priority inversion issues due to reads with different priorities waiting on the same semaphore, as well as make memory allocation more fair between service levels due to the adjusted number of shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	4cfd26efaf	qos: manage and assign scheduling groups to service levels Introduce the core logic of workload prioritization, responsible for assigning scheduling groups to service levels. The service level controller maintains a pool of scheduling groups for the currently present service levels, as well as a pool of unused scheduling groups which were previously used by some service level that was deleted during node's lifetime. When a new service level is created, the SL controller either assigns a scheduling group from the unused SG pool, or creates a new one if the pool is empty. The scheduling group is renamed to "sl:<scheduling group name>". When updating shares of a service level (and also when creating a new service level), the shares of the corresponding scheduling group are synchronized with those of the service level. When a service level is deleted, its group is released to the aforementioned pool of unused scheduling groups and the prefix of its name is changed from "sl:" to "sl_deleted:". For now, these scheduling groups are not used by any user operations. This will be changed in subsequent commits.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ff51551a94	qos: use the shares field in service level reads/writes Now, the newly introduced `shares` field is used when service levels are either read from or written into system tables.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	a6f681029f	qos: add shares to service_level_options Add service level shares related fields to service_level_options and slo_effective_names structs, and adjust the existing methods of the former (merge_with, init_effective_names) to account for them.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	2eb35f37d0	qos: explicitly specify columns when querying service level tables The service levels table is queried with a `SELECT * ...` query, by using the `execute_internal` method which prepares and caches the query in an special cache for internal queries, separate from the user query cache. During rolling upgrade from a version which does not support service level shares to the one that does, the `shares` column is added. The aforementioned internal query cache is _not_ invalidated on schema change, so the cache might still contain the prepared query from the time before the column was added, and that prepared query will fetch the old set of column without the new `shares` column. In order to solve this, explicitly specify the columns in the query string, using the full set of column names from the time when the query is executed. Note that this is a problem only for the legacy, non-raft service levels. Raft-based service levels use a local table for which the schema is determined on startup. Also note that this code only fetches values from the `shares` column but does not make any use of it otherwise. It will be handled by later commits in this series.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ea25b29684	db/system_distributed_keyspace: add shares column and upgrade code Add the "shares" column to the system_distributed_keyspace.service_levels table, which is used by legacy code. Because this table is in a distributed and not local keyspace, adding the column to an existing cluster during rolling upgrade requires a bit of care. A callback is added to the workload prioritization cluster feature which runs when the feature becomes enabled and adds the column for all nodes in the cluster.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	346fc84c3e	db/system_keyspace: adjust SL schema for workload prioritization Add a "shares" column which hold the number of shares allocated to given service level. It is not used by the code at all right now, subsequent commits will make good use of it.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ecbf8721de	gms: introduce WORKLOAD_PRIORITIZATION cluster feature Information about the number of shares per service level will be stored in an additional column in the service levels table, which is managed through group0. We will need the feature to make sure that all nodes in the cluster know about the new column before any node starts applying group0 commands the would touch the new column. This feature also serves a role for the legacy service levels implementation that uses system_distributed for storage: after all nodes are upgraded to support workload prioritization, one of the nodes will perform a schema change operation and will add the new column.	2025-01-02 07:13:34 +01:00

1 2 3 4 5 ...

46050 Commits