Commit Graph

46050 Commits

Author SHA1 Message Date
Botond Dénes
69150f0680 Merge 'Fix edge case issues related to tablet draining ' from Tomasz Grabiec
Main problem:

If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.

The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.

Fixes #21826

Secondary problem:

We allowed tablet_draining transition to be exited with undrained nodes, leaving replicas on nodes in the "left" state.

Third problem:

We removed DOWN nodes from the candidate node set, even when draining. This is not safe because it may lead to overload. This also makes the "main problem" more likely by extending it to the scenario when the DC is DOWN.

The overload part in not a problem in practice currently, since migrations will block on global topology barrier if there are DOWN nodes.

Closes scylladb/scylladb#21928

* github.com:scylladb/scylladb:
  tablets: load_balancer: Fail when draining with no candidate nodes
  tablets: load_balancer: Ignore skip_list when draining
  tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained
2025-01-07 13:04:00 +02:00
Botond Dénes
173fad296a tools/schema_loader.cc: remove duplicate include of short_streams.hh
Closes scylladb/scylladb#21982
2025-01-07 13:03:17 +02:00
David Garcia
66a5e7f672 docs: update Sphinx configuration for unified repository publishing
This change is related to the unification of enterprise and open-source repositories.

The Sphinx configuration is updated to build documentation either for `docs.scylladb.com/manual` or `opensource.docs.scylladb.com`, depending on the flag passed to Sphinx.

By default, it will build docs for `docs.scylladb.com/manual`. If the `opensource` flag is passed, it will build docs for `opensource.docs.scylladb.com`, with a different set of versions.

This change will prepare the configuration to publish to `docs.scylladb.com/manual` while allowing the option to keep publishing and editing docs with a different multiversion configuration.

Note that this change will continue publishing docs to `opensource.docs.scylladb.com` for now since the `opensource` flag is being passed in the `gh-pages.yml` branch.

chore: remove comment

chore: update project name

Closes scylladb/scylladb#22089
2025-01-07 12:54:51 +02:00
Kefu Chai
e4463b11af treewide: replace boost::algorithm::join() with fmt::join()
Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve
performance and reduce dependency on Boost. `fmt::join()` allows direct
formatting of ranges and tuples with custom separators without creating
intermediate strings.

When formatting comma-separated values into another string, fmt::join()
avoids the overhead of temporary string creation that
`boost::algorithm::join()` requires. This change also helps streamline
our dependencies by leveraging the existing fmt library instead of
Boost.Algorithm.

To avoid the ambiguity, some caller sites were updated to call
`seastar::format()` explicitly.

See also

- boost::algorithm::join():
  https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp
- fmt::join():
  https://fmt.dev/11.0/api/#ranges-api

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22082
2025-01-07 12:45:05 +02:00
Aleksandra Martyniuk
a91e03710a repair: check tasks local to given shard
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.

Fix the method to check the task map local to the given shard.

Fixes: #22156.

Closes scylladb/scylladb#22161
2025-01-06 21:53:54 +02:00
Kefu Chai
d3f3e2a6c8 .github: add more subdirectories to CLEANER_DIR
in order to prevent future inclusion of unused headers, let's include

- mutation_writer
- node_ops
- redis
- replica

subdirectories to CLEANER_DIR, so that this workflow can identify the
regressions in future.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22050
2025-01-06 21:28:39 +02:00
Avi Kivity
5653d13d48 Merge 'Clean up test/alternator mistakes that service levels introduced' from Nadav Har'El
The recent pull request https://github.com/scylladb/scylladb/pull/22031 introduced some regressions into the test/alternator framework. For a long time now, tests can create their own CQL roles for testing role-based features. But the new service levels test changed the "run" script and test.py's "suite.yaml" to create a new role and service level just for one test. This is not only ugly (the test code is now split to two places) and unnecessary, this setup also means that you can't run this test against an already-running copy of Scylla which wasn't prepared with the "right" role and service level. Even worse - the code that was added test/alternator/run was plain wrong - it used an outdated keyspace name (the code in suite.yaml was fine).

So in this patch I remove that extra run and suite.yaml code, and replace it by code inside the service level test to create the role and service level that it wants to test rather than assume it already exists.
While at it, I also removed a lot of duplicate and unnecessary code from this test.

After this patch, test/alternator/run returns to work correctly, after #22031 broke it.

This patch fixes a recent testing-framework regression, so doesn't need to be backported (unless that regression is backported).

Fixes #22047.

Closes scylladb/scylladb#22172

* github.com:scylladb/scylladb:
  test/alternator: fix mistakes introduced with test_service_levels.py
  test/alternator: move "cql" fixture to test/alternator/conftest.py
2025-01-06 17:44:25 +02:00
Anna Stuchlik
047ce13641 doc: add a new KB article about tombstone garbage collection in ICS
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22174
2025-01-06 16:48:50 +02:00
Kefu Chai
8873a4e1aa test.py: pass "count" to re.sub() with kwarg
since Python 3.13, passing count to `re.sub()` as positional argument
has been deprecated. and when runnint `test.py` with Python 3.13, we
have following warning:

```
/home/kefu/dev/scylladb/./test.py:1540: DeprecationWarning: 'count' is passed as positional argument
  args.modes = re.sub(r'.* List configured modes\n(.*)\n', r'\1',
```

see also https://github.com/python/cpython/issues/56166

in order to silence this distracting warning, let's pass
`count` using kwarg.

this change was created in the same spirit of c3be4a36af.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22085
2025-01-06 16:35:38 +02:00
Avi Kivity
4632e217e3 cql3: grammar: simplify unaliasedSelector production
The return variable s only gets a value by assignment from the
temporary tmp. Make tmp the return value instead.

Closes scylladb/scylladb#22151
2025-01-06 13:06:12 +02:00
Kefu Chai
9396c2ee6c api: include "smaller" header
Previously, `api/service_levels.hh` includes `api/api.hh` for
accessing symbols like `api/http_context`. but these symbols are
already available in a "smaller" header -- `api/api_init.hh`. so,
in order to improve the build efficiency, let's include smaller
headers in favor of "larger" ones.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22178
2025-01-06 13:04:33 +02:00
Nadav Har'El
fc22d5214f Merge 'test.py: check for existence of combined test with correct path' from Kefu Chai
test.py: Only check existence of Scylla executable

Previously, we had inconsistent behavior around missing executables:
- 561e88f0 added early failure if any executable was missing
- 8b7a5ca8 added a partial skip for combined_test, but didn't properly
  handle build paths and artifacts

This change:
1. Moves executable existence check to PythonTestSuite class
2. Only adds combined_test suite when the executable exists
3. Eliminates redundant os.access() checks
4. Corrects the path to combined_test when checking for its existence

This allows running tests with a partial build while properly handling
missing executables, particularly for the combined_test suite.

Fixes scylladb/scylladb#22086

---

no need to backport, because the offending commit (8b7a5ca88d) is not included by any LTS branches yet.

Closes scylladb/scylladb#22163

* github.com:scylladb/scylladb:
  test.py: Fix path checking for combined_test executable
  test.py: Throw only if scylla executable is not found
2025-01-06 09:21:01 +02:00
Nadav Har'El
e919794db8 test/alternator: fix mistakes introduced with test_service_levels.py
This patch undoes multiple mistakes done when introducing the test
for service levels in pull request #22031:

1. The PR introduced in test/alternator/run and test/alternator/suite.yaml
   a permanent role and service level that the service-level test is
   supposed to use. This was a mistake - the test can create the service
   level for its own use, using CQL, it does not need to assume such a
   service level already exists.
   It's important to fix this to allow the service level test to run
   against an installation of Scylla not set up by our own scripts.
   Moreover, while the code in suite.yaml was correct, the code in
   "run" was incorrect (used an outdated keyspace name). This patch
   removes that incorrect code.

2. The PR introduced a duplicate "cql" fixture, copied verbatim
   from test_cql_rbac.py (including a comment that was correct only
   in the latter file :-)). Let's de-duplicate it, using the fixture
   that I moved to conftest.py in the previous patch.

3. The PR used temporary_grant(). This needelessly complicated the test
   and added even more duplicate code, and this patch removes all that
   stuff. This test is about service levels, not RBAC and "grant".
   This test should just use a superuser role that has the permissions
   to do everything, and don't need to be granted specific permissions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-01-05 19:40:14 +02:00
Nadav Har'El
879c0a3bd6 test/alternator: move "cql" fixture to test/alternator/conftest.py
Most Alternator test use only the DynamoDB API, not CQL. Tests in
test_cql_rbac.py did need CQL to set up roles and RBAC, so this file
introduced a "cql" fixture to make CQL requests.

A recently-introduced test/alternator/test_service_levels.py also
needs access to CQL - it currently uses it for misguided reasons but
the next patch will need it for creating a role and a service level.
So instead of duplicating this fixture, let's move this fixture into
test/alternator/conftest.py that all Alternator tests can share.

The next patch will clean up this duplication in test_service_levels.py
and the other mistakes it introduced.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-01-05 19:33:55 +02:00
Kefu Chai
569f8e9246 treewide: fix misspellings
these misspellings were identified by codespell. let's fix them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22154
2025-01-05 16:13:09 +02:00
Raphael S. Carvalho
c973254362 Introduce incremental compaction strategy (ICS)
ICS is a compaction strategy that inherits size tiered properties --
therefore it's write optimized too -- but fixes its space overhead of
100% due to input files being only released on completion. That's
achieved with the concept of sstable run (similar in concept to LCS
levels) which breaks a large sstable into fixed-size chunks (1G by
default), known as run fragments. ICS picks similar-sized runs
for compaction, and fragments of those runs can be released
incrementally as they're compacted, reducing the space overhead
to about (number_of_input_runs * 1G). This allows user to increase
storage density of nodes (from 50% to ~80%), reducing the cost of
ownership.

NOTE: test_system_schema_version_is_stable adjusted to account for batchlog
using IncrementalCompactionStrategy

contains:

compaction/: added incremental_compaction_strategy.cc (.hh), incremental_backlog_tracker.cc (.hh)
compaction/CMakeLists.txt: include ICS cc files
configure.py: changes for ICS files, includes test
db/legacy_schema_migrator.cc / db/schema_tables.cc: fallback to ICS when strategy is not supported
db/system_keyspace: pick ICS for some system tables
schema/schema.hh: ICS becomes default
test/boost: Add incremental_compaction_test.cc
test/boost/sstable_compaction_test.cc: ICS related changes
test/cqlpy/test_compaction_strategy_validation.py: ICS related changes

docs/architecture/compaction/compaction-strategies.rst: changes to ICS section
docs/cql/compaction.rst: changes to ICS section
docs/cql/ddl.rst: adds reference to ICS options
docs/getting-started/system-requirements.rst: updates sentence mentioning ICS
docs/kb/compaction.rst: changes to ICS section
docs/kb/garbage-collection-ics.rst: add file
docs/kb/index.rst: add reference to <garbage-collection-ics>
docs/operating-scylla/procedures/tips/production-readiness.rst: add ICS section

some relevant commits throughout the ICS history:

commit 434b97699b39c570d0d849d372bf64f418e5c692
Merge: 105586f747 30250749b8
Author: Paweł Dziepak <pdziepak@scylladb.com>
Date:   Tue Mar 12 12:14:23 2019 +0000

    Merge "Introduce Incremental Compaction Strategy (ICS)" from Raphael

    "
    Introduce new compaction strategy which is essentially like size tiered
    but will work with the existing incremental compaction. Thus incremental
    compaction strategy.

    It works like size tiered, but each element composing a tier is a sstable
    run, meaning that the compaction strategy will look for N similar-sized
    sstable runs to compact, not just individual sstables.

    Parameters:
    * "sstable_size_in_mb": defines the maximum sstable (fragment) size
    composing
    a sstable run, which impacts directly the disk space requirement which is
    improved with incremental compaction.
    The lower the value the lower the space requirement for compaction because
    fragments involved will be released more frequently.
    * all others available in size tiered compaction strategy

    HOWTO
    =====

    To change an existing table to use it, do:
         ALTER TABLE mykeyspace.mytable  WITH compaction =
    {'class' : 'IncrementalCompactionStrategy'};

    Set fragment size:
         ALTER TABLE mykeyspace.mytable  WITH compaction =
    {'class' : 'IncrementalCompactionStrategy', 'sstable_size_in_mb' : 1000 }

    "

commit 94ef3cd29a196bedbbeb8707e20fe78a197f30a1
Merge: dca89ce7a5 e08ef3e1a3
Author: Avi Kivity <avi@scylladb.com>
Date:   Tue Sep 8 11:31:52 2020 +0300

    Merge "Add feature to limit space amplification in Incremental Compaction" from Raphael

    "
    A new option, space_amplification_goal (SAG), is being added to ICS. This option
    will allow ICS user to set a goal on the space amplification (SA). It's not
    supposed to be an upper bound on the space amplification, but rather, a goal.
    This new option will be disabled by default as it doesn't benefit write-only
    (no overwrites) workloads and could hurt severely the write performance.
    The strategy is free to delay triggering this new behavior, in order to
    increase overall compaction efficiency.

    The graph below shows how this feature works in practice for different values
    of space_amplification_goal:
    https://user-images.githubusercontent.com/1409139/89347544-60b7b980-d681-11ea-87ab-e2fdc3ecb9f0.png

    When strategy finds space amplification crossed space_amplification_goal, it
    will work on reducing the SA by doing a cross-tier compaction on the two
    largest tiers. This feature works only on the two largest tiers, because taking
    into account others, could hurt the compaction efficiency which is based on
    the fact that the more similar-sized sstables are compacted together the higher
    the compaction efficiency will be.

    With SAG enabled, min_threshold only plays an important role on the smallest
    tiers, given that the second-largest tier could be compacted into the largest
    tier for a space_amplification_goal value < 2.
    By making the options space_amplification_goal and min_threshold independent,
    user will be able to tune write amplification and space amplification, based on
    the needs. The lower the space_amplification_goal the higher the write
    amplification, but by increasing the min threshold, the write amplification
    can be decreased to a desired amount.
    "

commit 7d90911c5fb3fa891ad64a62147c3a6ca26d61b1
Author: Raphael S. Carvalho <raphaelsc@scylladb.com>
Date:   Sat Oct 16 13:41:46 2021 -0300

    compaction: ICS: Add garbage collection

    Today, ICS lacks an approach to persist expired tombstones in a timely manner,
    which is a problem because accumulation of tombstones are known to affecting
    latency considerably.

    For an expired tombstone to be purged, it has to reach the top of the LSM tree
    and hope that older overlapping data wasn't introduced at the bottom.
    The condition are there and must be satisfied to avoid data resurrection.

    STCS, today, has an inefficient garbage collection approach because it only
    picks a single sstable, which satisfies the tombstone density threshold and
    file staleness. That's a problem because overlapping data either on same tier
    or smaller tiers will prevent tombstones from being purged. Also, nothing is
    done to push the tombstones to the top of the tree, for the conditions to be
    eventually satisfied.

    Due to incremental compaction, ICS can more easily have an effecient GC by
    doing cross-tier compaction of relevant tiers.

    The trigger will be file staleness and tombstone density, which threshold
    values can be configured by tombstone_compaction_interval and
    tombstone_threshold, respectively.

    If ICS finds a tier which meets both conditions, then that tier and the
    larger[1] *and* closest-in-size[2] tier will be compacted together.
    [1]: A larger tier is picked because we want tombstones to eventually reach the
    top of the tree.
    [2]: It also has to be the closest-in-size tier as the smaller the size
    difference the higher the efficiency of the compaction. We want to minimize
    write amplification as much as possible.
    The staleness condition is there to prevent the same file from being picked
    over and over again in a short interval.

    With this approach, ICS will be continuously working to purge garbage while
    not hurting overall efficiency on a steady state, as same-tier compactions are
    prioritized.

    Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
    Message-Id: <20211016164146.38010-1-raphaelsc@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22063
2025-01-04 15:43:52 +02:00
Kefu Chai
220cafe7c4 test.py: Fix path checking for combined_test executable
Previously in 8b7a5ca88d, we checked for combined_test existence
without the "build" component in the path. This caused the test
suite to never find the executable, preventing the test cases'
cache from being populated.

Changes:

1. Use path_to() to check executable existence, which:
   - Includes the "build" component in path
   - Handles both CMake and configure.py build paths
2. Move existence check out of _generate_cache() for clarity

This ensures combined_test and its included tests are properly
discovered and run.

Fixes scylladb/scylladb#22086

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-04 06:11:21 +08:00
Kefu Chai
9d0f27e7c1 test.py: Throw only if scylla executable is not found
Previously, we had inconsistent behavior around missing executables:
- 561e88f0 added early failure if any executable was missing
- 8b7a5ca8 added a partial skip for combined_test, but didn't properly
  handle build paths and artifacts

This change:
1. Moves executable existence check to PythonTestSuite class
3. Eliminates redundant os.access() checks

This allows running tests with a partial build while properly handling
missing executables, particularly for the combined_test suite.

In a succeeding change, we will correct the check for combined_tests.

Refs scylladb/scylladb#19489
Refs scylladb/scylladb#22086

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-04 06:11:21 +08:00
Tomasz Grabiec
4c89e62470 Merge 'Phased barrier improvements' from Benny Halevy
- utils: phased_barrier: advance_and_await: allocate new gate only when needed
- utils: phased_barrier: add close() method
  - and use in existing services

* Improvement. No backport needed

Closes scylladb/scylladb#22018

* github.com:scylladb/scylladb:
  utils: phased_barrier: add close() method
  utils: phased_barrier: advance_and_await: allocate new gate only when needed
2025-01-03 18:51:23 +01:00
Avi Kivity
202f16e799 Merge 'Introduce workload prioritization for service levels' from Piotr Dulikowski
This series introduces workload prioritization: an extension of the service levels feature which allows specifying "shares" per service level. The number of shares determines the priority of the user which has this service level attached (if multiple are attached then the one with the lowest shares wins).

Different service levels will be isolated in the following way:

- Each service level gets its own scheduling group with the number of shares (corresponding to the service level's number of shares), which controls the priority of the CPU and I/O used for user operations running on that service level.
- Each service level gets two reader concurrency semaphores, one for user reads and the other for read-before-write done for view updates.
- Each service level gets its own TCP connections for RPC to prevent priority inversion issues.

Because of the mandatory use of scheduling groups, which are a globally limited resource, the number of service levels is now limited to 7 user created service levels + 1 created by default that cannot be removed.

This feature has been previously only available in ScyllaDB Enterprise but has been made available for the source available ScyllaDB. The series was created by comparing the master branch with source-available-workbranch / enterprise branch and taking the workload prioritization related parts from the diff, then molding the resulting diff into a proper series. Some very minor changes were made such as fixing whitespace, removing unused or unnecessary code, adding some boilerplate (in api/) which was missing, but otherwise no major changes have been made.

No backport is required.

Closes scylladb/scylladb#22031

* github.com:scylladb/scylladb:
  tracing: record scheduling group in trace event record
  qos: un-shared-from-this standard_service_level_distributed_data_accessor
  alternator: execute under scheduling group for service level
  test.py: support multiple commands in prepare_cql in suite.yml
  docs: add documentation for workload prioritization
  docs/dev: describe workload prioritization features in service_levels
  test/auth_cluster: test workload prioritization in service level tests
  cqlpy/test_service_levels: add workload prioritization tests
  api: introduce service levels specific API
  api/cql_server_test: add information about scheduling group
  db/virtual_tables: add scheduling group column to system.clients
  test/boost: update service_level_controller_test for workload prio
  qos: include number of shares in DESCRIBE
  cql3/statements: update SL statements for workload prioritization
  transport/server: use scheduling group assigned to current user
  messaging_service: use separate set of connections per service levels
  replica/database: add reader concurrency semaphore groups
  qos: manage and assign scheduling groups to service levels
  qos: use the shares field in service level reads/writes
  qos: add shares to service_level_options
  qos: explicitly specify columns when querying service level tables
  db/system_distributed_keyspace: add shares column and upgrade code
  db/system_keyspace: adjust SL schema for workload prioritization
  gms: introduce WORKLOAD_PRIORITIZATION cluster feature
  build: increase the max number of scheduling groups
  qos: return correct error code when SL does not exist
2025-01-02 20:05:36 +02:00
Kefu Chai
0ea8cd2bb8 test/pylib/minio_server: use error level for fatal errors
Previously fatal errors like missing Minio executable were logged at INFO level,
which could be filtered out by log settings. Switch to ERROR level to ensure
these critical issues are always visible to developers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22084
2025-01-02 20:03:55 +02:00
Botond Dénes
7d42b80228 service/storage_proxy: data_read_resolver::resolve(): remove unneded maybe_yield()
We already have a yield in the loop via apply_gently(), the maybe_yield
is superfluous so remove it.

Follow-up to https://github.com/scylladb/scylladb/pull/21884

Closes scylladb/scylladb#21984
2025-01-02 16:13:29 +01:00
Kefu Chai
de42dce4c4 pgo: use java-11 when running cassandra-stress
we updated tools/java/build.xml recently to only build for java-11. so
if

- the `java` executable in `$PATH` points to a java which is neither
  java-8 nor java-11.
- java-8 is installed

java-8 is used to execute the cassandra-stress tool. and we would have
following failure:

```
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/cassandra/stress/Stress has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recogniz
es class file versions up to 52.0
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:621)
```

in order to be compatible with the bytecode targeting java-11, let's run
cassandra-stress with java-11. we do not need to support java-8, because
the new tools/java is now building cassandra-stress targeting java-11 jre.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22142
2025-01-02 16:56:29 +02:00
Artsiom Mishuta
174199610b test.py: add more log info if the server is broken
attribute server_broken_reason into the server was introduced, to store the raw information
regarding why the server was broken

additional information was added in the error messages in case of "server
broken"

fixes: #21630

Closes scylladb/scylladb#22074
2025-01-02 16:54:55 +02:00
Kefu Chai
233e3969c4 utils: correct misspellings
these misspellings were identified by codespell. let's fix them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22143
2025-01-02 16:47:57 +02:00
Avi Kivity
1ce373d80b schema: deinline some speculative_retry methods
This string conversion functions are not in any fast path. Deinlining
them moves a <boost/lexical_cast.hpp> include out of a common header file.

Some files accessed on boost::iterator_range via lexical_cast.hpp,
so they gain a new dependency.

Closes scylladb/scylladb#21950
2025-01-02 12:28:33 +01:00
Avi Kivity
051c310f02 tracing: record scheduling group in trace event record
We have a "thread" field (unfortunately not yet displayed
in cqlsh, but visible in the table) that records the shard
on which a particular event was recorded. Record the scheduling
group as well, as this can be useful to understand where the
query came from.

(cherry picked from commit 3c03b5f66376dca230868e54148ad1c6a1ad0ee2)
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
07fdf9d21f qos: un-shared-from-this standard_service_level_distributed_data_accessor
Apparently, it is not needed for
standard_service_level_distributed_data_accessor to derive from
enable_shared_from_this.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
b23bc3a5d5 alternator: execute under scheduling group for service level
Now, the Alternator API requests are executed under the correct
scheduling group of the service level assigned to the currently logged
in user.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
67b11e846a test.py: support multiple commands in prepare_cql in suite.yml
This will be needed for alternator tests introduced in the next commit,
which will have to execute multiple CQL operations during preparation.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
07b162fb5b docs: add documentation for workload prioritization
The doc pages were slightly adjusted during migration not to mention
Scylla Enterprise and to fix some whitespace issues.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
241e710c19 docs/dev: describe workload prioritization features in service_levels
The concept of shares, and some helper HTTP APIs, are now described in
the developer documentation for service levels.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
473bb44722 test/auth_cluster: test workload prioritization in service level tests
Update `test_connections_parameters_auto_update` to also check that the
scheduling group of given connections is appropriately changed when a
different service level is assigned to the user that the connection uses
for authentication.

Apart from that, more tests are added:

- Check for the logic that forbids setting shares for a service level
  until all nodes in the cluster are upgraded
- Test for handling the case when there are more scheduling groups than
  it is allowed (it might happen after upgrade from a non-workload-prio
  version)
- Regression test for a bug where less scheduling groups could have been
  created than allowed due to some metrics not being renamed on
  scheduling group name change.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
29b153c9e7 cqlpy/test_service_levels: add workload prioritization tests
Adjust existing cqlpy tests and add more in order to test the workload
prioritization feature:

- The DESCRIBE test is updated to check that generated statements
  contain information about shares
- Two tests for shares in the LIST EFFECTIVE SERVICE LEVEL statement
- Regression test which checks that we can create as many service levels
  as promised in the documentation (currently 7), but no more
- Test which checks that NULL shares in the service levels table are
  treated as the default 1000 shares
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
49f5fc0e70 api: introduce service levels specific API
Introduces two endpoints with operations specific to service levels:

- switch_tenants: updates the scheduling group of all connections to be
  aligned with the service level specific to the logged in user. This is
  mostly legacy API, as with service levels on raft this is done
  automatically.
- count_connections: for each user and for each scheduling group, counts
  how many connections are assigned to that user and scheduling group.
  This API is used in tests.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
a65c0c3735 api/cql_server_test: add information about scheduling group
Now, information about connections' scheduling group is included in the
HTTP API for querying information about connections' parameters.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
9319d65971 db/virtual_tables: add scheduling group column to system.clients
Add the "scheduling_group" column to the system.clients table which
names the scheduling group that currently serves the connection/client.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
bbc655ff32 test/boost: update service_level_controller_test for workload prio
Adjust some of the existing tests in service_level_controller_test.cc
and add some more in order to test the workload prioritization features,
i.e. the service level shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ce4032dfc0 qos: include number of shares in DESCRIBE
Now, the CREATE statements generated for each service level by the
DESCRIBE SCHEMA WITH INTERNALS statement will account for the service
level's shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
0f62eb45d1 cql3/statements: update SL statements for workload prioritization
Introduce the "SHARES" keyword which can be used in conjunction with
existing CQL statements related to the service levels.

Adjust the CQL statements for service levels:

- CREATE/ALTER now allow to set shares (only if the cluster is fully
  upgraded)
- LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new
  column
- LIST SERVICE LEVEL(S) also return the number of shares, and has the
  additional column "percentage of all service level shares"
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
6d90a933cd transport/server: use scheduling group assigned to current user
Now, when the user logs in and the connection becomes authenticated, the
processing loop of the connection is switched to the scheduling group
that corresponds to the service level assigned to the logged in user.
The scheduling group is also updated when the service level assigned to
this user changes.

Starting from this commit, the scheduling groups managed by the service
level controller are actually being used by user workload.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
f1b9737e07 messaging_service: use separate set of connections per service levels
In order to make sure that the scheduling group carries over RPC, and
also to prevent priority inversion issues between different service
levels, modify the messaging service to use separate RPC connections for
each service level in order to serve user traffic.

The above is achieved by reusing the existing concept of "tenants" in
messaging service: when a new service level (or, more accurately,
service-level specific scheduling group) is first used in an RPC, a
new tenant is created.

In addition, extend the service level controller to be able to quickly
look up the service level name of the currently active scheduling group
in order to speed up the logic for choosing the tenant.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
7383013f43 replica/database: add reader concurrency semaphore groups
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.

Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.

The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
4cfd26efaf qos: manage and assign scheduling groups to service levels
Introduce the core logic of workload prioritization, responsible for
assigning scheduling groups to service levels.

The service level controller maintains a pool of scheduling groups for
the currently present service levels, as well as a pool of unused
scheduling groups which were previously used by some service level that
was deleted during node's lifetime.

When a new service level is created, the SL controller either assigns a
scheduling group from the unused SG pool, or creates a new one if the
pool is empty. The scheduling group is renamed to "sl:<scheduling group
name>".

When updating shares of a service level (and also when creating a new
service level), the shares of the corresponding scheduling group are
synchronized with those of the service level.

When a service level is deleted, its group is released to the
aforementioned pool of unused scheduling groups and the prefix of its
name is changed from "sl:" to "sl_deleted:".

For now, these scheduling groups are not used by any user operations.
This will be changed in subsequent commits.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ff51551a94 qos: use the shares field in service level reads/writes
Now, the newly introduced `shares` field is used when service levels are
either read from or written into system tables.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
a6f681029f qos: add shares to service_level_options
Add service level shares related fields to service_level_options and
slo_effective_names structs, and adjust the existing methods of the
former (merge_with, init_effective_names) to account for them.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
2eb35f37d0 qos: explicitly specify columns when querying service level tables
The service levels table is queried with a `SELECT * ...` query, by
using the `execute_internal` method which prepares and caches the query
in an special cache for internal queries, separate from the user query
cache.

During rolling upgrade from a version which does not support service
level shares to the one that does, the `shares` column is added. The
aforementioned internal query cache is _not_ invalidated on schema
change, so the cache might still contain the prepared query from the
time before the column was added, and that prepared query will fetch the
old set of column without the new `shares` column.

In order to solve this, explicitly specify the columns in the query
string, using the full set of column names from the time when the query
is executed.

Note that this is a problem only for the legacy, non-raft service
levels. Raft-based service levels use a local table for which the schema
is determined on startup.

Also note that this code only fetches values from the `shares` column
but does not make any use of it otherwise. It will be handled by later
commits in this series.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ea25b29684 db/system_distributed_keyspace: add shares column and upgrade code
Add the "shares" column to the
system_distributed_keyspace.service_levels table, which is used by
legacy code.

Because this table is in a distributed and not local keyspace, adding
the column to an existing cluster during rolling upgrade requires a bit
of care. A callback is added to the workload prioritization cluster
feature which runs when the feature becomes enabled and adds the column
for all nodes in the cluster.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
346fc84c3e db/system_keyspace: adjust SL schema for workload prioritization
Add a "shares" column which hold the number of shares allocated to
given service level.

It is not used by the code at all right now, subsequent commits will
make good use of it.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ecbf8721de gms: introduce WORKLOAD_PRIORITIZATION cluster feature
Information about the number of shares per service level will be stored
in an additional column in the service levels table, which is managed
through group0. We will need the feature to make sure that all nodes in
the cluster know about the new column before any node starts applying
group0 commands the would touch the new column.

This feature also serves a role for the legacy service levels
implementation that uses system_distributed for storage: after all nodes
are upgraded to support workload prioritization, one of the nodes will
perform a schema change operation and will add the new column.
2025-01-02 07:13:34 +01:00