Commit Graph

39578 Commits

Author SHA1 Message Date
Benny Halevy
9413afce41 chunked_vector_test: exception_safe_class: count also moved objects
We have to account for moved objects as well
as copied objects so they will be balanced with
the respective `del_live_object` calls called
by the destructor.

However, since chunked_vector requires the
value_type to be nothrow_move_constructible,
just count the additional live object, but
do not modify _countdown or, respectively, throw
an exception, as this should be considered only
for the default and copy constructors.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-21 11:05:38 +03:00
Benny Halevy
8e20379305 utils: chunked_vector: fill ctor: make exception safe
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not
fully constructed yet.

Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.

Fixes scylladb/scylladb#18635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-21 11:05:38 +03:00
Botond Dénes
4e9ed69a75 Merge '[Backport 5.4] mutation_fragment_stream_validating_filter: respect validating_level::none' from ScyllaDB
Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to.

Fixes: #18662

(cherry picked from commit f6511ca1b0)

(cherry picked from commit e7b07692b6)

(cherry picked from commit 78afb3644c)

Refs #18667

Closes scylladb/scylladb#18724

* github.com:scylladb/scylladb:
  test/boost/mutation_fragment_test.cc: add test for validator validation levels
  mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
  mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
2024-05-20 09:02:52 +03:00
Botond Dénes
7552c4b187 test/boost/mutation_fragment_test.cc: add test for validator validation levels
To make sure that the validator doesn't validate what the validation
level doesn't include.

(cherry picked from commit 78afb3644c)
2024-05-17 07:55:05 +00:00
Botond Dénes
87dcd29ec3 mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
Despite its name, this validation level still did some validation. Fix
this, by short-circuiting the catch-all operator(), preventing any
validation when the user asked for none.

(cherry picked from commit e7b07692b6)
2024-05-17 07:55:04 +00:00
Botond Dénes
9e7cd767dd mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
When set to false, no exceptions will be raised from the validator on
validation error. Instead, it will just return false from the respective
validator methods. This makes testing simpler, asserting exceptions is
clunky.
When true (default), the previous behaviour will remain: any validation
error will invoke on_internal_error(), resulting in either std::abort()
or an exception.

(cherry picked from commit f6511ca1b0)
2024-05-17 07:55:04 +00:00
Botond Dénes
63d1c763fc Merge '[Backport 5.4] tools/scylla-sstable: add scylla sstable shard-of command' from Kefu Chai
when migrating to the uuid-based identifiers, the mapping from the
integer-based generation to the shard-id is preserved. we used to have
"gen % smp_count" for calculating the shard which is responsible to host
a given sstable. despite that this is not a documented behavior, this is
handy when we try to correlate an sstable to a shard, typically when
looking at a performance issue.

in this change, a new subcommand is added to expose the connection
between the sstable and its "owner" shards.

Fixes #16343
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes https://github.com/scylladb/scylladb/pull/16345

(cherry picked from commit 273ee36bee)

Fixes #18381

- [x] need to backport, because we have needs in production to figure out the mapping from an sstable identifier to the shard which "owns" it.

Closes scylladb/scylladb#18681

* github.com:scylladb/scylladb:
  tools: Make sstable shard-of efficient by loading minimum to compute owners
  test/cql-pytest/test_tools.py: test shard-of with a single partition
  tools/scylla-sstable: add `scylla sstable shard-of` command
2024-05-16 11:07:47 +03:00
Pavel Emelyanov
29c892ea5a functions: Do not crash when schema is missing
Getting token() function first tries to find a schema for underlying
table and continues with nullptr if there's no one. Later, when creating
token_fct, the schema is passed as is and referenced. If it's null crash
happens.

It used to throw before 5983e9e7b2 (cql3: test_assignment: pass optional
schema everywhere) on missing schema, but this commit changed the way
schema is looked up, so nullptr is now possible.

fixes: #18637

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit df8a446437)

Closes scylladb/scylladb#18698
2024-05-16 11:06:25 +03:00
Raphael S. Carvalho
9bb175852d tools: Make sstable shard-of efficient by loading minimum to compute owners
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18440

(cherry picked from commit d7a01598ce)
2024-05-15 14:32:43 +08:00
Kefu Chai
daf4ffb9b4 test/cql-pytest/test_tools.py: test shard-of with a single partition
test_scylla_sstable_shard_of takes lots of time preparing the keys for a
certain shard. with the debug build, it takes 3 minutes to complete the
test.

so in order to test the "shard-of" subcommand in an more efficient way,
in this change, we improve the test in two ways:

1. cache the output of 'scylla types shardof`. so we can avoid the
   overhead of running a seastar application repeatly for the
   same keys.
2. reduce the number of partitions from 42 to 1. as the number of
   partitions in an sstable does not matter when testing the
   output of "shard-of" command of a certain sstable. because,
   the sstable is always generated by a certain shard.

before this change, with pytest-profiling:

```
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      4/3    0.000    0.000  181.950   60.650 runner.py:219(call_and_report)
      4/3    0.000    0.000  181.948   60.649 runner.py:247(call_runtest_hook)
      4/3    0.000    0.000  181.948   60.649 runner.py:318(from_call)
      4/3    0.000    0.000  181.948   60.649 runner.py:262(<lambda>)
    44/11    0.000    0.000  181.935   16.540 _hooks.py:427(__call__)
    43/11    0.000    0.000  181.935   16.540 _manager.py:103(_hookexec)
    43/11    0.000    0.000  181.935   16.540 _callers.py:30(_multicall)
      361    0.001    0.000  181.531    0.503 contextlib.py:141(__exit__)
   782/81    0.001    0.000  177.578    2.192 {built-in method builtins.next}
     1044    0.006    0.000   92.452    0.089 base_events.py:1894(_run_once)
       11    0.000    0.000   91.129    8.284 fixtures.py:686(<lambda>)
    17/11    0.000    0.000   91.129    8.284 fixtures.py:1025(finish)
        4    0.000    0.000   91.128   22.782 fixtures.py:913(_teardown_yield_fixture)
      2/1    0.000    0.000   91.055   91.055 runner.py:111(pytest_runtest_protocol)
      2/1    0.000    0.000   91.055   91.055 runner.py:119(runtestprotocol)
        2    0.000    0.000   91.052   45.526 conftest.py:50(cql)
        2    0.000    0.000   91.040   45.520 util.py:161(cql_session)
        1    0.000    0.000   91.040   91.040 runner.py:180(pytest_runtest_teardown)
        1    0.000    0.000   91.040   91.040 runner.py:509(teardown_exact)
     1945    0.002    0.000   90.722    0.047 events.py:82(_run)
```

after this change:
```
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      4/3    0.000    0.000    8.271    2.757 runner.py:219(call_and_report)
    44/11    0.000    0.000    8.270    0.752 _hooks.py:427(__call__)
    44/11    0.000    0.000    8.270    0.752 _manager.py:103(_hookexec)
    44/11    0.000    0.000    8.270    0.752 _callers.py:30(_multicall)
      4/3    0.000    0.000    8.269    2.756 runner.py:247(call_runtest_hook)
      4/3    0.000    0.000    8.269    2.756 runner.py:318(from_call)
      4/3    0.000    0.000    8.269    2.756 runner.py:262(<lambda>)
       48    0.000    0.000    8.269    0.172 {method 'send' of 'generator' objects}
       27    0.000    0.000    5.671    0.210 contextlib.py:141(__exit__)
       11    0.000    0.000    4.297    0.391 fixtures.py:686(<lambda>)
      2/1    0.000    0.000    4.228    4.228 runner.py:111(pytest_runtest_protocol)
      2/1    0.000    0.000    4.228    4.228 runner.py:119(runtestprotocol)
        2    0.000    0.000    4.213    2.106 capture.py:877(pytest_runtest_teardown)
        1    0.000    0.000    4.213    4.213 runner.py:180(pytest_runtest_teardown)
        1    0.000    0.000    4.213    4.213 runner.py:509(teardown_exact)
        2    0.000    0.000    3.628    1.814 capture.py:872(pytest_runtest_call)
        1    0.000    0.000    3.627    3.627 runner.py:160(pytest_runtest_call)
        1    0.000    0.000    3.627    3.627 python.py:1797(runtest)
   114/81    0.001    0.000    3.505    0.043 {built-in method builtins.next}
       15    0.784    0.052    3.183    0.212 subprocess.py:417(check_output)
```

Fixes #16516
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16523

(cherry picked from commit 642652efab)
2024-05-15 14:32:43 +08:00
Kefu Chai
03a54a4c07 tools/scylla-sstable: add scylla sstable shard-of command
when migrating to the uuid-based identifiers, the mapping from the
integer-based generation to the shard-id is preserved. we used to have
"gen % smp_count" for calculating the shard which is responsible to host
a given sstable. despite that this is not a documented behavior, this is
handy when we try to correlate an sstable to a shard, typically when
looking at a performance issue.

in this change, a new subcommand is added to expose the connection
between the sstable and its "owner" shards.

Fixes #16343
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16345

(cherry picked from commit 273ee36bee)
2024-05-15 14:32:42 +08:00
Lakshmi Narayanan Sreethar
4b0c60cdc3 compaction: improve partition estimates for garbage collected sstables
When a compaction strategy uses garbage collected sstables to track
expired tombstones, do not use complete partition estimates for them,
instead, use a fraction of it based on the droppable tombstone ratio
estimate.

Fixes #18283

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18465

(cherry picked from commit d39adf6438)

Closes scylladb/scylladb#18656
2024-05-14 07:53:07 +03:00
Patryk Wrobel
28d0fc1b6b scylla_io_setup: ensure correct RLIMIT_NOFILE for iotune
The default limit of open file descriptors
per process may be too small for iotune on
certain machines with large number of cores.

In such case iotune reports failure due to
unability to create files or to set up seastar
framework.

This change configures the limit of open file
descriptors before running iotune to ensure
that the failure does not occur.

The limit is set via 'resource.setrlimit()' in
the parent process. The limit is then inherited
by the child process.

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
(cherry picked from commit ec820e214c)

Closes scylladb/scylladb#18655
2024-05-14 07:48:53 +03:00
Israel Fruchter
393880f355 Update tools/cqlsh submodule to v6.0.17
Mostly a set of fixes in the area of ssl handling

* tools/cqlsh 99b2b777...9d49b385 (21):
  > cqlshlib/sslhandling: fix logic of `ssl_check_hostname`
  > cqlshlib/sslhandling.py: don't use empty userkey/usercert
  > Dockerfile: noninteractive isn't enough for answering yet on apt-get
  > fix cqlsh version print
  > cqlshlib/sslhandling: change `check_hostname` deafult to False
  > Introduce new ssl configuration for disableing check_hostname
  > set the hostname in ssl_options.server_hostname when SSL is used
  > issue-73 Fixed a bug where username and password from the credentials file were ignored.
  > issue-73 Fixed a bug where username and password from the credentials file were ignored.
  > issue-73
  > github actions: update `cibuildwheel==v2.16.5`
  > dist/debian: fix the trailer line format
  > `COPY TO STDOUT` shouldn't put None where a function is expected
  > Make cqlsh work with unix domain sockets
  > Bump python-driver version
  > dist/debian: add trailer line
  > dist/debian: wrap long line
  > Draft: explicit build-time packge dependencies
  > stop retruning status_code=2 on schema disagreement
  > Fix minor typos in the code
  > Dockerfile: apt-get update and apt-get upgrade to get latest OS packages

Ref: #18590

Closes scylladb/scylladb#18652
2024-05-14 07:47:37 +03:00
Botond Dénes
e008060f39 Merge 'doc: fix Rust Driver release information' from Anna Stuchlik
This PR removes the incorrect information that the ScyllaDB Rust Driver is not GA.

In addition, it replaces "Scylla" with "ScyllaDB".

Fixes https://github.com/scylladb/scylladb/issues/16178

Closes scylladb/scylladb#16199

* github.com:scylladb/scylladb:
  doc: remove the "preview" label from Rust driver
  doc: fix Rust Driver release information

(cherry picked from commit 56c3515751)
2024-05-10 12:22:03 +02:00
Aleksandra Martyniuk
7589981898 tasks: use default task_ttl in scylla.yaml
Currently default task_ttl_in_seconds is 0, but scylla.yaml changes
the value to 10.

Change task_ttl_in_seconds in scylla.yaml to 0, so that there are
consistent defaults. Comment it out.

Fixes: #16714.
(cherry picked from commit 67bbaad62e)

Closes scylladb/scylladb#18584
2024-05-09 16:10:14 +03:00
Kamil Braun
ed89deab40 direct_failure_detector: increase ping timeout and make it tunable
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

(cherry picked from commit 8df6d10e88)

Closes scylladb/scylladb#18559
2024-05-08 14:57:09 +02:00
Pavel Emelyanov
905b8f59bd Update seastar submodule (iotune iodepth underflow fix)
* seastar ae05c138...cfb015d0 (1):
  > iotune: ignore shards with id above max_iodepth

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-06 19:27:31 +03:00
Nadav Har'El
862e2affe0 cql3: Fix invalid JSON parsing for JSON object with different key types
More than three years ago, in issue #7949, we noticed that trying to
set a `map<ascii, int>` from JSON input (i.e., using INSERT JSON or the
fromJson() function) fails - the ascii key is incorrectly parsed.
We fixed that issue in commit 75109e9519
but unfortunately, did not do our due diligence: We did not write enough
tests inspired by this bug, and failed to discover that actually we have
the same bug for many other key types, not just for "ascii". Specifically,
the following key types have exactly the same bug:

  * blob
  * date
  * inet
  * time
  * timestamp
  * timeuuid
  * uuid

Other types, like numbers or boolean worked "by accident" - instead of
parsing them as a normal string, we asked the JSON parser to parse them
again after removing the quotes, and because unquoted numbers and
unquoted true/false happwn to work in JSON, this didn't fail.

The fix here is very simple - for all *native* types (i.e., not
collections or tuples), the encoding of the key in JSON is simply a
quoted string - and removing the quotes is all we need to do and there's
no need to run the JSON parser a second time. Only for more elaborate
types - collections and tuples - we need to run the JSON parser a
second time on the key string to build the more elaborate object.

This patch also includes tests for fromJson() reading a map with all
native key types, confirming that all the aforementioned key types
were broken before this patch, and all key types (including the numbers
and booleans which worked even befoe this patch) work with this patch.

Fixes #18477.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 21557cfaa6)

Closes scylladb/scylladb#18522
2024-05-05 23:53:19 +03:00
Pavel Emelyanov
d68d765247 view-builder: Print correct exception in built ste exception handler
Inside .handle_exception() continuation std::current_exception() doesn't
work, there's std::exception ex argument to handler's lambda instead

fixes #18423

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18349

(cherry picked from commit 4ac30e5337)
2024-05-01 10:19:28 +03:00
Michał Chojnowski
d8df02f490 docs: clarify that DELETE can be used with USING TIMEOUT
The current text seems to suggest that `USING TIMEOUT` doesn't work with `DELETE` and `BATCH`. But that's wrong.

Closes scylladb/scylladb#18424

(cherry picked from commit c1146314a1)
2024-05-01 10:17:05 +03:00
Anna Stuchlik
d85d37921a doc: run repair after changing RF of system_auth
This commit adds the requirement to run repair after changing
the replication factor of the system_auth keyspace
in the procedure of adding a new node to a cluster.

Refs: https://github.com/scylladb/scylla-enterprise/issues/4129

Closes scylladb/scylladb#18466
2024-04-30 19:15:55 +03:00
Asias He
2d4825835c streaming: Fix use after move in fire_stream_event
The event is used in a loop.

Found by clang-tidy:

```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
        listener->handle_stream_event(std::move(event));
                                                ^
streaming/stream_result_future.cc:80:39: note: move occurred here
        listener->handle_stream_event(std::move(event));
                                      ^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
        listener->handle_stream_event(std::move(event));
                                                ^
```

Fixes #18332

Closes scylladb/scylladb#18333

(cherry picked from commit 1ca779d287)
2024-04-26 11:00:01 +03:00
Lakshmi Narayanan Sreethar
201d990072 sstables: reclaim_memory_from_components: do not update _recognised_components
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.

Fixes #18398

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18400

(cherry picked from commit 6af2659b57)
2024-04-26 10:59:13 +03:00
Wojciech Mitros
678948e671 mv: keep semaphore units alive until the end of a remote view update
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.

Fixes #17890

(cherry picked from commit 9789a3dc7c)

Refs #17891

Closes scylladb/scylladb#18108
2024-04-25 12:45:01 +02:00
Raphael S. Carvalho
8acedb9255 sstables: Fix use-after-move in an error path of FS-based sstable writer
```
sstables/storage.cc:152:21: warning: 'file_path' used after it was moved [bugprone-use-after-move]
        remove_file(file_path).get();
                    ^
sstables/storage.cc:145:64: note: move occurred here
    auto w = file_writer(output_stream<char>(std::move(sink)), std::move(file_path));

```

It's a regression when TOC is found for a new sstable, and we try to delete temporary TOC.

courtesy of clang-tidy.

Fixes #18323.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 2fba1f936d)

Closes scylladb/scylladb#18382
2024-04-25 13:13:48 +03:00
Kefu Chai
3ed0826292 thrift: avoid use-after-move in make_non_overlapping_ranges()
in handler.cc, `make_non_overlapping_ranges()` references a moved
instance of `ColumnSlice` when something unexpected happens to
format the error message in an exception, the move constructor of
`ColumnSlice` is default-generated, so the members' move constructors
are used to construct the new instance in the move constructor. this
could lead to undefined behavior when dereferencing the move instance.

in this change, in order to avoid use-after free, let's keep
a copy of the referenced member variables and reference them when
formatting error message in the exception.

this use-after-move issue was introduced in 822a315dfa, which implemented
`get_multi_slice` verb and this piece in the first place. since both 5.2
and 5.4 include this commit, we should backport this change to them.

Refs 822a315dfa
Fixes #18356
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1ad3744edc)

Closes scylladb/scylladb#18374
2024-04-25 11:35:38 +03:00
Botond Dénes
154fcffbc7 Merge '[branch-5.4] Fix false-positive errors in scrub validate-mode' from Botond Dénes
The new MX-native validator, which validates the index in tandem with the data file, was discovered to print false-positive errors, related to range-tombstones and promoted-index positions.
This series fixes that. But first, it refactors the scrub-related tests. These are currently dominated by boiler-plate code. They are hard to read and hard to write. In the first half of the series, a new scrub_test is introduced, which moves all the boiler-plate to a central place, allowing the tests to focus on just the aspect of scrub that is tested.
Then, all the found bugs in validate are fixed and finally a new test, checking validate with valid sstable is introduced.

This PR backports https://github.com/scylladb/scylladb/pull/16327.

Fixes: https://github.com/scylladb/scylladb/issues/16326

Closes scylladb/scylladb#18404

* github.com:scylladb/scylladb:
  test/boost/sstable_compaction_test: add validation test with valid sstable
  sstablex/mx/reader: validate(): print trace message when finishing the PI block
  sstablex/mx/reader: validate(): make index-data PI position check message consistent
  sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
  sstablex/mx/reader: validate(): reset the current PI block on partition-start
  sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
  sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
  test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper
  test/boost/sstable_compaction_test: fix indentation
  test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
  test/boost/sstable_compaction_test: introduce scrub_test_framework
  test/lib/random_schema: add uncompatible_timestamp_generator()
2024-04-25 08:36:30 +03:00
Jenkins Promoter
7e946925c3 Update ScyllaDB version to: 5.4.7 2024-04-24 23:13:50 +03:00
Botond Dénes
d09e2a2311 test/boost/sstable_compaction_test: add validation test with valid sstable
Add a positive test, as it turns out we had some false-positive
validation bugs in the validator and we need a regression test for this.

(cherry picked from commit 2335f42b2b)
2024-04-24 09:38:57 -04:00
Botond Dénes
6462a2f391 sstablex/mx/reader: validate(): print trace message when finishing the PI block
(cherry picked from commit a19a2d76c9)
2024-04-24 09:37:15 -04:00
Botond Dénes
2e19e7cb6d sstablex/mx/reader: validate(): make index-data PI position check message consistent
The message says "index-data" but when printing the position, the data
position is printed first, causing confusion. Fix this and while at it,
also print the position of the partition start.

(cherry picked from commit 677be168c4)
2024-04-24 09:33:36 -04:00
Botond Dénes
fef7498da2 sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
The validate() consumes the content of partitions in a consume-loop.
Every time the consumer asks for a "break", the next PI block is loaded
and set on the validator, so it can validate that further clustering
elements are indeed from this block.
This loop assumed the consumer would only request interruption when the
current clustering block is finished. This is wrong, the consumer can
also request interruption when yielding is needed. When this is the
case, the next PI block doesn't have to be loaded yet, the current one
is not exhausted yet. Check this condition, before loading the next PI
block, to prevent false positive errors, due to mismatched PI block
and clustering elements from the sstable.

(cherry picked from commit 5bff7c40d3)
2024-04-24 09:32:28 -04:00
Botond Dénes
f45c878149 sstablex/mx/reader: validate(): reset the current PI block on partition-start
It is possible that the next partition has no PI and thus there won't be
a new PI block to overwrite the old one. This will result in
false-positive messages about rows being outside of the finished PI
block.

(cherry picked from commit e073df1dbb)
2024-04-24 09:31:10 -04:00
Botond Dénes
7225410af7 sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
Promoted index entries can be written on any clustering elements,
icluding range tombstones. So the validating consumer also has the check
whether the current expected clustering block is finished, when
consuming a range tombstone. If it is, consumption has to be
interrupted, so that the outer-loop can load up the next promoted index
block, before moving on to the next clustering element.

(cherry picked from commit 2737899c21)
2024-04-24 09:29:32 -04:00
Botond Dénes
564b01fda9 sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
For range tombstone end-bounds, the validate_fragment_order() should be
passed a null tombstone, not a disengaged optional. The latter means no
change in the current tombstone. This caused the end bound of range
tombstones to not make it to the validator and the latter complained
later on partition-end that the partition has unclosed range tombstone.

(cherry picked from commit f46b458f0d)
2024-04-24 09:28:18 -04:00
Botond Dénes
1d80427888 test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper
It is not used anymore.

(cherry picked from commit 8be97884ec)
2024-04-24 09:26:46 -04:00
Botond Dénes
ff17ec81e4 test/boost/sstable_compaction_test: fix indentation
(cherry picked from commit da0f4d3a9f)
2024-04-24 09:26:30 -04:00
Botond Dénes
121f2a530e test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit c35092aff6)
2024-04-24 09:03:44 -04:00
Botond Dénes
b77581c84c test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 3f76aad609)
2024-04-24 08:58:18 -04:00
Botond Dénes
ea176bf4ce test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
The test becomes a lot shorter and it now uses random schema and random
data. The test is also split in two: one test for abort mode and one for
skip mode.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 5237e8133b)
2024-04-24 08:45:53 -04:00
Botond Dénes
3835fd681d test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 76785baf43)
2024-04-24 07:57:54 -04:00
Botond Dénes
14da273c4c test/boost/sstable_compaction_test: introduce scrub_test_framework
Scrub tests require a lot of boilerplate code to work. This has a lot of
disadvantages:
* Tests are long
* The "meat" of the test is lost between all the boiler-plate, it is
  hard to glean what a test actually does
* Tests are hard to write, so we have only a few of them and they test
  multiple things.
* The boiler-plate differs sligthly from test-to-test.

To solve this, this patch introduces a new class, `scrub_test_frawmework`,
which is a central place for all the boiler-plate code needed to write
scrub-related tests. In the next patches, we will migrate scrub related
tests to this class.

(cherry picked from commit b6f0c4efa0)
2024-04-24 07:57:54 -04:00
Botond Dénes
33d5f27244 test/lib/random_schema: add uncompatible_timestamp_generator()
Guarantees that produced mutations will not be compactible.

(cherry picked from commit e412673c44)
2024-04-24 07:57:54 -04:00
Wojciech Mitros
c4515a9b99 mv: adjust memory tracking of single view updates within a batch
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.

Fixes #17854

(cherry picked from commit efcb718e0a)

Closes scylladb/scylladb#18107
2024-04-24 09:42:18 +02:00
Asias He
10f137e367 repair: Improve estimated_partitions to reduce memory usage
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.

Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.

The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.

This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.

Fixes #18140

Closes scylladb/scylladb#18141

(cherry picked from commit 642f9a1966)
scylla-5.4.6 scylla-5.4.6-candidate
2024-04-18 11:40:06 +03:00
Kamil Braun
53e1ed0ebb Merge '[Backport 5.4] gossiper: lock local endpoint when updating heart_beat' from ScyllaDB
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720

(cherry picked from commit a0b331b310)

(cherry picked from commit 72955093eb)

Refs scylladb/scylladb#18184

Closes scylladb/scylladb#18245

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
2024-04-17 17:50:30 +02:00
Botond Dénes
1aedc7372d Merge '[Backport 5.4] : Track and limit memory used by bloom filters' from Lakshmi Narayanan Sreethar
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.

The feature can be manually verified by doing the following :

1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB

The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.

Fixes https://github.com/scylladb/scylladb/issues/17747

Backported from #17771 to 5.4.

Closes scylladb/scylladb#18248

* github.com:scylladb/scylladb:
  test_bloom_filter.py: disable reclaiming memory from components
  sstable_datafile_test: add tests to verify auto reclamation of components
  test/lib: allow overriding available memory via test_env_config
  sstables_manager: support reclaiming memory from components
  sstables_manager: store available memory size
  sstables_manager: add variable to track component memory usage
  db/config: add a new variable to limit memory used by table components
  sstable_datafile_test: add testcase to verify reclamation from sstables
  sstables: support reclaiming memory from components
2024-04-17 14:33:19 +03:00
Kamil Braun
28781ca37e test: reproducer for missing gossiper updates
Regression test for scylladb/scylladb#17493.

(cherry picked from commit 72955093eb)

Backport note: removed `timeout` parameter passed to `server_add`,
missing on this branch. (If server adding hangs, it will timeout after
`TOPOLOGY_TIMEOUT` from scylla_cluster.py)

Removed `force_gossip_join_boot` error injection from test, not present
in this branch. Starting nodes with `experimental_features` disabled.

Added missing `handle_state_normal.*finished` message.
2024-04-17 13:09:39 +02:00
Beni Peled
9218bbb9b9 test.py: add the pytest junit_suite_name parameter
By default the suitename in the junit files generated by pytest
is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft`
With this change, the junit files will use the real suitename

This change doesn't affect the Test Report in Jenkins, but it
raised part of the other task of publishing the test results to
elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950
where we parse the XMLs and we need the correct suitename

Closes scylladb/scylladb#18172

(cherry picked from commit 223275b4d1)
2024-04-17 07:02:33 +03:00