Commit Graph

39560 Commits

Author SHA1 Message Date
Nadav Har'El
862e2affe0 cql3: Fix invalid JSON parsing for JSON object with different key types
More than three years ago, in issue #7949, we noticed that trying to
set a `map<ascii, int>` from JSON input (i.e., using INSERT JSON or the
fromJson() function) fails - the ascii key is incorrectly parsed.
We fixed that issue in commit 75109e9519
but unfortunately, did not do our due diligence: We did not write enough
tests inspired by this bug, and failed to discover that actually we have
the same bug for many other key types, not just for "ascii". Specifically,
the following key types have exactly the same bug:

  * blob
  * date
  * inet
  * time
  * timestamp
  * timeuuid
  * uuid

Other types, like numbers or boolean worked "by accident" - instead of
parsing them as a normal string, we asked the JSON parser to parse them
again after removing the quotes, and because unquoted numbers and
unquoted true/false happwn to work in JSON, this didn't fail.

The fix here is very simple - for all *native* types (i.e., not
collections or tuples), the encoding of the key in JSON is simply a
quoted string - and removing the quotes is all we need to do and there's
no need to run the JSON parser a second time. Only for more elaborate
types - collections and tuples - we need to run the JSON parser a
second time on the key string to build the more elaborate object.

This patch also includes tests for fromJson() reading a map with all
native key types, confirming that all the aforementioned key types
were broken before this patch, and all key types (including the numbers
and booleans which worked even befoe this patch) work with this patch.

Fixes #18477.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 21557cfaa6)

Closes scylladb/scylladb#18522
2024-05-05 23:53:19 +03:00
Pavel Emelyanov
d68d765247 view-builder: Print correct exception in built ste exception handler
Inside .handle_exception() continuation std::current_exception() doesn't
work, there's std::exception ex argument to handler's lambda instead

fixes #18423

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18349

(cherry picked from commit 4ac30e5337)
2024-05-01 10:19:28 +03:00
Michał Chojnowski
d8df02f490 docs: clarify that DELETE can be used with USING TIMEOUT
The current text seems to suggest that `USING TIMEOUT` doesn't work with `DELETE` and `BATCH`. But that's wrong.

Closes scylladb/scylladb#18424

(cherry picked from commit c1146314a1)
2024-05-01 10:17:05 +03:00
Anna Stuchlik
d85d37921a doc: run repair after changing RF of system_auth
This commit adds the requirement to run repair after changing
the replication factor of the system_auth keyspace
in the procedure of adding a new node to a cluster.

Refs: https://github.com/scylladb/scylla-enterprise/issues/4129

Closes scylladb/scylladb#18466
2024-04-30 19:15:55 +03:00
Asias He
2d4825835c streaming: Fix use after move in fire_stream_event
The event is used in a loop.

Found by clang-tidy:

```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
        listener->handle_stream_event(std::move(event));
                                                ^
streaming/stream_result_future.cc:80:39: note: move occurred here
        listener->handle_stream_event(std::move(event));
                                      ^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
        listener->handle_stream_event(std::move(event));
                                                ^
```

Fixes #18332

Closes scylladb/scylladb#18333

(cherry picked from commit 1ca779d287)
2024-04-26 11:00:01 +03:00
Lakshmi Narayanan Sreethar
201d990072 sstables: reclaim_memory_from_components: do not update _recognised_components
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.

Fixes #18398

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18400

(cherry picked from commit 6af2659b57)
2024-04-26 10:59:13 +03:00
Wojciech Mitros
678948e671 mv: keep semaphore units alive until the end of a remote view update
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.

Fixes #17890

(cherry picked from commit 9789a3dc7c)

Refs #17891

Closes scylladb/scylladb#18108
2024-04-25 12:45:01 +02:00
Raphael S. Carvalho
8acedb9255 sstables: Fix use-after-move in an error path of FS-based sstable writer
```
sstables/storage.cc:152:21: warning: 'file_path' used after it was moved [bugprone-use-after-move]
        remove_file(file_path).get();
                    ^
sstables/storage.cc:145:64: note: move occurred here
    auto w = file_writer(output_stream<char>(std::move(sink)), std::move(file_path));

```

It's a regression when TOC is found for a new sstable, and we try to delete temporary TOC.

courtesy of clang-tidy.

Fixes #18323.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 2fba1f936d)

Closes scylladb/scylladb#18382
2024-04-25 13:13:48 +03:00
Kefu Chai
3ed0826292 thrift: avoid use-after-move in make_non_overlapping_ranges()
in handler.cc, `make_non_overlapping_ranges()` references a moved
instance of `ColumnSlice` when something unexpected happens to
format the error message in an exception, the move constructor of
`ColumnSlice` is default-generated, so the members' move constructors
are used to construct the new instance in the move constructor. this
could lead to undefined behavior when dereferencing the move instance.

in this change, in order to avoid use-after free, let's keep
a copy of the referenced member variables and reference them when
formatting error message in the exception.

this use-after-move issue was introduced in 822a315dfa, which implemented
`get_multi_slice` verb and this piece in the first place. since both 5.2
and 5.4 include this commit, we should backport this change to them.

Refs 822a315dfa
Fixes #18356
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1ad3744edc)

Closes scylladb/scylladb#18374
2024-04-25 11:35:38 +03:00
Botond Dénes
154fcffbc7 Merge '[branch-5.4] Fix false-positive errors in scrub validate-mode' from Botond Dénes
The new MX-native validator, which validates the index in tandem with the data file, was discovered to print false-positive errors, related to range-tombstones and promoted-index positions.
This series fixes that. But first, it refactors the scrub-related tests. These are currently dominated by boiler-plate code. They are hard to read and hard to write. In the first half of the series, a new scrub_test is introduced, which moves all the boiler-plate to a central place, allowing the tests to focus on just the aspect of scrub that is tested.
Then, all the found bugs in validate are fixed and finally a new test, checking validate with valid sstable is introduced.

This PR backports https://github.com/scylladb/scylladb/pull/16327.

Fixes: https://github.com/scylladb/scylladb/issues/16326

Closes scylladb/scylladb#18404

* github.com:scylladb/scylladb:
  test/boost/sstable_compaction_test: add validation test with valid sstable
  sstablex/mx/reader: validate(): print trace message when finishing the PI block
  sstablex/mx/reader: validate(): make index-data PI position check message consistent
  sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
  sstablex/mx/reader: validate(): reset the current PI block on partition-start
  sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
  sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
  test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper
  test/boost/sstable_compaction_test: fix indentation
  test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
  test/boost/sstable_compaction_test: introduce scrub_test_framework
  test/lib/random_schema: add uncompatible_timestamp_generator()
2024-04-25 08:36:30 +03:00
Jenkins Promoter
7e946925c3 Update ScyllaDB version to: 5.4.7 2024-04-24 23:13:50 +03:00
Botond Dénes
d09e2a2311 test/boost/sstable_compaction_test: add validation test with valid sstable
Add a positive test, as it turns out we had some false-positive
validation bugs in the validator and we need a regression test for this.

(cherry picked from commit 2335f42b2b)
2024-04-24 09:38:57 -04:00
Botond Dénes
6462a2f391 sstablex/mx/reader: validate(): print trace message when finishing the PI block
(cherry picked from commit a19a2d76c9)
2024-04-24 09:37:15 -04:00
Botond Dénes
2e19e7cb6d sstablex/mx/reader: validate(): make index-data PI position check message consistent
The message says "index-data" but when printing the position, the data
position is printed first, causing confusion. Fix this and while at it,
also print the position of the partition start.

(cherry picked from commit 677be168c4)
2024-04-24 09:33:36 -04:00
Botond Dénes
fef7498da2 sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
The validate() consumes the content of partitions in a consume-loop.
Every time the consumer asks for a "break", the next PI block is loaded
and set on the validator, so it can validate that further clustering
elements are indeed from this block.
This loop assumed the consumer would only request interruption when the
current clustering block is finished. This is wrong, the consumer can
also request interruption when yielding is needed. When this is the
case, the next PI block doesn't have to be loaded yet, the current one
is not exhausted yet. Check this condition, before loading the next PI
block, to prevent false positive errors, due to mismatched PI block
and clustering elements from the sstable.

(cherry picked from commit 5bff7c40d3)
2024-04-24 09:32:28 -04:00
Botond Dénes
f45c878149 sstablex/mx/reader: validate(): reset the current PI block on partition-start
It is possible that the next partition has no PI and thus there won't be
a new PI block to overwrite the old one. This will result in
false-positive messages about rows being outside of the finished PI
block.

(cherry picked from commit e073df1dbb)
2024-04-24 09:31:10 -04:00
Botond Dénes
7225410af7 sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
Promoted index entries can be written on any clustering elements,
icluding range tombstones. So the validating consumer also has the check
whether the current expected clustering block is finished, when
consuming a range tombstone. If it is, consumption has to be
interrupted, so that the outer-loop can load up the next promoted index
block, before moving on to the next clustering element.

(cherry picked from commit 2737899c21)
2024-04-24 09:29:32 -04:00
Botond Dénes
564b01fda9 sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
For range tombstone end-bounds, the validate_fragment_order() should be
passed a null tombstone, not a disengaged optional. The latter means no
change in the current tombstone. This caused the end bound of range
tombstones to not make it to the validator and the latter complained
later on partition-end that the partition has unclosed range tombstone.

(cherry picked from commit f46b458f0d)
2024-04-24 09:28:18 -04:00
Botond Dénes
1d80427888 test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper
It is not used anymore.

(cherry picked from commit 8be97884ec)
2024-04-24 09:26:46 -04:00
Botond Dénes
ff17ec81e4 test/boost/sstable_compaction_test: fix indentation
(cherry picked from commit da0f4d3a9f)
2024-04-24 09:26:30 -04:00
Botond Dénes
121f2a530e test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit c35092aff6)
2024-04-24 09:03:44 -04:00
Botond Dénes
b77581c84c test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 3f76aad609)
2024-04-24 08:58:18 -04:00
Botond Dénes
ea176bf4ce test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
The test becomes a lot shorter and it now uses random schema and random
data. The test is also split in two: one test for abort mode and one for
skip mode.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 5237e8133b)
2024-04-24 08:45:53 -04:00
Botond Dénes
3835fd681d test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 76785baf43)
2024-04-24 07:57:54 -04:00
Botond Dénes
14da273c4c test/boost/sstable_compaction_test: introduce scrub_test_framework
Scrub tests require a lot of boilerplate code to work. This has a lot of
disadvantages:
* Tests are long
* The "meat" of the test is lost between all the boiler-plate, it is
  hard to glean what a test actually does
* Tests are hard to write, so we have only a few of them and they test
  multiple things.
* The boiler-plate differs sligthly from test-to-test.

To solve this, this patch introduces a new class, `scrub_test_frawmework`,
which is a central place for all the boiler-plate code needed to write
scrub-related tests. In the next patches, we will migrate scrub related
tests to this class.

(cherry picked from commit b6f0c4efa0)
2024-04-24 07:57:54 -04:00
Botond Dénes
33d5f27244 test/lib/random_schema: add uncompatible_timestamp_generator()
Guarantees that produced mutations will not be compactible.

(cherry picked from commit e412673c44)
2024-04-24 07:57:54 -04:00
Wojciech Mitros
c4515a9b99 mv: adjust memory tracking of single view updates within a batch
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.

Fixes #17854

(cherry picked from commit efcb718e0a)

Closes scylladb/scylladb#18107
2024-04-24 09:42:18 +02:00
Asias He
10f137e367 repair: Improve estimated_partitions to reduce memory usage
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.

Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.

The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.

This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.

Fixes #18140

Closes scylladb/scylladb#18141

(cherry picked from commit 642f9a1966)
scylla-5.4.6 scylla-5.4.6-candidate
2024-04-18 11:40:06 +03:00
Kamil Braun
53e1ed0ebb Merge '[Backport 5.4] gossiper: lock local endpoint when updating heart_beat' from ScyllaDB
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720

(cherry picked from commit a0b331b310)

(cherry picked from commit 72955093eb)

Refs scylladb/scylladb#18184

Closes scylladb/scylladb#18245

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
2024-04-17 17:50:30 +02:00
Botond Dénes
1aedc7372d Merge '[Backport 5.4] : Track and limit memory used by bloom filters' from Lakshmi Narayanan Sreethar
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.

The feature can be manually verified by doing the following :

1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB

The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.

Fixes https://github.com/scylladb/scylladb/issues/17747

Backported from #17771 to 5.4.

Closes scylladb/scylladb#18248

* github.com:scylladb/scylladb:
  test_bloom_filter.py: disable reclaiming memory from components
  sstable_datafile_test: add tests to verify auto reclamation of components
  test/lib: allow overriding available memory via test_env_config
  sstables_manager: support reclaiming memory from components
  sstables_manager: store available memory size
  sstables_manager: add variable to track component memory usage
  db/config: add a new variable to limit memory used by table components
  sstable_datafile_test: add testcase to verify reclamation from sstables
  sstables: support reclaiming memory from components
2024-04-17 14:33:19 +03:00
Kamil Braun
28781ca37e test: reproducer for missing gossiper updates
Regression test for scylladb/scylladb#17493.

(cherry picked from commit 72955093eb)

Backport note: removed `timeout` parameter passed to `server_add`,
missing on this branch. (If server adding hangs, it will timeout after
`TOPOLOGY_TIMEOUT` from scylla_cluster.py)

Removed `force_gossip_join_boot` error injection from test, not present
in this branch. Starting nodes with `experimental_features` disabled.

Added missing `handle_state_normal.*finished` message.
2024-04-17 13:09:39 +02:00
Beni Peled
9218bbb9b9 test.py: add the pytest junit_suite_name parameter
By default the suitename in the junit files generated by pytest
is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft`
With this change, the junit files will use the real suitename

This change doesn't affect the Test Report in Jenkins, but it
raised part of the other task of publishing the test results to
elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950
where we parse the XMLs and we need the correct suitename

Closes scylladb/scylladb#18172

(cherry picked from commit 223275b4d1)
2024-04-17 07:02:33 +03:00
Kamil Braun
8faef135cb gossiper: lock local endpoint when updating heart_beat
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b310)
2024-04-16 14:39:05 +02:00
Lakshmi Narayanan Sreethar
75962d3e94 test_bloom_filter.py: disable reclaiming memory from components
Disabled reclaiming memory from sstable components in the testcase as it
interferes with the false positive calculation.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d86505e399)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
034304127c sstable_datafile_test: add tests to verify auto reclamation of components
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d261f0fbea)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
95068d3c00 test/lib: allow overriding available memory via test_env_config
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 169629dd40)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
3dca49c524 sstables_manager: support reclaiming memory from components
Reclaim memory from the SSTable that has the most reclaimable memory if
the total reclaimable memory has crossed the threshold. Only the bloom
filter memory is considered reclaimable for now.

Fixes #17747

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a36965c474)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
505a0714a6 sstables_manager: store available memory size
The available memory size is required to calculate the reclaim memory
threshold, so store that within the sstables manager.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2ca4b0a7a2)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
b8bd650e32 sstables_manager: add variable to track component memory usage
sstables_manager::_total_reclaimable_memory variable tracks the total
memory that is reclaimable from all the SSTables managed by it.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f05bb4ba36)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
fd0b083414 db/config: add a new variable to limit memory used by table components
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
1609b77b45 sstable_datafile_test: add testcase to verify reclamation from sstables
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e0b6186d16)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
aec4d157da sstables: support reclaiming memory from components
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4f0aee62d1)
2024-04-16 13:05:36 +05:30
Tzach Livyatan
80d7bf7366 Update Driver root page
The right term is Amazon DynamoDB not AWS DynamoDB
See https://aws.amazon.com/dynamodb/

Closes scylladb/scylladb#18214

(cherry picked from commit 289793d964)
2024-04-16 09:53:53 +03:00
Pavel Emelyanov
0dc50ac449 view_update_generator: Unplug from database later
Patch 967ebacaa4 (view_update_generator: Move abort kicking to
do_abort()) moved unplugging v.u.g from database from .stop() to
.do_abort(). The latter call happens very early on stop -- once scylla
receives SIGINT. However, database may still need v.u.g. plugged to
flush views.

This patch moves unplug to later, namely to .stop() method of v.u.g.
which happens after database is drained and should no longer continue
view updates.

fixes: #16001

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18132

(cherry picked from commit 3471f30b58)
2024-04-15 14:33:56 +03:00
Botond Dénes
e9d38b3c9e Merge '[Backport 5.4] repair: fix memory counting in repair' from ScyllaDB
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.

Modify what's counted to repair memory limit.

Fixes: #16710.

(cherry picked from commit a4dc6553ab)

(cherry picked from commit 51c09a84cc)

Refs #17785

Closes scylladb/scylladb#18205

* github.com:scylladb/scylladb:
  test: add test for repair_row::size()
  repair: fix memory accounting in repair_row
2024-04-15 14:05:43 +03:00
Aleksandra Martyniuk
bfc4104eb9 test: add test for repair_row::size()
Add test which checs whether repair_row::size() considers external
memory.

(cherry picked from commit 51c09a84cc)
2024-04-06 22:44:51 +00:00
Aleksandra Martyniuk
c1c1fde90f repair: fix memory accounting in repair_row
In repair, only the size of frozen mutation fragments of repair row
is counted to the memory limit. So, huge keys of repair rows may
lead to OOM.

Include other repair_row's members' memory size in repair memory
limit.

(cherry picked from commit a4dc6553ab)
2024-04-06 22:44:51 +00:00
Ferenc Szili
94a551e671 logging: Don't log PK/CK in large partition/row/cell warning
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.

This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.

The logged data will look like this:

Large cells:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db

Large rows:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db

Large partitions:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db

Fixes #18041

Closes scylladb/scylladb#18166

(cherry picked from commit f1cc6252fd)
2024-04-05 16:02:22 +03:00
Kefu Chai
d6c7a26419 utils/logalloc: do not allocate memory in reclaim_timer::report()
before this change, `reclaim_timer::report()` calls

```c++
fmt::format(", at {}", current_backtrace())
```

which allocates a `std::string` on heap, so it can fail and throw. in
that case, `std::terminate()` is called. but at that moment, the reason
why `reclaim_timer::report()` gets called is that we fail to reclaim
memory for the caller. so we are more likely to run into this issue. anyway,
we should not allocate memory in this path.

in this change, a dedicated printer is created so that we don't format
to a temporary `std::string`, and instead write directly to the buffer
of logger. this avoids the memory allocation.

Fixes #18099
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18100

(cherry picked from commit fcf7ca5675)
2024-04-02 15:13:49 +03:00
Kamil Braun
7e05a54b9c schema_tables: pass reload flag when calling merge_schema cross-shard
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.

Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_schema` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.

Fix this.

(cherry picked from commit 5223d32fab)
2024-04-02 12:53:58 +02:00