Commit Graph

39537 Commits

Author SHA1 Message Date
Botond Dénes
ea176bf4ce test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
The test becomes a lot shorter and it now uses random schema and random
data. The test is also split in two: one test for abort mode and one for
skip mode.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 5237e8133b)
2024-04-24 08:45:53 -04:00
Botond Dénes
3835fd681d test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 76785baf43)
2024-04-24 07:57:54 -04:00
Botond Dénes
14da273c4c test/boost/sstable_compaction_test: introduce scrub_test_framework
Scrub tests require a lot of boilerplate code to work. This has a lot of
disadvantages:
* Tests are long
* The "meat" of the test is lost between all the boiler-plate, it is
  hard to glean what a test actually does
* Tests are hard to write, so we have only a few of them and they test
  multiple things.
* The boiler-plate differs sligthly from test-to-test.

To solve this, this patch introduces a new class, `scrub_test_frawmework`,
which is a central place for all the boiler-plate code needed to write
scrub-related tests. In the next patches, we will migrate scrub related
tests to this class.

(cherry picked from commit b6f0c4efa0)
2024-04-24 07:57:54 -04:00
Botond Dénes
33d5f27244 test/lib/random_schema: add uncompatible_timestamp_generator()
Guarantees that produced mutations will not be compactible.

(cherry picked from commit e412673c44)
2024-04-24 07:57:54 -04:00
Asias He
10f137e367 repair: Improve estimated_partitions to reduce memory usage
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.

Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.

The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.

This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.

Fixes #18140

Closes scylladb/scylladb#18141

(cherry picked from commit 642f9a1966)
scylla-5.4.6 scylla-5.4.6-candidate
2024-04-18 11:40:06 +03:00
Kamil Braun
53e1ed0ebb Merge '[Backport 5.4] gossiper: lock local endpoint when updating heart_beat' from ScyllaDB
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720

(cherry picked from commit a0b331b310)

(cherry picked from commit 72955093eb)

Refs scylladb/scylladb#18184

Closes scylladb/scylladb#18245

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
2024-04-17 17:50:30 +02:00
Botond Dénes
1aedc7372d Merge '[Backport 5.4] : Track and limit memory used by bloom filters' from Lakshmi Narayanan Sreethar
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.

The feature can be manually verified by doing the following :

1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB

The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.

Fixes https://github.com/scylladb/scylladb/issues/17747

Backported from #17771 to 5.4.

Closes scylladb/scylladb#18248

* github.com:scylladb/scylladb:
  test_bloom_filter.py: disable reclaiming memory from components
  sstable_datafile_test: add tests to verify auto reclamation of components
  test/lib: allow overriding available memory via test_env_config
  sstables_manager: support reclaiming memory from components
  sstables_manager: store available memory size
  sstables_manager: add variable to track component memory usage
  db/config: add a new variable to limit memory used by table components
  sstable_datafile_test: add testcase to verify reclamation from sstables
  sstables: support reclaiming memory from components
2024-04-17 14:33:19 +03:00
Kamil Braun
28781ca37e test: reproducer for missing gossiper updates
Regression test for scylladb/scylladb#17493.

(cherry picked from commit 72955093eb)

Backport note: removed `timeout` parameter passed to `server_add`,
missing on this branch. (If server adding hangs, it will timeout after
`TOPOLOGY_TIMEOUT` from scylla_cluster.py)

Removed `force_gossip_join_boot` error injection from test, not present
in this branch. Starting nodes with `experimental_features` disabled.

Added missing `handle_state_normal.*finished` message.
2024-04-17 13:09:39 +02:00
Beni Peled
9218bbb9b9 test.py: add the pytest junit_suite_name parameter
By default the suitename in the junit files generated by pytest
is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft`
With this change, the junit files will use the real suitename

This change doesn't affect the Test Report in Jenkins, but it
raised part of the other task of publishing the test results to
elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950
where we parse the XMLs and we need the correct suitename

Closes scylladb/scylladb#18172

(cherry picked from commit 223275b4d1)
2024-04-17 07:02:33 +03:00
Kamil Braun
8faef135cb gossiper: lock local endpoint when updating heart_beat
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b310)
2024-04-16 14:39:05 +02:00
Lakshmi Narayanan Sreethar
75962d3e94 test_bloom_filter.py: disable reclaiming memory from components
Disabled reclaiming memory from sstable components in the testcase as it
interferes with the false positive calculation.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d86505e399)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
034304127c sstable_datafile_test: add tests to verify auto reclamation of components
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d261f0fbea)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
95068d3c00 test/lib: allow overriding available memory via test_env_config
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 169629dd40)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
3dca49c524 sstables_manager: support reclaiming memory from components
Reclaim memory from the SSTable that has the most reclaimable memory if
the total reclaimable memory has crossed the threshold. Only the bloom
filter memory is considered reclaimable for now.

Fixes #17747

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a36965c474)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
505a0714a6 sstables_manager: store available memory size
The available memory size is required to calculate the reclaim memory
threshold, so store that within the sstables manager.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2ca4b0a7a2)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
b8bd650e32 sstables_manager: add variable to track component memory usage
sstables_manager::_total_reclaimable_memory variable tracks the total
memory that is reclaimable from all the SSTables managed by it.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f05bb4ba36)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
fd0b083414 db/config: add a new variable to limit memory used by table components
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
1609b77b45 sstable_datafile_test: add testcase to verify reclamation from sstables
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e0b6186d16)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
aec4d157da sstables: support reclaiming memory from components
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4f0aee62d1)
2024-04-16 13:05:36 +05:30
Tzach Livyatan
80d7bf7366 Update Driver root page
The right term is Amazon DynamoDB not AWS DynamoDB
See https://aws.amazon.com/dynamodb/

Closes scylladb/scylladb#18214

(cherry picked from commit 289793d964)
2024-04-16 09:53:53 +03:00
Pavel Emelyanov
0dc50ac449 view_update_generator: Unplug from database later
Patch 967ebacaa4 (view_update_generator: Move abort kicking to
do_abort()) moved unplugging v.u.g from database from .stop() to
.do_abort(). The latter call happens very early on stop -- once scylla
receives SIGINT. However, database may still need v.u.g. plugged to
flush views.

This patch moves unplug to later, namely to .stop() method of v.u.g.
which happens after database is drained and should no longer continue
view updates.

fixes: #16001

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18132

(cherry picked from commit 3471f30b58)
2024-04-15 14:33:56 +03:00
Botond Dénes
e9d38b3c9e Merge '[Backport 5.4] repair: fix memory counting in repair' from ScyllaDB
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.

Modify what's counted to repair memory limit.

Fixes: #16710.

(cherry picked from commit a4dc6553ab)

(cherry picked from commit 51c09a84cc)

Refs #17785

Closes scylladb/scylladb#18205

* github.com:scylladb/scylladb:
  test: add test for repair_row::size()
  repair: fix memory accounting in repair_row
2024-04-15 14:05:43 +03:00
Aleksandra Martyniuk
bfc4104eb9 test: add test for repair_row::size()
Add test which checs whether repair_row::size() considers external
memory.

(cherry picked from commit 51c09a84cc)
2024-04-06 22:44:51 +00:00
Aleksandra Martyniuk
c1c1fde90f repair: fix memory accounting in repair_row
In repair, only the size of frozen mutation fragments of repair row
is counted to the memory limit. So, huge keys of repair rows may
lead to OOM.

Include other repair_row's members' memory size in repair memory
limit.

(cherry picked from commit a4dc6553ab)
2024-04-06 22:44:51 +00:00
Ferenc Szili
94a551e671 logging: Don't log PK/CK in large partition/row/cell warning
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.

This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.

The logged data will look like this:

Large cells:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db

Large rows:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db

Large partitions:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db

Fixes #18041

Closes scylladb/scylladb#18166

(cherry picked from commit f1cc6252fd)
2024-04-05 16:02:22 +03:00
Kefu Chai
d6c7a26419 utils/logalloc: do not allocate memory in reclaim_timer::report()
before this change, `reclaim_timer::report()` calls

```c++
fmt::format(", at {}", current_backtrace())
```

which allocates a `std::string` on heap, so it can fail and throw. in
that case, `std::terminate()` is called. but at that moment, the reason
why `reclaim_timer::report()` gets called is that we fail to reclaim
memory for the caller. so we are more likely to run into this issue. anyway,
we should not allocate memory in this path.

in this change, a dedicated printer is created so that we don't format
to a temporary `std::string`, and instead write directly to the buffer
of logger. this avoids the memory allocation.

Fixes #18099
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18100

(cherry picked from commit fcf7ca5675)
2024-04-02 15:13:49 +03:00
Kamil Braun
7e05a54b9c schema_tables: pass reload flag when calling merge_schema cross-shard
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.

Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_schema` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.

Fix this.

(cherry picked from commit 5223d32fab)
2024-04-02 12:53:58 +02:00
Tzach Livyatan
2fa581d8fb Docs: Fix link fro scylla-sstable.rst to /architecture/sstable/
Fix https://github.com/scylladb/scylladb/issues/18096

Closes scylladb/scylladb#18097

(cherry picked from commit 4930095d39)
2024-04-02 13:46:43 +03:00
Wojciech Mitros
892c97295b mv: adjust the overhead estimation for view updates
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.

After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes

This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.

The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.

The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.

Fixes #17364

(cherry picked from commit 4c767c379c)

Closes scylladb/scylladb#18105
2024-04-02 10:26:28 +02:00
Jenkins Promoter
5f6b1dc5b9 Update ScyllaDB version to: 5.4.6 2024-04-01 23:37:05 +03:00
Kefu Chai
f1d547e74e bytes.hh: stop at '}' in fmt::formatter<fmt_hex>
according to {fmt}'s document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,

```
  // the range will contain "f} continued". The formatter should parse
  // specifiers until '}' or the end of the range. In this example the
  // formatter should parse the 'f' specifier and return an iterator
  // pointing to '}'.
```

so we should check for _both_ '}' and end of the range. when building
scylla with {fmt} 10.2.1, it fails to build code like

```c++
fmt::format_to(out, "{}", fmt_hex(frag))
```

as {fmt}'s compile-time checker fails to parse this format string
along with given argument, as at compile time,
```c++
throw format_error("invalid group_size")
```
is executed.

so, in this change, we check both '}' and the end of range.

the change which introduced this formatter was
2f9dfba800

Refs 2f9dfba800
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit ef2afa75d1)

Closes scylladb/scylladb#18086
2024-03-29 09:45:00 +02:00
Michał Chojnowski
3df5de60a9 cache_flat_mutation_reader: only call get_iterator_in_latest() when pointing at a row
Calling `_next_row.get_iterator_in_latest()` is illegal when `_next_row` is not
pointing at a row. In particular, the iterator returned by such call might be
dangling.

We have observed this to cause a use-after-free in the field, when a reverse
read called `maybe_add_to_cache` after `_latest_it` was left dangling after
a dead row removal in `copy_from_cache_to_buffer`.

To fix this, we should ensure that we only call `_next_row.get_iterator_in_latest`
is pointing at a row.

Only the occurrences of this problem in `maybe_add_to_cache` are truly dangerous.
As far as I can see, other occurrences can't break anything as of now.
But we apply fixes to them anyway.

(cherry picked from commit 04db6d4bb1)

Closes scylladb/scylladb#18075
2024-03-28 11:04:28 +01:00
Botond Dénes
fd7d57b9fa tools/toolchain: update python driver
Backports scylladb/scylladb#17604 and scylladb/scylladb#17956.

Fixes scylladb/scylladb#16709
Fixes scylladb/scylladb#17353

Closes scylladb/scylladb#17653
scylla-5.4.5-candidate scylla-5.4.5
2024-03-26 13:27:34 +02:00
Botond Dénes
8a6c543033 Merge '[Backport 5.4] tests: utils: error injection: print time duration instead of count' from ScyllaDB
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.

Fixes #15902

(cherry picked from commit 8a5689e7a7)

(cherry picked from commit 1d33a68dd7)

Closes scylladb/scylladb#17909

* github.com:scylladb/scylladb:
  tests: utils: error injection: print time duration instead of count
  error_injection: do not cast to milliseconds when injecting timeout
2024-03-25 17:40:31 +02:00
Pavel Emelyanov
27beb0fe60 Update seastar submodule (dupliex IO queue activation fix)
* seastar 9d44e5eb...ae05c138 (1):
  > fair_queue: Do not pop unplugged class immediately

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-25 13:01:07 +03:00
Botond Dénes
e868ade258 repair: resolve start-up deadlock
Repairs have to obtain a permit to the reader concurrency semaphore on
each shard they have a presence on. This is prone to deadlocks:

node1                              node2
repair1_master (takes permit)      repair1_follower (waits on permit)
repair2_master (waits for permit)  repair2_follower (takes permit)

In lieu of strong central coordination, we solved this by making permits
evictable: if repair2 can evict repair1's permit so it can obtain one
and make progress. This is not efficient as evicting a permit usually
means discarding already done work, but it prevents the deadlocks.
We recently discovered that there is a window when deadlocks can still
happen. The permit is made evictable when the disk reader is created.
This reader is an evictable one, which effectively makes the permit
evictable. But the permit is obtained when the repair constrol
structrure -- repair meta -- is create. Between creating the repair meta
and reading the first row from disk, the deadlock is still possible. And
we know that what is possible, will happen (and did happen). Fix by
making the permit evictable as soon as the repair meta is created. This
is very clunky and we should have a better API for this (refs #17644),
but for now we go with this simple patch, to make it easy to backport.

Refs: #17644
Fixes: #17591

Closes scylladb/scylladb#17646

(cherry picked from commit 65b9e10543)

Closes scylladb/scylladb#17950
2024-03-21 13:12:40 +02:00
Kefu Chai
57fa61e2ca tests: utils: error injection: print time duration instead of count
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)

test/boost/error_injection_test.cc(167): fatal error:
    in "test_inject_future_disabled":
      critical check wait_time > sleep_msec has failed [23839ns <= 10ms]

Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1d33a68dd7)
2024-03-20 09:32:00 +00:00
Kefu Chai
6fc3c62223 error_injection: do not cast to milliseconds when injecting timeout
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.

Fixes #15902

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 8a5689e7a7)
2024-03-20 09:32:00 +00:00
Anna Stuchlik
8414aa292a doc: fix the image upgrade page
This commit updates the Upgrade ScyllaDB Image page.

- It removes the incorrect information that updating underlying OS packages is mandatory.
- It adds information about the extended procedure for non-official images.

(cherry picked from commit fc90112b97)

Closes scylladb/scylladb#17886
2024-03-19 14:38:22 +02:00
Aleksandra Martyniuk
881ac7a9af test: fix regular compaction tasks check
Since 6b87778 regular compaction tasks are removed from task manager
immediately after they are finished.

test_regular_compaction_task lists compaction tasks and then requests
their statuses. Only one regular compaction task is guaranteed to still
be running at that time, the rest of them may finish before their status
is requested and so it will no longer be in task manager, causing the test
to fail.

Fix statuses check to consider the possibility of a regular compaction
task being removed from task manager.

Fixes: #17776.
(cherry picked from commit 80c5eb4ecb)

Closes scylladb/scylladb#17810
2024-03-15 08:54:00 +02:00
Tomasz Grabiec
24db04dbe4 Merge 'migration_manager: take group0 lock during raft snapshot taking' from Kamil Braun
This is a backport of 0c376043eb and follow-up fix 57b14580f0 to 5.4.

We haven't identified any specific issues in test or field in 5.4/2024.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways.

For 5.4 it's even more important than 5.2 because in 5.4 in Raft mode schema pulls are disabled.

Closes scylladb/scylladb#17641

* github.com:scylladb/scylladb:
  raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
  migration_manager: fix indentation after the previous patch.
  messaging_service: process migration_request rpc on shard 0
  migration_manager: take group0 lock during raft snapshot taking
2024-03-14 23:42:49 +01:00
Jenkins Promoter
9fcb7baa9e Update ScyllaDB version to: 5.4.5 2024-03-13 16:18:59 +02:00
Nadav Har'El
8a1f01ad88 alternator, mv: fix case of two new key columns in GSI
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.

We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!

We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.

The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.

Fixes #17119

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17172

(cherry picked from commit 21e7deafeb)
2024-03-13 14:46:03 +02:00
Raphael S. Carvalho
db1c8e8754 Fix potential data resurrection when another compaction type does cleanup work
Since commit f1bbf70, many compaction types can do cleanup work, but turns out
we forgot to invalidate cache on their completion.

So if a node regains ownership of token that had partition deleted in its previous
owner (and tombstone is already gone), data can be resurrected.

Tablet is not affected, as it explicitly invalidates cache during migration
cleanup stage.

Scylla 5.4 is affected.

Fixes #17501.
Fixes #17452.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#17502

(cherry picked from commit f07c233ad5)
2024-03-13 14:10:05 +02:00
Kamil Braun
6e01e821d7 test: unflake test_topology_remove_garbage_group0
The test is booting nodes, and then immediately starts shutting down
nodes and removing them from the cluster. The shutting down and
removing may happen before driver manages to connect to all nodes in the
cluster. In particular, the driver didn't yet connect to the last
bootstrapped node. Or it can even happen that the driver has connected,
but the control connection is established to the first node, and the
driver fetched topology from the first node when the first node didn't
yet consider the last node to be normal. So the driver decides to close
connection to the last node like this:
```
22:34:03.159 DEBUG> [control connection] Removing host not found in
   peers metadata: <Host: 127.42.90.14:9042 datacenter1>
```

Eventually, at the end of the test, only the last node remains, all
other nodes have been removed or stopped. But the driver does not have a
connection to that last node.

Fix this problem by ensuring that:
- all nodes see each other as NORMAL,
- the driver has connected to all nodes
at the beginning of the test, before we start shutting down and removing
nodes.

Fixes scylladb/scylladb#16373

(cherry picked from commit a68701ed4f)

Closes scylladb/scylladb#17702
2024-03-12 10:50:27 +01:00
Israel Fruchter
02182caff4 Update tools/cqlsh submodule
* tools/cqlsh 426fa0ea...99b2b777 (1):
  > `COPY TO STDOUT` shouldn't put None where a function is expected

Closes scylladb/scylladb#17728
2024-03-11 21:52:02 +02:00
Michał Chojnowski
42d9e36454 sstables: fix a use-after-free in key_view::explode()
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.

This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.

Fixes #17625

(cherry picked from commit 7a7b8972e5)

Closes scylladb/scylladb#17717
2024-03-11 12:05:55 +02:00
Lakshmi Narayanan Sreethar
05d2078911 reader_permit: store schema_ptr instead of raw schema pointer
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.

Fixes #16180

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#16658

(cherry picked from commit 76f0d5e35b)

Closes scylladb/scylladb#17677
2024-03-08 08:29:22 +02:00
Gleb Natapov
7908f69b7c raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
group0 operations a valid on shard 0 only. Assert that.

(cherry picked from commit 9847e272f9)
2024-03-05 20:36:44 +01:00
Gleb Natapov
a762ab6283 migration_manager: fix indentation after the previous patch.
(cherry picked from commit 77907b97f1)
2024-03-05 20:36:34 +01:00