2890 Commits

Author SHA1 Message Date
Lakshmi Narayanan Sreethar
705ec24977 db/config.cc: increment components_memory_reclaim_threshold config default
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.

Fixes #18607

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)

Closes #19011
2024-06-04 07:13:28 +03:00
Kamil Braun
b68c06cc3a direct_failure_detector: increase ping timeout and make it tunable
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

(cherry picked from commit 8df6d10e88)

Closes #18558
2024-05-08 15:46:59 +02:00
Pavel Emelyanov
5f5acc813a view-builder: Print correct exception in built ste exception handler
Inside .handle_exception() continuation std::current_exception() doesn't
work, there's std::exception ex argument to handler's lambda instead

fixes #18423

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18349

(cherry picked from commit 4ac30e5337)
2024-05-01 10:20:26 +03:00
Lakshmi Narayanan Sreethar
786c08aa59 db/config: add a new variable to limit memory used by table components
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
2024-04-16 15:30:39 +05:30
Ferenc Szili
2bb5fe7311 logging: Don't log PK/CK in large partition/row/cell warning
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.

This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.

The logged data will look like this:

Large cells:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db

Large rows:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db

Large partitions:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db

Fixes #18041

Closes scylladb/scylladb#18166

(cherry picked from commit f1cc6252fd)
2024-04-05 16:03:08 +03:00
Wojciech Mitros
c0c34d2af0 mv: keep semaphore units alive until the end of a remote view update
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.

Fixes #17890

(cherry picked from commit 9789a3dc7c)

Closes #18104
2024-04-02 10:09:01 +02:00
Wojciech Mitros
44bcaca929 mv: adjust memory tracking of single view updates within a batch
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.

Fixes #17854

(cherry picked from commit efcb718)

Closes #17999
2024-03-26 09:38:17 +02:00
Nadav Har'El
08077ff3e8 alternator, mv: fix case of two new key columns in GSI
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.

We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!

We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.

The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.

Fixes #17119

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17172

(cherry picked from commit 21e7deafeb)
2024-03-13 15:08:32 +02:00
Nadav Har'El
6a6115cd86 mv: fix missing view deletions in some cases of range tombstones
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.

This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).

Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.

This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:

1. The counter "_op_count" that decides where to break the output mutation
   should only be incremented when adding rows to this output mutation.
   The existing code strangely incrmented it on every read (!?) which
   resulted in the counter being incremented on every *input* fragment,
   and in particular could reach the limit 100 between two range
   tombstone pieces.

2. Moreover, the length of output was checked in the wrong place...
   The existing code could get to 100 rows, not check at that point,
   read the next input - half a range tombstone - and only *then*
   check that we reached 100 rows and stop. The fix is to calculate
   the number of rows in the right place - exactly when it's needed,
   not before the step.

The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.

This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).

By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).

Fixes #17117

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17164

(cherry picked from commit 14315fcbc3)
2024-02-22 15:36:58 +02:00
Kamil Braun
784695e3ac system_keyspace: use system memory for system.raft table
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).

This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.

This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.

Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.

In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.

Fixes: scylladb/scylladb#15622

Closes scylladb/scylladb#15972

(cherry picked from commit f094e23d84)
2024-01-25 17:59:49 +01:00
Calle Wilund
aaa25e1a78 Commitlog replayer: Range-check skip call
Fixes #15269

If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.

v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
    ensure we log the fact that things are broken

Closes scylladb/scylladb#15270

(cherry picked from commit 6ffb482bf3)
2024-01-05 09:19:45 +02:00
Kamil Braun
287546923e Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak
Fixes #9405

`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.

To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.

Closes #14534

* github.com:scylladb/scylladb:
  db: hints: add checksum to sync point encoding
  db: hints: add the version_size constant

(cherry picked from commit eb6202ef9c)

The only difference from the original merge commit is the include
path of `xx_hasher.hh`. On branch 5.2, this file is in the root
directory, not `utils`.

Closes #16458
2023-12-19 17:39:50 +02:00
Michael Huang
5499f7b5a8 cdc: use chunked_vector for topology_description entries
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.

Closes scylladb/scylladb#15428

(cherry picked from commit 62a8a31be7)
2023-12-19 13:43:23 +01:00
Botond Dénes
46a29e9a02 Merge 'alternator: fix isolation of concurrent modifications to tags' from Nadav Har'El
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.

The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.

Fixes #6389.

Closes #13150

* github.com:scylladb/scylladb:
  test/alternator: test concurrent TagResource / UntagResource
  db/tags: drop unsafe update_tags() utility function
  alternator: isolate concurrent modification to tags
  db/tags: add safe modify_tags() utility functions
  migration_manager: expose access to storage_proxy

(cherry picked from commit dba1d36aa6)

Closes #16453
2023-12-19 10:19:31 +02:00
Petr Gusev
b9178bd853 hints: send_one_hint: extend the scope of file_send_gate holder
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.

Fixes #15110

(cherry picked from commit 9fd3df13a2)
2023-12-18 13:03:23 +02:00
Kefu Chai
0da3453f95 db: schema_tables: capture reference to temporary value by value
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.

this also silences warning from GCC-13:

```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
 3654 |     const column_definition& first_view_ck = v->clustering_key_columns().front();
      |                              ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
 3654 |     const column_definition& first_view_ck = v->clustering_key_columns().front();
      |                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```

Fixes #13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13721

(cherry picked from commit 135b4fd434)
2023-12-15 13:55:57 +02:00
Kamil Braun
4101c8beab schema_tables: remove default value for reload in merge_schema
To avoid bugs like the one fixed in the previous commit.

(cherry picked from commit 4376854473)
2023-11-21 01:29:28 +01:00
Kamil Braun
c994ed2057 schema_tables: pass reload flag when calling merge_schema cross-shard
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.

Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.

Fix this.

(cherry picked from commit 48164e1d09)
2023-11-21 01:29:28 +01:00
Avi Kivity
40eed1f1c5 Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.

Manually tested using ccm on cluster upgrade scenarios and node restarts.

Closes #14441

* github.com:scylladb/scylladb:
  test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
  schema_mutations, migration_manager: Ignore empty partitions in per-table digest
  migration_manager, schema_tables: Implement migration_manager::reload_schema()
  schema_tables: Avoid crashing when table selector has only one kind of tables

(cherry picked from commit cf81eef370)
2023-11-21 01:29:28 +01:00
Marcin Maliszkiewicz
900754d377 db: view: run local materialized view mutations on a separate smp service group
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.

This fix affects also alternator's secondary indexes.

Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:

./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000

Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:

p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0

With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.

Fixes https://github.com/scylladb/scylladb/issues/15844

Closes scylladb/scylladb#15845

(cherry picked from commit 020a9c931b)
2023-11-19 18:54:46 +02:00
Tomasz Grabiec
573ef87245 Merge ' tool/scylla-sstable: more flexibility in obtaining the schema' from Botond Dénes
scylla-sstable currently has two ways to obtain the schema:

    * via a `schema.cql` file.
    * load schema definition from memory (only works for system tables).

This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file.
This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override.
If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong.
A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes.

This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change.

Example:

```
$ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db
{"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}}
```

As seen above, subdirectories like qurantine, staging etc are also supported.

Fixes: https://github.com/scylladb/scylladb/issues/10126

Closes #13448

* github.com:scylladb/scylladb:
  test/cql-pytest: test_tools.py: add tests for schema loading
  test/cql-pytest: add no_autocompaction_context
  docs: scylla-sstable.rst: remove accidentally added copy-pasta
  docs: scylla-sstable.rst: remove paragraph with schema limitations
  docs: scylla-sstable.rst: update schema section
  test/cql-pytest: nodetool.py: add flush_keyspace()
  tools/scylla-sstable: reform schema loading mechanism
  tools/schema_loader: add load_schema_from_schema_tables()
  db/schema_tables: expose types schema

(cherry picked from commit 952b455310)

Closes #15386
2023-11-02 17:25:18 +02:00
Benny Halevy
14113dc23e schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
Similar to create_keyspace_on_all_shards,
`extract_scylla_specific_keyspace_info` and
`create_keyspace_from_schema_partition` can be called
once in the upper layer, passing keyspace_metadata&
down to database::update_keyspace_on_all_shards
which now would only make the per-shard
keyspace_metadata from the reference it gets
from the schema_tables layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit dc9b0812e9)
2023-10-29 19:14:06 +02:00
Benny Halevy
4d5a99f3b8 database: create_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3520c786bd)
2023-10-29 19:13:55 +02:00
Benny Halevy
ffe28b3e3f database: update_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 53a6ea8616)
2023-10-29 19:06:45 +02:00
Benny Halevy
1459306603 database: drop_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9d40305ef6)
2023-10-29 19:06:25 +02:00
Patryk Jędrzejczak
4cd5847761 config: add schema_commitlog_segment_size_in_mb variable
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.

(cherry picked from commit 5b167a4ad7)
2023-08-02 18:05:39 +02:00
Nadav Har'El
e34c62c567 Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.

Fixes: #14819

Closes #14821

* github.com:scylladb/scylladb:
  test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
  db/view/view_updating_consumer: account for the size of mutations
  mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
  mutation/mutation: add memory_usage()

(cherry picked from commit 056d04954c)
2023-07-31 03:43:44 -04:00
Michał Chojnowski
75933b9906 view_updating_consumer: make buffer limit a variable
The limit doesn't change at runtime, but we this patch makes it variable for
unit testing purposes.
2023-07-11 09:44:00 +02:00
Michał Chojnowski
fc7b02c8e4 view: fix range tombstone handling on flushes in view_updating_consumer
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.

This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.

Fixes #14503
2023-07-11 09:44:00 +02:00
Botond Dénes
486483b379 Merge '[Backport 5.2]: node ops backports' from Benny Halevy
This branch backports to branch-5.2 several fixes related to node operations:
- ba919aa88a (PR #12980; Fixes: #11011, #12969)
- 53636167ca (part of PR #12970; Fixes: #12764, #12956)
- 5856e69462 (part of PR #12970)
- 2b44631ded (PR #13028; Fixes: #12989)
- 6373452b31 (PR #12799; Fixes #12798)

Closes #13531

* github.com:scylladb/scylladb:
  Merge 'Do not mask node operation errors' from Benny Halevy
  Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
  storage_service: Wait for normal state handler to finish in replace
  storage_service: Wait for normal state handler to finish in bootstrap
  storage_service: Send heartbeat earlier for node ops
2023-05-17 16:46:49 +03:00
Raphael S. Carvalho
26b4d2c3c1 db/view/build_progress_virtual_reader: Fix use-after-move
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.

static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
                _db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),

Fixes #13395.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1ecba373d6)
2023-05-15 20:26:01 +03:00
Benny Halevy
5785550e24 view: view_builder: start: demote sleep_aborted log error
This is not really an error, so print it in debug log_level
rather than error log_level.

Fixes #13374

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13462

(cherry picked from commit cc42f00232)
2023-05-14 21:21:59 +03:00
Marcin Maliszkiewicz
a2fed1588e db: view: use deferred_close for closing staging_sstable_reader
When consume_in_thread throws the reader should still be closed.

Related https://github.com/scylladb/scylla-enterprise/issues/2661

Closes #13398
Refs: scylladb/scylla-enterprise#2661
Fixes: #13413

(cherry picked from commit 99f8d7dcbe)
2023-05-08 09:41:07 +03:00
Kamil Braun
42fd3704e4 Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.

If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.

It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.

It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.

There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.

We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.

The race can still happen, but the time window is now much smaller.

The PR also lays down infrastructure which enables testing the scenarios. It makes node ops
watchdog periods configurable, and adds error injections.

Fixes #12989
Refs #12969

Closes #13028

* github.com:scylladb/scylladb:
  storage_service: node ops: Extract node_ops_insert() to reduce code duplication
  storage_service: Make node operations safer by detecting asymmetric abort
  storage_service: node ops: Add error injections
  service: node_ops: Make watchdog and heartbeat intervals configurable

(cherry picked from commit 2b44631ded)
2023-04-30 18:58:28 +03:00
Botond Dénes
50095cc3a5 Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.

There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).

Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).

Fixes #13594

Closes #13604

* github.com:scylladb/scylladb:
  db: system_keyspace: use microsecond resolution for group0_history range tombstone
  utils: UUID_gen: accept decimicroseconds in min_time_UUID

(cherry picked from commit 10c1f1dc80)
2023-04-23 16:03:02 +03:00
Wojciech Mitros
5fd4bb853b uda: return aggregate functions as shared pointers
We will want to reuse the functions that we get from an aggregate
without making a deep copy, and it's only possible if we get
pointers from the aggregate instead of actual values.

(cherry picked from commit 20069372e7)
2023-04-17 13:14:24 +02:00
Botond Dénes
128050e984 Merge 'commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off' from Calle Wilund
Fixes #12810

We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments.

Closes #12950

* github.com:scylladb/scylladb:
  commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off
  commitlog: change type of stored size

(cherry picked from commit e70be47276)
2023-04-03 08:57:43 +03:00
Botond Dénes
e380c24c69 Merge 'Improve database shutdown verbosity' from Pavel Emelyanov
The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler.

refs: #13100
refs: #10941

Closes #13141

* github.com:scylladb/scylladb:
  database: Increase verbosity of database::stop() method
  large_data_handler: Increase verbosity on shutdown
  large_data_handler: Coroutinize .stop() method

(cherry picked from commit e22b27a107)
2023-03-30 17:01:24 +03:00
Botond Dénes
c013336121 db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.

Fixes: #11905
Refs: #11836

Closes #11860

(cherry picked from commit 84a69b6adb)
2023-03-22 09:03:50 +02:00
Botond Dénes
bd4f9e3615 Merge 'readers/nonforwarding: don't emit partition_end on next_partition,fast_forward_to' from Gusev Petr
The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()`

Fixes: #12249

Closes #12978

* github.com:scylladb/scylladb:
  flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE
  make_nonforwardable: test through run_mutation_source_tests
  make_nonforwardable: next_partition and fast_forward_to when single_partition is true
  make_forwardable: fix next_partition
  flat_mutation_reader_v2: drop forward_buffer_to
  nonforwardable reader: fix indentation
  nonforwardable reader: refactor, extract reset_partition
  nonforwardable reader: add more tests
  nonforwardable reader: no partition_end after fast_forward_to()
  nonforwardable reader: no partition_end after next_partition()
  nonforwardable reader: no partition_end for empty reader
  row_cache: pass partition_start though nonforwardable reader

(cherry picked from commit 46efdfa1a1)
2023-03-16 10:42:03 +02:00
Kefu Chai
b2699743cc db: system_keyspace: take the reserved_memory into account
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.

after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.

Fixes #12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12573

(cherry picked from commit 4a0134a097)
2023-02-05 18:30:05 +02:00
Benny Halevy
0f9fe61d91 view: row_lock: lock_ck: find or construct row_lock under partition lock
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.

This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.

This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.

This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.

Fixes #12632

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
2023-02-05 17:22:31 +02:00
Michał Chojnowski
608ef92a71 commitlog: fix total_size_on_disk accounting after segment file removal
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.

`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.

The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.

The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.

Fixes #12645

Closes #12646

(cherry picked from commit fa7e904cd6)
2023-02-01 21:54:37 +02:00
Kamil Braun
a483915c62 db: system_keyspace: add a virtual table with raft configuration
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.

Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.

Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
2023-01-17 12:28:00 +01:00
Kamil Braun
2bfe85ce9b db: system_keyspace: improve system.raft_snapshot_config schema
Remove the `ip_addr` column which was not used. IP addresses are not
part of Raft configuration now and they can change dynamically.

Swap the `server_id` and `disposition` columns in the clustering key, so
when querying the configuration, we first obtain all servers with the
current disposition and then all servers with the previous disposition
(note that a server may appear both in current and previous).
2023-01-17 12:28:00 +01:00
Nadav Har'El
5bf94ae220 cql: allow disabling of USING TIMESTAMP sanity checking
As requested by issue #5619, commit 2150c0f7a2
added a sanity check for USING TIMESTAMP - the number specified in the
timestamp must not be more than 3 days into the future (when viewed as
a number of microseconds since the epoch).

This sanity checking helps avoid some annoying client-side bugs and
mis-configurations, but some users genuinely want to use arbitrary
or futuristic-looking timestamps and are hindered by this sanity check
(which Cassandra doesn't have, by the way).

So in this patch we add a new configuration option, restrict_future_timestamp
If set to "true", futuristic timestamps (more than 3 days into the future)
are forbidden. The "true" setting is the default (as has been the case
sinced #5619). Setting this option to "false" will allow using any 64-bit
integer as a timestamp, like is allowed Cassanda (and was allowed in
Scylla prior to #5619.

The error message in the case where a futuristic timestamp is rejected
now mentions the configuration paramter that can be used to disable this
check (this, and the option's name "restrict_*", is similar to other
so-called "safe mode" options).

This patch also includes a test, which works in Scylla and Cassandra,
with either setting of restrict_future_timestamp, checking the right
thing in all these cases (the futuristic timestamp can either be written
and read, or can't be written). I used this test to manually verify that
the new option works, defaults to "true", and when set to "false" Scylla
behaves like Cassandra.

Fixes #12527

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12537
2023-01-16 23:18:56 +02:00
Benny Halevy
1577aa8098 db: config: describe replace_address* options as deprecated
The replace_address options are still supported
But mention in their description that they are now deprecated
and the user should use replace_node_first_boot instead.

While at it fix a typo in ignore_dead_nodes_for_replace

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:36:09 +02:00
Benny Halevy
32e79185d4 db: config: add replace_node_first_boot option
For replacing a node given its (now unique) Host ID.

The existing options for replace_address*
will be deprecated in the following patches
and eventually we will stop supporting them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Kamil Braun
be390285b6 db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG
A single node will run a single Raft server in any given Raft group,
so this column is not necessary.
2023-01-12 16:48:50 +01:00
Kamil Braun
bed555d1e5 db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config'
Make it clear that the table stores the snapshot configuration, which is
not necessarily the currently operating configuration (the last one
appended to the log).

In the future we plan to have a separate virtual table for showing the
currently operating configuration, perhaps we will call it
`system.raft_config`.
2023-01-12 16:21:26 +01:00