Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)
Closes#19011
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.
Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).
Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.
Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.
In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.
Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607
---
I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.
With the change, the phenomenon no longer appears.
(cherry picked from commit 8df6d10e88)
Closes#18558
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.
This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.
The logged data will look like this:
Large cells:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db
Large rows:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db
Large partitions:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db
Fixes#18041Closesscylladb/scylladb#18166
(cherry picked from commit f1cc6252fd)
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.
Fixes#17890
(cherry picked from commit 9789a3dc7c)
Closes#18104
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.
Fixes#17854
(cherry picked from commit efcb718)
Closes#17999
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.
We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!
We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.
The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.
Fixes#17119
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17172
(cherry picked from commit 21e7deafeb)
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.
This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).
Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.
This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:
1. The counter "_op_count" that decides where to break the output mutation
should only be incremented when adding rows to this output mutation.
The existing code strangely incrmented it on every read (!?) which
resulted in the counter being incremented on every *input* fragment,
and in particular could reach the limit 100 between two range
tombstone pieces.
2. Moreover, the length of output was checked in the wrong place...
The existing code could get to 100 rows, not check at that point,
read the next input - half a range tombstone - and only *then*
check that we reached 100 rows and stop. The fix is to calculate
the number of rows in the right place - exactly when it's needed,
not before the step.
The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.
This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).
By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).
Fixes#17117
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17164
(cherry picked from commit 14315fcbc3)
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).
This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.
This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.
Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.
In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.
Fixes: scylladb/scylladb#15622Closesscylladb/scylladb#15972
(cherry picked from commit f094e23d84)
Fixes#15269
If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.
v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
ensure we log the fact that things are broken
Closesscylladb/scylladb#15270
(cherry picked from commit 6ffb482bf3)
Fixes#9405
`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.
To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.
Closes#14534
* github.com:scylladb/scylladb:
db: hints: add checksum to sync point encoding
db: hints: add the version_size constant
(cherry picked from commit eb6202ef9c)
The only difference from the original merge commit is the include
path of `xx_hasher.hh`. On branch 5.2, this file is in the root
directory, not `utils`.
Closes#16458
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.
Closesscylladb/scylladb#15428
(cherry picked from commit 62a8a31be7)
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.
The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.
Fixes#6389.
Closes#13150
* github.com:scylladb/scylladb:
test/alternator: test concurrent TagResource / UntagResource
db/tags: drop unsafe update_tags() utility function
alternator: isolate concurrent modification to tags
db/tags: add safe modify_tags() utility functions
migration_manager: expose access to storage_proxy
(cherry picked from commit dba1d36aa6)
Closes#16453
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.
Fixes#15110
(cherry picked from commit 9fd3df13a2)
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.
this also silences warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13721
(cherry picked from commit 135b4fd434)
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.
Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.
Fix this.
(cherry picked from commit 48164e1d09)
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
(cherry picked from commit cf81eef370)
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.
This fix affects also alternator's secondary indexes.
Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:
./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000
Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:
p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0
With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.
Fixes https://github.com/scylladb/scylladb/issues/15844Closesscylladb/scylladb#15845
(cherry picked from commit 020a9c931b)
scylla-sstable currently has two ways to obtain the schema:
* via a `schema.cql` file.
* load schema definition from memory (only works for system tables).
This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file.
This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override.
If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong.
A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes.
This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change.
Example:
```
$ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db
{"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}}
```
As seen above, subdirectories like qurantine, staging etc are also supported.
Fixes: https://github.com/scylladb/scylladb/issues/10126Closes#13448
* github.com:scylladb/scylladb:
test/cql-pytest: test_tools.py: add tests for schema loading
test/cql-pytest: add no_autocompaction_context
docs: scylla-sstable.rst: remove accidentally added copy-pasta
docs: scylla-sstable.rst: remove paragraph with schema limitations
docs: scylla-sstable.rst: update schema section
test/cql-pytest: nodetool.py: add flush_keyspace()
tools/scylla-sstable: reform schema loading mechanism
tools/schema_loader: add load_schema_from_schema_tables()
db/schema_tables: expose types schema
(cherry picked from commit 952b455310)
Closes#15386
Similar to create_keyspace_on_all_shards,
`extract_scylla_specific_keyspace_info` and
`create_keyspace_from_schema_partition` can be called
once in the upper layer, passing keyspace_metadata&
down to database::update_keyspace_on_all_shards
which now would only make the per-shard
keyspace_metadata from the reference it gets
from the schema_tables layer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit dc9b0812e9)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3520c786bd)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 53a6ea8616)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9d40305ef6)
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.
(cherry picked from commit 5b167a4ad7)
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.
Fixes: #14819Closes#14821
* github.com:scylladb/scylladb:
test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
db/view/view_updating_consumer: account for the size of mutations
mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
mutation/mutation: add memory_usage()
(cherry picked from commit 056d04954c)
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
This branch backports to branch-5.2 several fixes related to node operations:
- ba919aa88a (PR #12980; Fixes: #11011, #12969)
- 53636167ca (part of PR #12970; Fixes: #12764, #12956)
- 5856e69462 (part of PR #12970)
- 2b44631ded (PR #13028; Fixes: #12989)
- 6373452b31 (PR #12799; Fixes#12798)
Closes#13531
* github.com:scylladb/scylladb:
Merge 'Do not mask node operation errors' from Benny Halevy
Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
storage_service: Wait for normal state handler to finish in replace
storage_service: Wait for normal state handler to finish in bootstrap
storage_service: Send heartbeat earlier for node ops
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.
static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
Fixes#13395.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1ecba373d6)
This is not really an error, so print it in debug log_level
rather than error log_level.
Fixes#13374
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13462
(cherry picked from commit cc42f00232)
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.
If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.
It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.
It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.
There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.
We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.
The race can still happen, but the time window is now much smaller.
The PR also lays down infrastructure which enables testing the scenarios. It makes node ops
watchdog periods configurable, and adds error injections.
Fixes#12989
Refs #12969Closes#13028
* github.com:scylladb/scylladb:
storage_service: node ops: Extract node_ops_insert() to reduce code duplication
storage_service: Make node operations safer by detecting asymmetric abort
storage_service: node ops: Add error injections
service: node_ops: Make watchdog and heartbeat intervals configurable
(cherry picked from commit 2b44631ded)
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.
There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).
Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).
Fixes#13594Closes#13604
* github.com:scylladb/scylladb:
db: system_keyspace: use microsecond resolution for group0_history range tombstone
utils: UUID_gen: accept decimicroseconds in min_time_UUID
(cherry picked from commit 10c1f1dc80)
We will want to reuse the functions that we get from an aggregate
without making a deep copy, and it's only possible if we get
pointers from the aggregate instead of actual values.
(cherry picked from commit 20069372e7)
Fixes#12810
We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments.
Closes#12950
* github.com:scylladb/scylladb:
commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off
commitlog: change type of stored size
(cherry picked from commit e70be47276)
The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler.
refs: #13100
refs: #10941Closes#13141
* github.com:scylladb/scylladb:
database: Increase verbosity of database::stop() method
large_data_handler: Increase verbosity on shutdown
large_data_handler: Coroutinize .stop() method
(cherry picked from commit e22b27a107)
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.
Fixes: #11905
Refs: #11836Closes#11860
(cherry picked from commit 84a69b6adb)
The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()`
Fixes: #12249Closes#12978
* github.com:scylladb/scylladb:
flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE
make_nonforwardable: test through run_mutation_source_tests
make_nonforwardable: next_partition and fast_forward_to when single_partition is true
make_forwardable: fix next_partition
flat_mutation_reader_v2: drop forward_buffer_to
nonforwardable reader: fix indentation
nonforwardable reader: refactor, extract reset_partition
nonforwardable reader: add more tests
nonforwardable reader: no partition_end after fast_forward_to()
nonforwardable reader: no partition_end after next_partition()
nonforwardable reader: no partition_end for empty reader
row_cache: pass partition_start though nonforwardable reader
(cherry picked from commit 46efdfa1a1)
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.
after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.
Fixes#12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12573
(cherry picked from commit 4a0134a097)
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.
This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.
`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.
The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.
The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.
Fixes#12645Closes#12646
(cherry picked from commit fa7e904cd6)
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.
Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.
Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
Remove the `ip_addr` column which was not used. IP addresses are not
part of Raft configuration now and they can change dynamically.
Swap the `server_id` and `disposition` columns in the clustering key, so
when querying the configuration, we first obtain all servers with the
current disposition and then all servers with the previous disposition
(note that a server may appear both in current and previous).
As requested by issue #5619, commit 2150c0f7a2
added a sanity check for USING TIMESTAMP - the number specified in the
timestamp must not be more than 3 days into the future (when viewed as
a number of microseconds since the epoch).
This sanity checking helps avoid some annoying client-side bugs and
mis-configurations, but some users genuinely want to use arbitrary
or futuristic-looking timestamps and are hindered by this sanity check
(which Cassandra doesn't have, by the way).
So in this patch we add a new configuration option, restrict_future_timestamp
If set to "true", futuristic timestamps (more than 3 days into the future)
are forbidden. The "true" setting is the default (as has been the case
sinced #5619). Setting this option to "false" will allow using any 64-bit
integer as a timestamp, like is allowed Cassanda (and was allowed in
Scylla prior to #5619.
The error message in the case where a futuristic timestamp is rejected
now mentions the configuration paramter that can be used to disable this
check (this, and the option's name "restrict_*", is similar to other
so-called "safe mode" options).
This patch also includes a test, which works in Scylla and Cassandra,
with either setting of restrict_future_timestamp, checking the right
thing in all these cases (the futuristic timestamp can either be written
and read, or can't be written). I used this test to manually verify that
the new option works, defaults to "true", and when set to "false" Scylla
behaves like Cassandra.
Fixes#12527
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12537
The replace_address options are still supported
But mention in their description that they are now deprecated
and the user should use replace_node_first_boot instead.
While at it fix a typo in ignore_dead_nodes_for_replace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For replacing a node given its (now unique) Host ID.
The existing options for replace_address*
will be deprecated in the following patches
and eventually we will stop supporting them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make it clear that the table stores the snapshot configuration, which is
not necessarily the currently operating configuration (the last one
appended to the log).
In the future we plan to have a separate virtual table for showing the
currently operating configuration, perhaps we will call it
`system.raft_config`.