Compare commits

...

431 Commits

Author SHA1 Message Date
Beni Peled
447a3beb47 release: prepare for 5.2.17 2024-03-27 14:35:37 +02:00
Botond Dénes
aca50c46b7 tools/toolchain: update python driver
Backports scylladb/scylladb#17604 and scylladb/scylladb#17956.

Fixes scylladb/scylladb#16709
Fixes scylladb/scylladb#17353

Closes #17661
2024-03-27 08:48:25 +02:00
Wojciech Mitros
44bcaca929 mv: adjust memory tracking of single view updates within a batch
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.

Fixes #17854

(cherry picked from commit efcb718)

Closes #17999
2024-03-26 09:38:17 +02:00
Botond Dénes
2e2bf79092 Merge '[Backport 5.2] tests: utils: error injection: print time duration instead of count' from ScyllaDB
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.

Fixes #15902

(cherry picked from commit 8a5689e7a7)

(cherry picked from commit 1d33a68dd7)

Closes #17911

* github.com:scylladb/scylladb:
  tests: utils: error injection: print time duration instead of count
  error_injection: do not cast to milliseconds when injecting timeout
2024-03-25 17:41:23 +02:00
Pavel Emelyanov
616199f79c Update seastar submodule (dupliex IO queue activation fix)
* seastar ad0f2d5d...eb093f8a (1):
  > fair_queue: Do not pop unplugged class immediately

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-25 12:58:24 +03:00
Wojciech Mitros
5dfb6c9ead mv: adjust the overhead estimation for view updates
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.

After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes

This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.

The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.

The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.

Fixes #17364

(cherry picked from commit 5ab3586135)

Closes #17858
2024-03-20 13:52:23 +02:00
Kefu Chai
6209f5d6d4 tests: utils: error injection: print time duration instead of count
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)

test/boost/error_injection_test.cc(167): fatal error:
    in "test_inject_future_disabled":
      critical check wait_time > sleep_msec has failed [23839ns <= 10ms]

Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1d33a68dd7)
2024-03-20 09:40:16 +00:00
Kefu Chai
ac288684c6 error_injection: do not cast to milliseconds when injecting timeout
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.

Fixes #15902

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 8a5689e7a7)
2024-03-20 09:40:16 +00:00
Raphael S. Carvalho
fc1d126f31 replica: Fix major compaction semantics by performing off-strategy first
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.

Fixes #11915.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15792

(cherry picked from commit ea6c281b9f)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #17901
2024-03-20 08:48:17 +02:00
Anna Stuchlik
70edebd8d7 doc: fix the image upgrade page
This commit updates the Upgrade ScyllaDB Image page.

- It removes the incorrect information that updating underlying OS packages is mandatory.
- It adds information about the extended procedure for non-official images.

(cherry picked from commit fc90112b97)

Closes #17885
2024-03-19 16:47:06 +02:00
Petr Gusev
dffc0fb720 repair_meta: get_estimated_partitions fix
The shard_range parameter was unused.

Fixes: #17863

(cherry picked from commit b9f527bfa8)
2024-03-18 14:27:45 +02:00
Kamil Braun
0bb338c521 test: remove test_writes_to_recent_previous_cdc_generations
The test in its original form relies on the
`error_injections_at_startup` feature, which 5.2 doesn't have, so I
adapted the test to enable error injections after bootstrapping nodes in
the backport (9c44bbce67). That is however
incorrect, it's important for the injection to be enabled while the
nodes are booting, otherwise the test will be flaky, as we observed.
Details in scylladb/scylladb#17749.

Remove the test from 5.2 branch.

Fixes scylladb/scylladb#17749

Closes #17750
2024-03-15 10:22:22 +02:00
Tomasz Grabiec
cefa19eb93 Merge 'migration_manager: take group0 lock during raft snapshot taking' from Kamil Braun
This is a backport of 0c376043eb and follow-up fix 57b14580f0 to 5.2.

We haven't identified any specific issues in test or field in 5.2/2023.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways.

Closes #17640

* github.com:scylladb/scylladb:
  migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer
  raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
  migration_manager: fix indentation after the previous patch.
  messaging_service: process migration_request rpc on shard 0
  migration_manager: take group0 lock during raft snapshot taking
2024-03-14 23:41:02 +01:00
Nadav Har'El
08077ff3e8 alternator, mv: fix case of two new key columns in GSI
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.

We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!

We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.

The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.

Fixes #17119

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17172

(cherry picked from commit 21e7deafeb)
2024-03-13 15:08:32 +02:00
Kamil Braun
405387c663 test: unflake test_topology_remove_garbage_group0
The test is booting nodes, and then immediately starts shutting down
nodes and removing them from the cluster. The shutting down and
removing may happen before driver manages to connect to all nodes in the
cluster. In particular, the driver didn't yet connect to the last
bootstrapped node. Or it can even happen that the driver has connected,
but the control connection is established to the first node, and the
driver fetched topology from the first node when the first node didn't
yet consider the last node to be normal. So the driver decides to close
connection to the last node like this:
```
22:34:03.159 DEBUG> [control connection] Removing host not found in
   peers metadata: <Host: 127.42.90.14:9042 datacenter1>
```

Eventually, at the end of the test, only the last node remains, all
other nodes have been removed or stopped. But the driver does not have a
connection to that last node.

Fix this problem by ensuring that:
- all nodes see each other as NORMAL,
- the driver has connected to all nodes
at the beginning of the test, before we start shutting down and removing
nodes.

Fixes scylladb/scylladb#16373

(cherry picked from commit a68701ed4f)

Closes #17703
2024-03-12 13:43:21 +01:00
Kamil Braun
b567364af1 migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer
Jumping to shard 0 during group 0 snapshot transfer is required because
we take group 0 lock, onyl available on shard 0. But outside of Raft
mode it only pessimizes performance unnecessarily, so don't do it.
2024-03-12 11:19:31 +01:00
Botond Dénes
3897d44893 repair: resolve start-up deadlock
Repairs have to obtain a permit to the reader concurrency semaphore on
each shard they have a presence on. This is prone to deadlocks:

node1                              node2
repair1_master (takes permit)      repair1_follower (waits on permit)
repair2_master (waits for permit)  repair2_follower (takes permit)

In lieu of strong central coordination, we solved this by making permits
evictable: if repair2 can evict repair1's permit so it can obtain one
and make progress. This is not efficient as evicting a permit usually
means discarding already done work, but it prevents the deadlocks.
We recently discovered that there is a window when deadlocks can still
happen. The permit is made evictable when the disk reader is created.
This reader is an evictable one, which effectively makes the permit
evictable. But the permit is obtained when the repair constrol
structrure -- repair meta -- is create. Between creating the repair meta
and reading the first row from disk, the deadlock is still possible. And
we know that what is possible, will happen (and did happen). Fix by
making the permit evictable as soon as the repair meta is created. This
is very clunky and we should have a better API for this (refs #17644),
but for now we go with this simple patch, to make it easy to backport.

Refs: #17644
Fixes: #17591

Closes #17646

(cherry picked from commit c6e108a)

Backport notes:

The fix above does not apply to 5.2, because on 5.2 the reader is
created immediately when the repair-meta is created. So we don't need
the game with a fake inactive read, we can just pause the already
created reader in the repair-reader constructor.

Closes #17730
2024-03-12 08:24:26 +02:00
Michał Chojnowski
54048e5613 sstables: fix a use-after-free in key_view::explode()
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.

This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.

Fixes #17625

(cherry picked from commit 7a7b8972e5)

Closes #17725
2024-03-11 16:17:32 +02:00
Lakshmi Narayanan Sreethar
e8736ae431 reader_permit: store schema_ptr instead of raw schema pointer
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.

Fixes #16180

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#16658

(cherry picked from commit 76f0d5e35b)

Closes #17694
2024-03-08 10:56:12 +02:00
Gleb Natapov
42d25f1911 raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
group0 operations a valid on shard 0 only. Assert that.

(cherry picked from commit 9847e272f9)
2024-03-05 16:51:13 +01:00
Gleb Natapov
619f75d1de migration_manager: fix indentation after the previous patch.
(cherry picked from commit 77907b97f1)
2024-03-05 16:50:24 +01:00
Gleb Natapov
6dd31dcade messaging_service: process migration_request rpc on shard 0
Commit 0c376043eb added access to group0
semaphore which can be done on shard0 only. Unlike all other group0 rpcs
(that already always forwarded to shard0) migration_request does not
since it is an rpc that what reused from non raft days. The patch adds
the missing jump to shard0 before executing the rpc.

(cherry picked from commit 4a3c79625f)
2024-03-05 16:49:23 +01:00
Gleb Natapov
dd65bf151b migration_manager: take group0 lock during raft snapshot taking
Group0 state machine access atomicity is guaranteed by a mutex in group0
client. A code that reads or writes the state needs to hold the log. To
transfer schema part of the snapshot we used existing "migration
request" verb which did not follow the rule. Fix the code to take group0
lock before accessing schema in case the verb is called as part of
group0 snapshot transfer.

Fixes scylladb/scylladb#16821

(cherry picked from commit 0c376043eb)

Backport note: introduced missing
`raft_group0_client::hold_read_apply_mutex`
2024-03-05 16:40:00 +01:00
Yaron Kaikov
f4a7804596 release: prepare for 5.2.16 2024-03-03 14:33:44 +02:00
Botond Dénes
ba373f83e4 Merge '[Backport 5.2] repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk
RPC calls lose information about the type of returned exception.
Thus, if a table is dropped on receiver node, but it still exists
on a sender node and sender node streams the table's data, then
the whole operation fails.

To prevent that, add a method which synchronizes schema and then
checks, if the exception was caused by table drop. If so,
the exception is swallowed.

Use the method in streaming and repair to continue them when
the table is dropped in the meantime.

Fixes: https://github.com/scylladb/scylladb/issues/17028.
Fixes: https://github.com/scylladb/scylladb/issues/15370.
Fixes: https://github.com/scylladb/scylladb/issues/15598.

Closes #17528

* github.com:scylladb/scylladb:
  repair: handle no_such_column_family from remote node gracefully
  test: test drop table on receiver side during streaming
  streaming: fix indentation
  streaming: handle no_such_column_family from remote node gracefully
  repair: add methods to skip dropped table
2024-02-28 16:33:01 +02:00
Kamil Braun
d82c757323 Merge 'misc_services: fix data race from bad usage of get_next_version' from Piotr Dulikowski
The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc.

The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0.

The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0.

There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A:

1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted,
2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state,
3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state,
4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`,
5. Shard 1 resumes and stores version `v` in the global counter,
6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`.
7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4.

Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by:

1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual,
2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value,
3. running a test which performs schema changes in a loop.

The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway.

Refs: #17493

Closes scylladb/scylladb#17499

* github.com:scylladb/scylladb:
  version_generator: check that get_next_version is called on shard 0
  misc_services: fix data race from bad usage of get_next_version

(cherry picked from commit fd32e2ee10)
2024-02-28 14:28:03 +01:00
Aleksandra Martyniuk
78aeb990a6 repair: handle no_such_column_family from remote node gracefully
If no_such_column_family is thrown on remote node, then repair
operation fails as the type of exception cannot be determined.

Use repair::with_table_drop_silenced in repair to continue operation
if a table was dropped.

(cherry picked from commit cf36015591)
2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk
23493bb342 test: test drop table on receiver side during streaming
(cherry picked from commit 2ea5d9b623)
2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk
d19afd7059 streaming: fix indentation
(cherry picked from commit b08f539427)
2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk
4e200aa250 streaming: handle no_such_column_family from remote node gracefully
If no_such_column_family is thrown on remote node, then streaming
operation fails as the type of exception cannot be determined.

Use repair::with_table_drop_silenced in streaming to continue
operation if a table was dropped.

(cherry picked from commit 219e1eda09)
2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk
afca1142cd repair: add methods to skip dropped table
Schema propagation is async so one node can see the table while on
the other node it is already dropped. So, if the nodes stream
the table data, the latter node throws no_such_column_family.
The exception is propagated to the other node, but its type is lost,
so the operation fails on the other node.

Add method which waits until all raft changes are applied and then
checks whether given table exists.

Add the function which uses the above to determine, whether the function
failed because of dropped table (eg. on the remote node so the exact
exception type is unknown). If so, the exception isn't rethrown.

(cherry picked from commit 5202bb9d3c)
2024-02-28 11:45:54 +01:00
Botond Dénes
ce1a422c9c Merge '[Backport 5.2] sstables: close index_reader in has_partition_key' from Aleksandra Martyniuk
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.

Close index_reader in has_partition_key before destroying it.

Fixes: https://github.com/scylladb/scylladb/issues/17232.

Closes #17532

* github.com:scylladb/scylladb:
  test: add test to check if reader is closed
  sstables: close index_reader in has_partition_key
2024-02-27 16:12:17 +02:00
Aleksandra Martyniuk
296be93714 test: add test to check if reader is closed
Add test to check if reader is closed in sstable::has_partition_key.

(cherry picked from commit 4530be9e5b)
2024-02-26 16:17:12 +01:00
Aleksandra Martyniuk
6feb802d54 sstables: close index_reader in has_partition_key
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.

Close index_reader in has_partition_key before destroying it.

(cherry picked from commit 5227336a32)
2024-02-26 16:17:12 +01:00
Avi Kivity
9c44bbce67 Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak
Before this PR, writes to the previous CDC generations would
always be rejected. After this PR, they will be accepted if the
write's timestamp is greater than `now - generation_leeway`.

This change was proposed around 3 years ago. The motivation was
to improve user experience. If a client generates timestamps by
itself and its clock is desynchronized with the clock of the node
the client is connected to, there could be a period during
generation switching when writes fail. We didn't consider this
problem critical because the client could simply retry a failed
write with a higher timestamp. Eventually, it would succeed. This
approach is safe because these failed writes cannot have any side
effects. However, it can be inconvenient. Writing to previous
generations was proposed to improve it.

The idea was rejected 3 years ago. Recently, it turned out that
there is a case when the client cannot retry a write with the
increased timestamp. It happens when a table uses CDC and LWT,
which makes timestamps permanent. Once Paxos commits an entry
with a given timestamp, Scylla will keep trying to apply that entry
until it succeeds, with the same timestamp. Applying the entry
involves writing to the CDC log table. If it fails, we get stuck.
It's a major bug with an unknown perfect solution.

Allowing writes to previous generations for `generation_leeway` is
a probabilistic fix that should solve the problem in practice.

Apart from this change, this PR adds tests for it and updates
the documentation.

This PR is sufficient to enable writes to the previous generations
only in the gossiper-based topology. The Raft-based topology
needs some adjustments in loading and cleaning CDC generations.
These changes won't interfere with the changes introduced in this
PR, so they are left for a follow-up.

Fixes scylladb/scylladb#7251
Fixes scylladb/scylladb#15260

Closes scylladb/scylladb#17134

* github.com:scylladb/scylladb:
  docs: using-scylla: cdc: remove info about failing writes to old generations
  docs: dev: cdc: document writing to previous CDC generations
  test: add test_writes_to_previous_cdc_generations
  cdc: generation: allow increasing generation_leeway through error injection
  cdc: metadata: allow sending writes to the previous generations

(cherry picked from commit 9bb4482ad0)

Backport note: replaced `servers_add` with `server_add` loop in tests
replaced `error_injections_at_startup` (not implemented in 5.2) with
`enable_injection` post-boot
2024-02-22 15:05:19 +01:00
Nadav Har'El
6a6115cd86 mv: fix missing view deletions in some cases of range tombstones
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.

This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).

Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.

This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:

1. The counter "_op_count" that decides where to break the output mutation
   should only be incremented when adding rows to this output mutation.
   The existing code strangely incrmented it on every read (!?) which
   resulted in the counter being incremented on every *input* fragment,
   and in particular could reach the limit 100 between two range
   tombstone pieces.

2. Moreover, the length of output was checked in the wrong place...
   The existing code could get to 100 rows, not check at that point,
   read the next input - half a range tombstone - and only *then*
   check that we reached 100 rows and stop. The fix is to calculate
   the number of rows in the right place - exactly when it's needed,
   not before the step.

The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.

This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).

By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).

Fixes #17117

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17164

(cherry picked from commit 14315fcbc3)
2024-02-22 15:36:58 +02:00
Avi Kivity
e0e46fbc50 Regenerate frozen toolchain
For gnutls 3.8.3.

Since Fedora 37 is end-of-life, pick the package from Fedora 38. libunistring
needs to be updated to satisfy the dependency solver.

Fixes #17285.

Closes scylladb/scylladb#17287

Signed-off-by: Avi Kivity <avi@scylladb.com>

Closes #17411
2024-02-20 12:34:46 +02:00
Wojciech Mitros
27ab3b1744 rust: update dependencies
The currently used version of "rustix" depency had a minor
security vulnerability. This patch updates the corresponding
crate.
The update was performed using "cargo update" on "rustix"
package and version "0.36.17"
relevant package and the corresponding version.

Refs #15772

Closes #17408
2024-02-19 22:12:50 +02:00
Michał Jadwiszczak
0d22471222 schema::describe: print 'synchronous_updates' only if it was specified
While describing materialized view, print `synchronous_updates` option
only if the tag is present in schema's extensions map. Previously if the
key wasn't present, the default (false) value was printed.

Fixes: #14924

Closes #14928

(cherry picked from commit b92d47362f)
2024-02-19 09:10:34 +02:00
Botond Dénes
422a731e85 query: do not kill unpaged queries when they reach the tombstone-limit
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.

So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.

Fixes: #17241

Closes scylladb/scylladb#17242

(cherry picked from commit f068d1a6fa)
2024-02-15 12:50:30 +02:00
Yaron Kaikov
1fa8327504 release: prepare for 5.2.15 2024-02-11 14:17:31 +02:00
Pavel Emelyanov
f3c215aaa1 Update seastar submodule
* seastar 29badd99...ad0f2d5d (1):
  > Merge "Slowdown IO scheduler based on dispatched/completed ratio" into branch-5.2

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-09 12:22:58 +03:00
Botond Dénes
94af1df2cf Merge 'Fix mintimeuuid() call that could crash Scylla' from Nadav Har'El
This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs.

The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function.

Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw.

Fixes #17035

Closes scylladb/scylladb#17073

* github.com:scylladb/scylladb:
  utils: replace assert() by on_internal_error()
  utils: add on_internal_error with common logger
  utils: add a timeuuid minimum, like we had maximum
  test/cql-pytest: tests for "date" type

(cherry picked from commit 2a4b991772)
2024-02-07 14:19:32 +02:00
Botond Dénes
9291eafd4a Merge '[Backport 5.2] Raft snapshot fixes' from Kamil Braun
Backports required to fix scylladb/scylladb#16683 in 5.2:
- when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0
- add an API to trigger Raft snapshot
- use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot.

Closes #17087

* github.com:scylladb/scylladb:
  test_raft_snapshot_request: fix flakiness (again)
  test_raft_snapshot_request: fix flakiness
  Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
  Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
  raft: server: add workaround for scylladb/scylladb#12972
  raft: Store snapshot update and truncate log atomically
  service: raft: force initial snapshot transfer in new cluster
  raft_sys_table_storage: give initial snapshot a non zero value
2024-02-07 11:55:20 +02:00
Michał Chojnowski
4546d0789f row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted
Commit e81fc1f095 accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes #16759

Closes scylladb/scylladb#17138

(cherry picked from commit ed98102c45)
2024-02-04 14:46:57 +02:00
Kamil Braun
4e257c5c74 test_raft_snapshot_request: fix flakiness (again)
At the end of the test, we wait until a restarted node receives a
snapshot from the leader, and then verify that the log has been
truncated.

To check the snapshot, the test used the `system.raft_snapshots` table,
while the log is stored in `system.raft`.

Unfortunately, the two tables are not updated atomically when Raft
persists a snapshot (scylladb/scylladb#9603). We first update
`system.raft_snapshots`, then `system.raft` (see
`raft_sys_table_storage::store_snapshot_descriptor`). So after the wait
finishes, there's no guarantee the log has been truncated yet -- there's
a race between the test's last check and Scylla doing that last delete.

But we can check the snapshot using `system.raft` instead of
`system.raft_snapshots`, as `system.raft` has the latest ID. And since
1640f83fdc, storing that ID and truncating
the log in `system.raft` happens atomically.

Closes scylladb/scylladb#17106

(cherry picked from commit c911bf1a33)
2024-02-02 11:31:19 +01:00
Kamil Braun
08021dc906 test_raft_snapshot_request: fix flakiness
Add workaround for scylladb/python-driver#295.

Also an assert made at the end of the test was false, it is fixed with
appropriate comment added.

(cherry picked from commit 74bf60a8ca)
2024-02-02 11:31:19 +01:00
Botond Dénes
db586145aa Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.

Also add a test.

Fixes scylladb/scylladb#16683

Closes scylladb/scylladb#17072

* github.com:scylladb/scylladb:
  test: add test for fixing a broken group 0 snapshot
  raft_group0: trigger snapshot if existing snapshot index is 0

(cherry picked from commit 181f68f248)

Backport note: test_raft_fix_broken_snapshot had to be removed because
the "error injections enabled at startup" feature does not yet exist in
5.2.
2024-02-01 15:39:14 +01:00
Botond Dénes
ce0ed29ad6 Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb/scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32f81)

Backport notes:
- `has_output()` has a smaller condition in the backported version
  (because the condition was smaller in `poll_output()`)
- `process_fsm_output` has a smaller body (because `io_fiber` had a
  smaller body) in the backported version
- the HTTP API is only started if `raft_group_registry` is started
2024-02-01 15:38:51 +01:00
Kamil Braun
cbe8e05ef6 raft: server: add workaround for scylladb/scylladb#12972
When a node joins the cluster, it closes connections after learning
topology information from other nodes, in order to reopen them with
correct encryption, compression etc.

In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot
transfer. This was fixed in later versions by putting some order into
the bootstrap process with 50e8ec77c6 but
the fix was not backported due to many prerequisites and complexity.

Raft automatically recovers from interrupted snapshot transfer by
retrying it eventually, and everything works. However an ERROR is
reported due to that one failed snapshot transfer, and dtests dont like
ERRORs -- they report the test case as failed if an ERROR happened in
any node's logs even if the test passed otherwise.

Here we apply a simple workaround to please dtests -- in this particular
scenario, turn the ERROR into a WARN.
2024-02-01 14:29:56 +01:00
Michael Huang
84004ab83c raft: Store snapshot update and truncate log atomically
In case the snapshot update fails, we don't truncate commit log.

Fixes scylladb/scylladb#9603

Closes scylladb/scylladb#15540

(cherry picked from commit 1640f83fdc)
2024-02-01 13:10:05 +01:00
Kamil Braun
753e2d3c57 service: raft: force initial snapshot transfer in new cluster
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: #14066

Closes #14336

(cherry picked from commit ff386e7a44)

Backport note: contrary to the claims above, it turns out that it is
actually necessary to create snapshots in clusters which bootstrap with
Raft, because of tombstones in current schema state expire hence
applying schema mutations from old Raft log entries is not really
idempotent. Snapshot transfer, which transfers group 0 history and
state_ids, prevents old entries from applying schema mutations over
latest schema state.

Ref: scylladb/scylladb#16683
2024-01-31 17:00:10 +01:00
Gleb Natapov
42cf25bcbb raft_sys_table_storage: give initial snapshot a non zero value
We create a snapshot (config only, but still), but do not assign it any
id. Because of that it is not loaded on start. We do want it to be
loaded though since the state of group0 will not be re-created from the
log on restart because the entries will have outdated id and will be
skipped. As a result in memory state machine state will not be restored.
This is not a problem now since schema state it restored outside of raft
code.

Message-Id: <20230316112801.1004602-5-gleb@scylladb.com>
(cherry picked from commit a690070722)
2024-01-31 16:50:42 +01:00
Aleksandra Martyniuk
f85375ff99 api: ignore future in task_manager_json::wait_task
Before returning task status, wait_task waits for it to finish with
done() method and calls get() on a resulting future.

If requested task fails, an exception will be thrown and user will
get internal server error instead of failed task status.

Result of done() method is ignored.

Fixes: #14914.
(cherry picked from commit ae67f5d47e)

Closes #16438
2024-01-30 10:54:33 +02:00
Aleksandra Martyniuk
35a0a459db compaction: ignore future explicitly
discard_result ignores only successful futures. Thus, if
perform_compaction<regular_compaction_task_executor> call fails,
a failure is considered abandoned, causing tests to fail.

Explicitly ignore failed future.

Fixes: #14971.

Closes #15000

(cherry picked from commit 7a28cc60ec)

Closes #16441
2024-01-30 10:53:09 +02:00
Kamil Braun
784695e3ac system_keyspace: use system memory for system.raft table
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).

This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.

This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.

Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.

In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.

Fixes: scylladb/scylladb#15622

Closes scylladb/scylladb#15972

(cherry picked from commit f094e23d84)
2024-01-25 17:59:49 +01:00
Avi Kivity
351d6d6531 Merge 'Invalidate prepared statements for views when their schema changes.' from Eliran Sinvani
When a base table changes and altered, so does the views that might
refer to the added column (which includes "SELECT *" views and also
views that might need to use this column for rows lifetime (virtual
columns).
However the query processor implementation for views change notification
was an empty function.
Since views are tables, the query processor needs to at least treat them
as such (and maybe in the future, do also some MV specific stuff).
This commit adds a call to `on_update_column_family` from within
`on_update_view`.
The side effect true to this date is that prepared statements for views
which changed due to a base table change will be invalidated.

Fixes https://github.com/scylladb/scylladb/issues/16392

This series also adds a test which fails without this fix and passes when the fix is applied.

Closes scylladb/scylladb#16897

* github.com:scylladb/scylladb:
  Add test for mv prepared statements invalidation on base alter
  query processor: treat view changes at least as table changes

(cherry picked from commit 5810396ba1)
2024-01-23 21:31:47 +02:00
Takuya ASADA
5a05ccc2f8 scylla_raid_setup: faillback to other paths when UUID not avialable
On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is
not available, scylla_raid_setup will fail while mounting volume.

To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available
paths to mount the disk and fallback to other paths like by-partuuid,
by-id, by-path or just using real device path like /dev/md0.

To get device path, and also to dumping device status when UUID is not
available, this will introduce UdevInfo class which communicate udev
using pyudev.

Related #11359

Closes scylladb/scylladb#13803

(cherry picked from commit 58d94a54a3)

[syuu: renegerate tools/toolchain/image for new python3-pyudev package]

Closes #16938
2024-01-23 16:05:28 +02:00
Botond Dénes
a1603bcb40 readers/multishard: evictable_reader::fast_forward_to(): close reader on exception
When the reader is currently paused, it is resumed, fast-forwarded, then
paused again. The fast forwarding part can throw and this will lead to
destroying the reader without it being closed first.
Add a try-catch surrounding this part in the code. Also mark
`maybe_pause()` and `do_pause()` as noexcept, to make it clear why
that part doesn't need to be in the try-catch.

Fixes: #16606

Closes scylladb/scylladb#16630

(cherry picked from commit 204d3284fa)
2024-01-16 16:57:28 +02:00
Michał Jadwiszczak
29da20b9e0 schema: add scylla specific options to schema description
Add `paxos_grace_seconds`, `tombstone_gc`, `cdc` and `synchronous_updates`
options to schema description.

Fixes: #12389
Fixes: scylladb/scylla-enterprise#2979

Closes #16786
2024-01-16 09:56:08 +02:00
Botond Dénes
7c4ec8cf4b Update tools/java submodule
* tools/java 843096943e...a1eed2f381 (1):
  > Update JNA dependency to 5.14.0

Fixes: https://github.com/scylladb/scylla-tools-java/issues/371
2024-01-15 15:51:32 +02:00
Aleksandra Martyniuk
5def443cf0 tasks: keep task's children in list
If std::vector is resized its iterators and references may
get invalidated. While task_manager::task::impl::_children's
iterators are avoided throughout the code, references to its
elements are being used.

Since children vector does not need random access to its
elements, change its type to std::list<foreign_task_ptr>, which
iterators and references aren't invalidated on element insertion.

Fixes: #16380.

Closes scylladb/scylladb#16381

(cherry picked from commit 9b9ea1193c)

Closes #16777
2024-01-15 15:38:00 +02:00
Anna Mikhlin
c0604a31fa release: prepare for 5.2.14 2024-01-14 16:34:38 +02:00
Pavel Emelyanov
96bb602c62 Update seastar submodule (token bucket duration underflow)
* seastar 43a1ce58...29badd99 (1):
  > shared_token_bucket: Fix duration_for() underflow

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-01-12 15:15:56 +03:00
Botond Dénes
d96440e8b6 Merge '[Backport 5.2] Validate compaction strategy options in prepare' from Aleksandra Martyniuk
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.

Check table properties when create and alter statements are prepared.

Fixes: https://github.com/scylladb/scylladb/issues/14710.

Closes #16750

* github.com:scylladb/scylladb:
  cql3: statements: delete execute override
  cql3: statements: call check_restricted_table_properties in prepare
  cql3: statements: pass data_dictionary::database to check_restricted_table_properties
2024-01-12 10:56:54 +02:00
Aleksandra Martyniuk
ea41a811d6 cql3: statements: delete execute override
Delete overriden create_table_statement::execute as it only calls its
direct parent's (schema_altering_statement) execute method anyway.

(cherry picked from commit 6c7eb7096e)
2024-01-11 16:43:17 +01:00
Aleksandra Martyniuk
8b77fbc904 cql3: statements: call check_restricted_table_properties in prepare
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.

Check table properties when create and alter statements are prepared.

The error is no longer returned as an exceptional future, but it
is thrown. Adjust the tests accordingly.

(cherry picked from commit 60fdc44bce)
2024-01-11 16:10:26 +01:00
Aleksandra Martyniuk
3ab3a2cc1b cql3: statements: pass data_dictionary::database to check_restricted_table_properties
Pass data_dictionary::database to check_restricted_table_properties
as an arguemnt instead of query_processor as the method will be called
from a context which does not have access to query processor.

(cherry picked from commit ec98b182c8)
2024-01-11 16:10:26 +01:00
Botond Dénes
7e9107cc97 Update tools/java submodule
* tools/java 79fa02d8a3...843096943e (1):
  > build.xml: update io.airlift to 0.9

Fixes: scylladb/scylla-tools-java#374
2024-01-11 11:03:29 +02:00
Botond Dénes
abb7ae4309 Update ./tools/jmx submodule
* tools/jmx f21550e...50909d6 (1):
  > scylla-apiclient: drop hk2-locator dependency

Fixes: scylladb/scylla-jmx#231
2024-01-10 14:22:14 +02:00
Botond Dénes
2820c63734 Update tools/java submodule
* tools/java d7ec9bf45f...79fa02d8a3 (2):
  > build.xml: update scylla-driver-core to 3.11.5.1
  > treewide: update "guava" package

Fixes: scylla-tools-java#365
Fixes: scylla-tools-java#343

Closes #16693
2024-01-10 08:19:43 +02:00
Nadav Har'El
ac0056f4bc Merge 'Fix partition estimation with TWCS tables during streaming' from Raphael "Raph" Carvalho
TWCS tables require partition estimation adjustment as incoming streaming data can be segregated into the time windows.

Turns out we had two problems in this area that leads to suboptimal bloom filters.

1) With off-strategy enabled, data segregation is postponed, but partition estimation was adjusted as if segregation wasn't postponed. Solved by not adjusting estimation if segregation is postponed.
2) With off-strategy disabled, data segregation is not postponed, but streaming didn't feed any metadata into partition estimation procedure, meaning it had to assume the max windows input data can be segregated into (100). Solved by using schema's default TTL for a precise estimation of window count.

For the future, we want to dynamically size filters (see https://github.com/scylladb/scylladb/issues/2024), especially for TWCS that might have SSTables that are left uncompacted until they're fully expired, meaning that the system won't heal itself in a timely manner through compaction on a SSTable that had partition estimation really wrong.

Fixes https://github.com/scylladb/scylladb/issues/15704.

Closes scylladb/scylladb#15938

* github.com:scylladb/scylladb:
  streaming: Improve partition estimation with TWCS
  streaming: Don't adjust partition estimate if segregation is postponed

(cherry picked from commit 64d1d5cf62)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #16672
2024-01-08 09:06:43 +02:00
Calle Wilund
aaa25e1a78 Commitlog replayer: Range-check skip call
Fixes #15269

If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.

v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
    ensure we log the fact that things are broken

Closes scylladb/scylladb#15270

(cherry picked from commit 6ffb482bf3)
2024-01-05 09:19:45 +02:00
Beni Peled
c57a0a7a46 release: prepare for 5.2.13 2024-01-03 17:48:59 +02:00
Botond Dénes
740ba3ac2a tools/schema_loader: read_schema_table_mutation(): close the reader
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
  a mutation from the reader, uses consume(), which does not trigger
  `set_close_is_required()` (#16520). Due to this, the top-level
  combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
  for the current range. If the circumstances are just right, all
  underlying readers are closed, before the combined reader is
  destoyed. Looks like this is what happens for the most time.

This bug was discovered in SCT testing. After fixing #16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.

Fixes: #16519

Closes scylladb/scylladb#16521

(cherry picked from commit da033343b7)
2023-12-31 18:13:10 +02:00
Gleb Natapov
76c3dda640 storage_service: register schema version observer before joining group0 and starting gossiper
The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.

Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.

Fixes: #15078

Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
(cherry picked from commit d1654ccdda)
2023-12-20 11:14:27 +01:00
Kamil Braun
287546923e Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak
Fixes #9405

`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.

To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.

Closes #14534

* github.com:scylladb/scylladb:
  db: hints: add checksum to sync point encoding
  db: hints: add the version_size constant

(cherry picked from commit eb6202ef9c)

The only difference from the original merge commit is the include
path of `xx_hasher.hh`. On branch 5.2, this file is in the root
directory, not `utils`.

Closes #16458
2023-12-19 17:39:50 +02:00
Botond Dénes
c0dab523f9 Update tools/java submodule
* tools/java e2aad6e3a0...d7ec9bf45f (1):
  > Merge "build: take care of old libthrift" from Piotr Grabowski

Fixes: scylladb/scylla-tools-java#352

Closes #16464
2023-12-19 17:37:27 +02:00
Michael Huang
5499f7b5a8 cdc: use chunked_vector for topology_description entries
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.

Closes scylladb/scylladb#15428

(cherry picked from commit 62a8a31be7)
2023-12-19 13:43:23 +01:00
Piotr Grabowski
7055ac45d1 test: use more frequent reconnection policy
The default reconnection policy in Python Driver is an exponential
backoff (with jitter) policy, which starts at 1 second reconnection
interval and ramps up to 600 seconds.

This is a problem in tests (refs #15104), especially in tests that restart
or replace nodes. In such a scenario, a node can be unavailable for an
extended period of time and the driver will try to reconnect to it
multiple times, eventually reaching very long reconnection interval
values, exceeding the timeout of a test.

Fix the issue by using a exponential reconnection policy with a maximum
interval of 4 seconds. A smaller value was not chosen, as each retry
clutters the logs with reconnection exception stack trace.

Fixes #15104

Closes #15112

(cherry picked from commit 17e3e367ca)
2023-12-19 13:43:23 +01:00
Gleb Natapov
4ff29d1637 raft: drop assert in server_impl::apply_snapshot for a condition that may happen
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.

Drop the assert that verifies the assumption and return an error in this
case instead.

Fixes: #15222

Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
(cherry picked from commit 55f047f33f)
2023-12-19 13:43:23 +01:00
Alexey Novikov
6bcf9e6631 When add duration field to UDT check whether this UDT is used in some clustering key
Having values of the duration type is not allowed for clustering
columns, because duration can't be ordered. This is correctly validated
when creating a table but do not validated when we alter the type.

Fixes #12913

Closes scylladb/scylladb#16022

(cherry picked from commit bd73536b33)
2023-12-19 06:58:41 -05:00
Takuya ASADA
74dd8f08e3 dist: fix local-fs.target dependency
systemd man page says:

systemd-fstab-generator(3) automatically adds dependencies of type Before= to
all mount units that refer to local mount points for this target unit.

So "Before=local-fs.taget" is the correct dependency for local mount
points, but we currently specify "After=local-fs.target", it should be
fixed.

Also replaced "WantedBy=multi-user.target" with "WantedBy=local-fs.target",
since .mount are not related with multi-user but depends local
filesystems.

Fixes #8761

Closes scylladb/scylladb#15647

(cherry picked from commit a23278308f)
2023-12-19 13:15:00 +02:00
Botond Dénes
68507ed4d9 Merge '[Backport 5.2] Shard of shard repair task impl' from Aleksandra Martyniuk
Shard id is logged twice in repair (once explicitly, once added by logger).
Redundant occurrence is deleted.

shard_repair_task_impl::id (which contains global repair shard)
is renamed to avoid further confusion.

Fixes: https://github.com/scylladb/scylladb/issues/12955

Closes #16439

* github.com:scylladb/scylladb:
  repair: rename shard_repair_task_impl::id
  repair: delete redundant shard id from logs
2023-12-19 10:28:57 +02:00
Botond Dénes
46a29e9a02 Merge 'alternator: fix isolation of concurrent modifications to tags' from Nadav Har'El
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.

The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.

Fixes #6389.

Closes #13150

* github.com:scylladb/scylladb:
  test/alternator: test concurrent TagResource / UntagResource
  db/tags: drop unsafe update_tags() utility function
  alternator: isolate concurrent modification to tags
  db/tags: add safe modify_tags() utility functions
  migration_manager: expose access to storage_proxy

(cherry picked from commit dba1d36aa6)

Closes #16453
2023-12-19 10:19:31 +02:00
Botond Dénes
23fd6939eb Merge '[Backport to 5.2] gossiper: mark_alive: use deferred_action to unmark pending' from Benny Halevy
Backport the following patches to 5.2:
- gossiper: mark_alive: enter background_msg gate (#14791)
- gossiper: mark_alive: use deferred_action to unmark pending (#14839)

Closes #16452

* github.com:scylladb/scylladb:
  gossiper: mark_alive: use deferred_action to unmark pending
  gossiper: mark_alive: enter background_msg gate
2023-12-19 09:06:37 +02:00
Botond Dénes
1cf499cfea Update tools/java submodule
* tools/java 80701efa8d...e2aad6e3a0 (2):
  > build: update logback dependency
  > build: update `netty` dependency

Fixes: https://github.com/scylladb/scylla-tools-java/issues/363
Fixes: https://github.com/scylladb/scylla-tools-java/issues/364

Closes #16444
2023-12-18 18:19:20 +02:00
Nadav Har'El
91e05dc646 cql: fix SELECT toJson() or SELECT JSON of time column
The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column
of type "time" forgot to put the time string in quotes. The result was
invalid JSON. This is patch is a one-liner fixing this bug.

This patch also removes the "xfail" marker from one xfailing test
for this issue which now starts to pass. We also add a second test for
this issue - the existing test was for "SELECT TOJSON(t)", and the second
test shows that "SELECT JSON t" had exactly the same bug - and both are
fixed by the same patch.

We also had a test translated from Cassandra which exposed this bug,
but that test continues to fail because of other bugs, so we just
need to update the xfail string.

The patch also fixes one C++ test, test/boost/json_cql_query_test.cc,
which enshrined the *wrong* behavior - JSON output that isn't even
valid JSON - and had to be fixed. Unlike the Python tests, the C++ test
can't be run against Cassandra, and doesn't even run a JSON parser
on the output, which explains how it came to enshrine wrong output
instead of helping to discover the bug.

Fixes #7988

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#16121

(cherry picked from commit 8d040325ab)
2023-12-18 18:19:20 +02:00
Benny Halevy
a2009c4a8c gossiper: mark_alive: use deferred_action to unmark pending
Make sure _pending_mark_alive_endpoints is unmarked in
any case, including exceptions.

Fixes #14839

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14840

(cherry picked from commit 1e7e2eeaee)
2023-12-18 14:44:22 +02:00
Benny Halevy
999a6bfaae gossiper: mark_alive: enter background_msg gate
The function dispatch a background operation that must be
waited on in stop().

\Fixes scylladb/scylladb#14791

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 868e436901)
2023-12-18 14:42:52 +02:00
Kefu Chai
faef786c88 reloc: strip.sh: always generate symbol list with posix format
we compare the symbols lists of stripped ELF file ($orig.stripped) and
that of the one including debugging symbols ($orig.debug) to get a
an ELF file which includes only the necessary bits as the debuginfo
($orig.minidebug).

but we generate the symbol list of stripped ELF file using the
sysv format, while generate the one from the unstripped one using
posix format. the former is always padded the symbol names with spaces
so that their the length at least the same as the section name after
we split the fields with "|".

that's why the diff includes the stuff we don't expect. and hence,
we have tons of warnings like:

```
objcopy: build/node_exporter/node_exporter.keep_symbols:4910: Ignoring rubbish found on this line
```

when using objcopy to filter the ELF file to keep only the
symbols we are interested in.

so, in this change

* use the same format when dumping the symbols from unstripped ELF
  file
* include the symbols in the text area -- the code, by checking
  "T" and "t" in the dumped symbols. this was achieved by matching
  the lines with "FUNC" before this change.
* include the the symbols in .init data section -- the global
  variables which are initialized at compile time. they could
  be also interesting when debugging an application.

Fixes #15513
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15514

(cherry picked from commit 50c937439b)
2023-12-18 13:58:14 +02:00
Michał Chojnowski
7e9bdef8bb row_cache: when the constructor fails, clear _partitions in the right allocator
If the constructor of row_cache throws, `_partitions` is cleared in the
wrong allocator, possibly causing allocator corruption.

Fix that.

Fixes #15632

Closes scylladb/scylladb#15633

(cherry picked from commit 330d221deb)
2023-12-18 13:55:16 +02:00
Michael Huang
af38b255c8 cql3: Fix invalid JSON parsing for JSON objects with ASCII keys
For JSON objects represented as map<ascii, int>, don't treat ASCII keys
as a nested JSON string. We were doing that prior to the patch, which
led to parsing errors.

Included the error offset where JSON parsing failed for
rjson::parse related functions to help identify parsing errors
better.

Fixes: #7949

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Closes scylladb/scylladb#15499

(cherry picked from commit 75109e9519)
2023-12-18 13:45:57 +02:00
Kefu Chai
c4b699525a sstables: throw at seeing invalid chunk_len
before this change, when running into a zero chunk_len, scylla
crashes with `assert(chunk_size != 0)`. but we can do better than
printing a backtrace like:
```
scylla: sstables/compress.cc:158: void
sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed.
```
so, in this change, a `malformed_sstable_exception` is throw in place
of an `assert()`, which is supposed to verify the programming
invariants, not for identifying corrupted data file.

Fixes #15265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15264

(cherry picked from commit 1ed894170c)
2023-12-18 13:29:02 +02:00
Nadav Har'El
3a24b8c435 sstables: stop warning when auto-snapshot leaves non-empty directory
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:

```
WARN  2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```

It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.

And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.

To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.

Fixes #13538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14557

(cherry picked from commit edfb89ef65)
2023-12-18 13:26:40 +02:00
Kefu Chai
9e9a488da3 streaming: cast the progress to a float before formatting it
before this change, we format a `long` using `{:f}`. fmtlib would
throw an exception when actually formatting it.

so, let's make the percentage a float before formatting it.

Fixes #14587
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14588

(cherry picked from commit 1eb76d93b7)
2023-12-18 13:16:58 +02:00
Aleksandra Martyniuk
614d15b9f6 repair: rename shard_repair_task_impl::id
shard_repair_task_impl::id stores global repair id. To avoid confusion
with the task id, the field is renamed to global_repair_id.

(cherry picked from commit d889a599e8)
2023-12-18 12:08:00 +01:00
Aleksandra Martyniuk
fc2799096f repair: delete redundant shard id from logs
In repair shard id is logged twice. Delete repeated occurence.

(cherry picked from commit f7c88edec5)
2023-12-18 12:03:26 +01:00
Petr Gusev
b9178bd853 hints: send_one_hint: extend the scope of file_send_gate holder
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.

Fixes #15110

(cherry picked from commit 9fd3df13a2)
2023-12-18 13:03:23 +02:00
Kefu Chai
12aacea997 compound_compat: do not format an sstring with {:d}
before this change, we format a sstring with "{:d}", fmtlib would throw
`fmt::format_error` at runtime when formatting it. this is not expected.

so, in this change, we just print the int8_t using `seastar::format()`
in a single pass. and with the format specifier of `#02x` instead of
adding the "0x" prefix manually.

Fixes #14577
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14578

(cherry picked from commit 27d6ff36df)
2023-12-18 12:47:16 +02:00
Kefu Chai
df30f66bfa tools/scylla-sstable: dump column_desc as an object
before this change, `scylla sstable dump-statistics` prints the
"regular_columns" as a list of strings, like:

```
        "regular_columns": [
          "name",
          "clustering_order",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type",
          "name",
          "column_name_bytes",
          "type_name",
          "org.apache.cassandra.db.marshal.BytesType",
          "name",
          "kind",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type",
          "name",
          "position",
          "type_name",
          "org.apache.cassandra.db.marshal.Int32Type",
          "name",
          "type",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type"
        ]
```

but according
https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/scylla-sstable.html#dump-statistics,

> $SERIALIZATION_HEADER_METADATA := {
>     "min_timestamp_base": Uint64,
>     "min_local_deletion_time_base": Uint64,
>     "min_ttl_base": Uint64",
>     "pk_type_name": String,
>     "clustering_key_types_names": [String, ...],
>     "static_columns": [$COLUMN_DESC, ...],
>     "regular_columns": [$COLUMN_DESC, ...],
> }
>
> $COLUMN_DESC := {
>     "name": String,
>     "type_name": String
> }

"regular_columns" is supposed to be a list of "$COLUMN_DESC".
the same applies to "static_columnes". this schema makes sense,
as each column should be considered as a single object which
is composed of two properties. but we dump them like a list.

so, in this change, we guard each visit() call of `json_dumper()`
with `StartObject()` and `EndObject()` pair, so that each column
is printed as an object.

after the change, "regular_columns" are printed like:
```
        "regular_columns": [
          {
            "name": "clustering_order",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          },
          {
            "name": "column_name_bytes",
            "type_name": "org.apache.cassandra.db.marshal.BytesType"
          },
          {
            "name": "kind",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          },
          {
            "name": "position",
            "type_name": "org.apache.cassandra.db.marshal.Int32Type"
          },
          {
            "name": "type",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          }
        ]
```

Fixes #15036
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15037

(cherry picked from commit c82f1d2f57)
2023-12-18 12:26:36 +02:00
Michał Sala
2427bda737 forward_service: introduce shutdown checks
This commit introduces a new boolean flag, `shutdown`, to the
forward_service, along with a corresponding shutdown method. It also
adds checks throughout the forward_service to verify the value of the
shutdown flag before retrying or invoking functions that might use the
messaging service under the hood.

The flag is set before messaging service shutdown, by invoking
forward_service::shutdown in main. By checking the flag before each call
that potentially involves the messaging service, we can ensure that the
messaging service is still operational. If the flag is false, indicating
that the messaging service is still active, we can proceed with the
call. In the event that the messaging service is shutdown during the
call, appropriate exceptions should be thrown somewhere down in called
functions, avoiding potential hangs.

This fix should resolve the issue where forward_service retries could
block the shutdown.

Fixes #12604

Closes #13922

(cherry picked from commit e0855b1de2)
2023-12-18 12:25:25 +02:00
Petr Gusev
27adf340ef storage_proxy: mutation:: make frozen_mutation [[ref]]
We had a redundant copy in receive_mutation_handler
forward_fn callback. This frozen_mutation is
dynamically allocated and can be arbitrary large.

Fixes: #12504
(cherry picked from commit 5adbb6cde2)
2023-12-18 12:20:40 +02:00
Botond Dénes
5c33c9d6a6 Merge 'thrift: return address in listen_addresses() only after server is ready' from Marcin Maliszkiewicz
This is used for readiness API: /storage_service/rpc_server and the fix prevents from returning 'true' prematurely.

Some improvement for readiness was added in a51529dd15 but thrift implementation wasn't fully done.

Fixes https://github.com/scylladb/scylladb/issues/12376

Closes #13319

* github.com:scylladb/scylladb:
  thrift: return address in listen_addresses() only after server is ready
  thrift: simplify do_start_server() with seastar:async

(cherry picked from commit 9a024f72c4)
2023-12-18 12:20:40 +02:00
Kamil Braun
9aaaa66981 Merge 'cql3: fix a few misformatted printouts of column names in error messages' from Nadav Har'El
Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name.

Fixes #13657

Closes #13658

* github.com:scylladb/scylladb:
  cql3: fix printing of column_specification::name in some error messages
  cql3: fix printing of column_definition::name in some error messages

(cherry picked from commit a29b8cd02b)
2023-12-18 09:55:37 +02:00
Avi Kivity
b21ec82894 Merge 'Do not yield while traversing the gossiper endpoint state map' from Benny Halevy
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.

get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.

\Fixes #13899

\Closes #13900

* github.com:scylladb/scylladb:
  storage_service: do not preempt while traversing endpoint_state_map
  gossiper: do not preempt while traversing endpoint_state_map

(cherry picked from commit d2d53fc1db)

Closes #16431
2023-12-18 09:35:42 +02:00
Yaron Kaikov
5052890ae8 release: prepare for 5.2.12 2023-12-17 14:28:03 +02:00
Kefu Chai
0da3453f95 db: schema_tables: capture reference to temporary value by value
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.

this also silences warning from GCC-13:

```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
 3654 |     const column_definition& first_view_ck = v->clustering_key_columns().front();
      |                              ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
 3654 |     const column_definition& first_view_ck = v->clustering_key_columns().front();
      |                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```

Fixes #13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13721

(cherry picked from commit 135b4fd434)
2023-12-15 13:55:57 +02:00
Benny Halevy
6d7b2bc02f sstables: compressed_file_data_source_impl: get: throw malformed_sstable_exception on premature eof
Currently, the reader might dereference a null pointer
if the input stream reaches eof prematurely,
and read_exactly returns an empty temporary_buffer.

Detect this condition before dereferencing the buffer
and sstables::malformed_sstable_exception.

Fixes #13599

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13600

(cherry picked from commit 77b70dbdb7)
2023-12-15 13:54:42 +02:00
Wojciech Mitros
119c8279dd rust: update wasmtime dependency
The previous version of wasmtime had a vulnerability that possibly
allowed causing undefined behavior when calling UDFs.

We're directly updating to wasmtime 8.0.1, because the update only
requires a slight code modification and the Wasm UDF feature is
still experimental. As a result, we'll benefit from a number of
new optimizations.

Fixes #13807

Closes #13804

(cherry picked from commit 6bc16047ba)
2023-12-15 13:54:42 +02:00
Michał Chojnowski
3af6dfe4ac database: fix reads_memory_consumption for system semaphore
The metric shows the opposite of what its name suggests.
It shows available memory rather than consumed memory.
Fix that.

Fixes #13810

Closes #13811

(cherry picked from commit 0813fa1da0)
2023-12-15 13:54:42 +02:00
Eliran Sinvani
0230798db3 use_statement: Covert an exception to a future exception
The use statement execution code can throw if the keyspace is
doesn't exist, this can be a problem for code that will use
execute in a fiber since the exception will break the fiber even
if `then_wrapped` is used.

Fixes #14449

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes scylladb/scylladb#14394

(cherry picked from commit c5956957f3)
2023-12-15 13:54:42 +02:00
Botond Dénes
64503a7137 Merge 'mutation_query: properly send range tombstones in reverse queries' from Michał Chojnowski
reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, will not be emitted from mutation_query.

This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.

In particular, range deletes performed while a replica is down will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.

As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.

This series fixes the bug and adds a minimal reproducer test.

Fixes #10598

Closes scylladb/scylladb#16003

* github.com:scylladb/scylladb:
  mutation_query_test: test that range tombstones are sent in reverse queries
  mutation_query: properly send range tombstones in reverse queries

(cherry picked from commit 65e42e4166)
2023-12-14 12:53:07 +02:00
Yaron Kaikov
b013877629 build_docker.sh: Upgrade package during creation and remove sshd service
When scanning our latest docker image using `trivy` (command: `trivy
image docker.io/scylladb/scylla-nightly:latest`), it shows we have OS
packages which are out of date.

Also removing `openssh-server` and `openssh-client` since we don't use
it for our docker images

Fixes: https://github.com/scylladb/scylladb/issues/16222

Closes scylladb/scylladb#16224

(cherry picked from commit 7ce6962141)

Closes #16360
2023-12-11 10:57:16 +02:00
Botond Dénes
33d2da94ab reader_concurrency_semaphore: execution_loop(): trigger admission check when _ready_list is empty
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.

Fixes: #13540

Closes #13541

(cherry picked from commit b790f14456)
2023-12-07 16:04:55 +02:00
Paweł Zakrzewski
dac69be4a4 auth: fix error message when consistency level is not met
Propagate `exceptions::unavailable_exception` error message to the
client such as cqlsh.

Fixes #2339

(cherry picked from commit 400aa2e932)
2023-12-07 14:49:47 +02:00
Botond Dénes
763e583cf2 Merge 'row_cache: abort on exteral_updater::execute errors' from Benny Halevy
Currently the cache updaters aren't exception safe
yet they are intended to be.

Instead of allowing exceptions from
`external_updater::execute` escape `row_cache::update`,
abort using `on_fatal_internal_error`.

Future changes should harden all `execute` implementations
to effectively make them `noexcept`, then the pure virtual
definition can be made `noexcept` to cement that.

\Fixes scylladb/scylladb#15576

\Closes scylladb/scylladb#15577

* github.com:scylladb/scylladb:
  row_cache: abort on exteral_updater::execute errors
  row_cache: do_update: simplify _prev_snapshot_pos setup

(cherry picked from commit 4a0f16474f)

Closes scylladb/scylladb#16256
2023-12-07 09:16:42 +02:00
Nadav Har'El
b331b4a4bb Backport fixes for nodetool commands with Alternator GSI in the database
Fixes #16153

* java e716e1bd1d...80701efa8d (1):
  > NodeProbe: allow addressing table name with colon in it

/home/nyh/scylla/tools$ git submodule summary jmx | cat
* jmx bc4f8ea...f21550e (3):
  > ColumnFamilyStore: only quote table names if necessary
  > APIBuilder: allow quoted scope names
  > ColumnFamilyStore: don't fail if there is a table with ":" in its name

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #16296
2023-12-06 10:48:49 +02:00
Anna Stuchlik
d9448a298f doc: fix rollback in the 4.6-to-5.0 upgrade guide
This commit fixes the rollback procedure in
the 4.6-to-5.0 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
  is fixed.
- The "Gracefully shutdown ScyllaDB" command
  is fixed.

In addition, there are the following updates
to be in sync with the tests:

- The "Backup the configuration file" step is
  extended to include a command to backup
  the packages.
- The Rollback procedure is extended to restore
  the backup packages.
- The Reinstallation section is fixed for RHEL.

Refs https://github.com/scylladb/scylladb/issues/11907

This commit must be backported to branch-5.4, branch-5.2, and branch-5.1

Closes scylladb/scylladb#16155

(cherry picked from commit 1e80bdb440)
2023-12-05 15:10:21 +02:00
Anna Stuchlik
a82fd96b6a doc: fix rollback in the 5.0-to-5.1 upgrade guide
This commit fixes the rollback procedure in
the 5.0-to-5.1 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
  is fixed.
- The "Gracefully shutdown ScyllaDB" command
  is fixed.

In addition, there are the following updates
to be in sync with the tests:

- The "Backup the configuration file" step is
  extended to include a command to backup
  the packages.
- The Rollback procedure is extended to restore
  the backup packages.
- The Reinstallation section is fixed for RHEL.

Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.

Refs https://github.com/scylladb/scylladb/issues/11907

This commit must be backported to branch-5.4, branch-5.2, and branch-5.1

Closes scylladb/scylladb#16154

(cherry picked from commit 7ad0b92559)
2023-12-05 15:08:25 +02:00
Anna Stuchlik
ae79fb9ce0 doc: fix rollback in the 5.1-to-5.2 upgrade guide
This commit fixes the rollback procedure in
the 5.1-to-5.2 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
  is fixed.
- The "Gracefully shutdown ScyllaDB" command
  is fixed.

In addition, there are the following updates
to be in sync with the tests:

- The "Backup the configuration file" step is
  extended to include a command to backup
  the packages.
- The Rollback procedure is extended to restore
  the backup packages.
- The Reinstallation section is fixed for RHEL.

Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.

Refs https://github.com/scylladb/scylladb/issues/11907

This commit must be backported to branch-5.4 and branch-5.2.

Closes scylladb/scylladb#16152

(cherry picked from commit 91cddb606f)
2023-12-05 14:58:21 +02:00
Pavel Emelyanov
d83f4b9240 Update seastar submodule
* seastar eda297fc...43a1ce58 (2):
  > io_queue: Add iogroup label to metrics
  > io_queue: Remove ioshard metrics label

refs: scylladb/seastar#1591

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-12-05 10:46:07 +03:00
Raphael S. Carvalho
1b8c078cab test: Fix sporadic failures of database_test
database_test is failing sporadically and the cause was traced back
to commit e3e7c3c7e5.

The commit forces a subset of tests in database_test, to run once
for each of predefined x_log2_compaction_group settings.

That causes two problems:
1) test becomes 240% slower in dev mode.
2) queries on system.auth is timing out, and the reason is a small
table being spread across hundreds of compaction groups in each
shard. so to satisfy a range scan, there will be multiple hops,
making the overhead huge. additionally, the compaction group
aware sstable set is not merged yet. so even point queries will
unnecessarily scan through all the groups.

Fixes #13660.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13851

(cherry picked from commit a7ceb987f5)
2023-11-30 17:31:07 +02:00
Benny Halevy
1592a84b80 task_manager: module: make_task: enter gate when the task is created
Passing the gate_closed_exception to the task promise in start()
ends up with abandoned exception since no-one is waiting
for it.

Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.

Fixes scylladb/scylladb#15211

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f9a7635390)
2023-11-30 17:16:57 +02:00
Michał Chojnowski
bfeadae1bd position_in_partition: make operator= exception-safe
The copy assignment operator of _ck can throw
after _type and _bound_weight have already been changed.
This leaves position_in_partition in an inconsistent state,
potentially leading to various weird symptoms.

The problem was witnessed by test_exception_safety_of_reads.
Specifically: in cache_flat_mutation_reader::add_to_buffer,
which requires the assignment to _lower_bound to be exception-safe.

The easy fix is to perform the only potentially-throwing step first.

Fixes #15822

Closes scylladb/scylladb#15864

(cherry picked from commit 93ea3d41d8)
2023-11-30 15:01:22 +02:00
Avi Kivity
2c219a65f8 Update seastar submodule (spins on epoll)
* seastar 45f4102428...eda297fcb5 (1):
  > epoll: Avoid spinning on aborted connections

Fixes #12774
Fixes #7753
Fixes #13337
2023-11-30 14:09:22 +02:00
Piotr Grabowski
7054b1ab1e install-dependencies.sh: update node_exporter to 1.7.0
Update node_exporter to 1.7.0.

The previous version (1.6.1) was flagged by security scanners (such as
Trivy) with HIGH-severity CVE-2023-39325. 1.7.0 release fixed that
problem.

[Botond: regenerate frozen toolchain]

Fixes #16085

Closes scylladb/scylladb#16086

Closes scylladb/scylladb#16090

(cherry picked from commit 321459ec51)

[avi: regenerate frozen toolchain]
2023-11-27 18:17:38 +00:00
Anna Mikhlin
a65838ee9c re-spin: 5.2.11 2023-11-26 16:17:58 +02:00
Botond Dénes
68faf18ad9 Update ./tools/jmx and ./tools/java submodules
* tools/jmx 88d9bdc...bc4f8ea (1):
  > Merge "scylla-apiclient: update several Java dependencies" from Piotr Grabowski

* tools/java f8f556d802...e716e1bd1d (1):
  > Merge 'build: update several dependencies' from Piotr Grabowski

Update build dependencies which were flagged by security scanners.

Refs: scylladb/scylla-jmx#220
Refs: scylladb/scylla-tools-java#351

Closes #16150
2023-11-23 15:29:00 +02:00
Beni Peled
44d1b55253 release: prepare for 5.2.11 2023-11-22 14:22:13 +02:00
Tomasz Grabiec
bfd8401477 api, storage_service: Recalculate table digests on relocal_schema api call
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.

Use like this:

  curl -X POST http://127.0.0.1:10000/storage_service/relocal_schema

Fixes #15380

Closes #15381

(cherry picked from commit c27d212f4b)
2023-11-21 01:29:28 +01:00
Botond Dénes
e31f2224f5 migration_manager: also reload schema on enabling digest_insensitive_to_expiry
Currently, when said feature is enabled, we recalcuate the schema
digest. But this feature also influences how table versions are
calculated, so it has to trigger a recalculation of all table versions,
so that we can guarantee correct versions.
Before, this used to happen by happy accident. Another feature --
table_digest_insensitive_to_expiry -- used to take care of this, by
triggering a table version recalulation. However this feature only takes
effect if digest_insensitive_to_expiry is also enabled. This used to be
the case incidently, by the time the reload triggered by
table_digest_insensitive_to_expiry ran, digest_insensitive_to_expiry was
already enabled. But this was not guaranteed whatsoever and as we've
recently seen, any change to the feature list, which changes the order
in which features are enabled, can cause this intricate balance to
break.
This patch makes digest_insensitive_to_expiry also kick off a schema
reload, to eliminate our dependence on (unguaranteed) feature order, and
to guarantee that table schemas have a correct version after all features
are enabled. In fact, all schema feature notification handlers now kick
off a full schema reload, to ensure bugs like this don't creep in, in
the future.

Fixes: #16004

Closes scylladb/scylladb#16013

(cherry picked from commit 22381441b0)
2023-11-21 01:29:28 +01:00
Kamil Braun
4101c8beab schema_tables: remove default value for reload in merge_schema
To avoid bugs like the one fixed in the previous commit.

(cherry picked from commit 4376854473)
2023-11-21 01:29:28 +01:00
Kamil Braun
c994ed2057 schema_tables: pass reload flag when calling merge_schema cross-shard
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.

Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.

Fix this.

(cherry picked from commit 48164e1d09)
2023-11-21 01:29:28 +01:00
Avi Kivity
40eed1f1c5 Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.

Manually tested using ccm on cluster upgrade scenarios and node restarts.

Closes #14441

* github.com:scylladb/scylladb:
  test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
  schema_mutations, migration_manager: Ignore empty partitions in per-table digest
  migration_manager, schema_tables: Implement migration_manager::reload_schema()
  schema_tables: Avoid crashing when table selector has only one kind of tables

(cherry picked from commit cf81eef370)
2023-11-21 01:29:28 +01:00
Gleb Natapov
f233c8a9e4 database: fix do_apply_many() to handle empty array of mutations
Currently the code will assert because cl pointer will be null and it
will be null because there is no mutations to initialize it from.
Message-Id: <20230212144837.2276080-3-gleb@scylladb.com>

(cherry picked from commit 941407b905)

Backport needed by #4485.
2023-11-21 01:29:17 +01:00
Botond Dénes
0f3e31975d api/storage_service: start/stop native transport in the statement sg
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.

To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.

Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.

I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.

Fixes: #15485

Closes scylladb/scylladb#16019

(cherry picked from commit dfd7981fa7)
2023-11-20 20:00:56 +02:00
Takuya ASADA
c98b22afce scylla_post_install.sh: detect RHEL correctly
$ID_LIKE = "rhel" works only on RHEL compatible OSes, not for RHEL
itself.
To detect RHEL correctly, we also need to check $ID = "rhel".

Fixes #16040

Closes scylladb/scylladb#16041

(cherry picked from commit 338a9492c9)
2023-11-20 19:36:22 +02:00
Marcin Maliszkiewicz
900754d377 db: view: run local materialized view mutations on a separate smp service group
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.

This fix affects also alternator's secondary indexes.

Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:

./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000

Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:

p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0

With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.

Fixes https://github.com/scylladb/scylladb/issues/15844

Closes scylladb/scylladb#15845

(cherry picked from commit 020a9c931b)
2023-11-19 18:54:46 +02:00
Botond Dénes
fbb356aa88 repair/repair.cc: do_repair_ranges(): prevent stalls when skipping ranges
We have observed do_repair_ranges() receiving tens of thousands of
ranges to repairs on occasion. do_repair_ranges() repairs all ranges in
parallel, with parallel_for_each(). This is normally fine, as the lambda
inside parallel_for_each() takes a semaphore and this will result in
limited concurrency.
However, in some instances, it is possible that most of these ranges are
skipped. In this case the lambda will become synchronous, only logging a
message. This can cause stalls beacuse there are no opportunities to
yield. Solve this by adding an explicit yield to prevent this.

Fixes: #14330

Closes scylladb/scylladb#15879

(cherry picked from commit 90a8489809)
2023-11-08 21:10:30 +02:00
Michał Jadwiszczak
e8871c02a1 cql3:statements:describe_statement: check pointer to UDF/UDA
While looking for specific UDF/UDA, result of
`functions::functions::find()` needs to be filtered out based on
function's type.

Fixes: #14360
(cherry picked from commit d498451cdf)
2023-11-08 20:16:41 +02:00
Pavel Emelyanov
f76ba217e7 Merge 'api: failure_detector: invoke on shard 0' from Kamil Braun
These APIs may return stale or simply incorrect data on shards
other than 0. Newer versions of Scylla are better at maintaining
cross-shard consistency, but we need a simple fix that can be easily and
without risk be backported to older versions; this is the fix.

Add a simple test to check that the `failure_detector/endpoints`
API returns nonzero generation.

Fixes: scylladb/scylladb#15816

Closes scylladb/scylladb#15970

* github.com:scylladb/scylladb:
  test: rest_api: test that generation is nonzero in `failure_detector/endpoints`
  api: failure_detector: fix indentation
  api: failure_detector: invoke on shard 0

(cherry picked from commit 9443253f3d)
2023-11-07 15:12:12 +01:00
Botond Dénes
17e4d535db test/cql-pytest/nodetool.py: no_autocompaction_context: use the correct API
This `with` context is supposed to disable, then re-enable
autocompaction for the given keyspaces, but it used the wrong API for
it, it used the column_family/autocompaction API, which operates on
column families, not keyspaces. This oversight led to a silent failure
because the code didn't check the result of the request.
Both are fixed in this patch:
* switch to use `storage_service/auto_compaction/{keyspace}` endpoint
* check the result of the API calls and report errors as exceptions

Fixes: #13553

Closes #13568

(cherry picked from commit 66ee73641e)
2023-11-07 13:59:01 +02:00
Aleksandra Martyniuk
75b792e260 repair: release resources of shard_repair_task_impl
Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.

In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.

Release shard_repair_task_impl resources immediately after shard
repair is finished.

Fixes: #15505.
(cherry picked from commit 0474e150a9)

Closes #15875
2023-11-07 09:40:05 +02:00
Tomasz Grabiec
573ef87245 Merge ' tool/scylla-sstable: more flexibility in obtaining the schema' from Botond Dénes
scylla-sstable currently has two ways to obtain the schema:

    * via a `schema.cql` file.
    * load schema definition from memory (only works for system tables).

This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file.
This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override.
If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong.
A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes.

This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change.

Example:

```
$ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db
{"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}}
```

As seen above, subdirectories like qurantine, staging etc are also supported.

Fixes: https://github.com/scylladb/scylladb/issues/10126

Closes #13448

* github.com:scylladb/scylladb:
  test/cql-pytest: test_tools.py: add tests for schema loading
  test/cql-pytest: add no_autocompaction_context
  docs: scylla-sstable.rst: remove accidentally added copy-pasta
  docs: scylla-sstable.rst: remove paragraph with schema limitations
  docs: scylla-sstable.rst: update schema section
  test/cql-pytest: nodetool.py: add flush_keyspace()
  tools/scylla-sstable: reform schema loading mechanism
  tools/schema_loader: add load_schema_from_schema_tables()
  db/schema_tables: expose types schema

(cherry picked from commit 952b455310)

Closes #15386
2023-11-02 17:25:18 +02:00
Beni Peled
454e5a7110 release: prepare for 5.2.10 2023-11-02 15:08:11 +00:00
Avi Kivity
9967c0bda4 Update tools/pythion3 submodule (tar file timestamps)
* tools/python3 cf7030a...6ad2e5a (1):
  > create-relocatable-package.py: fix timestamp of executable files

Fixes #13415.
2023-11-02 12:37:09 +01:00
Botond Dénes
48509c5c00 Merge '[Backport 5.2] properly update storage service after schema changes' from Benny Halevy
This is a backport of https://github.com/scylladb/scylladb/pull/14158 to branch 5.2

Closes #15872

* github.com:scylladb/scylladb:
  migration_notifier: get schema_ptr by value
  migration_manager: propagate listener notification exceptions
  storage_service: keyspace_changed: execute only on shard 0
  database: modify_keyspace_on_all_shards: execute func first on shard 0
  database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
  database: add modify_keyspace_on_all_shards
  schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
  database: create_keyspace_on_all_shards
  database: update_keyspace_on_all_shards
  database: drop_keyspace_on_all_shards
2023-10-31 10:27:08 +02:00
Botond Dénes
d606e9bfa2 Merge '[branch-5.2] Enable incremental compaction on off-strategy' from Raphael "Raph" Carvalho
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.

Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.

The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.

Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.

Fixes https://github.com/scylladb/scylladb/issues/14992.

Backport notes: relatively easy to backport, had to include
**replica: Make compaction_group responsible for deleting off-strategy compaction input**
and
**compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0**

Closes #15793

* github.com:scylladb/scylladb:
  test: Verify that off-strategy can do incremental compaction
  compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
  compaction: Clear pending_replacement list when tombstone GC is disabled
  compaction: Enable incremental compaction on off-strategy
  compaction: Extend reshape type to allow for incremental compaction
  compaction: Move reshape_compaction in the source
  compaction: Enable incremental compaction only if replacer callback is engaged
  replica: Make compaction_group responsible for deleting off-strategy compaction input
2023-10-30 12:00:54 +02:00
Benny Halevy
cd7abb3833 migration_notifier: get schema_ptr by value
To prevent use-after-free as seen in
https://github.com/scylladb/scylladb/issues/15097
where a temp schema_ptr retrieved from a global_schema_ptr
get destroyed when the notification function yielded.

Capturing the schema_ptr on the coroutine frame
is inexpensive since its a shared ptr and it makes sure
that the schema remains valid throughput the coroutine
life time.

\Fixes scylladb/scylladb#15097

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

\Closes #15098

(cherry picked from commit 0f54e24519)
2023-10-29 19:39:17 +02:00
Benny Halevy
8064fface9 migration_manager: propagate listener notification exceptions
1e29b07e40 claimed
to make event notification exception safe,
but swallawing the exceptions isn't safe at all,
as this might leave the node in an inconsistent state
if e.g. storage_service::keyspace_changed fails on any of the
shards.  Propagating the exception here will cause abort,
but it is better than leaving the node up, but in an
inconsistent state.

We keep notifying other listeners even if any of them failed
Based on 1e29b07e40:
```
If one of the listeners throws an exception, we must ensure that other
listeners are still notified.
```

The decision about swallowing exceptions can't be
made in such a generic layer.
Specific notification listeners that may ignore exceptions,
like in transport/evenet_notifier, may decide to swallow their
local exceptions on their own (as done in this patch).

Refs #3389

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 825d617a53)
2023-10-29 19:32:55 +02:00
Benny Halevy
0cf6891c6d storage_service: keyspace_changed: execute only on shard 0
Previously all shards called `update_topology_change_info`
which in turn calls `mutate_token_metadata`, ending up
in quadratic complexity.

Now that the notifications are called after
all database shards are updated, we can apply
the changes on token metadata / effective replication map
only on shard 0 and count on replicate_to_all_cores to
propagate those changes to all other shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a690f0e81f)
2023-10-29 19:27:52 +02:00
Benny Halevy
16a594d564 database: modify_keyspace_on_all_shards: execute func first on shard 0
When creating or altering a keyspace, we create a new
effective_replication_map instance.

It is more efficient to do that first on shard 0
and then on all other shards, otherwise multiple
shards might need to calculate to new e_r_m (and reach
the same result).  When the new e_r_m is "seeded" on
shard 0, other shards will find it there and clone
a local copy of it - which is more efficient.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 13dd92e618)
2023-10-29 19:22:01 +02:00
Benny Halevy
096c312821 database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
When creating, updating, or dropping keyspaces,
first execute the database internal function to
modify the database state, and only when all shards
are updated, run the listener notifications,
to make sure they would operate when the database
shards are consistent with each other.

\Fixes #13137

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ba15786059)
2023-10-29 19:21:34 +02:00
Benny Halevy
5c27dacad5 database: add modify_keyspace_on_all_shards
Run all keyspace create/update/drop ops
via `modify_keyspace_on_all_shards` that
will standardize the execution on all shards
in the coming patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3b8c913e61)
2023-10-29 19:16:56 +02:00
Benny Halevy
14113dc23e schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
Similar to create_keyspace_on_all_shards,
`extract_scylla_specific_keyspace_info` and
`create_keyspace_from_schema_partition` can be called
once in the upper layer, passing keyspace_metadata&
down to database::update_keyspace_on_all_shards
which now would only make the per-shard
keyspace_metadata from the reference it gets
from the schema_tables layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit dc9b0812e9)
2023-10-29 19:14:06 +02:00
Benny Halevy
4d5a99f3b8 database: create_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3520c786bd)
2023-10-29 19:13:55 +02:00
Benny Halevy
ffe28b3e3f database: update_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 53a6ea8616)
2023-10-29 19:06:45 +02:00
Benny Halevy
1459306603 database: drop_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9d40305ef6)
2023-10-29 19:06:25 +02:00
Kefu Chai
30a4eb0ea7 sstables: writer: delegate flush() in checksummed_file_data_sink_impl
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.

this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!

but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.

the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.

in this change, we just delegate the `flush()` call to the
wrapped class.

Fixes #15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15134

(cherry picked from commit d2d1141188)
2023-10-26 16:48:17 +03:00
Avi Kivity
ea198d884d cql3: grammar: reject intValue with no contents
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).

Easily fixed by removing the empty alternative. A unit test is
added.

Fixes #14705.

Closes #14707

(cherry picked from commit e00811caac)
2023-10-25 19:15:28 +03:00
Raphael S. Carvalho
b8c8794e14 test: Verify that off-strategy can do incremental compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 17:05:33 -03:00
Benny Halevy
0c2bb5f0b3 compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
Prevent div-by-zero byt returning const level 1
if max_sstable_size is zero, as configured by
cleanup_incremental_compaction_test, before it's
extended to cover also offstrategy compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b1e164a241)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 17:05:33 -03:00
Raphael S. Carvalho
61316d8e88 compaction: Clear pending_replacement list when tombstone GC is disabled
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.

Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 17:05:33 -03:00
Raphael S. Carvalho
b8e2739596 compaction: Enable incremental compaction on off-strategy
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.

Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.

The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.

Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.

Fixes #14992.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 42050f13a0)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 17:05:29 -03:00
Raphael S. Carvalho
ba87dfefd1 compaction: Extend reshape type to allow for incremental compaction
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit db9ce9f35a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 15:00:07 -03:00
Raphael S. Carvalho
6b8499f4d8 compaction: Move reshape_compaction in the source
That's in preparation to next change that will make reshape
inherit from regular compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 15:00:07 -03:00
Raphael S. Carvalho
8c8a80a03d compaction: Enable incremental compaction only if replacer callback is engaged
That's needed for enabling incremental compaction to operate, and
needed for subsequent work that enables incremental compaction
for off-strategy, which in turn uses reshape compaction type.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 15:00:07 -03:00
Raphael S. Carvalho
fdec5e62d0 replica: Make compaction_group responsible for deleting off-strategy compaction input
Compaction group is responsible for deleting SSTables of "in-strategy"
compactions, i.e. regular, major, cleanup, etc.

Both in-strategy and off-strategy compaction have their completion
handled using the same compaction group interface, which is
compaction_group::table_state::on_compaction_completion(...,
				sstables::offstrategy offstrategy)

So it's important to bring symmetry there, by moving the responsibility
of deleting off-strategy input, from manager to group.

Another important advantage is that off-strategy deletion is now throttled
and gated, allowing for better control, e.g. table waiting for deletion
on shutdown.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13432

(cherry picked from commit 457c772c9c)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 15:00:06 -03:00
Raphael S. Carvalho
6798f9676f Resurrect optimization to avoid bloom filter checks during compaction
Commit 8c4b5e4 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.

Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.

This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.

The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.

Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
    45.64%    45.64%  sstable_compact  sstable_compaction_test_g
        [.] utils::filter::bloom_filter::is_present

With its resurrection, the problem is gone.

This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.

Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])

[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.

After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).

With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.

Cherry picked from commit 38b226f997

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes #14091.

Closes #13908

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15744
2023-10-20 09:34:53 +03:00
Botond Dénes
2642f32c38 Merge '[5.2 backport] doc: remove recommended image upgrade with OS from previous releases' from Anna Stuchlik
This is a backport of PR  https://github.com/scylladb/scylladb/pull/15740.

This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted).

The scope of this commit:

- Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info.
- Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides.
    /upgrade/_common/upgrade-image-opensource.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst

Closes #15768

* github.com:scylladb/scylladb:
  doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
  doc: remove wrong image upgrade info (4.6-to-5.0)
  doc: remove wrong image upgrade info (5.0-to-5.1)
2023-10-19 12:30:55 +03:00
Anna Stuchlik
fcbcf1eafd doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.x.y-to-5.x.y upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

In addition, the following files are removed as no longer
necessary (they were only created to incorporate the (invalid)
information about image upgrade into the upgrade guides.

/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst

(cherry picked from commit dd1207cabb)
2023-10-19 08:47:25 +02:00
Anna Stuchlik
3a14fd31d0 doc: remove wrong image upgrade info (4.6-to-5.0)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 4.6-to-5.0 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

(cherry picked from commit 526d543b95)
2023-10-19 08:41:24 +02:00
Anna Stuchlik
c7b6152a81 doc: remove wrong image upgrade info (5.0-to-5.1)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.0-to-5.1 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

(cherry picked from commit 9852130c5b)
2023-10-19 08:40:27 +02:00
Asias He
ac45d8d092 repair: Use the updated estimated_partitions to create writer
The estimated_partitions is estimated after the repair_meta is created.

Currently, the default estimated_partitions was used to create the
write which is not correct.

To fix, use the updated estimated_partitions.

Reported by Petr Gusev

Closes #14179
Fixes #15748

(cherry picked from commit 4592bbe182)
2023-10-18 13:58:28 +03:00
Anna Stuchlik
d319c2a83f doc: remove recommended image upgrade with OS
This commit removes the information about
the recommended way of upgrading ScyllaDB
images - by updating ScyllaDB and OS packages
in one step.
This upgrade procedure is not supported
(it was implemented, but then reverted).

The scope of this commit:
- Remove the information from the 5.1-to.-5.2
  upgrade guide and replace with general info.
- Remove the information from the Image Upgrade
  page.
- Remove outdated info (about previous releases)
  from the Image Upgrade page.
- Rename "AMI Upgrade" as "Image Upgrade"
  in the page tree.

Refs: https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit f6767f6d6e)

Closes #15754
2023-10-18 13:57:08 +03:00
Nadav Har'El
cb7e7f15ac Cherry-pick Seastar patch
Backported Seastar commit 4f4e84bb2cec5f11b4742396da7fc40dbb3f162f:

  > sstring: refactor to_sstring() using fmt::format_to()

Refs https://github.com/scylladb/scylladb/issues/15127

Closes #15663
2023-10-09 10:02:21 +03:00
Raphael S. Carvalho
00d431bd20 reader_concurrency_semaphore: Fix stop() in face of evictable reads becoming inactive
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.

1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests

2) turns out it stops first the gate used for closing readers that will
become inactive.

3) proceeds to wait for in-flight reads by closing the reader permit gate.

4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.

5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.

By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.

Fixes #15534.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15535

(cherry picked from commit 914cbc11cf)
2023-09-29 09:24:37 +03:00
Botond Dénes
ca8723a6fd Merge 'gossiper: add get_unreachable_members_synchronized and use over api' from Benny Halevy
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.

Fixes #12261
Fixes #15088

Also, add rest_api unit test for those apis

Closes #15093

* github.com:scylladb/scylladb:
  test: rest_api: add test_gossiper
  gossiper: add get_unreachable_members_synchronized

(cherry picked from commit 57deeb5d39)

Backport note: `gossiper::lock_endpoint_update_semaphore` helper
function was missing, replaced with
`get_units(g._endpoint_update_semaphore, 1)`
2023-09-27 15:09:32 +02:00
Beni Peled
5709d00439 release: prepare for 5.2.9 2023-09-20 12:34:43 +03:00
Konstantin Osipov
7202634789 raft: do not update raft address map with obsolete gossip data
It is possible that a gossip message from an old node is delivered
out of order during a slow boot and the raft address map overwrites
a new IP address with an obsolete one, from the previous incarnation
of this node. Take into account the node restart counter when updating
the address map.

A test case requires a parameterized error injection, which
we don't support yet. Will be added as a separate commit.

Fixes #14257
Refs #14357

Closes #14329

(cherry picked from commit b9c2b326bc)

Backport note: replaced `gms::generation_type` with `int64_t` because
the branch is missing the refactor which introduced `generation_type`
(7f04d8231d)
2023-09-19 11:12:38 +02:00
Avi Kivity
34e0afb18a Merge "auth: do not grant permissions to creator without actually creating" from Wojciech Mitros
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:

The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.

Additionally, a test is added that reproduces both of the mentioned
cases.

CVE-2023-33972

Fixes #15467.

* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
  auth: do not grant permissions to creator without actually creating
  transport: add is_schema_change() method to result_message

(cherry picked from commit ab6988c52f)
2023-09-19 01:47:27 +03:00
Anna Stuchlik
99e906499d doc: fix internal links
Fixes https://github.com/scylladb/scylladb/issues/14490

This commit fixes mulitple links that were broken
after the documentation is published (but not in
the preview) due to incorrect syntax.
I've fixed the syntax to use the :docs: and :ref:
directive for pages and sections, respectively.

Closes #14664

(cherry picked from commit a93fd2b162)
2023-09-18 09:32:12 +03:00
Anna Stuchlik
b8ff392e8b doc: add info - support for FIPS-compliant systems
This commit adds the information that ScyllaDB Enterprise
supports FIPS-compliant systems in versions
2023.1.1 and later.
The information is excluded from OSS docs with
the "only" directive, because the support was not
added in OSS.

This commit must be backported to branch-5.2 so that
it appears on version 2023.1 in the Enterprise docs.

Closes #15415

(cherry picked from commit fb635dccaa)
2023-09-18 09:17:59 +03:00
Raphael S. Carvalho
a65e5120ab compaction: base compaction throughput on amount of data read
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.

A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.

Fixes #14533.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14615

(cherry picked from commit 3b1829f0d8)
2023-09-14 21:30:22 +03:00
Jan Ciolek
cd9458eeb1 cql.g: make the parser reject INSERT JSON without a JSON value
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```

When no JSON value is specified, the query should be rejected.

Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.

A unit test is added to prevent regressions.

Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

\Closes #14785

(cherry picked from commit cbc97b41d4)
2023-09-14 21:07:21 +03:00
Nadav Har'El
e917b874f9 test/alternator: fix flaky test test_ttl_expiration_gsi_lsi
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!

So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.

This patch also fixes a small bug which prevented this test from running
on DynamoDB.

Fixes #14495

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14496

(cherry picked from commit 599636b307)
2023-09-14 20:43:51 +03:00
Pavel Emelyanov
a27c391cba Update seastar submodule
* seastar 85147cfd...872e0bc6 (3):
  > rpc: Abort server connection streams on stop
  > rpc: Do not register stream to dying parent
  > rpc: Fix client-side stream registration race

refs: #13100

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-06 12:33:30 +03:00
Beni Peled
455ab99b6c release: prepare for 5.2.8 2023-08-31 22:02:06 +03:00
Michał Chojnowski
adcf296bcf reader_concurrency_semaphore: fix a deadlock between stop() and execution_loop()
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.

Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.

A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.

After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.

Fixes #15198

Closes #15199

(cherry picked from commit 2000a09859)
2023-08-31 08:13:09 +03:00
Calle Wilund
198297a08a generic_server: Handle TLS error codes indicating broken pipe
Fixes  #14625

In broken pipe detection, handle also TLS error codes.

Requires https://github.com/scylladb/seastar/pull/1729

Closes #14626

(cherry picked from commit 890f1f4ad3)
2023-08-29 15:38:21 +03:00
Botond Dénes
9a9b5b691d Update seastar submodule
* seastar 534cb38c...85147cfd (1):
  > tls: Export error_category instance used by tls + some common error codes

Refs: #14625
2023-08-29 15:37:24 +03:00
Alejo Sanchez
610b682cf4 gms, service: replicate live endpoints on shard 0
Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of
the shards. And get the list of live members from shard 0.

Move lock to the callers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13240

(cherry picked from commit da00052ad8)
2023-08-29 12:28:00 +02:00
Kamil Braun
05f4640360 Merge 'api: gossiper: get alive nodes after reaching current shard 0 version' from Alecco
Add an API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.

Closes #12540

* github.com:scylladb/scylladb:
  api: gossiper: fix alive nodes
  gms, service: lock live endpoint copy
  gms, service: live endpoint copy method

(cherry picked from commit b919373cce)
2023-08-29 12:27:52 +02:00
Kefu Chai
8ed58c7dca sstable/writer: log sstable name and pk when capping ldt
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.

Fixes #15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 9c24be05c3)

Closes #15150
2023-08-25 10:13:19 +03:00
Petr Gusev
a83c0a8bbc test_secondary_index_collections: change insert/create index order
Secondary index creation is asynchronous, meaning it
takes time for existing data to be reflected within
the index. However, new data added after the
index is created should appear in it immediately.

The test consisted of two parts. The first created
a series of indexes for one table, added
test data to the table, and then ran a series of checks.
In the second part, several new indexes were added to
the same table, and checks were made to make sure that
already existing data would appear in them. This
last part was flaky.

The patch just moves the index creation statements
from the second part to the first.

Fixes: #14076

Closes #14090

(cherry picked from commit 0415ac3d5f)

Closes #15101
2023-08-24 14:09:08 +03:00
Botond Dénes
df71753498 Merge '[Backport 5.2] distributed_loader: process_sstable_dir: do not verify snapshots' from Benny Halevy
This mini-series backports the fix for #12010 along with low-risk patches it depends on.

Fixes: #12010

Closes #15137

* github.com:scylladb/scylladb:
  distributed_loader: process_sstable_dir: do not verify snapshots
  utils/directories: verify_owner_and_mode: add recursive flag
  utils: Restore indentation after previous patch
  utils: Coroutinize verify_owner_and_mode()
2023-08-23 15:50:29 +03:00
Benny Halevy
6588ecd66f distributed_loader: process_sstable_dir: do not verify snapshots
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.

Fixes #12010

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 845b6f901b)
2023-08-23 13:19:55 +03:00
Benny Halevy
03640cc15b utils/directories: verify_owner_and_mode: add recursive flag
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 60862c63dd)
2023-08-23 13:19:36 +03:00
Pavel Emelyanov
6d4d576460 utils: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2eb88945ea)
2023-08-23 13:19:36 +03:00
Pavel Emelyanov
96aca473b4 utils: Coroutinize verify_owner_and_mode()
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4ebb812df0)
2023-08-23 13:19:30 +03:00
Aleksandra Martyniuk
29e6dc8c1b compaction: do not swallow compaction_stopped_exception for reshape
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.

Rethrow an exception in perfrom task for reshape compaction.

Fixes: #15058.

(cherry picked from commit e0ce711e4f)

Closes #15122
2023-08-23 12:11:58 +03:00
Vlad Zolotarov
9a414d440d scylla_raid_setup: make --online-discard argument useful
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.

Fixes #14963

Closes #14964

(cherry picked from commit e13a2b687d)
2023-08-22 10:40:37 +03:00
Anna Mikhlin
e0ebc95025 release: prepare for 5.2.7 2023-08-21 14:44:56 +03:00
Botond Dénes
b7ab42b61c Merge 'Ignore no such column family in repair' from Aleksandra Martyniuk
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.

When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.

Fixes: #13045

Closes #13068

* github.com:scylladb/scylladb:
  repair: fix indentation
  repair: continue user requested repair if no_such_column_family is thrown
  repair: add find_column_family_if_exists function

(cherry picked from commit 9859bae54f)
2023-08-20 19:49:21 +03:00
Botond Dénes
098baaef48 Merge 'cql: add missing functions for the COUNTER column type' from Nadav Har'El
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL  `counterasblob()` and `blobascounter()` functions (issue #14742).

As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.

Fixes #14501
Fixes #14742

Closes #14745

* github.com:scylladb/scylladb:
  test/cql-pytest: test confirming that casting to counter doesn't work
  cql: support casting of counter to other types
  cql: implement missing counterasblob() and blobascounter() functions
  cql: implement missing type functions for "counters" type

(cherry picked from commit a637ddd09c)
2023-08-13 14:53:48 +03:00
Nadav Har'El
e11561ef65 cql-pytest: translate Cassandra's tests for compact tables
This is a translation of Cassandra's CQL unit test source file
validation/operations/CompactStorageTest.java into our cql-pytest
framework.

This very large test file includes 86 tests for various types of
operations and corner cases of WITH COMPACT STORAGE tables.

All 86 tests pass on Cassandra (except one using a deprecated feature
that needs to be specially enabled). 30 of the tests fail on Scylla
reproducing 7 already-known Scylla issues and 7 previously-unknown issues:

Already known issues:

Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE"
Refs #4244: Add support for mixing token, multi- and single-column
            restrictions
Refs #5361: LIMIT doesn't work when using GROUP BY
Refs #5362: LIMIT is not doing it right when using GROUP BY
Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #8627: Cleanly reject updates with indexed values where value > 64k

New issues:

Refs #12471: Range deletions on COMPACT STORAGE is not supported
Refs #12474: DELETE prints misleading error message suggesting
             ALLOW FILTERING would work
Refs #12477: Combination of COUNT with GROUP BY is different from
             Cassandra in case of no matches
Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column
Refs #12526: Support filtering on COMPACT tables
Refs #12749: Unsupported empty clustering key in COMPACT table
Refs #12815: Hidden column "value" in compact table isn't completely hidden

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12816

(cherry picked from commit 328cdb2124)
2023-08-13 14:44:19 +03:00
Nadav Har'El
e03c21a83b cql-pytest: translate Cassandra's tests for CAST operations
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.

There are 13 tests, 9 of them currently xfail.

The failures are caused by one recently-discovered issue:

Refs #14501: Cannot Cast Counter To Double

and by three previously unknown or undocumented issues:

Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
             milliseconds
Refs #14522: Support CAST function not only in SELECT

Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14528

(cherry picked from commit f08bc83cb2)
2023-08-13 14:41:36 +03:00
Nadav Har'El
79b5befe65 test/cql-pytest: add tests for data casts and inf in sums
This patch adds tests to reproduce issue #13551. The issue, discovered
by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast())
from varint type broke. So we add two tests in cql-pytest:

1. A new test file, test_cast_data.py, for testing data casts (a
   CAST (...) as ... in a SELECT), starting with testing casts from
   varint to other types.

   The test uncovers a lot of interesting cases (it is heavily
   commented to explain these cases) but nothing there is wrong
   and all tests pass on Scylla.

2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out
   that this caused #13551. In Cassandra and older Scylla, the sum
   returned a NaN. In Scylla today, it generates a misleading
   error message.

As usual, the tests were run on both Cassandra (4.1.1) and Scylla.

Refs #13551.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 78555ba7f1)
2023-08-13 14:40:08 +03:00
Petr Gusev
aca9e41a44 topology.cc: remove_endpoint: _dc_racks removal fix
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.

This is a backport of bcb1d7c to branch-5.2.

Refs: #14184

Closes #14893
2023-08-11 14:29:37 +03:00
Pavel Emelyanov
ff22807ed2 Update seastar submodule
* seastar 29a0e645...534cb38c (1):
  > rpc: Abort connection if send_entry() fails

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-09 11:30:57 +03:00
Botond Dénes
bcb8f6a8dd Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.

Until now, semaphore mismatch was only checked in multi-partition queries.  The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.

This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.

Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050
Closes: #14770

Closes #14736

* github.com:scylladb/scylladb:
  querier_cache: add stats of scheduling group mismatches
  querier_cache: check semaphore mismatch during querier lookup
  querier_cache: add reference to `replica::database::is_user_semaphore()`
  replica:database: add method to determine if semaphore is user one

(cherry picked from commit a8feb7428d)
2023-08-09 10:20:53 +03:00
Kefu Chai
9ce3695a0d compaction_manager: prevent gc-only sstables from being compacted
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.

in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.

so, in this change,

* to prevent sstables in the transient stage from being picked
  up by regular compactions, a new interface class is introduced
  so that the sstable is always added to registration before
  it is added to sstable set, and removed from registration after
  it is removed from sstable set. the struct helps to consolidate
  the regitration related logic in a single place, and helps to
  make it more obvious that the timespan of an sstable in
  the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
  overlap with the output sstable run. the run_id for the
  gc sstable run is created only when the gc sstable writer
  is created. because the gc sstables is not always created
  for all compactions.

please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.

Fixes #14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14725

(cherry picked from commit fdf61d2f7c)

Closes #14827
2023-08-04 09:59:10 +03:00
Patryk Jędrzejczak
4cd5847761 config: add schema_commitlog_segment_size_in_mb variable
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.

(cherry picked from commit 5b167a4ad7)
2023-08-02 18:05:39 +02:00
Botond Dénes
2b7f1cd906 Update tools/java submodule
* tools/java 83b2168b19...f8f556d802 (1):
  > Use EstimatedHistogram in metricPercentilesAsArray

Fixes: #10089
2023-07-31 12:13:01 +03:00
Nadav Har'El
e34c62c567 Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.

Fixes: #14819

Closes #14821

* github.com:scylladb/scylladb:
  test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
  db/view/view_updating_consumer: account for the size of mutations
  mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
  mutation/mutation: add memory_usage()

(cherry picked from commit 056d04954c)
2023-07-31 03:43:44 -04:00
Nadav Har'El
992c50173a Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved.
We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different.

We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change.

A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash.

Fixes: https://github.com/scylladb/scylladb/issues/13129

Closes #14429

* github.com:scylladb/scylladb:
  modification_statement: reject conditional statements with empty clustering key
  statements/cas_request: fix crash on empty clustering range in LWT

(cherry picked from commit 49c8c06b1b)
2023-07-31 09:14:55 +03:00
Beni Peled
58acf071bf release: prepare for 5.2.6 2023-07-30 14:19:28 +03:00
Raphael S. Carvalho
d2369fc546 cached_file: Evict unused pages that aren't linked to LRU yet
It was found that cached_file dtor can hit the following assert
after OOM

cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`

cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.

That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.

share() is responsible for automatically placing the page into
LRU once refcount drops to zero.

If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.

Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.

A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.

A reproducer was added.

Fixes #14814.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14818

(cherry picked from commit 050ce9ef1d)
2023-07-28 13:56:28 +02:00
Kamil Braun
6273c4df35 test: use correct timestamp resolution in test_group0_history_clearing_old_entries
In 10c1f1dc80 I fixed
`make_group0_history_state_id_mutation` to use correct timestamp
resolution (microseconds instead of milliseconds) which was supposed to
fix the flakiness of `test_group0_history_clearing_old_entries`.

Unfortunately, the test is still flaky, although now it's failing at a
later step -- this is because I was sloppy and I didn't adjust this
second part of the test to also use microsecond resolution. The test is
counting the number of entries in the `system.group0_history` table that
are older than a certain timestamp, but it's doing the counting using
millisecond resolution, causing it to give results that are off by one
sometimes.

Fix it by using microseconds everywhere.

Fixes #14653

Closes #14670

(cherry picked from commit 9d4b3c6036)
2023-07-27 15:46:37 +02:00
Raphael S. Carvalho
752984e774 Fix stack-use-after-return in mutation source excluding staging
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.

This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.

The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.

Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.

Fixes #14812.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14813

(cherry picked from commit 0ac43ea877)
2023-07-26 14:30:32 +03:00
Raphael S. Carvalho
986491447b table: Optimize creation of reader excluding staging for view building
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s

to
INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s

Refs #14089.
Fixes #14244.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1d8cb32a5d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14764
2023-07-20 16:46:15 +03:00
Takuya ASADA
a05bb26cd6 scylla_fstrim_setup: start scylla-fstrim.timer on setup
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.

Also, unmask is unnecessary for enabling the timer.

Fixes #14249

Closes #14252

(cherry picked from commit c70a9cbffe)

Closes #14421
2023-07-18 16:05:09 +03:00
Michał Chojnowski
41aef6dc96 partition_snapshot_reader.hh: fix iterator invalidation in do_refresh_state
do_refresh_state() keeps iterators to rows_entry in a vector.
This vector might be resized during the procedure, triggering
memory reclaim and invalidating the iterators, which can cause
arbitrarily long loops and/or a segmentation fault during make_heap().
To fix this, do_refresh_state has to always be called from the allocating
section.

Additionally, it turns out that the first do_refresh_state is useless,
because reset_state() doesn't set _change_mark. This causes do_refresh_state
to be needlessly repeated during a next_row() or next_range_tombstone() which
happens immediately after it. Therefore this patch moves the _change_mark
assignment from maybe_refresh_state to do_refresh_state, so that the change mark
is properly set even after the first refresh.

Fixes #14696

Closes #14697
2023-07-17 14:20:37 +02:00
Botond Dénes
aa5e904c40 repair: Release permit earlier when the repair_reader is done
Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes #14676

Closes #14677

(cherry picked from commit 1b577e0414)
2023-07-14 18:18:43 +03:00
Marcin Maliszkiewicz
eff2fe79b1 alternator: close output_stream when exception is thrown during response streaming
When exception occurs and we omit closing output_stream then the whole process is brought down
by an assertion in ~output_stream.

Fixes https://github.com/scylladb/scylladb/issues/14453
Relates https://github.com/scylladb/scylladb/issues/14403

Closes #14454

(cherry picked from commit 6424dd5ec4)
2023-07-13 23:27:46 +03:00
Nadav Har'El
ee8b26167b Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz
Adds preemption points used in Alternator when:
 - sending bigger json response
 - building results for BatchGetItem

I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:

    auto start  = std::chrono::steady_clock::now();
    do { } while ((std::chrono::steady_clock::now() - start) < 100ms);

and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.

Refs #7926
Fixes #13689

Closes #12351

* github.com:scylladb/scylladb:
  alternator: remove redundant flush call in make_streamed
  utils: yield when streaming json in print()
  alternator: yield during BatchGetItem operation

(cherry picked from commit d2e089777b)
2023-07-13 23:27:38 +03:00
Yaron Kaikov
02bc54d4b6 release: prepare for 5.2.5 2023-07-13 14:23:18 +03:00
Avi Kivity
c9a5c4c876 Merge ' message: match unknown tenants to the default tenant' from Botond Dénes
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types

(cherry picked from commit a7c2c9f92b)
2023-07-12 15:31:48 +03:00
Tomasz Grabiec
1a6f4389ae Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb/scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

\Closes scylladb/scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606cd6)

Closes #14649
2023-07-12 10:09:56 +03:00
Calle Wilund
1088c3e24a storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14636
2023-07-12 09:24:56 +03:00
Botond Dénes
c9cb8dcfd0 Merge '[backport 5.2] view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.

This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).

The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.

Backported from c25201c1a3. `view_build_test.cc` needed some tiny adjustments for the backport.

Closes #14619
Fixes #14503

* github.com:scylladb/scylladb:
  test: view_build_test: add range tombstones to test_view_update_generator_buffering
  test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
  view_updating_consumer: make buffer limit a variable
  view: fix range tombstone handling on flushes in view_updating_consumer
2023-07-11 15:04:23 +03:00
Takuya ASADA
91c1feec51 scylla_raid_setup: wipe filesystem signatures from specified disks
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.

Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).

Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.

Fixes #13737

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #13738

(cherry picked from commit fdceda20cc)
2023-07-11 15:00:03 +03:00
Piotr Dulikowski
57d0310dcc combined: mergers: remove recursion in operator()()
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.

In some unlucky circumstances, a stack overflow can occur:

- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
  end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
  combined reader (~500),
- All of the above needs to happen within a single task quota.

In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.

A regression test is added.

Fixes: scylladb/scylladb#14415

Closes #14452

(cherry picked from commit ee9bfb583c)

Closes #14605
2023-07-11 11:09:25 +03:00
Michał Chojnowski
78f25f2d36 test: view_build_test: add range tombstones to test_view_update_generator_buffering
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
2023-07-11 09:44:00 +02:00
Michał Chojnowski
14fa3ee34e test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
A random mutation test for view_updating_consumer's buffering logic.
Reproduces #14503.
2023-07-11 09:44:00 +02:00
Michał Chojnowski
75933b9906 view_updating_consumer: make buffer limit a variable
The limit doesn't change at runtime, but we this patch makes it variable for
unit testing purposes.
2023-07-11 09:44:00 +02:00
Michał Chojnowski
fc7b02c8e4 view: fix range tombstone handling on flushes in view_updating_consumer
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.

This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.

Fixes #14503
2023-07-11 09:44:00 +02:00
Jan Ciolek
0f4f8638c5 forward_service: fix forgetting case-sensitivity in aggregates
There was a bug that caused aggregates to fail when
used on column-sensitive columns.

For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".

This is because the case-sensitivity got lost on the way.

For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.

The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.

To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.

Fixes: https://github.com/scylladb/scylladb/issues/14307

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 7fca350075)
2023-07-10 15:22:58 +03:00
Botond Dénes
0ba37fa431 Merge 'doc: fix rollback in the 4.3-to-2021.1, 5.0-to-2022.1, and 5.1-to-2022.2 upgrade guides' from Anna Stuchlik
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).

This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.

Refs: https://github.com/scylladb/scylla-enterprise/issues/3046

- [x]  5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x]  5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x]  4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)

(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)

Closes #14444

* github.com:scylladb/scylladb:
  doc: fix rollback in 4.3-to-2021.1 upgrade guide
  doc: fix rollback in 5.0-to-2022.1 upgrade guide
  doc: fix rollback in 5.1-to-2022.2 upgrade guide

(cherry picked from commit 8a7261fd70)
2023-07-10 15:16:24 +03:00
Raphael S. Carvalho
55edbded47 compaction: avoid excessive reallocation and during input list formatting
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.

in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13907

(cherry picked from commit 5544d12f18)

Fixes scylladb/scylladb#14071
2023-07-09 23:54:18 +03:00
Marcin Maliszkiewicz
9f79c9f41d docs: link general repairs page to RBNO page
Information was duplicated before and the version on this page was outdated - RBNO is enabled for replace operation already.

Closes #12984

(cherry picked from commit bd7caefccf)
2023-07-07 16:38:32 +02:00
Kamil Braun
6dd09bb4ea storage_proxy: query_partition_key_range_concurrent: don't access empty range
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.

Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.

This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.

The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.

Also change the `assert` to `on_internal_error`.

Fixes #14284

Closes #14300

(cherry picked from commit 732feca115)

Backport note: the original `assert` was never added to branch-5.2, but
the fix is still applicable, so I backported the fix and the
`on_internal_error` check.
2023-07-05 13:14:24 +02:00
Mikołaj Grzebieluch
f431345ab6 raft topology: wait_for_peers_to_enter_synchronize_state doesn't need to resolve all IPs
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.

But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.

This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.

Fixes #13543

(cherry picked from commit a45e0765e4)
2023-07-05 13:01:57 +02:00
Anna Stuchlik
009601d374 doc: fix rollback in 5.2-to-2023.1 upgrade guide
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.

This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.

Refs: https://github.com/scylladb/scylla-enterprise/issues/3046

Closes #14373

(cherry picked from commit f4ae2c095b)
2023-06-29 12:07:41 +03:00
Botond Dénes
8e63b2f3e3 Merge 'readers: evictable_reader: don't accidentally consume the entire partition' from Kamil Braun
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.

The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.

So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.

There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.

Fix the comparison and adjust one of the tests (added in #13563) to detect this case.

After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.

Fixes #13491

Closes #14375

* github.com:scylladb/scylladb:
  readers: evictable_reader: don't accidentally consume the entire partition
  test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position

(cherry picked from commit 586102b42e)
2023-06-29 12:04:35 +03:00
Benny Halevy
483c0b183a repair: use fmt::join to print ks_erms|boost::adaptors::map_keys
This is a minimal fix for #13146 for branch-5.2

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14405
2023-06-27 14:15:28 +03:00
Anna Stuchlik
d5063e6347 doc: add Ubuntu 22 to 2021.1 OS support
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036

This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.

This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.

Closes #14372

(cherry picked from commit 74fc69c825)
2023-06-26 13:43:58 +03:00
Anna Stuchlik
543fa04e4d doc: udpate the OSS docs landing page
Fixes https://github.com/scylladb/scylladb/issues/14333

This commit replaces the documentation landing page with
the Open Source-only documentation landing page.

This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.

(cherry picked from commit f60f89df17)

Closes #14370
2023-06-23 14:00:33 +02:00
Anna Mikhlin
cebbf6c5df release: prepare for 5.2.4 2023-06-22 16:23:46 +03:00
Avi Kivity
73b8669953 Update seastar submodule (default priority class shares)
* seastar 32ab15cda6...29a0e64513 (1):
  > reactor: change shares for default IO class from 1 to 200

Fixes #13753.

In 5.3: 37e6e65211
2023-06-21 21:23:14 +03:00
Botond Dénes
9efca96cf2 Merge 'Backport 5.2 test.py stability/UX improvemenets' from Kamil Braun
Backport the following improvements for test.py topology tests for CI stability:
- https://github.com/scylladb/scylladb/pull/12652
- https://github.com/scylladb/scylladb/pull/12630
- https://github.com/scylladb/scylladb/pull/12619
- https://github.com/scylladb/scylladb/pull/12686
- picked from https://github.com/scylladb/scylladb/pull/12726: 9ceb6aba81
- picked from https://github.com/scylladb/scylladb/pull/12173: fc60484422
- https://github.com/scylladb/scylladb/pull/12765
- https://github.com/scylladb/scylladb/pull/12804
- https://github.com/scylladb/scylladb/pull/13342
- https://github.com/scylladb/scylladb/pull/13589
- picked from https://github.com/scylladb/scylladb/pull/13135: 7309a1bd6b
- picked from https://github.com/scylladb/scylladb/pull/13134: 21b505e67c, a4411e9ec4, c1d0ee2bce, 8e3392c64f, 794d0e4000, e407956e9f
- https://github.com/scylladb/scylladb/pull/13271
- https://github.com/scylladb/scylladb/pull/13399
- picked from https://github.com/scylladb/scylladb/pull/12699: 3508a4e41e, 08d754e13f, 62a945ccd5, 041ee3ffdd
- https://github.com/scylladb/scylladb/pull/13438 (but skipped the test_mutation_schema_change.py fix since I didn't backport this new test)
- https://github.com/scylladb/scylladb/pull/13427
- https://github.com/scylladb/scylladb/pull/13756
- https://github.com/scylladb/scylladb/pull/13789
- https://github.com/scylladb/scylladb/pull/13933 (but skipped the test_snapshot.py fix since I didn't backport this new test)

Closes #14215

* github.com:scylladb/scylladb:
  test: pylib: fix `read_barrier` implementation
  test: pylib: random_tables: perform read barrier in `verify_schema`
  test: issue a read barrier before checking ring consistency
  Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr
  test/pylib: ManagerClient helpers to wait for...
  test: pylib: Add a way to create cql connections with particular coordinators
  test/pylib: get gossiper alive endpoints
  test/topology: default replication factor 3
  test/pylib: configurable replication factor
  scylla_cluster.py: optimize node logs reading
  test/pylib: RandomTables.add_column with value column
  scylla_cluster.py: add start flag to server_add
  ServerInfo: drop host_id
  scylla_cluster.py: add config to server_add
  scylla_cluster.py: add expected_error to server_start
  scylla_cluster.py: ScyllaServer.start, refactor error reporting
  scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
  test: improve logging in ScyllaCluster
  test: topology smp test with custom cluster
  test/pylib: topology: support clusters of initial size 0
  Merge 'test/pylib: split and refactor topology tests' from Alecco
  Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun
  test: Increase START_TIMEOUT
  test/pylib: one-shot error injection helper
  test: topology: wait for token ring/group 0 consistency after decommission
  test: topology: verify that group 0 and token ring are consistent
  Merge 'pytest: start after ungraceful stop' from Alecco
  Merge 'test.py: improve test failure handling' from Kamil Braun
2023-06-15 07:19:39 +03:00
Pavel Emelyanov
210e3d1999 Backport 'Merge 'Enlighten messaging_service::shutdown()''
This includes seastar update titled
  'Merge 'Split rpc::server stop into two parts''

* br-5.2-backport-ms-shutdown:
  messaging_service: Shutdown rpc server on shutdown
  messaging_service: Generalize stop_servers()
  messaging_service: Restore indentation after previous patch
  messaging_service: Coroutinize stop()
  messaging_service: Coroutinize stop_servers()
  Update seastar submodule

refs: #14031
2023-06-14 09:14:06 +03:00
Pavel Emelyanov
702d622b38 messaging_service: Shutdown rpc server on shutdown
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:04:04 +03:00
Pavel Emelyanov
db44630254 messaging_service: Generalize stop_servers()
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:03:59 +03:00
Pavel Emelyanov
5d3d64bafe messaging_service: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:03:53 +03:00
Pavel Emelyanov
079f5d8eca messaging_service: Coroutinize stop()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:03:48 +03:00
Pavel Emelyanov
fd7310b104 messaging_service: Coroutinize stop_servers()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:03:42 +03:00
Pavel Emelyanov
991d00964d Update seastar submodule
* seastar 8c86e6de...32ab15cd (1):
  > rpc: Introduce server::shutdown()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:02:46 +03:00
Anna Stuchlik
0137ddaec8 doc: remove support for Ubuntu 18
Fixes https://github.com/scylladb/scylladb/issues/14097

This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.

The update is in sync with the change made for
ScyllaDB 5.2.

This commit must be backported to branch-5.2 and
branch-5.3.

Closes #14118

(cherry picked from commit b7022cd74e)
2023-06-13 12:06:56 +03:00
Raphael S. Carvalho
58f88897c8 compaction: Fix incremental compaction for sstable cleanup
After c7826aa910, sstable runs are cleaned up together.

The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.

Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.

Therefore cleanup is affected by the 100% space overhead.

To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.

New unit test reproduces the failure, and passes with the fix.

Fixes #14035.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14038

(cherry picked from commit 23443e0574)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14193
2023-06-13 09:53:46 +03:00
Kamil Braun
f4115528d6 test: pylib: fix read_barrier implementation
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.

Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.

This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).

Closes #13933

(cherry picked from commit 64dc76db55)
Backport note: skipped the test_snapshot.py change, as the test doesn't
exist on this branch.
2023-06-12 12:40:22 +02:00
Kamil Braun
9c941aba0b test: pylib: random_tables: perform read barrier in verify_schema
`RandomTables.verify_schema` is often called in topology tests after
performing a schema change. It compares the schema tables fetched from
some node to the expected latest schema stored by the `RandomTables`
object.

However there's no guarantee that the latest schema change has already
propagated to the node which we query. We could have performed the
schema change on a different node and the change may not have been
applied yet on all nodes.

To fix that, pick a specific node and perform a read barrier on it, then
use that node to fetch the schema tables.

Fixes #13788

Closes #13789

(cherry picked from commit 3f3dcf451b)
2023-06-12 12:40:22 +02:00
Konstantin Osipov
094bcac399 test: issue a read barrier before checking ring consistency
Raft replication doesn't guarantee that all replicas see
identical Raft state at all times, it only guarantees the
same order of events on all replicas.

When comparing raft state with gossip state on a node, first
issue a read barrier to ensure the node has the latest raft state.

To issue a read barrier it is sufficient to alter a non-existing
state: in order to validate the DDL the node needs to sync with the
leader and fetch its latest group0 state.

Fixes #13518 (flaky topology test).

Closes #13756

(cherry picked from commit e7c9ca560b)
2023-06-12 12:40:22 +02:00
Kamil Braun
e49a531aaa Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 256 chars;
* the logic of the function made simpler and more maintainable.

Closes #13427

* github.com:scylladb/scylladb:
  pylib_test: add tests for read_last_line
  pytest: add pylib_test directory
  scylla_cluster.py: fix read_last_line
  scylla_cluster.py: move read_last_line to util.py

(cherry picked from commit 70f2b09397)
2023-06-12 12:40:22 +02:00
Alejo Sanchez
bcf99a37cd test/pylib: ManagerClient helpers to wait for...
server to see other servers after start/restart

When starting/restarting a server, provide a way to wait for the server
to see at least n other servers.

Also leave the implementation methods available for manual use and
update previous tests, one to wait for a specific server to be seen, and
one to wait for a specific server to not be seen (down).

Fixes #13147

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13438

(cherry picked from commit 11561a73cb)
Backport note: skipped the test_mutation_schema_change.py fix as the
test doesn't exist on this branch.
2023-06-12 12:40:08 +02:00
Tomasz Grabiec
fe4af95745 test: pylib: Add a way to create cql connections with particular coordinators
Usage:

  await manager.driver_connect(server=servers[0])
  manager.cql.execute(f"...", execution_profile='whitelist')

(cherry picked from commit 041ee3ffdd)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
ac5dff7de0 test/pylib: get gossiper alive endpoints
Helper to get list of gossiper alive endpoints from REST API.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 62a945ccd5)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
ad99456a9d test/topology: default replication factor 3
For most tests there will be nodes down, increase replication factor to
3 to avoid having problems for partitions belonging to down nodes.

Use replication factor 1 for raft upgrade tests.

(cherry picked from commit 08d754e13f)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
937e890fba test/pylib: configurable replication factor
Make replication factor configurable for the RandomTables helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 3508a4e41e)
2023-06-12 12:38:15 +02:00
Petr Gusev
12eec5bb2b scylla_cluster.py: optimize node logs reading
There are two occasions in scylla_cluster
where we read the node logs, and in both of
them we read the entire file in memory.
This is not efficient and may cause an OOM.

In the first case we need the last line of the
log file, so we seek at the end and move backwards
looking for a new line symbol.

In the second case we look through the
log file to find the expected_error.
The readlines() method returns a Python
list object, which means it reads the entire
file in memory. It's sufficient to just remove
it since iterating over the file instance
already yields lines lazily one by one.

This is a follow-up for #13134.

Closes #13399

(cherry picked from commit 09636b20f3)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
59847389d4 test/pylib: RandomTables.add_column with value column
When adding extra columns in a test, make them value column. Name them
with the "v_" prefix and use the value column number counter.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13271

(cherry picked from commit 81b40c10de)
2023-06-12 12:38:15 +02:00
Petr Gusev
7a8c5db55b scylla_cluster.py: add start flag to server_add
Sometimes when creating a node it's useful
to just install it and not start. For example,
we may want to try to start it later with
expected error.

The ScyllaServer.install method has been made
exception safe, if an exception occurs, it
reverts to the original state. This allows
to not duplicate the try/except logic
in two of its call sites.

(cherry picked from commit e407956e9f)
2023-06-12 12:38:15 +02:00
Petr Gusev
15ea5bf53f ServerInfo: drop host_id
We are going to allow the
ScyllaCluster.add_server function not to
start the server if the caller has requested
that with a special parameter. The host_id
can only be obtained from a running node, so
add_server won't be able to return it in
this case. I've grepped the tests for host_id
and there doesn't seem to be any
reference to it in the code.

(cherry picked from commit 794d0e4000)
2023-06-12 12:38:15 +02:00
Petr Gusev
3ab610753e scylla_cluster.py: add config to server_add
Sometimes when creating a node it's useful
to pass a custom node config.

(cherry picked from commit 8e3392c64f)
2023-06-12 12:38:15 +02:00
Petr Gusev
1959eddf86 scylla_cluster.py: add expected_error to server_start
Sometimes it's useful to check that the node has failed
to start for a particular reason. If server_start can't
find expected_error in the node's log or if the
node has started without errors, it throws an exception.

(cherry picked from commit c1d0ee2bce)
2023-06-12 12:38:15 +02:00
Petr Gusev
43525aec83 scylla_cluster.py: ScyllaServer.start, refactor error reporting
Extract the function that encapsulates all the error
reporting logic. We are going to use it in several
other places to implement expected_error feature.

(cherry picked from commit a4411e9ec4)
2023-06-12 12:38:15 +02:00
Petr Gusev
930c4e65d6 scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
The ScyllaServer expects cmd to be None if the
Scylla process is not running. Otherwise, if start failed
and the test called update_config, the latter will
try to send a signal to a non-existent process via cmd.

(cherry picked from commit 21b505e67c)
2023-06-12 12:38:15 +02:00
Konstantin Osipov
d2caaef188 test: improve logging in ScyllaCluster
Print IP addresses and cluster identifiers in more log messages,
it helps debugging.

(cherry picked from commit 7309a1bd6b)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
6474edd691 test: topology smp test with custom cluster
Instead of decommission of initial cluster, use custom cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13589

(cherry picked from commit ce87aedd30)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
b39cdadff3 test/pylib: topology: support clusters of initial size 0
To allow tests with custom clusters, allow configuration of initial
cluster size of 0.

Add a proof-of-concept test to be removed later.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13342

(cherry picked from commit e3b462507d)
2023-06-12 12:38:15 +02:00
Nadav Har'El
7b60cddae7 Merge 'test/pylib: split and refactor topology tests' from Alecco
Move long running topology tests out of  `test_topology.py` and into their own files, so they can be run in parallel.

While there, merge simple schema tests.

Closes #12804

* github.com:scylladb/scylladb:
  test/topology: rename topology test file
  test/topology: lint and type for topology tests
  test/topology: move topology ip tests to own file
  test/topology: move topology test remove garbaje...
  test/topology: move topology rejoin test to own file
  test/topology: merge topology schema tests and...
  test/topology: isolate topology smp params test
  test/topology: move topology helpers to common file

(cherry picked from commit a24600a662)
2023-06-12 12:38:15 +02:00
Botond Dénes
ea80fe20ad Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.

The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.

Closes #12765

* github.com:scylladb/scylladb:
  test/pylib: use larger timeout for decommission/removenode
  test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT

(cherry picked from commit e55f475db1)
2023-06-12 12:38:15 +02:00
Asias He
f90fe6f312 test: Increase START_TIMEOUT
It is observed that CI machine is slow to run the test. Increase the
timeout of adding servers.

(cherry picked from commit fc60484422)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
6e2c547388 test/pylib: one-shot error injection helper
Existing helper with async context manager only worked for non one-shot
error injections. Fix it and add another helper for one-shot without a
context manager.

Fix tests using the previous helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 9ceb6aba81)
2023-06-12 12:38:05 +02:00
Kamil Braun
91aa2cd8d7 test: topology: wait for token ring/group 0 consistency after decommission
There was a check for immediate consistency after a decommission
operation has finished in one of the tests, but it turns out that also
after decommission it might take some time for token ring to be updated
on other nodes. Replace the check with a wait.

Also do the wait in another test that performs a sequence of
decommissions. We won't attempt to start another decommission until
every node learns that the previously decommissioned node has left.

Closes #12686

(cherry picked from commit 40142a51d0)
2023-06-12 11:58:32 +02:00
Kamil Braun
05c3f7ecef test: topology: verify that group 0 and token ring are consistent
After topology changes like removing a node, verify that the set of
group 0 members and token ring members is the same.

Modify `get_token_ring_host_ids` to only return NORMAL members. The
previous version which used the `/storage_service/host_id` endpoint
might have returned non-NORMAL members as well.

Fixes: #12153

Closes #12619

(cherry picked from commit fa9cf81af2)
2023-06-12 11:58:02 +02:00
Kamil Braun
3aa73e8b5a Merge 'pytest: start after ungraceful stop' from Alecco
If a server is stopped suddenly (i.e. not graceful), schema tables might
be in inconsistent state. Add a test case and enable Scylla
configuration option (force_schema_commit_log) to handle this.

Fixes #12218

Closes #12630

* github.com:scylladb/scylladb:
  pytest: test start after ungraceful stop
  test.py: enable force_schema_commit_log

(cherry picked from commit 5eadea301e)
2023-06-12 11:57:09 +02:00
Nadav Har'El
a0ba3b3350 Merge 'test.py: improve test failure handling' from Kamil Braun
Improve logging by printing the cluster at the end of each test.

Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure.

Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test.

Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do.

Closes #12652

* github.com:scylladb/scylladb:
  test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters
  test/topology: don't drop random_tables keyspace after a failed test
  test/pylib: mark cluster as dirty after a failed test
  test: pylib, topology: don't perform operations after test on a dirty cluster
  test/pylib: print cluster at the end of test

(cherry picked from commit 2653865b34)
2023-06-12 11:47:54 +02:00
Anna Mikhlin
ea08d409f1 release: prepare for 5.2.3 2023-06-08 22:04:50 +03:00
Avi Kivity
f32971b81f Merge 'multishard_mutation_query: make reader_context::lookup_readers() exception safe' from Botond Dénes
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.

Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.

Fixes: #13784

Closes #13790

* github.com:scylladb/scylladb:
  test/boost/database_test: add unit test for semaphore mismatch on range scans
  partition_slice_builder: add set_specific_ranges()
  multishard_mutation_query: make reader_context::lookup_readers() exception safe
  multishard_mutation_query: lookup_readers(): make inner lambda a coroutine

(cherry picked from commit 1c0e8c25ca)
2023-06-08 04:29:51 -04:00
Michał Chojnowski
8872157422 data_dictionary: fix forgetting of UDTs on ALTER KEYSPACE
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.

Fixes #14139

Closes #14143

(cherry picked from commit 1a521172ec)
2023-06-06 21:52:47 +03:00
Kamil Braun
b5785ed434 auth: don't use infinite timeout in default_role_row_satisfies query
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.

With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.

Use the same timeout configuration as other queries in this module do.

Fixes #13545.

Closes #14134

(cherry picked from commit f51312e580)
2023-06-06 19:39:29 +03:00
Pavel Emelyanov
70f93767fd Update seastar submodule
* seastar 98504c4b...8c86e6de (1):
  > rpc: Wait for server socket to stop before killing conns

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-30 20:10:44 +03:00
Tzach Livyatan
bb3751334c Remove Ubuntu 18.04 support from 5.2
Ubuntu [18.04 will be soon out of standard support](https://ubuntu.com/blog/18-04-end-of-standard-support), and can be removed from 5.2 supported list
https://github.com/scylladb/scylla-pkg/issues/3346

Closes #13529

(cherry picked from commit e655060429)
2023-05-30 16:25:42 +03:00
Beni Peled
9dd70a58c3 release: prepare for 5.2.2 2023-05-18 14:03:20 +03:00
Anna Stuchlik
0bc6694ac5 doc: fix the links to the Enterprise docs
Fixes https://github.com/scylladb/scylladb/issues/13915

This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.

This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.

Closes #13917

(cherry picked from commit 6f4a68175b)
2023-05-18 08:40:02 +03:00
Botond Dénes
486483b379 Merge '[Backport 5.2]: node ops backports' from Benny Halevy
This branch backports to branch-5.2 several fixes related to node operations:
- ba919aa88a (PR #12980; Fixes: #11011, #12969)
- 53636167ca (part of PR #12970; Fixes: #12764, #12956)
- 5856e69462 (part of PR #12970)
- 2b44631ded (PR #13028; Fixes: #12989)
- 6373452b31 (PR #12799; Fixes #12798)

Closes #13531

* github.com:scylladb/scylladb:
  Merge 'Do not mask node operation errors' from Benny Halevy
  Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
  storage_service: Wait for normal state handler to finish in replace
  storage_service: Wait for normal state handler to finish in bootstrap
  storage_service: Send heartbeat earlier for node ops
2023-05-17 16:46:49 +03:00
Tzach Livyatan
9afaec5b12 Update Azure recommended instances type from the Lsv2-series to the Lsv3-series
Closes #13835

(cherry picked from commit a73fde6888)
2023-05-17 15:41:47 +03:00
Anna Stuchlik
9c99dc36b5 doc: add OS support for version 2023.1
Fixes https://github.com/scylladb/scylladb/issues/13857

This commit adds the OS support for ScyllaDB Enterprise 2023.1.
The support is the same as for ScyllaDB Open Source 5.2, on which
2023.1 is based.

After this commit is merged, it must be backported to branch-5.2.
In this way, it will be merged to branch-2023.1 and available in
the docs for Enterprise 2023.1

Closes: #13858
(cherry picked from commit 84ed95f86f)
2023-05-16 10:11:21 +03:00
Tomasz Grabiec
548a7f73d3 Merge 'range_tombstone_change_generator: fix an edge case in flush()' from Michał Chojnowski
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes https://github.com/scylladb/scylladb/issues/12462

Closes #13894

* github.com:scylladb/scylladb:
  tests: row_cache: Add reproducer for reader producing missing closing range tombstone
  range_tombstone_change_generator: fix an edge case in flush()
2023-05-15 23:29:08 +02:00
Raphael S. Carvalho
5c66875dbe sstables: Fix use-after-move when making reader in reverse mode
static report:
sstables/mx/reader.cc:1705:58: error: invalid invocation of method 'operator*' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
            legacy_reverse_slice_to_native_reverse_slice(*schema, slice.get()), pc, std::move(trace_state), fwd, fwd_mr, monitor);

Fixes #13394.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 213eaab246)
2023-05-15 20:27:34 +03:00
Raphael S. Carvalho
26b4d2c3c1 db/view/build_progress_virtual_reader: Fix use-after-move
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.

static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
                _db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),

Fixes #13395.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1ecba373d6)
2023-05-15 20:26:01 +03:00
Raphael S. Carvalho
874062b72a index/built_indexes_virtual_reader.hh: Fix use-after-move
static report:
./index/built_indexes_virtual_reader.hh:228:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
                _db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS),

Fixes #13396.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit f8df3c72d4)
2023-05-15 20:24:35 +03:00
Raphael S. Carvalho
71ec750a59 replica: Fix use-after-move in table::make_streaming_reader
Variant used by
streaming/stream_transfer_task.cc:        , reader(cf.make_streaming_reader(cf.schema(), std::move(permit_), prs))

as full slice is retrieved after schema is moved (clang evaluates
left-to-right), the stream transfer task can be potentially working
on a stale slice for a particular set of partitions.

static report:
In file included from replica/dirty_memory_manager.cc:6:
replica/database.hh:706:83: error: invalid invocation of method 'operator->' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
        return make_streaming_reader(std::move(schema), std::move(permit), range, schema->full_slice());

Fixes #13397.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 04932a66d3)
2023-05-15 20:21:48 +03:00
Tomasz Grabiec
7c1bdc6553 tests: row_cache: Add reproducer for reader producing missing closing range tombstone
Adds a reproducer for #12462.

The bug manifests by reader throwing:

  std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}

The reason is that prior to the fix range_tombstone_change_generator::flush()
was used with end_of_range=true to produce the closing range_tombstone_change
and it did not handle correctly the case when there are two adjacent range
tombstones and flush(pos, end_of_range=true) is called such that pos is the
boundary between the two.

Cherry-picked from a717c803c7.
2023-05-15 18:02:40 +02:00
Michał Chojnowski
24d966f806 range_tombstone_change_generator: fix an edge case in flush()
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes #12462
2023-05-15 17:48:24 +02:00
Asias He
05a3a1bf55 tombstone_gc: Fix gc_before for immediate mode
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.

Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.

The following procedure reproduces the issue:

- Start 2 nodes

- Insert data

```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```

- Run nodetool flush and nodetool compact

- Compaction drops all data

```
~128 total partitions merged to 0.
```

Fixes #13572

Closes #13800

(cherry picked from commit 7fcc403122)
2023-05-15 10:33:29 +03:00
Takuya ASADA
f148a6be1d scylla_kernel_check: suppress verbose iotune messages
Stop printing verbose iotune messages while the check, just print error
message.

Fixes #13373.

Closes #13362

(cherry picked from commit 160c184d0b)
2023-05-14 21:25:57 +03:00
Benny Halevy
5785550e24 view: view_builder: start: demote sleep_aborted log error
This is not really an error, so print it in debug log_level
rather than error log_level.

Fixes #13374

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13462

(cherry picked from commit cc42f00232)
2023-05-14 21:21:59 +03:00
Avi Kivity
401de17c82 Update seastar submodule (condition_variable tasktrace fix)
* seastar aa46b980ec...98504c4bb6 (1):
  > condition-variable: replace the coroutine wakeup task with a promise

Fixes #13368
2023-05-14 21:12:12 +03:00
Raphael S. Carvalho
94c9553e8a Fix use-after-move when initializing row cache with dummy entry
Courtersy of clang-tidy:
row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move]
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:60: note: move occurred here
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema});

The use-after-move is UB, as for it to happen, depends on evaluation order.

We haven't hit it yet as clang is left-to-right.

Fixes #13400.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13401

(cherry picked from commit d2d151ae5b)
2023-05-14 21:02:24 +03:00
Anna Mikhlin
f1c45553bc release: prepare for 5.2.1 2023-05-08 22:15:46 +03:00
Botond Dénes
1a288e0a78 Update seastar submodule
* seastar 1488aaf8...aa46b980 (1):
  > core/on_internal_error: always log error with backtrace

Fixes: #13786
2023-05-08 10:30:10 +03:00
Marcin Maliszkiewicz
a2fed1588e db: view: use deferred_close for closing staging_sstable_reader
When consume_in_thread throws the reader should still be closed.

Related https://github.com/scylladb/scylla-enterprise/issues/2661

Closes #13398
Refs: scylladb/scylla-enterprise#2661
Fixes: #13413

(cherry picked from commit 99f8d7dcbe)
2023-05-08 09:41:07 +03:00
Botond Dénes
f07a06d390 Merge 'service:forward_service: use long type instead of counter in function mocking' from Michał Jadwiszczak
Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used.

Fixes: #12939

Closes #12963

* github.com:scylladb/scylladb:
  test:boost: counter column parallelized aggregation test
  service:forward_service: use long type when column is counter

(cherry picked from commit 61e67b865a)
2023-05-07 14:27:29 +03:00
Anna Stuchlik
4ec531d807 doc: remove the sequential repair option from docs
Fixes https://github.com/scylladb/scylladb/issues/12132

The sequential repair mode is not supported. This commit
removes the incorrect information from the documentation.

Closes #13544

(cherry picked from commit 3d25edf539)
2023-05-07 14:27:29 +03:00
Asias He
4867683f80 storage_service: Fix removing replace node as pending
Consider

- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2

  ```
  auto host_id = _gossiper.get_host_id(endpoint);
  auto existing = tmptr->get_endpoint_for_host_id(host_id);
  ```

  host_id = new host id
  existing = empty

  As a result, del_replacing_endpoint() will not be called.

This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.

This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.

To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.

Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3

After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3

Tests: https://github.com/scylladb/scylla-dtest/pull/3126

Closes #13677

Fixes: https://github.com/scylladb/scylla-enterprise/issues/2852

(cherry picked from commit a8040306bb)
2023-05-03 14:15:13 +03:00
Botond Dénes
0e42defe06 readers: evictable_reader: skip progress guarantee when next pos is partition start
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.

Fixes: #13491

Closes #13563

(cherry picked from commit 72003dc35c)
2023-05-02 21:58:41 +03:00
Avi Kivity
f73d017f05 tools: toolchain: regenerate
Fixes #13744
2023-05-02 13:16:59 +03:00
Pavel Emelyanov
3723678b82 scylla-gdb: Parse and eval _all_threads without quotes
I've no idea why the quotes are there at all, it works even without
them. However, with quotes gdb-13 fails to find the _all_threads static
thread-local variable _unless_ it's printed with gdb "p" command
beforehand.

fixes: #13125

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13132

(cherry picked from commit 537510f7d2)
2023-05-02 13:16:59 +03:00
Botond Dénes
ea506f50cc Merge 'Do not mask node operation errors' from Benny Halevy
This series handles errors when aborting node operations and prints them rather letting them leak and be exposed to the user.

Also, cleanup the node_ops logging formats when aborting different node ops
and add more error logging around errors in the "worker" nodes.

Closes #12799

* github.com:scylladb/scylladb:
  storage_service: node_ops_signal_abort: print a warning when signaling abort
  storage_service: s/node_ops_singal_abort/node_ops_signal_abort/
  storage_service: node_ops_abort: add log messages
  storage_service: wire node_ops_ctl for node operations
  storage_service: add node_ops_ctl class to formalize all node_ops flow
  repair: node_ops_cmd_request: add print function
  repair: do_decommission_removenode_with_repair: log ignore_nodes
  repair: replace_with_repair: get ignore_nodes as unordered_set
  gossiper: get_generation_for_nodes: get nodes as unordered_set
  storage_service: don't let node_ops abort failures mask the real error

(cherry picked from commit 6373452b31)
2023-04-30 18:58:28 +03:00
Kamil Braun
42fd3704e4 Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.

If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.

It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.

It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.

There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.

We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.

The race can still happen, but the time window is now much smaller.

The PR also lays down infrastructure which enables testing the scenarios. It makes node ops
watchdog periods configurable, and adds error injections.

Fixes #12989
Refs #12969

Closes #13028

* github.com:scylladb/scylladb:
  storage_service: node ops: Extract node_ops_insert() to reduce code duplication
  storage_service: Make node operations safer by detecting asymmetric abort
  storage_service: node ops: Add error injections
  service: node_ops: Make watchdog and heartbeat intervals configurable

(cherry picked from commit 2b44631ded)
2023-04-30 18:58:28 +03:00
Asias He
c9d19b3595 storage_service: Wait for normal state handler to finish in replace
Similar to "storage_service: Wait for normal state handler to finish in
bootstrap", this patch enables the check on the replace procedure.

(cherry picked from commit 5856e69462)
2023-04-30 18:58:28 +03:00
Asias He
9a873bf4b3 storage_service: Wait for normal state handler to finish in bootstrap
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.

Consider this:

- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
  bootstrap
- n2 fails to bootstrap

For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.

repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})

This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.

Fixes #12764
Fixes #12956

(cherry picked from commit 53636167ca)
2023-04-30 18:58:28 +03:00
Asias He
51a00280a2 storage_service: Send heartbeat earlier for node ops
Node ops has the following procedure:

1   for node in sync_nodes
      send prepare cmd to node

2   for node in sync_nodes
      send heartbeat cmd to node

If any of the prepare cmd in step 1 takes longer than the heartbeat
watchdog timeout, the heartbeat in step 2 will be too late to update the
watchdog, as a result the watchdog will abort the operation.

To prevent slow prepare cmd kills the node operations, we can start the
heartbeat earlier in the procedure.

Fixes #11011
Fixes #12969

Closes #12980

(cherry picked from commit ba919aa88a)
2023-04-30 18:58:28 +03:00
Wojciech Mitros
b0a7c02e09 rust: update dependencies
Cranelift-codegen 0.92.0 and wasmtime 5.0.0 have security issues
potentially allowing malicious UDFs to read some memory outside
the wasm sandbox. This patch updates them to versions 0.92.1
and 5.0.1 respectively, where the issues are fixed.

Fixes #13157

Closes #13171

(cherry picked from commit aad2afd417)
2023-04-27 22:01:44 +03:00
Wojciech Mitros
f18c49dcc6 rust: update dependencies
Wasmtime added some improvements in recent releases - particularly,
two security issues were patched in version 2.0.2. There were no
breaking changes for our use other than the strategy of returning
Traps - all of them are now anyhow::Errors instead, but we can
still downcast to them, and read the corresponding error message.

The cxx, anyhow and futures dependency versions now match the
versions saved in the Cargo.lock.

Closes #12830

(cherry picked from commit 8b756cb73f)

Ref #13157
2023-04-27 22:00:54 +03:00
Anna Stuchlik
35dfec78d1 doc: fixes https://github.com/scylladb/scylladb/issues/12964, removes the information that the CDC options are experimental
Closes #12973

(cherry picked from commit 4dd1659d0b)
2023-04-27 21:06:49 +03:00
Raphael S. Carvalho
dbd8ca4ade replica: Fix undefined behavior in table::generate_and_propagate_view_updates()
Undefined behavior because the evaluation order is undefined.

With GCC, where evaluation is right-to-left, schema will be moved
once it's forwarded to make_flat_mutation_reader_from_mutations_v2().

The consequence is that memory tracking of mutation_fragment_v2
(for tracking only permit used by view update), which uses the schema,
can be incorrect. However, it's more likely that Scylla will crash
when estimating memory usage for row, which access schema column
information using schema::column_at(), which in turn asserts that
the requested column does really exist.

Fixes #13093.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13092

(cherry picked from commit 3fae46203d)
2023-04-27 19:56:38 +03:00
Anna Stuchlik
1be4afb842 doc: remove incorrect info about BYPASS CACHE
Fixes https://github.com/scylladb/scylladb/issues/13106

This commit removes the information that BYPASS CACHE
is an Enterprise-only feature and replaces that info
with the link to the BYPASS CACHE description.

Closes #13316

(cherry picked from commit 1cfea1f13c)
2023-04-27 19:54:04 +03:00
Kefu Chai
7cc9f5a05f dist/redhat: enforce dependency on %{release} also
* tools/python3 279b6c1...cf7030a (1):
  > dist: redhat: provide only a single version

s/%{version}/%{version}-%{release}/ in `Requires:` sections.

this enforces the runtime dependencies of exactly the same
releases between scylla packages.

Fixes #13222
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 7165551fd7)
2023-04-27 19:27:34 +03:00
Nadav Har'El
bf7fc9709d test/rest_api: fix flaky test for toppartitions
The REST test test_storage_service.py::test_toppartitions_pk_needs_escaping
was flaky. It tests the toppartition request, which unfortunately needs
to choose a sampling duration in advance, and we chose 1 second which we
considered more than enough - and indeed typically even 1ms is enough!
but very rarely (only know of only one occurance, in issue #13223) one
second is not enough.

Instead of increasing this 1 second and making this test even slower,
this patch takes a retry approach: The tests starts with a 0.01 second
duration, and is then retried with increasing durations until it succeeds
or a 5-seconds duration is reached. This retry approach has two benefits:
1. It de-flakes the test (allowing a very slow test to take 5 seconds
instead of 1 seconds which wasn't enough), and 2. At the same time it
makes a successful test much faster (it used to always take a full
second, now it takes 0.07 seconds on a dev build on my laptop).

A *failed* test may, in some cases, take 10 seconds after this patch
(although in some other cases, an error will be caught immediately),
but I consider this acceptable - this test should pass, after all,
and a failure indicates a regression and taking 10 seconds will be
the last of our worries in that case.

Fixes #13223.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13238

(cherry picked from commit c550e681d7)
2023-04-27 19:16:58 +03:00
Nadav Har'El
00a8c3a433 test/alternator: increase CQL connection timeout
This patch increases the connection timeout in the get_cql_cluster()
function in test/cql-pytest/run.py. This function is used to test
that Scylla came up, and also test/alternator/run uses it to set
up the authentication - which can only be done through CQL.

The Python driver has 2-second and 5-second default timeouts that should
have been more than enough for everybody (TM), but in #13239 we saw
that in one case it apparently wasn't enough. So to be extra safe,
let's increase the default connection-related timeouts to 60 seconds.

Note this change only affects the Scylla *boot* in the test/*/run
scripts, and it does not affect the actual tests - those have different
code to connect to Scylla (see cql_session() in test/cql-pytest/util.py),
and we already increased the timeouts there in #11289.

Fixes #13239

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13291

(cherry picked from commit 4fdcee8415)
2023-04-27 19:15:39 +03:00
Tomasz Grabiec
c08ed39a33 direct_failure_detector: Avoid throwing exceptions in the success path
sleep_abortable() is aborted on success, which causes sleep_aborted
exception to be thrown. This causes scylla to throw every 100ms for
each pinged node. Throwing may reduce performance if happens often.

Also, it spams the logs if --logger-log-level exception=trace is enabled.

Avoid by swallowing the exception on cancellation.

Fixes #13278.

Closes #13279

(cherry picked from commit 99cb948eac)
2023-04-27 19:14:31 +03:00
Kefu Chai
04424f8956 test: cql-pytest: test_describe: clamp bloom filter's fp rate
before this change, we use `round(random.random(), 5)` for
the value of `bloom_filter_fp_chance` config option. there are
chances that this expression could return a number lower or equal
to 6.71e-05.

but we do have a minimal for this option, which is defined by
`utils::bloom_calculations::probs`. and the minimal false positive
rate is 6.71e-05.

we are observing test failures where the we are using 0 for
the option, and scylla right rejected it with the error message of
```
bloom_filter_fp_chance must be larger than 6.71e-05 and less than or equal to 1.0 (got 0)
```.

so, in this change, to address the test failure, we always use a number
slightly greater or equal to a number slightly greater to the minimum to
ensure that the randomly picked number is in the range of supported
false positive rate.

Fixes #13313
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13314

(cherry picked from commit 33f4012eeb)
2023-04-27 19:12:53 +03:00
Beni Peled
429b696bbc release: prepare for 5.2.0 2023-04-27 16:26:43 +03:00
Beni Peled
a89867d8c2 release: prepare for 5.2.0-rc5 2023-04-25 14:37:54 +03:00
Benny Halevy
6ad94fedf3 utils: clear_gently: do not clear null unique_ptr
Otherwise the null pointer is dereferenced.

Add a unit test reproducing the issue
and testing this fix.

Fixes #13636

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 12877ad026)
2023-04-24 17:51:01 +03:00
Anna Stuchlik
a6188d6abc doc: document tombstone_gc as not experimental
The tombstone_gc was documented as experimental in version 5.0.
It is no longer experimental in version 5.2.
This commit updates the information about the option.

Closes #13469

(cherry picked from commit a68b976c91)
2023-04-24 11:54:06 +03:00
Botond Dénes
50095cc3a5 Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.

There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).

Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).

Fixes #13594

Closes #13604

* github.com:scylladb/scylladb:
  db: system_keyspace: use microsecond resolution for group0_history range tombstone
  utils: UUID_gen: accept decimicroseconds in min_time_UUID

(cherry picked from commit 10c1f1dc80)
2023-04-23 16:03:02 +03:00
Botond Dénes
7b2215d8e0 Merge 'Backport bugfixes regarding UDT, UDF, UDA interactions to branch-5.2' from Wojciech Mitros
This patch backports https://github.com/scylladb/scylladb/pull/12710 to branch-5.2. To resolve the conflicts that it's causing, it also includes
* https://github.com/scylladb/scylladb/pull/12680
* https://github.com/scylladb/scylladb/pull/12681

Closes #13542

* github.com:scylladb/scylladb:
  uda: change the UDF used in a UDA if it's replaced
  functions: add helper same_signature method
  uda: return aggregate functions as shared pointers
  udf: also check reducefunc to confirm that a UDF is not used in a UDA
  udf: fix dropping UDFs that share names with other UDFs used in UDAs
  pytest: add optional argument for new_function argument types
  udt: disallow dropping a user type used in a user function
2023-04-19 01:38:08 -04:00
Botond Dénes
da9f90362d Merge 'Compaction reevaluation bug fixes' from Raphael "Raph" Carvalho
A problem in compaction reevaluation can cause the SSTable set to be left uncompacted for indefinite amount of time, potentially causing space and read amplification to be suboptimal.

Two revaluation problems are being fixed, one after off-strategy compaction ended, and another in compaction manager which intends to periodically reevaluate a need for compaction.

Fixes https://github.com/scylladb/scylladb/issues/13429.
Fixes https://github.com/scylladb/scylladb/issues/13430.

Closes #13431

* github.com:scylladb/scylladb:
  compaction: Make compaction reevaluation actually periodic
  replica: Reevaluate regular compaction on off-strategy completion

(cherry picked from commit 9a02315c6b)
2023-04-19 01:14:33 -04:00
Botond Dénes
c9a17c80f6 mutation/mutation_compactor: consume_partition_end(): reset _stop
The purpose of `_stop` is to remember whether the consumption of the
last partition was interrupted or it was consumed fully. In the former
case, the compactor allows retreiving the compaction state for the given
partition, so that its compaction can be resumed at a later point in
time.
Currently, `_stop` is set to `stop_iteration::yes` whenever the return
value of any of the `consume()` methods is also `stop_iteration::yes`.
Meaning, if the consuming of the partition is interrupted, this is
remembered in `_stop`.
However, a partition whose consumption was interrupted is not always
continued later. Sometimes consumption of a partitions is interrputed
because the partition is not interesting and the downstream consumer
wants to stop it. In these cases the compactor should not return an
engagned optional from `detach_state()`, because there is not state to
detach, the state should be thrown away. This was incorrectly handled so
far and is fixed in this patch, but overwriting `_stop` in
`consume_partition_end()` with whatever the downstream consumer returns.
Meaning if they want to skip the partition, then `_stop` is reset to
`stop_partition::no` and `detach_state()` will return a disengaged
optional as it should in this case.

Fixes: #12629

Closes #13365

(cherry picked from commit bae62f899d)
2023-04-18 02:32:24 -04:00
Wojciech Mitros
7242c42089 uda: change the UDF used in a UDA if it's replaced
Currently, if a UDA uses a UDF that's being replaced,
the UDA will still keep using the old UDF until the
node is restarted.
This patch fixes this behavior by checking all UDAs
when replacing a UDF and updating them if necessary.

Fixes #12709

(cherry picked from commit 02bfac0c66)
2023-04-17 13:14:46 +02:00
Wojciech Mitros
70ff69afab functions: add helper same_signature method
When deciding whether two functions have the same
signature, we have to check if they have the same name
and parameter types. Additionally, if they're represented
by pointers, we need to check if any of them is a nullptr.
This logic is used multiple times, so it's extracted to
a separate function.
To use this function, the `used_by_user_aggregate` method
takes now a function instead of name and types list - we
can do it because we always use it with an existing user
function (that we're trying to drop).
The method will also be useful when we'll be not dropping,
but replacing a user function.

(cherry picked from commit 58987215dc)
2023-04-17 13:14:40 +02:00
Wojciech Mitros
5fd4bb853b uda: return aggregate functions as shared pointers
We will want to reuse the functions that we get from an aggregate
without making a deep copy, and it's only possible if we get
pointers from the aggregate instead of actual values.

(cherry picked from commit 20069372e7)
2023-04-17 13:14:24 +02:00
Wojciech Mitros
313649e86d udf: also check reducefunc to confirm that a UDF is not used in a UDA
When dropping a UDF we're checking if it's not begin used in any UDAs
and fail otherwise. However, we're only checking its state function
and final function, and it may also be used as its reduce function.
This patch adds the missing checks and a test for them.

(cherry picked from commit ef1dac813b)
2023-04-17 13:14:16 +02:00
Wojciech Mitros
14d8cec130 udf: fix dropping UDFs that share names with other UDFs used in UDAs
Currently, when dropping a function, we only check if there exist
an aggregate that uses a function with the same name as its state
function or final function. This may cause the drop to fail even
when it's just another UDF with the same name that's used in the
aggregate, even when the actual dropped function is not used there.
This patch fixes this by checking whether not only the name of the
UDA's sfunc and finalfunc, but also their argument types.

(cherry picked from commit 49077dd144)
2023-04-17 13:14:09 +02:00
Wojciech Mitros
203cbb79a1 pytest: add optional argument for new_function argument types
When multiple functions with the same name but different argument types
are created, the default drop statement for these functions will fail
because it does not include the argument types.
With this change, this problem can be worked around by specifying
argument types when creating the function, as this will cause the drop
statement to include them.

(cherry picked from commit 8791b0faf5)
2023-04-17 13:13:59 +02:00
Wojciech Mitros
51f19d1b8c udt: disallow dropping a user type used in a user function
Currently, nothing prevents us from dropping a user type
used in a user function, even though doing so may make us
unable to use the function correctly.
This patch prevents this behavior by checking all function
argument and return types when executing a drop type statement
and preventing it from completing if the type is referenced
by any of them.

(cherry picked from commit 86c61828e6)
2023-04-17 13:13:35 +02:00
Anna Stuchlik
83735ae77f doc: update the metrics between 5.2 and 2023.1
Related: https://github.com/scylladb/scylla-enterprise/issues/2794

This commit adds the information about the metric changes
in version 2023.1 compared to version 5.2.

This commit is part of the 5.2-to-2023.1 upgrade guide and
must be backported to branch-5.2.

Closes #13506

(cherry picked from commit 989a75b2f7)
2023-04-17 11:29:43 +02:00
Avi Kivity
9d384e3af2 Merge 'Backport "reader_concurrency_semaphore: don't evict inactive readers needlessly" to branch-5.2' from Botond Dénes
The patch doesn't apply cleanly, so a targeted backport PR was necessary.
I also needed to cherry-pick two patches from https://github.com/scylladb/scylladb/pull/13255 that the backported patch depends on. Decided against backporting the entire https://github.com/scylladb/scylladb/pull/13255 as it is quite an intrusive change.

Fixes: https://github.com/scylladb/scylladb/issues/11803

Closes #13515

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: don't evict inactive readers needlessly
  reader_concurrency_semaphore: add stats to record reason for queueing permits
  reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
2023-04-17 12:25:21 +03:00
Nadav Har'El
0da0c94f49 cql: USING TTL 0 means unlimited, not default TTL
Our documentation states that writing an item with "USING TTL 0" means it
should never expire. This should be true even if the table has a default
TTL. But Scylla mistakenly handled "USING TTL 0" exactly like having no
USING TTL at all (i.e., it took the default TTL, instead of unlimited).
We had two xfailing tests demonstrating that Scylla's behavior in this
is different from Cassandra. Scylla's behavior in this case was also
undocumented.

By the way, Cassandra used to have the same bug (CASSANDRA-11207) but
it was fixed already in 2016 (Cassandra 3.6).

So in this patch we fix Scylla's "USING TTL 0" behavior to match the
documentation and Cassandra's behavior since 2016. One xfailing test
starts to pass and the second test passes this bug and fails on a
different one. This patch also adds a third test for "USING TTL ?"
with UNSET_VALUE - it behaves, on both Scylla and Cassandra, like a
missing "USING TTL".

The origin of this bug was that after parsing the statement, we saved
the USING TTL in an integer, and used 0 for the case of no USING TTL
given. This meant that we couldn't tell if we have USING TTL 0 or
no USING TTL at all. This patch uses an std::optional so we can tell
the case of a missing USING TTL from the case of USING TTL 0.

Fixes #6447

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13079

(cherry picked from commit a4a318f394)
2023-04-17 10:41:08 +03:00
Nadav Har'El
1a9f51b767 cql: fix empty aggregation, and add more tests
This patch fixes #12475, where an aggregation (e.g., COUNT(*), MIN(v))
of absolutely no partitions (e.g., "WHERE p = null" or "WHERE p in ()")
resulted in an internal error instead of the "zero" result that each
aggregator expects (e.g., 0 for COUNT, null for MIN).

The problem is that normally our aggregator forwarder picks the nodes
which hold the relevant partition(s), forwards the request to each of
them, and then combines these results. When there are no partitions,
the query is sent to no node, and we end up with an empty result set
instead of the "zero" results. So in this patch we recognize this
case and build those "zero" results (as mentioned above, these aren't
always 0 and depend on the aggregation function!).

The patch also adds two tests reproducing this issue in a fairly general
way (e.g., several aggregators, different aggregation functions) and
confirming the patch fixes the bug.

The test also includes two additional tests for COUNT aggregation, which
uncovered an incompatibility with Cassandra which is still not fixed -
so these tests are marked "xfail":

Refs #12477: Combining COUNT with GROUP by results with empty results
             in Cassandra, and one result with empty count in Scylla.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12715

(cherry picked from commit 3ba011c2be)
2023-04-17 10:41:08 +03:00
Raphael S. Carvalho
dba0e604a7 table: Fix disk-space related metrics
total disk space used metric is incorrectly telling the amount of
disk space ever used, which is wrong. It should tell the size of
all sstables being used + the ones waiting to be deleted.
live disk space used, by this defition, shouldn't account the
ones waiting to be deleted.
and live sstable count, shouldn't account sstables waiting to
be deleted.

Fix all that.

Fixes #12717.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 529a1239a9)
2023-04-16 22:14:01 +03:00
Michał Chojnowski
4ea67940cb locator: token_metadata: get rid of a quadratic behaviour in get_address_ranges()
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.

This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.

Refs #10337
Refs #10817
Refs #10836
Refs #10837
Fixes #12724

(cherry picked from commit 9e57b21e0c)
2023-04-16 21:59:14 +03:00
Jan Ciolek
a8c49c44e5 cql/query_options: add a check for missing bind marker name
There was a missing check in validation of named
bind markers.

Let's say that a user prepares a query like:
```cql
INSERT INTO ks.tab (pk, ck, v) VALUES (:pk, :ck, :v)
```
Then they execute the query, but specify only
values for `:pk` and `:ck`.

We should detect that a value for :v is missing
and throw an invalid_request_exception.

Until now there was no such check, in case of a missing variable
invalid `query_options` were created and Scylla crashed.

Sadly it's impossible to create a regression test
using `cql-pytest` or `boost`.

`cql-pytest` uses the python driver, which silently
ignores mising named bind variables, deciding
that the user meant to send an UNSET_VALUE for them.
When given values like `{'pk': 1, 'ck': 2}`, it will automaticaly
extend them to `{'pk': 1, 'ck': 2, 'v': UNSET_VALUE}`.

In `boost` I tried to use `cql_test_env`,
but it only has methods which take valid `query_options`
as a parameter. I could create a separate unit tests
for the creation and validation of `query_options`
but it won't be a true end-to-end test like `cql-pytest`.

The bug was found using the rust driver,
the reproducer is available in the issue description.

Fixes: #12727

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #12730

(cherry picked from commit 2a5ed115ca)
2023-04-16 21:57:28 +03:00
Nadav Har'El
12a29edf90 test/alternator: fix flaky test for partition-tombstone scan
The test test_scan.py::test_scan_long_partition_tombstone_string
checks that a full-table Scan operation ends a page in the middle of
a very long string of partition tombstones, and does NOT scan the
entire table in one page (if we did that, getting a single page could
take an unbounded amount of time).

The test is currently flaky, having failed in CI runs three times in
the past two months.

The reason for the flakiness is that we don't know exactly how long
we need to make the sequence of partition tombstones in the test before
we can be absolutely sure a single page will not read this entire sequence.
For single-partition scans we have the "query_tombstone_page_limit"
configuration parameter, which tells us exactly how long we need to
make the sequence of row tombstones. But for a full-table scan of
partition tombstones, the situation is more complicated - because the
scan is done in parallel on several vnodes in parallel and each of
them needs to read query_tombstone_page_limit before it stops.

In my experiments, using query_tombstone_limit * 4 consecutive tombstones
was always enough - I ran this test hundreds of times and it didn't fail
once. But since it did fail on Jenkins very rarely (3 times in the last
two months), maybe the multiplier 4 isn't enough. So this patch doubles
it to 8. Hopefully this would be enough for anyone (TM).

This makes this test even bigger and slower than it was. To make it
faster, I changed this test's write isolation mode from the default
always_use_lwt to forbid_rmw (not use LWT). This leaves the test's
total run time to be similar to what it was before this patch - around
0.5 seconds in dev build mode on my laptop.

Fixes #12817

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12819

(cherry picked from commit 14cdd034ee)
2023-04-14 11:54:45 +03:00
Botond Dénes
3e10c3fc89 reader_concurrency_semaphore: don't evict inactive readers needlessly
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.

Fixes: #11803

Closes #13286

(cherry picked from commit bd57471e54)
2023-04-14 10:37:30 +03:00
Botond Dénes
f11deb5074 reader_concurrency_semaphore: add stats to record reason for queueing permits
When diagnosing problems, knowing why permits were queued is very
valuable. Record the reason in a new stats, one for each reason a permit
can be queued.

(cherry picked from commit 7b701ac52e)
2023-04-14 10:37:30 +03:00
Botond Dénes
1baf9dddd7 reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
So caller can bump the appropriate counters or log the reason why the
the request cannot be admitted.

(cherry picked from commit bb00405818)
2023-04-14 09:30:02 +03:00
Kamil Braun
9717ff5057 docs: cleaning up after failed membership change
After a failed topology operation, like bootstrap / decommission /
removenode, the cluster might contain a garbage entry in either token
ring or group 0. This entry can be cleaned-up by executing removenode on
any other node, pointing to the node that failed to bootstrap or leave
the cluster.

Document this procedure, including a method of finding the host ID of a
garbage entry.

Add references in other documents.

Fixes: #13122

Closes #13186

(cherry picked from commit c2a2996c2b)
2023-04-13 10:35:02 +02:00
Anna Stuchlik
b293b1446f doc: remove Enterprise upgrade guides from OSS doc
This commit removes the Enterprise upgrade guides from
the Open Source documentation. The Enterprise upgrade guides
should only be available in the Enterprise documentation,
with the source files stored in scylla-enterprise.git.

In addition, this commit:
- adds the links to the Enterprise user guides in the Enterprise
documentation at https://enterprise.docs.scylladb.com/
- adds the redirections for the removed pages to avoid
breaking any links.

This commit must be reverted in scylla-enterprise.git.

(cherry picked from commit 61bc05ae49)

Closes #13473
2023-04-11 14:26:35 +03:00
Yaron Kaikov
e6f7ac17f6 doc: update supported os for 2022.1
ubuntu22.04 is already supported on both `5.0` and `2022.1`

updating the table

Closes #13340

(cherry picked from commit c80ab78741)
2023-04-05 13:56:07 +03:00
Anna Stuchlik
36619fc7d9 doc: add upgrade guide from 5.2 to 2023.1
Related: https://github.com/scylladb/scylla-enterprise/issues/2770

This commit adds the upgrade guide from ScyllaDB Open Source 5.2
to ScyllaDB Enterprise 2023.1.
This commit does not cover metric updates (the metrics file has no
content, which needs to be added in another PR).

As this is an upgrade guide, this commit must be merged to master and
backported to branch-5.2 and branch-2023.1 in scylla-enterprise.git.

Closes #13294

(cherry picked from commit 595325c11b)
2023-04-05 06:43:01 +03:00
Anna Stuchlik
750414c196 doc: update Raft doc for versions 5.2 and 2023.1
Fixes https://github.com/scylladb/scylladb/issues/13345
Fixes https://github.com/scylladb/scylladb/issues/13421

This commit updates the Raft documentation page to be up to date in versions 5.2 and 2023.1.

- Irrelevant information about previous releases is removed.
- Some information is clarified.
- Mentions of version 5.2 are either removed (if possible) or version 2023.1 is added.

Closes #13426

(cherry picked from commit 447ce58da5)
2023-04-05 06:42:28 +03:00
Botond Dénes
128050e984 Merge 'commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off' from Calle Wilund
Fixes #12810

We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments.

Closes #12950

* github.com:scylladb/scylladb:
  commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off
  commitlog: change type of stored size

(cherry picked from commit e70be47276)
2023-04-03 08:57:43 +03:00
Yaron Kaikov
d70751fee3 release: prepare for 5.2.0-rc4 2023-04-02 16:40:56 +03:00
Tzach Livyatan
1fba43c317 docs: minor improvments to the Raft Handling Failures and recovery procedure sections
Closes #13292

(cherry picked from commit 46e6c639d9)
2023-03-31 11:22:20 +02:00
Botond Dénes
e380c24c69 Merge 'Improve database shutdown verbosity' from Pavel Emelyanov
The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler.

refs: #13100
refs: #10941

Closes #13141

* github.com:scylladb/scylladb:
  database: Increase verbosity of database::stop() method
  large_data_handler: Increase verbosity on shutdown
  large_data_handler: Coroutinize .stop() method

(cherry picked from commit e22b27a107)
2023-03-30 17:01:24 +03:00
Avi Kivity
76a76a95f4 Update tools/java submodule (hdrhistogram with Java 11)
* tools/java 1c4e1e7a7d...83b2168b19 (1):
  > Fix cassandra-stress -log hdrfile=... with java 11

Fixes #13287
2023-03-29 14:10:27 +03:00
Anna Stuchlik
f6837afec7 doc: update the Ubuntu version used in the image
Starting from 5.2 and 2023.1 our images are based on Ubuntu:22.04.
See https://github.com/scylladb/scylladb/issues/13138#issuecomment-1467737084

This commit adds that information to the docs.
It should be merged and backported to branch-5.2.

Closes #13301

(cherry picked from commit 9e27f6b4b7)
2023-03-27 14:08:57 +03:00
Botond Dénes
6350c8836d Revert "repair: Reduce repair reader eviction with diff shard count"
This reverts commit c6087cf3a0.

Said commit can cause a deadlock when 2 or more repairs compete for
locks on 2 or more nodes. Consider the following scenario:

Node n1 and n2 in the cluster, 1 shard per node, rf = 2, each shard has
1 available unit for the reader lock

    n1 starts repair r1
    r1-n1 (instance of r1 on node1) takes the reader lock on node1
    n2 starts repair r2
    r2-n2 (instance of r2 on node2) takes the reader lock on node2
    r1-n2 will fail to take the reader lock on node2
    r2-n1 will fail to take the reader lock on node1

As a result, r1 and r2 could not make progress and deadlock happens.

The complexity comes from the fact that a repair job needs lock on more
than one node. It is not guaranteed that all the participant nodes could
take the lock in one short.

There is no simple solution to this so we have to revert this locking
mechanism and look for another way to prevent reader trashing when
repairing nodes with mismatching shard count.

Fixes: #12693

Closes #13266

(cherry picked from commit 7699904c54)
2023-03-24 09:44:16 +02:00
Avi Kivity
5457948437 Update seastar submodule (rpc cancellation during negotiation)
* seastar 8889cbc198...1488aaf842 (1):
  > Merge 'Keep outgoing queue all cancellable while negotiating (again)' from Pavel Emelyanov

Fixes #11507.
2023-03-23 17:15:00 +02:00
Avi Kivity
da41001b5c .gitmodules: point seastar submodule at scylla-seastar.git
This allows is to backport seastar commits.

Ref #11507.
2023-03-23 17:11:43 +02:00
Anna Stuchlik
dd61e8634c doc: related https://github.com/scylladb/scylladb/issues/12754; add the missing information about reporting latencies to the upgrade guide 5.1 to 5.2
Closes #12935

(cherry picked from commit 26bb36cdf5)
2023-03-22 10:38:28 +02:00
Anna Stuchlik
b642b4c30e doc: fix the service name in upgrade guides
Fixes https://github.com/scylladb/scylladb/issues/13207

This commit fixes the service and package names in
the upgrade guides 5.0-to-2022.1 and 5.1-to-2022.2.
Service name: scylla-server
Package name: scylla-enterprise

Previous PRs to fix the same issue in other
upgrade guides:
https://github.com/scylladb/scylladb/pull/12679
https://github.com/scylladb/scylladb/pull/12698

This commit must be backported to branch-5.1 and branch 5.2.

Closes #13225

(cherry picked from commit 922f6ba3dd)
2023-03-22 10:37:12 +02:00
Botond Dénes
c013336121 db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.

Fixes: #11905
Refs: #11836

Closes #11860

(cherry picked from commit 84a69b6adb)
2023-03-22 09:03:50 +02:00
Kamil Braun
b6b35ce061 service: storage_proxy: sequence CDC preimage select with Paxos learn
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.

Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.

`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.

Fixes #12098

(cherry picked from commit 1ef113691a)
2023-03-21 20:23:19 +02:00
Petr Gusev
069e38f02d transport server: fix unexpected server errors handling
If request processing ended with an error, it is worth
sending the error to the client through
make_error/write_response. Previously in this case we
just wrote a message to the log and didn't handle the
client connection in any way. As a result, the only
thing the client got in this case was timeout error.

A new test_batch_with_error is added. It is quite
difficult to reproduce error condition in a test,
so we use error injection instead. Passing injection_key
in the body of the request ensures that the exception
will be thrown only for this test request and
will not affect other requests that
the driver may send in the background.

Closes: scylladb#12104
(cherry picked from commit a4cf509c3d)
2023-03-21 20:23:09 +02:00
Anna Mikhlin
61a8003ad1 release: prepare for 5.2.0-rc3 2023-03-20 10:10:27 +02:00
Botond Dénes
8a17066961 Merge 'doc: Updates the recommended OS to be Ubuntu 22.04' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/13138
Fixes https://github.com/scylladb/scylladb/issues/13153

This PR:

- Fixes outdated information about the recommended OS. Since version 5.2, the recommended OS should be Ubuntu 22.04 because that OS is used for building the ScyllaDB image.
- Adds the OS support information for version 5.2.

This PR (both commits) needs to be backported to branch-5.2.

Closes #13188

* github.com:scylladb/scylladb:
  doc: Add OS support for version 5.2
  doc: Updates the recommended OS to be Ubuntu 22.04

(cherry picked from commit f4b5679804)
2023-03-17 10:30:06 +02:00
Pavel Emelyanov
487ba9f3e1 Merge '[backport] reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()' from Botond Dénes
This PR backports 2f4a793457 to branch-5.2. Said patch depends on some other patches that are not part of any release yet.
This PR should apply to 5.1 and 5.0 too.

Closes #13162

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
  reader_permit: expose operator<<(reader_permit::state)
  reader_permit: add get_state() accessor
2023-03-16 18:41:08 +03:00
Botond Dénes
bd4f9e3615 Merge 'readers/nonforwarding: don't emit partition_end on next_partition,fast_forward_to' from Gusev Petr
The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()`

Fixes: #12249

Closes #12978

* github.com:scylladb/scylladb:
  flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE
  make_nonforwardable: test through run_mutation_source_tests
  make_nonforwardable: next_partition and fast_forward_to when single_partition is true
  make_forwardable: fix next_partition
  flat_mutation_reader_v2: drop forward_buffer_to
  nonforwardable reader: fix indentation
  nonforwardable reader: refactor, extract reset_partition
  nonforwardable reader: add more tests
  nonforwardable reader: no partition_end after fast_forward_to()
  nonforwardable reader: no partition_end after next_partition()
  nonforwardable reader: no partition_end for empty reader
  row_cache: pass partition_start though nonforwardable reader

(cherry picked from commit 46efdfa1a1)
2023-03-16 10:42:03 +02:00
Botond Dénes
c68deb2461 reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
Instead of open-coding the same, in an incomplete way.
clear_inactive_reads() does incomplete eviction in severeal ways:
* it doesn't decrement _stats.inactive_reads
* it doesn't set the permit to evicted state
* it doesn't cancel the ttl timer (if any)
* it doesn't call the eviction notifier on the permit (if there is one)

The list goes on. We already have an evict() method that all this
correctly, use that instead of the current badly open-coded alternative.

This patch also enhances the existing test for clear_inactive_reads()
and adds a new one specifically for `stop()` being called while having
inactive reads.

Fixes: #13048

Closes #13049

(cherry picked from commit 2f4a793457)
2023-03-14 09:50:16 +02:00
Botond Dénes
dd96d3017a reader_permit: expose operator<<(reader_permit::state)
(cherry picked from commit ec1c615029)
2023-03-14 09:50:16 +02:00
Botond Dénes
6ca80ee118 reader_permit: add get_state() accessor
(cherry picked from commit 397266f420)
2023-03-14 09:40:11 +02:00
Jan Ciolek
eee8f750cc cql3: preserve binary_operator.order in search_and_replace
There was a bug in `expr::search_and_replace`.
It doesn't preserve the `order` field of binary_operator.

`order` field is used to mark relations created
using the SCYLLA_CLUSTERING_BOUND.
It is a CQL feature used for internal queries inside Scylla.
It means that we should handle the restriction as a raw
clustering bound, not as an expression in the CQL language.

Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues,
the database could end up selecting the wrong clustering ranges.

Fixes: #13055

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #13056

(cherry picked from commit aa604bd935)
2023-03-09 12:52:39 +02:00
Botond Dénes
8d5206e6c6 sstables/sstable: validate_checksums(): force-check EOF
EOF is only guarateed to be set if one tried to read past the end of the
file. So when checking for EOF, also try to read some more. This
should force the EOF flag into a correct value. We can then check that
the read yielded 0 bytes.
This should ensure that `validate_checksums()` will not falsely declare
the validation to have failed.

Fixes: #11190

Closes #12696

(cherry picked from commit 693c22595a)
2023-03-09 12:30:44 +02:00
Anna Stuchlik
cfa40402f4 doc: Update the documentation landing page
This commit makes the following changes to the docs landing page:

- Adds the ScyllaDB enterprise docs as one of three tiles.

- Modifies the three tiles to reflect the three flavors of ScyllaDB.

- Moves the "New to ScyllaDB? Start here!" under the page title.

- Renames "Our Products" to "Other Products" to list the products other
  than ScyllaDB itself. In addtition, the boxes are enlarged from to
  large-4 to look better.

The major purpose of this commit is to expose the ScyllaDB
documentation.

docs: fix the link
(cherry picked from commit 27bb8c2302)

Closes #13086
2023-03-06 14:18:15 +02:00
Botond Dénes
2d170e51cf Merge 'doc: specify the versions where Alternator TTL is no longer experimental' from Anna Stuchlik
This PR adds a note to the Alternator TTL section to specify in which Open Source and Enterprise versions the feature was promoted from experimental to non-experimental.

The challenge here is that OSS and Enterprise are (still) **documented together**, but they're **not in sync** in promoting the TTL feature: it's still experimental in 5.1 (released) but no longer experimental in 2022.2 (to be released soon).

We can take one of the following approaches:
a) Merge this PR with master and ask the 2022.2 users to refer to master.
b) Merge this PR with master and then backport to branch-5.1. If we choose this approach, it is necessary to backport https://github.com/scylladb/scylladb/pull/11997 beforehand to avoid conflicts.

I'd opt for a) because it makes more sense from the OSS perspective and helps us avoid mess and backporting.

Closes #12295

* github.com:scylladb/scylladb:
  doc: fix the version in the comment on removing the note
  doc: specify the versions where Alternator TTL is no longer experimental

(cherry picked from commit d5dee43be7)
2023-03-02 12:09:16 +02:00
Anna Stuchlik
860e79e4b1 doc: fixes https://github.com/scylladb/scylladb/issues/12954, adds the minimal version from which the 2021.1-to-2022.1 upgrade is supported for Ubuntu, Debian, and image
Closes #12974

(cherry picked from commit 91b611209f)
2023-02-28 13:02:05 +02:00
Anna Mikhlin
908a82bea0 release: prepare for 5.2.0-rc2 2023-02-28 10:13:06 +02:00
Gleb Natapov
39158f55d0 lwt: do not destroy capture in upgrade_if_needed lambda since the lambda is used more then once
If on the first call the capture is destroyed the second call may crash.

Fixes: #12958

Message-Id: <Y/sks73Sb35F+PsC@scylladb.com>
(cherry picked from commit 1ce7ad1ee6)
2023-02-27 14:19:37 +02:00
Raphael S. Carvalho
22c1685b3d sstables: Temporarily disable loading of first and last position metadata
It's known that reading large cells in reverse cause large allocations.
Source: https://github.com/scylladb/scylladb/issues/11642

The loading is preliminary work for splitting large partitions into
fragments composing a run and then be able to later read such a run
in an efficiency way using the position metadata.

The splitting is not turned on yet, anywhere. Therefore, we can
temporarily disable the loading, as a way to avoid regressions in
stable versions. Large allocations can cause stalls due to foreground
memory eviction kicking in.
The default values for position metadata say that first and last
position include all clustering rows, but they aren't used anywhere
other than by sstable_run to determine if a run is disjoint at
clustering level, but given that no splitting is done yet, it
does not really matter.

Unit tests relying on position metadata were adjusted to enable
the loading, such that they can still pass.

Fixes #11642.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12979

(cherry picked from commit d73ffe7220)
2023-02-27 08:58:34 +02:00
Botond Dénes
9ba6fc73f1 mutation_compactor: only pass consumed range-tombstone-change to validator
Currently all consumed range tombstone changes are unconditionally
forwarded to the validator. Even if they are shadowed by a higher level
tombstone and/or purgable. This can result in a situation where a range
tombstone change was seen by the validator but not passed to the
consumer. The validator expects the range tombstone change to be closed
by end-of-partition but the end fragment won't come as the tombstone was
dropped, resulting in a false-positive validation failure.
Fix by only passing tombstones to the validator, that are actually
passed to the consumer too.

Fixes: #12575

Closes #12578

(cherry picked from commit e2c9cdb576)
2023-02-23 22:52:47 +02:00
Botond Dénes
f2e2c0127a types: unserialize_value for multiprecision_int,bool: don't read uninitialized memory
Check the first fragment before dereferencing it, the fragment might be
empty, in which case move to the next one.
Found by running range scan tests with random schema and random data.

Fixes: #12821
Fixes: #12823
Fixes: #12708

Closes #12824

(cherry picked from commit ef548e654d)
2023-02-23 22:38:03 +02:00
Gleb Natapov
363ea87f51 raft: abort applier fiber when a state machine aborts
After 5badf20c7a applier fiber does not
stop after it gets abort error from a state machine which may trigger an
assertion because previous batch is not applied. Fix it.

Fixes #12863

(cherry picked from commit 9bdef9158e)
2023-02-23 14:12:12 +02:00
Kefu Chai
c49fd6f176 tools/schema_loader: do not return ref to a local variable
we should never return a reference to local variable.
so in this change, a reference to a static variable is returned
instead. this should address following warning from Clang 17:

```
/home/kefu/dev/scylladb/tools/schema_loader.cc:146:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address]
        return {};
               ^~
```

Fixes #12875
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12876

(cherry picked from commit 6eab8720c4)
2023-02-22 22:02:43 +02:00
Takuya ASADA
3114589a30 scylla_coredump_setup: fix coredump timeout settings
We currently configure only TimeoutStartSec, but probably it's not
enough to prevent coredump timeout, since TimeoutStartSec is maximum
waiting time for service startup, and there is another directive to
specify maximum service running time (RuntimeMaxSec).

To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it
configures both TimeoutStartSec and TimeoutStopSec).

Fixes #5430

Closes #12757

(cherry picked from commit bf27fdeaa2)
2023-02-19 21:13:36 +02:00
Anna Stuchlik
34f68a4c0f doc: related https://github.com/scylladb/scylladb/issues/12658, fix the service name in the upgrade guide from 2022.1 to 2022.2
Closes #12698

(cherry picked from commit 826f67a298)
2023-02-17 12:17:48 +02:00
Botond Dénes
b336e11f59 Merge 'doc: fix the service name from "scylla-enterprise-server" "to "scylla-server"' from Anna Stuchlik
Related https://github.com/scylladb/scylladb/issues/12658.

This issue fixes the bug in the upgrade guides for the released versions.

Closes #12679

* github.com:scylladb/scylladb:
  doc: fix the service name in the upgrade guide for patch releases versions 2022
  doc: fix the service name in the upgrade guide from 2021.1 to 2022.1

(cherry picked from commit 325246ab2a)
2023-02-17 12:16:52 +02:00
Anna Stuchlik
9ef73d7e36 doc: fixes https://github.com/scylladb/scylladb/issues/12754, document the metric update in 5.2
Closes #12891

(cherry picked from commit bcca706ff5)
2023-02-17 12:16:13 +02:00
Botond Dénes
8700a72b4c Merge 'Backport compaction-backlog-tracker fixes to branch-5.2' from Raphael "Raph" Carvalho
Both patches are important to fix inefficiencies when updating the backlog tracker, which can manifest as a reactor stall, on a special event like schema change.

No conflicts when backporting.

Regression since 1d9f53c881, which is present in branch 5.1 onwards.

Closes #12851

* github.com:scylladb/scylladb:
  compaction: Fix inefficiency when updating LCS backlog tracker
  table: Fix quadratic behavior when inserting sstables into tracker on schema change
2023-02-15 07:22:25 +02:00
Raphael S. Carvalho
886dd3e1d2 compaction: Fix inefficiency when updating LCS backlog tracker
LCS backlog tracker uses STCS tracker for L0. Turns out LCS tracker
is calling STCS tracker's replace_sstables() with empty arguments
even when higher levels (> 0) *only* had sstables replaced.
This unnecessary call to STCS tracker will cause it to recompute
the L0 backlog, yielding the same value as before.

As LCS has a fragment size of 0.16G on higher levels, we may be
updating the tracker multiple times during incremental compaction,
which operates on SSTables on higher levels.

Inefficiency is fixed by only updating the STCS tracker if any
L0 sstable is being added or removed from the table.

This may be fixing a quadratic behavior during boot or refresh,
as new sstables are loaded one by one.
Higher levels have a substantial higher number of sstables,
therefore updating STCS tracker only when level 0 changes, reduces
significantly the number of times L0 backlog is recomputed.

Refs #12499.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12676

(cherry picked from commit 1b2140e416)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-02-14 12:14:27 -03:00
Raphael S. Carvalho
f565f3de06 table: Fix quadratic behavior when inserting sstables into tracker on schema change
Each time backlog tracker is informed about a new or old sstable, it
will recompute the static part of backlog which complexity is
proportional to the total number of sstables.
On schema change, we're calling backlog_tracker::replace_sstables()
for each existing sstable, therefore it produces O(N ^ 2) complexity.

Fixes #12499.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12593

(cherry picked from commit 87ee547120)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-02-14 12:14:21 -03:00
Anna Stuchlik
76ff6d981c doc: related https://github.com/scylladb/scylladb/issues/12754, add the requirement to upgrade Monitoring to version 4.3
Closes #12784

(cherry picked from commit c7778dd30b)
2023-02-10 10:28:35 +02:00
Botond Dénes
f924f59055 Merge 'Backport test.py improvements to 5.2' from Kamil Braun
Backport the following improvements for test.py efficiency and user experience:
- https://github.com/scylladb/scylladb/pull/12542
- https://github.com/scylladb/scylladb/pull/12560
- https://github.com/scylladb/scylladb/pull/12564
- https://github.com/scylladb/scylladb/pull/12563
- https://github.com/scylladb/scylladb/pull/12588
- https://github.com/scylladb/scylladb/pull/12613
- https://github.com/scylladb/scylladb/pull/12569
- https://github.com/scylladb/scylladb/pull/12612
- https://github.com/scylladb/scylladb/pull/12549
- https://github.com/scylladb/scylladb/pull/12678

Fixes #12617

Closes #12770

* github.com:scylladb/scylladb:
  test/pylib: put UNIX-domain socket in /tmp
  Merge 'test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests' from Kamil Braun
  Merge 'test.py: manual cluster pool handling for Python suite' from Alecco
  Merge 'test.py: handle broken clusters for Python suite' from Alecco
  test/pylib: scylla_cluster: don't leak server if stopping it fails
  Merge 'test/pylib: scylla_cluster: improve server startup check' from Kamil Braun
  test/pylib: scylla_cluster: return error details from test framework endpoints
  test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager
  test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot
  test: disable commitlog O_DSYNC, preallocation
2023-02-08 15:09:09 +02:00
Nadav Har'El
d5cef05810 test/pylib: put UNIX-domain socket in /tmp
The "cluster manager" used by the topology test suite uses a UNIX-domain
socket to communicate between the cluster manager and the individual tests.
The socket is currently located in the test directory but there is a
problem: In Linux the length of the path used as a UNIX-domain socket
address is limited to just a little over 100 bytes. In Jenkins run, the
test directory names are very long, and we sometimes go over this length
limit and the result is that test.py fails creating this socket.

In this patch we simply put the socket in /tmp instead of the test
directory. We only need to do this change in one place - the cluster
manager, as it already passes the socket path to the individual tests
(using the "--manager-api" option).

Tested by cloning Scylla in a very long directory name.
A test like ./test.py --mode=dev test_concurrent_schema fails before
this patch, and passes with it.

Fixes #12622

Closes #12678

(cherry picked from commit 681a066923)
2023-02-07 17:12:14 +01:00
Nadav Har'El
e0f4e99e9b Merge 'test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests' from Kamil Braun
`ScyllaClusterManager` is used to run a sequence of test cases from
a single test file. Between two consecutive tests, if the previous test
left the cluster 'dirty', meaning the cluster cannot be reused, it would
free up space in the pool (using `steal`), stop the cluster, then get a
new cluster from the pool.

Between the `steal` and the `get`, a concurrent test run (with its own
instance of `ScyllaClusterManager` would start, because there was free
space in the pool.

This resulted in undesirable behavior when we ran tests with
`--repeat X` for a large `X`: we would start with e.g. 4 concurrent
runs of a test file, because the pool size was 4. As soon as one of the
runs freed up space in the pool, we would start another concurrent run.
Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so
on. We would have a large number of concurrent runs, even though the
original 4 runs didn't finish yet. All of these concurrent runs would
compete waiting on the pool, and waiting for space in the pool would
take longer and longer (the duration is linear w.r.t number of
concurrent competing runs). Tests would then time out because they would
have to wait too long.

Fix that by using the new `replace_dirty` function introduced to the
pool. This function frees up space by returning a dirty cluster and then
immediately takes it away to be used for a new cluster. Thanks to this,
we will only have at most as many concurrent runs as the pool size. For
example with --repeat 8 and pool size 4, we would run 4 concurrent runs
and start the 5th run only when one of the original 4 runs finishes,
then the 6th run when a second run finishes and so on.

The fix is preceded by a refactor that replaces `steal` with `put(is_dirty=True)`
and a `destroy` function passed to the pool (now the pool is responsible
for stopping the cluster and releasing its IPs).

Fixes #11757

Closes #12549

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests
  test/pylib: pool: introduce `replace_dirty`
  test/pylib: pool: replace `steal` with `put(is_dirty=True)`

(cherry picked from commit 132af20057)
2023-02-07 17:08:17 +01:00
Kamil Braun
6795715011 Merge 'test.py: manual cluster pool handling for Python suite' from Alecco
From reviews of https://github.com/scylladb/scylladb/pull/12569, avoid
using `async with` and access the `Pool` of clusters with
`get()`/`put()`.

Closes #12612

* github.com:scylladb/scylladb:
  test.py: manual cluster handling for PythonSuite
  test.py: stop cluster if PythonSuite fails to start
  test.py: minor fix for failed PythonSuite test

(cherry picked from commit 5bc7f0732e)
2023-02-07 17:07:43 +01:00
Nadav Har'El
aa9e91c376 Merge 'test.py: handle broken clusters for Python suite' from Alecco
If the after test check fails (is_after_test_ok is False), discard the cluster and raise exception so context manager (pool) does not recycle it.

Ignore exception re-raised by the context manager.

Fixes #12360

Closes #12569

* github.com:scylladb/scylladb:
  test.py: handle broken clusters for Python suite
  test.py: Pool discard method

(cherry picked from commit 54f174a1f4)
2023-02-07 17:07:36 +01:00
Kamil Braun
ddfb9ebab2 test/pylib: scylla_cluster: don't leak server if stopping it fails
`ScyllaCluster.server_stop` had this piece of code:
```
        server = self.running.pop(server_id)
        if gracefully:
            await server.stop_gracefully()
        else:
            await server.stop()
        self.stopped[server_id] = server
```

We observed `stop_gracefully()` failing due to a server hanging during
shutdown. We then ended up in a state where neither `self.running` nor
`self.stopped` had this server. Later, when releasing the cluster and
its IPs, we would release that server's IP - but the server might have
still been running (all servers in `self.running` are killed before
releasing IPs, but this one wasn't in `self.running`).

Fix this by popping the server from `self.running` only after
`stop_gracefully`/`stop` finishes.

Make an analogous fix in `server_start`: put `server` into
`self.running` *before* we actually start it. If the start fails, the
server will be considered "running" even though it isn't necessarily,
but that is OK - if it isn't running, then trying to stop it later will
simply do nothing; if it is actually running, we will kill it (which we
should do) when clearing after the cluster; and we don't leak it.

Closes #12613

(cherry picked from commit a0ff33e777)
2023-02-07 17:05:20 +01:00
Nadav Har'El
d58a3e4d16 Merge 'test/pylib: scylla_cluster: improve server startup check' from Kamil Braun
Don't use a range scan, which is very inefficient, to perform a query for checking CQL availability.

Improve logging when waiting for server startup times out. Provide details about the failure: whether we managed to obtain the Host ID of the server and whether we managed to establish a CQL connection.

Closes #12588

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: better logging for timeout on server startup
  test/pylib: scylla_cluster: use less expensive query to check for CQL availability

(cherry picked from commit ccc2c6b5dd)
2023-02-07 17:05:02 +01:00
Kamil Braun
2ebac52d2d test/pylib: scylla_cluster: return error details from test framework endpoints
If an endpoint handler throws an exception, the details of the exception
are not returned to the client. Normally this is desirable so that
information is not leaked, but in this test framework we do want to
return the details to the client so it can log a useful error message.

Do it by wrapping every handler into a catch clause that returns
the exception message.

Also modify a bit how HTTPErrors are rendered so it's easier to discern
the actual body of the error from other details (such as the params used
to make the request etc.)

Before:
```
E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error
E
E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff
```

After:
```
E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body:
E Failed to start server at host 127.155.129.1.
E Check the log files:
E /home/kbraun/dev/scylladb/testlog/test.py.dev.log
E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log
```

Closes #12563

(cherry picked from commit 2f84e820fd)
2023-02-07 17:04:37 +01:00
Kamil Braun
b536614913 test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager
When we obtained a new cluster for a test case after the previous test
case left a dirty cluster, we would release the old cluster's used IP
addresses (`_before_test` function). However, we would not release the
last cluster's IP after the last test case. We would run out of IPs with
sufficiently many test files or `--repeat` runs. Fix this.

Also reorder the operations a bit: stop the cluster (and release its
IPs) before freeing up space in the cluster pool (i.e. call
`self.cluster.stop()` before `self.clusters.steal()`). This reduces
concurrency a bit - fewer Scyllas running at the same time, which is
good (the pool size gives a limit on the desired max number of
concurrently running clusters). Killing a cluster is quick so it won't
make a significant difference for the next guy waiting on the pool.

Closes #12564

(cherry picked from commit 3ed3966f13)
2023-02-07 17:04:19 +01:00
Kamil Braun
85df0fd2b1 test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot
If a cluster fails to boot, it saves the exception in
`self.start_exception` variable; the exception will be rethrown when
a test tries to start using this cluster. As explained in `before_test`:
```
    def before_test(self, name) -> None:
        """Check that  the cluster is ready for a test. If
        there was a start error, throw it here - the server is
        running when it's added to the pool, which can't be attributed
        to any specific test, throwing it here would stop a specific
        test."""
```
It's arguable whether we should blame some random test for a failure
that it didn't cause, but nevertheless, there's a problem here: the
`start_exception` will be rethrown and the test will fail, but then the
cluster will be simply returned to the pool and the next test will
attempt to use it... and so on.

Prevent this by marking the cluster as dirty the first time we rethrow
the exception.

Closes #12560

(cherry picked from commit 147dd73996)
2023-02-07 17:03:56 +01:00
Avi Kivity
cdf9fe7023 test: disable commitlog O_DSYNC, preallocation
Commitlog O_DSYNC is intended to make Raft and schema writes durable
in the face of power loss. To make O_DSYNC performant, we preallocate
the commitlog segments, so that the commitlog writes only change file
data and not file metadata (which would require the filesystem to commit
its own log).

However, in tests, this causes each ScyllaDB instance to write 384MB
of commitlog segments. This overloads the disks and slows everything
down.

Fix this by disabling O_DSYNC (and therefore preallocation) during
the tests. They can't survive power loss, and run with
--unsafe-bypass-fsync anyway.

Closes #12542

(cherry picked from commit 9029b8dead)
2023-02-07 17:02:59 +01:00
Beni Peled
8ff4717fd0 release: prepare for 5.2.0-rc1 2023-02-06 22:13:53 +02:00
Kamil Braun
291b1f6e7f service/raft: raft_group0: prevent double abort
There was a small chance that we called `timeout_src.request_abort()`
twice in the `with_timeout` function, first by timeout and then by
shutdown. `abort_source` fails on an assertion in this case. Fix this.

Fixes: #12512

Closes #12514

(cherry picked from commit 54170749b8)
2023-02-05 18:31:50 +02:00
Kefu Chai
b2699743cc db: system_keyspace: take the reserved_memory into account
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.

after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.

Fixes #12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12573

(cherry picked from commit 4a0134a097)
2023-02-05 18:30:05 +02:00
Botond Dénes
50ae73a4bd types: is_tuple(): handle reverse types
Currently reverse types match the default case (false), even though they
might be wrapping a tuple type. One user-visible effect of this is that
a schema, which has a reversed<frozen<UDT>> clustering key component,
will have this component incorrectly represented in the schema cql dump:
the UDT will loose the frozen attribute. When attempting to recreate
this schema based on the dump, it will fail as the only frozen UDTs are
allowed in primary key components.

Fixes: #12576

Closes #12579

(cherry picked from commit ebc100f74f)
2023-02-05 18:20:21 +02:00
Calle Wilund
c3dd4a2b87 alterator::streams: Sort tables in list_streams to ensure no duplicates
Fixes #12601 (maybe?)

Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.

(cherry picked from commit da8adb4d26)
2023-02-05 17:44:00 +02:00
Benny Halevy
0f9fe61d91 view: row_lock: lock_ck: find or construct row_lock under partition lock
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.

This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.

This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.

This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.

Fixes #12632

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
2023-02-05 17:22:31 +02:00
Anna Stuchlik
59d30ff241 docs: fixes https://github.com/scylladb/scylladb/issues/12654, update the links to the Download Center
Closes #12655

(cherry picked from commit 64cc4c8515)
2023-02-05 17:19:56 +02:00
Anna Stuchlik
fb82dff89e doc: fixes https://github.com/scylladb/scylladb/issues/12672, fix the redirects to the Cloud docs
Closes #12673

(cherry picked from commit 2be131da83)
2023-02-05 17:17:35 +02:00
Kefu Chai
b588b19620 cql3/selection: construct string_view using char* not size
before this change, we construct a sstring from a comma statement,
which evaluates to the return value of `name.size()`, but what we
expect is `sstring(const char*, size_t)`.

in this change

* instead of passing the size of the string_view,
  both its address and size are used
* `std::string_view` is constructed instead of sstring, for better
  performance, as we don't need to perform a deep copy

the issue is reported by GCC-13:

```
In file included from cql3/selection/selectable.cc:11:
cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
        auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
                                                           ^~~~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12666

(cherry picked from commit 186ceea009)

Fixes #12739.
2023-02-05 13:50:48 +02:00
Michał Chojnowski
608ef92a71 commitlog: fix total_size_on_disk accounting after segment file removal
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.

`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.

The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.

The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.

Fixes #12645

Closes #12646

(cherry picked from commit fa7e904cd6)
2023-02-01 21:54:37 +02:00
Kamil Braun
d2732b2663 Merge 'Enable Raft by default in new clusters' from Kamil Braun
New clusters that use a fresh conf/scylla.yaml will have `consistent_cluster_management: true`, which will enable Raft, unless the user explicitly turns it off before booting the cluster.

People using existing yaml files will continue without Raft, unless consistent_cluster_management is explicitly requested during/after upgrade.

Also update the docs: cluster creation and node addition procedures.

Fixes #12572.

Closes #12585

* github.com:scylladb/scylladb:
  docs: mention `consistent_cluster_management` for creating cluster and adding node procedures
  conf: enable `consistent_cluster_management` by default

(cherry picked from commit 5c886e59de)
2023-01-26 12:21:55 +01:00
Anna Mikhlin
34ab98e1be release: prepare for 5.2.0-rc0 2023-01-18 14:54:36 +02:00
486 changed files with 16565 additions and 8368 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -72,7 +72,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=5.2.0-dev
VERSION=5.2.17
if test -f version
then

View File

@@ -88,17 +88,20 @@ json::json_return_type make_streamed(rjson::value&& value) {
// move objects to coroutine frame.
auto los = std::move(os);
auto lrs = std::move(rs);
std::exception_ptr ex;
try {
co_await rjson::print(*lrs, los);
co_await los.flush();
co_await los.close();
} catch (...) {
// at this point, we cannot really do anything. HTTP headers and return code are
// already written, and quite potentially a portion of the content data.
// just log + rethrow. It is probably better the HTTP server closes connection
// abruptly or something...
elogger.error("Unhandled exception in data streaming: {}", std::current_exception());
throw;
ex = std::current_exception();
elogger.error("Exception during streaming HTTP response: {}", ex);
}
co_await los.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
co_return;
};
@@ -761,7 +764,6 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
co_return api_error::access_denied("Incorrect resource identifier");
}
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
const rjson::value* tags = rjson::find(request, "Tags");
if (!tags || !tags->IsArray()) {
co_return api_error::validation("Cannot parse tags");
@@ -769,8 +771,9 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
if (tags->Size() < 1) {
co_return api_error::validation("The number of tags must be at least 1") ;
}
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
co_await db::update_tags(_mm, schema, std::move(tags_map));
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
});
co_return json_string("");
}
@@ -788,9 +791,9 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
co_await db::update_tags(_mm, schema, std::move(tags_map));
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
});
co_return json_string("");
}
@@ -2358,21 +2361,22 @@ std::optional<rjson::value> executor::describe_single_item(schema_ptr schema,
return item;
}
std::vector<rjson::value> executor::describe_multi_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get) {
cql3::selection::result_set_builder builder(selection, gc_clock::now());
query::result_view::consume(query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, selection));
future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get) {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
std::vector<rjson::value> ret;
for (auto& result_row : result_set->rows()) {
rjson::value item = rjson::empty_object();
describe_single_item(selection, result_row, attrs_to_get, item);
describe_single_item(*selection, result_row, *attrs_to_get, item);
ret.push_back(std::move(item));
co_await coroutine::maybe_yield();
}
return ret;
co_return ret;
}
static bool check_needs_read_before_write(const parsed::value& v) {
@@ -3254,8 +3258,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get] (service::storage_proxy::coordinator_query_result qr) mutable {
utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });
std::vector<rjson::value> jsons = describe_multi_item(schema, partition_slice, *selection, *qr.query_result, *attrs_to_get);
return make_ready_future<std::vector<rjson::value>>(std::move(jsons));
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get));
});
response_futures.push_back(std::move(f));
}

View File

@@ -222,11 +222,11 @@ public:
const query::result&,
const std::optional<attrs_to_get>&);
static std::vector<rjson::value> describe_multi_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get);
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<bytes_opt>&,

View File

@@ -145,19 +145,24 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
auto table = find_table(_proxy, request);
auto db = _proxy.data_dictionary();
auto cfs = db.get_tables();
auto i = cfs.begin();
auto e = cfs.end();
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
}
// TODO: the unordered_map here is not really well suited for partial
// querying - we're sorting on local hash order, and creating a table
// between queries may or may not miss info. But that should be rare,
// and we can probably expect this to be a single call.
// # 12601 (maybe?) - sort the set of tables on ID. This should ensure we never
// generate duplicates in a paged listing here. Can obviously miss things if they
// are added between paged calls and end up with a "smaller" UUID/ARN, but that
// is to be expected.
std::sort(cfs.begin(), cfs.end(), [](const data_dictionary::table& t1, const data_dictionary::table& t2) {
return t1.schema()->id().uuid() < t2.schema()->id().uuid();
});
auto i = cfs.begin();
auto e = cfs.end();
if (streams_start) {
i = std::find_if(i, e, [&](data_dictionary::table t) {
i = std::find_if(i, e, [&](const data_dictionary::table& t) {
return t.schema()->id().uuid() == streams_start
&& cdc::get_base_table(db.real_database(), *t.schema())
&& is_alternator_keyspace(t.schema()->ks_name())

View File

@@ -94,24 +94,25 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
}
sstring attribute_name(v->GetString(), v->GetStringLength());
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
co_return api_error::validation("TTL is already enabled");
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
throw api_error::validation("TTL is already enabled");
}
tags_map[TTL_TAG_KEY] = attribute_name;
} else {
auto i = tags_map.find(TTL_TAG_KEY);
if (i == tags_map.end()) {
throw api_error::validation("TTL is already disabled");
} else if (i->second != attribute_name) {
throw api_error::validation(format(
"Requested to disable TTL on attribute {}, but a different attribute {} is enabled.",
attribute_name, i->second));
}
tags_map.erase(TTL_TAG_KEY);
}
tags_map[TTL_TAG_KEY] = attribute_name;
} else {
auto i = tags_map.find(TTL_TAG_KEY);
if (i == tags_map.end()) {
co_return api_error::validation("TTL is already disabled");
} else if (i->second != attribute_name) {
co_return api_error::validation(format(
"Requested to disable TTL on attribute {}, but a different attribute {} is enabled.",
attribute_name, i->second));
}
tags_map.erase(TTL_TAG_KEY);
}
co_await db::update_tags(_mm, schema, std::move(tags_map));
});
// Prepare the response, which contains a TimeToLiveSpecification
// basically identical to the request's
rjson::value response = rjson::empty_object();

43
api/api-doc/raft.json Normal file
View File

@@ -0,0 +1,43 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/raft",
"produces":[
"application/json"
],
"apis":[
{
"path":"/raft/trigger_snapshot/{group_id}",
"operations":[
{
"method":"POST",
"summary":"Triggers snapshot creation and log truncation for the given Raft group",
"type":"string",
"nickname":"trigger_snapshot",
"produces":[
"application/json"
],
"parameters":[
{
"name":"group_id",
"description":"The ID of the group which should get snapshotted",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"timeout",
"description":"Timeout in seconds after which the endpoint returns a failure. If not provided, 60s is used.",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
]
}
]
}

View File

@@ -1946,7 +1946,7 @@
"operations":[
{
"method":"POST",
"summary":"Reset local schema",
"summary":"Forces this node to recalculate versions of schema objects.",
"type":"void",
"nickname":"reset_local_schema",
"produces":[

View File

@@ -31,6 +31,7 @@
#include "api/config.hh"
#include "task_manager.hh"
#include "task_manager_test.hh"
#include "raft.hh"
logging::logger apilog("api");
@@ -277,6 +278,18 @@ future<> set_server_task_manager_test(http_context& ctx, lw_shared_ptr<db::confi
#endif
future<> set_server_raft(http_context& ctx, sharded<service::raft_group_registry>& raft_gr) {
auto rb = std::make_shared<api_registry_builder>(ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx, &raft_gr] (routes& r) {
rb->register_function(r, "raft", "The Raft API");
set_raft(ctx, r, raft_gr);
});
}
future<> unset_server_raft(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_raft(ctx, r); });
}
void req_params::process(const request& req) {
// Process mandatory parameters
for (auto& [name, ent] : params) {

View File

@@ -22,6 +22,7 @@ namespace service {
class load_meter;
class storage_proxy;
class storage_service;
class raft_group_registry;
} // namespace service
@@ -116,5 +117,7 @@ future<> set_server_compaction_manager(http_context& ctx);
future<> set_server_done(http_context& ctx);
future<> set_server_task_manager(http_context& ctx);
future<> set_server_task_manager_test(http_context& ctx, lw_shared_ptr<db::config> cfg);
future<> set_server_raft(http_context&, sharded<service::raft_group_registry>&);
future<> unset_server_raft(http_context&);
}

View File

@@ -17,36 +17,42 @@ namespace fd = httpd::failure_detector_json;
void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_all_endpoint_states.set(r, [&g](std::unique_ptr<request> req) {
std::vector<fd::endpoint_state> res;
for (auto i : g.get_endpoint_states()) {
fd::endpoint_state val;
val.addrs = boost::lexical_cast<std::string>(i.first);
val.is_alive = i.second.is_alive();
val.generation = i.second.get_heart_beat_state().get_generation();
val.version = i.second.get_heart_beat_state().get_heart_beat_version();
val.update_time = i.second.get_update_timestamp().time_since_epoch().count();
for (auto a : i.second.get_application_state_map()) {
fd::version_value version_val;
// We return the enum index and not it's name to stay compatible to origin
// method that the state index are static but the name can be changed.
version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(a.first);
version_val.value = a.second.value;
version_val.version = a.second.version;
val.application_state.push(version_val);
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::vector<fd::endpoint_state> res;
for (auto i : g.get_endpoint_states()) {
fd::endpoint_state val;
val.addrs = boost::lexical_cast<std::string>(i.first);
val.is_alive = i.second.is_alive();
val.generation = i.second.get_heart_beat_state().get_generation();
val.version = i.second.get_heart_beat_state().get_heart_beat_version();
val.update_time = i.second.get_update_timestamp().time_since_epoch().count();
for (auto a : i.second.get_application_state_map()) {
fd::version_value version_val;
// We return the enum index and not it's name to stay compatible to origin
// method that the state index are static but the name can be changed.
version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(a.first);
version_val.value = a.second.value;
version_val.version = a.second.version;
val.application_state.push(version_val);
}
res.push_back(val);
}
res.push_back(val);
}
return make_ready_future<json::json_return_type>(res);
return make_ready_future<json::json_return_type>(res);
});
});
fd::get_up_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {
int res = g.get_up_endpoint_count();
return make_ready_future<json::json_return_type>(res);
return g.container().invoke_on(0, [] (gms::gossiper& g) {
int res = g.get_up_endpoint_count();
return make_ready_future<json::json_return_type>(res);
});
});
fd::get_down_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {
int res = g.get_down_endpoint_count();
return make_ready_future<json::json_return_type>(res);
return g.container().invoke_on(0, [] (gms::gossiper& g) {
int res = g.get_down_endpoint_count();
return make_ready_future<json::json_return_type>(res);
});
});
fd::get_phi_convict_threshold.set(r, [] (std::unique_ptr<request> req) {
@@ -54,11 +60,13 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
});
fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {
std::map<sstring, sstring> nodes_status;
for (auto& entry : g.get_endpoint_states()) {
nodes_status.emplace(entry.first.to_sstring(), entry.second.is_alive() ? "UP" : "DOWN");
}
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::map<sstring, sstring> nodes_status;
for (auto& entry : g.get_endpoint_states()) {
nodes_status.emplace(entry.first.to_sstring(), entry.second.is_alive() ? "UP" : "DOWN");
}
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
});
});
fd::set_phi_convict_threshold.set(r, [](std::unique_ptr<request> req) {
@@ -67,13 +75,15 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
});
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
auto* state = g.get_endpoint_state_for_endpoint_ptr(gms::inet_address(req->param["addr"]));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));
}
std::stringstream ss;
g.append_endpoint_state(ss, *state);
return make_ready_future<json::json_return_type>(sstring(ss.str()));
return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {
auto* state = g.get_endpoint_state_for_endpoint_ptr(gms::inet_address(req->param["addr"]));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));
}
std::stringstream ss;
g.append_endpoint_state(ss, *state);
return make_ready_future<json::json_return_type>(sstring(ss.str()));
});
});
fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {

View File

@@ -6,6 +6,8 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "gossiper.hh"
#include "api/api-doc/gossiper.json.hh"
#include "gms/gossiper.hh"
@@ -14,19 +16,23 @@ namespace api {
using namespace json;
void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
httpd::gossiper_json::get_down_endpoint.set(r, [&g] (const_req req) {
auto res = g.get_unreachable_members();
return container_to_vec(res);
httpd::gossiper_json::get_down_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto res = co_await g.get_unreachable_members_synchronized();
co_return json::json_return_type(container_to_vec(res));
});
httpd::gossiper_json::get_live_endpoint.set(r, [&g] (const_req req) {
auto res = g.get_live_members();
return container_to_vec(res);
httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) {
return g.get_live_members_synchronized().then([] (auto res) {
return make_ready_future<json::json_return_type>(container_to_vec(res));
});
});
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (const_req req) {
gms::inet_address ep(req.param["addr"]);
return g.get_endpoint_downtime(ep);
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
gms::inet_address ep(req->param["addr"]);
// synchronize unreachable_members on all shards
co_await g.get_unreachable_members_synchronized();
co_return g.get_endpoint_downtime(ep);
});
httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<request> req) {

70
api/raft.cc Normal file
View File

@@ -0,0 +1,70 @@
/*
* Copyright (C) 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "api/api.hh"
#include "api/api-doc/raft.json.hh"
#include "service/raft/raft_group_registry.hh"
using namespace seastar::httpd;
extern logging::logger apilog;
namespace api {
namespace r = httpd::raft_json;
using namespace json;
void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr) {
r::trigger_snapshot.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {
raft::group_id gid{utils::UUID{req->param["group_id"]}};
auto timeout_dur = std::invoke([timeout_str = req->get_query_param("timeout")] {
if (timeout_str.empty()) {
return std::chrono::seconds{60};
}
auto dur = std::stoll(timeout_str);
if (dur <= 0) {
throw std::runtime_error{"Timeout must be a positive number."};
}
return std::chrono::seconds{dur};
});
std::atomic<bool> found_srv{false};
co_await raft_gr.invoke_on_all([gid, timeout_dur, &found_srv] (service::raft_group_registry& raft_gr) -> future<> {
auto* srv = raft_gr.find_server(gid);
if (!srv) {
co_return;
}
found_srv = true;
abort_on_expiry aoe(lowres_clock::now() + timeout_dur);
apilog.info("Triggering Raft group {} snapshot", gid);
auto result = co_await srv->trigger_snapshot(&aoe.abort_source());
if (result) {
apilog.info("New snapshot for Raft group {} created", gid);
} else {
apilog.info("Could not create new snapshot for Raft group {}, no new entries applied", gid);
}
});
if (!found_srv) {
throw std::runtime_error{fmt::format("Server for group ID {} not found", gid)};
}
co_return json_void{};
});
}
void unset_raft(http_context&, httpd::routes& r) {
r::trigger_snapshot.unset(r);
}
}

18
api/raft.hh Normal file
View File

@@ -0,0 +1,18 @@
/*
* Copyright (C) 2023-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api_init.hh"
namespace api {
void set_raft(http_context& ctx, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr);
void unset_raft(http_context& ctx, httpd::routes& r);
}

View File

@@ -243,17 +243,21 @@ future<json::json_return_type> set_tables_autocompaction(http_context& ctx, cons
}
void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl) {
ss::start_native_transport.set(r, [&ctl](std::unique_ptr<request> req) {
ss::start_native_transport.set(r, [&ctx, &ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&] {
return ctl.start_server();
return with_scheduling_group(ctx.db.local().get_statement_scheduling_group(), [&ctl] {
return ctl.start_server();
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::stop_native_transport.set(r, [&ctl](std::unique_ptr<request> req) {
ss::stop_native_transport.set(r, [&ctx, &ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&] {
return ctl.request_stop_server();
return with_scheduling_group(ctx.db.local().get_statement_scheduling_group(), [&ctl] {
return ctl.request_stop_server();
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -275,17 +279,21 @@ void unset_transport_controller(http_context& ctx, routes& r) {
}
void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl) {
ss::stop_rpc_server.set(r, [&ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&] {
return ctl.request_stop_server();
ss::stop_rpc_server.set(r, [&ctx, &ctl] (std::unique_ptr<request> req) {
return smp::submit_to(0, [&ctx, &ctl] {
return with_scheduling_group(ctx.db.local().get_statement_scheduling_group(), [&ctl] () mutable {
return ctl.request_stop_server();
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::start_rpc_server.set(r, [&ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&] {
return ctl.start_server();
ss::start_rpc_server.set(r, [&ctx, &ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&ctx, &ctl] {
return with_scheduling_group(ctx.db.local().get_statement_scheduling_group(), [&ctl] () mutable {
return ctl.start_server();
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -1041,14 +1049,11 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return make_ready_future<json::json_return_type>(res);
});
ss::reset_local_schema.set(r, [&sys_ks](std::unique_ptr<request> req) {
ss::reset_local_schema.set(r, [&ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
// FIXME: We should truncate schema tables if more than one node in the cluster.
auto& sp = service::get_storage_proxy();
auto& fs = sp.local().features();
apilog.info("reset_local_schema");
return db::schema_tables::recalculate_schema_version(sys_ks, sp, fs).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
co_await ss.local().reload_schema();
co_return json_void();
});
ss::set_trace_probability.set(r, [](std::unique_ptr<request> req) {

View File

@@ -176,7 +176,9 @@ void set_task_manager(http_context& ctx, routes& r) {
auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) {
return task->done().then_wrapped([task] (auto f) {
task->unregister_task();
f.get();
// done() is called only because we want the task to be complete before getting its status.
// The future should be ignored here as the result does not matter.
f.ignore_ready_future();
return make_foreign(task);
});
}));
@@ -204,8 +206,8 @@ void set_task_manager(http_context& ctx, routes& r) {
while (!q.empty()) {
auto& current = q.front();
res.push_back(co_await retrieve_status(current));
for (auto i = 0; i < current->get_children().size(); ++i) {
q.push(co_await current->get_children()[i].copy());
for (auto& child: current->get_children()) {
q.push(co_await child.copy());
}
q.pop();
}

View File

@@ -66,36 +66,48 @@ atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
set_view(_data);
}
// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
// Based on Cassandra's resolveRegular function:
// - https://github.com/apache/cassandra/blob/e4f31b73c21b04966269c5ac2d3bd2562e5f6c63/src/java/org/apache/cassandra/db/rows/Cells.java#L79-L119
//
// Note: the ordering algorithm for cell is the same as for rows,
// except that the cell value is used to break a tie in case all other attributes are equal.
// See compare_row_marker_for_merge.
std::strong_ordering
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
// Largest write timestamp wins.
if (left.timestamp() != right.timestamp()) {
return left.timestamp() <=> right.timestamp();
}
// Tombstones always win reconciliation with live cells of the same timestamp
if (left.is_live() != right.is_live()) {
return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value()) <=> 0;
if (c != 0) {
return c;
}
// Prefer expiring cells (which will become tombstones at some future date) over live cells.
// See https://issues.apache.org/jira/browse/CASSANDRA-14592
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
}
// If both are expiring, choose the cell with the latest expiry or derived write time.
if (left.is_live_and_has_ttl()) {
// Prefer cell with latest expiry
if (left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
} else if (right.ttl() != left.ttl()) {
// The cell write time is derived by (expiry - ttl).
// Prefer the cell that was written later,
// so it survives longer after it expires, until purged,
// as it become purgeable gc_grace_seconds after it was written.
//
// Note that this is an extension to Cassandra's algorithm
// which stops at the expiration time, and if equal,
// move forward to compare the cell values.
return right.ttl() <=> left.ttl();
}
}
// The cell with the largest value wins, if all other attributes of the cells are identical.
// This is quite arbitrary, but still required to break the tie in a deterministic way.
return compare_unsigned(left.value(), right.value());
} else {
// Both are deleted

View File

@@ -229,6 +229,8 @@ future<authenticated_user> password_authenticator::authenticate(
std::throw_with_nested(exceptions::authentication_exception(e.what()));
} catch (exceptions::authentication_exception& e) {
std::throw_with_nested(e);
} catch (exceptions::unavailable_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.get_message()));
} catch (...) {
std::throw_with_nested(exceptions::authentication_exception("authentication failed"));
}

View File

@@ -55,6 +55,7 @@ future<bool> default_role_row_satisfies(
return qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::yes).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {

View File

@@ -25,6 +25,7 @@
#include "gms/gossiper.hh"
#include "gms/feature_service.hh"
#include "utils/UUID_gen.hh"
#include "utils/error_injection.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
@@ -44,8 +45,16 @@ static unsigned get_sharding_ignore_msb(const gms::inet_address& endpoint, const
namespace cdc {
extern const api::timestamp_clock::duration generation_leeway =
std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::seconds(5));
api::timestamp_clock::duration get_generation_leeway() {
static thread_local auto generation_leeway =
std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::seconds(5));
utils::get_local_injector().inject("increase_cdc_generation_leeway", [&] {
generation_leeway = std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::minutes(5));
});
return generation_leeway;
}
static void copy_int_to_bytes(int64_t i, size_t offset, bytes& b) {
i = net::hton(i);
@@ -160,18 +169,18 @@ bool token_range_description::operator==(const token_range_description& o) const
&& sharding_ignore_msb == o.sharding_ignore_msb;
}
topology_description::topology_description(std::vector<token_range_description> entries)
topology_description::topology_description(utils::chunked_vector<token_range_description> entries)
: _entries(std::move(entries)) {}
bool topology_description::operator==(const topology_description& o) const {
return _entries == o._entries;
}
const std::vector<token_range_description>& topology_description::entries() const& {
const utils::chunked_vector<token_range_description>& topology_description::entries() const& {
return _entries;
}
std::vector<token_range_description>&& topology_description::entries() && {
utils::chunked_vector<token_range_description>&& topology_description::entries() && {
return std::move(_entries);
}
@@ -263,7 +272,7 @@ public:
topology_description generate() const {
const auto tokens = get_tokens();
std::vector<token_range_description> vnode_descriptions;
utils::chunked_vector<token_range_description> vnode_descriptions;
vnode_descriptions.reserve(tokens.size());
vnode_descriptions.push_back(
@@ -331,7 +340,7 @@ future<cdc::generation_id> generation_service::make_new_generation(const std::un
auto new_generation_timestamp = [add_delay, ring_delay = _cfg.ring_delay] {
auto ts = db_clock::now();
if (add_delay && ring_delay != 0ms) {
ts += 2 * ring_delay + duration_cast<milliseconds>(generation_leeway);
ts += 2 * ring_delay + duration_cast<milliseconds>(get_generation_leeway());
}
return ts;
};

View File

@@ -46,6 +46,8 @@ namespace gms {
namespace cdc {
api::timestamp_clock::duration get_generation_leeway();
class stream_id final {
bytes _value;
public:
@@ -94,13 +96,13 @@ struct token_range_description {
* in the `_entries` vector. See the comment above `token_range_description` for explanation.
*/
class topology_description {
std::vector<token_range_description> _entries;
utils::chunked_vector<token_range_description> _entries;
public:
topology_description(std::vector<token_range_description> entries);
topology_description(utils::chunked_vector<token_range_description> entries);
bool operator==(const topology_description&) const;
const std::vector<token_range_description>& entries() const&;
std::vector<token_range_description>&& entries() &&;
const utils::chunked_vector<token_range_description>& entries() const&;
utils::chunked_vector<token_range_description>&& entries() &&;
};
/**

View File

@@ -15,10 +15,6 @@
extern logging::logger cdc_log;
namespace cdc {
extern const api::timestamp_clock::duration generation_leeway;
} // namespace cdc
static api::timestamp_type to_ts(db_clock::time_point tp) {
// This assumes that timestamp_clock and db_clock have the same epochs.
return std::chrono::duration_cast<api::timestamp_clock::duration>(tp.time_since_epoch()).count();
@@ -40,7 +36,7 @@ static cdc::stream_id get_stream(
// non-static for testing
cdc::stream_id get_stream(
const std::vector<cdc::token_range_description>& entries,
const utils::chunked_vector<cdc::token_range_description>& entries,
dht::token tok) {
if (entries.empty()) {
on_internal_error(cdc_log, "get_stream: entries empty");
@@ -73,7 +69,7 @@ bool cdc::metadata::streams_available() const {
cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok) {
auto now = api::new_timestamp();
if (ts > now + generation_leeway.count()) {
if (ts > now + get_generation_leeway().count()) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream \"from the future\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps arbitrarily into the future, because we don't"
@@ -86,27 +82,43 @@ cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok)
// Nothing protects us from that until we start using transactions for generation switching.
}
auto it = gen_used_at(now);
if (it == _gens.end()) {
auto it = gen_used_at(now - get_generation_leeway().count());
if (it != _gens.end()) {
// Garbage-collect generations that will no longer be used.
it = _gens.erase(_gens.begin(), it);
}
if (ts <= now - get_generation_leeway().count()) {
// We reject the write if `ts <= now - generation_leeway` and the write is not to the current generation, which
// happens iff one of the following is true:
// - the write is to no generation,
// - the write is to a generation older than the generation under `it`,
// - the write is to the generation under `it` and that generation is not the current generation.
// Note that we cannot distinguish the first and second cases because we garbage-collect obsolete generations,
// but we can check if one of them takes place (`it == _gens.end() || ts < it->first`). These three conditions
// are sufficient. The write with `ts <= now - generation_leeway` cannot be to one of the generations following
// the generation under `it` because that generation was operating at `now - generation_leeway`.
bool is_previous_gen = it != _gens.end() && std::next(it) != _gens.end() && std::next(it)->first <= now;
if (it == _gens.end() || ts < it->first || is_previous_gen) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream \"from the past\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps too far into the past, because that would break"
" consistency properties.\n"
"We *do* allow sending writes into the near past, but our ability to do that is limited."
" Are you using client-side timestamps? Make sure your clocks are well-synchronized"
" with the database's clocks.", format_timestamp(ts), format_timestamp(now)));
}
}
it = _gens.begin();
if (it == _gens.end() || ts < it->first) {
throw std::runtime_error(format(
"cdc::metadata::get_stream: could not find any CDC stream (current time: {})."
" Are we in the middle of a cluster upgrade?", format_timestamp(now)));
"cdc::metadata::get_stream: could not find any CDC stream for timestamp {}."
" Are we in the middle of a cluster upgrade?", format_timestamp(ts)));
}
// Garbage-collect generations that will no longer be used.
it = _gens.erase(_gens.begin(), it);
if (it->first > ts) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream from an earlier generation than the currently used one."
" With CDC you cannot send writes with timestamps too far into the past, because that would break"
" consistency properties (write timestamp: {}, current generation started at: {})",
format_timestamp(ts), format_timestamp(it->first)));
}
// With `generation_leeway` we allow sending writes to the near future. It might happen
// that `ts` doesn't belong to the current generation ("current" according to our clock),
// but to the next generation. Adjust for this case:
// Find the generation operating at `ts`.
{
auto next_it = std::next(it);
while (next_it != _gens.end() && next_it->first <= ts) {
@@ -147,8 +159,8 @@ bool cdc::metadata::known_or_obsolete(db_clock::time_point tp) const {
++it;
}
// Check if some new generation has already superseded this one.
return it != _gens.end() && it->first <= api::new_timestamp();
// Check if the generation is obsolete.
return it != _gens.end() && it->first <= api::new_timestamp() - get_generation_leeway().count();
}
bool cdc::metadata::insert(db_clock::time_point tp, topology_description&& gen) {
@@ -157,7 +169,7 @@ bool cdc::metadata::insert(db_clock::time_point tp, topology_description&& gen)
}
auto now = api::new_timestamp();
auto it = gen_used_at(now);
auto it = gen_used_at(now - get_generation_leeway().count());
if (it != _gens.end()) {
// Garbage-collect generations that will no longer be used.

View File

@@ -42,7 +42,9 @@ class metadata final {
container_t::const_iterator gen_used_at(api::timestamp_type ts) const;
public:
/* Is a generation with the given timestamp already known or superseded by a newer generation? */
/* Is a generation with the given timestamp already known or obsolete? It is obsolete if and only if
* it is older than the generation operating at `now - get_generation_leeway()`.
*/
bool known_or_obsolete(db_clock::time_point) const;
/* Are there streams available. I.e. valid for time == now. If this is false, any writes to
@@ -54,8 +56,9 @@ public:
*
* If the provided timestamp is too far away "into the future" (where "now" is defined according to our local clock),
* we reject the get_stream query. This is because the resulting stream might belong to a generation which we don't
* yet know about. The amount of leeway (how much "into the future" we allow `ts` to be) is defined
* by the `cdc::generation_leeway` constant.
* yet know about. Similarly, we reject queries to the previous generations if the timestamp is too far away "into
* the past". The amount of leeway (how much "into the future" or "into the past" we allow `ts` to be) is defined by
* `get_generation_leeway()`.
*/
stream_id get_stream(api::timestamp_type ts, dht::token tok);

View File

@@ -168,7 +168,7 @@ std::ostream& operator<<(std::ostream& os, pretty_printed_throughput tp) {
}
static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_s, sstable_set::incremental_selector& selector,
const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk) {
const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk, uint64_t& bloom_filter_checks) {
auto timestamp = table_s.min_memtable_timestamp();
std::optional<utils::hashed_key> hk;
for (auto&& sst : boost::range::join(selector.select(dk).sstables, table_s.compacted_undeleted_sstables())) {
@@ -179,6 +179,7 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
hk = sstables::sstable::make_hashed_key(*table_s.schema(), dk.key());
}
if (sst->filter_has_key(*hk)) {
bloom_filter_checks++;
timestamp = std::min(timestamp, sst->get_stats_metadata().min_timestamp);
}
}
@@ -414,9 +415,12 @@ private:
class formatted_sstables_list {
bool _include_origin = true;
std::vector<sstring> _ssts;
std::vector<std::string> _ssts;
public:
formatted_sstables_list() = default;
void reserve(size_t n) {
_ssts.reserve(n);
}
explicit formatted_sstables_list(const std::vector<shared_sstable>& ssts, bool include_origin) : _include_origin(include_origin) {
_ssts.reserve(ssts.size());
for (const auto& sst : ssts) {
@@ -435,9 +439,7 @@ public:
};
std::ostream& operator<<(std::ostream& os, const formatted_sstables_list& lst) {
os << "[";
os << boost::algorithm::join(lst._ssts, ",");
os << "]";
fmt::print(os, "[{}]", fmt::join(lst._ssts, ","));
return os;
}
@@ -462,6 +464,7 @@ protected:
uint64_t _start_size = 0;
uint64_t _end_size = 0;
uint64_t _estimated_partitions = 0;
uint64_t _bloom_filter_checks = 0;
db::replay_position _rp;
encoding_stats_collector _stats_collector;
bool _can_split_large_partition = false;
@@ -518,7 +521,7 @@ protected:
auto max_sstable_size = std::max<uint64_t>(_max_sstable_size, 1);
uint64_t estimated_sstables = std::max(1UL, uint64_t(ceil(double(_start_size) / max_sstable_size)));
return std::min(uint64_t(ceil(double(_estimated_partitions) / estimated_sstables)),
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimated_partitions));
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimated_partitions, _schema));
}
void setup_new_sstable(shared_sstable& sst) {
@@ -573,13 +576,13 @@ protected:
return bool(_sstable_set);
}
compaction_writer create_gc_compaction_writer() const {
compaction_writer create_gc_compaction_writer(run_id gc_run) const {
auto sst = _sstable_creator(this_shard_id());
auto&& priority = _io_priority;
auto monitor = std::make_unique<compaction_write_monitor>(sst, _table_s, maximum_timestamp(), _sstable_level);
sstable_writer_config cfg = _table_s.configure_writer("garbage_collection");
cfg.run_identifier = _run_identifier;
cfg.run_identifier = gc_run;
cfg.monitor = monitor.get();
auto writer = sst->get_writer(*schema(), partitions_per_sstable(), cfg, get_encoding_stats(), priority);
return compaction_writer(std::move(monitor), std::move(writer), std::move(sst));
@@ -600,8 +603,14 @@ protected:
// When compaction finishes, all the temporary sstables generated here will be deleted and removed
// from table's sstable set.
compacted_fragments_writer get_gc_compacted_fragments_writer() {
// because the temporary sstable run can overlap with the non-gc sstables run created by
// get_compacted_fragments_writer(), we have to use a different run_id. the gc_run_id is
// created here as:
// 1. it can be shared across all sstables created by this writer
// 2. it is optional, as gc writer is not always used
auto gc_run = run_id::create_random_id();
return compacted_fragments_writer(*this,
[this] (const dht::decorated_key&) { return create_gc_compaction_writer(); },
[this, gc_run] (const dht::decorated_key&) { return create_gc_compaction_writer(gc_run); },
[this] (compaction_writer* cw) { stop_gc_compaction_writer(cw); },
_stop_request_observable);
}
@@ -618,8 +627,8 @@ protected:
return _used_garbage_collected_sstables;
}
bool enable_garbage_collected_sstable_writer() const noexcept {
return _contains_multi_fragment_runs && _max_sstable_size != std::numeric_limits<uint64_t>::max();
virtual bool enable_garbage_collected_sstable_writer() const noexcept {
return _contains_multi_fragment_runs && _max_sstable_size != std::numeric_limits<uint64_t>::max() && bool(_replacer);
}
public:
compaction& operator=(const compaction&) = delete;
@@ -641,6 +650,7 @@ private:
future<> setup() {
auto ssts = make_lw_shared<sstables::sstable_set>(make_sstable_set_for_input());
formatted_sstables_list formatted_msg;
formatted_msg.reserve(_sstables.size());
auto fully_expired = _table_s.fully_expired_sstables(_sstables, gc_clock::now());
min_max_tracker<api::timestamp_type> timestamp_tracker;
@@ -757,6 +767,7 @@ protected:
.ended_at = ended_at,
.start_size = _start_size,
.end_size = _end_size,
.bloom_filter_checks = _bloom_filter_checks,
},
};
@@ -776,7 +787,7 @@ protected:
log_info("{} {} sstables to {}. {} to {} (~{}% of original) in {}ms = {}. ~{} total partitions merged to {}.",
report_finish_desc(),
_input_sstable_generations.size(), new_sstables_msg, pretty_printed_data_size(_start_size), pretty_printed_data_size(_end_size), int(ratio * 100),
std::chrono::duration_cast<std::chrono::milliseconds>(duration).count(), pretty_printed_throughput(_end_size, duration),
std::chrono::duration_cast<std::chrono::milliseconds>(duration).count(), pretty_printed_throughput(_start_size, duration),
_cdata.total_partitions, _cdata.total_keys_written);
return ret;
@@ -797,7 +808,7 @@ private:
};
}
return [this] (const dht::decorated_key& dk) {
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk);
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks);
};
}
@@ -997,51 +1008,6 @@ void compacted_fragments_writer::consume_end_of_stream() {
}
}
class reshape_compaction : public compaction {
public:
reshape_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata)
: compaction(table_s, std::move(descriptor), cdata) {
}
virtual sstables::sstable_set make_sstable_set_for_input() const override {
return sstables::make_partitioned_sstable_set(_schema, false);
}
flat_mutation_reader_v2 make_sstable_reader() const override {
return _compacting->make_local_shard_sstable_reader(_schema,
_permit,
query::full_partition_range,
_schema->full_slice(),
_io_priority,
tracing::trace_state_ptr(),
::streamed_mutation::forwarding::no,
::mutation_reader::forwarding::no,
default_read_monitor_generator());
}
std::string_view report_start_desc() const override {
return "Reshaping";
}
std::string_view report_finish_desc() const override {
return "Reshaped";
}
virtual compaction_writer create_compaction_writer(const dht::decorated_key& dk) override {
auto sst = _sstable_creator(this_shard_id());
setup_new_sstable(sst);
sstable_writer_config cfg = make_sstable_writer_config(compaction_type::Reshape);
return compaction_writer{sst->get_writer(*_schema, partitions_per_sstable(), cfg, get_encoding_stats(), _io_priority), sst};
}
virtual void stop_sstable_writer(compaction_writer* writer) override {
if (writer) {
finish_new_sstable(writer);
}
}
};
class regular_compaction : public compaction {
// keeps track of monitors for input sstable, which are responsible for adjusting backlog as compaction progresses.
mutable compaction_read_monitor_generator _monitor_generator;
@@ -1151,12 +1117,13 @@ private:
}
void update_pending_ranges() {
if (!_sstable_set || _sstable_set->all()->empty() || _cdata.pending_replacements.empty()) { // set can be empty for testing scenario.
auto pending_replacements = std::exchange(_cdata.pending_replacements, {});
if (!_sstable_set || _sstable_set->all()->empty() || pending_replacements.empty()) { // set can be empty for testing scenario.
return;
}
// Releases reference to sstables compacted by this compaction or another, both of which belongs
// to the same column family
for (auto& pending_replacement : _cdata.pending_replacements) {
for (auto& pending_replacement : pending_replacements) {
for (auto& sst : pending_replacement.removed) {
// Set may not contain sstable to be removed because this compaction may have started
// before the creation of that sstable.
@@ -1170,7 +1137,70 @@ private:
}
}
_selector.emplace(_sstable_set->make_incremental_selector());
_cdata.pending_replacements.clear();
}
};
class reshape_compaction : public regular_compaction {
private:
bool has_sstable_replacer() const noexcept {
return bool(_replacer);
}
public:
reshape_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata)
: regular_compaction(table_s, std::move(descriptor), cdata) {
}
virtual sstables::sstable_set make_sstable_set_for_input() const override {
return sstables::make_partitioned_sstable_set(_schema, false);
}
// Unconditionally enable incremental compaction if the strategy specifies a max output size, e.g. LCS.
virtual bool enable_garbage_collected_sstable_writer() const noexcept override {
return _max_sstable_size != std::numeric_limits<uint64_t>::max() && bool(_replacer);
}
flat_mutation_reader_v2 make_sstable_reader() const override {
return _compacting->make_local_shard_sstable_reader(_schema,
_permit,
query::full_partition_range,
_schema->full_slice(),
_io_priority,
tracing::trace_state_ptr(),
::streamed_mutation::forwarding::no,
::mutation_reader::forwarding::no,
default_read_monitor_generator());
}
std::string_view report_start_desc() const override {
return "Reshaping";
}
std::string_view report_finish_desc() const override {
return "Reshaped";
}
virtual compaction_writer create_compaction_writer(const dht::decorated_key& dk) override {
auto sst = _sstable_creator(this_shard_id());
setup_new_sstable(sst);
sstable_writer_config cfg = make_sstable_writer_config(compaction_type::Reshape);
return compaction_writer{sst->get_writer(*_schema, partitions_per_sstable(), cfg, get_encoding_stats(), _io_priority), sst};
}
virtual void stop_sstable_writer(compaction_writer* writer) override {
if (writer) {
if (has_sstable_replacer()) {
regular_compaction::stop_sstable_writer(writer);
} else {
finish_new_sstable(writer);
}
}
}
virtual void on_end_of_compaction() override {
if (has_sstable_replacer()) {
regular_compaction::on_end_of_compaction();
}
}
};
@@ -1590,7 +1620,7 @@ private:
uint64_t partitions_per_sstable(shard_id s) const {
uint64_t estimated_sstables = std::max(uint64_t(1), uint64_t(ceil(double(_estimation_per_shard[s].estimated_size) / _max_sstable_size)));
return std::min(uint64_t(ceil(double(_estimation_per_shard[s].estimated_partitions) / estimated_sstables)),
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimation_per_shard[s].estimated_partitions));
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimation_per_shard[s].estimated_partitions, _schema));
}
public:
resharding_compaction(table_state& table_s, sstables::compaction_descriptor descriptor, compaction_data& cdata)

View File

@@ -92,12 +92,15 @@ struct compaction_stats {
uint64_t start_size = 0;
uint64_t end_size = 0;
uint64_t validation_errors = 0;
// Bloom filter checks during max purgeable calculation
uint64_t bloom_filter_checks = 0;
compaction_stats& operator+=(const compaction_stats& r) {
ended_at = std::max(ended_at, r.ended_at);
start_size += r.start_size;
end_size += r.end_size;
validation_errors += r.validation_errors;
bloom_filter_checks += r.bloom_filter_checks;
return *this;
}
friend compaction_stats operator+(const compaction_stats& l, const compaction_stats& r) {

View File

@@ -7,10 +7,12 @@
*/
#include "compaction_manager.hh"
#include "compaction_descriptor.hh"
#include "compaction_strategy.hh"
#include "compaction_backlog_manager.hh"
#include "sstables/sstables.hh"
#include "sstables/sstables_manager.hh"
#include <memory>
#include <seastar/core/metrics.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/switch_to.hh>
@@ -78,6 +80,23 @@ public:
_compacting.erase(sst);
}
}
class update_me : public compaction_manager::task::on_replacement {
compacting_sstable_registration& _registration;
public:
update_me(compacting_sstable_registration& registration)
: _registration{registration} {}
void on_removal(const std::vector<sstables::shared_sstable>& sstables) override {
_registration.release_compacting(sstables);
}
void on_addition(const std::vector<sstables::shared_sstable>& sstables) override {
_registration.register_compacting(sstables);
}
};
auto update_on_sstable_replacement() {
return update_me(*this);
}
};
sstables::compaction_data compaction_manager::create_compaction_data() {
@@ -279,7 +298,7 @@ compaction_manager::task::task(compaction_manager& mgr, compaction::table_state*
, _description(std::move(desc))
{}
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task(shared_ptr<compaction_manager::task> task) {
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task(shared_ptr<compaction_manager::task> task, throw_if_stopping do_throw_if_stopping) {
_tasks.push_back(task);
auto unregister_task = defer([this, task] {
_tasks.remove(task);
@@ -292,6 +311,9 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_tas
co_return res;
} catch (sstables::compaction_stopped_exception& e) {
cmlog.info("{}: stopped, reason: {}", *task, e.what());
if (do_throw_if_stopping) {
throw;
}
} catch (sstables::compaction_aborted_exception& e) {
cmlog.error("{}: aborted, reason: {}", *task, e.what());
_stats.errors++;
@@ -310,14 +332,14 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_tas
co_return std::nullopt;
}
future<sstables::compaction_result> compaction_manager::task::compact_sstables_and_update_history(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, release_exhausted_func_t release_exhausted, can_purge_tombstones can_purge) {
future<sstables::compaction_result> compaction_manager::task::compact_sstables_and_update_history(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement& on_replace, can_purge_tombstones can_purge) {
if (!descriptor.sstables.size()) {
// if there is nothing to compact, just return.
co_return sstables::compaction_result{};
}
bool should_update_history = this->should_update_history(descriptor.options.type());
sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), cdata, std::move(release_exhausted), std::move(can_purge));
sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), cdata, on_replace, std::move(can_purge));
if (should_update_history) {
co_await update_history(*_compacting_table, res, cdata);
@@ -325,8 +347,11 @@ future<sstables::compaction_result> compaction_manager::task::compact_sstables_a
co_return res;
}
future<sstables::compaction_result> compaction_manager::task::compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, release_exhausted_func_t release_exhausted, can_purge_tombstones can_purge) {
future<sstables::compaction_result> compaction_manager::task::compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement& on_replace, can_purge_tombstones can_purge,
sstables::offstrategy offstrategy) {
compaction::table_state& t = *_compacting_table;
if (can_purge) {
descriptor.enable_garbage_collection(t.main_sstable_set());
}
@@ -334,15 +359,26 @@ future<sstables::compaction_result> compaction_manager::task::compact_sstables(s
auto sst = t.make_sstable();
return sst;
};
descriptor.replacer = [this, &t, release_exhausted] (sstables::compaction_completion_desc desc) {
descriptor.replacer = [this, &t, &on_replace, offstrategy] (sstables::compaction_completion_desc desc) {
t.get_compaction_strategy().notify_completion(desc.old_sstables, desc.new_sstables);
_cm.propagate_replacement(t, desc.old_sstables, desc.new_sstables);
// on_replace updates the compacting registration with the old and new
// sstables. while on_compaction_completion() removes the old sstables
// from the table's sstable set, and adds the new ones to the sstable
// set.
// since the regular compactions exclude the sstables in the sstable
// set which are currently being compacted, if we want to ensure the
// exclusive access of compactions to an sstable we should guard it
// with the registration when adding/removing it to/from the sstable
// set. otherwise, the regular compaction would pick it up in the time
// window, where the sstables:
// - are still in the main set
// - are not being compacted.
on_replace.on_addition(desc.new_sstables);
auto old_sstables = desc.old_sstables;
t.on_compaction_completion(std::move(desc), sstables::offstrategy::no).get();
// Calls compaction manager's task for this compaction to release reference to exhausted SSTables.
if (release_exhausted) {
release_exhausted(old_sstables);
}
t.on_compaction_completion(std::move(desc), offstrategy).get();
on_replace.on_removal(old_sstables);
};
co_return co_await sstables::compact_sstables(std::move(descriptor), cdata, t);
@@ -387,9 +423,7 @@ protected:
sstables::compaction_strategy cs = t->get_compaction_strategy();
sstables::compaction_descriptor descriptor = cs.get_major_compaction_job(*t, _cm.get_candidates(*t));
auto compacting = compacting_sstable_registration(_cm, descriptor.sstables);
auto release_exhausted = [&compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
compacting.release_compacting(exhausted_sstables);
};
auto on_replace = compacting.update_on_sstable_replacement();
setup_new_compaction(descriptor.run_identifier);
cmlog.info0("User initiated compaction started on behalf of {}.{}", t->schema()->ks_name(), t->schema()->cf_name());
@@ -401,7 +435,7 @@ protected:
// the exclusive lock can be freed to let regular compaction run in parallel to major
lock_holder.return_all();
co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, std::move(release_exhausted));
co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, on_replace);
finish_compaction();
@@ -448,12 +482,12 @@ protected:
}
};
future<> compaction_manager::run_custom_job(compaction::table_state& t, sstables::compaction_type type, const char* desc, noncopyable_function<future<>(sstables::compaction_data&)> job) {
future<> compaction_manager::run_custom_job(compaction::table_state& t, sstables::compaction_type type, const char* desc, noncopyable_function<future<>(sstables::compaction_data&)> job, throw_if_stopping do_throw_if_stopping) {
if (_state != state::enabled) {
return make_ready_future<>();
}
return perform_task(make_shared<custom_compaction_task>(*this, &t, type, desc, std::move(job))).discard_result();
return perform_task(make_shared<custom_compaction_task>(*this, &t, type, desc, std::move(job)), do_throw_if_stopping).discard_result();
}
future<> compaction_manager::update_static_shares(float static_shares) {
@@ -647,6 +681,7 @@ sstables::compaction_stopped_exception compaction_manager::task::make_compaction
compaction_manager::compaction_manager(config cfg, abort_source& as)
: _cfg(std::move(cfg))
, _compaction_submission_timer(compaction_sg().cpu, compaction_submission_callback())
, _compaction_controller(make_compaction_controller(compaction_sg(), static_shares(), [this] () -> float {
_last_backlog = backlog();
auto b = _last_backlog / available_memory();
@@ -681,6 +716,7 @@ compaction_manager::compaction_manager(config cfg, abort_source& as)
compaction_manager::compaction_manager()
: _cfg(config{ .available_memory = 1 })
, _compaction_submission_timer(compaction_sg().cpu, compaction_submission_callback())
, _compaction_controller(make_compaction_controller(compaction_sg(), 1, [] () -> float { return 1.0; }))
, _backlog_manager(_compaction_controller)
, _throughput_updater(serialized_action([this] { return update_throughput(throughput_mbs()); }))
@@ -738,7 +774,7 @@ void compaction_manager::register_metrics() {
void compaction_manager::enable() {
assert(_state == state::none || _state == state::disabled);
_state = state::enabled;
_compaction_submission_timer.arm(periodic_compaction_submission_interval());
_compaction_submission_timer.arm_periodic(periodic_compaction_submission_interval());
_waiting_reevalution = postponed_compactions_reevaluation();
}
@@ -982,9 +1018,7 @@ protected:
}
auto compacting = compacting_sstable_registration(_cm, descriptor.sstables);
auto weight_r = compaction_weight_registration(&_cm, weight);
auto release_exhausted = [&compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
compacting.release_compacting(exhausted_sstables);
};
auto on_replace = compacting.update_on_sstable_replacement();
cmlog.debug("Accepted compaction job: task={} ({} sstable(s)) of weight {} for {}.{}",
fmt::ptr(this), descriptor.sstables.size(), weight, t.schema()->ks_name(), t.schema()->cf_name());
@@ -993,7 +1027,7 @@ protected:
try {
bool should_update_history = this->should_update_history(descriptor.options.type());
sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), _compaction_data, std::move(release_exhausted));
sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), _compaction_data, on_replace);
finish_compaction();
if (should_update_history) {
// update_history can take a long time compared to
@@ -1034,7 +1068,7 @@ void compaction_manager::submit(compaction::table_state& t) {
// OK to drop future.
// waited via task->stop()
(void)perform_task(make_shared<regular_compaction_task>(*this, t));
(void)perform_task(make_shared<regular_compaction_task>(*this, t)).then_wrapped([] (auto f) { f.ignore_ready_future(); });
}
bool compaction_manager::can_perform_regular_compaction(compaction::table_state& t) {
@@ -1093,54 +1127,40 @@ public:
}
private:
future<> run_offstrategy_compaction(sstables::compaction_data& cdata) {
// This procedure will reshape sstables in maintenance set until it's ready for
// integration into main set.
// It may require N reshape rounds before the set satisfies the strategy invariant.
// This procedure also only updates maintenance set at the end, on success.
// Otherwise, some overlapping could be introduced in the set after each reshape
// round, progressively degrading read amplification until integration happens.
// The drawback of this approach is the 2x space requirement as the old sstables
// will only be deleted at the end. The impact of this space requirement is reduced
// by the fact that off-strategy is serialized across all tables, meaning that the
// actual requirement is the size of the largest table's maintenance set.
// Incrementally reshape the SSTables in maintenance set. The output of each reshape
// round is merged into the main set. The common case is that off-strategy input
// is mostly disjoint, e.g. repair-based node ops, then all the input will be
// reshaped in a single round. The incremental approach allows us to be space
// efficient (avoiding a 100% overhead) as we will incrementally replace input
// SSTables from maintenance set by output ones into main set.
compaction::table_state& t = *_compacting_table;
const auto& maintenance_sstables = t.maintenance_sstable_set();
// Filter out sstables that require view building, to avoid a race between off-strategy
// and view building. Refs: #11882
const auto old_sstables = boost::copy_range<std::vector<sstables::shared_sstable>>(*maintenance_sstables.all()
| boost::adaptors::filtered([] (const sstables::shared_sstable& sst) {
return !sst->requires_view_building();
}));
std::vector<sstables::shared_sstable> reshape_candidates = old_sstables;
std::vector<sstables::shared_sstable> sstables_to_remove;
std::unordered_set<sstables::shared_sstable> new_unused_sstables;
auto cleanup_new_unused_sstables_on_failure = defer([&new_unused_sstables] {
for (auto& sst : new_unused_sstables) {
sst->mark_for_deletion();
}
});
auto get_reshape_candidates = [&t] () {
auto maintenance_ssts = t.maintenance_sstable_set().all();
return boost::copy_range<std::vector<sstables::shared_sstable>>(*maintenance_ssts
| boost::adaptors::filtered([](const sstables::shared_sstable& sst) {
return !sst->requires_view_building();
}));
};
auto get_next_job = [&] () -> std::optional<sstables::compaction_descriptor> {
auto& iop = service::get_local_streaming_priority(); // run reshape in maintenance mode
auto desc = t.get_compaction_strategy().get_reshaping_job(reshape_candidates, t.schema(), iop, sstables::reshape_mode::strict);
auto desc = t.get_compaction_strategy().get_reshaping_job(get_reshape_candidates(), t.schema(), iop, sstables::reshape_mode::strict);
return desc.sstables.size() ? std::make_optional(std::move(desc)) : std::nullopt;
};
std::exception_ptr err;
while (auto desc = get_next_job()) {
desc->creator = [this, &new_unused_sstables, &t] (shard_id dummy) {
auto sst = t.make_sstable();
new_unused_sstables.insert(sst);
return sst;
};
auto input = boost::copy_range<std::unordered_set<sstables::shared_sstable>>(desc->sstables);
auto compacting = compacting_sstable_registration(_cm, desc->sstables);
auto on_replace = compacting.update_on_sstable_replacement();
sstables::compaction_result ret;
try {
ret = co_await sstables::compact_sstables(std::move(*desc), cdata, t);
sstables::compaction_result _ = co_await compact_sstables(std::move(*desc), _compaction_data, on_replace,
compaction_manager::can_purge_tombstones::no,
sstables::offstrategy::yes);
} catch (sstables::compaction_stopped_exception&) {
// If off-strategy compaction stopped on user request, let's not discard the partial work.
// Therefore, both un-reshaped and reshaped data will be integrated into main set, allowing
@@ -1149,36 +1169,20 @@ private:
break;
}
_performed = true;
// update list of reshape candidates without input but with output added to it
auto it = boost::remove_if(reshape_candidates, [&] (auto& s) { return input.contains(s); });
reshape_candidates.erase(it, reshape_candidates.end());
std::move(ret.new_sstables.begin(), ret.new_sstables.end(), std::back_inserter(reshape_candidates));
// If compaction strategy is unable to reshape input data in a single round, it may happen that a SSTable A
// created in round 1 will be compacted in a next round producing SSTable B. As SSTable A is no longer needed,
// it can be removed immediately. Let's remove all such SSTables immediately to reduce off-strategy space requirement.
// Input SSTables from maintenance set can only be removed later, as SSTable sets are only updated on completion.
auto can_remove_now = [&] (const sstables::shared_sstable& s) { return new_unused_sstables.contains(s); };
for (auto&& sst : input) {
if (can_remove_now(sst)) {
co_await sst->unlink();
new_unused_sstables.erase(std::move(sst));
} else {
sstables_to_remove.push_back(std::move(sst));
}
}
}
// at this moment reshape_candidates contains a set of sstables ready for integration into main set
auto completion_desc = sstables::compaction_completion_desc{
.old_sstables = std::move(old_sstables),
.new_sstables = std::move(reshape_candidates)
};
co_await t.on_compaction_completion(std::move(completion_desc), sstables::offstrategy::yes);
// There might be some remaining sstables in maintenance set that didn't require reshape, or the
// user has aborted off-strategy. So we can only integrate them into the main set, such that
// they become candidates for regular compaction. We cannot hold them forever in maintenance set,
// as that causes read and space amplification issues.
if (auto sstables = get_reshape_candidates(); sstables.size()) {
auto completion_desc = sstables::compaction_completion_desc{
.old_sstables = sstables, // removes from maintenance set.
.new_sstables = sstables, // adds into main set.
};
co_await t.on_compaction_completion(std::move(completion_desc), sstables::offstrategy::yes);
}
cleanup_new_unused_sstables_on_failure.cancel();
co_await sstables::sstable_directory::delete_atomically(std::move(sstables_to_remove));
if (err) {
co_await coroutine::return_exception_ptr(std::move(err));
}
@@ -1201,9 +1205,11 @@ protected:
std::exception_ptr ex;
try {
compaction::table_state& t = *_compacting_table;
auto maintenance_sstables = t.maintenance_sstable_set().all();
cmlog.info("Starting off-strategy compaction for {}.{}, {} candidates were found",
t.schema()->ks_name(), t.schema()->cf_name(), maintenance_sstables->size());
{
auto maintenance_sstables = t.maintenance_sstable_set().all();
cmlog.info("Starting off-strategy compaction for {}.{}, {} candidates were found",
t.schema()->ks_name(), t.schema()->cf_name(), maintenance_sstables->size());
}
co_await run_offstrategy_compaction(_compaction_data);
finish_compaction();
cmlog.info("Done with off-strategy compaction for {}.{}", t.schema()->ks_name(), t.schema()->cf_name());
@@ -1276,9 +1282,7 @@ private:
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, _options);
// Releases reference to cleaned sstable such that respective used disk space can be freed.
auto release_exhausted = [this] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
_compacting.release_compacting(exhausted_sstables);
};
auto on_replace = _compacting.update_on_sstable_replacement();
setup_new_compaction(descriptor.run_identifier);
@@ -1287,7 +1291,7 @@ private:
std::exception_ptr ex;
try {
sstables::compaction_result res = co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, std::move(release_exhausted), _can_purge);
sstables::compaction_result res = co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, on_replace, _can_purge);
finish_compaction();
_cm.reevaluate_postponed_compactions();
co_return res; // done with current sstable
@@ -1444,14 +1448,26 @@ protected:
co_return std::nullopt;
}
private:
// Releases reference to cleaned files such that respective used disk space can be freed.
void release_exhausted(std::vector<sstables::shared_sstable> exhausted_sstables) {
_compacting.release_compacting(exhausted_sstables);
}
future<> run_cleanup_job(sstables::compaction_descriptor descriptor) {
co_await coroutine::switch_to(_cm.compaction_sg().cpu);
// Releases reference to cleaned files such that respective used disk space can be freed.
using update_registration = compacting_sstable_registration::update_me;
class release_exhausted : public update_registration {
sstables::compaction_descriptor& _desc;
public:
release_exhausted(compacting_sstable_registration& registration, sstables::compaction_descriptor& desc)
: update_registration{registration}
, _desc{desc} {}
void on_removal(const std::vector<sstables::shared_sstable>& sstables) override {
auto exhausted = boost::copy_range<std::unordered_set<sstables::shared_sstable>>(sstables);
std::erase_if(_desc.sstables, [&] (const sstables::shared_sstable& sst) {
return exhausted.contains(sst);
});
update_registration::on_removal(sstables);
}
};
release_exhausted on_replace{_compacting, descriptor};
for (;;) {
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_cm._compaction_controller.backlog_of_shares(200), _cm.available_memory()));
_cm.register_backlog_tracker(user_initiated);
@@ -1459,8 +1475,7 @@ private:
std::exception_ptr ex;
try {
setup_new_compaction(descriptor.run_identifier);
co_await compact_sstables_and_update_history(descriptor, _compaction_data,
std::bind(&cleanup_sstables_compaction_task::release_exhausted, this, std::placeholders::_1));
co_await compact_sstables_and_update_history(descriptor, _compaction_data, on_replace);
finish_compaction();
_cm.reevaluate_postponed_compactions();
co_return; // done with current job

View File

@@ -32,6 +32,7 @@
#include "compaction.hh"
#include "compaction_weight_registration.hh"
#include "compaction_backlog_manager.hh"
#include "compaction/compaction_descriptor.hh"
#include "strategy_control.hh"
#include "backlog_controller.hh"
#include "seastarx.hh"
@@ -49,6 +50,8 @@ public:
boost::icl::interval_map<dht::token, gc_clock::time_point, boost::icl::partial_absorber, std::less, boost::icl::inplace_max> map;
};
using throw_if_stopping = bool_class<struct throw_if_stopping_tag>;
// Compaction manager provides facilities to submit and track compaction jobs on
// behalf of existing tables.
class compaction_manager {
@@ -137,11 +140,20 @@ public:
virtual ~task();
// called when a compaction replaces the exhausted sstables with the new set
struct on_replacement {
virtual ~on_replacement() {}
// called after the replacement completes
// @param sstables the old sstable which are replaced in this replacement
virtual void on_removal(const std::vector<sstables::shared_sstable>& sstables) = 0;
// called before the replacement happens
// @param sstables the new sstables to be added to the table's sstable set
virtual void on_addition(const std::vector<sstables::shared_sstable>& sstables) = 0;
};
protected:
virtual future<compaction_stats_opt> do_run() = 0;
using throw_if_stopping = bool_class<struct throw_if_stopping_tag>;
state switch_state(state new_state);
future<semaphore_units<named_semaphore_exception_factory>> acquire_semaphore(named_semaphore& sem, size_t units = 1);
@@ -158,12 +170,10 @@ public:
// otherwise, returns stop_iteration::no after sleep for exponential retry.
future<stop_iteration> maybe_retry(std::exception_ptr err, bool throw_on_abort = false);
// Compacts set of SSTables according to the descriptor.
using release_exhausted_func_t = std::function<void(const std::vector<sstables::shared_sstable>& exhausted_sstables)>;
future<sstables::compaction_result> compact_sstables_and_update_history(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, release_exhausted_func_t release_exhausted,
can_purge_tombstones can_purge = can_purge_tombstones::yes);
future<sstables::compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, release_exhausted_func_t release_exhausted,
future<sstables::compaction_result> compact_sstables_and_update_history(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement&,
can_purge_tombstones can_purge = can_purge_tombstones::yes);
future<sstables::compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement&,
can_purge_tombstones can_purge = can_purge_tombstones::yes, sstables::offstrategy offstrategy = sstables::offstrategy::no);
future<> update_history(compaction::table_state& t, const sstables::compaction_result& res, const sstables::compaction_data& cdata);
bool should_update_history(sstables::compaction_type ct) {
return ct == sstables::compaction_type::Compaction;
@@ -296,10 +306,10 @@ private:
std::function<void()> compaction_submission_callback();
// all registered tables are reevaluated at a constant interval.
// Submission is a NO-OP when there's nothing to do, so it's fine to call it regularly.
timer<lowres_clock> _compaction_submission_timer = timer<lowres_clock>(compaction_submission_callback());
static constexpr std::chrono::seconds periodic_compaction_submission_interval() { return std::chrono::seconds(3600); }
config _cfg;
timer<lowres_clock> _compaction_submission_timer;
compaction_controller _compaction_controller;
compaction_backlog_manager _backlog_manager;
optimized_optional<abort_source::subscription> _early_abort_subscription;
@@ -315,7 +325,7 @@ private:
per_table_history_maps _repair_history_maps;
tombstone_gc_state _tombstone_gc_state;
private:
future<compaction_stats_opt> perform_task(shared_ptr<task>);
future<compaction_stats_opt> perform_task(shared_ptr<task>, throw_if_stopping do_throw_if_stopping = throw_if_stopping::no);
future<> stop_tasks(std::vector<shared_ptr<task>> tasks, sstring reason);
future<> update_throughput(uint32_t value_mbs);
@@ -460,7 +470,7 @@ public:
// parameter type is the compaction type the operation can most closely be
// associated with, use compaction_type::Compaction, if none apply.
// parameter job is a function that will carry the operation
future<> run_custom_job(compaction::table_state& s, sstables::compaction_type type, const char *desc, noncopyable_function<future<>(sstables::compaction_data&)> job);
future<> run_custom_job(compaction::table_state& s, sstables::compaction_type type, const char *desc, noncopyable_function<future<>(sstables::compaction_data&)> job, throw_if_stopping do_throw_if_stopping);
class compaction_reenabler {
compaction_manager& _cm;

View File

@@ -65,7 +65,7 @@ bool compaction_strategy_impl::worth_dropping_tombstones(const shared_sstable& s
return sst->estimate_droppable_tombstone_ratio(gc_before) >= _tombstone_threshold;
}
uint64_t compaction_strategy_impl::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) {
uint64_t compaction_strategy_impl::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema) {
return partition_estimate;
}
@@ -409,7 +409,9 @@ public:
l0_old_ssts.push_back(std::move(sst));
}
}
_l0_scts.replace_sstables(std::move(l0_old_ssts), std::move(l0_new_ssts));
if (l0_old_ssts.size() || l0_new_ssts.size()) {
_l0_scts.replace_sstables(std::move(l0_old_ssts), std::move(l0_new_ssts));
}
}
};
@@ -748,8 +750,8 @@ compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema
return _compaction_strategy_impl->get_reshaping_job(std::move(input), schema, iop, mode);
}
uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) {
return _compaction_strategy_impl->adjust_partition_estimate(ms_meta, partition_estimate);
uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema) {
return _compaction_strategy_impl->adjust_partition_estimate(ms_meta, partition_estimate, std::move(schema));
}
reader_consumer_v2 compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) {

View File

@@ -108,7 +108,7 @@ public:
compaction_backlog_tracker make_backlog_tracker();
uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate);
uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr);
reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer);

View File

@@ -70,7 +70,7 @@ public:
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() = 0;
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate);
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema);
virtual reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer);

View File

@@ -144,6 +144,8 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
auto max_sstable_size_in_bytes = _max_sstable_size_in_mb * 1024 * 1024;
leveled_manifest::logger.debug("get_reshaping_job: mode={} input.size={} max_sstable_size_in_bytes={}", mode == reshape_mode::relaxed ? "relaxed" : "strict", input.size(), max_sstable_size_in_bytes);
for (auto& sst : input) {
auto sst_level = sst->get_sstable_level();
if (sst_level > leveled_manifest::MAX_LEVELS - 1) {
@@ -227,6 +229,9 @@ leveled_compaction_strategy::get_cleanup_compaction_jobs(table_state& table_s, s
}
unsigned leveled_compaction_strategy::ideal_level_for_input(const std::vector<sstables::shared_sstable>& input, uint64_t max_sstable_size) {
if (!max_sstable_size) {
return 1;
}
auto log_fanout = [fanout = leveled_manifest::leveled_fan_out] (double x) {
double inv_log_fanout = 1.0f / std::log(fanout);
return log(x) * inv_log_fanout;

View File

@@ -100,16 +100,27 @@ public:
};
};
uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) {
if (!ms_meta.min_timestamp || !ms_meta.max_timestamp) {
// Not enough information, we assume the worst
return partition_estimate / max_data_segregation_window_count;
}
const auto min_window = get_window_for(_options, *ms_meta.min_timestamp);
const auto max_window = get_window_for(_options, *ms_meta.max_timestamp);
const auto window_size = get_window_size(_options);
uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr s) {
// If not enough information, we assume the worst
auto estimated_window_count = max_data_segregation_window_count;
auto default_ttl = std::chrono::duration_cast<std::chrono::microseconds>(s->default_time_to_live());
bool min_and_max_ts_available = ms_meta.min_timestamp && ms_meta.max_timestamp;
auto estimate_window_count = [this] (timestamp_type min_window, timestamp_type max_window) {
const auto window_size = get_window_size(_options);
return (max_window + (window_size - 1) - min_window) / window_size;
};
auto estimated_window_count = (max_window + (window_size - 1) - min_window) / window_size;
if (!min_and_max_ts_available && default_ttl.count()) {
auto min_window = get_window_for(_options, timestamp_type(0));
auto max_window = get_window_for(_options, timestamp_type(default_ttl.count()));
estimated_window_count = estimate_window_count(min_window, max_window);
} else if (min_and_max_ts_available) {
auto min_window = get_window_for(_options, *ms_meta.min_timestamp);
auto max_window = get_window_for(_options, *ms_meta.max_timestamp);
estimated_window_count = estimate_window_count(min_window, max_window);
}
return partition_estimate / std::max(1UL, uint64_t(estimated_window_count));
}

View File

@@ -157,7 +157,7 @@ public:
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) override;
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr s) override;
virtual reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) override;

View File

@@ -560,7 +560,7 @@ public:
auto marker = it->second;
++it;
if (it != e && marker != composite::eoc::none) {
throw runtime_exception(format("non-zero component divider found ({:d}) mid", format("0x{:02x}", composite::eoc_type(marker) & 0xff)));
throw runtime_exception(format("non-zero component divider found ({:#02x}) mid", composite::eoc_type(marker) & 0xff));
}
}
return ret;

View File

@@ -65,6 +65,13 @@ commitlog_sync_period_in_ms: 10000
# is reasonable.
commitlog_segment_size_in_mb: 32
# The size of the individual schema commitlog file segments.
# The segment size puts a limit on the mutation size that can be
# written at once, and some schema mutation writes are much larger
# than average.
schema_commitlog_segment_size_in_mb: 32
# seed_provider class_name is saved for future use.
# A seed address is mandatory.
seed_provider:
@@ -553,4 +560,16 @@ murmur3_partitioner_ignore_msb_bits: 12
# WARNING: It's unsafe to set this to false if the node previously booted
# with the schema commit log enabled. In such case, some schema changes
# may be lost if the node was not cleanly stopped.
force_schema_commit_log: true
force_schema_commit_log: true
# Use Raft to consistently manage schema information in the cluster.
# Refer to https://docs.scylladb.com/master/architecture/raft.html for more details.
# The 'Handling Failures' section is especially important.
#
# Once enabled in a cluster, this cannot be turned off.
# If you want to bootstrap a new cluster without Raft, make sure to set this to `false`
# before starting your nodes for the first time.
#
# A cluster not using Raft can be 'upgraded' to use Raft. Refer to the aforementioned
# documentation, section 'Enabling Raft in ScyllaDB 5.2 and further', for the procedure.
consistent_cluster_management: true

View File

@@ -409,6 +409,7 @@ scylla_tests = set([
'test/boost/limiting_data_source_test',
'test/boost/linearizing_input_stream_test',
'test/boost/loading_cache_test',
'test/boost/locator_topology_test',
'test/boost/log_heap_test',
'test/boost/estimated_histogram_test',
'test/boost/summary_test',
@@ -697,6 +698,7 @@ scylla_core = (['message/messaging_service.cc',
'mutation_partition.cc',
'mutation_partition_view.cc',
'mutation_partition_serializer.cc',
'utils/on_internal_error.cc',
'converting_mutation_partition_applier.cc',
'readers/combined.cc',
'readers/multishard.cc',
@@ -968,6 +970,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/lister.cc',
'repair/repair.cc',
'repair/row_level.cc',
'repair/table_check.cc',
'exceptions/exceptions.cc',
'auth/allow_all_authenticator.cc',
'auth/allow_all_authorizer.cc',
@@ -1076,6 +1079,8 @@ api = ['api/api.cc',
Json2Code('api/api-doc/error_injection.json'),
'api/authorization_cache.cc',
Json2Code('api/api-doc/authorization_cache.json'),
'api/raft.cc',
Json2Code('api/api-doc/raft.json'),
]
alternator = [
@@ -1268,7 +1273,7 @@ deps['test/boost/bytes_ostream_test'] = [
"test/lib/log.cc",
]
deps['test/boost/input_stream_test'] = ['test/boost/input_stream_test.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'hashers.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']

View File

@@ -463,8 +463,7 @@ orderByClause[raw::select_statement::parameters::orderings_type& orderings]
;
jsonValue returns [expression value]
:
| s=STRING_LITERAL { $value = untyped_constant{untyped_constant::string, $s.text}; }
: s=STRING_LITERAL { $value = untyped_constant{untyped_constant::string, $s.text}; }
| m=marker { $value = std::move(m); }
;
@@ -1578,8 +1577,7 @@ marker returns [expression value]
;
intValue returns [expression value]
:
| t=INTEGER { $value = untyped_constant{untyped_constant::integer, $t.text}; }
: t=INTEGER { $value = untyped_constant{untyped_constant::integer, $t.text}; }
| e=marker { $value = std::move(e); }
;

View File

@@ -10,6 +10,7 @@
#include "cql3/attributes.hh"
#include "cql3/column_identifier.hh"
#include <optional>
namespace cql3 {
@@ -55,9 +56,9 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
}
}
int32_t attributes::get_time_to_live(const query_options& options) {
std::optional<int32_t> attributes::get_time_to_live(const query_options& options) {
if (!_time_to_live.has_value() || _time_to_live_unset_guard.is_unset(options))
return 0;
return std::nullopt;
cql3::raw_value tval = expr::evaluate(*_time_to_live, options);
if (tval.is_null()) {

View File

@@ -45,7 +45,7 @@ public:
int64_t get_timestamp(int64_t now, const query_options& options);
int32_t get_time_to_live(const query_options& options);
std::optional<int32_t> get_time_to_live(const query_options& options);
db::timeout_clock::duration get_timeout(const query_options& options) const;

View File

@@ -1416,7 +1416,7 @@ expression search_and_replace(const expression& e,
};
},
[&] (const binary_operator& oper) -> expression {
return binary_operator(recurse(oper.lhs), oper.op, recurse(oper.rhs));
return binary_operator(recurse(oper.lhs), oper.op, recurse(oper.rhs), oper.order);
},
[&] (const column_mutation_attribute& cma) -> expression {
return column_mutation_attribute{cma.kind, recurse(cma.column)};

View File

@@ -78,7 +78,7 @@ static
void
usertype_constructor_validate_assignable_to(const usertype_constructor& u, data_dictionary::database db, const sstring& keyspace, const column_specification& receiver) {
if (!receiver.type->is_user_type()) {
throw exceptions::invalid_request_exception(format("Invalid user type literal for {} of type {}", receiver.name, receiver.type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid user type literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
}
auto ut = static_pointer_cast<const user_type_impl>(receiver.type);
@@ -90,7 +90,7 @@ usertype_constructor_validate_assignable_to(const usertype_constructor& u, data_
const expression& value = u.elements.at(field);
auto&& field_spec = usertype_field_spec_of(receiver, i);
if (!assignment_testable::is_assignable(test_assignment(value, db, keyspace, *field_spec))) {
throw exceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}", receiver.name, field, field_spec->type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}", *receiver.name, field, field_spec->type->as_cql3_type()));
}
}
}
@@ -313,7 +313,7 @@ set_validate_assignable_to(const collection_constructor& c, data_dictionary::dat
return;
}
throw exceptions::invalid_request_exception(format("Invalid set literal for {} of type {}", receiver.name, receiver.type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid set literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
}
auto&& value_spec = set_value_spec_of(receiver);
@@ -501,18 +501,18 @@ void
tuple_constructor_validate_assignable_to(const tuple_constructor& tc, data_dictionary::database db, const sstring& keyspace, const column_specification& receiver) {
auto tt = dynamic_pointer_cast<const tuple_type_impl>(receiver.type->underlying_type());
if (!tt) {
throw exceptions::invalid_request_exception(format("Invalid tuple type literal for {} of type {}", receiver.name, receiver.type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid tuple type literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
}
for (size_t i = 0; i < tc.elements.size(); ++i) {
if (i >= tt->size()) {
throw exceptions::invalid_request_exception(format("Invalid tuple literal for {}: too many elements. Type {} expects {:d} but got {:d}",
receiver.name, tt->as_cql3_type(), tt->size(), tc.elements.size()));
*receiver.name, tt->as_cql3_type(), tt->size(), tc.elements.size()));
}
auto&& value = tc.elements[i];
auto&& spec = component_spec_of(receiver, i);
if (!assignment_testable::is_assignable(test_assignment(value, db, keyspace, *spec))) {
throw exceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}", receiver.name, i, spec->type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}", *receiver.name, i, spec->type->as_cql3_type()));
}
}
}

View File

@@ -165,8 +165,6 @@ static data_value castas_fctn_from_dv_to_string(data_value from) {
return from.type()->to_string_impl(from);
}
// FIXME: Add conversions for counters, after they are fully implemented...
static constexpr unsigned next_power_of_2(unsigned val) {
unsigned ret = 1;
while (ret <= val) {
@@ -370,6 +368,26 @@ castas_fctn get_castas_fctn(data_type to_type, data_type from_type) {
return castas_fctn_from_dv_to_string;
case cast_switch_case_val(kind::utf8, kind::ascii):
return castas_fctn_simple<sstring, sstring>;
case cast_switch_case_val(kind::byte, kind::counter):
return castas_fctn_simple<int8_t, int64_t>;
case cast_switch_case_val(kind::short_kind, kind::counter):
return castas_fctn_simple<int16_t, int64_t>;
case cast_switch_case_val(kind::int32, kind::counter):
return castas_fctn_simple<int32_t, int64_t>;
case cast_switch_case_val(kind::long_kind, kind::counter):
return castas_fctn_simple<int64_t, int64_t>;
case cast_switch_case_val(kind::float_kind, kind::counter):
return castas_fctn_simple<float, int64_t>;
case cast_switch_case_val(kind::double_kind, kind::counter):
return castas_fctn_simple<double, int64_t>;
case cast_switch_case_val(kind::varint, kind::counter):
return castas_fctn_simple<utils::multiprecision_int, int64_t>;
case cast_switch_case_val(kind::decimal, kind::counter):
return castas_fctn_from_integer_to_decimal<int64_t>;
case cast_switch_case_val(kind::ascii, kind::counter):
case cast_switch_case_val(kind::utf8, kind::counter):
return castas_fctn_to_string<int64_t>;
}
throw exceptions::invalid_request_exception(format("{} cannot be cast to {}", from_type->name(), to_type->name()));
}

View File

@@ -13,6 +13,7 @@
#include "cql3/lists.hh"
#include "cql3/constants.hh"
#include "cql3/user_types.hh"
#include "cql3/ut_name.hh"
#include "cql3/type_json.hh"
#include "cql3/functions/user_function.hh"
#include "cql3/functions/user_aggregate.hh"
@@ -52,6 +53,13 @@ bool abstract_function::requires_thread() const { return false; }
bool as_json_function::requires_thread() const { return false; }
static bool same_signature(const shared_ptr<function>& f1, const shared_ptr<function>& f2) {
if (f1 == nullptr || f2 == nullptr) {
return false;
}
return f1->name() == f2->name() && f1->arg_types() == f2->arg_types();
}
thread_local std::unordered_multimap<function_name, shared_ptr<function>> functions::_declared = init();
void functions::clear_functions() noexcept {
@@ -97,11 +105,6 @@ functions::init() noexcept {
if (type == cql3_type::blob) {
continue;
}
// counters are not supported yet
if (type.is_counter()) {
warn(unimplemented::cause::COUNTERS);
continue;
}
declare(make_to_blob_function(type.get_type()));
declare(make_from_blob_function(type.get_type()));
@@ -143,22 +146,56 @@ void functions::replace_function(shared_ptr<function> func) {
with_udf_iter(func->name(), func->arg_types(), [func] (functions::declared_t::iterator i) {
i->second = std::move(func);
});
auto scalar_func = dynamic_pointer_cast<scalar_function>(func);
if (!scalar_func) {
return;
}
for (auto& fit : _declared) {
auto aggregate = dynamic_pointer_cast<user_aggregate>(fit.second);
if (aggregate && (same_signature(aggregate->sfunc(), scalar_func)
|| (same_signature(aggregate->finalfunc(), scalar_func))
|| (same_signature(aggregate->reducefunc(), scalar_func))))
{
// we need to replace at least one underlying function
shared_ptr<scalar_function> sfunc = same_signature(aggregate->sfunc(), scalar_func) ? scalar_func : aggregate->sfunc();
shared_ptr<scalar_function> finalfunc = same_signature(aggregate->finalfunc(), scalar_func) ? scalar_func : aggregate->finalfunc();
shared_ptr<scalar_function> reducefunc = same_signature(aggregate->reducefunc(), scalar_func) ? scalar_func : aggregate->reducefunc();
fit.second = ::make_shared<user_aggregate>(aggregate->name(), aggregate->initcond(), sfunc, reducefunc, finalfunc);
}
}
}
void functions::remove_function(const function_name& name, const std::vector<data_type>& arg_types) {
with_udf_iter(name, arg_types, [] (functions::declared_t::iterator i) { _declared.erase(i); });
}
std::optional<function_name> functions::used_by_user_aggregate(const function_name& name) {
std::optional<function_name> functions::used_by_user_aggregate(shared_ptr<user_function> func) {
for (const shared_ptr<function>& fptr : _declared | boost::adaptors::map_values) {
auto aggregate = dynamic_pointer_cast<user_aggregate>(fptr);
if (aggregate && (aggregate->sfunc().name() == name || (aggregate->has_finalfunc() && aggregate->finalfunc().name() == name))) {
if (aggregate && (same_signature(aggregate->sfunc(), func)
|| (same_signature(aggregate->finalfunc(), func))
|| (same_signature(aggregate->reducefunc(), func))))
{
return aggregate->name();
}
}
return {};
}
std::optional<function_name> functions::used_by_user_function(const ut_name& user_type) {
for (const shared_ptr<function>& fptr : _declared | boost::adaptors::map_values) {
for (auto& arg_type : fptr->arg_types()) {
if (arg_type->references_user_type(user_type.get_keyspace(), user_type.get_user_type_name())) {
return fptr->name();
}
}
if (fptr->return_type()->references_user_type(user_type.get_keyspace(), user_type.get_user_type_name())) {
return fptr->name();
}
}
return {};
}
lw_shared_ptr<column_specification>
functions::make_arg_spec(const sstring& receiver_ks, const sstring& receiver_cf,
const function& fun, size_t i) {

View File

@@ -71,7 +71,8 @@ public:
static void add_function(shared_ptr<function>);
static void replace_function(shared_ptr<function>);
static void remove_function(const function_name& name, const std::vector<data_type>& arg_types);
static std::optional<function_name> used_by_user_aggregate(const function_name& name);
static std::optional<function_name> used_by_user_aggregate(shared_ptr<user_function>);
static std::optional<function_name> used_by_user_function(const ut_name& user_type);
private:
template <typename F>
static void with_udf_iter(const function_name& name, const std::vector<data_type>& arg_types, F&& f);

View File

@@ -37,14 +37,14 @@ public:
virtual sstring element_type() const override { return "aggregate"; }
virtual std::ostream& describe(std::ostream& os) const override;
const scalar_function& sfunc() const {
return *_sfunc;
seastar::shared_ptr<scalar_function> sfunc() const {
return _sfunc;
}
const scalar_function& reducefunc() const {
return *_reducefunc;
seastar::shared_ptr<scalar_function> reducefunc() const {
return _reducefunc;
}
const scalar_function& finalfunc() const {
return *_finalfunc;
seastar::shared_ptr<scalar_function> finalfunc() const {
return _finalfunc;
}
const bytes_opt& initcond() const {
return _initcond;

View File

@@ -32,9 +32,9 @@ operation::set_element::prepare(data_dictionary::database db, const sstring& key
using exceptions::invalid_request_exception;
auto rtype = dynamic_pointer_cast<const collection_type_impl>(receiver.type);
if (!rtype) {
throw invalid_request_exception(format("Invalid operation ({}) for non collection column {}", to_string(receiver), receiver.name()));
throw invalid_request_exception(format("Invalid operation ({}) for non collection column {}", to_string(receiver), receiver.name_as_text()));
} else if (!rtype->is_multi_cell()) {
throw invalid_request_exception(format("Invalid operation ({}) for frozen collection column {}", to_string(receiver), receiver.name()));
throw invalid_request_exception(format("Invalid operation ({}) for frozen collection column {}", to_string(receiver), receiver.name_as_text()));
}
if (rtype->get_kind() == abstract_type::kind::list) {
@@ -47,7 +47,7 @@ operation::set_element::prepare(data_dictionary::database db, const sstring& key
return make_shared<lists::setter_by_index>(receiver, std::move(idx), std::move(lval));
}
} else if (rtype->get_kind() == abstract_type::kind::set) {
throw invalid_request_exception(format("Invalid operation ({}) for set column {}", to_string(receiver), receiver.name()));
throw invalid_request_exception(format("Invalid operation ({}) for set column {}", to_string(receiver), receiver.name_as_text()));
} else if (rtype->get_kind() == abstract_type::kind::map) {
auto key = prepare_expression(_selector, db, keyspace, nullptr, maps::key_spec_of(*receiver.column_specification));
auto mval = prepare_expression(_value, db, keyspace, nullptr, maps::value_spec_of(*receiver.column_specification));
@@ -136,11 +136,11 @@ operation::addition::prepare(data_dictionary::database db, const sstring& keyspa
auto ctype = dynamic_pointer_cast<const collection_type_impl>(receiver.type);
if (!ctype) {
if (!receiver.is_counter()) {
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for non counter column {}", to_string(receiver), receiver.name()));
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for non counter column {}", to_string(receiver), receiver.name_as_text()));
}
return make_shared<constants::adder>(receiver, std::move(v));
} else if (!ctype->is_multi_cell()) {
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for frozen collection column {}", to_string(receiver), receiver.name()));
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for frozen collection column {}", to_string(receiver), receiver.name_as_text()));
}
if (ctype->get_kind() == abstract_type::kind::list) {
@@ -169,14 +169,14 @@ operation::subtraction::prepare(data_dictionary::database db, const sstring& key
auto ctype = dynamic_pointer_cast<const collection_type_impl>(receiver.type);
if (!ctype) {
if (!receiver.is_counter()) {
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for non counter column {}", to_string(receiver), receiver.name()));
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for non counter column {}", to_string(receiver), receiver.name_as_text()));
}
auto v = prepare_expression(_value, db, keyspace, nullptr, receiver.column_specification);
return make_shared<constants::subtracter>(receiver, std::move(v));
}
if (!ctype->is_multi_cell()) {
throw exceptions::invalid_request_exception(
format("Invalid operation ({}) for frozen collection column {}", to_string(receiver), receiver.name()));
format("Invalid operation ({}) for frozen collection column {}", to_string(receiver), receiver.name_as_text()));
}
if (ctype->get_kind() == abstract_type::kind::list) {
@@ -211,9 +211,9 @@ operation::prepend::prepare(data_dictionary::database db, const sstring& keyspac
auto v = prepare_expression(_value, db, keyspace, nullptr, receiver.column_specification);
if (!dynamic_cast<const list_type_impl*>(receiver.type.get())) {
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for non list column {}", to_string(receiver), receiver.name()));
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for non list column {}", to_string(receiver), receiver.name_as_text()));
} else if (!receiver.type->is_multi_cell()) {
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for frozen list column {}", to_string(receiver), receiver.name()));
throw exceptions::invalid_request_exception(format("Invalid operation ({}) for frozen list column {}", to_string(receiver), receiver.name_as_text()));
}
return make_shared<lists::prepender>(receiver, std::move(v));
@@ -340,9 +340,9 @@ operation::element_deletion::affected_column() const {
shared_ptr<operation>
operation::element_deletion::prepare(data_dictionary::database db, const sstring& keyspace, const column_definition& receiver) const {
if (!receiver.type->is_collection()) {
throw exceptions::invalid_request_exception(format("Invalid deletion operation for non collection column {}", receiver.name()));
throw exceptions::invalid_request_exception(format("Invalid deletion operation for non collection column {}", receiver.name_as_text()));
} else if (!receiver.type->is_multi_cell()) {
throw exceptions::invalid_request_exception(format("Invalid deletion operation for frozen collection column {}", receiver.name()));
throw exceptions::invalid_request_exception(format("Invalid deletion operation for frozen collection column {}", receiver.name_as_text()));
}
auto ctype = static_pointer_cast<const collection_type_impl>(receiver.type);
if (ctype->get_kind() == abstract_type::kind::list) {

View File

@@ -135,12 +135,21 @@ void query_options::prepare(const std::vector<lw_shared_ptr<column_specification
ordered_values.reserve(specs.size());
for (auto&& spec : specs) {
auto& spec_name = spec->name->text();
bool found_value_for_name = false;
for (size_t j = 0; j < names.size(); j++) {
if (names[j] == spec_name) {
ordered_values.emplace_back(_value_views[j]);
found_value_for_name = true;
break;
}
}
// No bound value was found with the name `spec_name`.
// This means that the user forgot to include a bound value with such name.
if (!found_value_for_name) {
throw exceptions::invalid_request_exception(
format("Missing value for bind marker with name: {}", spec_name));
}
}
_value_views = std::move(ordered_values);
}

View File

@@ -22,6 +22,7 @@
#include "db/config.hh"
#include "data_dictionary/data_dictionary.hh"
#include "hashers.hh"
#include "utils/error_injection.hh"
namespace cql3 {
@@ -600,6 +601,14 @@ query_processor::get_statement(const sstring_view& query, const service::client_
std::unique_ptr<raw::parsed_statement>
query_processor::parse_statement(const sstring_view& query) {
try {
{
const char* error_injection_key = "query_processor-parse_statement-test_failure";
utils::get_local_injector().inject(error_injection_key, [&]() {
if (query.find(error_injection_key) != sstring_view::npos) {
throw std::runtime_error(error_injection_key);
}
});
}
auto statement = util::do_with_parser(query, std::mem_fn(&cql3_parser::CqlParser::query));
if (!statement) {
throw exceptions::syntax_exception("Parsing failed");
@@ -923,6 +932,9 @@ void query_processor::migration_subscriber::on_update_aggregate(const sstring& k
void query_processor::migration_subscriber::on_update_view(
const sstring& ks_name,
const sstring& view_name, bool columns_changed) {
// scylladb/scylladb#16392 - Materialized views are also tables so we need at least handle
// them as such when changed.
on_update_column_family(ks_name, view_name, columns_changed);
}
void query_processor::migration_subscriber::on_drop_keyspace(const sstring& ks_name) {

View File

@@ -80,7 +80,7 @@ public:
virtual sstring assignment_testable_source_context() const override {
auto&& name = _type->field_name(_field);
auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
auto sname = std::string_view(reinterpret_cast<const char*>(name.data()), name.size());
return format("{}.{}", _selected, sname);
}

View File

@@ -404,20 +404,19 @@ alter_table_statement::prepare_schema_mutations(query_processor& qp, api::timest
std::unique_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_table_statement::prepare(data_dictionary::database db, cql_stats& stats) {
auto t = db.try_find_table(keyspace(), column_family());
std::optional<schema_ptr> s = t ? std::make_optional(t->schema()) : std::nullopt;
std::optional<sstring> warning = check_restricted_table_properties(db, s, keyspace(), column_family(), *_properties);
if (warning) {
mylogger.warn("{}", *warning);
}
return std::make_unique<prepared_statement>(make_shared<alter_table_statement>(*this));
}
future<::shared_ptr<messages::result_message>>
alter_table_statement::execute(query_processor& qp, service::query_state& state, const query_options& options) const {
auto s = validation::validate_column_family(qp.db(), keyspace(), column_family());
std::optional<sstring> warning = check_restricted_table_properties(qp, s, keyspace(), column_family(), *_properties);
return schema_altering_statement::execute(qp, state, options).then([this, warning = std::move(warning)] (::shared_ptr<messages::result_message> msg) {
if (warning) {
msg->add_warning(*warning);
mylogger.warn("{}", *warning);
}
return msg;
});
validation::validate_column_family(qp.db(), keyspace(), column_family());
return schema_altering_statement::execute(qp, state, options);
}
}

View File

@@ -141,6 +141,18 @@ user_type alter_type_statement::add_or_alter::do_add(data_dictionary::database d
throw exceptions::invalid_request_exception(format("Cannot add new field to type {}: maximum number of fields reached", _name));
}
if (_field_type->is_duration()) {
auto&& ks = db.find_keyspace(keyspace());
for (auto&& schema : ks.metadata()->cf_meta_data() | boost::adaptors::map_values) {
for (auto&& column : schema->clustering_key_columns()) {
if (column.type->references_user_type(_name.get_keyspace(), _name.get_user_type_name())) {
throw exceptions::invalid_request_exception(format("Cannot add new field to type {} because it is used in the clustering key column {} of table {}.{} where durations are not allowed",
_name.to_string(), column.name_as_text(), schema->ks_name(), schema->cf_name()));
}
}
}
}
std::vector<bytes> new_names(to_update->field_names());
new_names.push_back(_field_name->name());
std::vector<data_type> new_types(to_update->field_types());

View File

@@ -120,6 +120,9 @@ std::optional<mutation> cas_request::apply(foreign_ptr<lw_shared_ptr<query::resu
const update_parameters::prefetch_data::row* cas_request::find_old_row(const cas_row_update& op) const {
static const clustering_key empty_ckey = clustering_key::make_empty();
if (_key.empty()) {
throw exceptions::invalid_request_exception("Empty partition key range");
}
const partition_key& pkey = _key.front().start()->value().key().value();
// We must ignore statement clustering column restriction when
// choosing a row to check the conditions. If there is no
@@ -131,6 +134,9 @@ const update_parameters::prefetch_data::row* cas_request::find_old_row(const cas
// CREATE TABLE t(p int, c int, s int static, v int, PRIMARY KEY(p, c));
// INSERT INTO t(p, s) VALUES(1, 1);
// UPDATE t SET v=1 WHERE p=1 AND c=1 IF s=1;
if (op.ranges.empty()) {
throw exceptions::invalid_request_exception("Empty clustering range");
}
const clustering_key& ckey = op.ranges.front().start() ? op.ranges.front().start()->value() : empty_ckey;
auto row = _rows.find_row(pkey, ckey);
if (row == nullptr && !ckey.is_empty() &&

View File

@@ -185,6 +185,10 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
if (_properties.properties()->get_synchronous_updates_flag()) {
throw exceptions::invalid_request_exception(format("The synchronous_updates option is only applicable to materialized views, not to base tables"));
}
std::optional<sstring> warning = check_restricted_table_properties(db, std::nullopt, keyspace(), column_family(), *_properties.properties());
if (warning) {
mylogger.warn("{}", *warning);
}
const bool has_default_ttl = _properties.properties()->get_default_time_to_live() > 0;
auto stmt = ::make_shared<create_table_statement>(*_cf_name, _properties.properties(), _if_not_exists, _static_columns, _properties.properties()->get_id());
@@ -426,7 +430,7 @@ void create_table_statement::raw_statement::add_column_alias(::shared_ptr<column
// legal but restricted by the configuration. Checks for other of errors
// in the table's options are done elsewhere.
std::optional<sstring> check_restricted_table_properties(
query_processor& qp,
data_dictionary::database db,
std::optional<schema_ptr> schema,
const sstring& keyspace, const sstring& table,
const cf_prop_defs& cfprops)
@@ -450,7 +454,7 @@ std::optional<sstring> check_restricted_table_properties(
auto cs = (strategy) ? strategy : current_strategy;
if (strategy && *strategy == sstables::compaction_strategy_type::date_tiered) {
switch(qp.db().get_config().restrict_dtcs()) {
switch(db.get_config().restrict_dtcs()) {
case db::tri_mode_restriction_t::mode::TRUE:
throw exceptions::configuration_exception(
"DateTieredCompactionStrategy is deprecated, and "
@@ -471,7 +475,7 @@ std::optional<sstring> check_restricted_table_properties(
std::map<sstring, sstring> options = (strategy) ? cfprops.get_compaction_type_options() : (*schema)->compaction_strategy_options();
sstables::time_window_compaction_strategy_options twcs_options(options);
long ttl = (cfprops.has_property(cf_prop_defs::KW_DEFAULT_TIME_TO_LIVE)) ? cfprops.get_default_time_to_live() : current_ttl.count();
auto max_windows = qp.db().get_config().twcs_max_window_count();
auto max_windows = db.get_config().twcs_max_window_count();
// It may happen that an user tries to update an unrelated table property. Allow the request through.
if (!cfprops.has_property(cf_prop_defs::KW_DEFAULT_TIME_TO_LIVE) && !strategy) {
@@ -491,7 +495,7 @@ std::optional<sstring> check_restricted_table_properties(
"highly discouraged.", ttl, twcs_options.get_sstable_window_size().count(), window_count, max_windows));
}
} else {
switch (qp.db().get_config().restrict_twcs_without_default_ttl()) {
switch (db.get_config().restrict_twcs_without_default_ttl()) {
case db::tri_mode_restriction_t::mode::TRUE:
throw exceptions::configuration_exception(
"TimeWindowCompactionStrategy tables without a strict default_time_to_live setting "
@@ -510,18 +514,6 @@ std::optional<sstring> check_restricted_table_properties(
return std::nullopt;
}
future<::shared_ptr<messages::result_message>>
create_table_statement::execute(query_processor& qp, service::query_state& state, const query_options& options) const {
std::optional<sstring> warning = check_restricted_table_properties(qp, std::nullopt, keyspace(), column_family(), *_properties);
return schema_altering_statement::execute(qp, state, options).then([this, warning = std::move(warning)] (::shared_ptr<messages::result_message> msg) {
if (warning) {
msg->add_warning(*warning);
mylogger.warn("{}", *warning);
}
return msg;
});
}
}
}

View File

@@ -79,9 +79,6 @@ public:
virtual future<> grant_permissions_to_creator(const service::client_state&) const override;
virtual future<::shared_ptr<messages::result_message>>
execute(query_processor& qp, service::query_state& state, const query_options& options) const override;
schema_ptr get_cf_meta_data(const data_dictionary::database) const;
class raw_statement;
@@ -129,7 +126,7 @@ public:
};
std::optional<sstring> check_restricted_table_properties(
query_processor& qp,
data_dictionary::database db,
std::optional<schema_ptr> schema,
const sstring& keyspace, const sstring& table,
const cf_prop_defs& cfprops);

View File

@@ -178,7 +178,13 @@ future<std::vector<description>> function(replica::database& db, const sstring&
auto udfs = boost::copy_range<std::vector<shared_ptr<const keyspace_element>>>(fs | boost::adaptors::transformed([] (const auto& f) {
return dynamic_pointer_cast<const functions::user_function>(f.second);
}) | boost::adaptors::filtered([] (const auto& f) {
return f != nullptr;
}));
if (udfs.empty()) {
throw exceptions::invalid_request_exception(format("Function '{}' not found in keyspace '{}'", name, ks));
}
co_return co_await generate_descriptions(db, udfs, true);
}
@@ -191,13 +197,19 @@ future<std::vector<description>> functions(replica::database& db,const sstring&
future<std::vector<description>> aggregate(replica::database& db, const sstring& ks, const sstring& name) {
auto fs = functions::functions::find(functions::function_name(ks, name));
if(fs.empty()) {
if (fs.empty()) {
throw exceptions::invalid_request_exception(format("Aggregate '{}' not found in keyspace '{}'", name, ks));
}
auto udas = boost::copy_range<std::vector<shared_ptr<const keyspace_element>>>(fs | boost::adaptors::transformed([] (const auto& f) {
return dynamic_pointer_cast<const functions::user_aggregate>(f.second);
}) | boost::adaptors::filtered([] (const auto& f) {
return f != nullptr;
}));
if (udas.empty()) {
throw exceptions::invalid_request_exception(format("Aggregate '{}' not found in keyspace '{}'", name, ks));
}
co_return co_await generate_descriptions(db, udas, true);
}

View File

@@ -35,7 +35,7 @@ drop_function_statement::prepare_schema_mutations(query_processor& qp, api::time
if (!user_func) {
throw exceptions::invalid_request_exception(format("'{}' is not a user defined function", func));
}
if (auto aggregate = functions::functions::used_by_user_aggregate(user_func->name()); bool(aggregate)) {
if (auto aggregate = functions::functions::used_by_user_aggregate(user_func)) {
throw exceptions::invalid_request_exception(format("Cannot delete function {}, as it is used by user-defined aggregate {}", func, *aggregate));
}
m = co_await qp.get_migration_manager().prepare_function_drop_announcement(user_func, ts);

View File

@@ -10,6 +10,7 @@
#include "cql3/statements/drop_type_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/query_processor.hh"
#include "cql3/functions/functions.hh"
#include "boost/range/adaptor/map.hpp"
@@ -109,6 +110,9 @@ void drop_type_statement::validate_while_executing(query_processor& qp) const {
}
}
if (auto&& fun_name = functions::functions::used_by_user_function(_name)) {
throw exceptions::invalid_request_exception(format("Cannot drop user type {}.{} as it is still used by function {}", keyspace, type->get_name_as_string(), *fun_name));
}
} catch (data_dictionary::no_such_keyspace& e) {
throw exceptions::invalid_request_exception(format("Cannot drop type in unknown keyspace {}", keyspace()));
}

View File

@@ -17,6 +17,7 @@
#include "cql3/util.hh"
#include "validation.hh"
#include "db/consistency_level_validations.hh"
#include <optional>
#include <seastar/core/shared_ptr.hh>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/adaptor/map.hpp>
@@ -95,8 +96,9 @@ bool modification_statement::is_timestamp_set() const {
return attrs->is_timestamp_set();
}
gc_clock::duration modification_statement::get_time_to_live(const query_options& options) const {
return gc_clock::duration(attrs->get_time_to_live(options));
std::optional<gc_clock::duration> modification_statement::get_time_to_live(const query_options& options) const {
std::optional<int32_t> ttl = attrs->get_time_to_live(options);
return ttl ? std::make_optional<gc_clock::duration>(*ttl) : std::nullopt;
}
future<> modification_statement::check_access(query_processor& qp, const service::client_state& state) const {
@@ -305,6 +307,10 @@ modification_statement::execute_with_condition(query_processor& qp, service::que
throw exceptions::invalid_request_exception(format("Unrestricted partition key in a conditional {}",
type.is_update() ? "update" : "deletion"));
}
if (ranges.empty()) {
throw exceptions::invalid_request_exception(format("Unrestricted clustering key in a conditional {}",
type.is_update() ? "update" : "deletion"));
}
auto request = seastar::make_shared<cas_request>(s, std::move(keys));
// cas_request can be used for batches as well single statements; Here we have just a single

View File

@@ -130,7 +130,7 @@ public:
bool is_timestamp_set() const;
gc_clock::duration get_time_to_live(const query_options& options) const;
std::optional<gc_clock::duration> get_time_to_live(const query_options& options) const;
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;

View File

@@ -120,7 +120,10 @@ schema_altering_statement::execute(query_processor& qp, service::query_state& st
}
return execute0(qp, state, options).then([this, &state, internal](::shared_ptr<messages::result_message> result) {
auto permissions_granted_fut = internal
// We don't want to grant the permissions to the supposed creator even if the statement succeeded if it's an internal query
// or if the query did not actually create the item, i.e. the query is bounced to another shard or it's a IF NOT EXISTS
// query where the item already exists.
auto permissions_granted_fut = internal || !result->is_schema_change()
? make_ready_future<>()
: grant_permissions_to_creator(state.get_client_state());
return permissions_granted_fut.then([result = std::move(result)] {

View File

@@ -60,7 +60,11 @@ void use_statement::validate(query_processor&, const service::client_state& stat
future<::shared_ptr<cql_transport::messages::result_message>>
use_statement::execute(query_processor& qp, service::query_state& state, const query_options& options) const {
state.get_client_state().set_keyspace(qp.db().real_database(), _keyspace);
try {
state.get_client_state().set_keyspace(qp.db().real_database(), _keyspace);
} catch(...) {
return make_exception_future<::shared_ptr<cql_transport::messages::result_message>>(std::current_exception());
}
auto result =::make_shared<cql_transport::messages::result_message::set_keyspace>(_keyspace);
return make_ready_future<::shared_ptr<cql_transport::messages::result_message>>(result);
}

View File

@@ -151,14 +151,19 @@ static bytes from_json_object_aux(const map_type_impl& t, const rjson::value& va
std::map<bytes, bytes, serialized_compare> raw_map(t.get_keys_type()->as_less_comparator());
for (auto it = value.MemberBegin(); it != value.MemberEnd(); ++it) {
bytes value = from_json_object(*t.get_values_type(), it->value);
if (!t.get_keys_type()->is_compatible_with(*utf8_type)) {
if (t.get_keys_type()->underlying_type() == ascii_type ||
t.get_keys_type()->underlying_type() == utf8_type) {
raw_map.emplace(from_json_object(*t.get_keys_type(), it->name), std::move(value));
} else {
// Keys in maps can only be strings in JSON, but they can also be a string representation
// of another JSON type, which needs to be reparsed. Example - map<frozen<list<int>>, int>
// will be represented like this: { "[1, 3, 6]": 3, "[]": 0, "[1, 2]": 2 }
rjson::value map_key = rjson::parse(rjson::to_string_view(it->name));
raw_map.emplace(from_json_object(*t.get_keys_type(), map_key), std::move(value));
} else {
raw_map.emplace(from_json_object(*t.get_keys_type(), it->name), std::move(value));
try {
rjson::value map_key = rjson::parse(rjson::to_string_view(it->name));
raw_map.emplace(from_json_object(*t.get_keys_type(), map_key), std::move(value));
} catch (rjson::error& e) {
throw marshal_exception(format("Failed parsing map_key {}: {}", it->name, e.what()));
}
}
}
return map_type_impl::serialize_to_bytes(raw_map);
@@ -488,7 +493,7 @@ struct to_json_string_visitor {
sstring operator()(const tuple_type_impl& t) { return to_json_string_aux(t, bv); }
sstring operator()(const user_type_impl& t) { return to_json_string_aux(t, bv); }
sstring operator()(const simple_date_type_impl& t) { return quote_json_string(t.to_string(bv)); }
sstring operator()(const time_type_impl& t) { return t.to_string(bv); }
sstring operator()(const time_type_impl& t) { return quote_json_string(t.to_string(bv)); }
sstring operator()(const empty_type_impl& t) { return "null"; }
sstring operator()(const duration_type_impl& t) {
auto v = t.deserialize(bv);

View File

@@ -93,7 +93,7 @@ public:
};
// Note: value (mutation) only required to contain the rows we are interested in
private:
const gc_clock::duration _ttl;
const std::optional<gc_clock::duration> _ttl;
// For operations that require a read-before-write, stores prefetched cell values.
// For CAS statements, stores values of conditioned columns.
// Is a reference to an outside prefetch_data container since a CAS BATCH statement
@@ -106,7 +106,7 @@ public:
const query_options& _options;
update_parameters(const schema_ptr schema_, const query_options& options,
api::timestamp_type timestamp, gc_clock::duration ttl, const prefetch_data& prefetched)
api::timestamp_type timestamp, std::optional<gc_clock::duration> ttl, const prefetch_data& prefetched)
: _ttl(ttl)
, _prefetched(prefetched)
, _timestamp(timestamp)
@@ -127,11 +127,7 @@ public:
}
atomic_cell make_cell(const abstract_type& type, const raw_value_view& value, atomic_cell::collection_member cm = atomic_cell::collection_member::no) const {
auto ttl = _ttl;
if (ttl.count() <= 0) {
ttl = _schema->default_time_to_live();
}
auto ttl = this->ttl();
return value.with_value([&] (const FragmentedView auto& v) {
if (ttl.count() > 0) {
@@ -143,11 +139,7 @@ public:
};
atomic_cell make_cell(const abstract_type& type, const managed_bytes_view& value, atomic_cell::collection_member cm = atomic_cell::collection_member::no) const {
auto ttl = _ttl;
if (ttl.count() <= 0) {
ttl = _schema->default_time_to_live();
}
auto ttl = this->ttl();
if (ttl.count() > 0) {
return atomic_cell::make_live(type, _timestamp, value, _local_deletion_time + ttl, ttl, cm);
@@ -169,7 +161,7 @@ public:
}
gc_clock::duration ttl() const {
return _ttl.count() > 0 ? _ttl : _schema->default_time_to_live();
return _ttl.value_or(_schema->default_time_to_live());
}
gc_clock::time_point expiry() const {

View File

@@ -216,7 +216,7 @@ keyspace_metadata::keyspace_metadata(std::string_view name,
std::move(strategy_options),
durable_writes,
std::move(cf_defs),
user_types_metadata{},
std::move(user_types),
storage_options{}) { }
keyspace_metadata::keyspace_metadata(std::string_view name,
@@ -254,6 +254,11 @@ keyspace_metadata::new_keyspace(std::string_view name,
return ::make_lw_shared<keyspace_metadata>(name, strategy_name, options, durables_writes, cf_defs, user_types_metadata{}, storage_opts);
}
lw_shared_ptr<keyspace_metadata>
keyspace_metadata::new_keyspace(const keyspace_metadata& ksm) {
return new_keyspace(ksm.name(), ksm.strategy_name(), ksm.strategy_options(), ksm.durable_writes(), std::vector<schema_ptr>{}, ksm.get_storage_options());
}
void keyspace_metadata::add_user_type(const user_type ut) {
_user_types.add_type(ut);
}

View File

@@ -55,6 +55,8 @@ public:
bool durables_writes,
std::vector<schema_ptr> cf_defs = std::vector<schema_ptr>{},
storage_options storage_opts = {});
static lw_shared_ptr<keyspace_metadata>
new_keyspace(const keyspace_metadata& ksm);
void validate(const locator::topology&) const;
const sstring& name() const {
return _name;

View File

@@ -59,7 +59,7 @@ public:
}
_end_of_stream = false;
forward_buffer_to(pr.start());
clear_buffer();
return _underlying->fast_forward_to(std::move(pr));
}

View File

@@ -1671,9 +1671,9 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
align = f.disk_write_dma_alignment();
auto is_overwrite = false;
auto existing_size = f.known_size();
if ((flags & open_flags::dsync) != open_flags{}) {
auto existing_size = f.known_size();
is_overwrite = true;
// would be super nice if we just could mmap(/dev/zero) and do sendto
// instead of this, but for now we must do explicit buffer writes.
@@ -1683,8 +1683,6 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
if (existing_size > max_size) {
co_await f.truncate(max_size);
} else if (existing_size < max_size) {
totals.total_size_on_disk += (max_size - existing_size);
clogger.trace("Pre-writing {} of {} KB to segment {}", (max_size - existing_size)/1024, max_size/1024, filename);
// re-open without o_dsync for pre-alloc. The reason/rationale
@@ -1732,6 +1730,12 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
co_await f.truncate(max_size);
}
// #12810 - we did not update total_size_on_disk unless o_dsync was
// on. So kept running with total == 0 -> free for all in creating new segment.
// Always update total_size_on_disk. Will wrap-around iff existing_size > max_size.
// That is ok.
totals.total_size_on_disk += (max_size - existing_size);
if (cfg.extensions && !cfg.extensions->commitlog_file_extensions().empty()) {
for (auto * ext : cfg.extensions->commitlog_file_extensions()) {
auto nf = co_await ext->wrap_file(filename, f, flags);
@@ -2116,6 +2120,9 @@ future<> db::commitlog::segment_manager::do_pending_deletes() {
clogger.debug("Discarding segments {}", ftd);
for (auto& [f, mode] : ftd) {
// `f.remove_file()` resets known_size to 0, so remember the size here,
// in order to subtract it from total_size_on_disk accurately.
auto size = f.known_size();
try {
if (f) {
co_await f.close();
@@ -2132,7 +2139,6 @@ future<> db::commitlog::segment_manager::do_pending_deletes() {
}
}
auto size = f.known_size();
auto usage = totals.total_size_on_disk;
auto next_usage = usage - size;
@@ -2165,7 +2171,7 @@ future<> db::commitlog::segment_manager::do_pending_deletes() {
// or had such an exception that we consider the file dead
// anyway. In either case we _remove_ the file size from
// footprint, because it is no longer our problem.
totals.total_size_on_disk -= f.known_size();
totals.total_size_on_disk -= size;
}
// #8376 - if we had an error in recycling (disk rename?), and no elements
@@ -2585,12 +2591,20 @@ db::commitlog::read_log_file(sstring filename, sstring pfx, seastar::io_priority
return eof || next == pos;
}
future<> skip(size_t bytes) {
pos += bytes;
if (pos > file_size) {
auto n = std::min(file_size - pos, bytes);
pos += n;
if (pos == file_size) {
eof = true;
pos = file_size;
}
return fin.skip(bytes);
if (n < bytes) {
// if we are trying to skip past end, we have at least
// the bytes skipped or the source from where we read
// this corrupt. So add at least four bytes. This is
// inexact, but adding the full "bytes" is equally wrong
// since it could be complete garbled junk.
corrupt_size += std::max(n, sizeof(uint32_t));
}
return fin.skip(n);
}
void stop() {
eof = true;

View File

@@ -418,6 +418,9 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, commitlog_segment_size_in_mb(this, "commitlog_segment_size_in_mb", value_status::Used, 64,
"Sets the size of the individual commitlog file segments. A commitlog segment may be archived, deleted, or recycled after all its data has been flushed to SSTables. This amount of data can potentially include commitlog segments from every table in the system. The default size is usually suitable for most commitlog archiving, but if you want a finer granularity, 8 or 16 MB is reasonable. See Commit log archive configuration.\n"
"Related information: Commit log archive configuration")
, schema_commitlog_segment_size_in_mb(this, "schema_commitlog_segment_size_in_mb", value_status::Used, 32,
"Sets the size of the individual schema commitlog file segments. The segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average.\n"
"Related information: Commit log archive configuration")
/* Note: does not exist on the listing page other than in above comment, wtf? */
, commitlog_sync_period_in_ms(this, "commitlog_sync_period_in_ms", value_status::Used, 10000,
"Controls how long the system waits for other writes before performing a sync in \"periodic\" mode.")
@@ -826,6 +829,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_sstables_mc_format(this, "enable_sstables_mc_format", value_status::Unused, true, "Enable SSTables 'mc' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
, enable_sstables_md_format(this, "enable_sstables_md_format", value_status::Unused, true, "Enable SSTables 'md' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
, sstable_format(this, "sstable_format", value_status::Used, "me", "Default sstable file format", {"md", "me"})
, table_digest_insensitive_to_expiry(this, "table_digest_insensitive_to_expiry", liveness::MustRestart, value_status::Used, true,
"When enabled, per-table schema digest calculation ignores empty partitions.")
, enable_dangerous_direct_import_of_cassandra_counters(this, "enable_dangerous_direct_import_of_cassandra_counters", value_status::Used, false, "Only turn this option on if you want to import tables from Cassandra containing counters, and you are SURE that no counters in that table were created in a version earlier than Cassandra 2.1."
" It is not enough to have ever since upgraded to newer versions of Cassandra. If you EVER used a version earlier than 2.1 in the cluster where these SSTables come from, DO NOT TURN ON THIS OPTION! You will corrupt your data. You have been warned.")
, enable_shard_aware_drivers(this, "enable_shard_aware_drivers", value_status::Used, true, "Enable native transport drivers to use connection-per-shard for better performance")
@@ -909,6 +914,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, force_schema_commit_log(this, "force_schema_commit_log", value_status::Used, false,
"Use separate schema commit log unconditionally rater than after restart following discovery of cluster-wide support for it.")
, task_ttl_seconds(this, "task_ttl_in_seconds", liveness::LiveUpdate, value_status::Used, 10, "Time for which information about finished task stays in memory.")
, nodeops_watchdog_timeout_seconds(this, "nodeops_watchdog_timeout_seconds", liveness::LiveUpdate, value_status::Used, 120, "Time in seconds after which node operations abort when not hearing from the coordinator")
, nodeops_heartbeat_interval_seconds(this, "nodeops_heartbeat_interval_seconds", liveness::LiveUpdate, value_status::Used, 10, "Period of heartbeat ticks in node operations")
, cache_index_pages(this, "cache_index_pages", liveness::LiveUpdate, value_status::Used, false,
"Keep SSTable index pages in the global cache after a SSTable read. Expected to improve performance for workloads with big partitions, but may degrade performance for workloads with small partitions.")
, x_log2_compaction_groups(this, "x_log2_compaction_groups", value_status::Used, 0, "Controls static number of compaction groups per table per shard. For X groups, set the option to log (base 2) of X. Example: Value of 3 implies 8 groups.")

View File

@@ -175,6 +175,7 @@ public:
named_value<uint32_t> failure_detector_timeout_in_ms;
named_value<sstring> commitlog_sync;
named_value<uint32_t> commitlog_segment_size_in_mb;
named_value<uint32_t> schema_commitlog_segment_size_in_mb;
named_value<uint32_t> commitlog_sync_period_in_ms;
named_value<uint32_t> commitlog_sync_batch_window_in_ms;
named_value<int64_t> commitlog_total_space_in_mb;
@@ -331,6 +332,7 @@ public:
named_value<bool> enable_sstables_mc_format;
named_value<bool> enable_sstables_md_format;
named_value<sstring> sstable_format;
named_value<bool> table_digest_insensitive_to_expiry;
named_value<bool> enable_dangerous_direct_import_of_cassandra_counters;
named_value<bool> enable_shard_aware_drivers;
named_value<bool> enable_ipv6_dns_lookup;
@@ -388,6 +390,8 @@ public:
named_value<bool> force_schema_commit_log;
named_value<uint32_t> task_ttl_seconds;
named_value<uint32_t> nodeops_watchdog_timeout_seconds;
named_value<uint32_t> nodeops_heartbeat_interval_seconds;
named_value<bool> cache_index_pages;
@@ -401,6 +405,10 @@ public:
named_value<uint64_t> wasm_udf_yield_fuel;
named_value<uint64_t> wasm_udf_total_fuel;
named_value<size_t> wasm_udf_memory_limit;
// wasm_udf_reserved_memory is static because the options in db::config
// are parsed using seastar::app_template, while this option is used for
// configuring the Seastar memory subsystem.
static constexpr size_t wasm_udf_reserved_memory = 50 * 1024 * 1024;
seastar::logging_settings logging_settings(const log_cli::options&) const;

View File

@@ -860,7 +860,8 @@ future<> manager::end_point_hints_manager::sender::send_one_hint(lw_shared_ptr<s
ctx_ptr->mark_hint_as_in_progress(rp);
// Future is waited on indirectly in `send_one_file()` (via `ctx_ptr->file_send_gate`).
(void)with_gate(ctx_ptr->file_send_gate, [this, secs_since_file_mod, &fname, buf = std::move(buf), rp, ctx_ptr] () mutable {
auto h = ctx_ptr->file_send_gate.hold();
(void)std::invoke([this, secs_since_file_mod, &fname, buf = std::move(buf), rp, ctx_ptr] () mutable {
try {
auto m = this->get_mutation(ctx_ptr, buf);
gc_clock::duration gc_grace_sec = m.s->gc_grace_seconds();
@@ -896,7 +897,7 @@ future<> manager::end_point_hints_manager::sender::send_one_hint(lw_shared_ptr<s
return make_exception_future<>(std::move(eptr));
}
return make_ready_future<>();
}).then_wrapped([this, units = std::move(units), rp, ctx_ptr] (future<>&& f) {
}).then_wrapped([this, units = std::move(units), rp, ctx_ptr, h = std::move(h)] (future<>&& f) {
// Information about the error was already printed somewhere higher.
// We just need to account in the ctx that sending of this hint has failed.
if (!f.failed()) {

View File

@@ -17,13 +17,22 @@
#include "idl/hinted_handoff.dist.hh"
#include "idl/hinted_handoff.dist.impl.hh"
#include "utils/base64.hh"
#include "xx_hasher.hh"
namespace db {
namespace hints {
// Sync points can be encoded in two formats: V1 and V2. V2 extends V1 by adding
// a checksum. Currently, we use the V2 format, but sync points encoded in the V1
// format still can be safely decoded.
//
// Format V1 (encoded in base64):
// uint8_t 0x01 - version of format
// sync_point_v1 - encoded using IMR
// sync_point_v1 - encoded using IDL
//
// Format V2 (encoded in base64):
// uint8_t 0x02 - version of format
// sync_point_v1 - encoded using IDL
// uint64_t - checksum computed using the xxHash algorithm
//
// sync_point_v1:
// UUID host_id - ID of the host which created the sync point
@@ -41,6 +50,9 @@ namespace hints {
// Flattened representation was chosen in order to save space on
// vector lengths etc.
static constexpr size_t version_size = sizeof(uint8_t);
static constexpr size_t checksum_size = sizeof(uint64_t);
static std::vector<sync_point::shard_rps> decode_one_type_v1(uint16_t shard_count, const per_manager_sync_point_v1& v1) {
std::vector<sync_point::shard_rps> ret;
@@ -67,16 +79,37 @@ static std::vector<sync_point::shard_rps> decode_one_type_v1(uint16_t shard_coun
return ret;
}
static uint64_t calculate_checksum(const sstring_view s) {
xx_hasher h;
h.update(s.data(), s.size());
return h.finalize_uint64();
}
sync_point sync_point::decode(sstring_view s) {
bytes raw = base64_decode(s);
if (raw.empty()) {
throw std::runtime_error("Could not decode the sync point - not a valid hex string");
}
if (raw[0] != 1) {
throw std::runtime_error(format("Unsupported sync point format version: {}", int(raw[0])));
sstring_view raw_s(reinterpret_cast<const char*>(raw.data()), raw.size());
seastar::simple_memory_input_stream in{raw_s.data(), raw_s.size()};
uint8_t version = ser::serializer<uint8_t>::read(in);
if (version == 2) {
if (raw_s.size() < version_size + checksum_size) {
throw std::runtime_error("Could not decode the sync point encoded in the V2 format - serialized blob is too short");
}
seastar::simple_memory_input_stream in_checksum{raw_s.end() - checksum_size, checksum_size};
uint64_t checksum = ser::serializer<uint64_t>::read(in_checksum);
if (checksum != calculate_checksum(raw_s.substr(0, raw_s.size() - checksum_size))) {
throw std::runtime_error("Could not decode the sync point encoded in the V2 format - wrong checksum");
}
}
else if (version != 1) {
throw std::runtime_error(format("Unsupported sync point format version: {}", int(version)));
}
seastar::simple_memory_input_stream in{reinterpret_cast<const char*>(raw.data()) + 1, raw.size() - 1};
sync_point_v1 v1 = ser::serializer<sync_point_v1>::read(in);
return sync_point{
@@ -133,11 +166,16 @@ sstring sync_point::encode() const {
seastar::measuring_output_stream measure;
ser::serializer<sync_point_v1>::write(measure, v1);
// Reserve 1 byte for the version
bytes serialized{bytes::initialized_later{}, 1 + measure.size()};
serialized[0] = 1;
seastar::simple_memory_output_stream out{reinterpret_cast<char*>(serialized.data()), measure.size(), 1};
// Reserve version_size bytes for the version and checksum_size bytes for the checksum
bytes serialized{bytes::initialized_later{}, version_size + measure.size() + checksum_size};
// Encode using V2 format
seastar::simple_memory_output_stream out{reinterpret_cast<char*>(serialized.data()), serialized.size()};
ser::serializer<uint8_t>::write(out, 2);
ser::serializer<sync_point_v1>::write(out, v1);
sstring_view serialized_s(reinterpret_cast<const char*>(serialized.data()), version_size + measure.size());
uint64_t checksum = calculate_checksum(serialized_s);
ser::serializer<uint64_t>::write(out, checksum);
return base64_encode(serialized);
}

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/print.hh>
#include <seastar/core/coroutine.hh>
#include "db/system_keyspace.hh"
#include "db/large_data_handler.hh"
#include "sstables/sstables.hh"
@@ -55,11 +56,11 @@ void large_data_handler::start() {
}
future<> large_data_handler::stop() {
if (!running()) {
return make_ready_future<>();
if (running()) {
_running = false;
large_data_logger.info("Waiting for {} background handlers", max_concurrency - _sem.available_units());
co_await _sem.wait(max_concurrency);
}
_running = false;
return _sem.wait(max_concurrency);
}
void large_data_handler::plug_system_keyspace(db::system_keyspace& sys_ks) noexcept {

View File

@@ -24,6 +24,10 @@ enum class schema_feature {
PER_TABLE_PARTITIONERS,
SCYLLA_KEYSPACES,
SCYLLA_AGGREGATES,
// When enabled, schema_mutations::digest() will skip empty mutations (with only tombstones),
// so that the digest remains the same after schema tables are compacted.
TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,
};
using schema_features = enum_set<super_enum<schema_feature,
@@ -33,7 +37,8 @@ using schema_features = enum_set<super_enum<schema_feature,
schema_feature::CDC_OPTIONS,
schema_feature::PER_TABLE_PARTITIONERS,
schema_feature::SCYLLA_KEYSPACES,
schema_feature::SCYLLA_AGGREGATES
schema_feature::SCYLLA_AGGREGATES,
schema_feature::TABLE_DIGEST_INSENSITIVE_TO_EXPIRY
>>;
}

View File

@@ -93,15 +93,18 @@ static bool is_extra_durable(const sstring& ks_name, const sstring& cf_name) {
/** system.schema_* tables used to store keyspace/table/type attributes prior to C* 3.0 */
namespace db {
schema_ctxt::schema_ctxt(const db::config& cfg, std::shared_ptr<data_dictionary::user_types_storage> uts)
: _extensions(cfg.extensions())
schema_ctxt::schema_ctxt(const db::config& cfg, std::shared_ptr<data_dictionary::user_types_storage> uts,
const gms::feature_service& features, replica::database* db)
: _db(db)
, _features(features)
, _extensions(cfg.extensions())
, _murmur3_partitioner_ignore_msb_bits(cfg.murmur3_partitioner_ignore_msb_bits())
, _schema_registry_grace_period(cfg.schema_registry_grace_period())
, _user_types(std::move(uts))
{}
schema_ctxt::schema_ctxt(const replica::database& db)
: schema_ctxt(db.get_config(), db.as_user_types_storage())
schema_ctxt::schema_ctxt(replica::database& db)
: schema_ctxt(db.get_config(), db.as_user_types_storage(), db.features(), &db)
{}
schema_ctxt::schema_ctxt(distributed<replica::database>& db)
@@ -148,7 +151,8 @@ static future<> merge_tables_and_views(distributed<service::storage_proxy>& prox
std::map<table_id, schema_mutations>&& tables_before,
std::map<table_id, schema_mutations>&& tables_after,
std::map<table_id, schema_mutations>&& views_before,
std::map<table_id, schema_mutations>&& views_after);
std::map<table_id, schema_mutations>&& views_after,
bool reload);
struct [[nodiscard]] user_types_to_drop final {
seastar::noncopyable_function<future<> ()> drop;
@@ -161,7 +165,7 @@ static future<user_types_to_drop> merge_types(distributed<service::storage_proxy
static future<> merge_functions(distributed<service::storage_proxy>& proxy, schema_result before, schema_result after);
static future<> merge_aggregates(distributed<service::storage_proxy>& proxy, schema_result before, schema_result after, schema_result scylla_before, schema_result scylla_after);
static future<> do_merge_schema(distributed<service::storage_proxy>&, std::vector<mutation>, bool do_flush);
static future<> do_merge_schema(distributed<service::storage_proxy>&, std::vector<mutation>, bool do_flush, bool reload);
using computed_columns_map = std::unordered_map<bytes, column_computation_ptr>;
static computed_columns_map get_computed_columns(const schema_mutations& sm);
@@ -941,18 +945,18 @@ future<> update_schema_version_and_announce(sharded<db::system_keyspace>& sys_ks
* @throws ConfigurationException If one of metadata attributes has invalid value
* @throws IOException If data was corrupted during transportation or failed to apply fs operations
*/
future<> merge_schema(sharded<db::system_keyspace>& sys_ks, distributed<service::storage_proxy>& proxy, gms::feature_service& feat, std::vector<mutation> mutations)
future<> merge_schema(sharded<db::system_keyspace>& sys_ks, distributed<service::storage_proxy>& proxy, gms::feature_service& feat, std::vector<mutation> mutations, bool reload)
{
if (this_shard_id() != 0) {
// mutations must be applied on the owning shard (0).
co_await smp::submit_to(0, [&, fmuts = freeze(mutations)] () mutable -> future<> {
return merge_schema(sys_ks, proxy, feat, unfreeze(fmuts));
return merge_schema(sys_ks, proxy, feat, unfreeze(fmuts), reload);
});
co_return;
}
co_await with_merge_lock([&] () mutable -> future<> {
bool flush_schema = proxy.local().get_db().local().get_config().flush_schema_tables_after_modification();
co_await do_merge_schema(proxy, std::move(mutations), flush_schema);
co_await do_merge_schema(proxy, std::move(mutations), flush_schema, reload);
co_await update_schema_version_and_announce(sys_ks, proxy, feat.cluster_schema_features());
});
}
@@ -1095,7 +1099,7 @@ future<> store_column_mapping(distributed<service::storage_proxy>& proxy, schema
// and or filesystem calls, e.g. fsync.
constexpr size_t max_concurrent = 8;
static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std::vector<mutation> mutations, bool do_flush)
static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std::vector<mutation> mutations, bool do_flush, bool reload)
{
slogger.trace("do_merge_schema: {}", mutations);
schema_ptr s = keyspaces();
@@ -1110,6 +1114,12 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std:
delete_schema_version(mutation);
}
if (reload) {
for (auto&& ks : proxy.local().get_db().local().get_non_system_keyspaces()) {
keyspaces.emplace(ks);
}
}
// current state of the schema
auto&& old_keyspaces = co_await read_schema_for_keyspaces(proxy, KEYSPACES, keyspaces);
auto&& old_column_families = co_await read_tables_for_keyspaces(proxy, keyspaces, tables());
@@ -1145,18 +1155,16 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std:
auto types_to_drop = co_await merge_types(proxy, std::move(old_types), std::move(new_types));
co_await merge_tables_and_views(proxy,
std::move(old_column_families), std::move(new_column_families),
std::move(old_views), std::move(new_views));
std::move(old_views), std::move(new_views), reload);
co_await merge_functions(proxy, std::move(old_functions), std::move(new_functions));
co_await merge_aggregates(proxy, std::move(old_aggregates), std::move(new_aggregates), std::move(old_scylla_aggregates), std::move(new_scylla_aggregates));
co_await types_to_drop.drop();
co_await proxy.local().get_db().invoke_on_all([&] (replica::database& db) -> future<> {
// it is safe to drop a keyspace only when all nested ColumnFamilies where deleted
for (auto keyspace_to_drop : keyspaces_to_drop) {
db.drop_keyspace(keyspace_to_drop);
co_await db.get_notifier().drop_keyspace(keyspace_to_drop);
}
});
auto& sharded_db = proxy.local().get_db();
// it is safe to drop a keyspace only when all nested ColumnFamilies where deleted
for (auto keyspace_to_drop : keyspaces_to_drop) {
co_await replica::database::drop_keyspace_on_all_shards(sharded_db, keyspace_to_drop);
}
}
future<lw_shared_ptr<query::result_set>> extract_scylla_specific_keyspace_info(distributed<service::storage_proxy>& proxy, const schema_result_value_type& partition) {
@@ -1205,19 +1213,18 @@ future<std::set<sstring>> merge_keyspaces(distributed<service::storage_proxy>& p
slogger.info("Altering keyspace {}", key);
altered.emplace_back(key);
}
co_await proxy.local().get_db().invoke_on_all([&] (replica::database& db) -> future<> {
for (auto&& val : created) {
auto scylla_specific_rs = co_await extract_scylla_specific_keyspace_info(proxy, val);
auto ksm = create_keyspace_from_schema_partition(val, std::move(scylla_specific_rs));
co_await db.create_keyspace(ksm, proxy.local().get_erm_factory());
co_await db.get_notifier().create_keyspace(ksm);
}
{
for (auto& name : altered) {
co_await db.update_keyspace(proxy, name);
};
}
});
auto& sharded_db = proxy.local().get_db();
for (auto&& val : created) {
auto scylla_specific_rs = co_await extract_scylla_specific_keyspace_info(proxy, val);
auto ksm = create_keyspace_from_schema_partition(val, std::move(scylla_specific_rs));
co_await replica::database::create_keyspace_on_all_shards(sharded_db, proxy, *ksm);
}
for (auto& name : altered) {
auto v = co_await db::schema_tables::read_schema_partition_for_keyspace(proxy, db::schema_tables::KEYSPACES, name);
auto scylla_specific_rs = co_await db::schema_tables::extract_scylla_specific_keyspace_info(proxy, v);
auto tmp_ksm = db::schema_tables::create_keyspace_from_schema_partition(v, scylla_specific_rs);
co_await replica::database::update_keyspace_on_all_shards(sharded_db, proxy, *tmp_ksm);
}
co_return dropped;
}
@@ -1252,6 +1259,7 @@ enum class schema_diff_side {
static schema_diff diff_table_or_view(distributed<service::storage_proxy>& proxy,
std::map<table_id, schema_mutations>&& before,
std::map<table_id, schema_mutations>&& after,
bool reload,
noncopyable_function<schema_ptr (schema_mutations sm, schema_diff_side)> create_schema)
{
schema_diff d;
@@ -1272,6 +1280,13 @@ static schema_diff diff_table_or_view(distributed<service::storage_proxy>& proxy
slogger.info("Altering {}.{} id={} version={}", s->ks_name(), s->cf_name(), s->id(), s->version());
d.altered.emplace_back(schema_diff::altered_schema{s_before, s});
}
if (reload) {
for (auto&& key: diff.entries_in_common) {
auto s = create_schema(std::move(after.at(key)), schema_diff_side::right);
slogger.info("Reloading {}.{} id={} version={}", s->ks_name(), s->cf_name(), s->id(), s->version());
d.altered.emplace_back(schema_diff::altered_schema {s, s});
}
}
return d;
}
@@ -1284,12 +1299,13 @@ static future<> merge_tables_and_views(distributed<service::storage_proxy>& prox
std::map<table_id, schema_mutations>&& tables_before,
std::map<table_id, schema_mutations>&& tables_after,
std::map<table_id, schema_mutations>&& views_before,
std::map<table_id, schema_mutations>&& views_after)
std::map<table_id, schema_mutations>&& views_after,
bool reload)
{
auto tables_diff = diff_table_or_view(proxy, std::move(tables_before), std::move(tables_after), [&] (schema_mutations sm, schema_diff_side) {
auto tables_diff = diff_table_or_view(proxy, std::move(tables_before), std::move(tables_after), reload, [&] (schema_mutations sm, schema_diff_side) {
return create_table_from_mutations(proxy, std::move(sm));
});
auto views_diff = diff_table_or_view(proxy, std::move(views_before), std::move(views_after), [&] (schema_mutations sm, schema_diff_side side) {
auto views_diff = diff_table_or_view(proxy, std::move(views_before), std::move(views_after), reload, [&] (schema_mutations sm, schema_diff_side side) {
// The view schema mutation should be created with reference to the base table schema because we definitely know it by now.
// If we don't do it we are leaving a window where write commands to this schema are illegal.
// There are 3 possibilities:
@@ -2216,15 +2232,15 @@ std::vector<mutation> make_create_aggregate_mutations(schema_features features,
mutation& m = p.first;
clustering_key& ckey = p.second;
data_type state_type = aggregate->sfunc().arg_types()[0];
data_type state_type = aggregate->sfunc()->arg_types()[0];
if (aggregate->has_finalfunc()) {
m.set_clustered_cell(ckey, "final_func", aggregate->finalfunc().name().name, timestamp);
m.set_clustered_cell(ckey, "final_func", aggregate->finalfunc()->name().name, timestamp);
}
if (aggregate->initcond()) {
m.set_clustered_cell(ckey, "initcond", state_type->deserialize(*aggregate->initcond()).to_parsable_string(), timestamp);
}
m.set_clustered_cell(ckey, "return_type", aggregate->return_type()->as_cql3_type().to_string(), timestamp);
m.set_clustered_cell(ckey, "state_func", aggregate->sfunc().name().name, timestamp);
m.set_clustered_cell(ckey, "state_func", aggregate->sfunc()->name().name, timestamp);
m.set_clustered_cell(ckey, "state_type", state_type->as_cql3_type().to_string(), timestamp);
std::vector<mutation> muts = {m};
@@ -2233,7 +2249,7 @@ std::vector<mutation> make_create_aggregate_mutations(schema_features features,
auto sa_p = get_mutation(sa_schema, *aggregate);
mutation& sa_mut = sa_p.first;
clustering_key& sa_ckey = sa_p.second;
sa_mut.set_clustered_cell(sa_ckey, "reduce_func", aggregate->reducefunc().name().name, timestamp);
sa_mut.set_clustered_cell(sa_ckey, "reduce_func", aggregate->reducefunc()->name().name, timestamp);
sa_mut.set_clustered_cell(sa_ckey, "state_type", state_type->as_cql3_type().to_string(), timestamp);
muts.emplace_back(sa_mut);
@@ -2964,7 +2980,7 @@ schema_ptr create_table_from_mutations(const schema_ctxt& ctxt, schema_mutations
if (version) {
builder.with_version(*version);
} else {
builder.with_version(sm.digest());
builder.with_version(sm.digest(ctxt.features().cluster_schema_features()));
}
if (auto partitioner = sm.partitioner()) {
@@ -3195,7 +3211,7 @@ view_ptr create_view_from_mutations(const schema_ctxt& ctxt, schema_mutations sm
if (version) {
builder.with_version(*version);
} else {
builder.with_version(sm.digest());
builder.with_version(sm.digest(ctxt.features().cluster_schema_features()));
}
auto base_id = table_id(row.get_nonnull<utils::UUID>("base_table_id"));
@@ -3524,7 +3540,8 @@ view_ptr maybe_fix_legacy_secondary_index_mv_schema(replica::database& db, const
if (v->clustering_key_size() == 0) {
return view_ptr(nullptr);
}
const column_definition& first_view_ck = v->clustering_key_columns().front();
const auto ck_cols = v->clustering_key_columns();
const column_definition& first_view_ck = ck_cols.front();
if (first_view_ck.is_computed()) {
return view_ptr(nullptr);
}

View File

@@ -14,6 +14,7 @@
#include "schema_fwd.hh"
#include "schema_features.hh"
#include "hashing.hh"
#include "gms/feature_service.hh"
#include "schema_mutations.hh"
#include "types/map.hh"
#include "query-result-set.hh"
@@ -66,8 +67,8 @@ class config;
class schema_ctxt {
public:
schema_ctxt(const config&, std::shared_ptr<data_dictionary::user_types_storage> uts);
schema_ctxt(const replica::database&);
schema_ctxt(const config&, std::shared_ptr<data_dictionary::user_types_storage> uts, const gms::feature_service&, replica::database* = nullptr);
schema_ctxt(replica::database&);
schema_ctxt(distributed<replica::database>&);
schema_ctxt(distributed<service::storage_proxy>&);
@@ -87,7 +88,17 @@ public:
return *_user_types;
}
const gms::feature_service& features() const {
return _features;
}
replica::database* get_db() const {
return _db;
}
private:
replica::database* _db;
const gms::feature_service& _features;
const db::extensions& _extensions;
const unsigned _murmur3_partitioner_ignore_msb_bits;
const uint32_t _schema_registry_grace_period;
@@ -128,6 +139,7 @@ schema_ptr indexes();
schema_ptr tables();
schema_ptr scylla_tables(schema_features features = schema_features::full());
schema_ptr views();
schema_ptr types();
schema_ptr computed_columns();
// Belongs to the "system" keyspace
schema_ptr scylla_table_schema_history();
@@ -184,7 +196,7 @@ future<mutation> read_keyspace_mutation(distributed<service::storage_proxy>&, co
// Must be called on shard 0.
future<semaphore_units<>> hold_merge_lock() noexcept;
future<> merge_schema(sharded<db::system_keyspace>& sys_ks, distributed<service::storage_proxy>& proxy, gms::feature_service& feat, std::vector<mutation> mutations);
future<> merge_schema(sharded<db::system_keyspace>& sys_ks, distributed<service::storage_proxy>& proxy, gms::feature_service& feat, std::vector<mutation> mutations, bool reload);
// Recalculates the local schema version.
//

View File

@@ -295,7 +295,7 @@ future<> size_estimates_mutation_reader::fast_forward_to(const dht::partition_ra
}
future<> size_estimates_mutation_reader::fast_forward_to(position_range pr) {
forward_buffer_to(pr.start());
clear_buffer();
_end_of_stream = false;
if (_partition_reader) {
return _partition_reader->fast_forward_to(std::move(pr));

View File

@@ -486,7 +486,7 @@ system_distributed_keyspace::read_cdc_topology_description(
return {};
}
std::vector<cdc::token_range_description> entries;
utils::chunked_vector<cdc::token_range_description> entries;
auto entries_val = value_cast<list_type_impl::native_type>(
cdc_generation_description_type->deserialize(cql_result->one().get_view("description")));
@@ -580,7 +580,7 @@ system_distributed_keyspace::insert_cdc_generation(
future<std::optional<cdc::topology_description>>
system_distributed_keyspace::read_cdc_generation(utils::UUID id) {
std::vector<cdc::token_range_description> entries;
utils::chunked_vector<cdc::token_range_description> entries;
auto num_ranges = 0;
co_await _qp.query_internal(
// This should be a local read so 20s should be more than enough

View File

@@ -2276,7 +2276,10 @@ public:
add_partition(mutation_sink, "trace_probability", format("{:.2}", tracing::tracing::get_local_tracing_instance().get_trace_probability()));
co_await add_partition(mutation_sink, "memory", [this] () {
struct stats {
uint64_t total = 0;
// take the pre-reserved memory into account, as seastar only returns
// the stats of memory managed by the seastar allocator, but we instruct
// it to reserve addition memory for system.
uint64_t total = db::config::wasm_udf_reserved_memory;
uint64_t free = 0;
static stats reduce(stats a, stats b) { return stats{a.total + b.total, a.free + b.free}; }
};
@@ -2836,8 +2839,7 @@ static void install_virtual_readers(db::system_keyspace& sys_ks, replica::databa
static bool maybe_write_in_user_memory(schema_ptr s) {
return (s.get() == system_keyspace::batchlog().get()) || (s.get() == system_keyspace::paxos().get())
|| s == system_keyspace::v3::scylla_views_builds_in_progress()
|| s == system_keyspace::raft();
|| s == system_keyspace::v3::scylla_views_builds_in_progress();
}
future<> system_keyspace_make(db::system_keyspace& sys_ks, distributed<replica::database>& dist_db, distributed<service::storage_service>& dist_ss, sharded<gms::gossiper>& dist_gossiper, distributed<service::raft_group_registry>& dist_raft_gr, db::config& cfg, table_selector& tables) {
@@ -3344,11 +3346,11 @@ mutation system_keyspace::make_group0_history_state_id_mutation(
using namespace std::chrono;
assert(*gc_older_than >= gc_clock::duration{0});
auto ts_millis = duration_cast<milliseconds>(microseconds{ts});
auto gc_older_than_millis = duration_cast<milliseconds>(*gc_older_than);
assert(gc_older_than_millis < ts_millis);
auto ts_micros = microseconds{ts};
auto gc_older_than_micros = duration_cast<microseconds>(*gc_older_than);
assert(gc_older_than_micros < ts_micros);
auto tomb_upper_bound = utils::UUID_gen::min_time_UUID(ts_millis - gc_older_than_millis);
auto tomb_upper_bound = utils::UUID_gen::min_time_UUID(ts_micros - gc_older_than_micros);
// We want to delete all entries with IDs smaller than `tomb_upper_bound`
// but the deleted range is of the form (x, +inf) since the schema is reversed.
auto range = query::clustering_range::make_starting_with({

View File

@@ -11,6 +11,8 @@
#include "db/tags/extension.hh"
#include "schema_builder.hh"
#include "schema_registry.hh"
#include "service/storage_proxy.hh"
#include "data_dictionary/data_dictionary.hh"
namespace db {
@@ -38,14 +40,27 @@ std::optional<std::string> find_tag(const schema& s, const sstring& tag) {
}
}
future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map) {
co_await mm.container().invoke_on(0, [s = global_schema_ptr(std::move(schema)), tags_map = std::move(tags_map)] (service::migration_manager& mm) -> future<> {
future<> modify_tags(service::migration_manager& mm, sstring ks, sstring cf,
std::function<void(std::map<sstring, sstring>&)> modify) {
co_await mm.container().invoke_on(0, [ks = std::move(ks), cf = std::move(cf), modify = std::move(modify)] (service::migration_manager& mm) -> future<> {
// FIXME: the following needs to be in a loop. If mm.announce() below
// fails, we need to retry the whole thing.
auto group0_guard = co_await mm.start_group0_operation();
// After getting the schema-modification lock, we need to read the
// table's *current* schema - it might have changed before we got
// the lock, by some concurrent modification. If the table is gone,
// this will throw no_such_column_family.
schema_ptr s = mm.get_storage_proxy().data_dictionary().find_schema(ks, cf);
const std::map<sstring, sstring>* tags_ptr = get_tags_of_table(s);
std::map<sstring, sstring> tags;
if (tags_ptr) {
// tags_ptr is a constant pointer to schema data. To allow func()
// to modify the tags, we must make a copy.
tags = *tags_ptr;
}
modify(tags);
schema_builder builder(s);
builder.add_extension(tags_extension::NAME, ::make_shared<tags_extension>(tags_map));
builder.add_extension(tags_extension::NAME, ::make_shared<tags_extension>(tags));
auto m = co_await mm.prepare_column_family_update_announcement(builder.build(), false, std::vector<view_ptr>(), group0_guard.write_timestamp());

View File

@@ -33,9 +33,18 @@ const std::map<sstring, sstring>* get_tags_of_table(schema_ptr schema);
// tags exist but not this tag.
std::optional<std::string> find_tag(const schema& s, const sstring& tag);
// FIXME: Updating tags currently relies on updating schema, which may be subject
// to races during concurrent updates of the same table. Once Scylla schema updates
// are fixed, this issue will automatically get fixed as well.
future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map);
// modify_tags() atomically modifies the tags on a given table: It reads the
// existing tags, passes them as a map to the given function which can modify
// the map, and finally writes the modified tags. This read-modify-write
// operation is atomic - isolated from other concurrent schema operations.
//
// The isolation requirement is also why modify_tags() takes the table's name
// ks,cf and not a schema object - the current schema may not be relevant by
// the time the tags are modified, due to some other concurrent modification.
// If a table (ks, cf) doesn't exist, no_such_column_family is thrown.
//
// If the table didn't have the tags schema extension, it's fine: The function
// is passed an empty map, and the tags it adds will be added to the table.
future<> modify_tags(service::migration_manager& mm, sstring ks, sstring cf,
std::function<void(std::map<sstring, sstring>&)> modify_func);
}

View File

@@ -172,7 +172,7 @@ class build_progress_virtual_reader {
}
virtual future<> fast_forward_to(position_range range) override {
forward_buffer_to(range.start());
clear_buffer();
_end_of_stream = false;
return _underlying.fast_forward_to(std::move(range));
}
@@ -197,7 +197,7 @@ public:
streamed_mutation::forwarding fwd,
mutation_reader::forwarding fwd_mr) {
return flat_mutation_reader_v2(std::make_unique<build_progress_reader>(
std::move(s),
s,
std::move(permit),
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
range,

View File

@@ -85,29 +85,25 @@ future<row_locker::lock_holder>
row_locker::lock_ck(const dht::decorated_key& pk, const clustering_key_prefix& cpk, bool exclusive, db::timeout_clock::time_point timeout, stats& stats) {
mylog.debug("taking shared lock on partition {}, and {} lock on row {} in it", pk, (exclusive ? "exclusive" : "shared"), cpk);
auto tracker = latency_stats_tracker(exclusive ? stats.exclusive_row : stats.shared_row);
auto ck = cpk;
// Create a two-level lock entry for the partition if it doesn't exist already.
auto i = _two_level_locks.try_emplace(pk, this).first;
// The two-level lock entry we've just created is guaranteed to be kept alive as long as it's locked.
// Initiating read locking in the background below ensures that even if the two-level lock is currently
// write-locked, releasing the write-lock will synchronously engage any waiting
// locks and will keep the entry alive.
future<lock_type::holder> lock_partition = i->second._partition_lock.hold_read_lock(timeout);
auto j = i->second._row_locks.find(cpk);
if (j == i->second._row_locks.end()) {
// Not yet locked, need to create the lock. This makes a copy of cpk.
try {
j = i->second._row_locks.emplace(cpk, lock_type()).first;
} catch(...) {
// If this emplace() failed, e.g., out of memory, we fail. We
// could do nothing - the partition lock we already started
// taking will be unlocked automatically after being locked.
// But it's better form to wait for the work we started, and it
// will also allow us to remove the hash-table row we added.
return lock_partition.then([ex = std::current_exception()] (auto lock) {
// The lock is automatically released when "lock" goes out of scope.
// TODO: unlock (lock = {}) now, search for the partition in the
// hash table (we know it's still there, because we held the lock until
// now) and remove the unused lock from the hash table if still unused.
return make_exception_future<row_locker::lock_holder>(std::current_exception());
});
return lock_partition.then([this, pk = &i->first, row_locks = &i->second._row_locks, ck = std::move(ck), exclusive, tracker = std::move(tracker), timeout] (auto lock1) mutable {
auto j = row_locks->find(ck);
if (j == row_locks->end()) {
// Not yet locked, need to create the lock.
j = row_locks->emplace(std::move(ck), lock_type()).first;
}
}
return lock_partition.then([this, pk = &i->first, cpk = &j->first, &row_lock = j->second, exclusive, tracker = std::move(tracker), timeout] (auto lock1) mutable {
auto* cpk = &j->first;
auto& row_lock = j->second;
// Like to the two-level lock entry above, the row_lock entry we've just created
// is guaranteed to be kept alive as long as it's locked.
// Initiating read/write locking in the background below ensures that.
auto lock_row = exclusive ? row_lock.hold_write_lock(timeout) : row_lock.hold_read_lock(timeout);
return lock_row.then([this, pk, cpk, exclusive, tracker = std::move(tracker), lock1 = std::move(lock1)] (auto lock2) mutable {
lock1.release();

View File

@@ -487,37 +487,56 @@ mutation_partition& view_updates::partition_for(partition_key&& key) {
}
size_t view_updates::op_count() const {
return _op_count++;;
return _op_count;
}
row_marker view_updates::compute_row_marker(const clustering_or_static_row& base_row) const {
/*
* We need to compute both the timestamp and expiration.
* We need to compute both the timestamp and expiration for view rows.
*
* There are 3 cases:
* 1) There is a column that is not in the base PK but is in the view PK. In that case, as long as that column
* lives, the view entry does too, but as soon as it expires (or is deleted for that matter) the entry also
* should expire. So the expiration for the view is the one of that column, regardless of any other expiration.
* To take an example of that case, if you have:
* CREATE TABLE t (a int, b int, c int, PRIMARY KEY (a, b))
* CREATE MATERIALIZED VIEW mv AS SELECT * FROM t WHERE c IS NOT NULL AND a IS NOT NULL AND b IS NOT NULL PRIMARY KEY (c, a, b)
* INSERT INTO t(a, b) VALUES (0, 0) USING TTL 3;
* UPDATE t SET c = 0 WHERE a = 0 AND b = 0;
* then even after 3 seconds elapsed, the row will still exist (it just won't have a "row marker" anymore) and so
* the MV should still have a corresponding entry.
* This cell determines the liveness of the view row.
* 2) The columns for the base and view PKs are exactly the same, and all base columns are selected by the view.
* In that case, all components (marker, deletion and cells) are the same and trivially mapped.
* 3) The columns for the base and view PKs are exactly the same, but some base columns are not selected in the view.
* Use the max timestamp out of the base row marker and all the unselected columns - this ensures we can keep the
* view row alive. Do the same thing for the expiration, if the marker is dead or will expire, and so
* will all unselected columns.
* Below there are several distinct cases depending on how many new key
* columns the view has - i.e., how many of the view's key columns were
* regular columns in the base. base_regular_columns_in_view_pk.size():
*
* Zero new key columns:
* The view rows key is composed only from base key columns, and those
* cannot be changed in an update, so the view row remains alive as
* long as the base row is alive. We need to return the same row
* marker as the base for the view - to keep an empty view row alive
* for as long as an empty base row exists.
* Note that in this case, if there are *unselected* base columns, we
* may need to keep an empty view row alive even without a row marker
* because the base row (which has additional columns) is still alive.
* For that we have the "virtual columns" feature: In the zero new
* key columns case, we put unselected columns in the view as empty
* columns, to keep the view row alive.
*
* One new key column:
* In this case, there is a regular base column that is part of the
* view key. This regular column can be added or deleted in an update,
* or its expiration be set, and those can cause the view row -
* including its row marker - to need to appear or disappear as well.
* So the liveness of cell of this one column determines the liveness
* of the view row and the row marker that we return.
*
* Two or more new key columns:
* This case is explicitly NOT supported in CQL - one cannot create a
* view with more than one base-regular columns in its key. In general
* picking one liveness (timestamp and expiration) is not possible
* if there are multiple regular base columns in the view key, as
* those can have different liveness.
* However, we do allow this case for Alternator - we need to allow
* the case of two (but not more) because the DynamoDB API allows
* creating a GSI whose two key columns (hash and range key) were
* regular columns.
* We can support this case in Alternator because it doesn't use
* expiration (the "TTL" it does support is different), and doesn't
* support user-defined timestamps. But, the two columns can still
* have different timestamps - this happens if an update modifies
* just one of them. In this case the timestamp of the view update
* (and that of the row marker we return) is the later of these two
* updated columns.
*/
// WARNING: The code assumes that if multiple regular base columns are present in the view key,
// they share liveness information. It's true especially in the only case currently allowed by CQL,
// which assumes there's up to one non-pk column in the view key. It's also true in alternator,
// which does not carry TTL information.
const auto& col_ids = base_row.is_clustering_row()
? _base_info->base_regular_columns_in_view_pk()
: _base_info->base_static_columns_in_view_pk();
@@ -525,7 +544,20 @@ row_marker view_updates::compute_row_marker(const clustering_or_static_row& base
auto& def = _base->column_at(base_row.column_kind(), col_ids[0]);
// Note: multi-cell columns can't be part of the primary key.
auto cell = base_row.cells().cell_at(col_ids[0]).as_atomic_cell(def);
return cell.is_live_and_has_ttl() ? row_marker(cell.timestamp(), cell.ttl(), cell.expiry()) : row_marker(cell.timestamp());
auto ts = cell.timestamp();
if (col_ids.size() > 1){
// As explained above, this case only happens in Alternator,
// and we may need to pick a higher ts:
auto& second_def = _base->column_at(base_row.column_kind(), col_ids[1]);
auto second_cell = base_row.cells().cell_at(col_ids[1]).as_atomic_cell(second_def);
auto second_ts = second_cell.timestamp();
ts = std::max(ts, second_ts);
// Alternator isn't supposed to have TTL or more than two col_ids!
if (col_ids.size() != 2 || cell.is_live_and_has_ttl() || second_cell.is_live_and_has_ttl()) [[unlikely]] {
utils::on_internal_error(format("Unexpected col_ids length {} or has TTL", col_ids.size()));
}
}
return cell.is_live_and_has_ttl() ? row_marker(ts, cell.ttl(), cell.expiry()) : row_marker(ts);
}
return base_row.marker();
@@ -923,8 +955,22 @@ void view_updates::do_delete_old_entry(const partition_key& base_key, const clus
// Note: multi-cell columns can't be part of the primary key.
auto& def = _base->column_at(kind, col_ids[0]);
auto cell = existing.cells().cell_at(col_ids[0]).as_atomic_cell(def);
auto ts = cell.timestamp();
if (col_ids.size() > 1) {
// This is the Alternator-only support for two regular base
// columns that become view key columns. See explanation in
// view_updates::compute_row_marker().
auto& second_def = _base->column_at(kind, col_ids[1]);
auto second_cell = existing.cells().cell_at(col_ids[1]).as_atomic_cell(second_def);
auto second_ts = second_cell.timestamp();
ts = std::max(ts, second_ts);
// Alternator isn't supposed to have more than two col_ids!
if (col_ids.size() != 2) [[unlikely]] {
utils::on_internal_error(format("Unexpected col_ids length {}", col_ids.size()));
}
}
if (cell.is_live()) {
r->apply(shadowable_tombstone(cell.timestamp(), now));
r->apply(shadowable_tombstone(ts, now));
}
} else {
// "update" caused the base row to have been deleted, and !col_id
@@ -1308,11 +1354,12 @@ void view_update_builder::generate_update(static_row&& update, const tombstone&
future<stop_iteration> view_update_builder::on_results() {
constexpr size_t max_rows_for_view_updates = 100;
size_t rows_for_view_updates = std::accumulate(_view_updates.begin(), _view_updates.end(), 0, [] (size_t acc, const view_updates& vu) {
return acc + vu.op_count();
});
const bool stop_updates = rows_for_view_updates >= max_rows_for_view_updates;
auto should_stop_updates = [this] () -> bool {
size_t rows_for_view_updates = std::accumulate(_view_updates.begin(), _view_updates.end(), 0, [] (size_t acc, const view_updates& vu) {
return acc + vu.op_count();
});
return rows_for_view_updates >= max_rows_for_view_updates;
};
if (_update && !_update->is_end_of_partition() && _existing && !_existing->is_end_of_partition()) {
auto cmp = position_in_partition::tri_compare(*_schema)(_update->position(), _existing->position());
if (cmp < 0) {
@@ -1335,7 +1382,7 @@ future<stop_iteration> view_update_builder::on_results() {
: std::nullopt;
generate_update(std::move(update), _update_partition_tombstone, std::move(existing), _existing_partition_tombstone);
}
return stop_updates ? stop() : advance_updates();
return should_stop_updates() ? stop() : advance_updates();
}
if (cmp > 0) {
// We have something existing but no update (which will happen either because it's a range tombstone marker in
@@ -1371,7 +1418,7 @@ future<stop_iteration> view_update_builder::on_results() {
generate_update(std::move(update), _update_partition_tombstone, { std::move(existing) }, _existing_partition_tombstone);
}
}
return stop_updates ? stop () : advance_existings();
return should_stop_updates() ? stop () : advance_existings();
}
// We're updating a row that had pre-existing data
if (_update->is_range_tombstone_change()) {
@@ -1393,8 +1440,9 @@ future<stop_iteration> view_update_builder::on_results() {
mutation_fragment_v2::printer(*_schema, *_update), mutation_fragment_v2::printer(*_schema, *_existing)));
}
generate_update(std::move(*_update).as_static_row(), _update_partition_tombstone, { std::move(*_existing).as_static_row() }, _existing_partition_tombstone);
}
return stop_updates ? stop() : advance_all();
return should_stop_updates() ? stop() : advance_all();
}
auto tombstone = std::max(_update_partition_tombstone, _update_current_tombstone);
@@ -1409,7 +1457,7 @@ future<stop_iteration> view_update_builder::on_results() {
auto update = static_row();
generate_update(std::move(update), _update_partition_tombstone, { std::move(existing) }, _existing_partition_tombstone);
}
return stop_updates ? stop() : advance_existings();
return should_stop_updates() ? stop() : advance_existings();
}
// If we have updates and it's a range tombstone, it removes nothing pre-exisiting, so we can ignore it
@@ -1430,7 +1478,7 @@ future<stop_iteration> view_update_builder::on_results() {
: std::nullopt;
generate_update(std::move(*_update).as_static_row(), _update_partition_tombstone, std::move(existing), _existing_partition_tombstone);
}
return stop_updates ? stop() : advance_updates();
return should_stop_updates() ? stop() : advance_updates();
}
return stop();
@@ -1609,6 +1657,13 @@ static bool should_update_synchronously(const schema& s) {
return *tag_opt == "true";
}
size_t memory_usage_of(const frozen_mutation_and_schema& mut) {
// Overhead of sending a view mutation, in terms of data structures used by the storage_proxy, as well as possible background tasks
// allocated for a remote view update.
constexpr size_t base_overhead_bytes = 2288;
return base_overhead_bytes + mut.fm.representation().size();
}
// Take the view mutations generated by generate_view_updates(), which pertain
// to a modification of a single base partition, and apply them to the
// appropriate paired replicas. This is done asynchronously - we do not wait
@@ -1630,7 +1685,7 @@ future<> mutate_MV(
auto& keyspace_name = mut.s->ks_name();
auto target_endpoint = get_view_natural_endpoint(keyspace_name, base_token, view_token);
auto remote_endpoints = service::get_local_storage_proxy().get_token_metadata_ptr()->pending_endpoints_for(view_token, keyspace_name);
auto sem_units = pending_view_updates.split(mut.fm.representation().size());
auto sem_units = pending_view_updates.split(memory_usage_of(mut));
const bool update_synchronously = should_update_synchronously(*mut.s);
if (update_synchronously) {
@@ -1676,7 +1731,7 @@ future<> mutate_MV(
auto mut_ptr = remote_endpoints.empty() ? std::make_unique<frozen_mutation>(std::move(mut.fm)) : std::make_unique<frozen_mutation>(mut.fm);
tracing::trace(tr_state, "Locally applying view update for {}.{}; base token = {}; view token = {}",
mut.s->ks_name(), mut.s->cf_name(), base_token, view_token);
local_view_update = service::get_local_storage_proxy().mutate_locally(mut.s, *mut_ptr, tr_state, db::commitlog::force_sync::no).then_wrapped(
local_view_update = service::get_local_storage_proxy().mutate_mv_locally(mut.s, *mut_ptr, tr_state, db::commitlog::force_sync::no).then_wrapped(
[s = mut.s, &stats, &cf_stats, tr_state, base_token, view_token, my_address, mut_ptr = std::move(mut_ptr),
units = sem_units.split(sem_units.count())] (future<>&& f) {
--stats.writes;
@@ -1825,6 +1880,8 @@ future<> view_builder::start(service::migration_manager& mm) {
(void)_build_step.trigger();
return make_ready_future<>();
});
}).handle_exception_type([] (const seastar::sleep_aborted& e) {
vlogger.debug("start aborted: {}", e.what());
}).handle_exception([] (std::exception_ptr eptr) {
vlogger.error("start failed: {}", eptr);
return make_ready_future<>();
@@ -2523,32 +2580,33 @@ update_backlog node_update_backlog::add_fetch(unsigned shard, update_backlog bac
return std::max(backlog, _max.load(std::memory_order_relaxed));
}
future<bool> check_view_build_ongoing(db::system_distributed_keyspace& sys_dist_ks, const sstring& ks_name, const sstring& cf_name) {
return sys_dist_ks.view_status(ks_name, cf_name).then([] (std::unordered_map<locator::host_id, sstring>&& view_statuses) {
return boost::algorithm::any_of(view_statuses | boost::adaptors::map_values, [] (const sstring& view_status) {
return view_status == "STARTED";
future<bool> check_view_build_ongoing(db::system_distributed_keyspace& sys_dist_ks, const locator::token_metadata& tm, const sstring& ks_name,
const sstring& cf_name) {
using view_statuses_type = std::unordered_map<locator::host_id, sstring>;
return sys_dist_ks.view_status(ks_name, cf_name).then([&tm] (view_statuses_type&& view_statuses) {
return boost::algorithm::any_of(view_statuses, [&tm] (const view_statuses_type::value_type& view_status) {
// Only consider status of known hosts.
return view_status.second == "STARTED" && tm.get_endpoint_for_host_id(view_status.first);
});
});
}
future<bool> check_needs_view_update_path(db::system_distributed_keyspace& sys_dist_ks, const replica::table& t, streaming::stream_reason reason) {
future<bool> check_needs_view_update_path(db::system_distributed_keyspace& sys_dist_ks, const locator::token_metadata& tm, const replica::table& t,
streaming::stream_reason reason) {
if (is_internal_keyspace(t.schema()->ks_name())) {
return make_ready_future<bool>(false);
}
if (reason == streaming::stream_reason::repair && !t.views().empty()) {
return make_ready_future<bool>(true);
}
return do_with(t.views(), [&sys_dist_ks] (auto& views) {
return do_with(t.views(), [&sys_dist_ks, &tm] (auto& views) {
return map_reduce(views,
[&sys_dist_ks] (const view_ptr& view) { return check_view_build_ongoing(sys_dist_ks, view->ks_name(), view->cf_name()); },
[&sys_dist_ks, &tm] (const view_ptr& view) { return check_view_build_ongoing(sys_dist_ks, tm, view->ks_name(), view->cf_name()); },
false,
std::logical_or<bool>());
});
}
const size_t view_updating_consumer::buffer_size_soft_limit{1 * 1024 * 1024};
const size_t view_updating_consumer::buffer_size_hard_limit{2 * 1024 * 1024};
void view_updating_consumer::do_flush_buffer() {
_staging_reader_handle.pause();
@@ -2571,6 +2629,10 @@ void view_updating_consumer::do_flush_buffer() {
}
void view_updating_consumer::flush_builder() {
_buffer.emplace_back(_mut_builder->flush());
}
void view_updating_consumer::end_builder() {
_mut_builder->consume_end_of_partition();
if (auto mut_opt = _mut_builder->consume_end_of_stream()) {
_buffer.emplace_back(std::move(*mut_opt));
@@ -2579,11 +2641,9 @@ void view_updating_consumer::flush_builder() {
}
void view_updating_consumer::maybe_flush_buffer_mid_partition() {
if (_buffer_size >= buffer_size_hard_limit) {
if (_buffer_size >= _buffer_size_hard_limit) {
flush_builder();
auto dk = _buffer.back().decorated_key();
do_flush_buffer();
consume_new_partition(dk);
}
}

View File

@@ -213,7 +213,7 @@ class view_updates final {
schema_ptr _base;
base_info_ptr _base_info;
std::unordered_map<partition_key, mutation_partition, partition_key::hashing, partition_key::equality> _updates;
mutable size_t _op_count = 0;
size_t _op_count = 0;
public:
explicit view_updates(view_and_base vab)
: _view(std::move(vab.view))
@@ -327,6 +327,8 @@ future<> mutate_MV(
service::allow_hints allow_hints,
wait_for_all_updates wait_for_all);
size_t memory_usage_of(const frozen_mutation_and_schema& mut);
/**
* create_virtual_column() adds a "virtual column" to a schema builder.
* The definition of a "virtual column" is based on the given definition

View File

@@ -22,9 +22,13 @@ class system_distributed_keyspace;
}
namespace locator {
class token_metadata;
}
namespace db::view {
future<bool> check_view_build_ongoing(db::system_distributed_keyspace& sys_dist_ks, const sstring& ks_name, const sstring& cf_name);
future<bool> check_needs_view_update_path(db::system_distributed_keyspace& sys_dist_ks, const replica::table& t, streaming::stream_reason reason);
future<bool> check_needs_view_update_path(db::system_distributed_keyspace& sys_dist_ks, const locator::token_metadata& tm, const replica::table& t,
streaming::stream_reason reason);
}

View File

@@ -157,11 +157,11 @@ future<> view_update_generator::start() {
service::get_local_streaming_priority(),
nullptr,
::mutation_reader::forwarding::no);
auto close_sr = deferred_close(staging_sstable_reader);
inject_failure("view_update_generator_consume_staging_sstable");
auto result = staging_sstable_reader.consume_in_thread(view_updating_consumer(s, std::move(permit), *t, sstables, _as, staging_sstable_reader_handle),
dht::incremental_owned_ranges_checker::make_partition_filter(_db.get_keyspace_local_ranges(s->ks_name())));
staging_sstable_reader.close().get();
if (result == stop_iteration::yes) {
break;
}

View File

@@ -33,8 +33,17 @@ public:
// We prefer flushing on partition boundaries, so at the end of a partition,
// we flush on reaching the soft limit. Otherwise we continue accumulating
// data. We flush mid-partition if we reach the hard limit.
static const size_t buffer_size_soft_limit;
static const size_t buffer_size_hard_limit;
static constexpr size_t buffer_size_soft_limit_default = 1 * 1024 * 1024;
static constexpr size_t buffer_size_hard_limit_default = 2 * 1024 * 1024;
private:
size_t _buffer_size_soft_limit = buffer_size_soft_limit_default;
size_t _buffer_size_hard_limit = buffer_size_hard_limit_default;
public:
// Meant only for usage in tests.
void set_buffer_size_limit_for_testing_purposes(size_t sz) {
_buffer_size_soft_limit = sz;
_buffer_size_hard_limit = sz;
}
private:
schema_ptr _schema;
@@ -49,6 +58,7 @@ private:
private:
void do_flush_buffer();
void flush_builder();
void end_builder();
void maybe_flush_buffer_mid_partition();
public:
@@ -71,7 +81,11 @@ public:
void consume_new_partition(const dht::decorated_key& dk) {
_mut_builder.emplace(_schema);
_mut_builder->consume_new_partition(dk);
// Further accounting is inaccurate as we base it on the consumed
// mutation-fragments, not on their final form in the mutation.
// This is good enough, as long as the difference is small and mostly
// constant (per fragment).
_buffer_size += _mut_builder->consume_new_partition(dk).memory_usage(*_schema);
}
void consume(tombstone t) {
@@ -113,8 +127,8 @@ public:
if (_as->abort_requested()) {
return stop_iteration::yes;
}
flush_builder();
if (_buffer_size >= buffer_size_soft_limit) {
end_builder();
if (_buffer_size >= _buffer_size_soft_limit) {
do_flush_buffer();
}
return stop_iteration::no;

View File

@@ -478,7 +478,15 @@ static future<bool> ping_with_timeout(pinger::endpoint_id id, clock::timepoint_t
auto f = pinger.ping(id, timeout_as);
auto sleep_and_abort = [] (clock::timepoint_t timeout, abort_source& timeout_as, clock& c) -> future<> {
co_await c.sleep_until(timeout, timeout_as);
co_await c.sleep_until(timeout, timeout_as).then_wrapped([&timeout_as] (auto&& f) {
// Avoid throwing if sleep was aborted.
if (f.failed() && timeout_as.abort_requested()) {
// Expected (if ping() resolved first or we were externally aborted).
f.ignore_ready_future();
return make_ready_future<>();
}
return f;
});
if (!timeout_as.abort_requested()) {
// We resolved before `f`. Abort the operation.
timeout_as.request_abort();
@@ -501,8 +509,6 @@ static future<bool> ping_with_timeout(pinger::endpoint_id id, clock::timepoint_t
// Wait on the sleep as well (it should return shortly, being aborted) so we don't discard the future.
try {
co_await std::move(sleep_and_abort);
} catch (const sleep_aborted&) {
// Expected (if `f` resolved first or we were externally aborted).
} catch (...) {
// There should be no other exceptions, but just in case... log it and discard,
// we want to propagate exceptions from `f`, not from sleep.

View File

@@ -42,7 +42,8 @@ if __name__ == '__main__':
if systemd_unit.available('systemd-coredump@.service'):
dropin = '''
[Service]
TimeoutStartSec=infinity
RuntimeMaxSec=infinity
TimeoutSec=infinity
'''[1:-1]
os.makedirs('/etc/systemd/system/systemd-coredump@.service.d', exist_ok=True)
with open('/etc/systemd/system/systemd-coredump@.service.d/timeout.conf', 'w') as f:
@@ -61,8 +62,7 @@ ExternalSizeMax=1024G
[Unit]
Description=Save coredump to scylla data directory
Conflicts=umount.target
Before=scylla-server.service
After=local-fs.target
Before=local-fs.target scylla-server.service
DefaultDependencies=no
[Mount]
@@ -72,7 +72,7 @@ Type=none
Options=bind
[Install]
WantedBy=multi-user.target
WantedBy=local-fs.target
'''[1:-1]
with open('/etc/systemd/system/var-lib-systemd-coredump.mount', 'w') as f:
f.write(dot_mount)

View File

@@ -16,7 +16,7 @@ if __name__ == '__main__':
if os.getuid() > 0:
print('Requires root permission.')
sys.exit(1)
systemd_unit('scylla-fstrim.timer').unmask()
systemd_unit('scylla-fstrim.timer').enable()
systemd_unit('scylla-fstrim.timer').start()
if is_redhat_variant() or is_arch() or is_suse_variant():
systemd_unit('fstrim.timer').disable()

View File

@@ -25,7 +25,7 @@ if __name__ == '__main__':
run('dd if=/dev/zero of=/var/tmp/kernel-check.img bs=1M count=128', shell=True, check=True, stdout=DEVNULL, stderr=DEVNULL)
run('mkfs.xfs /var/tmp/kernel-check.img', shell=True, check=True, stdout=DEVNULL, stderr=DEVNULL)
run('mount /var/tmp/kernel-check.img /var/tmp/mnt -o loop', shell=True, check=True, stdout=DEVNULL, stderr=DEVNULL)
ret = run('iotune --fs-check --evaluation-directory /var/tmp/mnt', shell=True).returncode
ret = run('iotune --fs-check --evaluation-directory /var/tmp/mnt --default-log-level error', shell=True).returncode
run('umount /var/tmp/mnt', shell=True, check=True)
shutil.rmtree('/var/tmp/mnt')
os.remove('/var/tmp/kernel-check.img')

View File

@@ -9,15 +9,90 @@
import os
import argparse
import distutils.util
import pwd
import grp
import sys
import stat
import distro
import logging
import pyudev
from pathlib import Path
from scylla_util import *
from subprocess import run, SubprocessError
LOGGER = logging.getLogger(__name__)
class UdevInfo:
def __init__(self, device_file):
self.context = pyudev.Context()
self.device = pyudev.Devices.from_device_file(self.context, device_file)
def verify(self):
if not self.id_fs_uuid:
LOGGER.error('ID_FS_UUID does not found')
if self.id_fs_type != 'xfs':
LOGGER.error('ID_FS_TYPE is not "xfs"')
if self.id_fs_usage != 'filesystem':
LOGGER.error('ID_FS_USAGE is not "filesystem"')
def dump_variables(self):
LOGGER.error(f' sys_path: {self.device.sys_path}')
LOGGER.error(f' sys_name: {self.device.sys_name}')
LOGGER.error(f' sys_number: {self.device.sys_number}')
LOGGER.error(f' device_path: {self.device.device_path}')
LOGGER.error(f' tags: {list(self.device.tags)}')
LOGGER.error(f' subsystem: {self.device.subsystem}')
LOGGER.error(f' driver: {self.device.driver}')
LOGGER.error(f' device_type: {self.device.device_type}')
LOGGER.error(f' device_node: {self.device.device_node}')
LOGGER.error(f' device_number: {self.device.device_number}')
LOGGER.error(f' device_links: {list(self.device.device_links)}')
LOGGER.error(f' is_initialized: {self.device.is_initialized}')
LOGGER.error(f' time_since_initialized: {self.device.time_since_initialized}')
for k, v in self.device.properties.items():
LOGGER.error(f' {k}: {v}')
@property
def id_fs_uuid(self):
return self.device.properties.get('ID_FS_UUID')
@property
def id_fs_type(self):
return self.device.properties.get('ID_FS_TYPE')
@property
def id_fs_usage(self):
return self.device.properties.get('ID_FS_USAGE')
@property
def uuid_link(self):
for l in self.device.device_links:
if l.startswith('/dev/disk/by-uuid/'):
return l
@property
def label_link(self):
for l in self.device.device_links:
if l.startswith('/dev/disk/by-label/'):
return l
@property
def partuuid_link(self):
for l in self.device.device_links:
if l.startswith('/dev/disk/by-partuuid/'):
return l
@property
def path_link(self):
for l in self.device.device_links:
if l.startswith('/dev/disk/by-path/'):
return l
@property
def id_links(self):
return [l for l in self.device.device_links if l.startswith('/dev/disk/by-id')]
if __name__ == '__main__':
if os.getuid() > 0:
print('Requires root permission.')
@@ -37,11 +112,14 @@ if __name__ == '__main__':
help='force constructing RAID when only one disk is specified')
parser.add_argument('--raid-level', default='0',
help='specify RAID level')
parser.add_argument('--online-discard', default=True,
parser.add_argument('--online-discard', default="True",
help='Enable XFS online discard (trim SSD cells after file deletion)')
args = parser.parse_args()
# Allow args.online_discard to be used as a boolean value
args.online_discard = distutils.util.strtobool(args.online_discard)
root = args.root.rstrip('/')
if args.volume_role == 'all':
mount_at=root
@@ -125,9 +203,12 @@ if __name__ == '__main__':
procs.append(proc)
for proc in procs:
proc.wait()
for disk in disks:
run(f'wipefs -a {disk}', shell=True, check=True)
if raid:
run('udevadm settle', shell=True, check=True)
run('mdadm --create --verbose --force --run {raid} --level={level} -c1024 --raid-devices={nr_disk} {disks}'.format(raid=fsdev, level=args.raid_level, nr_disk=len(disks), disks=args.disks.replace(',', ' ')), shell=True, check=True)
run(f'wipefs -a {fsdev}', shell=True, check=True)
run('udevadm settle', shell=True, check=True)
major_minor = os.stat(fsdev).st_rdev
@@ -138,7 +219,7 @@ if __name__ == '__main__':
# and it also cannot be smaller than the sector size.
block_size = max(1024, sector_size)
run('udevadm settle', shell=True, check=True)
run(f'mkfs.xfs -b size={block_size} {fsdev} -f -K', shell=True, check=True)
run(f'mkfs.xfs -b size={block_size} {fsdev} -K', shell=True, check=True)
run('udevadm settle', shell=True, check=True)
if is_debian_variant():
@@ -154,35 +235,51 @@ if __name__ == '__main__':
os.makedirs(mount_at, exist_ok=True)
uuid = out(f'blkid -s UUID -o value {fsdev}')
if not uuid:
raise Exception(f'Failed to get UUID of {fsdev}')
udev_info = UdevInfo(fsdev)
mount_dev = None
if udev_info.uuid_link:
mount_dev = udev_info.uuid_link
else:
if udev_info.label_link:
mount_dev = udev_info.label_link
dev_type = 'label'
elif udev_info.partuuid_link:
mount_dev = udev_info.partuuid_link
dev_type = 'partuuid'
elif udev_info.path_link:
mount_dev = udev_info.path_link
dev_type = 'path'
elif udev_info.id_links:
mount_dev = udev_info.id_links[0]
dev_type = 'id'
else:
mount_dev = fsdev
dev_type = 'realpath'
LOGGER.error(f'Failed to detect uuid, using {dev_type}: {mount_dev}')
uuidpath = f'/dev/disk/by-uuid/{uuid}'
after = 'local-fs.target'
after = ''
wants = ''
if raid and args.raid_level != '0':
after += f' {md_service}'
wants = f'\nWants={md_service}'
after = wants = 'md_service'
opt_discard = ''
if args.online_discard:
opt_discard = ',discard'
unit_data = f'''
[Unit]
Description=Scylla data directory
Before=scylla-server.service
After={after}{wants}
Before=local-fs.target scylla-server.service
After={after}
Wants={wants}
DefaultDependencies=no
[Mount]
What={uuidpath}
What={mount_dev}
Where={mount_at}
Type=xfs
Options=noatime{opt_discard}
[Install]
WantedBy=multi-user.target
WantedBy=local-fs.target
'''[1:-1]
with open(f'/etc/systemd/system/{mntunit_bn}', 'w') as f:
f.write(unit_data)
@@ -202,10 +299,18 @@ WantedBy=multi-user.target
mount = systemd_unit(mntunit_bn)
mount.start()
except SubprocessError as e:
if not os.path.exists(uuidpath):
print(f'\nERROR: {uuidpath} is not found\n')
elif not stat.S_ISBLK(os.stat(uuidpath).st_mode):
print(f'\nERROR: {uuidpath} is not block device\n')
if mount_dev != fsdev:
if not os.path.islink(mount_dev):
LOGGER.error('{mount_dev} is not found')
if not os.path.exists(mount_dev):
LOGGER.error('{mount_dev} is broken link')
if not os.path.exists(fsdev):
LOGGER.error('{fsdev} is not found')
if not stat.S_ISBLK(os.stat(fsdev).st_mode):
LOGGER.error('{fsdev} is not block device')
LOGGER.error(f'Error detected, dumping udev env parameters on {fsdev}')
udev_info.verify()
udev_info.dump_variables()
raise e
if args.enable_on_nextboot:
@@ -221,3 +326,8 @@ WantedBy=multi-user.target
if is_debian_variant():
run('update-initramfs -u', shell=True, check=True)
if not udev_info.uuid_link:
LOGGER.error(f'Error detected, dumping udev env parameters on {fsdev}')
udev_info.verify()
udev_info.dump_variables()

View File

@@ -63,7 +63,6 @@ bcp "${packages[@]}" packages/
bcp dist/docker/etc etc/
bcp dist/docker/scylla-housekeeping-service.sh /scylla-housekeeping-service.sh
bcp dist/docker/sshd-service.sh /sshd-service.sh
bcp dist/docker/scyllasetup.py /scyllasetup.py
bcp dist/docker/commandlineparser.py /commandlineparser.py
@@ -73,10 +72,11 @@ bcp dist/docker/scylla_bashrc /scylla_bashrc
run apt-get -y clean expire-cache
run apt-get -y update
run apt-get -y upgrade
run apt-get -y install dialog apt-utils
run bash -ec "echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections"
run bash -ec "rm -rf /etc/rsyslog.conf"
run apt-get -y install hostname supervisor openssh-server openssh-client openjdk-11-jre-headless python2 python3 python3-yaml curl rsyslog sudo
run apt-get -y install hostname supervisor openjdk-11-jre-headless python2 python3 python3-yaml curl rsyslog sudo
run bash -ec "echo LANG=C.UTF-8 > /etc/default/locale"
run bash -ec "dpkg -i packages/*.deb"
run apt-get -y clean all

View File

@@ -1,6 +0,0 @@
[program:sshd]
command=/sshd-service.sh
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0

View File

@@ -1,15 +0,0 @@
#!/bin/bash
if [ ! -f /run/sshd ]; then
mkdir -p /run/sshd
fi
if [ ! -f /etc/ssh/ssh_host_ed25519_key ]; then
ssh-keygen -t ed25519 -f /etc/ssh/ssh_host_ed25519_key -N ''
fi
if [ ! -f /etc/ssh/ssh_host_rsa_key ]; then
ssh-keygen -t rsa -b 4096 -f /etc/ssh/ssh_host_rsa_key -N ''
fi
/usr/sbin/sshd -D

View File

@@ -7,7 +7,7 @@ Group: Applications/Databases
License: AGPLv3
URL: http://www.scylladb.com/
Source0: %{reloc_pkg}
Requires: %{product}-server = %{version} %{product}-conf = %{version} %{product}-python3 = %{version} %{product}-kernel-conf = %{version} %{product}-jmx = %{version} %{product}-tools = %{version} %{product}-tools-core = %{version} %{product}-node-exporter = %{version}
Requires: %{product}-server = %{version}-%{release} %{product}-conf = %{version}-%{release} %{product}-python3 = %{version}-%{release} %{product}-kernel-conf = %{version}-%{release} %{product}-jmx = %{version}-%{release} %{product}-tools = %{version}-%{release} %{product}-tools-core = %{version}-%{release} %{product}-node-exporter = %{version}-%{release}
Obsoletes: scylla-server < 1.1
%global _debugsource_template %{nil}
@@ -54,7 +54,7 @@ Group: Applications/Databases
Summary: The Scylla database server
License: AGPLv3
URL: http://www.scylladb.com/
Requires: %{product}-conf = %{version} %{product}-python3 = %{version}
Requires: %{product}-conf = %{version}-%{release} %{product}-python3 = %{version}-%{release}
Conflicts: abrt
AutoReqProv: no

View File

@@ -1,6 +1,77 @@
### a dictionary of redirections
#old path: new path
# removing the Enterprise upgrade guides from the Open Source documentation
/stable/upgrade/upgrade-enterprise/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2021.1-to-2022.1-ubuntu.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2021.1-to-2022.1-ubuntu.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2021.1-to-2022.1-debian.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2021.1-to-2022.1-debian.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2021.1-to-2022.1-image.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2021.1-to-2022.1-image.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/metric-update-2021.1-to-2022.1.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/metric-update-2021.1-to-2022.1.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/upgrade-guide-from-2020.1-to-2021.1-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/upgrade-guide-from-2020.1-to-2021.1-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/upgrade-guide-from-2020.1-to-2021.1-ubuntu-16-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/upgrade-guide-from-2020.1-to-2021.1-ubuntu-16-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/upgrade-guide-from-2020.1-to-2021.1-ubuntu-18-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/upgrade-guide-from-2020.1-to-2021.1-ubuntu-18-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/upgrade-guide-from-2020.1-to-2021.1-debian.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/upgrade-guide-from-2020.1-to-2021.1-debian.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/metric-update-2020.1-to-2021.1.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.1-to-2021.1/metric-update-2020.1-to-2021.1.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/upgrade-guide-from-2019.1-to-2020.1-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/upgrade-guide-from-2019.1-to-2020.1-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/upgrade-guide-from-2019.1-to-2020.1-ubuntu-16-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/upgrade-guide-from-2019.1-to-2020.1-ubuntu-16-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/upgrade-guide-from-2019.1-to-2020.1-ubuntu-18-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/upgrade-guide-from-2019.1-to-2020.1-ubuntu-18-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/upgrade-guide-from-2019.1-to-2020.1-debian.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/upgrade-guide-from-2019.1-to-2020.1-debian.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/metric-update-2019.1-to-2020.1.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.1-to-2020.1/metric-update-2019.1-to-2020.1.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.1-to-2019.1/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.1-to-2019.1/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.1-to-2019.1/upgrade-guide-from-2018.1-to-2019.1-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.1-to-2019.1/upgrade-guide-from-2018.1-to-2019.1-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.1-to-2019.1/upgrade-guide-from-2018.1-to-2019.1-ubuntu-16-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.1-to-2019.1/upgrade-guide-from-2018.1-to-2019.1-ubuntu-16-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.1-to-2019.1/metric-update-2018.1-to-2019.1.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.1-to-2019.1/metric-update-2018.1-to-2019.1.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/upgrade-guide-from-2017.1-to-2018.1-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/upgrade-guide-from-2017.1-to-2018.1-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/upgrade-guide-from-2017.1-to-2018.1-ubuntu.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/upgrade-guide-from-2017.1-to-2018.1-ubuntu.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/upgrade-guide-from-2017.1-to-2018.1-debian.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/upgrade-guide-from-2017.1-to-2018.1-debian.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/metric-update-2017.1-to-2018.1.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.1-to-2018.1/metric-update-2017.1-to-2018.1.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-ubuntu-14-to-16.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-ubuntu-14-to-16.html
/stable/getting-started/install-scylla/unified-installer.html#unified-installed-upgrade: https://enterprise.docs.scylladb.com/stable/getting-started/install-scylla/unified-installer.html#unified-installed-upgrade
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-image.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-image.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-ubuntu-18-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-ubuntu-18-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-ubuntu-20-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-ubuntu-20-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-debian-10.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2022.x.y-to-2022.x.z/upgrade-guide-from-2022.x.y-to-2022.x.z-debian-10.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-ubuntu-16-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-ubuntu-16-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-ubuntu-18-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-ubuntu-18-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-ubuntu-20-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-ubuntu-20-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-debian-9.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-debian-9.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-debian-10.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.x.y-to-2021.x.z/upgrade-guide-from-2021.x.y-to-2021.x.z-debian-10.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-ubuntu-16-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-ubuntu-16-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-ubuntu-18-04.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-ubuntu-18-04.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-debian-9.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-debian-9.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-debian-10.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2020.x.y-to-2020.x.z/upgrade-guide-from-2020.x.y-to-2020.x.z-debian-10.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.x.y-to-2019.x.z/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.x.y-to-2019.x.z/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.x.y-to-2019.x.z/upgrade-guide-from-2019.x.y-to-2019.x.z-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.x.y-to-2019.x.z/upgrade-guide-from-2019.x.y-to-2019.x.z-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.x.y-to-2019.x.z/upgrade-guide-from-2019.x.y-to-2019.x.z-ubuntu.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.x.y-to-2019.x.z/upgrade-guide-from-2019.x.y-to-2019.x.z-ubuntu.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.x.y-to-2019.x.z/upgrade-guide-from-2019.x.y-to-2019.x.z-debian.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2019.x.y-to-2019.x.z/upgrade-guide-from-2019.x.y-to-2019.x.z-debian.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.x.y-to-2018.x.z/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.x.y-to-2018.x.z/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.x.y-to-2018.x.z/upgrade-guide-from-2018.x.y-to-2018.x.z-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.x.y-to-2018.x.z/upgrade-guide-from-2018.x.y-to-2018.x.z-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.x.y-to-2018.x.z/upgrade-guide-from-2018.x.y-to-2018.x.z-ubuntu.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.x.y-to-2018.x.z/upgrade-guide-from-2018.x.y-to-2018.x.z-ubuntu.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.x.y-to-2018.x.z/upgrade-guide-from-2018.x.y-to-2018.x.z-debian.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2018.x.y-to-2018.x.z/upgrade-guide-from-2018.x.y-to-2018.x.z-debian.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.x.y-to-2017.x.z/index.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.x.y-to-2017.x.z/index.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.x.y-to-2017.x.z/upgrade-guide-from-2017.x.y-to-2017.x.z-rpm.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.x.y-to-2017.x.z/upgrade-guide-from-2017.x.y-to-2017.x.z-rpm.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.x.y-to-2017.x.z/upgrade-guide-from-2017.x.y-to-2017.x.z-ubuntu.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.x.y-to-2017.x.z/upgrade-guide-from-2017.x.y-to-2017.x.z-ubuntu.html
/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.x.y-to-2017.x.z/upgrade-guide-from-2017.x.y-to-2017.x.z-debian.html: https://enterprise.docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2017.x.y-to-2017.x.z/upgrade-guide-from-2017.x.y-to-2017.x.z-debian.html
# removing the Enterprise-only content from the Open Source documentation
/stable/using-scylla/workload-prioritization: https://enterprise.docs.scylladb.com//stable/using-scylla/workload-prioritization.html
/stable/operating-scylla/security/encryption-at-rest: https://enterprise.docs.scylladb.com/stable/operating-scylla/security/encryption-at-rest.html
/stable/operating-scylla/security/ldap-authentication: https://enterprise.docs.scylladb.com/stable/operating-scylla/security/ldap-authentication.html
/stable/operating-scylla/security/ldap-authorization: https://enterprise.docs.scylladb.com/stable/operating-scylla/security/ldap-authorization.html
/stable/operating-scylla/security/auditing: https://enterprise.docs.scylladb.com/stable/operating-scylla/security/auditing.html
# unifying the Ubunut upgrade guide for different Ubuntu versions: from 5.0 to 2022.1
/stable/upgrade/upgrade-to-enterprise/upgrade-guide-from-5.0-to-2022.1/upgrade-guide-from-5.0-to-2022.1-ubuntu-18-04.html: /stable/upgrade/upgrade-to-enterprise/upgrade-guide-from-5.0-to-2022.1/upgrade-guide-from-5.0-to-2022.1-ubuntu.html
@@ -1112,14 +1183,14 @@ tls-ssl/index.html: /stable/operating-scylla/security
/using-scylla/integrations/integration_kairos/index.html: /stable/using-scylla/integrations/integration-kairos
/upgrade/ami_upgrade/index.html: /stable/upgrade/ami-upgrade
/scylla-cloud/cloud-setup/gcp-vpc-peering/index.html: /stable/scylla-cloud/cloud-setup/GCP/gcp-vpc-peering
/scylla-cloud/cloud-setup/GCP/gcp-vcp-peering/index.html: /stable/scylla-cloud/cloud-setup/GCP/gcp-vpc-peering
/scylla-cloud/cloud-setup/gcp-vpc-peering/index.html: https://cloud.docs.scylladb.com/stable/cloud-setup/gcp-vpc-peering.html
/scylla-cloud/cloud-setup/GCP/gcp-vcp-peering/index.html: https://cloud.docs.scylladb.com/stable/cloud-setup/gcp-vpc-peering.html
# move scylla cloud for AWS to dedicated directory
/scylla-cloud/cloud-setup/aws-vpc-peering/index.html: /stable/scylla-cloud/cloud-setup/AWS/aws-vpc-peering
/scylla-cloud/cloud-setup/cloud-prom-proxy/index.html: /stable/scylla-cloud/cloud-setup/AWS/cloud-prom-proxy
/scylla-cloud/cloud-setup/outposts/index.html: /stable/scylla-cloud/cloud-setup/AWS/outposts
/scylla-cloud/cloud-setup/scylla-cloud-byoa/index.html: /stable/scylla-cloud/cloud-setup/AWS/scylla-cloud-byoa
/scylla-cloud/cloud-setup/aws-vpc-peering/index.html: https://cloud.docs.scylladb.com/stable/cloud-setup/aws-vpc-peering.html
/scylla-cloud/cloud-setup/cloud-prom-proxy/index.html: https://cloud.docs.scylladb.com/stable/monitoring/cloud-prom-proxy.html
/scylla-cloud/cloud-setup/outposts/index.html: https://cloud.docs.scylladb.com/stable/cloud-setup/outposts.html
/scylla-cloud/cloud-setup/scylla-cloud-byoa/index.html: https://cloud.docs.scylladb.com/stable/cloud-setup/scylla-cloud-byoa.html
/scylla-cloud/cloud-services/scylla_cloud_costs/index.html: /stable/scylla-cloud/cloud-services/scylla-cloud-costs
/scylla-cloud/cloud-services/scylla_cloud_managin_versions/index.html: /stable/scylla-cloud/cloud-services/scylla-cloud-managin-versions
/scylla-cloud/cloud-services/scylla_cloud_support_alerts_sla/index.html: /stable/scylla-cloud/cloud-services/scylla-cloud-support-alerts-sla

View File

@@ -161,6 +161,10 @@ events appear in the Streams API as normal deletions - without the
distinctive marker on deletions which are really expirations.
See <https://github.com/scylladb/scylla/issues/5060>.
<!--- REMOVE IN FUTURE VERSIONS - Remove the note below in version 5.3/2023.1 -->
> **Note** This feature is experimental in versions earlier than ScyllaDB Open Source 5.2 and ScyllaDB Enterprise 2022.2.
---

View File

@@ -5,7 +5,7 @@ Raft Consensus Algorithm in ScyllaDB
Introduction
--------------
ScyllaDB was originally designed, following Apache Cassandra, to use gossip for topology and schema updates and the Paxos consensus algorithm for
strong data consistency (:doc:`LWT </using-scylla/lwt>`). To achieve stronger consistency without performance penalty, ScyllaDB 5.x has turned to Raft - a consensus algorithm designed as an alternative to both gossip and Paxos.
strong data consistency (:doc:`LWT </using-scylla/lwt>`). To achieve stronger consistency without performance penalty, ScyllaDB has turned to Raft - a consensus algorithm designed as an alternative to both gossip and Paxos.
Raft is a consensus algorithm that implements a distributed, consistent, replicated log across members (nodes). Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.
@@ -13,9 +13,9 @@ Raft uses a heartbeat mechanism to trigger a leader election. All servers start
Leader selection is described in detail in the `Raft paper <https://raft.github.io/raft.pdf>`_.
ScyllaDB 5.x may use Raft to maintain schema updates in every node (see below). Any schema update, like ALTER, CREATE or DROP TABLE, is first committed as an entry in the replicated Raft log, and, once stored on most replicas, applied to all nodes **in the same order**, even in the face of a node or network failures.
ScyllaDB can use Raft to maintain schema updates in every node (see below). Any schema update, like ALTER, CREATE or DROP TABLE, is first committed as an entry in the replicated Raft log, and, once stored on most replicas, applied to all nodes **in the same order**, even in the face of a node or network failures.
Following ScyllaDB 5.x releases will use Raft to guarantee consistent topology updates similarly.
Upcoming ScyllaDB releases will use Raft to guarantee consistent topology updates similarly.
.. _raft-quorum-requirement:
@@ -26,90 +26,55 @@ Raft requires at least a quorum of nodes in a cluster to be available. If multip
and the quorum is lost, the cluster is unavailable for schema updates. See :ref:`Handling Failures <raft-handling-failures>`
for information on how to handle failures.
Upgrade Considerations for ScyllaDB 5.0 and Later
==================================================
Note that when you have a two-DC cluster with the same number of nodes in each DC, the cluster will lose the quorum if one
of the DCs is down.
**We recommend configuring three DCs per cluster to ensure that the cluster remains available and operational when one DC is down.**
.. _enabling-raft-existing-cluster:
Enabling Raft
---------------
Enabling Raft in ScyllaDB 5.0 and 5.1
=====================================
.. warning::
In ScyllaDB 5.0 and 5.1, Raft is an experimental feature.
It is not possible to enable Raft in an existing cluster in ScyllaDB 5.0 and 5.1.
In order to have a Raft-enabled cluster in these versions, you must create a new cluster with Raft enabled from the start.
.. warning::
**Do not** use Raft in production clusters in ScyllaDB 5.0 and 5.1. Such clusters won't be able to correctly upgrade to ScyllaDB 5.2.
Use Raft only for testing and experimentation in clusters which can be thrown away.
.. warning::
Once enabled, Raft cannot be disabled on your cluster. The cluster nodes will fail to restart if you remove the Raft feature.
When creating a new cluster, add ``raft`` to the list of experimental features in your ``scylla.yaml`` file:
.. code-block:: yaml
experimental_features:
- raft
.. _enabling-raft-existing-cluster:
Enabling Raft in ScyllaDB 5.2 and further
=========================================
.. TODO include enterprise versions in this documentation
.. note::
In ScyllaDB 5.2, Raft is Generally Available and can be safely used for consistent schema management.
In ScyllaDB 5.3 it will become enabled by default.
In further versions it will be mandatory.
In ScyllaDB 5.2 and ScyllaDB Enterprise 2023.1 Raft is Generally Available and can be safely used for consistent schema management.
In further versions, it will be mandatory.
ScyllaDB 5.2 and later comes equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the **internal Raft upgrade procedure** (do not confuse with the :doc:`ScyllaDB version upgrade procedure </upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/upgrade-guide-from-5.1-to-5.2-generic>`).
ScyllaDB Open Source 5.2 and later, and ScyllaDB Enterprise 2023.1 and later come equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the **Raft upgrade procedure** (do not confuse with the :doc:`ScyllaDB version upgrade procedure </upgrade/index/>`).
.. warning::
Once enabled, Raft cannot be disabled on your cluster. The cluster nodes will fail to restart if you remove the Raft feature.
To enable Raft in an existing cluster in Scylla 5.2 and beyond:
To enable Raft in an existing cluster, you need to enable the ``consistent_cluster_management`` option in the ``scylla.yaml`` file
for **each node** in the cluster:
* ensure that the schema is synchronized in the cluster by executing :doc:`nodetool describecluster </operating-scylla/nodetool-commands/describecluster>` on each node and ensuring that the schema version is the same on all nodes,
* then perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>`, updating the ``scylla.yaml`` file for **each node** in the cluster before restarting it to enable the ``consistent_cluster_management`` flag:
#. Ensure that the schema is synchronized in the cluster by executing :doc:`nodetool describecluster </operating-scylla/nodetool-commands/describecluster>` on each node and ensuring that the schema version is the same on all nodes.
#. Perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>`, updating the ``scylla.yaml`` file for **each node** in the cluster before restarting it to enable the ``consistent_cluster_management`` option:
.. code-block:: yaml
.. code-block:: yaml
consistent_cluster_management: true
consistent_cluster_management: true
When all the nodes in the cluster and updated and restarted, the cluster will start the **internal Raft upgrade procedure**.
**You must then verify** that the internal Raft upgrade procedure has finished successfully. Refer to the :ref:`next section <verify-raft-procedure>`.
When all the nodes in the cluster and updated and restarted, the cluster will start the **Raft upgrade procedure**.
**You must then verify** that the Raft upgrade procedure has finished successfully. Refer to the :ref:`next section <verify-raft-procedure>`.
You can also enable the ``consistent_cluster_management`` flag while performing :doc:`rolling upgrade from 5.1 to 5.2 </upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/upgrade-guide-from-5.1-to-5.2-generic>`: update ``scylla.yaml`` before restarting each node. The internal Raft upgrade procedure will start as soon as the last node was upgraded and restarted. As above, this requires :ref:`verifying <verify-raft-procedure>` that this internal procedure successfully finishes.
Alternatively, you can enable the ``consistent_cluster_management`` option when you are:
Finally, you can enable the ``consistent_cluster_management`` flag when creating a new cluster. This does not use the internal Raft upgrade procedure; instead, Raft is functioning in the cluster and managing schema right from the start.
* Performing a rolling upgrade from version 5.1 to 5.2 or version 2022.x to 2023.1 by updating ``scylla.yaml`` before restarting each node. The Raft upgrade procedure will start as soon as the last node was upgraded and restarted. As above, this requires :ref:`verifying <verify-raft-procedure>` that the procedure successfully finishes.
* Creating a new cluster. This does not use the Raft upgrade procedure; instead, Raft is functioning in the cluster and managing schema right from the start.
Until all nodes are restarted with ``consistent_cluster_management: true``, it is still possible to turn this option back off. Once enabled on every node, it must remain turned on (or the node will refuse to restart).
.. _verify-raft-procedure:
Verifying that the internal Raft upgrade procedure finished successfully
Verifying that the Raft upgrade procedure finished successfully
========================================================================
.. versionadded:: 5.2
The internal Raft upgrade procedure starts as soon as every node in the cluster restarts with ``consistent_cluster_management`` flag enabled in ``scylla.yaml``.
The Raft upgrade procedure starts as soon as every node in the cluster restarts with ``consistent_cluster_management`` flag enabled in ``scylla.yaml``.
.. TODO: update the above sentence once 5.3 and later are released.
The procedure requires **full cluster availability** to correctly setup the Raft algorithm; after the setup finishes, Raft can proceed with only a majority of nodes, but this initial setup is an exception.
An unlucky event, such as a hardware failure, may cause one of your nodes to fail. If this happens before the internal Raft upgrade procedure finishes, the procedure will get stuck and your intervention will be required.
An unlucky event, such as a hardware failure, may cause one of your nodes to fail. If this happens before the Raft upgrade procedure finishes, the procedure will get stuck and your intervention will be required.
To verify that the procedure finishes, look at the log of every Scylla node (using ``journalctl _COMM=scylla``). Search for the following patterns:
@@ -204,8 +169,6 @@ If some nodes are **dead and irrecoverable**, you'll need to perform a manual re
Verifying that Raft is enabled
===============================
.. versionadded:: 5.2
You can verify that Raft is enabled on your cluster by performing the following query on each node:
.. code-block:: sql
@@ -224,7 +187,7 @@ The query should return:
on every node.
If the query returns 0 rows, or ``value`` is ``synchronize`` or ``use_pre_raft_procedures``, it means that the cluster is in the middle of the internal Raft upgrade procedure; consult the :ref:`relevant section <verify-raft-procedure>`.
If the query returns 0 rows, or ``value`` is ``synchronize`` or ``use_pre_raft_procedures``, it means that the cluster is in the middle of the Raft upgrade procedure; consult the :ref:`relevant section <verify-raft-procedure>`.
If ``value`` is ``recovery``, it means that the cluster is in the middle of the manual recovery procedure. The procedure must be finished. Consult :ref:`the section about Raft recovery <recover-raft-procedure>`.
@@ -276,12 +239,8 @@ Examples
- Schema updates are possible and safe.
- Try restarting the node. If the node is dead, :doc:`replace it with a new node </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
* - 2 nodes
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- Data is available for reads and writes, schema changes are impossible.
- Restart at least 1 of the 2 nodes that are down to regain quorum. If you cant recover at least 1 of the 2 nodes, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
* - 1 datacenter
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- When the DC comes back online, restart the nodes. If the DC does not come back online and nodes are lost, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
.. list-table:: Cluster B: 2 datacenters, 6 nodes (3 nodes per DC)
:widths: 20 40 40
@@ -294,10 +253,10 @@ Examples
- Schema updates are possible and safe.
- Try restarting the node(s). If the node is dead, :doc:`replace it with a new node </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
* - 3 nodes
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- Data is available for reads and writes, schema changes are impossible.
- Restart 1 of the 3 nodes that are down to regain quorum. If you cant recover at least 1 of the 3 failed nodes, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
* - 1DC
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- Data is available for reads and writes, schema changes are impossible.
- When the DCs come back online, restart the nodes. If the DC fails to come back online and the nodes are lost, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
@@ -315,7 +274,7 @@ Examples
- Schema updates are possible and safe.
- When the DC comes back online, try restarting the nodes in the cluster. If the nodes are dead, :doc:`add 3 new nodes in a new region </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
* - 2 DCs
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- Data is available for reads and writes, schema changes are impossible.
- When the DCs come back online, restart the nodes. If at least one DC fails to come back online and the nodes are lost, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
.. _recover-raft-procedure:
@@ -323,26 +282,24 @@ Examples
Raft manual recovery procedure
==============================
.. versionadded:: 5.2
The manual Raft recovery procedure applies to the following situations:
* :ref:`The internal Raft upgrade procedure <verify-raft-procedure>` got stuck because one of your nodes failed in the middle of the procedure and is irrecoverable,
* :ref:`The Raft upgrade procedure <verify-raft-procedure>` got stuck because one of your nodes failed in the middle of the procedure and is irrecoverable,
* or the cluster was running Raft but a majority of nodes (e.g. 2 our of 3) failed and are irrecoverable. Raft cannot progress unless a majority of nodes is available.
.. warning::
Perform the manual recovery procedure **only** if you're dealing with **irrecoverable** nodes. If it is possible to restart your nodes, do that instead of manual recovery.
.. warning::
.. note::
Before proceeding, make sure that the irrecoverable nodes are truly dead, and not, for example, temporarily partitioned away due to a network failure. If it is possible for the 'dead' nodes to come back to life, they might communicate and interfere with the recovery procedure and cause unpredictable problems.
If you have no means of ensuring that these irrecoverable nodes won't come back to life and communicate with the rest of the cluster, setup firewall rules or otherwise isolate your alive nodes to reject any communication attempts from these dead nodes.
During the manual recovery procedure you'll enter a special ``RECOVERY`` mode, remove all faulty nodes (using the standard :doc:`node removal procedure </operating-scylla/procedures/cluster-management/remove-node/>`), delete the internal Raft data, and restart the cluster. This will cause the cluster to perform the internal Raft upgrade procedure again, initializing the Raft algorithm from scratch. The manual recovery procedure is applicable both to clusters which were not running Raft in the past and then had Raft enabled, and to clusters which were bootstrapped using Raft.
During the manual recovery procedure you'll enter a special ``RECOVERY`` mode, remove all faulty nodes (using the standard :doc:`node removal procedure </operating-scylla/procedures/cluster-management/remove-node/>`), delete the internal Raft data, and restart the cluster. This will cause the cluster to perform the Raft upgrade procedure again, initializing the Raft algorithm from scratch. The manual recovery procedure is applicable both to clusters which were not running Raft in the past and then had Raft enabled, and to clusters which were bootstrapped using Raft.
.. warning::
.. note::
Entering ``RECOVERY`` mode requires a node restart. Restarting an additional node while some nodes are already dead may lead to unavailability of data queries (assuming that you haven't lost it already). For example, if you're using the standard RF=3, CL=QUORUM setup, and you're recovering from a stuck of upgrade procedure because one of your nodes is dead, restarting another node will cause temporary data query unavailability (until the node finishes restarting). Prepare your service for downtime before proceeding.
@@ -393,4 +350,3 @@ Learn More About Raft
* `Making Schema Changes Safe with Raft <https://www.scylladb.com/presentations/making-schema-changes-safe-with-raft/>`_ - A Scylla Summit talk by Konstantin Osipov (register for access)
* `The Future of Consensus in ScyllaDB 5.0 and Beyond <https://www.scylladb.com/presentations/the-future-of-consensus-in-scylladb-5-0-and-beyond/>`_ - A Scylla Summit talk by Tomasz Grabiec (register for access)

View File

@@ -70,9 +70,7 @@ You can use the ``nodetool`` command to describe different aspects of your nodes
``$ nodetool ring <keyspace>``
Outputs all tokens of a node, and displays the token ring information_. It produces output as follows for a single datacenter:
.. _information: /operating-scylla/nodetool-commands/ring/
Outputs all tokens of a node, and displays the :doc:`token ring information </operating-scylla/nodetool-commands/ring>`. It produces output as follows for a single datacenter:
.. code-block:: shell

Some files were not shown because too many files have changed in this diff Show More