Compare commits

...

992 Commits

Author SHA1 Message Date
Anna Stuchlik
cd7a6b8892 doc: remove wrong image upgrade info (5.2-to-2023.1)
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#21876
Fixes https://github.com/scylladb/scylla-enterprise/issues/5041
Fixes https://github.com/scylladb/scylladb/issues/21898

(cherry picked from commit 98860905d8)
2024-12-12 15:24:08 +02:00
Anna Stuchlik
a592393d5d doc: add the 6.0-to-2024.2 upgrade guide-from-6
This commit adds an upgrade guide from ScyllDB 6.0
to ScyllaDB Enterprise 2024.2.

Fixes https://github.com/scylladb/scylladb/issues/20063
Fixes https://github.com/scylladb/scylladb/issues/20062
Refs https://github.com/scylladb/scylla-enterprise/issues/4544

(cherry picked from commit 3d4b7e41ef)

Closes scylladb/scylladb#21618
2024-11-18 17:17:56 +02:00
Gleb Natapov
8c8316e21d topology coordinator: take a copy of a replication state in raft_topology_cmd_handler
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.

Fix it by taking a copy instead. This is a slow path anyway.

Fixes: scylladb/scylladb#21220
(cherry picked from commit fb38bfa35d)

Closes scylladb/scylladb#21372
2024-11-14 17:33:16 +02:00
Tomasz Grabiec
5d07be19c0 node-exporter: Disable hwmon collector
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.

By default, the monitoring stack queries it once every 60s.

(cherry picked from commit 93777fa907)

Closes scylladb/scylladb#21306
2024-10-31 14:06:17 +01:00
Lakshmi Narayanan Sreethar
fd80dd2284 [Backport 6.0] replica/table: check memtable before discarding tombstone during read
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.

Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.

Fixes #20916

`perf-simple-query` stats before and after this fix :

`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59393 insns/op,   24029 cycles/op,        0 errors)
97551.14 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59376 insns/op,   23966 cycles/op,        0 errors)
96599.92 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59367 insns/op,   23998 cycles/op,        0 errors)
97774.91 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59370 insns/op,   23968 cycles/op,        0 errors)
97796.13 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59368 insns/op,   23947 cycles/op,        0 errors)

         throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
  cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19

// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59392 insns/op,   24058 cycles/op,        0 errors)
97311.48 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59375 insns/op,   24005 cycles/op,        0 errors)
98043.10 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59381 insns/op,   23941 cycles/op,        0 errors)
96750.31 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59396 insns/op,   24025 cycles/op,        0 errors)
93381.21 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59390 insns/op,   24097 cycles/op,        0 errors)

         throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
  cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```

This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.

Closes scylladb/scylladb#20985

* github.com:scylladb/scylladb:
  topology-custom: add test to verify tombstone gc in read path
  replica/table: check memtable before discarding tombstone during read
  compaction_group: track maximum timestamp across all sstables

(cherry picked from commit 519e167611)

Backported from #20985 to 6.0

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#21249
2024-10-25 11:20:24 +03:00
Botond Dénes
21cdf8833f Merge '[Backport 6.0] tablet: Fix single-sstable split when attaching new unsplit sstables' from Raphael Raph Carvalho
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.

So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.

This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.

It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.

Fixes https://github.com/scylladb/scylladb/issues/20626.

(cherry picked from commit 999f1f1318)

(cherry picked from commit 38ce2c605d)

Refs https://github.com/scylladb/scylladb/pull/20737

Closes scylladb/scylladb#21201

* github.com:scylladb/scylladb:
  tablet: Fix single-sstable split when attaching new unsplit sstables
  replica: Fix tablet split execute after restart
2024-10-25 11:18:58 +03:00
Botond Dénes
79b7aee58b Merge '[Backport 6.0] Check system.tablets update before putting it into the table' from ScyllaDB
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.

fixes #20043

(cherry picked from commit f09fe4f351)

(cherry picked from commit e5bf376cbc)

(cherry picked from commit 1863ccd900)

 Refs #21020

Closes scylladb/scylladb#21112

* github.com:scylladb/scylladb:
  tablets: Validate system.tablets update
  group0_client: Introduce change validation
  group0_client: Add shared_token_metadata dependency
  replica/tablets: Add to_tablet_metadata_(row_)?key helpers
  replica/tablets: extract tablet_replica_set_from_cell()
2024-10-25 11:18:32 +03:00
Benny Halevy
e45477811c storage_service: rebuild: warn about tablets-enabled keyspaces
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.

The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.

However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.

Refs scylladb/scylladb#17575

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ed1e9a1543)

Closes scylladb/scylladb#20724
2024-10-25 11:18:12 +03:00
Tomasz Grabiec
86066f5313 Merge '[Backport 6.0] replica: Fix tombstone GC during tablet split preparation' from Raphael Raph Carvalho
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:

sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes https://github.com/scylladb/scylladb/issues/20044.

Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.

(cherry picked from commit bcd358595f)

(cherry picked from commit 93815e0649)

Refs https://github.com/scylladb/scylladb/pull/20939

Closes scylladb/scylladb#21204

* github.com:scylladb/scylladb:
  replica: Fix tombstone GC during tablet split preparation
  service: Improve error handling for split
2024-10-23 11:48:45 +02:00
Pavel Emelyanov
b71753a4bc tablets: Validate system.tablets update
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.

If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 12:36:00 +03:00
Pavel Emelyanov
29e562af1a group0_client: Introduce change validation
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 12:35:44 +03:00
Pavel Emelyanov
ae5885abf5 group0_client: Add shared_token_metadata dependency
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 12:35:44 +03:00
Pavel Emelyanov
a0f029ceaa replica/tablets: Add to_tablet_metadata_(row_)?key helpers
Extraceted from larger patch f5976aa87b (replica/tablets: add
get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint())
by Botond. The helpers are needed to decode mutations with tablets
update to validate them later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 12:35:42 +03:00
Kefu Chai
648b017a36 replica/tablets: extract tablet_replica_set_from_cell()
so it can be reused to implement a low-level tool which reads tablets
data from sstables

Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 12:24:57 +03:00
Botond Dénes
f84cdc7569 Merge '[Backport 6.0] atomic_delete: allow deletion of sstables from several prefixes' from ScyllaDB
Allow create_pending_deletion_log to delete a bunch of sstables
potentially resides in different prefixes (e.g. in the base directory
and under staging/).

The motivation arises from table::cleanup_tablet that calls compaction_group::cleanup on all cg:s via cleanup_compaction_groups.  Cleanup, in turn, calls delete_sstables_atomically on all sstables in the compaction_group, in all states, including the normal state as well as staging - hence the requirement to support deleting sstables in different sub-directories.

Also, apparently truncate calls delete_atomically for all sstables too, via table::discard_sstables, so if it happened to be executed during view update generation, i.e. when there are sstables in staging, it should hit the assertion failure reported in https://github.com/scylladb/scylladb/issues/18862 as well (although I haven't seen it yet, but I see no reason why it would happen). So the issue was apparently present since the initial implementation of the pending_delete_log. It's just that with tablet migration it is more likely to be hit.

Fixes scylladb/scylladb#18862

Needs backport to 6.0 since tablets require this capability

(cherry picked from commit a7b92d7b6f)

(cherry picked from commit 027e64876a)

(cherry picked from commit 44bd183187)

(cherry picked from commit f47b5e60bc)

 Refs #19555

Closes scylladb/scylladb#20645

* github.com:scylladb/scylladb:
  sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
  sstables: storage: keep base directory in base class
  sstables: storage: define opened_directory in header file
  sstable_directory: use only dirlog
2024-10-22 09:21:04 +03:00
Benny Halevy
3170f9abec view: check_needs_view_update_path: get token_metadata_ptr
check_needs_view_update_path is async and might yield
so the token_metadata reference passed to it must be kept
alive throughout the call.

Fixes scylladb/scylladb#20979

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d34878e96c)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21040
2024-10-22 09:18:49 +03:00
Raphael S. Carvalho
553803ac0f replica: Fix tombstone GC during tablet split preparation
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split

tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes #20044.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
2024-10-20 20:38:21 -03:00
Raphael S. Carvalho
fde550360e tablet: Fix single-sstable split when attaching new unsplit sstables
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.

So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.

This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.

It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.

Fixes #20626.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 38ce2c605d)
2024-10-20 20:16:53 -03:00
Raphael S. Carvalho
3ba613b833 replica: Fix tablet split execute after restart
let's assume there are 2 nodes, n1, n2. n1 is the coordinator.

1) n1 emits split
2) n1 and n2 complete split work
3) n1 becomes aware all replicas are ready for split
4) n2 restarts, but places split sstable into main group[1]
5) n1 executes split
6) n2 handles split completion, but see the main group is not empty

[1]: During split, main group should only contain unsplit sstables.
If all sstables are split, main must be empty.

This is a result of replica not setting storage group to split mode on restart
(using tablet map) and therefore sstables are incorrectly placed on main group.

The fix is about looking at tablet map and setting group to split mode before
sstables are populated into it.

Refs #20626.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 999f1f1318)
2024-10-20 20:16:51 -03:00
Benny Halevy
f61e0e0f3e sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.

Note that this requires loading and processing
the base directory first.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f47b5e60bc)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

# Conflicts:
#	sstables/sstable_directory.hh
2024-10-20 09:17:06 +03:00
Benny Halevy
f0511ab4eb sstables: storage: keep base directory in base class
so we can use the base (table) directory for
e.g. pending_delete logs, in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 44bd183187)
2024-10-20 09:17:06 +03:00
Benny Halevy
38bc9ee175 sstables: storage: define opened_directory in header file
So it can be used outside the storage module
in the following patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 027e64876a)
2024-10-20 09:17:06 +03:00
Benny Halevy
75e19cdeb1 sstable_directory: use only dirlog
Currently, there are leftover log messages using
sstlog rather than dirlog, that was introduced
in aebd965f0e,
and that makes debugging harder.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a7b92d7b6f)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

# Conflicts:
#	sstables/sstable_directory.cc
2024-10-20 09:17:06 +03:00
Piotr Smaron
cdc5ee84ec test: fix flaky test_multidc_alter_tablets_rf
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit 3969ffb39f)

Closes scylladb/scylladb#21134
2024-10-17 11:03:14 +03:00
Piotr Smaron
5fa7c5dbc0 cql/tablets: handle MVs in ALTER tablets KEYSPACE
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.

Fixes: #20240
(cherry picked from commit e0c1a51642)

Closes scylladb/scylladb#21024
2024-10-17 09:36:35 +03:00
Kamil Braun
dddd9837b2 Merge '[Backport 6.0] cql: improve validating RF's change in ALTER tablets KS' from Piotr Smaron
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.

Fixes: https://github.com/scylladb/scylladb/issues/20039

Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.

(cherry picked from commit 042825247f)

(cherry picked from commit adf453af3f)

(cherry picked from commit 9c5950533f)

(cherry picked from commit 47acdc1f98)

(cherry picked from commit 93d61d7031)

(cherry picked from commit 6676e47371)

(cherry picked from commit 2aabe7f09c)

(cherry picked from commit ee56bbfe61)

Refs https://github.com/scylladb/scylladb/pull/20208

Closes scylladb/scylladb#21047

* github.com:scylladb/scylladb:
  cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
  cql: join new and old KS options in ALTER tablets KS
  cql: fix validation of ALTERing RFs in tablets KS
  cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
  cql: validate RF change for new DCs in ALTER tablets KS
  cql: extend test_alter_tablet_keyspace_rf
  cql: refactor test_tablets::test_alter_tablet_keyspace
  cql: remove unused helper function from test_tablets
2024-10-15 12:45:40 +02:00
Kefu Chai
38f432b598 install.sh: install seastar/scripts/addr2line.py as well
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
    from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```

in this change, we redistribute `addr2line.py` as well. this
should address the issue above.

Fixes scylladb/scylladb#21077

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit da433aad9d)

Closes scylladb/scylladb#21086
2024-10-14 13:37:25 +03:00
Botond Dénes
45a3a1d460 Merge '[Backport 6.0] storage_proxy: Add conditions checking to avoid UB in speculating read executors.' from ScyllaDB
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in  filter_for_query(): the map is considered incorrect if the list  of replicas contains a node from a data center whose replication factor is 0.

 Please note: This PR does not fix the issue found in scylladb/scylladb#20282;   it only adds condition checks to prevent undefined behavior in cases of  inconsistent inputs.

Refs scylladb/scylladb#20625

As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.

(cherry picked from commit 132358dc92)

(cherry picked from commit ae23d42889)

(cherry picked from commit ad93cf5753)

(cherry picked from commit 8db6d6bd57)

(cherry picked from commit c373edab2d)

 Refs #20851

Closes scylladb/scylladb#21069

* github.com:scylladb/scylladb:
  Add conditions checking for get_read_executor
  Avoid an extra call to block_for in db::filter_for_query.
  Improve code readability in consistency_level.cc and storage_proxy.cc
  tools: Add build_info header with functions providing build type information
  tests: Add tests for alter table with RF=1 to RF=0
2024-10-14 13:37:08 +03:00
Michał Chojnowski
a4eab7cab6 reader_concurrency_semaphore: in stats, fix swapped count_resources and memory_resources
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.

This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.

(cherry picked from commit 6cf3747c5f)

Closes scylladb/scylladb#21032
2024-10-14 13:35:47 +03:00
Sergey Zolotukhin
d490178a11 Add conditions checking for get_read_executor
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
  get_endpoints_for_reading(): the map is considered incorrect the number of
  read replica nodes is higher than replication factor. The check is
  applied only when built in non release mode.

Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.

Refs scylladb/scylladb#20625

(cherry picked from commit c373edab2d)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
a8114ab91c Avoid an extra call to block_for in db::filter_for_query.
(cherry picked from commit 8db6d6bd57)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
3b0a161d14 Improve code readability in consistency_level.cc and storage_proxy.cc
Add const correctness and rename some variables to improve code readability.

(cherry picked from commit ad93cf5753)
2024-10-11 18:20:42 +00:00
Sergey Zolotukhin
e35746064a tools: Add build_info header with functions providing build type information
A new header provides `constexpr` functions to retrieve build
type information: `get_build_type()`, `is_release_build()`,
and `is_debug_build()`. These functions are useful when adding
changes that should be enabled at compile time only for
specific build types.

(cherry picked from commit ae23d42889)
2024-10-11 18:20:42 +00:00
Sergey Zolotukhin
c5d7b66a5e tests: Add tests for alter table with RF=1 to RF=0
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.

Test for scylladb/scylladb#20625

(cherry picked from commit 132358dc92)
2024-10-11 18:20:42 +00:00
Botond Dénes
9e5bdc069e repair/row_level: remove reader timeout
This timeout was added to catch reader related deadlocks. We have not
seen such deadlocks for a long time, but we did see false-timeouts
caused by this, see explanation below. Since the cost now outweight the
benefit, remove the timeout altogether.

The false timeout happens during mixed-shard repair. The
`reader_permit::set_timeout()` call is called on the top-level permit
which repair has a handle on. In the case of the mixed-shard repair,
this belongs to the multishard reader. Calling set_timeout() on the
multishard reader has no effect on the actual shard readers, except in
one case: when the shard reader is created, it inherits the multishard
reader's current timeout. As the shard reader can be alive for a long
time, this timeout is not refreshed and ultimately causes a timeout and
fails the repair.

Refs: #18269
(cherry picked from commit 3ebb124eb2)

Closes scylladb/scylladb#20957
2024-10-11 14:49:49 +03:00
Calle Wilund
edfa692b80 database: Also forced new schema commitlog segment on user initiated memtable flush
Refs #20686
Refs #15607

In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.

(cherry picked from commit 60f8a9f39d)

Closes scylladb/scylladb#20706
2024-10-11 14:46:32 +03:00
Avi Kivity
16eab130fa Merge '[Backport 6.0] scylla_raid_setup: configure SELinux file context' from ScyllaDB
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #19325

(cherry picked from commit 56c971373c)

(cherry picked from commit 0ac450de05)

 Refs #20528

Closes scylladb/scylladb#20872

* github.com:scylladb/scylladb:
  scylla_raid_setup: configure SELinux file context
  scylla_coredump_setup: fix SELinux configuration for RHEL9
2024-10-10 19:02:11 +03:00
Gleb Natapov
899c696a3e storage_proxy: make sure there is no end iterator in _live_iterators array
storage_proxy::cancellable_write_handlers_list::update_live_iterators
assumes that iterators in _live_iterators can be dereferenced, but
the code does not make any attempt to make sure this is the case. The
iterator can be the end iterator which cannot be dereferenced.

The patch makes sure that there is no end iterator in _live_iterators.

Fixes scylladb/scylladb#20874

(cherry picked from commit da084d6441)

Closes scylladb/scylladb#21005
2024-10-10 18:57:41 +03:00
Piotr Smaron
2557991f92 cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.

(cherry picked from commit ee56bbfe61)
2024-10-10 12:38:00 +02:00
Piotr Smaron
f05b9adba6 cql: join new and old KS options in ALTER tablets KS
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.

(cherry picked from commit 2aabe7f09c)
2024-10-10 12:37:54 +02:00
Kefu Chai
e0f1076f63 auth: capture boost::regex_error not std::regex_error
in a3db5401, we introduced the TLS certi authenticator, which is
configured using `auth_certificate_role_queries` option . the
value of this option contains a regular expression. so there are
chances the regular expression is malformatted. in that case,
when converting its value presenting the regular expression to an
instance of `boost::regex`, Boost.Regex throws a `boost::regex_error`
exception, not `std::regex_error`.

since we decided to use Boost.Regex, let's catch `boost::regex_error`.

Refs a3db5401
Fixes scylladb/scylladb#20941
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 439c52c7c5)

Closes scylladb/scylladb#20954
2024-10-09 21:56:09 +03:00
Piotr Smaron
7aceb8a763 cql: fix validation of ALTERing RFs in tablets KS
The validation has been corrected with:
1. Checking if a DC specified in ALTER exists.
2. Removing `REPLICATION_STRATEGY_CLASS_KEY` key from a map of RFs that
   needs their RFs to be validated.

(cherry picked from commit 6676e47371)
2024-10-08 18:08:23 +00:00
Piotr Smaron
0913a15dc1 cql: harden alter_keyspace_statement.cc::validate_rf_difference
This function assumed that strings passed as arguments will be of
integer types, but that wasn't the case, and we missed that because this
function didn't have any validation, so this change adds proper
validation and error logging.
Arguments passed to this function were forwarded from a call to
`ks_prop_defs::get_replication_options`, which, among rf-per-dc mapping, returns also
`class:replication_strategy` pair. Second pair's member has been casted
into an `int` type and somehow the code was still running fine, but only
extra testing added later discovered a bug in here.

(cherry picked from commit 93d61d7031)
2024-10-08 18:08:22 +00:00
Piotr Smaron
e03cc8aa6c cql: validate RF change for new DCs in ALTER tablets KS
ALTER tablets KS validated if RF is not changed by more than 1 for DCs
that already had replicas, but not for DCs that didn't have them yet, so
specifying an RF jump from 0 to 2 was possible when listing a new DC in
ALTER tablets KS statement, which violated internal invariants of
tablets load balancer.
This PR fixes that bug and adds a multi-dc testcases to check if adding
replicas to a new DC and removing replicas from a DC is honoring the RF
change constraints.

Refs: #20039
(cherry picked from commit 47acdc1f98)
2024-10-08 18:08:22 +00:00
Piotr Smaron
4172d34c5c cql: extend test_alter_tablet_keyspace_rf
Added cases to also test decreasing RF and setting the same RF.
Also added extra explanatory comments.

(cherry picked from commit 9c5950533f)
2024-10-08 18:08:22 +00:00
Piotr Smaron
b3fcd9fc5b cql: refactor test_tablets::test_alter_tablet_keyspace
1. Renamed the testcase to emphasize that it only focuses on testing
   changing RF - there are other tests that test ALTER tablets KS
in general.
2. Fixed whitespaces according to PEP8

(cherry picked from commit adf453af3f)
2024-10-08 18:08:21 +00:00
Piotr Smaron
9810fb3efd cql: remove unused helper function from test_tablets
`change_default_rf` is not used anywhere, moreover it uses
`replication_factor` tag, which is forbidden in ALTER tablets KS
statement.

(cherry picked from commit 042825247f)
2024-10-08 18:08:21 +00:00
Raphael S. Carvalho
c4cdfb1d78 service: Improve error handling for split
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.

Fixes #20890.

(cherry picked from commit bcd358595f)
2024-10-04 11:17:41 +00:00
Pavel Emelyanov
1ff582f808 cql: Check that CREATEing tablets/vnodes is consistent with the CLI
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:

    if (strategy.supports_tablets) {
         if (cql.with_tablets) {
             if (cfg.enable_tablets) {
                 return create_keyspace_with_tablets();
             } else {
                 throw "tablets are not enabled";
             }
         } else if (cql.with_tablets = off) {
              return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
              if (cfg.enable_tablets) {
                  return create_keyspace_with_tablets();
              } else {
                  return create_keyspace_without_tablets();
              }
         }
     } else { // strategy doesn't support tablets
         if (cql.with_tablets == on) {
             throw "invalid cql parameter";
         } else if (cql.with_tablets == off) {
             return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
             return create_keyspace_without_tablets();
         }
     }

closes: #20088

In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.

There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20929
2024-10-03 17:08:26 +03:00
Michael Litvak
e392531ca9 mv: skip building view updates on a pending replica
Currently, a pending replica that applies a write on a table that has
materialized views, will build all the view updates as a normal replica,
only to realize at a late point, in db::view::get_view_natural_endpoint(),
that it doesn't have a paired view replica to send the updates to. It will
then either drop the view updates, or send them to a pending view
replica, if such exists.

This work is unnecessary since it may be dropped, and even if there is a
pending view replica to send the updates to, the updates that are built
by the pending replica may be wrong since it may have incomplete
information.

This commit fixes the inefficiency by skipping the view update building
step when applying an update on a pending replica.

The metric total_view_updates_on_wrong_node is added to count the cases
that a view update is determined to be unnecessary.

The test reproduces the scenario of writing to a table and applying
the update on a pending replica, and verifies that the pending replica
doesn't try to build view updates.

Fixes scylladb/scylladb#19152

Closes scylladb/scylladb#19488

Fixes scylladb/scylladb#20787

(cherry picked from commit 08b29460fc)

Closes scylladb/scylladb#20934
2024-10-03 11:17:13 +02:00
Calle Wilund
c311a93f0a commitlog: Fix buffer_list_bytes not updated correctly
Fixes #20862

With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.

Test included.

(cherry picked from commit ee5e71172f)

Closes scylladb/scylladb#20913
2024-10-03 09:12:33 +03:00
Kamil Braun
3baa06d349 Merge ' [Backport 6.0] Populate raft address map from gossiper on raft configuration change' from Gleb Natapov
For each new node added to the raft config populate it&#39;s ID to IP mapping in raft address map from the gossiper. The mapping may have expired if a node is added to the raft configuration
long after it first appears in the gossiper.

Fixes scylladb/scylladb#20600

Backport to all supported versions since the bug may cause bootstrapping failure.

(cherry picked from commit bddaf498df)

(cherry picked from commit 9e4cd32096)

Closes scylladb/scylladb#20866

* github.com:scylladb/scylladb:
  test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
  group0: make sure that address map has an entry for each new node in the raft configuration
2024-09-30 17:05:01 +02:00
Takuya ASADA
f0e5f37943 scylla_raid_setup: configure SELinux file context
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #20573

(cherry picked from commit 0ac450de05)
2024-09-29 13:23:04 +00:00
Takuya ASADA
81e922e8c2 scylla_coredump_setup: fix SELinux configuration for RHEL9
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.

Fixes #19325

(cherry picked from commit 56c971373c)
2024-09-29 13:23:04 +00:00
Jenkins Promoter
71a2b14e08 Update ScyllaDB version to: 6.0.5 2024-09-29 15:23:43 +03:00
Gleb Natapov
b4d028f51f test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
(cherry picked from commit 9e4cd32096)
2024-09-29 12:52:15 +03:00
Kamil Braun
f6ae04a59c Merge ' [Backport 6.0] mark node as being replaced earlier' from Gleb Natapov
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

Fixes: https://github.com/scylladb/scylladb/issues/20629

Need to be backported since this is a regression

(cherry picked from commit 644e7a2012)

(cherry picked from commit c0939d86f9)

(cherry picked from commit 1b4c255ffd)

Closes scylladb/scylladb#20835

* github.com:scylladb/scylladb:
  test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
  topology coordinator:: mark node as being replaced earlier
  topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
2024-09-27 16:10:42 +02:00
Kamil Braun
fecb76cff3 service: raft: fix rpc error message
What it called "leader" is actually the destination of the RPC.

Trivial fix, should be backported to all affected versions.

(cherry picked from commit 84dd0e922b)

Closes scylladb/scylladb#20828
2024-09-27 11:22:47 +02:00
Gleb Natapov
35a02eb235 group0: make sure that address map has an entry for each new node in the raft configuration
ID->IP mapping is added to the raft address map when the mapping first
appears in the gossiper, but it is added as expiring entry. It becomes
non expiring when a node is added to raft configuration. But when a node
joins those two events may be distant in time (since the node's request
may sit in the topology coordinator queue for a while) and mappings may
expire already from the map. This patch makes sure to transfer the
mapping from the gossiper for a node that is added to the raft
configuration instead of assuming that the mapping is already there.

(cherry picked from commit bddaf498df)
2024-09-26 21:14:14 +00:00
Gleb Natapov
12495ccea5 test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
(cherry picked from commit 1b4c255ffd)
2024-09-26 12:53:06 +03:00
Gleb Natapov
d565cb3501 topology coordinator:: mark node as being replaced earlier
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

(cherry picked from commit c0939d86f9)
2024-09-26 12:53:06 +03:00
Gleb Natapov
2312a7cd23 topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
During replace with the same IP a node may get queries that were intended
for the node it was replacing since the new node declares itself UP
before it advertises that it is a replacement. But after the node
starts replacing procedure the old node is marked as "being replaced"
and queries no longer sent there. It is important to do so before the
new node start to get raft snapshot since the snapshot application is
not atomic and queries that run parallel with it may see partial state
and fail in weird ways. Queries that are sent before that will fail
because schema is empty, so they will not find any tables in the first
place. The is pre-existing and not addressed by this patch.

(cherry picked from commit 644e7a2012)
2024-09-26 12:53:06 +03:00
Nadav Har'El
05ed0a033a alternator: exclude CDC log table from ListTables
The Alternator command ListTables is supposed to list actual tables
created with CreateTable, and should list things like materialized views
(created for GSI or LSI) or CDC log tables.

We already properly excluded materialized views from the list - and
had the tests to prove it - but forgot both the exclusion and the testing
for CDC log tables - so creating a table xyz with streams enable would
cause ListTables to also list "xyz_scylla_cdc_log".

This patch fixes both oversights: It adds the code to exclude CDC logs
from the output of ListTables, add adds a test which reproduces the bug
before this fix, and verifies the fix works.

Fixes #19911.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19914

(cherry picked from commit d293a5787f)
2024-09-26 11:24:57 +03:00
Kefu Chai
b715df1c2e docs: explain precedence of configure options
to explain for instance which setting takes effect if both
command line options and `scylla.yaml` configures the same parameter.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1aa030a8cd)

Closes scylladb/scylladb#20776
2024-09-26 10:59:49 +03:00
Botond Dénes
276af11202 Merge '[Backport 6.0] replica: ignore cleanup of deallocated storage group' from Aleksandra Martyniuk
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.

Ignore cleanup of deallocated storage groups.

Fixes: https://github.com/scylladb/scylladb/issues/19752.

Needs to be backported to all branches with tablets (6.0 and later)

(cherry picked from commit 20d6cf55f2)

(cherry picked from commit 2c4b1d6b45)

Refs https://github.com/scylladb/scylladb/pull/20584

Closes scylladb/scylladb#20628

* github.com:scylladb/scylladb:
  test: check if cleanup of deallocated sg is ignored
  replica: ignore cleanup of deallocated storage group
2024-09-26 10:58:01 +03:00
Lakshmi Narayanan Sreethar
24c6fce266 [Backport 6.0] database::get_all_tables_flushed_at: fix return value
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.

Fixes #20301

Closes scylladb/scylladb#20471

* github.com:scylladb/scylladb:
  cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
  database::get_all_tables_flushed_at: fix return value

(cherry picked from commit 0e5b444777)

Backported from #20471 to 6.0.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#20580
2024-09-26 10:52:02 +03:00
Kamil Braun
45f01a886f test: fix topology_custom/test_raft_recovery_stuck flakiness
The test performs consecutive schema changes in RECOVERY mode. The
second change relies on the first. However the driver might route the
changes to different servers and we don't have group 0 to guarantee
linearizability. We must rely on the first change coordinator to push
the schema mutations to other servers before returning, but that only
happens when it sees other servers as alive when doing the schema
change. It wasn't guaranteed in the test. Fix this.

Fixes scylladb/scylladb#20791

Should be backported to all branches containing this test to reduce
flakiness.

(cherry picked from commit f390d4020a)

Closes scylladb/scylladb#20810
2024-09-25 15:12:30 +02:00
Abhinav
d8b66cf6ef raft topology: add error for removal of non-normal nodes
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.

This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#20271
(cherry picked from commit b25b8dccbd)

Closes scylladb/scylladb#20801
2024-09-25 11:36:02 +02:00
Gleb Natapov
c75d58aef5 test: skip test_lwt_semaphore::test_cas_semaphore in aarch64 debug mode
The test configures write timeout to much smaller value to make the test
run faster since for some writes sleep is inserted to hit the timeout,
but it makes aarch64 debug flaky since timeout happens when it should
not because of a natural slowness.

(cherry picked from commit 71a5b1c6dd)

Closes scylladb/scylladb#20778
2024-09-24 15:20:56 +02:00
Benny Halevy
2e8ad3ec35 time_window_compaction_strategy: get_reshaping_job: restrict sort of multi_window vector to its size
Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.

Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.

Fixes scylladb/scylladb#20608

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 39ce358d82)

Closes scylladb/scylladb#20664
2024-09-23 16:01:27 +03:00
Botond Dénes
7d421abec4 Merge '[Manual Backport 6.0] generic_server: convert connection tracking to seastar::gate' from Laszlo Ersek
This is a manual backport of #20212 to 6.0, superseding #20346 (which had run into conflicts).

Please see the individual commit messages for backport notes.

Fixes #10305

Closes scylladb/scylladb#20349

* github.com:scylladb/scylladb:
  generic_server: make server::stop() idempotent
  generic_server: coroutinize server::shutdown()
  generic_server: make server::shutdown() idempotent
  test/generic_server: add test case
  configure, cmake: sort the lists of boost unit tests
  generic_server: convert connection tracking to seastar::gate
2024-09-19 09:18:48 +03:00
Tomasz Grabiec
e38b42cedf Merge '[Backport 6.0] tablets: Fix race between repair and split ' from Raphael "Raph" Carvalho
Consider the following:

```
T
0   split prepare starts
1                               repair starts
2   split prepare finishes
3                               repair adds unsplit sstables
4                               repair ends
5   split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.

The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.

Fixes https://github.com/scylladb/scylladb/issues/19378.
Fixes https://github.com/scylladb/scylladb/issues/19416.

Please replace this line with justification for the backport/* labels added to this PR

(cherry picked from commit 239344ab55)

(cherry picked from commit 74612ad358)

Refs https://github.com/scylladb/scylladb/pull/19427

Closes scylladb/scylladb#20593

* github.com:scylladb/scylladb:
  tablets: Fix race between repair and split
  compaction: Allow "offline" sstable to be split
2024-09-17 13:24:36 +02:00
Gleb Natapov
c04ce4ce64 paxos_state: release semaphore units before checking if a semaphore can be dropped
To drop a semaphore it should not be held by anyone, so we need to
release out units before checking if a semaphore can be dropped.

Fixes: scylladb/scylladb#20602
(cherry picked from commit 9cc54932ae)

Closes scylladb/scylladb#20622
2024-09-16 22:39:03 +03:00
Aleksandra Martyniuk
62593ee85f test: check if cleanup of deallocated sg is ignored
(cherry picked from commit 2c4b1d6b45)
2024-09-16 17:49:59 +02:00
Aleksandra Martyniuk
ae385b29fe replica: ignore cleanup of deallocated storage group
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.

Ignore cleanup of deallocated storage group.

(cherry picked from commit 20d6cf55f2)
2024-09-16 17:21:14 +02:00
Piotr Dulikowski
977c458555 Merge '[Backport 6.0]: hints: send hints with CL=ALL if target is leaving' from Piotr Dulikowski
Currently, when attempting to send a hint, we might choose its recipients in one of two ways:

- If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.

There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded.

As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destination of the hint is leaving.

Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again.

Fixes scylladb/scylladb#20558
Fixes scylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835

This is a backport of the original PR without the tests, done avoid the need of resolving merge conflicts in that area.

Closes scylladb/scylladb#20559

* github.com:scylladb/scylladb:
  hints: send hints with CL=ALL if target is leaving
  hints: inline do_send_one_mutation
2024-09-16 10:26:00 +02:00
Raphael S. Carvalho
c67967b65a tablets: Fix race between repair and split
Consider the following:

T
0   split prepare starts
1                               repair starts
2   split prepare finishes
3                               repair adds unsplit sstables
4                               repair ends
5   split executes

If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.

The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.

Fixes #19378.
Fixes #19416.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 74612ad358)
2024-09-13 21:11:25 -03:00
Abhi
d8a71ca6db raft: Add descriptions for requested abort errors
Fixes: scylladb/scylladb#18902

This PR is intended to make debugging easier, hence backporting it to
previous versions shall be useful while debugging issues there

(cherry picked from commit a616f10)

For fixing the backport,  parentheses () were added after variable captures
in lambdas, absence of which wasn't supported in earlier versions of C++.

Closes scylladb/scylladb#20564
2024-09-13 10:18:18 +03:00
Botond Dénes
2f3f734cfb docs/cql/ddl.rst: fix description of sstable_compression
ScyllaDB doesn't support custom compressors. The available compressors
are the only available ones, not the default ones.
Adjust the text to reflect this.

(cherry picked from commit 08f109724b)

Closes scylladb/scylladb#20525
2024-09-13 10:17:12 +03:00
Gleb Natapov
6bd8c9fae5 db/consistency_level: do not use result from hit weighted load balancer if it contains duplicates
Because of https://github.com/scylladb/scylladb/issues/9285 hit weighted
load balancer may sometimes return same node twice. It may cause wrong
data to be read or unexpected errors to be returned to a client. Since
the original bug is not easy to fix and it is rare lets introduce a
workaround. We will check for duplicates and will use non HWLB one if
one is found.

(cherry picked from commit 807e37502a)

Closes scylladb/scylladb#20470
2024-09-13 10:16:39 +03:00
Anna Stuchlik
3948167fca doc: add a page with ScyllaDB limits
This commit adds a page listing the ScyllDB limits
we know today.
The page can and should be extended when other limits
are confirmed.

Closes scylladb/scylladb#19399

(cherry picked from commit 072542a5cc)
2024-09-12 13:32:50 +03:00
Piotr Dulikowski
c423ae1688 hints: send hints with CL=ALL if target is leaving
Currently, when attempting to send a hint, we might choose its
recipients in one of two ways:

- If the original destination is a natural endpoint of the hint, we only
  send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.

There is a problem when we decommission a node: while data is streamed
away from that node, it is still considered to be a natural endpoint of
the data that it used to own. Because of that, it might happen that a
hint is sent directly to it but streaming will miss it, effectively
resulting in the hint being discarded.

As sending the hint _only_ to the leaving replica is a rather bad idea,
send the hint to all replicas also in the case when the original
destiantion of the hint is leaving.

Note that this is a conservative fix written only with the decommission
+ vnode-based keyspaces combo in mind. In general, such "data loss" can
occur in other situations where the replica set is changing and we go
through a streaming phase, i.e. other topology operations in case of
vnodes and tablet load balancing. However, the consistency guarantees of
hinted handoff in the face of topology changes are not defined and it is
not clear what they should be, if there should be any at all. The
picture is further complicated by the fact that hints are used by
materialized views, and sending view updates to more replicas than
necessary can introduce inconsistencies in the form of "ghost rows".
This fix was developed in response to a failing test which checked the
hint replay + decommission scenario, and it makes it work again.

Fixes scylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835

(cherry picked from commit 61ac0a336d)
2024-09-12 10:58:25 +02:00
Piotr Dulikowski
24e70895d5 hints: inline do_send_one_mutation
It's a small method and it is only used once in send_one_mutation.
Inlining it lets us get rid of its declaration in the header - now, if
one needs to change the variables passed from one function to another,
it is no longer necessary to change the header.

(cherry picked from commit 8abb06ab82)
2024-09-12 10:58:22 +02:00
Nadav Har'El
9c1f4d0953 Merge '[Backport 6.0] cql3: add option to not unify bind variables with the same name' from Avi Kivity
Bind variables in CQL have two formats: positional (?) where a variable is referred to by its relative position in the statement, and named (:var), where the user is expected to supply a name->value mapping.

In 19a6e69001 we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to.

However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection.

Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the dialect and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables.

A unit test is added.

Fixes https://github.com/scylladb/scylladb/issues/15559

This may be useful to users transitioning from Cassandra, so merits a backport.

(cherry picked from commit f9322799af)

(cherry picked from commit d69bf4f010)

(cherry picked from commit ea8441dfa3)

Refs https://github.com/scylladb/scylladb/pull/19493

Subsumes #20389

Closes scylladb/scylladb#20551

* github.com:scylladb/scylladb:
  cql3: add option to not unify bind variables with the same name
  cql3: introduce dialect infrastructure
  cql3: prepared_statement_cache: drop cache key default constructor
  test: cql-pytest: config_value_context: remove strange ast.literal_eval call
  Merge 'config: round-trip boolean configuration variables' from Avi Kivity
2024-09-12 11:21:34 +03:00
Avi Kivity
ad52caac55 cql3: add option to not unify bind variables with the same name
Bind variables in CQL have two formats: positional (`?`) where a
variable is referred to by its relative position in the statement,
and named (`:var`), where the user is expected to supply a
name->value mapping.

In 19a6e69001 we identified the case where a named bind variable
appears twice in a query, and collapsed it to a single entry in the
statement metadata. Without this, a driver using the named variable
syntax cannot disambiguate which variable is referred to.

However, it turns out that users can use the positional call form
even with the named variable syntax, by using the positional
API of the driver. To support this use case, we add a configuration
variable to disable the same-variable detection.

Because the detection has to happen when the entire statement is
visible, we have to supply the configuration to the parser. We
call it the `dialect` and pass it from all callers. The alternative
would be to add a pre-prepare call similar to fill_prepare_context that
rewrites all expressions in a statement to deduplicate variables.

A unit test is added.

Fixes #15559

(cherry picked from commit ea8441dfa3)
(cherry picked from commit edb3068ecf)
2024-09-11 22:55:22 +03:00
Avi Kivity
aabad7e88f cql3: introduce dialect infrastructure
A dialect is a different way to interpret the same CQL statement.

Examples:
 - how duplicate bind variable names are handled (later in this series)
 - whether `column = NULL` in LWT can return true (as is now) or
   whether it always returns NULL (as in SQL)

Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.

The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.

(cherry picked from commit d69bf4f010)
2024-09-11 22:55:22 +03:00
Avi Kivity
f8f958030a cql3: prepared_statement_cache: drop cache key default constructor
It's unnecessary, and interferes with the following patch where
we change the cache key type.

(cherry picked from commit f9322799af)
2024-09-11 22:55:22 +03:00
Avi Kivity
643da8e3d8 test: cql-pytest: config_value_context: remove strange ast.literal_eval call
cql-pytest's config_value_context is used to run a code sequence with
different ScyllaDB configuration applied for a while. When it reads
the original value (in order to restore it later), it applies
ast.literal_eval() to it. This is strange, since the config variable isn't
a Python literal.

It was added in 8c464b2ddb ("guardrails: restrict replication
strategy (RS)"). Presumably, as a workaround for #19604 - it sufficiently
massaged the input we read via SELECT to be acceptable later via UPDATE.

Now that #19604 is fixed, we can remove the call to ast.literal_eval,
but have to fix up the parameters to config_value_context to something
that will be accepted without further massaging.

This is a step towards fixing #15559, where we want to run some tests
with a boolean configuration variable changed, and literal_eval is
transforming the string representation of integers to integers and
confusing the driver.

Closes scylladb/scylladb#19696

(cherry picked from commit d5af86bd8a)
2024-09-11 22:55:22 +03:00
Nadav Har'El
79879be753 Merge 'config: round-trip boolean configuration variables' from Avi Kivity
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.

Not a regression, so a backport isn't strictly necessary.

Closes scylladb/scylladb#19792

* github.com:scylladb/scylladb:
  config: specialize from-string conversion for bool
  config: wrap boost::lexical_cast<> when converting from strings

(cherry picked from commit 9eb47b3ef0)
2024-09-11 22:55:22 +03:00
Kamil Braun
553174251b test: test_raft_no_quorum: increase raft timeout in debug mode
The test cases in this file use an error injection to reduce raft group
0 timeouts (from the default 1 minute), in order to speed up the tests;
the scenarios expect these timeouts to happen, so we want them to happen
as quick as possible, but we don't want to reduce timeouts so much that
it will make other operations fail when we don't expect them to (e.g.
when the test wants to add a node to the cluster).

Unfortunately the selected 5 seconds in debug mode was not enough and
made the tests flaky: scylladb/scylladb#20111.

Increase it to 10 seconds. This unfortunately will slow down these tests
as they have to sometimes wait for 10 seconds for the timeout to happen.
But better to have this than a flaky test.

Fixes: scylladb/scylladb#20111
(cherry picked from commit 52fdf5b4c9)

Closes scylladb/scylladb#20478
2024-09-10 11:56:21 +03:00
Kefu Chai
cd743a4cfa docs: do not install scylla/ppa repo when perform upgrade
for following reasons:

1. the ppa in question does not provide the build for the latest ubuntu's LTS release. it only builds for trusty, xenial, bionic and jammy. according to https://wiki.ubuntu.com/Releases, the latest LTS release is ubuntu noble at the time of writing.
2. the ppa in question does not provide the packages used in production. it does provides the package for *building* scylla
3. after we introduced the relocatable package, there is no need to provide extra user space dependencies apart from scylla packages.

so, in this change, we remove all references to enabling the Scylla/PPA repository.

Fixes scylladb/scylladb#20449

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit fe0e961856)

Closes scylladb/scylladb#20454
2024-09-10 11:54:09 +03:00
Nadav Har'El
44b1fe628c alternator ttl: fix use-after-free
The Alternator TTL scanning code uses an object "scan_ranges_context"
to hold the scanning context. One of the members of this object is
a service::query_state, and that in turn holds a reference to a
service::client_state. The existing constructor created a temporary
client_state object and saved a reference to it - which can result
in use after free as the temporary object is freed as soon as the
constructor ends.

The fix is to save a client_state in the scan_ranges_context object,
instead of a temporary object.

Fixes #19988

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 15f8046fcb)

Closes scylladb/scylladb#20437
2024-09-10 11:51:08 +03:00
Kefu Chai
1e35328161 sstables: correct the debugging message printed when removing temp dir
in 372a4d1b79, we introduced a change
which was for debugging the logging message. but the logging message
intended for printing the temp_dir not prints an `optional<int>`. this
is both confusing, and more importantly, it hurts the debuggability.

in this change, the related change is reverted.

Fixes scylladb/scylladb#20408

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit d26bb9ae30)

Closes scylladb/scylladb#20435
2024-09-10 11:50:01 +03:00
Takuya ASADA
46458226bb install.sh: fix more incorrect permission on strict umask
Even after 13caac7, we still have more files incorrect permission, since
we use "cp -r" and creating new file with redirect.

To fix this, we need to replace "cp -r" with "cp -pr", and "chmod <perm>" on
newly created files.

Fixes #14383
Related #19775

(cherry picked from commit 9d7fed40b5)

Closes scylladb/scylladb#20433
2024-09-10 11:47:41 +03:00
Kefu Chai
6fdb124914 dist: drop %pretrans section
before this change, if user does not have `/bin/sh` around, when
installing scylla packages, the script in `%pretrans" is executed,
and fails due to missing `/bin/sh`. per
https://docs.fedoraproject.org/en-US/packaging-guidelines/Scriptlets/#pretrans

> Note that the %pretrans scriptlet will, in the particular case of
> system installation, run before anything at all has been installed.
> This implies that it cannot have any dependencies at all. For this
> reason, %pretrans is best avoided, but if used it MUST (by necessity)
> be written in Lua. See
> https://rpm-software-management.github.io/rpm/manual/lua.html for more
> information.

but we were trying to warn users upgrading from scylla < 1.7.3, which
was released 7 years ago at the time of writing.

in this change, we drop the `%pretrans` section. hopefuly they will
find their way out if they still exist.

Fixes scylladb/scylladb#20321

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 6970c502c9)

Closes scylladb/scylladb#20386
2024-09-10 11:46:30 +03:00
Avi Kivity
093bff385b docs: cql: document ZstdCompressor for CREATE TABLE
Adjust the wording slightly to be less awkward.

(cherry picked from commit 60acfd8c08)

Closes scylladb/scylladb#20381
2024-09-10 11:45:55 +03:00
Raphael S. Carvalho
a4f6811d5f storage_service: avoid processing same table unnecessarily in split monitor
If there's a token metadata for a given table, and it is in split mode,
it will be registered such that split monitor can look at it, for
example, to start split work, or do nothing if table completed it.

during topology change, e.g. drain, split is stalled since it cannot
take over the state machine.
It was noticed that the log is being spammed with a message saying the
table completed split work, since every tablet metadata update, means
waking up the monitor on behalf of a table. So it makes sense to
demote the logging level to debug. That persists until drain completes
and split can finally complete.

Another thing that was noticed is that during drain, a table can be
submitted for processing faster than the monitor can handle, so the
candidate queue may end up with multiple duplicated entries for same
table, which means unnecessary work. That is fixed by using a
sequenced set, which keeps the current FIFO behavior.

Fixes #20339.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 26facd807e)

Closes scylladb/scylladb#20344
2024-09-10 11:45:06 +03:00
Botond Dénes
8d5f2d8943 Merge '[Backport 6.0] repair: throw if batchlog manager isn't initialized' from Aleksandra Martyniuk
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.

Throw if batchlog manager isn't initialized.

Fixes: https://github.com/scylladb/scylladb/issues/20236.

Needs backport to 6.0 and 6.1 as they suffer from the uninitialized bm access.

(cherry picked from commit d8e4393418)

(cherry picked from commit f38bb6483a)

Refs https://github.com/scylladb/scylladb/pull/20251

Closes scylladb/scylladb#20392

* github.com:scylladb/scylladb:
  test: add test to ensure repair won't fail with uninitialized bm
  repair: throw if batchlog manager isn't initialized
2024-09-09 15:13:38 +03:00
Jenkins Promoter
0e5108ed7f Update ScyllaDB version to: 6.0.4 2024-09-04 15:34:01 +03:00
Kamil Braun
96064e9647 Merge '[Backport 6.0] Fix node replace with inter-dc encryption enabled.' from Gleb Natapov
Currently if a coordinator and a node being replaced are in the same DC
while inter-dc encryption is enabled (connections between nodes in the
same DC should not be encrypted) the replace operation will fail. It
fails because a coordinator uses non encrypted connection to push raft
data to the new node, but the new node will not accept such connection
until it knows which DC the coordinator belongs to and for that the raft
data needs to be transferred.

The series adds the test for this scenario and the fix for the
chicken&egg problem above.

The series (or at least the fix itself) needs to be backported because
this is a serious regression.

Fixes: https://github.com/scylladb/scylladb/issues/19025

(cherry picked from commit 84757a4ed3)

(cherry picked from commit b98282a976)

(cherry picked from commit 2f1b1fd45e)

(cherry picked from commit 17f4a151ce)

(cherry picked from commit 32a59ba98f)

Refs https://github.com/scylladb/scylladb/pull/20290

Closes scylladb/scylladb#20399

* github.com:scylladb/scylladb:
  topology coordinator: fix indentation after the last patch
  topology coordinator: do not add replacing node without a ring to topology
  test: add test for replace in clusters with encryption enabled
  test.py: add server encryption support to cluster manager
  .gitignore: fix pattern for resources to match only one specific directory
2024-09-03 12:24:35 +02:00
Gleb Natapov
8400e6947b topology coordinator: fix indentation after the last patch
(cherry picked from commit 32a59ba98f)
2024-09-02 17:04:42 +03:00
Gleb Natapov
8510568eda topology coordinator: do not add replacing node without a ring to topology
When only inter dc encryption is enabled a non encrypted connection
between two nodes is allowed only if both nodes are in the same dc.
If a nodes that initiates the connection knows that dst is in the same
dc and hence use non encrypted connection, but the dst not yet knows the
topology of the src such connection will not be allowed since dst cannot
guaranty that dst is in the same dc.

Currently, when topology coordinator is used, a replacing node will
appear in the coordinator's topology immediately after it is added to the
group0. The coordinator will try to send raft message to the new node
and (assuming only inter dc encryption is enabled and replacing node and
the coordinator are in the same dc) it will try to open regular, non encrypted,
connection to it. But the replacing node will not have the coordinator
in it's topology yet (it needs to sync the raft state for that). so it
will reject such connection.

To solve the problem the patch does not add a replacing node that was
just added to group0 to the topology. It will be added later, when
tokens will be assigned to it. At this point a replacing node will
already make sure that its topology state is up-to-date (since it will
execute a raft barrier in join_node_response_params handler) and it knows
coordinator's topology. This aligns replace behaviour with bootstrap
since bootstrap also does not add a node without a ring to the topology.

The patch effectively reverts b8ee8911ca

Fixes: scylladb/scylladb#19025
(cherry picked from commit 17f4a151ce)
2024-09-02 17:04:42 +03:00
Gleb Natapov
cd324b8513 test: add test for replace in clusters with encryption enabled
(cherry picked from commit 2f1b1fd45e)
2024-09-02 17:04:42 +03:00
Gleb Natapov
d441d93e63 test.py: add server encryption support to cluster manager
(cherry picked from commit b98282a976)
2024-09-02 17:04:42 +03:00
Gleb Natapov
84c47df5e3 .gitignore: fix pattern for resources to match only one specific directory
(cherry picked from commit 84757a4ed3)
2024-09-02 15:21:11 +03:00
Aleksandra Martyniuk
ca3cbae70b test: add test to ensure repair won't fail with uninitialized bm
(cherry picked from commit f38bb6483a)
2024-09-02 10:37:18 +02:00
Aleksandra Martyniuk
3e25eadf12 repair: throw if batchlog manager isn't initialized
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.

Batchlog manager cannot be initialized before repair as we have the
dependencies chain:
repair_service -> storage_service::join_cluster -> batchlog_manager.

Throw if batchlog manager isn't initialized. That won't cause repair
to fail.

(cherry picked from commit d8e4393418)
2024-08-30 13:55:48 +00:00
Laszlo Ersek
f765591886 generic_server: make server::stop() idempotent
After server::shutdown(), make server::stop() more robust too, by allowing
callers (internal or external) to call it several times (not concurrently
though, just yet; see
<https://github.com/scylladb/scylladb/issues/20309>).

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 49bff3b1ab)
2024-08-30 15:33:00 +02:00
Laszlo Ersek
f17516c4c1 generic_server: coroutinize server::shutdown()
By turning server::shutdown() into a coroutine, we need not dynamically
allocate "nr_conn".

Verified as follows:

(1) In terminal #1:

    build/Dev/scylla --overprovisioned --developer-mode=yes \
        --memory=2G --smp=1 --default-log-level error \
        --logger-log-level cql_server=debug:cql_server_controller=debug

> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:9042 (unencrypted,
>                                     non-shard-aware)
> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:19042 (unencrypted,
>                                     shard-aware)

(2) In terminals #2 and #3:

    tools/cqlsh/bin/cqlsh.py

(3) Press ^C in terminal #1:

> DEBUG [...] cql_server - abort accept nr_total=2
> DEBUG [...] cql_server - abort accept 1 out of 2 done
> DEBUG [...] cql_server - abort accept 2 out of 2 done
> DEBUG [...] cql_server - shutdown connection nr_total=4
> DEBUG [...] cql_server - shutdown connection 1 out of 4 done
> DEBUG [...] cql_server - shutdown connection 2 out of 4 done
> DEBUG [...] cql_server - shutdown connection 3 out of 4 done
> DEBUG [...] cql_server - shutdown connection 4 out of 4 done
> INFO  [...] cql_server_controller - CQL server stopped

This patch is best viewed with "git show --word-diff=color".

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 1138347e7e)
2024-08-30 15:32:55 +02:00
Laszlo Ersek
b053f794d7 generic_server: make server::shutdown() idempotent
Make server::shutdown() more robust by allowing callers (internal or
external) to call it several times (not concurrently though, just yet; see
<https://github.com/scylladb/scylladb/issues/20309>).

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 2216275ebd)
2024-08-30 15:32:49 +02:00
Laszlo Ersek
272c409b26 test/generic_server: add test case
Check whether we can stop a generic server without first asking it to
listen.

The test fails currently; the failure mode is a hang, which triggers the 5
minute timeout set in the test:

> unknown location(0): fatal error: in "stop_without_listening":
> seastar::timed_out_error: timedout
> seastar/src/testing/seastar_test.cc(43): last checkpoint
> test/boost/generic_server_test.cc(34): Leaving test case
> "stop_without_listening"; testing time: 300097447us

Backport notes for 6.0:

- Replace

    #include "utils/assert.hh"
    SCYLLA_ASSERT(false);

  with

    #include <cassert>
    assert(false);

  due to 6.0 lacking commit aa1270a00c ("treewide: change assert() to
  SCYLLA_ASSERT()", 2024-08-05). The header file "utils/assert.hh"
  wouldn't be difficult to backport, but separating it from the treewide
  changes in commit aa1270a00c might not be the best idea.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit dbc0ca6354)
2024-08-30 15:22:18 +02:00
Laszlo Ersek
5490092abf configure, cmake: sort the lists of boost unit tests
Both lists were obviously meant to be sorted originally, but by today
we've introduced many instances of disorder -- thus, inserting a new test
in the proper place leaves the developer scratching their head. Sort both
lists.

Backport notes for 6.0:

- Conflicts in "configure.py" and "test/boost/CMakeLists.txt",
  unsurprisingly. For the backport, I sorted the boost unit test list in
  each file manually, from scratch.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 931f2f8d73)
2024-08-30 14:43:26 +02:00
Laszlo Ersek
58695724b6 generic_server: convert connection tracking to seastar::gate
If we call server::stop() right after "server" construction, it hangs:

With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,

    co_await _all_connections_stopped.get_future();

at the end of server::stop() deadlocks.

Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when

- cserver->start() (sharded<cql_server>::start()) constructs a
  "server"-derived object,

- start_listening_on_tcp_sockets() throws an exception before reaching
  listen_on_all_shards() (for example because it fails to set up client
  encryption -- certificate file is inaccessible etc.),

- the "deferred_action"

      cserver->stop().get();

  is invoked during cleanup.

(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)

Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.

NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:

- server::shutdown() + server::stop()

- or just server::stop().

Fixes #10305

Backport notes for 6.0:

- Conflict in "generic_server.hh", due to 6.0 not having commit
  324b3c43c0 ("generic_server: use async function in
  `for_each_gently()`", 2024-08-08), which is part of the feature series
  "service levels: update connections parameters automatically"
  <https://github.com/scylladb/scylladb/pull/19085>.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 5a04743663)
2024-08-30 14:12:47 +02:00
Botond Dénes
e33fcfe27b Merge '[Backport 6.0] Make Summary support histogram with infinite bucket vlaues' from ScyllaDB
This series fixes an issue where histogram Summaries return an infinite value.

It updated the quantile calculation logic to address cases where values fall into the infinite bucket of a histogram.
Now, instead of returning infinite (max int), the calculation will return the last bucket limit, ensuring finite outputs in all cases.

The series adds a test for summaries with a specific test case for this scenario.

Fixes #20255
Need backport to 6.0, 6.1 and 2023.1 and above

(cherry picked from commit 011aa91a8c)

(cherry picked from commit 644e6f0121)

 Refs #20257

Closes scylladb/scylladb#20304

* github.com:scylladb/scylladb:
  test/estimated_histogram_test Add summary tests
  utils/histogram.hh: Make summary support inifinite bucket.
2024-08-29 07:52:36 +03:00
Botond Dénes
0020d37a20 Merge '[Backport 6.0] repair: do_rebuild_replace_with_repair: use source_dc only when safe' from ScyllaDB
It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored.

Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology`
strategies, as with simple replication strategy
there is no guarantee that there would be any
more replicas in that data center.

Fixes #16826

Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865
It fails without this fix and passes with it.

* Requires backport to live versions.  Issue hit in the filed with 2022.2.14

(cherry picked from commit 8b1877f3ca)

(cherry picked from commit 0419b1d522)

(cherry picked from commit b5d0ab092c)

(cherry picked from commit 9729dd21c3)

(cherry picked from commit 8665eef98c)

(cherry picked from commit 5f655e41e3)

 Refs #16827

Closes scylladb/scylladb#20229

* github.com:scylladb/scylladb:
  raft_rebuild: propagate source_dc force option to rebuild_option
  repair: do_rebuild_replace_with_repair: use source_dc only when safe
  repair: replace_with_repair: pass the replace_node downstream
  repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
  repair: replace_rebuild_with_repair: pass ks_erms from caller
  nodetool: rebuild: add force option
  Add and use utils::optional_param to pass source_dc
2024-08-29 07:36:39 +03:00
Botond Dénes
97fe5213f7 Merge '[Backport 6.0] schema_tables: calculate_schema_digest: prevent stalls due to large m…' from ScyllaDB
…utations vector

With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls when freed.

For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
 (inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
 (inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```

This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time, and by that, reducing memory footprint and preventing reactor stalls.

Fixes #18173

(cherry picked from commit 95a5fba0ea)

(cherry picked from commit 52234214e5)

 Refs #18174

Closes scylladb/scylladb#20247

* github.com:scylladb/scylladb:
  schema_tables: calculate_schema_digest: filter the key earlier
  schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
2024-08-29 07:33:28 +03:00
Benny Halevy
81f4036143 raft_rebuild: propagate source_dc force option to rebuild_option
Currently, the `force` property of the `source_dc` rebuild option
is lost and `raft_topology_cmd_handler` has no way to know
if it was given or not.

This in turn can cause rebuild to fail, even when `--force`
is set by the user, where it would succeed with gossip
topology changes, based on the source_dc --force semantics.

\Fixes scylladb/scylladb#20242

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

\Closes scylladb/scylladb#20249

(cherry picked from commit 18c45f7502)

Closes scylladb/scylladb#20312
2024-08-28 12:12:15 +03:00
Benny Halevy
18644625a5 repair: do_rebuild_replace_with_repair: use source_dc only when safe
It is unsafe to restrict the sync nodes for repair to
the source data center if we cannot guarantee a quorum
in the data center with network-topology replication strategy.

This change restricts the usage of source_dc in the following cases:
1. For SimpleStrategy - source_dc is ignored since there is no guarantee
that it contains remaining replicas for all tokens.
2. For EverywhereStrategy - use source_dc if there are remaining
live nodes in the datacenter.
3. For NetworkTopologyStrategy:
a. It is considered unsafe to use source_dc if number of nodes
   lost in that DC (replaced/rebuilt node + additional ignored nodes)
   is greater than 1, or it has 1 lost node and rf <= 1 in the DC.

b. If the source_dc arg is forced, as with the new
   `nodetool rebuild --force <source_dc>` option,
   we use it anyway, even if it's considered to be unsafe.
   A warning is printed in this case.

c. If the source_dc arg is user-provided, (using nodetool rebuild),
   an error exception is thrown, advising to use an alternative dc,
   if available, omit source_dc to sync with all nodes, or use the
   --force option to use the given source_dc anyhow.

d. Otherwise, we look for an alternative source datacenter,
   that has not lost any node. If such datacenter is found
   we use it as source_dc for the keyspace, and log a warning.

e. If no alternative dc is found (and source_dc is implicit), then:
   log a warning and fall back to using replicas from all nodes in the cluster.

Fixes #16826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5f655e41e3)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-28 12:12:15 +03:00
Benny Halevy
453f4b0277 repair: replace_with_repair: pass the replace_node downstream
To be used by the next path to count how many nodes
are lost in each datacenter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 8665eef98c)
2024-08-28 12:12:15 +03:00
Benny Halevy
ee9202ac1b repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
The callers already pass ignore_nodes as host_id:s
and we translate them into inet_address only for repair
so delay the translation as much as posible,

Refs scylladb/scylladb#6403

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9729dd21c3)
2024-08-28 12:12:15 +03:00
Benny Halevy
53abb7fdcc repair: replace_rebuild_with_repair: pass ks_erms from caller
The keyspaces replication maps must be in sync with the
token_metadata_ptr passed already to the functions,
so instead of getting it in the callee, let the caller
get the ks_erms along with retrieving the tmptr.

Note that it's already done on the rebuild path
for streaming based rebuild.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b5d0ab092c)
2024-08-28 12:12:15 +03:00
Benny Halevy
7d63c9c62b nodetool: rebuild: add force option
To be used to force usage of source_dc, even
when it is unsafe for rebuild.

Update docs and add test/nodetool/test_rebuild.py

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0419b1d522)
2024-08-28 12:12:11 +03:00
Benny Halevy
4443a21c2a Add and use utils::optional_param to pass source_dc
Clearly indicate if a source_dc is provided,
and if so, was it explicitly given by the user,
or was implicitly selected by scylla.

This will become useful in the next patches
that will use that to either reject the operation
if it's unsafe to use the source_dc and the dc was
explicitly given by the user, or whether
to fallback to using all nodes otherwise.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 8b1877f3ca)
2024-08-28 12:02:20 +03:00
Botond Dénes
c945adcb67 Merge '[Backport 6.0] select from mutation_fragments() + tablets: handle reads for non-owned partitions' from ScyllaDB
Attempting to read a partition via `SELECT * FROM MUTATION_FRAGMENTS()`, which the node doesn't own, from a table using tablets causes a crash.
This is because when using tablets, the replica side simply doesn't handle requests for un-owned tokens and this triggers a crash.
We should probably improve how this is handled (an exception is better than a crash), but this is outside the scope of this PR.
This PR fixes this and also adds a reproducer test.

Fixes: https://github.com/scylladb/scylladb/issues/18786

Fixes a regression introduced in 6.0, so needs backport to 6.0 and 6.1

(cherry picked from commit de5329157c)

(cherry picked from commit 46563d719f)

(cherry picked from commit 4e2d7aa2a2)

 Refs #20109

Closes scylladb/scylladb#20314

* github.com:scylladb/scylladb:
  test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
  replica/mutation_dump: enfore pinning of effective replication map
  replica/mutation_dump: handle un-owned tokens (with tablets)
2024-08-28 06:40:04 +03:00
Lakshmi Narayanan Sreethar
d3b41635de test/pylib: fix keyspace_compaction method
The `keyspace_compaction` method incorrectly appends the column family
parameter to the URL using a regular string, `"?cf={table}"`, instead of
an f-string, `f"?cf={table}"`. As a result, the column family name is
sent as `{table}` to the server, causing the compaction request to fail.
Fix this issue by passing the parameter to the POST request using a
dictionary instead of appending it to the URL.

Fixes #20264

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4823a1e203)

Closes scylladb/scylladb#20275
2024-08-28 06:38:44 +03:00
Michał Chojnowski
0cee97f5c7 cql_test_env: ensure shutdown() before stop() for system_keyspace
If system_keyspace::stop() is called before system_keyspace::shutdown(),
it will never finish, because the uncleared shared pointers will keep
it alive indefinitely.

Currently this can happen if an exception is thrown before the construction
of the shutdown() defer. This patch moves the shutdown() call to immediately
before stop(). I see no reason why it should be elsewhere.

Fixes scylladb/scylla-enterprise#4380

(cherry picked from commit 4d77faa61e)

Closes scylladb/scylladb#20147
2024-08-28 06:34:46 +03:00
Lakshmi Narayanan Sreethar
3f23780650 boost/sstable_datafile_test: wait for total memory reclaimed update
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.

Fixes #19879

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#19883

(cherry picked from commit 27b305b9d1)

Closes scylladb/scylladb#19963
2024-08-28 06:33:29 +03:00
Benny Halevy
e06be56c28 sstable_directory: delete_atomically: allow sstables from multiple prefixes
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
   single tablet replica

In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.

Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes.  This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.

Refs scylladb/scylladb#18862

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 26abad23d9)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#19920
2024-08-28 06:32:12 +03:00
Aleksandra Martyniuk
5276701881 test: tasks: adjust tests to new wait_task behavior
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.

Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.

Fixes: #19351.
(cherry picked from commit dfe3af40ed)

Closes scylladb/scylladb#19840
2024-08-28 06:29:38 +03:00
Łukasz Paszkowski
9698ebe975 api/system: add highest_supported_sstable_format path
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.

Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.

This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.

Fixes scylladb/scylladb#19772

Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.

(cherry picked from commit 781eb7517c)

Closes scylladb/scylladb#19815
2024-08-28 06:28:55 +03:00
Tomas Nozicka
37a0e149ae Allow configuring default loglevel with args for container images
(cherry picked from commit 26466a3043)

Closes scylladb/scylladb#19702
2024-08-28 06:28:10 +03:00
Avi Kivity
9d3ee6e920 config, enum_option: allow round-trip string conversion
The default configuration for replication_strategy_warn_list is
["SimpleStrategy"], but one cannot set this via CQL:

cqlsh> select * from system.config where name = 'replication_strategy_warn_list';

 name                           | source  | type                      | value
--------------------------------+---------+---------------------------+--------------------
 replication_strategy_warn_list | default | replication strategy list | ["SimpleStrategy"]

(1 rows)
cqlsh> update system.config set value  = '[NetworkTopologyStrategy]' where name = 'replication_strategy_warn_list';
cqlsh> select * from system.config where name = 'replication_strategy_warn_list';

 name                           | source | type                      | value
--------------------------------+--------+---------------------------+-----------------------------
 replication_strategy_warn_list |    cql | replication strategy list | ["NetworkTopologyStrategy"]

(1 rows)
cqlsh> update system.config set value  = '["NetworkTopologyStrategy"]' where name = 'replication_strategy_warn_list';
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for system.config - received 0 responses and 1 failures from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 1}

Fix by allowing quotes in enum_set parsing.

Bug present since 8c464b2ddb ("guardrails: restrict
replication strategy (RS)", 6.0).

Fixes #19604.

(cherry picked from commit 45e27c0da2)

Closes scylladb/scylladb#19691
2024-08-28 06:27:08 +03:00
Yaron Kaikov
c20f5c8408 .github/script/label_promoted_commit.py: add label only if ref is PR
we got a failure during check-commit action:
```
Run python .github/scripts/label_promoted_commits.py --commit_before_merge 30e82a81e8 --commit_after_merge f31d5e3204 --repository scylladb/scylladb --ref refs/heads/master

Commit sha is: d5a149fc01
Commit sha is: 415457be2b
Commit sha is: d3b1ccd03a
Commit sha is: 1fca341514
Commit sha is: f784be6a7e
Commit sha is: 80986c17c3
Commit sha is: 492d0a5c86
Commit sha is: 7b3f55a65f
Commit sha is: 78d6471ce4
Commit sha is: 7a69d9070f
Commit sha is: a9e985fcc9
master branch, pr number is: 19213
Traceback (most recent call last):
  File "/home/runner/work/scylladb/scylladb/.github/scripts/label_promoted_commits.py", line 87, in <module>
    main()
  File "/home/runner/work/scylladb/scylladb/.github/scripts/label_promoted_commits.py", line 81, in main
    pr = repo.get_pull(pr_number)
  File "/usr/lib/python3/dist-packages/github/Repository.py", line 2746, in get_pull
    headers, data = self._requester.requestJsonAndCheck(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
    return self.__check(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/pulls/pulls#get-a-pull-request", "status": "404"}
Error: Process completed with exit code 1.
```

The reason for this failure is since in one of the promoted commits
(a9e985fcc9) had a reference of `Closes`
to an issue.

Fixes: https://github.com/scylladb/scylladb/issues/19677
(cherry picked from commit e33126fc3e)

Closes scylladb/scylladb#19690
2024-08-28 06:26:24 +03:00
Botond Dénes
c33556c0f1 Merge '[Backport 6.0] conf: scylla.yaml: update documentation for tablets' from ScyllaDB
Tablets are no longer in experimental_features since 83d491a, so remove them from the experimental_features section documentation.

Also, expand the documentation for the `enable_tablets` option.

Fixes #19456

Needs backport to 6.0

(cherry picked from commit 92f8d219b3)

(cherry picked from commit 7f05f95ec4)

 Refs #19516

Closes scylladb/scylladb#19688

* github.com:scylladb/scylladb:
  conf: scylla.yaml: enable_tablets: expand documentation
  conf: scylla.yaml: remove tablets from experimental_features doc comment
2024-08-28 06:25:26 +03:00
Pavel Emelyanov
e50a8ad209 test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
Currently it doesn't, one of the node crashes with std::out_of_range
exception and meaningless calltrace

[Botond]: this test checks the case of reading a partition via
MUTATION_FRAGMENTS from a node which doesn't own said partition.

refs: #18786

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4e2d7aa2a2)
2024-08-27 23:43:14 +00:00
Botond Dénes
12bd371152 replica/mutation_dump: enfore pinning of effective replication map
By making it a required argument, making sure the topology version is
pinned for the duration of the query. This is needed because mutation
dump queries bypass the storage proxy, where this pinning usually takes
place. So it has to be enforced here.

(cherry picked from commit 46563d719f)
2024-08-27 23:43:14 +00:00
Botond Dénes
431d4740e3 replica/mutation_dump: handle un-owned tokens (with tablets)
When using tablets, the replica-side doesn't handle un-owned tokens.
table::shard_for_reads() will just return 0 for un-owned tokens, and a
later attempt at calling table::storage_group_for_token() with said
un-owned token will cause a crash (std::terminate due to
std::out_of_range thrown in noexcept context).
The replicas rely on the coordinator to not send stray requests, but for
select from mutation_fragments(table) queries, there is no coordinator
side who could do the correct dispatching. So do this in
mutation_dump(), just creating empty readers for un-owned tokens.

(cherry picked from commit de5329157c)
2024-08-27 23:43:14 +00:00
Aleksandra Martyniuk
9b5d33ac4f replica: add/remove table atomically
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.

Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.

Analogically for remove_table.

Fixes: #19833.
(cherry picked from commit 483d89ed6d)

Closes scylladb/scylladb#20245
2024-08-27 20:47:55 +03:00
Amnon Heiman
043855c574 test/estimated_histogram_test Add summary tests
This patch adds tests for summary calculation. It adds two tests, the
first is a basic calculation for P50, P95, P99 by adding 100 elements
into 20 buckets.

The second test look that if elements are found in the infinite bucket,
the result would be the lower limit (33s) and not infinite.

Relates to #20255

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 644e6f0121)
2024-08-27 12:12:39 +00:00
Amnon Heiman
615101bd90 utils/histogram.hh: Make summary support inifinite bucket.
This patch handles an edge cases related to The infinite bucket  
limit.

Summaries are the P50, P95, and P99 quantiles.

The quantiles are calculated from a histogram; we find the bucket and
return its upper limit.

In classic histograms, there is a notion of the infinite bucket;
anything that does not fall into the last bucket is considered to be
infinite;

with quantile, it does not make sense. So instead of reporting infinite
we'll report the bucket lower limit.

Fixes #20255

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 011aa91a8c)
2024-08-27 12:12:39 +00:00
Botond Dénes
0509548dbc Update tools/java submodule
* tools/java 6dfc187a...c0735a9d (1):
  > cassandra-stress: Make default repl. strategy NetworkTopologyStrategy

Fixes: scylladb/scylla-tools-java#400

Closes scylladb/scylladb#20200
2024-08-27 07:39:39 +03:00
Benny Halevy
b7de15dc60 schema_tables: calculate_schema_digest: filter the key earlier
Currently, each frozen mutation we get from
system_keyspace::query_mutations is unfrozen in whole
to a mutation and only then we check its key with
the provided `accept_keyspace` function.

This is wasteful, since they key can be processed
directly form the frozen mutation, before taking
the toll of unfreezing it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 52234214e5)
2024-08-22 09:06:28 +00:00
Benny Halevy
7e7a8e44b0 schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls
when freed.

For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
 (inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
 (inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```

This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time,
and by that, reducing memory footprint and preventing
reactor stalls.

Fixes #18173

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 95a5fba0ea)
2024-08-22 09:06:28 +00:00
Anna Stuchlik
c8bbfbecad doc: extract the info about tablets defaut to a separate file
This commit extracts the information about the default for tables in keyspace creation
to a separate file in the _common folder. The file is then included using
the scylladb_include_flag directive.

The purpose of this commit is to make it possible to include a different file
in the scylla-enterprise repo - with a different default.

Refs https://github.com/scylladb/scylla-enterprise/issues/4585

(cherry picked from commit 107708434c)

Closes scylladb/scylladb#20222
2024-08-21 11:07:57 +03:00
Raphael S. Carvalho
b8099b9e83 compaction: Allow "offline" sstable to be split
In order to fix the race between split and repair, we must introduce
the ability to split an "offline" sstable, one that wasn't added
to any of the table's sstable set yet.

It's not safe to split a sstable after adding it to the set, because
a failure to split can result in unsplit data left in the set, causing
split to fail down the road, since the coordinator thinks this replica
has only split data in the set.

Refs #19378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 239344ab55)
2024-08-20 10:39:31 +00:00
Avi Kivity
3a67423ac9 Merge '[Backport 6.0] replica: fix copy constructor of tablet_sstable_set' from ScyllaDB
Commit 9f93dd9fa3 changed `tablet_sstable_set::_sstable_sets` to be a `absl::flat_hash_map` and in addition, `std::set<size_t> _sstable_set_ids` was added. `_sstable_set_ids` is set up in the `tablet_sstable_set(schema_ptr s, const storage_group_manager& sgm, const locator::tablet_map& tmap)` constructor, but it is not copied in `tablet_sstable_set(const tablet_sstable_set& o)`.

This affects the `tablet_sstable_set::tablet_sstable_set` method as it depends on the copy constructor. Since sstable set can be cloned when a new sstable set is added, the issue will cause ids not being copied into the new sstable set. It's healed only after compaction, since the sstable set is rebuilt from scratch there.

This PR fixes this issue by removing the existing copy constructor of `tablet_sstable_set` to enable the implicit default copy constructor.

Fixes #19519

(cherry picked from commit 44583eed9e)

(cherry picked from commit ec47b50859)

 Refs #20115

Closes scylladb/scylladb#20204

* github.com:scylladb/scylladb:
  boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
  replica: fix copy constructor of tablet_sstable_set
2024-08-19 21:31:33 +03:00
Michał Jadwiszczak
4e236f3392 cql3/statements/create_service_level: forbid creating SL starting with $
Tenant names starting with `$` are reserved for internal ones.
Forbid creating new service level which name starts with `$`
and log a warning for existing service levels with `$` prefix.

(cherry picked from commit d729d1b272)

Closes scylladb/scylladb#20198
2024-08-19 16:20:48 +03:00
Anna Stuchlik
b127b492cf doc: fix a link on the RBAC page
This commit fixes an external link on the Role Based Access Control page.

Fixes https://github.com/scylladb/scylladb/issues/20166

(cherry picked from commit c56c3ce469)

Closes scylladb/scylladb#20203
2024-08-19 15:30:34 +03:00
Lakshmi Narayanan Sreethar
ab6b8be69a boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit ec47b50859)
2024-08-19 12:13:11 +00:00
Lakshmi Narayanan Sreethar
5d8543221b replica: fix copy constructor of tablet_sstable_set
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.

Fixes #19519

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 44583eed9e)
2024-08-19 12:13:10 +00:00
Avi Kivity
cdae15ced9 Merge '[Backport 6.0] db/view: drop view updates to replaced node marked as left' from ScyllaDB
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.

This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.

As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.

In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.

Fixes: scylladb/scylladb#19439

This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.

(cherry picked from commit 6af7882c59)

(cherry picked from commit 5ec8c06561)

 Refs #19765

Closes scylladb/scylladb#19896

* github.com:scylladb/scylladb:
  test: regression test for MV crash with tablets during decommission
  db/view: drop view updates to replaced node marked as left
2024-08-14 22:32:07 +03:00
Tzach Livyatan
0da973d50d Improve tombstone_compaction_interval description
(cherry picked from commit 6dbe6a79b5)

Closes scylladb/scylladb#20026
2024-08-14 22:27:47 +03:00
Tzach Livyatan
b9abdfb3ea Update tracing.rst - fix table node_slow_log_time name
(cherry picked from commit 03f69a6d88)

Closes scylladb/scylladb#20024
2024-08-14 22:27:24 +03:00
Aleksandra Martyniuk
6f04fa66ce repair: use find_column_family in insert_repair_meta
repair_service::insert_repair_meta gets the reference to a table
and passes it to continuations. If the table is dropped in the meantime,
the reference becomes invalid.

Use find_column_family at each table occurrence in insert_repair_meta
instead.

Fixes: #20057

(cherry picked from commit 719999b34c)

Refs #19953

Closes scylladb/scylladb#20077
2024-08-14 22:21:12 +03:00
Dawid Medrek
5abd343cf9 db/hints: Make commitlog use commitlog IO scheduling group
Before these changes, we didn't specify which I/O scheduling
group commitlog instances in hinted handoff should use.
In this commit, we set it explicitly to the commitlog
scheduling group. The rationale for this choice is the fact
we don't want to cause a bottleneck on the write path
-- if hints are written too slowly, new incoming mutations
(NOT hints) might be rejected due to a too high number
of hints currently being written to disk; see
`storage_proxy::create_write_response_handler_helper()`
for more context.

(cherry picked from commit 6a7fb18b52)

Closes scylladb/scylladb#20094
2024-08-14 22:15:28 +03:00
Avi Kivity
7b6f1a1e2f Merge '[Backport 6.0] replica: remove rwlock for protecting iteration over storage group map' from Raphael "Raph" Carvalho
rwlock was added to protect iterations against concurrent updates to the map.

the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).

the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time).

to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out.

Fixes https://github.com/scylladb/scylladb/issues/18821.

```
WRITE
=====

./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets --write

- BEFORE

65559.52 tps ( 59.6 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   52841 insns/op,   30946 cycles/op,        0 errors)
67408.05 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53018 insns/op,   30874 cycles/op,        0 errors)
67714.72 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53026 insns/op,   30881 cycles/op,        0 errors)
67825.57 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53015 insns/op,   30821 cycles/op,        0 errors)
67810.74 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53009 insns/op,   30828 cycles/op,        0 errors)

         throughput: mean=67263.72 standard-deviation=967.40 median=67714.72 median-absolute-deviation=547.02 maximum=67825.57 minimum=65559.52
instructions_per_op: mean=52981.61 standard-deviation=79.09 median=53014.96 median-absolute-deviation=36.54 maximum=53025.79 minimum=52840.56
  cpu_cycles_per_op: mean=30869.90 standard-deviation=50.23 median=30874.06 median-absolute-deviation=42.11 maximum=30945.94 minimum=30820.89

- AFTER
65448.76 tps ( 59.5 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   52788 insns/op,   31013 cycles/op,        0 errors)
67290.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53025 insns/op,   30950 cycles/op,        0 errors)
67646.81 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53025 insns/op,   30909 cycles/op,        0 errors)
67565.90 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53058 insns/op,   30951 cycles/op,        0 errors)
67537.32 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   52983 insns/op,   30963 cycles/op,        0 errors)

         throughput: mean=67097.93 standard-deviation=931.44 median=67537.32 median-absolute-deviation=467.97 maximum=67646.81 minimum=65448.76
instructions_per_op: mean=52975.85 standard-deviation=108.07 median=53024.55 median-absolute-deviation=49.45 maximum=53057.99 minimum=52788.49
  cpu_cycles_per_op: mean=30957.17 standard-deviation=37.43 median=30951.31 median-absolute-deviation=7.51 maximum=31013.01 minimum=30908.62

READ
=====

./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets

- BEFORE

79423.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41840 insns/op,   26820 cycles/op,        0 errors)
81076.70 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41837 insns/op,   26583 cycles/op,        0 errors)
80927.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41829 insns/op,   26629 cycles/op,        0 errors)
80539.44 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41841 insns/op,   26735 cycles/op,        0 errors)
80793.10 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41864 insns/op,   26662 cycles/op,        0 errors)

         throughput: mean=80551.99 standard-deviation=661.12 median=80793.10 median-absolute-deviation=375.37 maximum=81076.70 minimum=79423.36
instructions_per_op: mean=41842.20 standard-deviation=13.26 median=41840.14 median-absolute-deviation=5.68 maximum=41864.50 minimum=41829.29
  cpu_cycles_per_op: mean=26685.88 standard-deviation=93.31 median=26662.18 median-absolute-deviation=56.47 maximum=26820.08 minimum=26582.68

- AFTER
79464.70 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41799 insns/op,   26761 cycles/op,        0 errors)
80954.58 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41803 insns/op,   26605 cycles/op,        0 errors)
81160.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41811 insns/op,   26555 cycles/op,        0 errors)
81263.10 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41814 insns/op,   26527 cycles/op,        0 errors)
81162.97 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41806 insns/op,   26549 cycles/op,        0 errors)

         throughput: mean=80801.25 standard-deviation=755.54 median=81160.90 median-absolute-deviation=361.72 maximum=81263.10 minimum=79464.70
instructions_per_op: mean=41806.47 standard-deviation=5.85 median=41806.05 median-absolute-deviation=4.05 maximum=41813.86 minimum=41799.36
  cpu_cycles_per_op: mean=26599.22 standard-deviation=94.84 median=26554.54 median-absolute-deviation=50.51 maximum=26761.06 minimum=26527.05
```
(cherry picked from commit ad5c5bca5f)

(cherry picked from commit c539b7c861)

Refs https://github.com/scylladb/scylladb/pull/19469

Closes scylladb/scylladb#19808

* github.com:scylladb/scylladb:
  replica: Fix race between split compaction and migration
  replica: remove rwlock for protecting iteration over storage group map
  replica: remove linear search when picking memtable_list for range scan with tablets
  replica: get rid of fragile compaction group intrusive list
2024-08-14 21:24:21 +03:00
Avi Kivity
2c7ce97af2 Merge '[Backport 6.0] Prevent ALTERing non-existing KS with tablets' from ScyllaDB
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required to execute this req,
2. global topo req is executed by topo coordinator, which reads data attached to the req.

The KS name is among the data attached to the req. There's a time window between these steps where a to-be-altered KS could have been DROPped, which results in topo coordinator forever trying to ALTER a non-existing KS. In order to avoid it, the code has been changed to first check if a to-be-altered KS exists, and if it's not the case, it doesn't perform any schema/tablets mutations, but just removes the global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected changes, which is due to the fact that the code is written badly and needs to be refactored - an effort that's already planned under #19126
(I suggest to disable displaying whitespace differences when reviewing this PR).

Fixes: #19576

Requires 6.0 backport

(cherry picked from commit 5b089d8e10)

(cherry picked from commit 0ea2128140)

(cherry picked from commit ddb5204929)

 Refs #19666

Closes scylladb/scylladb#20144

* github.com:scylladb/scylladb:
  tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
  cql: refactor rf_change indentation
  Prevent ALTERing non-existing KS with tablets
2024-08-14 20:50:42 +03:00
Piotr Smaron
b10bf17df7 tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
Using the error injection framework, we inject a sleep into the
processing path of ALTER tablets KS, so that the topology coordinator of
the leader node
sleeps after the rf_change event has been scheduled, but before it is
started to be executed. During that time the second node executes a DROP
KS statement, which is propagated to the leader node. Once leader node
wakes up and resumes processing of ALTER tablets KS, the KS won't exist
and the node cannot crash, which was the case before.

(cherry picked from commit ddb5204929)
2024-08-14 10:37:59 +00:00
Piotr Smaron
d0ceaa01ee cql: refactor rf_change indentation
(cherry picked from commit 0ea2128140)
2024-08-14 10:37:59 +00:00
Piotr Smaron
352c159460 Prevent ALTERing non-existing KS with tablets
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required
   to execute this req,
2. global topo req is executed by topo coordinator, which reads data
   attached to the req.

The KS name is among the data attached to the req.
There's a time window between these steps where a to-be-altered KS could
have been DROPped, which results in topo coordinator forever trying to
ALTER a non-existing KS. In order to avoid it, the code has been changed
to first check if a to-be-altered KS exists, and if it's not the case,
it doesn't perform any schema/tablets mutations, but just removes the
global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected
changes, which is due to the fact that the code is written badly and
needs to be refactored - an effort that's already planned under #19126

Fixes: #19576
(cherry picked from commit 5b089d8e10)
2024-08-14 10:37:59 +00:00
Aleksandra Martyniuk
ee18de71b4 repair: drop timeout from table_sync_and_check
Delete 10s timeout from read barrier in table_sync_and_check,
so that the function always considers all previous group0 changes.

Fixes: #18490.
(cherry picked from commit f947cc5477)

Refs #18752

Closes scylladb/scylladb#20130
2024-08-14 11:52:40 +02:00
Raphael S. Carvalho
721f0ff46d replica: Fix race between split compaction and migration
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.

Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.

The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.

The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
   --------
   seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::(anonymous namespace)::thread_wake_task
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
   seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...

From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.

Fixes: #19873.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 5af1f41ecd)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-08-13 12:26:13 -03:00
Raphael S. Carvalho
31451ec2a0 replica: remove rwlock for protecting iteration over storage group map
rwlock was added to protect iterations against concurrent updates to the map.

the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).

the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).

to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.

Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.

Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.

Fixes #18821.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit c539b7c861)
2024-08-13 12:26:13 -03:00
Raphael S. Carvalho
f6da439ac6 replica: remove linear search when picking memtable_list for range scan with tablets
with tablets, we're expected to have a worst of ~100 tablets in a given
table and shard, so let's avoid linear search when looking for the
memtable_list in a range scan. we're bounded by ~100 elements, so
shouldn't be a big problem, but it's an inefficiency we can easily
get rid of.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#19286

(cherry picked from commit f143f5b90d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-08-13 12:26:13 -03:00
Raphael S. Carvalho
39eb44dfa0 replica: get rid of fragile compaction group intrusive list
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.

Fixes #18506.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ad5c5bca5f)
2024-08-13 12:26:11 -03:00
Kamil Braun
a56f7ce21a storage_service: raft topology: warn when raft_topology_cmd_handler fails due to abort
Currently we print an ERROR on all exceptions in
`raft_topology_cmd_handler`. This log level is too high, in some cases
exceptions are expected -- like during shutdown. And it causes dtest
failures.

Turn exceptions from aborts into WARN level.

Also improve logging by printing the command that failed.

Fixes scylladb/scylladb#19754

(cherry picked from commit 7506709573)

Closes scylladb/scylladb#20072
2024-08-08 18:14:24 +02:00
Kamil Braun
4948029666 raft topology: improve logging
Add more logging for raft-based topology operations in INFO and DEBUG
levels.

Improve the existing logging, adding more details.

Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).

Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.

These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).

Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945
(cherry picked from commit e8d5974961)

Closes scylladb/scylladb#20049
2024-08-08 11:59:34 +03:00
Tomasz Grabiec
89a93a784e tablets: Do not allocate tablets on nodes being decommissioned
If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.

This also violates invariants about replica placement and this state
cannot be fixed by topology operations.

One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:

  load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at:  ...

Fixes #20032

(cherry picked from commit f5c74a5df2)

Closes scylladb/scylladb#20067
2024-08-08 11:57:09 +03:00
Dawid Medrek
d065d6f05d db/hints: Log when ignoring invalid hint directories
In 58784cd, aa4b06a and other commits migrating
hinted handoff from IPs to host IDs (scylladb/scylladb#15567),
we started ignoring hint directories of invalid names,
i.e. those that represent neither an IP address, nor a host ID.
They remain on disk and are taken into account while computing
e.g. the total size of hints, but they're not used in any way.

These changes add logs informing the user when Scylla
encounters such a directory.

Closes scylladb/scylladb#17566

(cherry picked from commit a5528a2093)

Closes scylladb/scylladb#19892
2024-08-07 10:55:06 +02:00
Michael Litvak
ccd01caed8 db: fix waiting for counter update operations on table stop
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.

Fixes scylladb/scylla-enterprise#4475

Closes scylladb/scylladb#20017
2024-08-05 12:52:32 +02:00
Piotr Dulikowski
78e3f0f208 Merge '[Backport 6.0] hinted handoff: migrate sync point to host ID ' from Dawid Mędrek
Change the format of sync points to use host ID instead of IPs, to be consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores host IDs instead of IPs.
The decoding supports both formats with host IDs and IPs, so a sync point contains now a variant of either types, and in the case of new type the translation is avoided.

Fixes scylladb/scylladb#18653

(cherry picked from commit scylladb/scylladb@b824e73)
(cherry picked from commit scylladb/scylladb@afc9a1a)
(cherry picked from commit scylladb/scylladb@c56de90)
(cherry picked from commit scylladb/scylladb@222dbf2)

In scylladb/scylladb#18733, we were experiencing a test failure
because the test code was receiving the reply `"DONE"` instead of
`"IN_PROGRESS"` when awaiting a sync point. The cluster consisted
of two nodes and the last few steps of the test that are relevant were:

1. Stop node 2.
2. Enable an error injection on node 1 to prevent it from sending hints.
3. Perform mutations on node 1 leading it to save hints towards node 2.
4. Start node 2 again.
5. Create a sync point on node 1.
6. Decommission node 2.
7. Await the created sync point on node 1.

Decommissioning node 2 led to node 1 trying to drain hints saved
towards it. However, due to the error injection, the draining process
was stuck and never finished. Because of that, when node 1 received
a request to await the sync point, the hint endpoint manager
corresponding to node 2 was still present -- all of that was expected
by the test.

What was unexpected by the test was the fact that now that hinted
handoff has started identifying nodes by their host IDs, but sync points
themselves still used IP addresses internally, there had to be a point
in the code where mapping one data type to the other would happen.
That place in the code is `manager::wait_for_sync_point()`.

When a node is decommissioned/removed, its host ID--IP mapping
is removed from the locator::token_metadata. Since node 2 had been
decommissioned, we no longer had access to the mapping we needed
and so the code used the "default" replay position, which, when compared,
is smaller than any other replay position except for itself.

Because of that, Scylla thought that all of the hints corresponding to
the sync point it got had been replayed and returned `"DONE"` to
the test's code, effectively leading to its failure.

These changes prevent that from happening as we start using host IDs
in the internal format used by sync points. Similar failures might still
occur if a sync point is created before the migration to host-ID-based
hinted handoff takes place, but awaited only after the migration.
However, the chances that that would happen are quite slim. The test
itself should proceed without any failures now.

Fixes scylladb/scylladb#18733

Closes scylladb/scylladb#19967

* github.com:scylladb/scylladb:
  test/boost: include test/lib/test_utils.hh
  test/boost/hint_test.cc: Add missing parse() callback
  db/hints: migrate sync point to host ID
  db/hints: rename sync point structures with _v1 suffix to _v1_v2
2024-08-05 09:46:48 +02:00
Kefu Chai
e1dab2779d test/boost: include test/lib/test_utils.hh
this change was created in the same spirit of 505900f18f. because
we are deprecating the operator<< for vector and unorderd_map in
Seastar, some tests do not compile anymore if we disable these
operators. so to be prepared for the change disabling them, let's
include test/lib/test_utils.hh for accessing the printer dedicated
for Boost.test. and also '#include <fmt/ranges.h>' when necessary,
because, in order to format the ranges using {fmt}, we need to
use fmt/ranges.h.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-02 15:04:34 +02:00
Kamil Braun
7a2396867b Merge '[Backport 6.0] raft: fix the shutdown phase being stuck' from Emil Maskovsky
Some of the calls inside the raft_group0_client::start_operation() method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the ownership of the raft group semaphore (get_units(semaphore)) - these waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes #19223

(cherry picked from commit 2dbe9ef2f2)

(cherry picked from commit 5dfc50d354)

Refs #19860

Closes scylladb/scylladb#19985

* github.com:scylladb/scylladb:
  raft: fix the shutdown phase being stuck
  raft: use the abort source reference in raft group0 client interface
2024-08-02 11:25:37 +02:00
Emil Maskovsky
b99d87863d raft: fix the shutdown phase being stuck
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223

(cherry picked from commit 5dfc50d354)
2024-08-01 19:37:02 +02:00
Emil Maskovsky
0770069dda raft: use the abort source reference in raft group0 client interface
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.

(cherry picked from commit 2dbe9ef2f2)
2024-08-01 19:36:00 +02:00
Dawid Medrek
13183069f7 test/boost/hint_test.cc: Add missing parse() callback
Before these changes, compilation was failing with the following
error:

In file included from test/boost/hint_test.cc:12:
/usr/include/fmt/ranges.h:298:7: error: no member named 'parse' in 'fmt::formatter<db::hints::sync_point::host_id_or_addr>'
  298 |     f.parse(ctx);
      |     ~ ^

We add the missing callback.

Closes scylladb/scylladb#19375
2024-08-01 14:49:36 +02:00
Michael Litvak
df0503afd6 db/hints: migrate sync point to host ID
Change the format of sync points to use host ID instead of IPs, to be
consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores
host IDs instead of IPs.
The encoding of sync points now always uses the new v3 format with host
IDs.
The decoding supports both formats with host IDs and IPs, so a sync point
contains now a variant of either types, and in the case of the new
format the translation from IP to host ID is avoided.
2024-07-31 18:00:28 +02:00
Michael Litvak
42ee9f9e59 db/hints: rename sync point structures with _v1 suffix to _v1_v2
rename sync point types and variables to have v1/v2 suffix according to
their use.
2024-07-31 17:59:08 +02:00
Kamil Braun
9572674f25 docs: extend "forbidden operations" section for Raft-topology upgrade
The Raft-topology upgrade procedure must not be run concurrently with
version upgrade.

(cherry picked from commit bb0c3cdc65)

Closes scylladb/scylladb#19837
2024-07-29 16:53:01 +02:00
Tomasz Grabiec
416cbafd16 Merge '[Backport 6.0] sstables: fix some mixups between the writer's schema and the sstable's schema' from Michał Chojnowski
There are two schemas associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

This series fixes the known mixups between the two — when setting up compression,
and when setting up the bloom filters.

Fixes scylladb/scylladb#16065

The bug is present in all supported versions, so the patch has to be backported to all of them.

(cherry picked from commit a1834efd82)

(cherry picked from commit d10b38ba5b)

(cherry picked from commit 1a8ee69a43)

Refs scylladb/scylladb#19695

Closes scylladb/scylladb#19877

* github.com:scylladb/scylladb:
  sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
  sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
  sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
2024-07-29 15:36:52 +02:00
Jenkins Promoter
36cb61589d Update ScyllaDB version to: 6.0.3 2024-07-29 15:21:14 +03:00
Piotr Dulikowski
dec02b38ae test: regression test for MV crash with tablets during decommission
Regression test for scylladb/scylladb#19439.

Co-authored-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit 5ec8c06561)
2024-07-26 14:02:51 +00:00
Piotr Dulikowski
a1436f1ce2 db/view: drop view updates to replaced node marked as left
When a node that is permanently down is replaced, it is marked as "left"
but it still can be a replica of some tablets. We also don't keep IPs of
nodes that have left and the `node` structure for such node returns an
empty IP (all zeros) as the address.

This interacts badly with the view update logic. The base replica paired
with the left node might decide to generate a view update. Because
storage proxy still uses IPs and not host IDs, it needs to obtain the
view replica's IP and tell the storage proxy to write a view update to
that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to
write a hint towards this address - hinted handoff on the other hand
operates on host IDs and not IPs, so it attempts to translate the IP
back, which triggers an assertion as there is no replica with IP
0.0.0.0.

As a quick workaround for this issue just drop view updates towards
nodes which seem to have IPs that are all zeros. It would be more proper
to keep the view updates as hints and replay them later to the new
paired replica, but achieving this right now would require much more
significant changes. For now, fixing a crash is more important than
keeping views consistent with base replicas.

Fixes: scylladb/scylladb#19439
(cherry picked from commit 6af7882c59)
2024-07-26 14:02:51 +00:00
Takuya ASADA
fefa76bffc scylla_raid_setup: install update-initramfs when it's not available
scylla_raid_setup may fail on Ubuntu minimal image since it calls
update-initramfs without installing.

(cherry picked from commit b6dedf1ee1)

Closes scylladb/scylladb#19871
2024-07-25 13:58:11 +03:00
Michał Chojnowski
43ba44ce97 sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the compressor objects based on its own schema, but using them based
based on the sstable's schema the sstable's schema.
This patch forces the writer to use the sstable's schema for both.

(cherry picked from commit 1a8ee69a43)
2024-07-25 12:23:58 +02:00
Michał Chojnowski
d6d3a91283 sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the filter based on its own schema, while the layer outside the writer
was interpreting it as if it was created with the sstable's schema.

This patch forces the writer to pick the filter's parameters based on the
sstable's schema instead.

(cherry picked from commit d10b38ba5b)
2024-07-25 12:23:58 +02:00
Michał Chojnowski
0555d4c30b sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
As of this patch, those static_casts are actually invalid in some cases
(they cast to the wrong type) because of an oversight.
A later patch will fix that. But to even write a reliable reproducer
for the problem, we must force the invalid casts to manifest as a crash
(instead of weird results).

This patch both allows writing a reproducer for the bug and serves
as a bit of defensive programming for the future.

(cherry picked from commit a1834efd82)

# Conflicts:
#	sstables/sstables.cc
2024-07-25 12:23:58 +02:00
Nadav Har'El
af39675c38 alternator: fix "/localnodes" to not return nodes still joining
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.

The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.

The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().

But the hard part of this test is the testing:

1. An existing multi-node test for "/localnodes" assummed that right after
   a new node was created, it appears on "/localnodes". But after this
   patch, it may take a bit more time for the bootstrapping to complete
   and the new node to appear in /localnodes - so I had to add a retry loop.

2. I added a test that reproduces the bug fixed here, and verifies its
   fix. The test is in the multi-node topology framework. It adds an
   injection which delays the bootstrap, which leaves a new node in JOINING
   state for a long time. The test then verifies that the new node is
   alive (as checked by the REST API), but is not returned by "/localnodes".

3. The new injection for delaying the bootstrap is unfortunately not
   very pretty - I had to do it in three places because we have several
   code paths of how bootstrap works without repair, with repair, without
   Raft and with Raft - and I wanted to delay all of them.

Fixes #19694.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19725

(cherry picked from commit bac7c33313)
(deleted test for cherry-pick)
2024-07-24 11:29:36 +03:00
Lakshmi Narayanan Sreethar
3c1fd843c8 [Backport 6.0]: sstables: do not reload components of unlinked sstables
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.

The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.

Fixes https://github.com/scylladb/scylladb/issues/19722

Closes scylladb/scylladb#19717

* github.com:scylladb/scylladb:
  boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
  sstables: do not reload components of unlinked sstables
  sstables/sstables_manager: introduce on_unlink method

(cherry picked from commit 591876b44e)

Backported from #19717 to 6.0

Closes scylladb/scylladb#19830
2024-07-23 23:16:53 +03:00
Piotr Dulikowski
9cc20e7c4d Merge '[Backport 6.0] schema: fix describe of indexes on collections' from ScyllaDB
If the index was created on collection (both frozen or not), its description wasn't a correct create statement.
This patch fixes the bug and includes functions like `full()`, `keys()`, `values()`, ... used to create index on collections.

Fixes scylladb/scylladb#19278

(cherry picked from commit 253feb6811)

(cherry picked from commit b65a4c66f0)

 Refs #19381

Closes scylladb/scylladb#19700

* github.com:scylladb/scylladb:
  cql-pytest/test_describe: add a test for describe indexes
  schema/schema: fix column names in index description
2024-07-22 12:33:47 +02:00
Kamil Braun
9efaca0bd2 Merge '[Backport 6.0] test: raft: fix the flaky test_raft_recovery_stuck' from ScyllaDB
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is fixed.

Fixes scylladb/scylladb#19154

(cherry picked from commit ef3393bd36)

(cherry picked from commit a89facbc74)

 Refs #19771

Closes scylladb/scylladb#19809

* github.com:scylladb/scylladb:
  test: raft: fix the flaky `test_raft_recovery_stuck`
  test: raft: code cleanup in `test_raft_recovery_stuck`
2024-07-22 11:14:44 +02:00
Emil Maskovsky
62c9709f4a test: raft: fix the flaky test_raft_recovery_stuck
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is
fixed.

Fixes scylladb/scylladb#19154

(cherry picked from commit a89facbc74)
2024-07-20 02:17:50 +00:00
Emil Maskovsky
64d414f10a test: raft: code cleanup in test_raft_recovery_stuck
Cleaning up the imports.

(cherry picked from commit ef3393bd36)
2024-07-20 02:17:50 +00:00
Kamil Braun
f32ed716ed Merge '[Backport 6.0] Fix lwt semaphore guard accounting' from ScyllaDB
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.

Should be backported to all supported versions.

(cherry picked from commit 87beebeed0)

(cherry picked from commit 4178589826)

 Refs #19699

Closes scylladb/scylladb#19796

* github.com:scylladb/scylladb:
  test: add test to check that coordinator lwt semaphore continues functioning after locking failures
  paxos: do not signal semaphore if it was not acquired
2024-07-19 19:06:36 +02:00
Gleb Natapov
c437c8be36 test: add test to check that coordinator lwt semaphore continues functioning after locking failures
(cherry picked from commit 4178589826)
2024-07-18 15:34:17 +00:00
Gleb Natapov
1c04b95c68 paxos: do not signal semaphore if it was not acquired
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.

Fixes: https://github.com/scylladb/scylladb/issues/19698
(cherry picked from commit 87beebeed0)
2024-07-18 15:34:16 +00:00
Emil Maskovsky
5649b55e08 test: raft: fix the flaky test_change_ip
The python driver might currently trigger spurios reconnects that cause
the `NoHostAvailable` to be thrown, which is not expected.

This patch adds a retry mechanism to the test to make skip this failure
if it occurs, as a work-around.

The proper fix is expected to be done in the scylladb/python-driver#295,
once fixed there this work-around can be reverted.

Fixes: scylladb/scylla#18547
(cherry picked from commit 6b9992737a)

Closes scylladb/scylladb#19773
2024-07-18 15:06:23 +02:00
Emil Maskovsky
b4745406da raft: Fix crash in leader_host API handler
The leader_host API handler was eventually using the `req` unique_ptr
after it has been already destroyed (passed down to the future lambda
by reference). This was causing an occassional crash in some tests.

Reworked the leader_host handler to use the req only outside of the
future lambda.

Also updated the code to handle the possibility that the non-default
leader group (other than Group 0) might reside on a different shard
than the shard 0 - using the same concept of calling on all shards via
`invoke_on_all()` as done for the other requests.

Fixes scylladb/scylladb#19714

(cherry picked from commit b2db8f4b9b)

Closes scylladb/scylladb#19742
2024-07-16 13:29:37 +02:00
Anna Stuchlik
27faec3015 doc: replace a link on the CDC+Kafka page
This commit replaces a link to the installation section with a link to the getting started section.

(cherry picked from commit f90867c740)

Closes scylladb/scylladb#19712
2024-07-16 13:15:45 +02:00
Emil Maskovsky
06c356df8f test: raft: fix the topology failure recovery test flakiness
Setting the error condition for all nodes in the cluster to avoid
having to check which one is the coordinator. This should make the test
more stable and avoid the flakiness observed when the coordinator node
is the one that got the error condition injected.

Randomizing the retrieved running servers to reproduce the issue more
frequently and to avoid making any assumptions about the order of the
servers.

Note that only the "raft_topology_barrier_fail" needs to run
on a non-coordinator node, the other error "stream_ranges_fail" can be
injected on any node (including the coordinator).

Fixes: #18614
(cherry picked from commit 9dbad34205)

Closes scylladb/scylladb#19708
2024-07-15 16:27:22 +02:00
Michael Litvak
815a707b0a storage_proxy: remove response handler if no targets
When writing a mutation, it might happen that there are no live targets
to send the mutation to, yet the request can be satisfied. For example,
when writing with CL=ANY to a dead node, the request is completed by
storing a local hint.

Currently, in that case, a write response handler is created for the
request and it remains active until it timeouts because it is not
removed anywhere, even though the write is completed successfuly after
storing the hint. The response handler should be removed usually when
receiving responses from all targets, but in this case there are no
targets to trigger the removal.

In this commit we check if we don't have live targets to send the
mutation to. If so, we remove the response handler immediately.

Fixes scylladb/scylladb#19529

(cherry picked from commit a9fdd0a93a)

Closes scylladb/scylladb#19680
2024-07-15 08:24:18 +02:00
Botond Dénes
16452f9cf5 Merge '[Backport 6.0] scylla-sstable: add method to load the schema from the sstable itself' from ScyllaDB
As it turns out, each sstable carries its own schema in its serialization header (Statistics component). This schema is incomplete -- the names of the key columns are not stored, just their type. Static and regular columns do have names and types stored however. This bare-bones schema is enough to parse and display the content of the sstable. Another thing missing is schema options (the stuff after the `WITH` keyword, except the clustering order). The only options stored are the compression options (in the CompressionInfo component), this is actually needed to read the Data component.

This series adds a new method to `tools/schema_loader.cc` to extract the schema stored in the sstable itself. This new schema load method is used as the last fall-back for obtaining the schema, in case scylla-sstable is trying to autodetect the schema of the sstable. Although, right now this bare-bones schema is enough for everything scylla-sstable does, it is more future proof to stick to the "full" schema if possible, so this new method is the last resort for now.

Fixes: https://github.com/scylladb/scylladb/issues/17869
Fixes: https://github.com/scylladb/scylladb/issues/18809

New functionality, no backport needed.

(cherry picked from commit 435c01d1e6)

(cherry picked from commit 0d7335dd27)

(cherry picked from commit 8f2ba03465)

(cherry picked from commit 43c44f0af5)

(cherry picked from commit 145a67f77c)

 Refs #19169

Closes scylladb/scylladb#19711

* github.com:scylladb/scylladb:
  tools/scylla-sstable: log loaded schema with trace level
  tools/scylla-sstable: load schema from the sstable as fallback
  tools/schema_loader: introduce load_schema_from_sstable()
  test/lib/random_schema: remove assert on min number of regular columns
  sstables: introduce load_metadata()
2024-07-12 16:55:44 +03:00
Botond Dénes
5d94a08250 tools/scylla-sstable: log loaded schema with trace level
The schema of the sstable can be interesting, so log it with trace
level. Unfortunately, this is not the nice CQL statement we are used to
(that requires a database object), but the not-nearly-so-nice CFMetadata
printout. Still, it is better then nothing.

(cherry picked from commit 145a67f77c)
2024-07-12 10:36:59 +00:00
Botond Dénes
4f74e6f28e tools/scylla-sstable: load schema from the sstable as fallback
When auto-detecting the schema of the sstable, if all other methods
failed, load the schema from the sstable's serialization header. This
schema is incomplete. It is just enough to parse and display the content
of the sstable. Although parsing and displaying the content of the
sstable is all scylla-sstable does, it is more future-compatible to us
the full schema when possible. So the always-available but minimal
schema that each sstable has on itself, is used just as a fallback.

The test which tested the case when all schema load attempts fail,
doesn't work now, because loading the serialization header always
succeeds. So convert this test into two positive tests, testing the
serialization header schema fallback instead.

(cherry picked from commit 43c44f0af5)
2024-07-12 10:36:59 +00:00
Botond Dénes
f42e8e872a tools/schema_loader: introduce load_schema_from_sstable()
Allows loading the schema from an sstable's serialization header. This
schema is incomplete, but it is enough to parse and display the content
of the sstable.

(cherry picked from commit 8f2ba03465)
2024-07-12 10:36:59 +00:00
Botond Dénes
f7c8c32929 test/lib/random_schema: remove assert on min number of regular columns
It is legal for a schema to have 0 regular columns, so remove the assert
on the schema specification's regular column count.

(cherry picked from commit 0d7335dd27)
2024-07-12 10:36:59 +00:00
Botond Dénes
4f165eb3e9 sstables: introduce load_metadata()
Loads just the metadata components. No validation.
Split off from load(), to allow scylla-sstable to partially load an
sstable.

(cherry picked from commit 435c01d1e6)
2024-07-12 10:36:59 +00:00
Michał Jadwiszczak
25f8fd0b5c cql-pytest/test_describe: add a test for describe indexes
(cherry picked from commit b65a4c66f0)
2024-07-11 12:59:27 +00:00
Michał Jadwiszczak
67764e7d66 schema/schema: fix column names in index description
Previously description of index didn't include functions for
indexes on collections like full(), keys(), values(), etc...

(cherry picked from commit 253feb6811)
2024-07-11 12:59:27 +00:00
Benny Halevy
8b109e5a16 conf: scylla.yaml: enable_tablets: expand documentation
The exiting documentation comment for `enable_tablets`
is very terse and lacks details about the effect of enabling
or disabling tablets.

This change adds more details about the impact of `enable_tablets`
on newly created keyspaces, and hot to disable tablets when
keyspaces are created.

Also, a note was added to warn about the irreversibility
of the tablets enablement per keyspace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 7f05f95ec4)
2024-07-10 18:03:38 +00:00
Benny Halevy
b9ad792eeb conf: scylla.yaml: remove tablets from experimental_features doc comment
tablets are no longer in experimental_features
since 83d491af02.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 92f8d219b3)
2024-07-10 18:03:38 +00:00
Tomasz Grabiec
43ff19273c Merge '[Backport 6.0] mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion' from ScyllaDB
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.

This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.

Fixes https://github.com/scylladb/scylladb/issues/19552

(cherry picked from commit f784be6a7e)

(cherry picked from commit 7b3f55a65f)

(cherry picked from commit 78d6471ce4)

Refs #19617

Closes scylladb/scylladb#19675

* github.com:scylladb/scylladb:
  mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
  logalloc: add hold_reserve
  logalloc: generalize refill_emergency_reserve()
2024-07-10 14:28:01 +02:00
Michał Chojnowski
aee0150506 mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.

This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.

Fixes scylladb/scylladb#19552

(cherry picked from commit 78d6471ce4)
2024-07-10 08:36:11 +00:00
Michał Chojnowski
c5c19e90ac logalloc: add hold_reserve
mutation_partition_v2::apply_monotonically() needs to perform some allocations
in a destructor, to ensure that the invariants of the data structure are
restored before returning. But it is usually called with reclaiming disabled,
so the allocations might fail even in a perfectly healthy node with plenty of
reclaimable memory.

This patch adds a mechanism which allows to reserve some LSA memory (by
asking the allocator to keep it unused) and make it available for allocation
right when we need to guarantee allocation success.

(cherry picked from commit 7b3f55a65f)
2024-07-10 08:36:11 +00:00
Michał Chojnowski
985f5a50f6 logalloc: generalize refill_emergency_reserve()
In the next patch, we will want to do the thing as
refill_emergency_reserve() does, just with a quantity different
than _emergency_reserve_max. So we split off the shareable part
to a new function, and use it to implement refill_emergency_reserve().

(cherry picked from commit f784be6a7e)
2024-07-10 08:36:11 +00:00
Botond Dénes
ae11381d7c Merge '[Backport 6.0] reader_concurrency_semaphore: make CPU concurrency configurable' from Botond Dénes
The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache.
This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly.

This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread.

Fixes: https://github.com/scylladb/scylladb/issues/19017

This PR backports https://github.com/scylladb/scylladb/pull/19018 and its follow-up https://github.com/scylladb/scylladb/pull/19600.

Closes scylladb/scylladb#19644

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: execution_loop(): move maybe_admit_waiters() to the inner loop
  test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrency
  test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
  reader_concurrency_semaphore: wire in the configurable cpu concurrency
  reader_concurrency_semaphore: add cpu_concurrency constructor parameter
  db/config: introduce reader_concurrency_semahore_cpu_concurrency
2024-07-10 07:23:08 +03:00
Anna Stuchlik
4ec5a06101 doc: update Scylla Doctor installation
This commit updates the instuctions on how to download and run Scylla Doctor,
following the changes in how Scylla Doctor is released.

(cherry picked from commit 2ffda9b262)

Closes scylladb/scylladb#19525
2024-07-09 14:32:21 +03:00
Anna Stuchlik
dcf4c757b2 doc: remove support for Debian 10
This PR removes support for Debian 10, which reached end of life on June 30, 2024.

Refs https://github.com/scylladb/scylla-enterprise/issues/4377

(cherry picked from commit 1f340428ea)

Closes scylladb/scylladb#19630
2024-07-09 12:55:11 +02:00
Wojciech Przytuła
a7fe9eeffd storage_proxy: fix uninitialized LWT contention counter
When debugging the issue of high LWT contention metric, we (the
drivers team) discovered that at least 3 drivers (Go, Java, Rust)
cause high numbers in that metrics in LWT workloads - we doubted that
all those drivers route LWT queries badly. We tried to understand that
metric and its semantics. It took 3 people over 10 hours to figure out
what it is supposed to count.

People from core team suspected that it was the drivers sending
requests to different shards, causing contention. Then we ran the
workload against a single node single shard cluster... and observed
contention. Finally, we looked into the Scylla code and saw it.

**Uninitialized stack value.**

The core member was shocked. But we, the drivers people, felt we always
knew it. It's yet another time that we are blamed for a server-side
issue. We rebuilt scylla with the variable initialized to 0 and the
metric kept being 0.

To prevent such errors in the future, let's consider some lints that
warn against uninitialized variables. This is such an obvious feature
of e.g. Rust, and yet this has shown to be cause a painful bug in 2024.

Fixes: scylladb/scylladb#19654
(cherry picked from commit 36a125bf97)

Closes scylladb/scylladb#19657
2024-07-09 11:41:10 +02:00
Michael Litvak
ad6eb1cadf view: drain view builder before database
The view builder is doing write operations to the database.
In order for the view builder to shutdown gracefully without errors, we
need to ensure the database can handle writes while it is drained.
The commit changes the drain order, so that view builder is drained
before the database shuts down.

Fixes scylladb/scylladb#18929

(cherry picked from commit 9d9318c564)

Closes scylladb/scylladb#19636
2024-07-08 19:16:26 +02:00
Botond Dénes
dadc0c32e1 reader_concurrency_semaphore: execution_loop(): move maybe_admit_waiters() to the inner loop
Now that the CPU concurency limit is configurable, new reads might be
ready to execute right after the current one was executed. So move the
poll for admitting new reads into the inner loop, to prevent the
situation where the inner loop yields and a concurrent
do_wait_admission() finds that there are waiters (queued because at the
time they arrived to the semaphore, the _ready_list was not empty) but it
is is possible to admit a new read. When this happens the semaphore will
dump diagnostics to help debug the apparent contradiction, which can
generate a lot of log spam. Moving the poll into the inner loop prevents
the false-positive contradiction detection from firing.

Refs: scylladb/scylladb#19017

Closes scylladb/scylladb#19600

(cherry picked from commit 155acbb306)
2024-07-08 08:13:40 +03:00
Botond Dénes
88d3c2eb4b test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrency
(cherry picked from commit b4f3809ad2)
2024-07-08 08:13:07 +03:00
Botond Dénes
4307631950 test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
This is currently a lambda in a test, hoist it into the global scope and
make it into a function, so other tests can use it too (in the next
patch).

(cherry picked from commit 9cbdd8ef92)
2024-07-08 08:12:34 +03:00
Botond Dénes
abc4a9b635 reader_concurrency_semaphore: wire in the configurable cpu concurrency
Before this patch, the semaphore was hard-wired to stop admission, if
there is even a single permit, which is in the need_cpu state.
Therefore, keeping the CPU concurrency at 1.
This patch makes use of the new cpu_concurrency parameter, which was
wired in in the last patches, allowing for a configurable amount of
concurrent need_cpu permits. This is to address workloads where some
small subset of reads are expected to be slow, and can hold up faster
reads behind them in the semaphore queue.

(cherry picked from commit 07c0a8a6f8)
2024-07-08 08:12:34 +03:00
Botond Dénes
052cef2621 reader_concurrency_semaphore: add cpu_concurrency constructor parameter
In the case of the user semaphore, this receives the new
reader_concurrency_semaphore_cpu_limit config item.
Not used yet.

(cherry picked from commit 59faa6d4ff)
2024-07-08 08:12:20 +03:00
Botond Dénes
5a7af93c7c db/config: introduce reader_concurrency_semahore_cpu_concurrency
To allow increasing the semaphore's CPU concurrency, which is currently
hard-limited to 1. Not wired yet.

(cherry picked from commit c7317be09a)
2024-07-08 08:06:28 +03:00
Pavel Emelyanov
78f3fc8890 tablet_allocator: Put more info into failed-to-drain exception
When balancer fails to find a node to balance drained tablets into, it
throws an exception with tablet id and node id, but it's also good to
know more details about the balancing state that lead to failure

refs: #19504

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c3d9831c5f)

Closes scylladb/scylladb#19619
2024-07-05 11:17:37 +03:00
None
3e06c882f0 .github: remove pull_request_template
The reason for the pr template is to explain why do we need to backport
a PR.

On release branches there is no need for it

Closes scylladb/scylladb#19615
2024-07-04 16:52:27 +03:00
Avi Kivity
c6e8a7f762 Merge '[Backport 6.0] Close output_stream in get_compaction_history() API handler' from ScyllaDB
If an httpd body writer is called with output_stream<>, it mist close the stream on its own regardless of any exceptions it may generate while working, otherwise stream destructor may step on non-closed assertion. Stepped on with different handler, see #19541

Coroutinize the handler as the first step while at it (though the fix would have been notably shorter if done with .finally() lambda)

(cherry picked from commit acb351f4ee)

(cherry picked from commit 6d4ba98796)

(cherry picked from commit b4f9387a9d)

 Refs #19543

Closes scylladb/scylladb#19603

* github.com:scylladb/scylladb:
  api: Close response stream of get_compaction_history()
  api: Flush output stream in get_compaction_history() call
  api: Coroutinize get_compaction_history inner function
2024-07-04 15:08:08 +03:00
Pavel Emelyanov
941ec80a00 api: Close response stream of get_compaction_history()
The function must close the stream even if it throws along the way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit b4f9387a9d)
2024-07-03 18:30:17 +00:00
Pavel Emelyanov
ab5041cb03 api: Flush output stream in get_compaction_history() call
It's currently implicitly flushed on its close, but in that case close
can throw while flusing. Next patch wants close not to throw and that's
possible if flushing the stream in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 6d4ba98796)
2024-07-03 18:30:17 +00:00
Pavel Emelyanov
009f5eb69e api: Coroutinize get_compaction_history inner function
The handler returns a function which is then invoked with output_stream
argument to render the json into. This function is converted into
coroutine. It has yet another inner lambda that's passed into
compaction_manager::get_compaction_history() as consumer lambda. It's
coroutinized too.

The indentation looks weird as preparation for future patching.
Hopefullly it's still possible to understand what's going on.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit acb351f4ee)
2024-07-03 18:30:17 +00:00
Tzach Livyatan
c9cd171f42 Docs: Fix a typo in sstable-corruption.rst
(cherry picked from commit a7115124ce)

Closes scylladb/scylladb#19591
2024-07-03 10:24:44 +02:00
Piotr Dulikowski
8b9e62e107 Merge '[Backport 6.0] cql3/statement/select_statement: do not parallelize single-partition aggregations' from Michał Jadwiszczak
This patch adds a check if aggregation query is doing single-partition read and if so, makes the query to not use forward_service and do not parallelize the request.

Fixes scylladb/scylladb#19349

(cherry picked from commit e9ace7c203)

(cherry picked from commit 8eb5ca8202)

Refs scylladb/scylladb#19350

Closes scylladb/scylladb#19499

* github.com:scylladb/scylladb:
  test/boost/cql_query_test: add test for single-partition aggregation
  cql3/select_statement: do not parallelize single-partition aggregations
2024-07-02 21:03:24 +02:00
Kamil Braun
4e21421ddc Merge '[Backport 6.0] Do not expire local addres in raft address map since the local node cannot disappear' from ScyllaDB
A node may wait in the topology coordinator queue for awhile before been
joined. Since the local address is added as expiring entry to the raft
address map it may expire meanwhile and the bootstrap will fail. The
series makes the entry non expiring.

Fixes  scylladb/scylladb#19523

Needs to be backported to 6.0 since the bug may cause bootstrap to fail.

(cherry picked from commit 5d8f08c0d7)

(cherry picked from commit 3f136cf2eb)

 Refs #19557

Closes scylladb/scylladb#19574

* github.com:scylladb/scylladb:
  test: add test that checks that local address cannot expire between join request placemen and its processing
  storage_service: make node's entry non expiring in raft address map
2024-07-01 16:20:17 +02:00
Gleb Natapov
724ec62e22 test: add test that checks that local address cannot expire between join request placemen and its processing
(cherry picked from commit 3f136cf2eb)
2024-07-01 10:44:31 +00:00
Gleb Natapov
a6c5f8192d storage_service: make node's entry non expiring in raft address map
Local address map entry should never expire in the address map.

(cherry picked from commit 5d8f08c0d7)
2024-07-01 10:44:31 +00:00
Pavel Emelyanov
20b99246fd Merge '[Backport 6.0] Close output stream in task manager's API get_tasks handler' from ScyllaDB
If client stops reading response early, the server-side stream throws but must be closed anyway. Seen in another endpoint and fixed by #19541

(cherry picked from commit 4897d8f145)

(cherry picked from commit 986a04cb11)

(cherry picked from commit 1be8b2fd25)

 Refs #19542

Closes scylladb/scylladb#19562

* github.com:scylladb/scylladb:
  api: Fix indentation after previous patch
  api: Close response stream on error
  api: Flush response output stream before closing
2024-07-01 10:47:30 +03:00
Pavel Emelyanov
8e74ac5140 Merge '[Backport 6.0] Close output_stream in get_snapshot_details() API handler' from ScyllaDB
All streams used by httpd handlers are to be closed by the handler itself, caller doesn't take care of that.

fixes: #19494

(cherry picked from commit d1fd886608)

(cherry picked from commit a0c1552cea)

(cherry picked from commit 1839030e3b)

 Refs #19541

Closes scylladb/scylladb#19563

* github.com:scylladb/scylladb:
  api: Fix indentation after previous patch
  api: Close output_stream on error
  api: Flush response output stream before closing
2024-07-01 10:47:08 +03:00
Pavel Emelyanov
4e17a5a1c2 api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1839030e3b)
2024-06-30 19:20:11 +00:00
Pavel Emelyanov
c5c168a1db api: Close output_stream on error
If the get_snapshot_details() lambda throws, the output stream remains
non-closed which is bad. Close it regardless of what.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit a0c1552cea)
2024-06-30 19:20:10 +00:00
Pavel Emelyanov
09272d2478 api: Flush response output stream before closing
Otherwise close() may throw and this is what next patch will want not to
happen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit d1fd886608)
2024-06-30 19:20:10 +00:00
Pavel Emelyanov
1e7f377b0a api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1be8b2fd25)
2024-06-30 19:19:52 +00:00
Pavel Emelyanov
b038177f19 api: Close response stream on error
The handler's lambda is called with && stream object and must close the
stream on its own regardless of what.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 986a04cb11)
2024-06-30 19:19:52 +00:00
Pavel Emelyanov
426bc6a4e1 api: Flush response output stream before closing
The .close() method flushes the stream, but it may throw doing it. Next
patch will want .close() not to throw, for that stream must be flushed
explicitly before closing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4897d8f145)
2024-06-30 19:19:52 +00:00
Piotr Smaron
6a1e0489c6 cql: forbid switching from tablets to vnodes in ALTER KS
This check is already in place, but isn't fully working, i.e.
switching from a vnode KS to a tablets KS is not allowed, but
this check doesn't work in the other direction. To fix the
latter, `ks_prop_defs::get_initial_tablets()` has been changed
to handle 3 states: (1) init_tablets is set, (2) it was skipped,
(3) tablets are disabled. These couldn't fit into std::optional,
so a new local struct to hold these states has been introduced.
Callers of this function have been adjusted to set init_tablets
to an appropriate value according to the circumstances, i.e. if
tablets are globally enabled, but have been skipped in the CQL,
init_tablets is automatically set to 0, but if someone executes
ALTER KS and doesn't provide tablets options, they're inherited
from the old KS.
I tried various approaches and this one resulted in the least
lines of code changed. I also provided testcases to explain how
the code behaves.

Fixes: #18795
(cherry picked from commit 758139c8b2)

Closes scylladb/scylladb#19540
2024-06-28 17:58:35 +03:00
Yaron Kaikov
1577765a20 .github/scripts/label_promoted_commits.py: fix adding labels when PR is closed
`prs = response.json().get("items", [])` will return empty when there are no merged PRs, and this will just skip the all-label replacement process.

This is a regression following the work done in #19442

Adding another part to handle closed PRs (which is the majority of the cases we have in Scylla core)

Fixes: https://github.com/scylladb/scylladb/issues/19441
(cherry picked from commit 2eb8344b9a)

Closes scylladb/scylladb#19527
2024-06-27 19:35:18 +03:00
Botond Dénes
c4f1f129c3 Merge '[Backport 6.0] batchlog replay: bypass tombstones generated by past replays' from ScyllaDB
The `system.batchlog` table has a partition for each batch that failed to complete. After finally applying the batch, the partition is deleted. Although the table has gc_grace_second = 0, tombstones can still accumulate in memory, because we don't purge partition tombstones from either the memtable or the cache. This can lead to the cache and memtable of this table to accumulate many thousands of even millions of tombstones, making batchlog replay very slow. We didn't notice this before, because we would only replay all failed batches on unbootstrap, which is rare and a heavy and slow operation on its own right already.
With repair-based tombstone-gc however, we do a full batchlog replay at the beginning of each repair, and now this extra delay is noticeable.
Fix this by making sure batchlog replays don't have to scan through all the tombstones generated by previous replays:
* flush the `system.batchlog` memtable at the end of each batchlog replay, so it is cleared of tombstones
* bypass the cache

Fixes: https://github.com/scylladb/scylladb/issues/19376

Although this is not a regression -- replay was like this since forever -- now that repair calls into batchlog replay, every release which uses repair-based tombstone-gc should get this fix

(cherry picked from commit 4e96e320b4)

(cherry picked from commit 2dd057c96d)

(cherry picked from commit 29f610d861)

(cherry picked from commit 31c0fa07d8)

 Refs #19377

Closes scylladb/scylladb#19502

* github.com:scylladb/scylladb:
  db/batchlog_manager: bypass cache when scanning batchlog table
  db/batchlog_manager: replace open-coded paging with internal one
  db/batchlog_manager: implement cleanup after all batchlog replay
  cql3/query_processor: for_each_cql_result(): move func to the coro frame
2024-06-27 14:46:50 +03:00
Botond Dénes
fa644c6269 Merge '[Backport 6.0] tasks: fix tasks abort' from Aleksandra Martyniuk
Currently if task_manager::task::impl::abort preempts before children are recursively aborted and then the task gets unregistered, we hit use after free since abort uses children vector which is no longer alive.

Modify abort method so that it goes over all tasks in task manager and aborts those with the given parent.

Fixes: https://github.com/scylladb/scylladb/issues/19304.

Requires backport to all versions containing task manager

(cherry picked from commit 3463f495b1)

(cherry picked from commit 50cb797d95)

Refs https://github.com/scylladb/scylladb/pull/19305

Closes scylladb/scylladb#19437

* github.com:scylladb/scylladb:
  test: add test for abort while a task is being unregistered
  tasks: fix tasks abort
2024-06-27 14:45:34 +03:00
Botond Dénes
cb4b4fe678 Merge '[Backport 6.0] test_tablets: add test_tablet_storage_freeing' from ScyllaDB
Before work on tablets was completed, it was noticed that — due to some missing pieces of implementation — Scylla doesn't properly close sstables for migrated-away tablets. Because of this, disk space wasn't being reclaimed properly.

Since the missing pieces of implementation were added, the problem should be gone now. This patch adds a test which was used to reproduce the problem earlier. It's expected to pass now, validating that the issue was fixed.

Should be backported to branch-6.0, because the tested problem was also affecting that branch.

Fixes #16946

(cherry picked from commit 7741491b47)

(cherry picked from commit 823da140dd)

 Refs #18906

Closes scylladb/scylladb#19295

* github.com:scylladb/scylladb:
  test_tablets: add test_tablet_storage_freeing
  test: pylib: add get_sstables_disk_usage()
2024-06-27 14:40:06 +03:00
Kamil Braun
aca08bb1d1 Merge '[Backport 6.0] join_token_ring, gossip topology: recalculate sync nodes in wait_alive' from ScyllaDB
The node booting in gossip topology waits until all NORMAL
nodes are UP. If we removed a different node just before,
the booting node could still see it as NORMAL and wait for
it to be UP, which would time out and fail the bootstrap.

This issue caused scylladb/scylladb#17526.

Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.

Although the issue fixed by this PR caused only test flakiness,
it could also manifest in real clusters. It's best to backport this
PR to 5.4 and 6.0.

Fixes scylladb/scylladb#17526

(cherry picked from commit 017134fd38)

(cherry picked from commit 7735bd539b)

(cherry picked from commit bcc0a352b7)

 Refs #19387

Closes scylladb/scylladb#19419

* github.com:scylladb/scylladb:
  join_token_ring, gossip topology: update obsolete comment
  join_token_ring, gossip topology: fix indendation after previous patch
  join_token_ring, gossip topology: recalculate sync nodes in wait_alive
2024-06-26 12:38:06 +02:00
Yaron Kaikov
9f31426ead .github/workflow: close and replace label when backport promoted
Today after Mergify opened a Backport PR, it will stay open until someone manually close the backport PR , also we can't track using labels which backport was done or not since there is no indication for that except digging into the PR and looking for a comment or a commit ref

The following changes were made in this PR:
* trigger add-label-when-promoted.yaml also when the push was made to `branch-x.y`
* Replace label `backport/x.y` with `backport/x.y-done` in the original PR (this will automatically update the original Issue as well)
* Add a comment on the backport PR and close it

Fixes: https://github.com/scylladb/scylladb/issues/19441
(cherry picked from commit 394cba3e4b)

Closes scylladb/scylladb#19496
2024-06-26 12:42:34 +03:00
Botond Dénes
22622a94ca db/batchlog_manager: bypass cache when scanning batchlog table
Scans should not pollute the cache with cold data, in general. In the
case of the batchlog table, there is another reason to bypass the cache:
this table can have a lot of partition tombstones, which currently are
not purged from the cache. So in certain cases, using the cache can make
batch replay very slow, because it has to scan past tombstones of
already replayed batches.

(cherry picked from commit 31c0fa07d8)
2024-06-26 09:05:14 +00:00
Botond Dénes
35a64856b0 db/batchlog_manager: replace open-coded paging with internal one
query_processor has built-in paging support, no need to open-code paging
in batchlog manager code.

(cherry picked from commit 29f610d861)
2024-06-26 09:05:13 +00:00
Botond Dénes
4e66b3c9ce db/batchlog_manager: implement cleanup after all batchlog replay
We have a commented code snippet from Origin with cleanup and a FIXME to
implement it. Origin flushes the memtables and kicks a compaction. We
only implement the flush here -- the flush will trigger a compaction
check and we leave it up to the compaction manager to decide when a
compaction is worthwhile.
This method used to be called only from unbootstrap, so a cleanup was
not really needed. Now it is also called at the end of repair, if the
table is using repair-based tombstone-gc. If the memtable is filled with
tombstones, this can add a lot of time to the runtime of each repair. So
flush the memtable at the end, so the tombstones can be purged (they
aren't purged from memtables yet).

(cherry picked from commit 2dd057c96d)
2024-06-26 09:05:13 +00:00
Botond Dénes
5e422ceefb cql3/query_processor: for_each_cql_result(): move func to the coro frame
Said method has a func parameter (called just f), which it receives as
rvalue ref and just uses as a reference. This means that if caller
doesn't keep the func alive, for_each_cql_result() will run into
use-after-free after the first suspention point. This is unexpected for
callers, who don't expect to have to keep something alive, which they
passed in with std::move().
Adjust the signature to take a value instead, value parameters are moved
to the coro frame and survive suspention points.
Adjust internal callers (query_internal()) the same way.

There are no known vulnerable external callers.

(cherry picked from commit 4e96e320b4)
2024-06-26 09:05:13 +00:00
Michał Jadwiszczak
29c6a4cf44 test/boost/cql_query_test: add test for single-partition aggregation
(cherry picked from commit 8eb5ca8202)
2024-06-25 23:56:49 +02:00
Dawid Medrek
7201efc2f2 db/hints: Initialize endpoint managers only for valid hint directories
Before these changes, it could happen that Scylla initialized
endpoint managers for hint directories representing

* host IDs before migrating hinted handoff to using host IDs,
* IP addresses after the migration.

One scenario looked like this:

1. Start Scylla and upgrade the cluster to using host IDs.
2. Create, by hand, a hint directory representing an IP address.
3. Trigger changing the host filter in hinted handoff; it could
   be achieved by, for example, restricting the set of data
   centers Scylla is allowed to save hints for.

When changing the host filter, we browse the hint directories
and create endpoint managers if we can send hints towards
the node corresponding to a given hint directory. We only
accepted hint directories representing IP addresses
and host IDs. However, we didn't check whether the local node
has already been upgraded to host-ID-based hinted handoff
or not. As a result, endpoint managers were created for
both IP addresses and host IDs, no matter whether we were
before or after the migration.

These changes make sure that any time we browse the hint
directories, we take that into account.

Fixes scylladb/scylladb#19172

(cherry picked from commit c9bb0a4da6)

Closes scylladb/scylladb#19426
2024-06-23 19:32:57 +03:00
Kefu Chai
1b2f10a4e7 sstables: use "me" sstable format by default
in 7952200c, we changed the `selected_format` from `mc` to `me`,
but to be backward compatible the cluster starts with "md", so
when the nodes in cluster agree on the "ME_SSTABLE_FORMAT" feature,
the format selector believes that the node is already using "ME",
which is specified by `_selected_format`. even it is actually still
using "md", which is specified by `sstable_manager::_format`, as
changed by 54d49c04. as explained above, it was specified to "md"
in hope to be backward compatible when upgrading from an existign
installation which might be still using "md". but after a second
thought, since we are able to read sstables persisted with older
formats, this concern is not valid.

in other words, 7952200c introduced a regression which changed the
"default" sstable format from `me` to `md`.

to address this, we just change `sstable_manager::_format` to "me",
so that all sstables are created using "me" format.

a test is added accordingly.

Fixes #18995
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 5a0d30f345)

Closes scylladb/scylladb#19422
2024-06-23 19:26:53 +03:00
Jenkins Promoter
1f2bbf52cc Update ScyllaDB version to: 6.0.2 2024-06-23 15:15:46 +03:00
Aleksandra Martyniuk
169dfaf037 test: add test for abort while a task is being unregistered
(cherry picked from commit 50cb797d95)
2024-06-22 15:47:03 +02:00
Botond Dénes
cfac9d8bef Merge '[Backport 6.0] Reduce TWCS off-strategy space overhead' from ScyllaDB
Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier.

Reshaping a TWCS table that takes ~50% of available space can result in system running out of space.

That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round.

Fixes #16514.

(cherry picked from commit b8bd4c51c2)

(cherry picked from commit 51c7ee889e)

(cherry picked from commit 0ce8ee03f1)

(cherry picked from commit ace4e5111e)

 Refs #18137

Closes scylladb/scylladb#19404

* github.com:scylladb/scylladb:
  compaction: Reduce twcs off-strategy space overhead to 10% of free space
  compaction: wire storage free space into reshape procedure
  sstables: Allow to get free space from underlying storage
  replica: don't expose compaction_group to reshape task
2024-06-21 20:00:10 +03:00
Anna Stuchlik
aca9d657ca doc: remove the link to Scylladb Google group
The group is no longer active and should be removed from resources.

(cherry picked from commit 027cf3f47d)

Closes scylladb/scylladb#19402
2024-06-21 19:59:35 +03:00
Anna Stuchlik
5fe531fcb2 doc: separate Entrprise- from OSS-only content
This commit adds files that contain Open Source-specific information
and includes these files with the .. scylladb_include_flag:: directive.
The files include a) a link and b) Table of Contents.

The purpose of this update is to enable adding
Open Source/Enterprise-specific information in the Reference section.

(cherry picked from commit 680405b465)

Closes scylladb/scylladb#19395
2024-06-21 19:58:14 +03:00
Botond Dénes
673d49dba3 docs: nodetool status: document keyspace and table arguments
Also fix the example nodetool status invocation.

Fixes: #17840

Closes scylladb/scylladb#18037

(cherry picked from commit 6e3b997e04)

Closes scylladb/scylladb#19394
2024-06-21 19:57:37 +03:00
Botond Dénes
fe9435924a Merge '[Backport 6.0] schema: Make "describe" use extensions to string' from ScyllaDB
Fixes #19334

Current impl uses hardcoded printing of a few extensions.
Instead, use extension options to string and print all.

Note: required to make enterprise CI happy again.

(cherry picked from commit d27620e146)

(cherry picked from commit 73abc56d79)

 Refs #19337

Closes scylladb/scylladb#19359

* github.com:scylladb/scylladb:
  schema: Make "describe" use extensions to string
  schema_extensions: Add an option to string method
2024-06-21 19:56:02 +03:00
Nadav Har'El
0715038dbe test: unflake test test_alternator_ttl_scheduling_group
This test in topology_experimental_raft/test_alternator.py wants to
check that during Alternator TTL's expiration scans, ALL of the CPU was
used in the "streaming" scheduling group and not in the "statement"
scheduling group. But to allow for some fluke requests (e.g., from the
driver), the test actually allows work in the statement group to be
up to 1% of the work.

Unfortunately, in one test run - a very slow debug+aarch64 run - we
saw the work on the statement group reach 1.4%, failing the test.
I don't know exactly where this work comes from, perhaps the driver,
but before this bug was fixed we saw more than 58% of the work in the
wrong scheduling group, so neither 1% or 1.4% is a sign that the bug
came back. In fact, let's just change the threshold in the test to 10%,
which is also much lower than the pre-fix value of 58%, so is still a
valid regression test.

Fixes #19307

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9fc70a28ca)

Closes scylladb/scylladb#19333
2024-06-21 19:55:09 +03:00
Patryk Jędrzejczak
e129c4ad43 join_token_ring, gossip topology: update obsolete comment
The code mentioned in the comment has already been added. We change
the comment to prevent confusion.

(cherry picked from commit bcc0a352b7)
2024-06-21 12:05:43 +00:00
Patryk Jędrzejczak
37bb6e0a43 join_token_ring, gossip topology: fix indendation after previous patch
(cherry picked from commit 7735bd539b)
2024-06-21 12:05:43 +00:00
Patryk Jędrzejczak
e5e8b970ed join_token_ring, gossip topology: recalculate sync nodes in wait_alive
Before this patch, if we booted a node just after removing
a different node, the booting node may still see the removed node
as NORMAL and wait for it to be UP, which would time out and fail
the bootstrap.

This issue caused scylladb/scylladb#17526.

Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.

(cherry picked from commit 017134fd38)
2024-06-21 12:05:42 +00:00
Michał Jadwiszczak
b275ee9a1c cql3/select_statement: do not parallelize single-partition aggregations
Currently reads with WHERE clause which limits them to be
single-partition reads, are unnecessarily parallelized.

This commit checks this condition and the query doesn't use
forward_service in single-partition reads.

(cherry picked from commit e9ace7c203)
2024-06-21 09:31:39 +00:00
Raphael S. Carvalho
3d9aa9d49e compaction: Reduce twcs off-strategy space overhead to 10% of free space
TWCS off-strategy suffers with 100% space overhead, so a big TWCS table
can cause scylla to run out of disk space during node ops.

To not penalize TWCS tables, that take a small percentage of disk,
with increased write ampl, TWCS off-strategy will be restricted to
10% of free disk space. Then small tables can still compact all
disjoint sstables in a single round.

Fixes #16514.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ace4e5111e)
2024-06-20 20:41:41 +00:00
Raphael S. Carvalho
ef72075920 compaction: wire storage free space into reshape procedure
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0ce8ee03f1)
2024-06-20 20:41:41 +00:00
Raphael S. Carvalho
37f1af2646 sstables: Allow to get free space from underlying storage
That will be used in turn to restrict reshape to 10% of available space
in underlying storage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 51c7ee889e)
2024-06-20 20:41:41 +00:00
Raphael S. Carvalho
56f551f740 replica: don't expose compaction_group to reshape task
compaction_group sits in replica layer and compaction layer is
supposed to talk to it through compaction::table_state only.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b8bd4c51c2)
2024-06-20 20:41:41 +00:00
Calle Wilund
fd59176a73 main/minio_server.py: Respect any preexisting AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY vars
Fixes scylladb/scylla-pkg#3845

Don't overwrite (or rather change) AWS credentials variables if already set in
enclosing environment. Ensures EAR tests for AWS KMS can run properly in CI.

v2:
* Allow environment variables in reading obj storage config - allows CI to
  use real credentials in env without risking putting them info less seure
  files
* Don't write credentials info from miniserver into config, instead use said
  environment vars to propagate creds.

v3:
* Fix python launch scripts to not clear environment, thus retaining above aws envs.

(cherry picked from commit 5056a98289)

Closes scylladb/scylladb#19330
2024-06-20 18:08:51 +03:00
Aleksandra Martyniuk
634c0d44ef tasks: fix tasks abort
Currently if task_manager::task::impl::abort preempts before children
are recursively aborted and then the task gets unregistered, we hit
use after free since abort uses children vector which is no
longer alive.

Modify abort method so that it goes over all tasks in task manager
and aborts those with the given parent.

Fixes: #19304.
(cherry picked from commit 3463f495b1)
2024-06-20 14:47:14 +00:00
Botond Dénes
3baa68f8af Merge '[Backport 6.0] doc: add 6.x.y to 6.x.z and remove 5.x.y to 5.x.z upgrade guide' from ScyllaDB
This PR removes the 5.x.y to 5.x.z upgrade guide and adds the 6.x.y to 6.x.z upgrade guide.

The previous maintenance upgrade guides, such as from 5.x.y to 5.x.z, consisted of several documents - separate for each platform.
The new 6.x.y to 6.x.z upgrade guide is one document - there are tabs to include platform-specific information (we've already done it for other upgrade guides as one generic document is more convenient to use and maintain).

I did not modify the procedures. At some point, they have been reviewed for previous upgrade guides.

Fixes https://github.com/scylladb/scylladb/issues/19322

-  This PR must be backported to branch-6.0, as it adds 6.x specific content.

(cherry picked from commit ead201496d)

(cherry picked from commit ea35982764)

 Refs #19340

Closes scylladb/scylladb#19360

* github.com:scylladb/scylladb:
  doc: remove the 5.x.y to 5.x.z upgrade guide
  doc: add the 6.x.y to 6.x.z upgrade guide-6
2024-06-19 14:36:54 +03:00
Gleb Natapov
0e49180cef topology coordinator: add more trace level logging for debugging
Add more logging that provide more visibility into what happens during
topology loading.

Message-ID: <ZnE5OAmUbExVZMWA@scylladb.com>

(cherry picked from commit fb764720d3)
2024-06-18 16:38:51 +02:00
Anna Stuchlik
a97a074813 doc: remove the 5.x.y to 5.x.z upgrade guide
This commit removes the upgrade guide from 5.x.y to 5.x.z.
It is reduntant in version 6.x.

(cherry picked from commit ea35982764)
2024-06-18 14:13:57 +00:00
Anna Stuchlik
e869eae5fa doc: add the 6.x.y to 6.x.z upgrade guide-6
This commit adds the upgrade guide from 6.x.y to 6.x.z.

(cherry picked from commit ead201496d)
2024-06-18 14:13:57 +00:00
Calle Wilund
dd4f483668 schema: Make "describe" use extensions to string
Fixes #19334

Current impl uses hardcoded printing of a few extensions.
Instead, use extension options to string and print all.

(cherry picked from commit 73abc56d79)
2024-06-18 14:13:51 +00:00
Calle Wilund
d18be9a7dc schema_extensions: Add an option to string method
Allow an extension to describe itself as the CQL property
string that created it (and is serialized to schema tables)

Only paxos extension requires override.

(cherry picked from commit d27620e146)
2024-06-18 14:13:51 +00:00
Botond Dénes
6682b50868 Merge '[Backport 6.0] doc: document keyspace and table for nodetool ring' from ScyllaDB
these two arguments are critical when tablets are enabled.

Fixes https://github.com/scylladb/scylladb/issues/19296

---

6.0 is the first release with tablets support. and `nodetool ring` is an important tool to understand the data distribution. so we need to backport this document change to 6.0

(cherry picked from commit aef1718833)

(cherry picked from commit ea3b8c5e4f)

 Refs #19297

Closes scylladb/scylladb#19309

* github.com:scylladb/scylladb:
  doc: document `keyspace` and `table` for `nodetool ring`
  doc: replace tab with space
2024-06-17 09:33:34 +03:00
Wojciech Mitros
d70cf46af0 mv: replicate the gossiped backlog to all shards
On each shard of each node we store the view update backlogs of
other nodes to, depending on their size, delay responses to incoming
writes, lowering the load on these nodes and helping them get their
backlog to normal if it were too high.

These backlogs are propagated between nodes in two ways: the first
one is adding them to replica write responses. The seconds one
is gossiping any changes to the node's backlog every 1s. The gossip
becomes useful when writes stop to some node for some time and we
stop getting the backlog using the first method, but we still want
to be able to select a proper delay for new writes coming to this
node. It will also be needed for the mv admission control.

Currently, the backlog is gossiped from shard 0, as expected.
However, we also receive the backlog only on shard 0 and only
update this shard's backlogs for the other node. Instead, we'd
want to have the backlogs updated on all shards, allowing us
to use proper delays also when requests are received on shards
different than 0.

This patch changes the backlog update code, so that the backlogs
on all shards are updated instead. This will only be performed
up to once per second for each other node, and is done with
a lower priority, so it won't severly impact other work.

Fixes: scylladb/scylladb#19232
(cherry picked from commit d31437b589)

Closes scylladb/scylladb#19302
2024-06-17 09:32:29 +03:00
Botond Dénes
869f2637b8 Merge '[Backport 6.0] Fix usage of utils/chunked_vector::reserve_partial' from ScyllaDB
utils/chunked_vector::reserve_partial: fix usage in callers

The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.

Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.

This PR updates the method comment and all the callers to use the
right way.

Fixes #19254

(cherry picked from commit 64768b58e5)

(cherry picked from commit 29f036a777)

(cherry picked from commit 0a22759c2a)

(cherry picked from commit d4f8b91bd6)

(cherry picked from commit 310c5da4bb)

(cherry picked from commit 83190fa075)

(cherry picked from commit c49f6391ab)

 Refs #19279

Closes scylladb/scylladb#19310

* github.com:scylladb/scylladb:
  utils/large_bitset: remove unused includes identified by clangd
  utils/large_bitset: use thread::maybe_yield()
  test/boost/chunked_managed_vector_test: fix testcase tests_reserve_partial
  utils/lsa/chunked_managed_vector: fix reserve_partial()
  utils/chunked_vector: return void from reserve_partial and make_room
  test/boost/chunked_vector_test: fix testcase tests_reserve_partial
  utils/chunked_vector::reserve_partial: fix usage in callers
2024-06-17 09:31:28 +03:00
Israel Fruchter
d79f9156ed Update tools/cqlsh submodule v6.0.20
* tools/cqlsh c8158555...0d58e5ce (6):
  > cqlsh.py: fix server side describe after login command
  > cqlsh: try server-side DESCRIBE, then client-side
  > Refactor tests to accept both client and server side describe
  > github actions: support testing with enterprise release
  > Add the tab-completion support of SERVICE_LEVEL statements
  > reloc/build_reloc.sh: don't use `--no-build-isolation`

Closes: scylladb/scylladb#18989
(cherry picked from commit 1fd600999b)

Closes scylladb/scylladb#19132
2024-06-17 09:05:46 +03:00
Lakshmi Narayanan Sreethar
2fa4cb69b6 utils/large_bitset: remove unused includes identified by clangd
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit c49f6391ab)
2024-06-14 15:48:57 +00:00
Lakshmi Narayanan Sreethar
87397f43f6 utils/large_bitset: use thread::maybe_yield()
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 83190fa075)
2024-06-14 15:48:57 +00:00
Lakshmi Narayanan Sreethar
e64e659ef1 test/boost/chunked_managed_vector_test: fix testcase tests_reserve_partial
Update the maximum size tested by the testcase. The test always created
only one chunk as the maximum size tested by it (1 << 12 = 4KB) was less
than the default max chunk size (12.8 KB). So, use twice the
max_chunk_capacity as the test size distribution upper limit to verify
that partial_reserve can reserve multiple chunks.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 310c5da4bb)
2024-06-14 15:48:57 +00:00
Kefu Chai
fb9a1b4e38 doc: document keyspace and table for nodetool ring
these two arguments are critical when tablets are enabled.

Fixes #19296
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit ea3b8c5e4f)
2024-06-14 15:48:56 +00:00
Lakshmi Narayanan Sreethar
397b04b2a4 utils/lsa/chunked_managed_vector: fix reserve_partial()
Fix the method comment and return types of chunked_managed_vector's
reserve_partial() similar to chunked_vector's reserve_partial() as it
has the same issues mentioned in #19254. Also update the usage in the
chunked_managed_vector_test.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d4f8b91bd6)
2024-06-14 15:48:56 +00:00
Kefu Chai
8f3d693c8a doc: replace tab with space
more consistent this way, also easier to format in a regular editor
without additional setup.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit aef1718833)
2024-06-14 15:48:56 +00:00
Lakshmi Narayanan Sreethar
8dc662ebde utils/chunked_vector: return void from reserve_partial and make_room
Since reserve_partial does not depend on the number of remaining
capacity to be reserved, there is no need to return anything from it and
the make_room method.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 0a22759c2a)
2024-06-14 15:48:56 +00:00
Lakshmi Narayanan Sreethar
4e68599b17 test/boost/chunked_vector_test: fix testcase tests_reserve_partial
Fix the usage of reserve_partial in the testcase. Also update the
maximum chunk size used by the testcase. The test always created only
one chunk as the maximum size tested by it (1 << 12 = 4KB) was less
than the default max chunk size (128 KB). So, use smaller chunk size,
512 bytes, to verify that partial_reserve can reserve multiple chunks.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 29f036a777)
2024-06-14 15:48:56 +00:00
Lakshmi Narayanan Sreethar
7072f7e706 utils/chunked_vector::reserve_partial: fix usage in callers
The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.

Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.

This commit updates the method comment and all the callers to use the
right way.

Fixes #19254

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 64768b58e5)
2024-06-14 15:48:56 +00:00
Kefu Chai
57980b77d3 test: test_topology_ops: adapt to tablets
in e7d4e080, we reenabled the background writes in this test, but
when running with tablets enabled, background writes are still
disabled because of #17025, which was fixed last week. so we can
enable background writes with tablets.

in this change,

* background writes are enabled with tablets.
* increase the number of nodes by 1 so that we have enough nodes
  to fulfill the needs of tablets, which enforces that the number
  of replicas should always satisfy RF.
* pass rf to `start_writes()` explicitly, so we have less
  magic numbers in the test, and make the data dependencies
  more obvious.

Fixes #17589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 77f0259a63)

Closes scylladb/scylladb#19184
2024-06-14 15:54:36 +03:00
Botond Dénes
5139e74058 Merge '[Backport 6.0] Improve handling of outdated --experimental-features' from ScyllaDB
Some time ago it turned out that if unrecognized feature name is met in scylla.yaml, the whole experimental features list is ignored, but scylla continues to boot. There's UNUSED feature which is the proper way to deprecate a feature, and this PR improves its handling in several ways.

1. The recently removed "tablets" feature is partially brought back, but marked as UNUSED
2. Any UNUSED features met while parsing are printed into logs
3. The enum_option<> helper is enlightened along the way

refs: #18968

(cherry picked from commit f56cdb1cac)

(cherry picked from commit 0c0a7d9b9a)

(cherry picked from commit b85a02a3fe)

(cherry picked from commit b2520b8185)

 Refs #19230

Closes scylladb/scylladb#19266

* github.com:scylladb/scylladb:
  config: Mark tablets feature as unused
  main: Warn unused features
  enum_option: Carry optional key on board
  enum_option: Remove on-board _map member
2024-06-14 15:43:17 +03:00
Michał Chojnowski
ddcaefefdc test_tablets: add test_tablet_storage_freeing
Tests that tablet storage is freed after it is migrated away.

Fixes #16946

(cherry picked from commit 823da140dd)
2024-06-14 10:19:32 +00:00
Michał Chojnowski
f466dcfa5f test: pylib: add get_sstables_disk_usage()
Adds an util for measuring the disk usage of the given table on the given
node.
Will be used in a follow-up patch for testing that sstables are freed
properly.

(cherry picked from commit 7741491b47)
2024-06-14 10:19:32 +00:00
Benny Halevy
6122f9454d storage_service: join_token_ring: reject replace on different dc or rack
Do not allow replacing a node on one dc/rack
with a node on a different dc/rack as this violates
the assumption of replace node operation that
all token ranges previously owned by the dead
node would be rebuilt on the new node.

Fixes #16858
Refs scylladb/scylla-enterprise#3518

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 34dfa4d3a3)

Closes scylladb/scylladb#19281
2024-06-14 07:43:58 +03:00
Botond Dénes
b18d9e5d0d Merge '[Backport 6.0] make enable_compacting_data_for_streaming_and_repair truly live-update' from ScyllaDB
This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken.
This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series.

Fixes: https://github.com/scylladb/scylladb/issues/18674

- [x] This patch has to be backported because it fixes broken functionality

(cherry picked from commit dbccb61636)

(cherry picked from commit 4590026b38)

(cherry picked from commit feea609e37)

(cherry picked from commit 0c61b1822c)

(cherry picked from commit 8ef4fbdb87)

 Refs #18705

Closes scylladb/scylladb#19240

* github.com:scylladb/scylladb:
  test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
  test/pylib: rest_client: add get_injection()
  api/error_injection: add getter for error_injection
  utils/error_injection: add set_parameter()
  replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
2024-06-13 12:45:23 +03:00
Kamil Braun
cb6a97d0dc raft: fsm: add details to on_internal_error_noexcept message
If we receive a message in the same term but from a different leader
than we expect, we print:
```
Got append request/install snapshot/read_quorum from an unexpected leader
```
For some reason the message did not include the details (who the leader
was and who the sender was) which requires almost zero effort and might
be useful for debugging. So let's include them.

Ref: scylladb/scylla-enterprise#4276
(cherry picked from commit 99a0599e1e)

Closes scylladb/scylladb#19265
2024-06-13 11:25:11 +02:00
Wojciech Mitros
813fef44d3 exceptions: make view update timeouts inherit from timed_out_error
Currently, when generating and propagating view updates, if we notice
that we've already exceeded the time limit, we throw an exception
inheriting from `request_timeout_exception`, to later catch and
log it when finishing request handling. However, when catching, we
only check timeouts by matching the `timed_out_error` exception,
so the exception thrown in the view update code is not registered
as a timeout exception, but an unknown one. This can cause tests
which were based on the log output to start failing, as in the past
we were noticing the timeout at the end of the request handling
and using the `timed_out_error` to keep processing it and now, even
though we do notice the timeout even earlier, due to it's type we
log an error to the log, instead of treating it as a regular timeout.
In this patch we make the error thrown on timeout during view updates
inherit from `timed_out_error` instead of the `request_timeout_exception`
(it is also moved from the "exceptions" directory, where we define
exceptions returned to the user).
Aside from helping with the issue described above, we also improve our
metrics, as the `request_timeout_exception` is also not checked for
in the `is_timeout_exception` method, and because we're using it to
check whether we should update write timeout metrics, they will only
start getting updated after this patch.

Fixes #19261
(cherry picked from commit 4aa7ada771)

Closes scylladb/scylladb#19262
2024-06-13 12:01:12 +03:00
Botond Dénes
1c67c6cf78 Merge '[Backport 6.0] test: memtable_test: increase unspooled_dirty_soft_limit ' from ScyllaDB
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.

after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.

Fixes https://github.com/scylladb/scylladb/issues/19034

---

the issue applies to both 5.4 and 6.0, and this issue hurts the CI stability, hence we should backport it.

(cherry picked from commit 2df4e9cfc2)

(cherry picked from commit 223fba3243)

 Refs #19252

Closes scylladb/scylladb#19258

* github.com:scylladb/scylladb:
  test: memtable_test: increase unspooled_dirty_soft_limit
  test: memtable_test: replace BOOST_ASSERT with BOOST_REQURE
2024-06-13 07:26:43 +03:00
Pavel Emelyanov
5811df4d4b config: Mark tablets feature as unused
This features used to be there for a while, but then it was removed by
83d491af02. This patch partially takes it
back, but maps to UNUSED, so that if met in config, it's warned, but
other features are parsed as well.

refs: #18968

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit b2520b8185)
2024-06-12 18:35:32 +00:00
Pavel Emelyanov
cb9d6e080c main: Warn unused features
When seeing an UNUSED feature -- print it into log. This is where the
enum_option::key is in use. The thing is that experimental features map
different unused feature names into the single UNUSED feature enum
value, so once the feature is parsed its configured name only persists
in the option's key member (saved by previous patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit b85a02a3fe)
2024-06-12 18:35:32 +00:00
Pavel Emelyanov
86068790ec enum_option: Carry optional key on board
It facilitates option formatting, but the main purpose is to be able to
find out the exact keys, not values, later (see next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 0c0a7d9b9a)
2024-06-12 18:35:31 +00:00
Pavel Emelyanov
3501ede024 enum_option: Remove on-board _map member
The map in question is immutable and can obtained from the Mapper type
at any time, there's no need in keeping its copy on each enum_option

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit f56cdb1cac)
2024-06-12 18:35:31 +00:00
Anna Stuchlik
bc89aac9d0 doc: reorganize ToC of the Reference section
This commit adds a proper ToC to the Reference section to improve
how it renders.

(cherry picked from commit 63084c6798)

Closes scylladb/scylladb#19257
2024-06-12 19:12:53 +02:00
Kefu Chai
b39c0a1d15 test: memtable_test: increase unspooled_dirty_soft_limit
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.

after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.

Fixes #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 223fba3243)
2024-06-12 15:44:11 +00:00
Kefu Chai
548fd01bd4 test: memtable_test: replace BOOST_ASSERT with BOOST_REQURE
before this change, we verify the behavior of design under test using
`BOOST_ASSERT()`, which is a wrapper around `assert()`, so if a test
fails, the test just aborts. this is not very helpful for postmortem
debugging.

after this change, we use `BOOST_REQUIRE` macro for verifying the
behavior, so that Boost.Test prints out the condition if it does not
hold when we test it.

Refs #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 2df4e9cfc2)
2024-06-12 15:44:11 +00:00
Pavel Emelyanov
2306c3b522 test: Reduce failure detector timeout for failed tablets migration test
Most of the time this test spends waiting for a node to die. Helps 3x times

Was
  real	9m21,950s
  user	1m11,439s
  sys	1m26,022s

Now
  real	3m37,780s
  user	0m58,439s
  sys	1m13,698s

refs: #17764

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit a4e8f9340a)

Closes scylladb/scylladb#19233
2024-06-12 10:02:45 +03:00
Tomasz Grabiec
6d90ff84d9 Merge '[Backport 6.0] tablets: Filter-out left nodes in get_natural_endpoints()' from ScyllaDB
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".

Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:

  storage_proxy - No mapping for :: in the passed effective replication map

It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.

Users which need replica list stability should use the host_id-based API.

Fixes #18843

(cherry picked from commit 3e1ba4c859)

(cherry picked from commit 0d596a425c)

 Refs #18955

Closes scylladb/scylladb#19143

* github.com:scylladb/scylladb:
  tablets: Filter-out left nodes in get_natural_endpoints()
  test: pylib: Extract start_writes() load generator utility
2024-06-12 01:31:38 +02:00
Botond Dénes
0d13c51dd4 test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
Avoid this the live-update feature of this config item breaking
silently.

(cherry picked from commit 8ef4fbdb87)
2024-06-11 17:32:37 +00:00
Botond Dénes
d4563e2b28 test/pylib: rest_client: add get_injection()
The /v2/error_injection/{injection} endpoint now has a GET method too,
expose this.

(cherry picked from commit 0c61b1822c)
2024-06-11 17:32:37 +00:00
Botond Dénes
bb18a8152e api/error_injection: add getter for error_injection
Allow external code to obtain information about an error injection
point, including whether it is enabled, and importantly, what its
parameters are. Together with the `set_parameter()` added in the
previous patch, this allows tests to read out the values of internal
parameters, via a set_parameter() injection point.

(cherry picked from commit feea609e37)
2024-06-11 17:32:37 +00:00
Botond Dénes
1947290c74 utils/error_injection: add set_parameter()
Allow injection points to write values into the parameter map, which
external code can then examine. This allows exfiltrating the values if
internal variables, to be examined by tests, without exposing these
variables via an "official" path.

(cherry picked from commit 4590026b38)
2024-06-11 17:32:36 +00:00
Botond Dénes
d121fc1264 replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
This config item is propagated to the table object via table::config.
Although the field in table::config, used to propagate the value, was
utils::updateable_value<T>, it was assigned a constant and so the
live-update chain was broken.
This patch fixes this.

(cherry picked from commit dbccb61636)
2024-06-11 17:32:36 +00:00
Michał Chojnowski
80ac0da11c storage_proxy: avoid infinite growth of _throttled_writes
storage_proxy has a throttling mechanism which attempts to limit the number
of background writes by forcefully raising CL to ALL
(it's not implemented exactly like that, but that's the effect) when
the amount of background and queued writes is above some fixed threshold.
If this is applied to a write, it becomes "throttled",
and its ID is appended to into _throttled_writes.

Whenever the amount of background and queued writes falls below the threshold,
writes are "unthrottled" — some IDs are popped from _throttled_writes
and the writes represented by these IDs — if their handlers still exist —
have their CL lowered back.

The problem here is that IDs are only ever removed from _throttled_writes
if the number of queued and background writes falls below the threshold.
But this doesn't have to happen in any finite time, if there's constant write
pressure. And in fact, in one load test, it hasn't happened in 3 hours,
eventually causing the buffer to grow into gigabytes and trigger OOM.

This patch is intended to be a good-enough-in-practice fix for the problem.

Fixes #17476
Fixes #1834

(cherry picked from commit fee48f67ef)

Closes scylladb/scylladb#19180
2024-06-11 18:33:38 +03:00
Raphael S. Carvalho
d4c3a43b34 replica: Refresh mutation source when allocating tablet replicas
Consider the following:

1) table A has N tablets and views
2) migration starts for a tablet of A from node 1 to 2.
3) migration is at write_both_read_old stage
4) coordinator will push writes to both nodes (pending and leaving)
5) A has view, so writes to it will also result in reads (table::push_view_replica_updates())
6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in)
7) so read on step 5 is not being able to find sstable set for tablet migrating in

Causes the following error:
"tablets - SSTable set wasn't found for tablet 21 of table mview.users"

which means loss of write on pending replica.

The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot.
It's not a problem to refresh the cache snapshot as long as the logical
state of the data hasn't changed, which is true when allocating new
tablet replicas. That's also done in the context of compactions for example.

Fixes #19052.
Fixes #19033.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 7b41630299)

Closes scylladb/scylladb#19229
2024-06-11 18:12:43 +03:00
Kefu Chai
31ba5561e7 build: remove coverage compiling options from the cxx_flags
in 44e85c7d, we remove coverage compiling options from the cflags
when building abseil. but in 535f2b21, these options were brought
back as parts of cxx_flags.

so we need to remove them again from cxx_flags.
Fixes #19219
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit d05db52d11)

Closes scylladb/scylladb#19237
2024-06-11 18:11:35 +03:00
Tomasz Grabiec
7479167af2 tablets: Filter-out left nodes in get_natural_endpoints()
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".

Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:

  storage_proxy - No mapping for :: in the passed effective replication map

It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.

Users which need replica list stability should use the host_id-based API.

Fixes #18843

(cherry picked from commit 0d596a425c)
2024-06-11 12:18:17 +02:00
Tomasz Grabiec
e35ab96f8b test: pylib: Extract start_writes() load generator utility
(cherry picked from commit 3e1ba4c859)
2024-06-11 12:18:17 +02:00
Guilherme Nogueira
1ace370ecd Remove comma that breaks CQL DML on tablets.rst
The current sample reads:

```cql
CREATE KEYSPACE my_keyspace
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'replication_factor': 3,
} AND tablets = {
    'enabled': false
};
```

The additional comma after `'replication_factor': 3` breaks the query execution.

(cherry picked from commit cf157e4423)

Closes scylladb/scylladb#19194
2024-06-10 20:24:22 +03:00
Kefu Chai
3e7de910ab docs: correct the link pointing to Scylla U
before this change it points to
https://university.scylladb.com/courses/scylla-operations/lessons/change-data-capture-cdc/
which then redirects the browser to
https://university.scylladb.com/courses/scylla-operations/,
but it should have point to
https://university.scylladb.com/courses/data-modeling/lessons/change-data-capture-cdc/

in this change, the hyperlink is corrected.

Fixes #19163
Refs 6e97b83b60
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit b5dce7e3d0)

Closes scylladb/scylladb#19198
2024-06-10 20:23:08 +03:00
Kefu Chai
9cf0d618d0 build: populate cxxflags to abseil
before this change, when building abseil, we don't pass cxxflags
to compiler, and abseil libraries are build with the default
optimization level. in the case of clang, its default optimization
level is `-O0`, it compiles the fastest, but the performance of
the emitted code is not optimized for runtime performance. but we
expect good performance for the release build. a typical command line
for building abseil looks like
```
clang++  -I/home/kefu/dev/scylladb/master/abseil -ffile-prefix-map=/home/kefu/dev/scylladb/master=. -march=westmere -std=gnu++20 -Wall -Wextra -Wcast-qual -Wconversion -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wfor-loop-analysis -Wformat-security -Wgnu-redeclared-enum -Winfinite-recursion -Winvalid-constexpr -Wliteral-conversion -Wmissing-declarations -Woverlength-strings -Wpointer-arith -Wself-assign -Wshadow-all -Wshorten-64-to-32 -Wsign-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-zero-compare -Wundef -Wuninitialized -Wunreachable-code -Wunused-comparison -Wunused-local-typedefs -Wunused-result -Wvla -Wwrite-strings -Wno-float-conversion -Wno-implicit-float-conversion -Wno-implicit-int-float-conversion -Wno-unknown-warning-option -DNOMINMAX -MD -MT absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o -MF absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o.d -o absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o -c /home/kefu/dev/scylladb/master/abseil/absl/base/internal/scoped_set_env.cc
```

so, in this change, we populate cxxflags to abseil, so that the
per-mode `-O` option can be populated when building abseil.

after this change, the command line building abseil in release mode
looks like

```
clang++  -I/home/kefu/dev/scylladb/master/abseil -ffunction-sections -fdata-sections  -O3 -mllvm -inline-threshold=2500 -fno-slp-vectorize -DSCYLLA_BUILD_MODE=release -g -gz -ffile-prefix-map=/home/kefu/dev/scylladb/master=. -march=westmere -std=gnu++20 -Wall -Wextra -Wcast-qual -Wconversion -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wfor-loop-analysis -Wformat-security -Wgnu-redeclared-enum -Winfinite-recursion -Winvalid-constexpr -Wliteral-conversion -Wmissing-declarations -Woverlength-strings -Wpointer-arith -Wself-assign -Wshadow-all -Wshorten-64-to-32 -Wsign-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-zero-compare -Wundef -Wuninitialized -Wunreachable-code -Wunused-comparison -Wunused-local-typedefs -Wunused-result -Wvla -Wwrite-strings -Wno-float-conversion -Wno-implicit-float-conversion -Wno-implicit-int-float-conversion -Wno-unknown-warning-option -DNOMINMAX -MD -MT absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o -MF absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o.d -o absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o -c /home/kefu/dev/scylladb/master/abseil/absl/flags/internal/commandlineflag.cc
```

Refs 0b0e661a85
Fixes #19161
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 535f2b2134)

Closes scylladb/scylladb#19200
2024-06-10 20:22:00 +03:00
Nadav Har'El
4810937ddf test/alternator: fix flaky test test_item_latency
The Alternator test test_metrics.py::test_item_latency confirms that
for several operation types (PutItem, GetItem, DeleteItem, UpdateItem)
we did not forget to measure their latencies.

The test checked that a latency was updated by checking that two metrics
increases:
    scylla_alternator_op_latency_count
    scylla_alternator_op_latency_sum

However, it turns out that the "sum" is only an approximate sum of all
latencies, and when the total sum grows large it sometimes does *not*
increase when a short latency is added to the statistics. When this
happens, this test fails on the assertion that the "sum" increases after
an operation. We saw this happening sometimes in CI runs.

The simple fix is to stop checking _sum at all, and only verify that
the _count increases - this is really an integer counter that
unconditionally increases when a latency is added to the histogram.

Don't worry that the strength of this test is reduced - this test was
never meant to check the accuracy or correctness of the histograms -
we should have different (and better) tests for that, unrelated to
Alternator. The purpose of *this* test is only to verify that for some
specific operation like PutItem, Alternator didn't forget to measure its
latency and update the histogram. We want to avoid a bug like we had
in counters in the past (#9406).

Fixes #18847.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 13cf6c543d)

Closes scylladb/scylladb#19193
2024-06-10 20:20:54 +03:00
Tomasz Grabiec
a3e4dc7b6c test: tablets: Fix flakiness of test_removenode_with_ignored_node due to read timeout
The check query may be executed on a node which doesn't yet see that
the downed server is down, as it is not shut down gracefully. The
query coordinator can choose the down node as a CL=1 replica for read
and time out.

To fix, wait for all nodes to notice the node is down before executing
the checking query.

Fixes #17938

(cherry picked from commit c8f71f4825)

Closes scylladb/scylladb#19199
2024-06-10 20:12:56 +03:00
Botond Dénes
7a6ff12ace Merge '[Backport 6.0] alternator: keep TTL work in the maintenance scheduling group' from ScyllaDB
Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group).
This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group.

Fixes: #18719

- [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this.

(cherry picked from commit 5d3f7c13f9)

(cherry picked from commit 1fe8f22d89)

 Refs #18729

Closes scylladb/scylladb#19196

* github.com:scylladb/scylladb:
  alternator, scheduler: test reproducing RPC scheduling group bug
  main: add maintenance tenant to messaging_service's scheduling config
2024-06-10 19:58:38 +03:00
Anna Stuchlik
e38d675cb9 doc: mark tablets as GA in the CREATE KEYSPACE section
This commit removes the information that tablets are an experimental feature
from the CREATE KEYSPACE section.

In addition, it removes the notes and cautions that are redundant when
a feature is GA, especially the information and warnings about the future
plans.

Fixes https://github.com/scylladb/scylladb/issues/18670

Closes scylladb/scylladb#19063

(cherry picked from commit 55ed18db07)
2024-06-10 18:53:47 +03:00
Gleb Natapov
45ff4d2c41 group0, topology coordinator: run group0 and the topology coordinator in gossiper scheduling group
Currently they both run in streaming group and it may become busy during
repair/mv building and affect group0 functionality. Move it to the
gossiper group where it should have more time to run.

Fixes #18863

(cherry picked from commit a74fbab99a)

Closes scylladb/scylladb#19175
2024-06-10 10:34:29 +02:00
Nadav Har'El
0662e80917 alternator, scheduler: test reproducing RPC scheduling group bug
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1fe8f22d89)
2024-06-10 07:42:23 +00:00
Botond Dénes
5b546ad4b1 main: add maintenance tenant to messaging_service's scheduling config
Currently only the user tenant (statement scheduling group) and system
(default scheduling group) tenants exist, as we used to have only
user-initiated operations and sytem (internal) ones. Now there is need
to distinguish between two kinds of system operation: foreground and
background ones. The former should use the system tenant while the
latter will use the new maintenance tenant (streaming scheduling group).

(cherry picked from commit 5d3f7c13f9)
2024-06-10 07:42:22 +00:00
Piotr Dulikowski
e04378fdf0 Merge ' [Backport 6.0] db/hints: Use host ID to IP mappings to choose the ep manager to drain when node is leaving' from Dawid Mędrek
In [d0f5873](d0f58736c8), we introduced mappings IP–host ID between hint directories and the hint endpoint managers managing them. As a consequence, it may happen that one hint directory stores hints towards multiple nodes at the same time. If any of those nodes leaves the cluster, we should drain the hint directory. However, before these changes that doesn't happen – we only drain it when the node of the same host ID as the hint endpoint manager leaves the cluster.

This PR fixes that draining issue in the pre-host-ID-based hinted handoff. Now no matter which of the nodes corresponding to a hint directory leaves the cluster, the directory will be drained.

We also introduce error injections to be able to test that it indeed happens.

Fixes scylladb/scylladb#18761

(cherry picked from commit [745a9c6](745a9c6ab8))

(cherry picked from commit [e855794](e855794327))

Refs scylladb/scylladb#18764

Closes scylladb/scylladb#19114

* github.com:scylladb/scylladb:
  db/hints: Introduce an error injection to test draining
  db/hints: Ensure that draining happens
2024-06-10 09:11:07 +02:00
Tomasz Grabiec
f8243cbf19 Merge '[Backport 6.0] Serialize repair with tablet migration' from ScyllaDB
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requests start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

A blocked tablet migration (e.g. due to down node) will block repair,
whereas before it would fail. Once admin resolves the cause of blocked migration,
repair will continue.

Fixes #17658.
Fixes #18561.

(cherry picked from commit 6c64cf33df)

(cherry picked from commit 1513d6f0b0)

(cherry picked from commit 476c076a21)

(cherry picked from commit c45ce41330)

(cherry picked from commit e97acf4e30)

(cherry picked from commit 98323be296)

(cherry picked from commit 5ca54a6e88)

 Refs #18641

Closes scylladb/scylladb#19144

* github.com:scylladb/scylladb:
  test: pylib: Do not block async reactor while removing directories
  repair: Exclude tablet migrations with tablet repair
  repair_service: Propagate topology_state_machine to repair_service
  main, storage_service: Move topology_state_machine outside storage_service
  storage_srvice, toplogy: Extract topology_state_machine::await_quiesced()
  tablet_scheduler: Make disabling of balancing interrupt shuffle mode
  tablet_scheduler: Log whether balancing is considered as enabled
2024-06-09 00:20:44 +02:00
Tomasz Grabiec
27f01bf4e3 test: pylib: Do not block async reactor while removing directories
This fixes a problem where suite cleanup schedules lots of uninstall()
tasks for servers started in the suite, which schedules lots of tasks,
which synchronously call rmtree(). These take over a minute to finish,
which blocks other tasks for tests which are still executing.

In particular, this was observed to case
ManagerClient.server_stop_gracefully() to time-out. It has a timeout
of 60 seconds. The server was stopped quickly, but the RESTful API
response was not processed in time and the call timed out when it got
the async reactor.

(cherry picked from commit 5ca54a6e88)
2024-06-08 16:31:18 +02:00
Tomasz Grabiec
ded9aca6ee repair: Exclude tablet migrations with tablet repair
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requets start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

Fixes #17658.
Fixes #18561.

(cherry picked from commit 98323be296)
2024-06-08 16:31:18 +02:00
Tomasz Grabiec
ccd441a4de repair_service: Propagate topology_state_machine to repair_service
(cherry picked from commit e97acf4e30)
2024-06-08 16:31:15 +02:00
Jenkins Promoter
79e4e411b3 Update ScyllaDB version to: 6.0.1 2024-06-07 09:31:05 +03:00
Kefu Chai
f8ba94a960 doc: document "enable_tablets" option
it sets the cluster feature of tablets, and is a prerequisite for
using tablets.

Refs #18670
Fixes #19157
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit bac7e1e942)

Closes scylladb/scylladb#19158
2024-06-07 07:03:30 +03:00
Tzach Livyatan
dfe89157c6 Docs: fix start command in Update replace-dead-node.rst
Fix #18920

(cherry picked from commit c30f81c389)

Closes scylladb/scylladb#19142
2024-06-07 07:02:02 +03:00
Kefu Chai
50d8fa6b77 topology_coordinator: handle/wait futures when stopping topology_coordinator
before this change, unlike other services in scylla,
topology_coordinator is not properly stopped when it is aborted,
because the scylla instance is no longer a leader or is being shut down.
its `run()` method just stops the grand loop and bails out before
topology_coordinator is destroyed. but we are tracking the migration
state of tablets using a bunch of futures, which might not be
handled yet, and some of them could carry failures. in that case,
when the `future` instances with failure state get destroyed,
seastar calls `report_failed_future`. and seastar considers this
practice a source a bug -- as one just fails to handle an error.
that's why we have following error:

```
WARN  2024-05-19 23:00:42,895 [shard 0:strm] seastar - Exceptional future ignored: seastar::rpc::unknown_verb_error (unknown verb), backtrace: /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56c14e /home/bhalevy/.ccm/scylla-repository/local_tarball/libre
loc/libseastar.so+0x56c770 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56ca58 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x38c6ad 0x29cdd07 0x29b376b 0x29a5b65 0x108105a /home/bhalevy/.ccm/scylla-repository/local_tarbal
l/libreloc/libseastar.so+0x3ff1df /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x400367 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff838 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36de58
 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36d092 0x1017cba 0x1055080 0x1016ba7 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27b89 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27c4a 0x1015524
```
and the backtrace looks like:
```
seastar::current_backtrace_tasklocal() at ??:?
seastar::current_tasktrace() at ??:?
seastar::current_backtrace() at ??:?
seastar::report_failed_future(seastar::future_state_base::any&&) at ??:?
service::topology_coordinator::tablet_migration_state::~tablet_migration_state() at topology_coordinator.cc:?
service::topology_coordinator::~topology_coordinator() at topology_coordinator.cc:?
service::run_topology_coordinator(seastar::sharded<db::system_distributed_keyspace>&, gms::gossiper&, netw::messaging_service&, locator::shared_token_metadata&, db::system_keyspace&, replica::database&, service::raft_group0&, service::topology_state_machine&, seastar::abort_source&, raft::server&, seastar::noncopyable_function<seastar::future<service::raft_topology_cmd_result> (utils::tagged_tagged_integer<raft::internal::non_final, raft::term_tag, unsigned long>, unsigned long, service::raft_topology_cmd const&)>, service::tablet_allocator&, std::chrono::duration<long, std::ratio<1l, 1000l> >, service::endpoint_lifecycle_notifier&) [clone .resume] at topology_coordinator.cc:?
seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at main.cc:?
seastar::reactor::run_some_tasks() at ??:?
seastar::reactor::do_run() at ??:?
seastar::reactor::run() at ??:?
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ??:?
```

and even worse, these futures are indirectly owned by `topology_coordinator`.
so there are chances that they could be used even after `topology_coordinator`
is destroyed. this is a use-after-free issue. because the
`run_topology_coordinator` fiber exits when the scylla instance retires
from the leader's role, this use-after-free could be fatal to a
running instance due to undefined behavior of use after free.

so, in this change, we handle the futures in `_tablets`, and note
down the failures carried by them if any.

Fixes #18745
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 4a36918989)

Closes scylladb/scylladb#19139
2024-06-07 07:00:25 +03:00
Jenkins Promoter
a77615adf3 Update ScyllaDB version to: 6.0.0 2024-06-06 16:03:39 +03:00
Tomasz Grabiec
e518bb68b2 main, storage_service: Move topology_state_machine outside storage_service
It will be propagated to repair_service to avoid cyclic dependency:

storage_service <-> repair_service

(cherry picked from commit c45ce41330)
2024-06-06 13:01:19 +00:00
Tomasz Grabiec
af2caeb2de storage_srvice, toplogy: Extract topology_state_machine::await_quiesced()
Will be used later in a place which doesn't have access to storage_service
but has to toplogy_state_machine.

It's not necessary to start group0 operation around polling because
the busy() state can be checked atomically and if it's false it means
the topology is no longer busy.

(cherry picked from commit 476c076a21)
2024-06-06 13:01:19 +00:00
Tomasz Grabiec
d5ebfea1ff tablet_scheduler: Make disabling of balancing interrupt shuffle mode
Tests will rely on that, they will run in shuffle mode, and disable
balancing around section which otherwise would be infinitely blocked
by ongoing shuffling (like repair).

(cherry picked from commit 1513d6f0b0)
2024-06-06 13:01:18 +00:00
Tomasz Grabiec
3fec9e1344 tablet_scheduler: Log whether balancing is considered as enabled
(cherry picked from commit 6c64cf33df)
2024-06-06 13:01:18 +00:00
Kamil Braun
5d3dde50f4 Merge '[Backport 6.0] Fail bootstrap if ip mapping is missing during double write stage' from ScyllaDB
If a node restart just before it stores bootstrapping node's IP it will
not have ID to IP mapping for bootstrapping node which may cause failure
on a write path. Detect this and fail bootstrapping if it happens.

(cherry picked from commit 1faef47952)

(cherry picked from commit 27445f5291)

(cherry picked from commit 6853b02c00)

(cherry picked from commit f91db0c1e4)

 Refs #18927

Closes scylladb/scylladb#19118

* github.com:scylladb/scylladb:
  raft topology: fix indentation after previous commit
  raft topology: do not add bootstrapping node without IP as pending
  test: add test of bootstrap where the coordinator crashes just before storing IP mapping
  schema_tables: remove unused code
2024-06-06 11:35:13 +02:00
Tomasz Grabiec
b7fe4412d0 test: pylib: Fetch all pages by default in run_async
Fetching only the first page is not the intuitive behavior expected by users.

This causes flakiness in some tests which generate variable amount of
keys depending on execution speed and verify later that all keys were
written using a single SELECT statement. When the amount of keys
becomes larger than page size, the test fails.

Fixes #18774

(cherry picked from commit 2c3f7c996f)

Closes scylladb/scylladb#19130
2024-06-06 08:22:45 +03:00
Benny Halevy
fd7284ec06 gms: endpoint_state: get_dc_rack: do not assign to uninitialized memory
Assigning to a member of an uninitialized optional
does not initialize the object before assigning to it.
This resulted in the AddressSanitizer detecting attempt
to double-free when the uninitialized string contained
apprently a bogus pointer.

The change emplaces the returned optional when needed
without resorting to the copy-assignment operator.
So it's not suceptible to assigning to uninitialized
memory, and it's more efficient as well...

Fixes scylladb/scylladb#19041

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b2fa954d82)

Closes scylladb/scylladb#19117
2024-06-06 08:21:05 +03:00
Botond Dénes
8d12eeee62 Merge '[Backport 6.0] tasks: introduce task manager's task folding' from Aleksandra Martyniuk
Task manager's tasks stay in memory after they are finished.
Moreover, even if a child task is unregistered from task manager,
it is still alive since its parent keeps a foreign pointer to it. Also,
when a task has finished successfully there is no point in keeping
all of its descendants in memory.

The patch introduces folding of task manager's tasks. Whenever
a task which has a parent is finished it is unregistered from task
manager and foreign_ptr to it (kept in its parent) is replaced
with its status. Children's statuses of the task are dropped unless
they or one of their descendants failed. So for each operation we
keep a tree of tasks which contains:
- a root task and its direct children (status if they are finished, a task
  otherwise);
- running tasks and their direct children (same as above);
- a statuses path from root to failed tasks.

/task_manager/wait_task/ does not unregister tasks anymore.

Refs: https://github.com/scylladb/scylladb/issues/16694.

- [ ] ** Backport reason (please explain below if this patch should be backported or not) **
Requires backport to 6.0 as task number exploded with tablets.

(cherry picked from commit 6add9edf8a)

(cherry picked from commit 319e799089)

(cherry picked from commit e6c50ad2d0)

(cherry picked from commit a82a2f0624)

(cherry picked from commit c1b2b8cb2c)

(cherry picked from commit 30f97ea133)

(cherry picked from commit fc0796f684)

(cherry picked from commit d7e80a6520)

(cherry picked from commit beef77a778)

Refs https://github.com/scylladb/scylladb/pull/18735

Closes scylladb/scylladb#19104

* github.com:scylladb/scylladb:
  docs: describe task folding
  test: rest_api: add test for task tree structure
  test: rest_api: modify new_test_module
  tasks: test: modify test_task methods
  api: task_manager: do not unregister task in /task_manager/wait_task/
  tasks: unregister tasks with parents when they are finished
  tasks: fold finished tasks info their parents
  tasks: make task_manager::task::impl::finish_failed noexcept
  tasks: change _children type
2024-06-06 07:56:12 +03:00
Gleb Natapov
e11827f37e raft topology: fix indentation after previous commit
(cherry picked from commit f91db0c1e4)
2024-06-05 13:55:29 +00:00
Gleb Natapov
0acfc223ab raft topology: do not add bootstrapping node without IP as pending
If there is no mapping from host id to ip while a node is in bootstrap
state there is no point adding it to pending endpoint since write
handler will not be able to map it back to host id anyway. If the
transition sate requires double writes though we still want to fail.
In case the state is write_both_read_old we fail the barrier that will
cause topology operation to rollback and in case of write_both_read_new
we assert but this should not happen since the mapping is persisted by
this point (or we failed in write_both_read_old state).

Fixes: scylladb/scylladb#18676
(cherry picked from commit 6853b02c00)
2024-06-05 13:55:28 +00:00
Gleb Natapov
c53cd98a41 test: add test of bootstrap where the coordinator crashes just before storing IP mapping
On the next boot there is no host ID to IP mapping which causes node to
crash again with "No mapping for :: in the passed effective replication map"
assertion.

(cherry picked from commit 27445f5291)
2024-06-05 13:55:28 +00:00
Gleb Natapov
fa6a7cf144 schema_tables: remove unused code
(cherry picked from commit 1faef47952)
2024-06-05 13:55:28 +00:00
Patryk Jędrzejczak
65021c4b1c [Backport 6.0] test: test_topology_ops: run correctly without tablets
The values of `tablets_enabled` were nonempty strings, so they
always evaluated to `True` in the if statement responsible for
enabling writing workers only if tablets are disabled. Hence, the
writing workers were always disabled.

The original commit, ea4717da65,
contains one more change, which is not needed (and conflicting)
in 6.0 because scylladb/scylladb#18898 has been backported first.

Closes scylladb/scylladb#19111
2024-06-05 15:15:00 +02:00
Botond Dénes
341c29bd74 Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.

Fixes https://github.com/scylladb/scylladb/issues/18085.

(cherry picked from commit abcc68dbe7)

(cherry picked from commit 551bf9dd58)

(cherry picked from commit e7246751b6)

Refs https://github.com/scylladb/scylladb/pull/18287

Closes scylladb/scylladb#19095

* github.com:scylladb/scylladb:
  topology_experimental_raft/test_tablets: restore usage of check_with_down
  test: Fix flakiness in topology_experimental_raft/test_tablets
  service: Use tablet read selector to determine which replica to account table stats
  storage_service: Fix race between tablet split and stats retrieval
2024-06-05 13:06:32 +03:00
Aleksandra Martyniuk
e963631859 docs: describe task folding
(cherry picked from commit beef77a778)
2024-06-05 10:09:13 +02:00
Jenkins Promoter
c6f0a3267e Update ScyllaDB version to: 6.0.0-rc3 2024-06-05 10:03:47 +03:00
Marcin Maliszkiewicz
f02f2fef40 docs: remove note about performance degradation with default superuser
This doesn't apply for auth-v2 as we improved data placement and
removed cassandra quirk which was setting different CL for some
default superuser involved operations.

Fixes #18773

(cherry picked from commit 9adf74ae6c)

Closes scylladb/scylladb#18860
2024-06-05 09:04:45 +03:00
Benny Halevy
f8ae38a68c data_dictionary: keyspace_metadata: format: print also initial_tablets
Currently, there is no indication of tablets in the logged KSMetaData.
Print the tablets configuration of either the`initial`  number of tablets,
if enabled, or {'enabled':false} otherwise.

For example:
```
migration_manager - Create new Keyspace: KSMetaData{name=tablets_ks, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={"datacenter1": "1"}, cfMetaData={}, durable_writes=true, tablets={"initial":0}, userTypes=org.apache.cassandra.config.UTMetaData@0x600004d446a8}

migration_manager - Create new Keyspace: KSMetaData{name=vnodes_ks, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={"datacenter1": "1"}, cfMetaData={}, durable_writes=true, tablets={"enabled":false}, userTypes=org.apache.cassandra.config.UTMetaData@0x600004c33ea8}

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4fe700a962)

Closes scylladb/scylladb#19009
2024-06-05 08:31:21 +03:00
Botond Dénes
8a064daccf Update tools/java submodule
* tools/java 4ee15fd9...6dfc187a (1):
  > Update Scylla Java driver to 3.11.5.3.

[botond: regenerate frozen toolchain]

Closes scylladb/scylladb#18999
2024-06-05 08:00:19 +03:00
Botond Dénes
7f540407c9 Merge '[Backport 6.0] repair: Introduce new primary replica selection algorithm for tablets' from ScyllaDB
Tablet allocation does not guarantee fairness of
the first replica in the replicas set across dcs.
The lack of this fix cause the following dtest to fail:
repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc

Use the tablet_map get_primary_replica or get_primary_replica_within_dc,
respectively to see if this node is the primary replica for each tablet
or not.

Fixes https://github.com/scylladb/scylladb/issues/17752

No backport is required before 6.0 as tablets (and tablet repair) are introduced in 6.0

(cherry picked from commit c52f70f92c)

(cherry picked from commit 2de79c39dc)

(cherry picked from commit 84761acc31)

(cherry picked from commit 009767455d)

(cherry picked from commit 18df36d920)

 Refs #18784

Closes scylladb/scylladb#19068

* github.com:scylladb/scylladb:
  repair: repair_tablets: use get_primary_replica
  repair: repair_tablets: no need to check ranges_specified per tablet
  locator: tablet_map: add get_primary_replica_within_dc
  locator: tablet_map: get_primary_replica: do not copy tablet info
  locator: tablet_map: get_primary_replica: return tablet_replica
2024-06-05 07:47:24 +03:00
Aleksandra Martyniuk
50e1369d1d test: rest_api: add test for task tree structure
Add test which checks whether the tasks are folded into their parent
as expected.

(cherry picked from commit d7e80a6520)
2024-06-04 14:42:10 +00:00
Aleksandra Martyniuk
21e860453c test: rest_api: modify new_test_module
Remove remaining test tasks when a test module is removed, so that
a node could shutdown even if a test fails.

(cherry picked from commit fc0796f684)
2024-06-04 14:42:10 +00:00
Dawid Medrek
fc3d2d8fde db/hints: Introduce an error injection to test draining
We want to verify that a hint directory is drained
when any of the nodes correspodning to it leaves
the cluster. The test scenario should happen before
the whole cluster has been migrated to
the host-ID-based hinted handoff, so when we still
rely on the mappings between hint endpoint managers
and the hint directories managed by them.

To make such a test possible, in these changes we
introduce an error injection rejecting incoming
hints. We want to test a scenario when:

1. hints are saved towards a given node -- node N1,
2. N1 changes its IP to a different one,
3. some other node -- node N2 -- changes its IP
   to the original IP of N1,
4. hints are saved towards N2 and they are stored
   in the same directory as the hints saved towards
   N1 before,
5. we start draining N2.

Because at some point N2 needs to be stopped,
it may happen that some mutations towards
a distributed system table generate a hint
to N2 BEFORE it has finished changing its IP,
effectively creating another hint directory
where ALL of the hints towards the node
will be stored from there on. That would disturb
the test scenario. Hence, this error injection is
necessary to ensure that all of the steps in the
test proceed as expected.

(cherry picked from commit e855794327)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
1d34da21a9 tasks: test: modify test_task methods
Wait until the task is done in test_task::finish_failed and
test_task::finish to ensure that it is folded into its parent.

(cherry picked from commit 30f97ea133)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
377bc345f1 api: task_manager: do not unregister task in /task_manager/wait_task/
If /task_manager/wait_task/ unregisters the task, then there is no
way to examine children failures, since their statuses can be checked
only through their parent.

(cherry picked from commit c1b2b8cb2c)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
607be221b8 tasks: unregister tasks with parents when they are finished
Unregister children that are finished from task manager. They can be
examined through they parents.

(cherry picked from commit a82a2f0624)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
cb242ad48c tasks: fold finished tasks info their parents
Currently, when a child task is unregistered, it is still kept by its parent. This leads
to excessive memory usage, especially when the tasks are configured to be kept in task
manager after they are finished (task_ttl_in_seconds).

Introduce task_essentials struct which keeps only data necesarry for task manager API.
When a task which has a parent is finished, a foreign pointer to it in its parent is replaced
with respective task_essentials. Once a parent task is finished it is also folded into
its parent (if it has one). Children details of a folded task are lost, unless they
(or some of their subtrees) failed. That is, when a task is finished, we keep:
- a root task (until it is unregistered);
- task_essentials of root's direct children;
- a path (of task_essentials) from root to each failed task (so that the reason
  of a failure could be examined).

(cherry picked from commit e6c50ad2d0)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
7258f4f73c tasks: make task_manager::task::impl::finish_failed noexcept
(cherry picked from commit 319e799089)
2024-06-04 14:42:09 +00:00
Dawid Medrek
82d635b6a7 db/hints: Ensure that draining happens
Before hinted handoff is migrated to using host IDs
to identify nodes in the cluster, we keep track
of mappings between hint endpoint managers
identified by host IDs and the hint directories
managed by them and represented by IP addresses.
As a consequence, it may happen that one hint
directory corresponds to multiple nodes
-- it's intended. See 64ba620 for more details.

Before these changes, we only started the draining
process of a hint directory if the node leaving
the cluster corresponded to that hint directory
AND was identified by the same host ID as
the hint endpoint manager managing that directory.
As a result, the draining did not always happen
when it was supposed to.

Draining should start no matter which of the nodes
corresponding to a hint directory is leaving
the cluster. This commit ensures that it happens.

(cherry picked from commit 745a9c6ab8)
2024-06-04 14:42:08 +00:00
Aleksandra Martyniuk
baf0385728 tasks: change _children type
Keep task children in a map. It's a preparation for further changes.

(cherry picked from commit 6add9edf8a)
2024-06-04 14:42:08 +00:00
Raphael S. Carvalho
a373ed52a5 topology_experimental_raft/test_tablets: restore usage of check_with_down
e7246751b6 incorrectly dropped its usage in
test_tablet_missing_data_repair.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-04 11:01:22 -03:00
Raphael S. Carvalho
9a341a65af replica: Only consume memtable of the tablet intersecting with range read
storage_proxy is responsible for intersecting the range of the read
with tablets, and calling replica with a single tablet range, therefore
it makes sense to avoid touching memtables of tablets that don't
intersect with a particular range.

Note this is a performance issue, not correctness one, as memtable
readers that don't intersect with current range won't produce any
data, but cpu is wasted until that's realized (they're added to list
of readers in mutation_reader_merger, more allocations, more data
sources to peek into, etc).

That's also important for streaming e.g. after decommission, that
will consume one tablet at a time through a reader, so we don't want
memtables of streamed tablets (that weren't cleaned up yet) to
be consumed.

Refs #18904.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 832fb43fb4)

Closes scylladb/scylladb#18983
2024-06-04 16:17:47 +03:00
Kefu Chai
35b4b47d74 build: add sanitizer compiling options directly
before this change, in order to avoid repeating/hardwiring the
compiling options set by Seastar, we just inherit the compiling
options of Seastar for building Abseil, as the former exposes the
options to enable sanitizers.

this works fine, despite that, strictly speaking, not all options
are necessary for building abseil, as abseil is not a Seastar
application -- it is just a C++ library.

but when we introduce dependencies which are only generated at
build time, and these dependencies are passed to the compiler
at build time, this breaks the build of Abseil. because these
dependencies are exposed by the Seastar's .pc file, and consumed
by Abseil. when building Abseil, apparently, the building process
driven by ninja is not started yet, so we are not able to build
Abseil with these settings due to missing dependencies.

so instead of inheriting the compiling options from Seastar, just
set the sanitizer related compiling options directly, to avoid
referencing these missing dependencies.

the upside is that we pass a much smaller set of compiling options
to compiler when building Abseil, the downside is that we hardwire
these options related to sanitizer manually, they are also detected
by Seastar's building system. but fortunately, these options are
relatively stable across the building environements we support.

Fixes #19055
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit c436dfd2db)

Closes scylladb/scylladb#19064
2024-06-04 15:00:43 +03:00
Benny Halevy
6d7388c689 repair: repair_tablets: use get_primary_replica
Tablet allocation does not guarantee fairness of
the first replica in the replicas set across dcs.
The lack of this fix cause the following dtest to fail:
repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc

Use the tablet_map get_primary_replica* functions to get
the primary replica for each tablet, possibly within a dc.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 18df36d920)
2024-06-03 19:50:40 +00:00
Benny Halevy
6ac34f7acf repair: repair_tablets: no need to check ranges_specified per tablet
The code already turns off `primary_replica_only`
if `!ranges_specified.empty()`, so there's no need to
check it again inside the per-tablet loop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 009767455d)
2024-06-03 19:50:40 +00:00
Benny Halevy
bdf3e71f62 locator: tablet_map: add get_primary_replica_within_dc
Will be needed by repair in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 84761acc31)
2024-06-03 19:50:40 +00:00
Benny Halevy
ec30bdc483 locator: tablet_map: get_primary_replica: do not copy tablet info
Currently, the function needlessly copies the tablet_info
(all tablet replicas in particular) to a local variable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2de79c39dc)
2024-06-03 19:50:40 +00:00
Benny Halevy
21f87c9cfa locator: tablet_map: get_primary_replica: return tablet_replica
This is required by repair when it will start using get_primary_replica
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c52f70f92c)
2024-06-03 19:50:39 +00:00
Botond Dénes
a38d5463ef Merge '[Backport 6.0] tablets: load balancer: Use random selection of candidates when moving tablets' from ScyllaDB
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.

Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.

For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions.  One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2.  Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):

  Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
  Overcommit       : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit       : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
  Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}

The worst state before the patch had the following distribution of tablets for the smaller table:

  Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
  Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
  Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
  Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
  Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
  Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
  Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33

One node has as many as 171 tablets of that table and another one has as few as 3.

After the patch, the worst distribution looks like this:

  Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
  Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
  Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
  Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
  Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
  Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
  Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65

Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.

Refs #16824

(cherry picked from commit 3be6120e3b)

(cherry picked from commit c9bcb5e400)

(cherry picked from commit 7b1eea794b)

(cherry picked from commit 603abddca9)

 Refs #18885

Closes scylladb/scylladb#19036

* github.com:scylladb/scylladb:
  tablets: load balancer: Use random selection of candidates when moving tablets
  test: perf: Add test for tablet load balancer effectiveness
  load_sketch: Extract get_shard_minmax()
  load_sketch: Allow populating only for a given table
2024-06-03 12:25:05 +03:00
Raphael S. Carvalho
3cb71c5b88 replica: Fix race of tablet snapshot with compaction
tablet snapshot, used by migration, can race with compaction and
can find files deleted. That won't cause data loss because the
error is propagated back into the coordinator that decides to
retry streaming stage. So the consequence is delayed migration,
which might in turn reduce node operation throughput (e.g.
when decommissioning a node). It should be rare though, so
shouldn't have drastic consequences.

Fixes #18977.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b396b05e20)

Closes scylladb/scylladb#19008
2024-06-03 12:21:52 +03:00
Lakshmi Narayanan Sreethar
85805f6472 db/config.cc: increment components_memory_reclaim_threshold config default
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.

Fixes #18607

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)

Closes scylladb/scylladb#19014
2024-06-03 12:19:16 +03:00
Pavel Emelyanov
62a23fd86a config: Remove experimental TABLETS feature
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.

The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 83d491af02)

Closes scylladb/scylladb#19012
2024-06-03 12:16:41 +03:00
Tomasz Grabiec
b9c88fdf4b tablets: load balancer: Use random selection of candidates when moving tablets
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.

Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.

For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions.  One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2.  Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):

  Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
  Overcommit       : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit       : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
  Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}

The worst state before the patch had the following distribution of tablets for the smaller table:

  Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
  Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
  Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
  Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
  Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
  Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
  Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33

One node has as many as 171 tablets of that table and the one has as few as 3.

After the patch, the worst distribution looks like this:

  Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
  Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
  Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
  Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
  Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
  Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
  Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65

Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.

Refs #16824

(cherry picked from commit 603abddca9)
2024-06-02 22:40:46 +00:00
Tomasz Grabiec
0c1b6fed16 test: perf: Add test for tablet load balancer effectiveness
(cherry picked from commit 7b1eea794b)
2024-06-02 22:40:45 +00:00
Tomasz Grabiec
fb7a33be13 load_sketch: Extract get_shard_minmax()
(cherry picked from commit c9bcb5e400)
2024-06-02 22:40:44 +00:00
Tomasz Grabiec
b208953e07 load_sketch: Allow populating only for a given table
(cherry picked from commit 3be6120e3b)
2024-06-02 22:40:44 +00:00
Michał Jadwiszczak
803662351d docs/procedures/backup-restore: use DESC SCHEMA WITH INTERNALS
Update docs for backup procedure to use `DESC SCHEMA WITH INTERNALS`
instead of plain `DESC SCHEMA`.
Add a note to use cqlsh in a proper version (at least 6.0.19).

Closes scylladb/scylladb#18953

(cherry picked from commit 5b4e688668)
2024-06-02 23:15:49 +02:00
Marcin Maliszkiewicz
cbf47319c1 db: auth: move auth tables to system keyspace
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.

Fixes https://github.com/scylladb/scylladb/issues/18098

Closes scylladb/scylladb#18769

(cherry picked from commit 2ab143fb40)

[avi: adjust test/alternator/suite.yaml to reflect new keyspace]
2024-06-02 21:41:14 +03:00
Jenkins Promoter
64388bcf22 Update ScyllaDB version to: 6.0.0-rc2 2024-06-02 15:35:58 +03:00
Anna Stuchlik
83dfe6bfd6 doc: add support for Ubuntu 24.04
(cherry picked from commit e81afa05ea)

Closes scylladb/scylladb#19010
2024-05-31 18:33:57 +03:00
Wojciech Mitros
3c47ab9851 mv: handle different ERMs for base and view table
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.

This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.

Fixes: #17786
Fixes: #18709

(cherry-picked from 519317dc58)

This commit also includes the follow-up patch that removes the
flakiness from the test that is introduced by the commit above.
The flakiness was caused by enabling the
delay_before_get_view_natural_endpoint injection on a node
and not disabling it before the node is shut down. The patch
removes the enabling of the injection on the node in the first
place.
By squashing the commits, we won't introduce a place in the
commit history where a potential bisect could mistakenly fail.

Fixes: https://github.com/scylladb/scylladb/issues/18941

(cherry-picked from 0de3a5f3ff)

Closes scylladb/scylladb#18974
2024-05-30 09:13:31 +02:00
Anna Stuchlik
bef3777a5f doc: add the tablets information to the nodetool describering command
This commit adds an explanation of how the `nodetool describering` command
works if tablets are enabled.

(cherry picked from commit 888d7601a2)

Closes scylladb/scylladb#18981
2024-05-30 09:22:49 +03:00
Pavel Emelyanov
b25dd2696f Backport Merge 'tablets: alter keyspace' from Piotr Smaron
This change supports changing replication factor in tablets-enabled keyspaces.
This covers both increasing and decreasing the number of tablets replicas through
first building topology mutations (`alter_keyspace_statement.cc`) and then
tablets/topology/schema mutations (`topology_coordinator.cc`).
For the limitations of the current solution, please see the docs changes attached to this PR.

refs: scylladb/scylladb#16723

* br-backport-alter-ks-tablets:
  test: Do not check tablets mutations on nodes that don't have them
  test: Fix the way tablets RF-change test parses mutation_fragments
  test/tablets: Unmark RF-changing test with xfail
  docs: document ALTER KEYSPACE with tablets
  Return response only when tablets are reallocated
  cql-pytest: Verify RF is changes by at most 1 when tablets on
  cql3/alter_keyspace_statement: Do not allow for change of RF by more than 1
  Reject ALTER with 'replication_factor' tag
  Implement ALTER tablets KEYSPACE statement support
  Parameterize migration_manager::announce by type to allow executing different raft commands
  Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks
  Extend system.topology with 3 new columns to store data required to process alter ks global topo req
  Allow query_processor to check if global topo queue is empty
  Introduce new global topo `keyspace_rf_change` req
  New raft cmd for both schema & topo changes
  Add storage service to query processor
  tablets: tests for adding/removing replicas
  tablet_allocator: make load_balancer_stats_manager configurable by name
2024-05-30 08:33:58 +03:00
Pavel Emelyanov
57d267a97e test: Do not check tablets mutations on nodes that don't have them
The check is performed by selecting from mutation_fragments(table), but
it's known that this query crashes Scylla when there's no tablet replica
on that node.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Pavel Emelyanov
5b8523273b test: Fix the way tablets RF-change test parses mutation_fragments
When the test changes RF from 2 to 3, the extra node executes "rebuild"
transition which means that it streams tablets replicas from two other
peers. When doing it, the node receives two sets of sstables with
mutations from the given tablet. The test part that checks if the extra
node received the mutations notices two mutation fragments on the new
replica and errorneously fails by seeing, that RF=3 is not equal to the
number of mutations found, which is 4.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Pavel Emelyanov
6497ed68ed test/tablets: Unmark RF-changing test with xfail
Now the scailing works and test must check it does

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Piotr Smaron
39c1237e25 docs: document ALTER KEYSPACE with tablets 2024-05-30 08:33:26 +03:00
Piotr Smaron
e04964ba17 Return response only when tablets are reallocated
Up until now we waited until mutations are in place and then returned
directly to the caller of the ALTER statement, but that doesn't imply
that tablets were deleted/created, so we must wait until the whole
processing is done and return only then.
2024-05-30 08:33:26 +03:00
Dawid Medrek
fb5b9012e6 cql-pytest: Verify RF is changes by at most 1 when tablets on
This commit adds a test verifying that we can only
change the RF of a keyspace for any DC by at most 1
when using tablets.

Fixes #18029
2024-05-30 08:33:26 +03:00
Dawid Medrek
749197f0a4 cql3/alter_keyspace_statement: Do not allow for change of RF by more than 1
We want to ensure that when the replication factor
of a keyspace changes, it changes by at most 1 per DC
if it uses tablets. The rationale for that is to make
sure that the old and new quorums overlap by at least
one node.

After these changes, attempts to change the RF of
a keyspace in any DC by more than 1 will fail.
2024-05-30 08:33:26 +03:00
Piotr Smaron
1f4428153f Reject ALTER with 'replication_factor' tag
This patch removes the support for the "wildcard" replication_factor
option for ALTER KEYSPACE when the keyspace supports tablets.
It will still be supported for CREATE KEYSPACE so that a user doesn't
have to know all datacenter names when creating the keyspace,
but ALTER KEYSPACE will require that and the user will have to
specify the exact change in replication factors they wish to make by
explicitly specifying the datacenter names.
Expanding the replication_factor option in the ALTER case is
unintuitive and it's a trap many users fell into.

See #8881, #15391, #16115
2024-05-30 08:33:26 +03:00
Piotr Smaron
544c424e89 Implement ALTER tablets KEYSPACE statement support
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.
2024-05-30 08:33:25 +03:00
Piotr Smaron
73b59b244d Parameterize migration_manager::announce by type to allow executing different raft commands
Since ALTER KS requires creating topology_change raft command, some
functions need to be extended to handle it. RAFT commands are recognized
by types, so some functions are just going to be parameterized by type,
i.e. made into templates.
These templates are instantiated already, so that only 1 instances of
each template exists across the whole code base, to avoid compiling it
in each translation unit.
2024-05-30 08:33:15 +03:00
Piotr Smaron
5afa3028a3 Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks 2024-05-30 08:33:15 +03:00
Piotr Smaron
885c7309ee Extend system.topology with 3 new columns to store data required to process alter ks global topo req
Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.
2024-05-30 08:33:15 +03:00
Piotr Smaron
adfad686b3 Allow query_processor to check if global topo queue is empty
With current implementation only 1 global topo req can be executed at a
time, so when ALTER KS is executed, we'll have to check if any other
global topo req is ongoing and fail the req if that's the case.
2024-05-30 08:33:15 +03:00
Piotr Smaron
1a70db17a6 Introduce new global topo keyspace_rf_change req
It will be used when processing ALTER KS statement, but also to
create a separate processing path for a KS with tablets (as opposed to
a vnode KS).
2024-05-30 08:33:15 +03:00
Piotr Smaron
bd4b781dc8 New raft cmd for both schema & topo changes
Allows executing combined topology & schema mutations under a single RAFT command
2024-05-30 08:33:15 +03:00
Piotr Smaron
51b8b04d97 Add storage service to query processor
Query processor needs to access storage service to check if global
topology request is still ongoing and to be able to wait until it
completes.
2024-05-30 08:33:15 +03:00
Paweł Zakrzewski
242caa14fe tablets: tests for adding/removing replicas
Note we're suppressing a UBSanitizer overflow error in UTs. That's
because our linter complains about a possible overflow, which never
happens, but tests are still failing because of it.
2024-05-30 08:33:15 +03:00
Paweł Zakrzewski
cedb47d843 tablet_allocator: make load_balancer_stats_manager configurable by name
This is needed, because the same name cannot be used for 2 separate
entities, because we're getting double-metrics-registration error, thus
the names have to be configurable, not hardcoded.
2024-05-30 08:33:15 +03:00
Pavel Emelyanov
da816bf50c test/tablets: Check that after RF change data is replicated properly
There's a test that checks system.tablets contents to see that after
changing ks replication factor via ALTER KEYSPACE the tablet map is
updated properly. This patch extends this test that also validates that
mutations themselves are replicated according to the desired replication
factor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18644
2024-05-30 08:31:48 +03:00
Botond Dénes
8bff078a89 Merge '[Backport 6.0] cdc, raft topology: fix and test cdc in the recovery mode' from ScyllaDB
This PR ensures that CDC keeps working correctly in the recovery
mode after leaving the raft-based topology.

We update `system.cdc_local` in `topology_state_load` to ensure
a node restarting in the recovery mode sees the last CDC generation
created by the topology coordinator.

Additionally, we extend the topology recovery test to verify
that the CDC keeps working correctly during the whole recovery
process. In particular, we test that after restarting nodes in the
recovery mode, they correctly use the active CDC generation created
by the topology coordinator.

Fixes scylladb/scylladb#17409
Fixes scylladb/scylladb#17819

(cherry picked from commit 4351eee1f6)

(cherry picked from commit 68b6e8e13e)

(cherry picked from commit 388db33dec)

(cherry picked from commit 2111cb01df)

 Refs #18820

Closes scylladb/scylladb#18938

* github.com:scylladb/scylladb:
  test: test_topology_recovery_basic: test CDC during recovery
  test: util: start_writes_to_cdc_table: add FIXME to increase CL
  test: util: start_writes_to_cdc_table: allow restarting with new cql
  storage_service: update system.cdc_local in topology_state_load
2024-05-29 16:14:02 +03:00
Botond Dénes
68d12daa7b Merge '[Backport 6.0] Fix parsing of initial tablets by ALTER' from ScyllaDB
If the user wants to change the default initial tablets value, it uses ALTER KEYSPACE statement. However, specifying `WITH tablets = { initial: $value }`  will take no effect, because statement analyzer only applies `tablets` parameters together with the `replication` ones, so the working statement should be `WITH replication = $old_parameters AND tablets = ...` which is not very convenient.

This PR changes the analyzer so that altering `tablets` happens independently from `replication`. Test included.

fixes: #18801

(cherry picked from commit 8a612da155)

(cherry picked from commit a172ef1bdf)

(cherry picked from commit 1003391ed6)

 Refs #18899

Closes scylladb/scylladb#18918

* github.com:scylladb/scylladb:
  cql-pytest: Add validation of ALTER KEYSPACE WITH TABLETS
  cql3: Fix parsing of ALTER KEYSPACE's tablets parameters
  cql3: Remove unused ks_prop_defs/prepare_options() argument
2024-05-29 16:13:26 +03:00
Patryk Jędrzejczak
e1616a2970 test: test_topology_ops: stop a write worker after the first error
`test_topology_ops` is flaky, which has been uncovered by gating
in scylladb/scylladb#18707. However, debugging it is harder than it
should be because write workers can flood the logs. They may send
a lot of failed writes before the test fails. Then, the log file
can become huge, even up to 20 GB.

Fix this issue by stopping a write worker after the first error.

This test is important for 6.0, so we can backport this change.

(cherry picked from commit 7c1e6ba8b3)

Closes scylladb/scylladb#18914
2024-05-29 16:11:46 +03:00
Kefu Chai
62f5171a55 docs: fix typos in upgrade document
s/Montioring/Monitoring/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit f1f3f009e7)

Closes scylladb/scylladb#18912
2024-05-29 16:11:08 +03:00
Piotr Smaron
fd928601ad cql: fix a crash lurking in ks_prop_defs::get_initial_tablets
`tablets_options->erase(it);` invalidates `it`, but it's still referred
to later in the code in the last `else`, and when that code is invoked,
we get a `heap-use-after-free` crash.

Fixes: #18926
(cherry picked from commit 8a77a74d0e)

Closes scylladb/scylladb#18949
2024-05-29 16:10:24 +03:00
Aleksandra Martyniuk
ae474f6897 test: fix test_tombstone_gc.py
Tests in test_tombstone_gc.py are parametrized with string instead
of bool values. Fix that. Use the value to create a keyspace with
or without tablets.

Fixes: #18888.
(cherry picked from commit b7ae7e0b0e)

Closes scylladb/scylladb#18948
2024-05-29 16:09:45 +03:00
Anna Stuchlik
099338b766 doc: add the tablet limitation to the manual recovery procedure
This commit adds the information that the manual recovery procedure
is not supported if tablets are enabled.

In addition, the content in the Manual Recovery Procedure is reorganized
by adding the Prerequisites and Procedure subsections - in this way,
we can limit the number of Note and Warning boxes that made the page
hard to follow.

Fixes https://github.com/scylladb/scylladb/issues/18895

(cherry picked from commit cfa3cd4c94)

Closes scylladb/scylladb#18947
2024-05-29 16:08:44 +03:00
Anna Stuchlik
375610ace8 doc: document RF limitation
This commit adds the information that the Replication Factor
must be the same or higher than the number of nodes.

(cherry picked from commit 87f311e1e0)

Closes scylladb/scylladb#18946
2024-05-29 16:08:05 +03:00
Botond Dénes
1b64e80393 Merge '[Backport 6.0] Harden the repair_service shutdown path' from ScyllaDB
This series ignores errors in `load_history()` to prevent `abort_requested_exception` coming from `get_repair_module().check_in_shutdown()` from escaping during `repair_service::stop()`, causing
```
repair_service::~repair_service(): Assertion `_stopped' failed.
```

Fixes https://github.com/scylladb/scylladb/issues/18889

Backport to 6.0 required due to 523895145d

(cherry picked from commit 38845754c4)

(cherry picked from commit c32c418cd5)

 Refs #18890

Closes scylladb/scylladb#18944

* github.com:scylladb/scylladb:
  repair: load_history: warn and ignore all errors
  repair_service: debug stop
2024-05-29 16:07:40 +03:00
Benny Halevy
fa330a6a4d repair: load_history: warn and ignore all errors
Currently, the call to `get_repair_module().check_in_shutdown()`
may throw `abort_requested_exception` that causes
`repair_service::stop()` to fail, and trigger assertion
failure in `~repair_service`.

We alredy ignore failure from `update_repair_time`,
so expand the logic to cover the whole function body.

Fixes scylladb/scylladb#18889

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c32c418cd5)
2024-05-28 17:07:26 +00:00
Benny Halevy
68544d5bb3 repair_service: debug stop
Seen the following unexplained assertion failure with
pytest -s -v --scylla-version=local_tarball --tablets repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc
```
INFO  2024-05-27 11:18:05,081 [shard 0:main] init - Shutting down repair service
INFO  2024-05-27 11:18:05,081 [shard 0:main] task_manager - Stopping module repair
INFO  2024-05-27 11:18:05,081 [shard 0:main] task_manager - Unregistered module repair
INFO  2024-05-27 11:18:05,081 [shard 1:main] task_manager - Stopping module repair
INFO  2024-05-27 11:18:05,081 [shard 1:main] task_manager - Unregistered module repair
scylla: repair/row_level.cc:3230: repair_service::~repair_service(): Assertion `_stopped' failed.
Aborting on shard 0.
Backtrace:
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3f040c
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x41c7a1
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x3dbaf
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x8e883
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x3dafd
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x2687e
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x2679a
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x36186
  0x26f2428
  0x10fb373
  0x10fc8b8
  0x10fc809
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x456c6d
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x456bcf
  0x10fc65b
  0x10fc5bc
  0x10808d0
  0x1080800
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff22f
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x4003b7
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff888
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36dea8
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36d0e2
  0x101cefa
  0x105a390
  0x101bde7
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27b89
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27c4a
  0x101a764
```

Decoded:
```
~repair_service at ./repair/row_level.cc:3230
~shared_ptr_count_for at ././seastar/include/seastar/core/shared_ptr.hh:491
 (inlined by) ~shared_ptr_count_for at ././seastar/include/seastar/core/shared_ptr.hh:491
~shared_ptr at ././seastar/include/seastar/core/shared_ptr.hh:569
 (inlined by) seastar::shared_ptr<repair_service>::operator=(seastar::shared_ptr<repair_service>&&) at ././seastar/include/seastar/core/shared_ptr.hh:582
 (inlined by) seastar::shared_ptr<repair_service>::operator=(decltype(nullptr)) at ././seastar/include/seastar/core/shared_ptr.hh:588
 (inlined by) operator() at ././seastar/include/seastar/core/sharded.hh:727
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&>(seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&) at ././seastar/include/seastar/core/future.hh:2035
 (inlined by) seastar::futurize<std::invoke_result<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>::type>::type seastar::smp::submit_to<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>(unsigned int, seastar::smp_submit_to_options, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&&) at ././seastar/include/seastar/core/smp.hh:367
seastar::futurize<std::invoke_result<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>::type>::type seastar::smp::submit_to<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>(unsigned int, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&&) at ././seastar/include/seastar/core/smp.hh:394
 (inlined by) operator() at ././seastar/include/seastar/core/sharded.hh:725
 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int>(std::__invoke_other, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<seastar::future<void>, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int>, seastar::future<void> >::type std::__invoke_r<seastar::future<void>, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int>(seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:114
 (inlined by) std::_Function_handler<seastar::future<void> (unsigned int), seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}>::_M_invoke(std::_Any_data const&, unsigned int&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:290
```

FWIW, gdb crashed when opening the coredump.

This commit will help catch the issue earlier
when repair_service::stop() fails (and it must never fail)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 38845754c4)
2024-05-28 17:07:26 +00:00
Piotr Dulikowski
bc711a169d Merge '[Backport 6.0] qos/raft_service_level_distributed_data_accessor: print correct error message when trying to modify a service level in recovery mode' from ScyllaDB
Raft service levels are read-only in recovery mode. This patch adds check and proper error message when a user tries to modify service levels in recovery mode.

Fixes https://github.com/scylladb/scylladb/issues/18827

(cherry picked from commit 2b56158d13)

(cherry picked from commit ee08d7fdad)

(cherry picked from commit af0b6bcc56)

 Refs #18841

Closes scylladb/scylladb#18913

* github.com:scylladb/scylladb:
  test/auth_cluster/test_raft_service_levels: try to create sl in recovery
  service/qos/raft_sl_dda: reject changes to service levels in recovery mode
  service/qos/raft_sl_dda: extract raft_sl_dda steps to common function
2024-05-28 16:45:52 +02:00
Patryk Jędrzejczak
0d0c037e1d test: test_topology_recovery_basic: test CDC during recovery
In topology on raft, management of CDC generations is moved to the
topology coordinator. We extend the topology recovery test to verify
that the CDC keeps working correctly during the whole recovery
process. In particular, we test that after restarting nodes in the
recovery mode, they correctly use the active CDC generation created
by the topology coordinator. A node restarting in the recovery mode
should learn about the active generation from `system.cdc_local`
(or from gossip, but we don't want to rely on it). Then, it should
load its data from `system.cdc_generations_v3`.

Fixes scylladb/scylladb#17409

(cherry picked from commit 2111cb01df)
2024-05-28 14:02:00 +00:00
Patryk Jędrzejczak
4d616ccb8c test: util: start_writes_to_cdc_table: add FIXME to increase CL
(cherry picked from commit 388db33dec)
2024-05-28 14:02:00 +00:00
Patryk Jędrzejczak
25d3398b93 test: util: start_writes_to_cdc_table: allow restarting with new cql
This patch allows us to restart writing (to the same table with
CDC enabled) with a new CQL session. It is useful when we want to
continue writing after closing the first CQL session, which
happens during the `reconnect_driver` call. We must stop writing
before calling `reconnect_driver`. If a write started just before
the first CQL session was closed, it would time out on the client.

We rename `finish_and_verify` - `stop_and_verify` is a better
name after introducing `restart`.

(cherry picked from commit 68b6e8e13e)
2024-05-28 14:02:00 +00:00
Patryk Jędrzejczak
ed3ac1eea4 storage_service: update system.cdc_local in topology_state_load
When the node with CDC enabled and with the topology on raft
disabled bootstraps, it reads system.cdc_local for the last
generation. Nodes with both enabled use group0 to get the last
generation.

In the following scenario with a cluster of one node:
1. the node is created with CDC and the topology on raft enabled
2. the user creates table T
3. the node is restarted in the recovery mode
4. the CDC log of T is extended with new entries
5. the node restarts in normal mode
The generation created in the step 3 is seen in
system_distributed.cdc_generation_timestamps but not in
system.cdc_generations_v3, thus there are used streams that the CDC
based on raft doesn't know about. Instead of creating a new
generation, the node should use the generation already committed
to group0.

Save the last CDC generation in the system.cdc_local during loading
the topology state so that it is visible for CDC not based on raft.

Fixes scylladb/scylladb#17819

(cherry picked from commit 4351eee1f6)
2024-05-28 14:01:59 +00:00
Anna Stuchlik
7229c820cf doc: describe Tablets in ScyllaDB
This commit adds the main description of tablets and their
benefits.
The article can be used as a reference in other places
across the docs where we mention tablets.

(cherry picked from commit b5c006aadf)

Closes scylladb/scylladb#18916
2024-05-28 11:27:53 +02:00
Pavel Emelyanov
67878af591 cql-pytest: Add validation of ALTER KEYSPACE WITH TABLETS
There's a test that checks how ALTER changes the initial tablets value,
but it equips the statement with `replication` parameters because of
limitations that parser used to impose. Now the `tablets` parameters can
come on their own, so add a new test. The old one is kept from
compatibility considerations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1003391ed6)
2024-05-28 02:07:58 +00:00
Pavel Emelyanov
2dbc555933 cql3: Fix parsing of ALTER KEYSPACE's tablets parameters
When the `WITH` doesn't include the `replication` parameters, the
`tablets` one is ignoded, even if it's present in the statement. That's
not great, those two parameter sets are pretty much independent and
should be parsed individually.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit a172ef1bdf)
2024-05-28 02:07:58 +00:00
Pavel Emelyanov
3b9c86dcf5 cql3: Remove unused ks_prop_defs/prepare_options() argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 8a612da155)
2024-05-28 02:07:58 +00:00
Raphael S. Carvalho
b6f3891282 test: Fix flakiness in topology_experimental_raft/test_tablets
One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode
due to gossiper being aborted prematurely, and causing reconnection
storm.

Another is test_tablet_missing_data_repair which is flaky due an issue
in python driver that session might not reconnect on rolling restart
(tracked by https://github.com/scylladb/python-driver/issues/230)

Refs #15356.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit e7246751b6)
2024-05-27 18:21:21 +00:00
Raphael S. Carvalho
46220bd839 service: Use tablet read selector to determine which replica to account table stats
Since we introduced the ability to revert migrations, we can no longer
rely on ordering of transition stages to determine whether to account
pending or leaving replica. Let's use read selector instead, which
correctly has info which replica type has correct stats info.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 551bf9dd58)
2024-05-27 18:21:21 +00:00
Raphael S. Carvalho
55a45e3486 storage_service: Fix race between tablet split and stats retrieval
If tablet split is finalized while retrieving stats, the saved erm, used by all
shards, will be invalidated. It can either cause incorrect behavior or
crash if id is not available.

It's worked by feeding local tablet map into the "coordinator"
collecting stats from all shards. We will also no longer have a snapshot
of erm shared between shards to help intra-node migration. This is
simplified by serializing token metadata changes and the retrieval of
the stats (latter should complete pretty fast, so it shouldn't block
the former for any significant time).

Fixes #18085.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit abcc68dbe7)
2024-05-27 18:21:21 +00:00
Michał Jadwiszczak
1dd522edc8 test/auth_cluster/test_raft_service_levels: try to create sl in recovery
(cherry picked from commit af0b6bcc56)
2024-05-27 18:20:36 +00:00
Michał Jadwiszczak
6d655e6766 service/qos/raft_sl_dda: reject changes to service levels in recovery
mode

When a cluster goes into recovery mode and service levels were migrated
to raft, service levels become temporarily read-only.

This commit adds a proper error message in case a user tries to do any
changes.

(cherry picked from commit ee08d7fdad)
2024-05-27 18:20:36 +00:00
Michał Jadwiszczak
54b9fdab03 service/qos/raft_sl_dda: extract raft_sl_dda steps to common function
When setting/dropping a service level using raft data accessor, the same
validation steps are executed (this_shard_id = 0 and guard is present).
To not duplicate the calls in both functions, they can be extracted to a
helper function.

(cherry picked from commit 2b56158d13)
2024-05-27 18:20:36 +00:00
Raphael S. Carvalho
13f8486cd7 replica: Fix tablet's compaction_groups_for_token_range() with unowned range
File-based tablet streaming calls every shard to return data of every
group that intersects with a given range.
After dynamic group allocation, that breaks as the tablet range will
only be present in a single shard, so an exception is thrown causing
migration to halt during streaming phase.
Ideally, only one shard is invoked, but that's out of the scope of this
fix and compaction_groups_for_token_range() should return empty result
if none of the local groups intersect with the range.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit eb8ef38543)

Closes scylladb/scylladb#18859
2024-05-27 15:20:04 +03:00
Kefu Chai
747ffd8776 migration_manager: do not reference moved-away smart pointer
this change is inspired by clang-tidy. it warns like:
```
[752/852] Building CXX object service/CMakeFiles/service.dir/migration_manager.cc.o
Warning: /home/runner/work/scylladb/scylladb/service/migration_manager.cc:891:71: warning: 'view' used after it was moved [bugprone-use-after-move]
  891 |             db.get_notifier().before_create_column_family(*keyspace, *view, mutations, ts);
      |                                                                       ^
/home/runner/work/scylladb/scylladb/service/migration_manager.cc:886:86: note: move occurred here
  886 |             auto mutations = db::schema_tables::make_create_view_mutations(keyspace, std::move(view), ts);
      |                                                                                      ^
```
in which,  `view` is an instance of view_ptr which is a type with the
semantics of shared pointer, it's backed by a member variable of
`seastar::lw_shared_ptr<const schema>`, whose move-ctor actually resets
the original instance. so we are actually accessing the moved-away
pointer in

```c++
db.get_notifier().before_create_column_family(*keyspace, *view, mutations, ts)
```

so, in this change, instead of moving away from `view`, we create
a copy, and pass the copy to
`db::schema_tables::make_create_view_mutations()`. this should be fine,
as the behavior of `db::schema_tables::make_create_view_mutations()`
does not rely on if the `view` passed to it is a moved away from it or not.

the change which introduced this use-after-move was 88a5ddabce

Refs 88a5ddabce
Fixes #18837
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 125464f2d9)

Closes scylladb/scylladb#18873
2024-05-27 15:18:29 +03:00
Anna Stuchlik
a87683c7be doc: remove outdated MV error from Troubleshooting
This commit removes the MV error message, which only
affect older versions of ScyllaDB, from the Troubleshooting section.

Fixes https://github.com/scylladb/scylladb/issues/17205

(cherry picked from commit 92bc8053e2)

Closes scylladb/scylladb#18855
2024-05-27 15:12:22 +03:00
Anna Stuchlik
eff7b0d42d doc: replace Raft-disabled with Raft-enabled procedure
This commit fixes the incorrect Raft-related information on the Handling Cluster Membership Change Failures page
introduced with https://github.com/scylladb/scylladb/pull/17500.

The page describes the procedure for when Raft is disabled. Since 6.0, Raft for consistent schema management
is enabled and mandatory (cannot be disabled), this commit adds the procedure for Raft-enabled setups.

(cherry picked from commit 6626d72520)

Closes scylladb/scylladb#18858
2024-05-27 15:11:09 +03:00
David Garcia
7dbcfe5a39 docs: docs: autogenerate metrics
Autogenerates metrics documentation using the scripts/get_description.py script introduced in #17479

docs: add beta
(cherry picked from commit 9eef3d6139)

Closes scylladb/scylladb#18857
2024-05-27 15:10:48 +03:00
Jenkins Promoter
d078bafa00 Update ScyllaDB version to: 6.0.0-rc1 2024-05-23 15:35:32 +03:00
Yaron Kaikov
1b4d5d02ef Update ScyllaDB version to: 6.0.0-rc0 2024-05-22 14:07:45 +03:00
Anna Stuchlik
a86fb293fe doc: update Raft information in 6.0
This commit updates the documentation about Raft in version 6.0.

- "Introduction": The outdated information about consistent topology updates not being supported
  is removed and replaced with the correct information.
- "Enabling Raft": The relevant information is moved to other sections. The irrelevant information
   is removed. The section no longer exists.
- "Verifying that the Raft upgrade procedure finished successfully" - moved under Schema
   (in the same document). I additionally removed the include saying that after you verify
   that schema on Raft is enabled, you MUST enable topology changes on Raft (it is not mandatory;
   also, it should be part of the upgrade guide, not the Raft document).
- Unnecessary or incorrect references to versions are removed.

Refs https://github.com/scylladb/scylladb/issues/18580

Closes scylladb/scylladb#18689
2024-05-21 11:45:36 +02:00
Anna Stuchlik
eefa4a7333 doc: replace 5.4-to-5.5 with 5.4-to-6.0 upgrade guide
This commit replaces the 5.4-to-5.5 upgrade guide with the 5.4-to-6.0 upgrade guide,
including the metrics update information.

The guide references the "Enable Consistent Topology Updates" document,
as enabling consistent topology updates is a new step when upgrading to version 6.0.

Also, a procedure for image upgrades has been added (as verified by @yaronkaikov).

Fixes scylladb/scylladb#18254
Fixes scylladb/scylladb#17896
Refs scylladb/scylladb#18580

Closes scylladb/scylladb#18728
2024-05-21 11:31:04 +02:00
Piotr Dulikowski
9820472277 main: introduce schema commitlog scheduling group
Currently, we do not explicitly set a scheduling group for the schema
commitlog which causes it to run in the default scheduling group (called
"main"). However:

- It is important and significant enough that it should run in a
  scheduling group that is separate from the main one,
- It should not run in the existing "commitlog" group as user writes may
  sometimes need to wait for schema commitlog writes (e.g. read barrier
  done to learn the schema necessary to interpret the user write) and we
  want to avoid priority inversion issues.

Therefore, introduce a new scheduling group dedicated to the schema
commitlog.

Fixes: scylladb/scylladb#15566

Closes scylladb/scylladb#18715
2024-05-21 11:29:57 +02:00
Kefu Chai
5db315930e sstables: fix a typo in comment: s/Mimicks/Mimics/
this typo was identified by the codespell workflow

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18781
2024-05-21 12:14:10 +03:00
Nadav Har'El
dcd26d8a16 Merge 'docs: update isolation.md' from Botond Dénes
Update `docs/dev/isolation.d`:
* Update the list of scheduling groups
* Remove IO priority groups (they were folded into scheduling groups)
* Add section on RPC isolation

Closes scylladb/scylladb#18749

* github.com:scylladb/scylladb:
  docs: isolation.md: add section on RPC call isolation
  docs: isolation.md: remove mention of IO priority groups
  docs: isolation.md: update scheduling group list, add aliases
2024-05-21 11:46:57 +03:00
Kefu Chai
44e85c7d79 build: "undo" the coverage compiling options added to abseil
we are not interseted in the code coverage of abseil library, so no need
to apply the compiling options enabling the coverage instrumentation
when building the abseil library.

moreover, since the path of the file passed to `-fprofile-list` is a relative
path. when building with coverage enabled, the build fails when building
abseil, like:

```
 /usr/lib64/ccache/clang++  -I/jenkins/workspace/scylla-master/scylla-ci/scylla/abseil -std=c++20 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -fprofile-instr-generate -fcoverage-mapping -fprofile-list=./coverage_sources.list -std=gnu++20 -Wall -Wextra -Wcast-qual -Wconversion -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wfor-loop-analysis -Wformat-security -Wgnu-redeclared-enum -Winfinite-recursion -Winvalid-constexpr -Wliteral-conversion -Wmissing-declarations -Woverlength-strings -Wpointer-arith -Wself-assign -Wshadow-all -Wshorten-64-to-32 -Wsign-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-zero-compare -Wundef -Wuninitialized -Wunreachable-code -Wunused-comparison -Wunused-local-typedefs -Wunused-result -Wvla -Wwrite-strings -Wno-float-conversion -Wno-implicit-float-conversion -Wno-implicit-int-float-conversion -Wno-unknown-warning-option -DNOMINMAX -MD -MT absl/strings/CMakeFiles/strings.dir/str_cat.cc.o -MF absl/strings/CMakeFiles/strings.dir/str_cat.cc.o.d -o absl/strings/CMakeFiles/strings.dir/str_cat.cc.o -c /jenkins/workspace/scylla-master/scylla-ci/scylla/abseil/absl/strings/str_cat.cc
clang-16: error: no such file or directory: './coverage_sources.list'`
```

in this change, we just remove the compiling options enabling the
coverage instrumentation from the cflags when building abseil.

Fixes #18686
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18748
2024-05-21 11:43:16 +03:00
Botond Dénes
11fa79a537 docs: isolation.md: add section on RPC call isolation 2024-05-21 03:12:22 -04:00
Kefu Chai
86b988a70b test/lib: do not use variable which could be moved away
C++ standard does not define the order in which the parameters
passed to a function are evaluated. so in theory, in
```c++
reusable_sst(sst->get_schema(), std::move(sst));
```
`std::move(sst)` could be evaluated before `sst->get_schema`.
but please note, `std::move(sst)` does not move `sst`
away, it merely cast `sst` to a rvalue reference, it is
`reusable_sst()` which *could* move `sst` away by
consuming it. so following call is much more dangerous
than the above one:
```c++
reusable_sst(sst->get_schema(), modify_sst(std::move(sst)))
```
nevertheless, this usage is still confusing. so instead
of passing a copy of `sst` to `reusable_sst`.

this change is inspired by clang-tidy, it warns like:

```
Warning: /home/runner/work/scylladb/scylladb/test/lib/test_services.cc:397:25: warning: 'sst' used after it was moved [bugprone-use-after-move]
  397 |     return reusable_sst(sst->get_schema(), std::move(sst));
      |                         ^
/home/runner/work/scylladb/scylladb/test/lib/test_services.cc:397:44: note: move occurred here
  397 |     return reusable_sst(sst->get_schema(), std::move(sst));
      |                                            ^
/home/runner/work/scylladb/scylladb/test/lib/test_services.cc:397:25: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
  397 |     return reusable_sst(sst->get_schema(), std::move(sst));
      |
```

per the analysis above, this is a false alarm.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18775
2024-05-21 10:02:10 +03:00
Pavel Emelyanov
428e0bd7d4 locator: Remove unused lshift-operator for topology
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18714
2024-05-21 09:46:30 +03:00
Pavel Emelyanov
b24fb8dc87 inet_address: Remove to_sstring() in favor of fmt::to_string
The existing inet_address::to_string() calls fmt::format("{}", *this)
anyway. However, the to_string() method is declared in .cc file, while
form formatter is in the header and is equipeed with constexprs so
that converting an address to string is done as much as possible
compile-time.

Also, though minor, fmt::to_string(foo) is believed to be even faster
than fmt::format("{}", foo).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18712
2024-05-21 09:43:08 +03:00
Kefu Chai
b6e2d6868b build: add dependencies from binaries to abseil libraries
in 0b0e661a, we brought abseil submodule back. but we didn't update
the build.ninja rules properly -- we should have add the abseil
libraries to the dependencies of the binaries so that the abseil
libraries are always generated before a certain binary is built.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18753
2024-05-21 08:50:48 +03:00
Avi Kivity
33ec6ccea9 test: boost: chunked_vector_test: include <optional>
std::optional is used but not imported. This fails on libstdc++-14.

Closes scylladb/scylladb#18739
2024-05-21 07:37:11 +03:00
Yaron Kaikov
bc596a3e76 pull_request_template: clearify the template and remove checkbox verification
It seems that having the checkbox in the PR template and failing the action is confusing and not very clear. Let's remove it completely and just add to the template an explanation to explain the backport reason

Closes scylladb/scylladb#18708
2024-05-20 18:24:28 +03:00
Botond Dénes
f239339a29 Merge 'Improve modularity of some per-table API endpoints' from Pavel Emelyanov
There's a set of API endpoints that toggle per-table auto-compaction and tombstone-gc booleans. They all live in two different .cc files under api/ directory and duplicate code of each other. This PR generalizes those handlers, places them next to each other, fixes leak on stop and, as a nice side effect, enlightens database.hh header.

Closes scylladb/scylladb#18703

* github.com:scylladb/scylladb:
  api,database: Move auto-compaction toggle guard
  api: Move some table manipulation helpers from storage_service
  api: Move table-related calls from storage_service domain
  api: Reimplement some endpoints using existing helpers
  api: Lost unset of tombstone-gc endpoints
2024-05-20 18:01:54 +03:00
Avi Kivity
61505d057e Merge 'Sort user-defined types in describe statements' from Michał Jadwiszczak
User-defined types can depend on each other, creating directed acyclic graph.

In order to support restoring schema from `DESC SCHEMA`, UDTs should be
ordered topologically, not alphabetically as it was till now.

This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema.

Fixes #18539

Closes scylladb/scylladb#18302

* github.com:scylladb/scylladb:
  test/cql-pytest/test_describe: add test for UDTs ordering
  cql3/statements/describe_statement: UDTs topological sorting
  cql3/statements/describe_statement: allow to skip alphabetical sorting
  types: add a method to get all referenced user types
  db/cql_type_parser: use generic topological sorting
  db/cql_type_parses: futurize raw_builder::build()
  test/boost: add test for topological sorting
  utils: introduce generic topological sorting algorithm
2024-05-20 16:58:17 +03:00
Pavel Emelyanov
159e44d08a test.py: Make it possible to avoid wildcard test names matching
There's a nasty scenario when this searching plays bad joke.

When CI picks up a new branch and notices, that a test had changed, it
spawns a custom job with test.py --repeat 100 $changed_test_name in
it. Next, when the test.py tries opt-in test name matching, it uses the
wildcard search and can pick up extra unwanted tests into the run.

To solve this, the case-selection syntax is extended. Now if the caller
specifies `suite/test::*` as test, the test file is selected by exact
name match, but the specific test-case is not selected, the `*` makes it
run all cases.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18704
2024-05-20 15:50:47 +02:00
Botond Dénes
e1c4e6c151 Merge 'sstables_manager: use maintenance scheduling group to run components reload fiber' from Lakshmi Narayanan Sreethar
PR https://github.com/scylladb/scylladb/pull/18186 introduced a fiber that reloads reclaimed bloom filters when memory becomes available. Use maintenance scheduling group to run that fiber instead of running it in the main scheduling group.

Fixes #18675

Closes scylladb/scylladb#18721

* github.com:scylladb/scylladb:
  sstables_manager: use maintenance scheduling group to run components reload fiber
  sstables_manager: add member to store maintenance scheduling group
2024-05-20 16:38:42 +03:00
Takuya ASADA
33af97ca5a dist/docker: revert dropping systemd package
On 7ce6962141 we dropped openssh-server,
it also dropped systemd package and caused an error on Scylla Operator
(#17787).

This reverts dropping systemd package and fix the issue.

Fix #17787

Closes scylladb/scylladb#18643
2024-05-20 16:38:15 +03:00
Andrei Chekun
bce53efd36 Enrich test results produced by test.py
This PR resolves issue with double count of the test result for topology tests. It will not appear in the consolidated report anymore.
Another fix is to provide a better view which test failed by modifying the test case name in the report enriching it with mode and run id, so making them unique across the run.

The scope of this change is:
1. Modify the test name to have run id in name
2. Add handlers to get logs of test.py and pytest in one file that are related to test, rather than to the full suite
3. Remove topology tests from aggregating them on a suite level in Junit results
4. Add a link to the logs related to the failed tests in Junit results, so it will be easier to navigate to all logs related to test
5. Gather logs related to the failed test to one directory for better logs investigation

Ref: scylladb/scylladb#17851

Closes scylladb/scylladb#18277
2024-05-20 15:33:57 +02:00
Avi Kivity
52fe351c31 Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N
then there is not even a possibility of moving tablets around to fix the imbalance.
The only way to bring the system to balance is to move tablets within the nodes.

The system is not prepared for intra-node migration currently. Request coordination
is host-based, while for intra-node migration it should be (also) shard-based.
The solution employed here is to keep the coordination between nodes as-is,
and for intra-node migration storage_proxy-level coordinator is not aware of
the migration (no pending host). The replica-side request handler will be a
second-level coordinator which routes requests to shards, similar to how
the first-level coordinator routes them to hosts.

Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.

perf-simple-query test results show no signs of regression:

Command: perf-simple-query -c1 -m1G --write --tablets --duration=10

Before:

> 83294.81 tps ( 59.5 allocs/op,  14.3 tasks/op,   53725 insns/op,        0 errors)
> 87756.72 tps ( 59.5 allocs/op,  14.3 tasks/op,   54049 insns/op,        0 errors)
> 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> 86211.38 tps ( 59.7 allocs/op,  14.3 tasks/op,   54219 insns/op,        0 errors)
> 86559.89 tps ( 59.6 allocs/op,  14.3 tasks/op,   54188 insns/op,        0 errors)
> 86609.39 tps ( 59.6 allocs/op,  14.3 tasks/op,   54117 insns/op,        0 errors)
> 87464.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   54039 insns/op,        0 errors)
> 86185.43 tps ( 59.6 allocs/op,  14.3 tasks/op,   54169 insns/op,        0 errors)
> 86254.71 tps ( 59.6 allocs/op,  14.3 tasks/op,   54139 insns/op,        0 errors)
> 83395.35 tps ( 60.2 allocs/op,  14.4 tasks/op,   54693 insns/op,        0 errors)
>
> median 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> median absolute deviation: 243.04
> maximum: 87756.72
> minimum: 83294.81
>

After:

> 85523.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   53872 insns/op,        0 errors)
> 89362.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54226 insns/op,        0 errors)
> 88167.55 tps ( 59.7 allocs/op,  14.3 tasks/op,   54400 insns/op,        0 errors)
> 87044.40 tps ( 59.7 allocs/op,  14.3 tasks/op,   54310 insns/op,        0 errors)
> 88344.50 tps ( 59.6 allocs/op,  14.3 tasks/op,   54289 insns/op,        0 errors)
> 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> 88725.46 tps ( 59.6 allocs/op,  14.3 tasks/op,   54230 insns/op,        0 errors)
> 88640.08 tps ( 59.6 allocs/op,  14.3 tasks/op,   54210 insns/op,        0 errors)
> 90306.31 tps ( 59.4 allocs/op,  14.3 tasks/op,   54043 insns/op,        0 errors)
> 87343.62 tps ( 59.8 allocs/op,  14.3 tasks/op,   54496 insns/op,        0 errors)
>
> median 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> median absolute deviation: 1007.41
> maximum: 90306.31
> minimum: 85523.06

Command (reads): perf-simple-query -c1 -m1G  --tablets --duration=10

Before:

> 95860.18 tps ( 63.1 allocs/op,  14.1 tasks/op,   42476 insns/op,        0 errors)
> 97537.69 tps ( 63.1 allocs/op,  14.1 tasks/op,   42454 insns/op,        0 errors)
> 97549.23 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97511.29 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97227.32 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 94031.94 tps ( 63.1 allocs/op,  14.1 tasks/op,   42441 insns/op,        0 errors)
> 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> 96401.70 tps ( 63.1 allocs/op,  14.1 tasks/op,   42473 insns/op,        0 errors)
> 96573.77 tps ( 63.1 allocs/op,  14.1 tasks/op,   42440 insns/op,        0 errors)
> 96340.54 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
>
> median 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> median absolute deviation: 571.20
> maximum: 97549.23
> minimum: 94031.94
>

After:

> 99794.67 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 101244.99 tps ( 63.1 allocs/op,  14.1 tasks/op,   42472 insns/op,        0 errors)
> 101128.37 tps ( 63.1 allocs/op,  14.1 tasks/op,   42485 insns/op,        0 errors)
> 101065.27 tps ( 63.1 allocs/op,  14.1 tasks/op,   42465 insns/op,        0 errors)
> 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> 101413.31 tps ( 63.1 allocs/op,  14.1 tasks/op,   42463 insns/op,        0 errors)
> 101464.92 tps ( 63.1 allocs/op,  14.1 tasks/op,   42466 insns/op,        0 errors)
> 101086.74 tps ( 63.1 allocs/op,  14.1 tasks/op,   42488 insns/op,        0 errors)
> 101559.09 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
> 100742.58 tps ( 63.1 allocs/op,  14.1 tasks/op,   42491 insns/op,        0 errors)
>
> median 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> median absolute deviation: 200.33
> maximum: 101559.09
> minimum: 99794.67
>

Fixes #16594

Closes scylladb/scylladb#18026

* github.com:scylladb/scylladb:
  Implement fast streaming for intra-node migration
  test: tablets_test: Test sharding during intra-node migration
  test: tablets_test: Check sharding also on the pending host
  test: py: tablets: Test writes concurrent with migration
  test: py: tablets: Test crash during intra-node migration
  api, storage_service: Introduce API to wait for topology to quiesce
  dht, replica: Remove deprecated sharder APIs
  test: Avoid using deprecated sharded API
  db: do_apply_many() avoid deprecated sharded API
  replica: mutation_dump: Avoid deprecated sharder API
  repair: Avoid deprecated sharder API
  table: Remove optimization which returns empty reader when key is not owned by the shard
  dht: is_single_shard: Avoid deprecated sharder API
  dht: split_range_to_single_shard: Work with static_sharder only
  dht: ring_position_range_sharder: Avoid deprecated sharder APIs
  dht: token: Avoid use of deprecated sharder API by switching to static_sharder
  selective_token_sharder: Avoid use of deprecated sharder API
  docs: Document tablet sharding vs tablet replica placement
  readers/multishard.cc: use shard_for_reads() instead of shard_of()
  multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
  storage_proxy: Extract common code to apply mutations on many shards according to sharder
  storage_proxy: Prepare per-partition rate-limiting for intra-node migration
  storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
  storage_proxy: Prepare mutate_hint() for intra-node tablet migration
  commitlog_replayer: Avoid deprecated sharder::shard_of()
  lwt: Avoid deprecated sharder::shard_of()
  compaction: Avoid deprecated sharder::shard_of()
  dht: Extract dht::static_sharder
  replica: Deprecate table::shard_of()
  locator: Deprecate effective_replication_map::shard_of()
  dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
  tests: tablets: py: Add intra-node migration test
  tests: tablets: Test that drained nodes are not balanced internally
  tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load
  tests: tablets: Verify that disabling balancing results in no intra-node migrations
  tests: tablets: Check that nodes are internally balanced
  tests: tablets: Improve debuggability by showing which rows are missing
  tablets, storage_service: Support intra-node migration in move_tablet() API
  tablet_allocator: Generate intra-node migration plan
  tablet_allocator: Extract make_internode_plan()
  tablet_allocator: Maintain candidate list and shard tablet count for target nodes
  tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
  tablets, streaming: Implement tablet streaming for intra-node migration
  dht, auto_refreshing_sharder: Allow overriding write selector
  multishard_writer: Handle intra-node migration
  storage_proxy: Handle intra-node tablet migration for writes
  tablets: Get rid of tablet_map::get_shard()
  tablets: Avoid tablet_map::get_shard in cleanup
  tablets: test: Use sharder instead of tablet_map::get_shard()
  tablets: tablet_sharder: Allow working with non-local host
  sharding: Prepare for intra-node-migration
  docs: Document sharder use for tablets
  tablets: Introduce tablet transition kind for intra-node migration
  tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
  sstables, gdb: Track readers in a linked list
  raft topology: Fix global token metadata barrier to not fence ahead of what is drained
2024-05-20 16:13:01 +03:00
Kefu Chai
a517fcf970 service/storage_proxy: capture tr_state by copy in handle_paxos_accept()
this change is inspired by following warning from clang-tidy

```
Warning: /home/runner/work/scylladb/scylladb/service/storage_proxy.cc:884:13: warning: 'tr_state' used after it was moved [bugprone-use-after-move]
  884 |         if (tr_state) {
      |             ^
/home/runner/work/scylladb/scylladb/service/storage_proxy.cc:872:139: note: move occurred here
  872 |         auto f = get_schema_for_read(proposal.update.schema_version(), src_addr, *timeout).then([&sp = _sp, &sys_ks = _sys_ks, tr_state = std::move(tr_state),
      |                                                                                                                                           ^
```

this is not a false positive. as `tr_state` is a captured by move for
constructing a variable in the captured list of a lambda which is in
turn passed to the expression evaluated to `f`.

even the expression itself is not evaluated yet when we reference
`tr_state` to check if it is empty after preparing the expression,
`tr_state` is already moved away into the captured variable. so
at that moment, the statement of `f = f.finally(...)` is never
evaluated, because `tr_state` is always empty by then.

so before this change, the trace message is never recorded.

in this change, we address this issue by capturing `tr_state` by
copying it. as `tr_state` is backed by a `lw_shared_ptr`, the overhead is
neglectable.

after this change, the tracing message is recorded.

the change introduced this issue was 548767f91e.

please note, we could coroutinize this function to improve its
readability, but since this is a fix and should be backported,
let's start with a minimal fix, and worry about the readability
in a follow-up change.

Refs 548767f91e
Fixes #18725
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18702
2024-05-20 12:58:49 +03:00
Kefu Chai
40ce52c3cc test: use generic boost_test_print_type()
in this change, we trade the `boost_test_print_type()` overloads
for the generic template of `boost_test_print_type()`, except for
those in the very small tests, which presumably want to keep
themselves relative self-contained.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18727
2024-05-20 12:56:20 +03:00
Botond Dénes
0e23cd45ad Merge 'feature: grandfather some old cluster features' from Avi Kivity
This series grandfathers the following features:

  MD_SSTABLE_FORMAT
  ME_SSTABLE feature
  VIEW_VIRTUAL_COLUMNS
  DIGEST_INSENSITIVE_TO_EXPIRY
  CDC
  NONFROZEN_UDTS
  PER_TABLE_PARTITIONERS
  PER_TABLE_CACHING
  DIGEST_FOR_NULL_VALUES
  CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX

Note that for the last (CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX) some code remains to support indexes created before the new feature was adopted.

Each patch names the version where the feature was introduced.

Closes scylladb/scylladb#18428

* github.com:scylladb/scylladb:
  feature, index: grandfather CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX
  feature: grandfather DIGEST_FOR_NULL_VALUES
  storage_proxy: drop use of MD5 as a digest algorithm
  feature: grandfather PER_TABLE_CACHING
  feature: grandfather LWT
  feature: grandfather HINTED_HANDOFF_SEPARATE_CONNECTION
  feature: grandfather PER_TABLE_PARTITIONERS
  test: schema_change_test: regenerate digest for PER_TABLE_PARTITIONERS
  test: test_schema_change_digest: drop unneeded reference digests
  feature: grandfather NONFROZEN_UDTS
  feature: grandfather CDC
  feature: grandfather DIGEST_INSENSITIVE_TO_EXPIRY
  feature: grandfather VIEW_VIRTUAL_COLUMNS
  feature: grandfather ME_SSTABLE feature
  feature: grandfather MD_SSTABLE_FORMAT
2024-05-20 11:48:07 +03:00
Botond Dénes
936a7e282b docs: isolation.md: remove mention of IO priority groups
They were folded into CPU scheduling groups, which now apply to both CPU
and IO.
2024-05-20 03:33:24 -04:00
Botond Dénes
8f61468322 docs: isolation.md: update scheduling group list, add aliases 2024-05-20 03:30:04 -04:00
Lakshmi Narayanan Sreethar
6f58768c46 sstables_manager: use maintenance scheduling group to run components reload fiber
Fixes #18675

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-19 15:23:45 +05:30
Lakshmi Narayanan Sreethar
79f6746298 sstables_manager: add member to store maintenance scheduling group
Store that maintenance scheduling group inside the sstables_manager. The
next patch will use this to run the components reloader fiber.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-19 15:23:45 +05:30
Avi Kivity
54a82fed6b feature, index: grandfather CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX
This feature corrected how we store the token in secondary indexes. It
was introduced in 7ff72b0ba5 (2020; 4.4) and can now be assumed present
everywhere. Note that we still support indexes created with the old format.
2024-05-18 00:24:11 +03:00
Avi Kivity
2fbd78c769 feature: grandfather DIGEST_FOR_NULL_VALUES
The DIGEST_FOR_NULL_VALUES feature was added in 21a77612b3 (2020; 4.4)
and can now be assumed to be always present. The hasher which it invoked
is removed.
2024-05-18 00:24:00 +03:00
Avi Kivity
879583c489 storage_proxy: drop use of MD5 as a digest algorithm
The XXHASH feature was introduced in 0bab3e59c2 (2017; 2.2) and made
mandatory in defe6f49df (2020; 4.4), but some vestiges remain.
Remove them now. Note that md5_hasher itself is still in use by
other components, so it cannot be removed.
2024-05-18 00:23:47 +03:00
Avi Kivity
7c264e8a71 feature: grandfather PER_TABLE_CACHING
The PER_TABLE_CACHING feature was added in 0475dab359 (2020; 4.2)
and can now be assumed to be always present.
2024-05-18 00:23:30 +03:00
Avi Kivity
d52c424a5f feature: grandfather LWT
LWT was make non-experimental in 9948f548a5 (2020; 4.1) and can now be
assumed to be always present.
2024-05-18 00:20:53 +03:00
Avi Kivity
93088d0921 feature: grandfather HINTED_HANDOFF_SEPARATE_CONNECTION
The HINTED_HANDOFF_SEPARATE_CONNECTION feature was introduced in 3a46b1bb2b (2019; 3.3)
and can be assumed always present.
2024-05-18 00:18:27 +03:00
Avi Kivity
3bead8cea0 feature: grandfather PER_TABLE_PARTITIONERS
The PER_TABLE_PARTITIONERS feature was added in 90df9a44ce (2020; 4.0)
and can now be assumed to be always present. We also remove the associated
schema_feature.
2024-05-18 00:15:07 +03:00
Avi Kivity
6b532fd40b test: schema_change_test: regenerate digest for PER_TABLE_PARTITIONERS
The first digest tested was generated without the PER_TABLE_PARTITIONERS
schema feature. We're about to make that feature mandatory, so we won't
be able (and won't need) to generate a digest without it.

Update the digest to include the feature. Note it wasn't untested before,
we have a test with schema_features::full().
2024-05-18 00:14:43 +03:00
Avi Kivity
c4d8b17f4c test: test_schema_change_digest: drop unneeded reference digests
digests[0] was used by the VIEW_VIRTUAL_COLUMNS feature, which
no longer exists.

digests[1] is the same as digests[2], so drop it.
2024-05-17 20:41:20 +03:00
Avi Kivity
93113da01b feature: grandfather NONFROZEN_UDTS
The NONFROZEN_UDTS feature was added in e74b5deb5d (2019; 3.2)
and can now be assumed to be always present.
2024-05-17 20:41:20 +03:00
Avi Kivity
c7d7ca2c23 feature: grandfather CDC
The CDC feature was made non-experimental in e9072542c1 (2020; 4.4)
and can now be assumed to be always present. We also remove the corresponding
schema_feature.
2024-05-17 20:41:20 +03:00
Avi Kivity
82ad2913ca feature: grandfather DIGEST_INSENSITIVE_TO_EXPIRY
The DIGEST_INSENSITIVE_TO_EXPIRY feature was added in 9de071d214 (2019; 3.2)
and can now be assumed to be always present. We enable the corresponding
schema_feature unconditionally.

We do not remove the corresponding schema feature, because it can be disabled
when the related TABLE_DIGEST_INSENSITIVE_TO_EXPIRY is present.
2024-05-17 20:41:19 +03:00
Avi Kivity
b5f6021a6b feature: grandfather VIEW_VIRTUAL_COLUMNS
The VIEW_VIRTUAL_COLUMNS feature was added in a108df09f9 (2019; 3.1)
and can now be assumed to be always present.

The corresponding schema_feature is removed. Note schema_features are not sent
over the wire. A digest calculation without VIEW_VIRTUAL_COLUMNS is no longer tested.
2024-05-17 20:41:19 +03:00
Avi Kivity
7952200c8c feature: grandfather ME_SSTABLE feature
"me" format sstables were introduced in d370558279 (Jan 2022; 5.1)
and so can be assumed always present. The listener that checks when
the cluster understands ME_SSTABLE was removed and in its place
we default to sstable_version_types::me (and call on_enabled()
immediately).
2024-05-17 20:41:19 +03:00
Avi Kivity
6d0c0b542c feature: grandfather MD_SSTABLE_FORMAT
"md" sstable support was introduced in e8d7744040 (2020; 4.4)
and so can be assumed to be present on all versions we upgrade from.
Nothing appears to depend on it.
2024-05-17 20:41:19 +03:00
Anna Stuchlik
c93a7d2664 doc: replace 5.5 with 6.0 in SStable docs (me)
This commit replaces the version number 5.5 with 6.0,
because 5.5 has never been released.

This is a follow-up to https://github.com/scylladb/scylladb/pull/16716.

Refs https://github.com/scylladb/scylladb/issues/16551
Refs https://github.com/scylladb/scylladb/issues/18580

Closes scylladb/scylladb#18730
2024-05-17 16:34:18 +03:00
Botond Dénes
db70e8dd5f test/cql-pytest: test_tombstone_limit.py: enable xfailing tests
These tests were marked as xfail because they use to fail with tablets.
They don't anymore, so remove the xfail.

Fixes: #16486

Closes scylladb/scylladb#18671
2024-05-16 20:14:47 +03:00
Nadav Har'El
c7aa47354a Merge 'mutation_fragment_stream_validating_filter: respect validating_level::none' from Botond Dénes
Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to.

Fixes: #18662

Closes scylladb/scylladb#18667

* github.com:scylladb/scylladb:
  test/boost/mutation_fragment_test.cc: add test for validator validation levels
  mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
  mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
2024-05-16 19:57:49 +03:00
Kamil Braun
734c5de314 Merge 'fix test teardown race with ongoing test operation' from Artsiom Mishuta
This commit brings several new features in scylla_cluster.py to fix runaway asyncio task problems in topology tests

- Start-Stop Lock and Stop Event in ScyllaServer
- Tasks History, Wait for tasks from Tasks History and Manager broken state in ScyllaClusterManager
- make ManagerClient object function scope
- test_finished_event in ManagerClient

Fixes: scylladb/scylladb#16472
Fixes: scylladb/scylladb#16651

Closes scylladb/scylladb#18236

* github.com:scylladb/scylladb:
  test/pylib: Introduce ManagerClient.test_finished_event
  test/topology: make ManagerClient object function scope
  test/pylib: Introduce Manager broken state:
  test/pylib: Wait for tasks from Tasks History:
  test/pylib: Introduce Tasks History:
  test/pylib: Introduce Stop Event
  test/pylib: Introduce Start-Stop Lock:
2024-05-16 17:42:00 +02:00
Kefu Chai
759156b56d test: perf: alternator: mark format string as constexpr
before this change, we use `update_item_suffix` as a format string
fed to `format(...)`, which is resolved to `seastar::format()`.
but with a patch which migrates the `seastar::format()` to the backend
with compile-time format check, the caller sites using `format()` would
fail to build, because `update_item_suffix` is not a `constexpr`:
```
/home/kefu/.local/bin/clang++ -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT test/perf/CMakeFiles/test-perf.dir/RelWithDebInfo/perf_alternator.cc.o -MF test/perf/CMakeFiles/test-perf.dir/RelWithDebInfo/perf_alternator.cc.o.d -o test/perf/CMakeFiles/test-perf.dir/RelWithDebInfo/perf_alternator.cc.o -c /home/kefu/dev/scylladb/test/perf/perf_alternator.cc
/home/kefu/dev/scylladb/test/perf/perf_alternator.cc:249:69: error: call to consteval function 'fmt::basic_format_string<char, const char (&)[1]>::basic_format_string<const char *, 0>' is not a constant expression
  249 |     return make_request(cli, "UpdateItem", prefix + seastar::format(update_item_suffix, ""));
      |                                                                     ^
/usr/include/fmt/core.h:2776:67: note: read of non-constexpr variable 'update_item_suffix' is not allowed in a constant expression
 2776 |   FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) {
      |                                                                   ^
/home/kefu/dev/scylladb/test/perf/perf_alternator.cc:249:69: note: in call to 'basic_format_string<const char *, 0>(update_item_suffix)'
  249 |     return make_request(cli, "UpdateItem", prefix + seastar::format(update_item_suffix, ""));
      |                                                                     ^~~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/test/perf/perf_alternator.cc:198:6: note: declared here
  198 | auto update_item_suffix = R"(
      |      ^
```

so, to prepare the change switching to compile-time format checking,
let's mark this variable `static constexpr`. this is also more correct,
as this variable is

* a compile time constant, and
* is not shared across different compilation units.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18685
2024-05-16 15:18:42 +03:00
Avi Kivity
6982de6dde Merge 'Fix stalls in forward_service::dispatch() with large tablet count' from Raphael "Raph" Carvalho
With a large tablet count, e.g. 128k, forward_service::dispatch() can potentially stall when grouping ranges per endpoint.

`    Reactor stalled for 4 ms on shard 1. Backtrace: 0x5eb15ea 0x5eb09f5 0x5eb1daf 0x3dbaf 0x2d01e57 0x33f7d1e 0x348255f 0x2d005d4 0x2d3d017 0x2d3d58c 0x2d3d225 0x5e59622 0x5ec328f 0x5ec4577 0x5ee84e0 0x5e8394a 0x8c946 0x11296f
`

Also there are inefficient copies that are being removed. partition_range_vector for a single endpoint can grow beyond 1M.

Closes scylladb/scylladb#18695

* github.com:scylladb/scylladb:
  service: fix indentation in dispatch()
  service: fix reactor stall with large tablet count
  service: avoid potential expensive copies in forward_service::dispatch()
  service: coroutinize forward_service::dispatch()
2024-05-16 15:17:43 +03:00
Kefu Chai
617e532859 db: config: drop operator<<() for error_injection_at_startup
it is not used anymore, so let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18701
2024-05-16 15:10:57 +03:00
Pavel Emelyanov
dffd985401 data_dictionary: Resurrect formatter for keyspace_metadata
It was commented out by the a439ebcfce (treewide: include fmt/ranges.h
and/or fmt/std.h) , probably by mistake

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18665
2024-05-16 15:09:45 +03:00
Pavel Emelyanov
31d05925cc api,database: Move auto-compaction toggle guard
Toggling per-table auto-compaction enabling bit is guarded with
on-database boolean and raii guard. It's only used by a single
api/column_family.cc file, so it can live there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:51 +03:00
Pavel Emelyanov
a43b178f72 api: Move some table manipulation helpers from storage_service
Continuation of the previous patch -- helpers toggling tombstone_gc and
auto_compaction on tables should live in the same file that uses them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:50 +03:00
Pavel Emelyanov
862fcd7bc7 api: Move table-related calls from storage_service domain
The storage_service/(enable|disable)_(tombstone_gc|auto_compaction)
endpoints are not handled by storage_service _service_ and should rather
live in the column_family/ domain which is handler by replica::database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:50 +03:00
Pavel Emelyanov
ba53283d21 api: Reimplement some endpoints using existing helpers
The (enable|disable)_(tombstone_gc|auto_compaction) endpoints living in
column_family domain can benefit from the helpers that do the same in
the storage_service domain. The "difference" is that c.f. endpoints do
it per-table, while s.s. ones operate on a vector of tables, so the
former is a corner case of the latter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:50 +03:00
Pavel Emelyanov
231ffa623c api: Lost unset of tombstone-gc endpoints
On stop all endpoints must be unregistered, these three are lost

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-16 14:42:50 +03:00
Michał Jadwiszczak
b3e6a39604 test/cql-pytest/test_describe: add test for UDTs ordering 2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
f29820fb27 cql3/statements/describe_statement: UDTs topological sorting
User-defined types can depend on each other, creating directed acyclic
graph.
In order to support restoring schema from `DESC SCHEMA`, UDTs should be
ordered topologically, not alphabetically as it was till now.
2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
7be938192b cql3/statements/describe_statement: allow to skip alphabetical sorting
In a next commit, we are going to introduce topological sorting of
user-defined types, so alphabetical sorting must be skipped not to
interfere.
2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
8157d260f2 types: add a method to get all referenced user types
The method allows to collect all UDTs used to create a type.
This is required to sort UDTs in a topological order.
2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
573e13e3f1 db/cql_type_parser: use generic topological sorting 2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
3830f3bd23 db/cql_type_parses: futurize raw_builder::build()
In order to use generic topological sort,
build() method needs to return future.
2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
7f04c88395 test/boost: add test for topological sorting 2024-05-16 13:30:03 +02:00
Michał Jadwiszczak
aa08e586fd utils: introduce generic topological sorting algorithm
Until now, we have implemented topological sorting in
db/cql_type_parser.cc but it is specific to its usage.

Now we want to use topological sorting in another place,
so generic sorting algoritm provides one implementation
to be reused in several places.
2024-05-16 13:30:03 +02:00
Nadav Har'El
27ab560abd cql: fix hang during certain SELECT statements
The function intersection(r1,r2) in statement_restrictions.cc is used
when several WHERE restrictions were applied to the same column.
For example, for "WHERE b<1 AND b<2" the intersection of the two ranges
is calculated to be b<1.

As noted in issue #18690, Scylla is inconsistent in where it allows or
doesn't allow these intersecting restrictions. But where they are
allowed they must be implemented correctly. And it turns out the
function intersection() had a bug that caused it to sometimes enter
an infinite loop - when the intent was only to call itself once with
swapped parameters.

This patch includes a test reproducing this bug, and a fix for the
bug. The test hangs before the fix, and passes after the fix.

While at it, I carefully reviewed the entire code used to implement
the intersection() function to try to make sure that the bug we found
was the only one. I also added a few more comments where I thought they
were needed to understand complicated logic of the code.

The bug, the fix and the test were originally discovered by
Michał Chojnowski.

Fixes #18688
Refs #18690

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#18694
2024-05-16 11:25:44 +03:00
Piotr Dulikowski
68eca3778c Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros
This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed.

See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions.

This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh.

The existing mechanism works in the following way:

* Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes
* Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking
* We keep track of the percent of consumed units on each node, this is called `view update backlog`.
* Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level.

This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates.

To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`.

The new algorithm of view update generation looks something like this:
```c++
for(;;) {
    auto updates = generate_updates_batch_with_max_100_rows();
    co_await seastar::sleep(calculate_sleep_time_from_backlogs());
    spawn_background_tasks_for_updates(updates);
}
```
Fixes: https://github.com/scylladb/scylladb/issues/12379

Closes scylladb/scylladb#16819

* github.com:scylladb/scylladb:
  test: add test for bad_allocs during large mv queries
  mv: throttle view update generation for large queries
  exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
  db/view: extract view throttling delay calculation to a global function
  view_update_generator: add get_storage_proxy()
  storage_proxy: make view backlog getters public
2024-05-16 08:22:54 +02:00
Botond Dénes
af9e173c99 Merge 'repair: Don't get topology via database' from Pavel Emelyanov
Database has token-metadata onboard and other services use it to get topology from. Repair code has simpler and cleaner ways to get access to topology.

Closes scylladb/scylladb#18677

* github.com:scylladb/scylladb:
  repair: Get topology via replication map
  repair: Use repair_service::my_address() in handlers
  repair: Remove repair_meta::_myip
  repair: Use repair_meta::myip() everywhere
  repair: Add repair_service::my_address() method
2024-05-16 08:28:14 +03:00
Raphael S. Carvalho
715ae689c0 Implement fast streaming for intra-node migration
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.

This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.

A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.

test:
./test.py --mode=dev intranode --repeat=100 passes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a179f37780 test: tablets_test: Test sharding during intra-node migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
5f32d2ddb6 test: tablets_test: Check sharding also on the pending host 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
6d809c75fb test: py: tablets: Test writes concurrent with migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
ad02d85c16 test: py: tablets: Test crash during intra-node migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7956a2991e api, storage_service: Introduce API to wait for topology to quiesce 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
679baff25a dht, replica: Remove deprecated sharder APIs 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
32a191384a test: Avoid using deprecated sharded API
There is not tablet migration in unit tests, so shard_of() can be
safely replaced with shard_for_reads(). Even if it's used for writes.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
539460dd71 db: do_apply_many() avoid deprecated sharded API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
0f50504c39 replica: mutation_dump: Avoid deprecated sharder API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7bf5733fa5 repair: Avoid deprecated sharder API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7c03646f99 table: Remove optimization which returns empty reader when key is not owned by the shard
This check would lead to correctness issues with intra-node migration
because the shard may switch during read, from "read old" to "read
new". If the coordinator used "read old" for shard routing, but table
on the old shard is already using "read new" erm, such a read would
observe empty result, which is wrong.

Drop the optimization. In the scenario above, read will observe all
past writes because:

  1) writes are still using "write both"

  2) writes are switched to "write new" only after all requests which
  might be using "read old" are done

Replica-side coordinators should already route single-key requests to
the correct shard, so it's not important as an optimization.

This issue shows how assumptions about static sharding are embedded in
the current code base and how intra-node migration, by violating those
assumptions, can lead to correctness issues.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
26f2e6aa8e dht: is_single_shard: Avoid deprecated sharder API
All current uses are used in the read path.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c9e6b4dca7 dht: split_range_to_single_shard: Work with static_sharder only
In preparation for intra-node tablet migration, to avoid
using deprecated sharder APIs.

This function is used for generating sstable sharding metadata.
For tablets, it is not invoked, so we can safely work with the
static sharder. The call site already passes static_sharder only.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c380aecf64 dht: ring_position_range_sharder: Avoid deprecated sharder APIs
In preparation for tablet intra-node migration.

Existing uses are for reads, so it's safe to use shard_for_reads():
  - in multishard reader
  - in forward_service

The ring_position_range_vector_sharder is used when computing sstable
shards, which for intra-node migration should use the view for
reads. If we haven't completed streaming, sstables should be attached
to the old shard (used by reads). When in write-both-read-new stage,
streaming is complete, reads are using the new shard, and we should
attach sstables to the new shard.

When not in intra-node migration, the view for reads on the pending
node will return the pending shard even if read selector is "read old".
So if pending node restarts during streaming, we will attach to sstables
to the shard which is used by writes even though we're using the selector
for reads.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a1aac409bf dht: token: Avoid use of deprecated sharder API by switching to static_sharder
The touched APIs are used only with static_sharder.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
dd4a086b87 selective_token_sharder: Avoid use of deprecated sharder API
I analyzed all the uses and all except the alternator/ttl.cc seem to
be interested in the result for the purpose of reading.

Alternator is not supported with tablets yet, so the use was annotated
with a relevant issue.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
eb3a22d5a8 docs: Document tablet sharding vs tablet replica placement 2024-05-16 00:28:47 +02:00
Botond Dénes
635aba435b readers/multishard.cc: use shard_for_reads() instead of shard_of()
The latter is deprecated.
2024-05-16 00:28:47 +02:00
Botond Dénes
bc779ed00c multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
The latter is deprecated.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
3b7d7088d1 storage_proxy: Extract common code to apply mutations on many shards according to sharder 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
660b3d1765 storage_proxy: Prepare per-partition rate-limiting for intra-node migration
Note: there is a potential problem with rate-limit count going out of sync
during intra-node migration between old and the new shard.

Before this patch, when coordinator accounted and admitted the
request, so the rate_limit_info passed to apply_locally() is
account_only, it was converted to std::monostate for requests to the
local replia. This makes sense because the request was already
accounted by the coordinator.

However, during intra-node migration when we do double writes to two
shards locally, that means that the new shard will not account the
write, it will have lower count than the limiter on the old
shard. This means that the new shard may accept writes which will end
up being rejected. This is not desirable, but not the end of the world
since it's temporary, and the new shard will still protect itself from
overload based on its own rate limiter.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7c3291b5ea storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
Cunters are not supported with tablets, so we should not reach this path.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
db2809317d storage_proxy: Prepare mutate_hint() for intra-node tablet migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
feafe0f6a7 commitlog_replayer: Avoid deprecated sharder::shard_of()
shard_for_writes() is appropriate, because we're writing.  It can
happen that the tablet was migrated away and no shard is the owner. In
that case the mutation is dropped, as it should be, because "shards"
is empty.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c9294b1642 lwt: Avoid deprecated sharder::shard_of()
Instead, use shard_for_reads(). The justification is that:

 1) In cas_shard(), we need to pick a single request coordinator.
    shard_for_reads() gives that, which is equivalent to shard_of()
    if there is no intra-node migration.

 2) In paxos handler for prepare(), the shard we execute it on is
    the shard from which we read, so shard_for_reads() is the one.

 3) Updates of paxos state are separate CQL requests, and use their
    own sharding.

 4) Handler for learn is executing updates using calls to
    storage_proxy::mutate_locally() which will use the right sharder for writes

However, the code is still not prepared for intra-node migration, and
possibly regular migration too in case of abandoned requests, because
the locking of paxos state assumes that the shard is static. That
would have to be fixed separately, e.g. by locking both shards
(shard_for_writes()) during migration, so that the set of locked
shards always intersects during migration and local serialization of
paxos state updates is achieved. I left FIXMEs for that.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
1631bab658 compaction: Avoid deprecated sharder::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
9da3bd84c7 dht: Extract dht::static_sharder
Before the patch, dht::sharder could be instantiated and it would
behave like a static sharder. This is not safe with regards to
extensions of the API because if a derived implementation forgets to
override some method, it would incorrectly default to the
implementation from static sharder. Better to fail the compilation in
this case, so extract static sharder logic to dht::static_sharder
class and make all methods in dht::sharder pure virtual.

This also allows us to have algorithms indicate that they only work
with static sharder by accepting the type, and have compile-time
safety for this requirement.

schema::get_sharder() is changed to return the static_sharder&.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
dbca598e99 replica: Deprecate table::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a1bee16ee9 locator: Deprecate effective_replication_map::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
10a4903d0c dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
Require users to specify whether we want shard for reads or for writes
by switching to appropriate non-deprecated variant.

For example, shard_of() can be replaced with shard_for_reads() or
shard_for_writes().

The next_shard/token_for_next_shard APIs have only for-reads variant,
and the act of switching will be a testimony to the fact that the code
is valid for intra-node migration.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
b3cdf9a379 tests: tablets: py: Add intra-node migration test 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
d26cd97633 tests: tablets: Test that drained nodes are not balanced internally
It would be a waste of effort to do so, since we migrate tablets away
anyway.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
04f0088679 tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c76ba52c70 tests: tablets: Verify that disabling balancing results in no intra-node migrations 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
0addca88b9 tests: tablets: Check that nodes are internally balanced
Existing tests are augmented with a check which verifies that
all nodes are internally balanced.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
0e2617336a tests: tablets: Improve debuggability by showing which rows are missing 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
329342bfb2 tablets, storage_service: Support intra-node migration in move_tablet() API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
db9d3f0128 tablet_allocator: Generate intra-node migration plan
Intra-node migrations are scheduled for each node independently with
the aim to equalize per-shard tablet count on each node.

This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if
RF=N then there is not even a possibility of moving tablets around to
fix the imbalance.  The only way to bring the system to balance is to
move tablets within the nodes.

After scheduling inter-node migrations, the algorithm schedules
intra-node migrations. This means that across-node migrations can
proceed in parallel with intra-node migrations if there is free
capacity to carry them out, but across-node migrations have higher
priority.

Fixes #16594
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
793af3d6e1 tablet_allocator: Extract make_internode_plan()
Currently the load balancer is only generting an inter-node plan, and
the algorithm is embedded in make_plan(). The method will become even
harder to follow once we add more kinds of plan generating steps,
e.g. inter-node plan. Extract the inter-node plan to make it easier to
add other plans and see the grand flow.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
f95a0f0182 tablet_allocator: Maintain candidate list and shard tablet count for
target nodes

The node_load datastructure was not updated to reflect migration
decisions on the target node. This is not needed for inter-node
migration because target nodes are not considered as sources. But we
want it to reflect migration decisions so that later inter-node
migration sees an accurate picture with earlier migrations reflected
in node_load.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
c86f659421 tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
Will be needed by member methods which generate migration plans.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
fdcaaea91a tablets, streaming: Implement tablet streaming for intra-node migration 2024-05-16 00:28:46 +02:00
Tomasz Grabiec
aafeacc8d9 dht, auto_refreshing_sharder: Allow overriding write selector
During streaming for intra-node migration we want to write only to the
new shard. To achieve that, allow altering write selector in
sharder::shard_for_writes() and per-instance of
auto_refreshing_sharder.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
dfed4efcc5 multishard_writer: Handle intra-node migration
This writer is used by streaming, on tablet migration and
load-and-stream.

The caller of distribute_reader_and_consume_on_shards(), which provides
a sharder, is supposed to ensure that effective_replication_map is kept
alive around it, in order for topology coordinator to wait for any writes
which may be in flight to reach their shards before tablet replica starts
another migration. This is already the case:

  1) repair and load-and-stream keep the erm around writing.

  2) tablet migration uses autorefreshing_sharder, so it does not, but
     it keeps the topology_guard around the operation in the consumer,
     which serves the same purpose.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
4df818db98 storage_proxy: Handle intra-node tablet migration for writes
When sharder says that the write should go to multiple shards,
we need to consider the write as applied only if it was applied
to all those shards.

This can happen during intra-node tablet migration. During such migration,
the request coordinator on storage_proxy side is coordinating to hosts
as if no migration was in progress. The replica-side coordinator coordinates
to shards based on sharder response.

One way to think about it is that
effective_replication_map::get_natural_endpoints()/get_pending_endpoints()
tells how to coordinate between nodes, and sharder tells how to
coordinate between shards. Both work with some snapshot of tablet
metadata, which should be kept alive around the operation. Sharder is
associated with its own effective_replication_map, which marks the
topology version as used and allows barriers to synchronize with
replica-side operations.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
6c6ce2d928 tablets: Get rid of tablet_map::get_shard()
Its semantics do not fit well with intra-node migration which allow
two owning shards. Replace uses with the new has_replica() API.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
d000ad0325 tablets: Avoid tablet_map::get_shard in cleanup
In preparation for intra-node migration for which get_shard() is not
prepared.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
daaceda963 tablets: test: Use sharder instead of tablet_map::get_shard()
tablet_map::get_shard() will go away as it is not prepared for
intra-node migration.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
d47dfceb34 tablets: tablet_sharder: Allow working with non-local host
Will be used in tests.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
6946ad2a45 sharding: Prepare for intra-node-migration
Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
b5bb46357b docs: Document sharder use for tablets 2024-05-16 00:28:46 +02:00
Tomasz Grabiec
82b34d34d8 tablets: Introduce tablet transition kind for intra-node migration
We need a separate transition kind for intra node migration so that we
don't have to recover this information from replica set in an
expensive way. This information is needed in the hot path - in
effective_replicaiton_map, to not return the pending tablet replica to
the coordinator. From its perspective, replica set is not
transitional.

The transition will also be used to alter the behavior of the
sharder. When not in intra-node migration, the sharder should
advertise the shard which is either in the previous or next replica
set. During intra-node migration, that's not possible as there may be
two such shards. So it will return the shard according to the current
read selector.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
942ea39bf0 tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
balance_tablets() is invoked in a loop, so only the first call will
see non-empty skiplist.

This bug starts to manifest after adding intra-node migration plan,
causing failures of the test_load_balancing_with_skiplist test
case. The reason is that rebalancing will now require multiple passes
before convergence is reached, due to intra-node migrations, and later
calls will not see the skiplist and try to balance skipped nodes,
vioating test's assertions.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
4d84451cf1 sstables, gdb: Track readers in a linked list
For the purpose of scylla-gdb.py command "scylla
active-sstables". Before the patch, readers were located by scanning
the heap for live objects with vtable pointers corresponding to
readers. It was observed that the test scylla_gdb/test_misc.py::test_active_sstables started failing like this:

  gdb.error: Error occurred in Python: Cannot access memory at address 0x300000000000000

This could be explained by there being a live object on the heap which
used to be a reader but now is a different object, and the _sst field
contains some other data which is not a pointer.

To fix, track readers explicitly in a linked list so that the gdb
script can reliably walk readers.

Fixes #18618.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
fad6c41cee raft topology: Fix global token metadata barrier to not fence ahead of what is drained
Topology version may be updated, for example, by executing a RESTful
API call to move a tablet. If that is done concurrently with an
ongoing token metadata barrier executed by topology coordinator
(because there is active tablet migration, for example), then some
requests may fail due to being fenced out unnecessarily.

The problem is that barrier function assumes no concurrent topology
updates so it sets the fence version to the one which is current after
other nodes are drained. This patch changes it to set the fence to the
version which was current before other nodes were drained. Semantics
of the barrier are preserved because it only guarantees that topology
state from before the invocation of barrier is propagated.

Fixes #18699
2024-05-16 00:28:46 +02:00
Benny Halevy
3c4c81c2d9 utils: chunked_vector: optimize for trivially_copyable types
Use std::uninitialized_{copy,move} and std::destroy
that have optimizations for trivially copyable and
trivially moveable types.
In those cases, memory can be copied onto the uninitialized
memory, rather than invoking the respective copy/move constructors,
one item at a time.

perf-simple-query results:
```
base: median 95954.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42312 insns/op,        0 errors)
post: median 97530.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42331 insns/op,        0 errors)
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#18609
2024-05-15 22:32:45 +03:00
Raphael S. Carvalho
012ba25b5b service: fix indentation in dispatch()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-15 16:30:06 -03:00
Raphael S. Carvalho
0a9e073154 service: fix reactor stall with large tablet count
with a large tablet count, e.g. 128k, forward_service::dispatch() can
potentially stall when grouping ranges per endpoint.

Reactor stalled for 4 ms on shard 1. Backtrace: 0x5eb15ea 0x5eb09f5 0x5eb1daf 0x3dbaf 0x2d01e57 0x33f7d1e 0x348255f 0x2d005d4 0x2d3d017 0x2d3d58c 0x2d3d225 0x5e59622 0x5ec328f 0x5ec4577 0x5ee84e0 0x5e8394a 0x8c946 0x11296f

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-15 16:30:06 -03:00
Raphael S. Carvalho
f7659b357c service: avoid potential expensive copies in forward_service::dispatch()
each partition_range_vector might grow to ~9600 elements, assuming
96-shard nodes, each with 100 tablets.

~9600 elements, where each is 120 bytes (sizeof(partition_range))
can result in vector with capacity of ~2M due to growth factor of
2.

we're copying each range 3x in dispatch(), and we can easily avoid
it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-15 16:30:06 -03:00
Raphael S. Carvalho
f9d2b9a83b service: coroutinize forward_service::dispatch()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-15 16:30:06 -03:00
Pavel Emelyanov
16db2f650e functions: Do not crash when schema is missing
Getting token() function first tries to find a schema for underlying
table and continues with nullptr if there's no one. Later, when creating
token_fct, the schema is passed as is and referenced. If it's null crash
happens.

It used to throw before 5983e9e7b2 (cql3: test_assignment: pass optional
schema everywhere) on missing schema, but this commit changed the way
schema is looked up, so nullptr is now possible.

fixes: #18637

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18639
2024-05-15 17:20:40 +03:00
Pavel Emelyanov
d267fbd894 repair: Get topology via replication map
When row_level_repair is constructed it sorts provided list of enpoints.
For that it needs to get topology from somewhere and it goes the
database->token_metadata->topology chain. Patch this palce to get
topology from erm instead. It's consistent with how other code from
row_level_repair gets it and removes one more place that uses database
to token metadata "provider".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-15 17:07:45 +03:00
Pavel Emelyanov
2706f27cd9 repair: Use repair_service::my_address() in handlers
Some handlers want to print local node address in logs. Now the
repair_service has a method to get one, so those places can stop getting
it via database->token_metadata dependency chain.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-15 17:07:45 +03:00
Pavel Emelyanov
7fb405ba65 repair: Remove repair_meta::_myip
In favor of recently introduced my_address() one.

One nice side effect of this change is minus one place that gets token
metadata from database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-15 17:07:45 +03:00
Pavel Emelyanov
017f650955 repair: Use repair_meta::myip() everywhere
The method returns _myip and some places in this class use _myip
directly. Next patch is going to remove _myip, so prepare for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-15 17:07:45 +03:00
Pavel Emelyanov
6899bf83ec repair: Add repair_service::my_address() method
To be used in next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-15 17:07:45 +03:00
Avi Kivity
a5fea84d82 Merge 'scylla-nodetool: add tablet support for ring command' from Botond Dénes
Currently, invoking `nodetool ring` on a tablet keyspace fails with an error, because it doesn't pass the required table parameter to `/storage_service/ownership/{keyspace}`. Further to this, the command will currently always output the vnode ring, regardless of the keyspace and table parameter. This series fixes this, adding tablet support to `/storage_service/tokens_endpoint`, which will now return the tablet ring (tablet token -> tablet primary replica mapping) if the new keyspace and table parameters are provided.
`nodetool status` also gets a touch-up, to provide the tablet ring's token count (the tablet count) when invoked with a tablet keyspace and table.

Fixes: #17889
Fixes: #18474

- [x] ** native-nodetool is new functionality, no backport is needed **

Closes scylladb/scylladb#18608

* github.com:scylladb/scylladb:
  test/nodetool: make test pass with cassandra nodetool
  tools/scylla-nodetool: status: fix token count for tablets
  tools/scylla-nodetool: add tablet support to ring command
  api/storage_service: add tablet support for /storage_service/tokens_endpoint
  service/storage_service: introduce get_tablet_to_endpoint_map()
  locator/tablets: introduce the primary replica concept
2024-05-15 16:05:10 +03:00
Artsiom Mishuta
d659d9338b test/pylib: Introduce ManagerClient.test_finished_event
introduce ManagerClient.test_finished_event
to block access to REST client object from the test if
ManagerClient.after_test method was called
(test teardown started)
2024-05-15 11:33:45 +02:00
Botond Dénes
7b41bb601c Merge 'Simplify access to topology::my_address()' from Pavel Emelyanov
Recent commit 12f160045b (Get rid of fb_utilities) replaced the usage of global fb_utilities and made all services use topology::my_address() in order to get local node broadcast address. Some places resulted in long dependency chains dereferences. to get to topology This PR fixes some of them.

Closes scylladb/scylladb#18672

* github.com:scylladb/scylladb:
  service_level_controller_test: Use topology::is_me() helper
  service_level_controller: Add dependency on shared_token_metadata
  tracing: Get my_address() via proxy
  storage_proxy: Get token metadata via local member, not database
2024-05-15 11:23:16 +03:00
Wojciech Mitros
5154429713 mv gossip: check errno instead of value returned by strtoull
Currently, when a view update backlog is changed and sent
using gossip, we check whether the strtoll/strtoull
function used for reading the backlog returned
LLONG_MAX/ULLONG_MAX, signaling an error of a value
exceeding the type's limit, and if so, we do not store
it as the new value for the node.

However, the ULLONG_MAX value can also be used as the max
backlog size when sending empty backlogs that were never
updated. In theory, we could avoid sending the default
backlog because each node has its real backlog (based on
the node's memory, different than the ULLONG_MAX used in
the default backlog). In practice, if the node's
backlog changed to 0, the backlog sent by it will be
likely the default backlog, because when selecting
the biggest backlog across node's shards, we use the
operator<=>(), which treats the default backlog as
equal to an empty backlog and we may get the default
backlog during comparison if the backlog of some shard
was never changed (also it's the initial max value
we compare shard's backlogs against).

This patch removes the (U)LLONG_MAX check and replaces
it with the errno check, which is also set to ERANGE during
the strtoll error, and which won't prevent empty backlogs
from being read

Fixes: #18462

Closes scylladb/scylladb#18560
2024-05-15 07:14:36 +02:00
Pavel Emelyanov
59aec1f300 database: Don't break namespace withexternal alias
The namespace replica is broken in the middle with sstable_list alias,
while the latter can be declared earlier

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18664
2024-05-14 16:45:20 +03:00
Piotr Dulikowski
9ab57b12bb Merge 'cql/describe: hide cdc log tables' from Michał Jadwiszczak
Currently all tables are printed in statements like `DESC TABLES`, `DESC KEYSPACE ks` or `DESC SCHEMA`.
But when we create a table with cdc enabled, additional table with `_scylla_cdc_log` suffix is created.
Those tables shouldn't be recreated manually but created automatically when the base table is created.

This patch hides tables with `_scylla_cdc_log` suffix in all describe statements.
To preserve properties values of those tables, `ALTER TABLE` statement with all properties and their current values for log cdc table is added to description of the base table.

Fixes #18459

Closes scylladb/scylladb#18467

* github.com:scylladb/scylladb:
  test/cql-pytest/test_describe: add test for hiding cdc tables
  cql3/statements/describe_statement: hide cdc tables
  schema: add a method to generate ALTER statement with all properties
  schema: extract schema's properties generation
2024-05-14 15:02:29 +02:00
Pavel Emelyanov
a30337e719 service_level_controller_test: Use topology::is_me() helper
The on_leave_cluster() callback needs to check if the leaving node is
the local one. It currently compares endpoint with the my_address()
obtained via pretty long dependency chain of

  auth_service->query_processor->storage_proxy->database->token_metadata

This patch makes the whole thing _much_ shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-14 15:47:12 +03:00
Pavel Emelyanov
634c066c43 service_level_controller: Add dependency on shared_token_metadata
The controller needs to access topology, so it needs the token metadata
at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-14 15:43:01 +03:00
Pavel Emelyanov
f9c34f7bd5 tracing: Get my_address() via proxy
The my_address() helper method gets the address via a long
qp->proxy->database->token_metadata->topology chain. That's quite an
overkill, storage_proxy has public my_address() method. The latter also
accesses topology, but without the help of the database. Also this
change makes tracing code a bit shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-14 15:41:04 +03:00
Pavel Emelyanov
75d5eb96f2 storage_proxy: Get token metadata via local member, not database
The my_address() method eventually needs to access topology and goes
long way via sharded<database>. No need in that, shared token metadata
is available on proxy itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-14 15:40:10 +03:00
Artsiom Mishuta
fb6b572b9e test/topology: make ManagerClient object function scope
move ManagerClient object creation/clear
to functions scope instead of session scope

to prevent test cases affect each other
by stopping sharing connections to cluster between tests
2024-05-14 14:31:10 +02:00
Artsiom Mishuta
efb079ec15 test/pylib: Introduce Manager broken state:
Waiting for all tasks does not guarantee
that test will not spawn new tasks while we wait

Manager broken state prevents all future put requests in case of
1) fail during task waiting
2) Test continue to create tasks in test_after stage
2024-05-14 14:24:03 +02:00
Artsiom Mishuta
a8bab03c15 test/pylib: Wait for tasks from Tasks History:
To ensure the atomicity of tests and recycle clusters without any issues, it is crucial
that all active requests in ScyllaClusterManager are completed before proceeding further.
2024-05-14 14:24:03 +02:00
Artsiom Mishuta
2ee063c90c test/pylib: Introduce Tasks History:
Topology tests might spawn asynchronous tasks in parallel in ScyllaClusterManager.
Tasks history is introduced to be able log and analyze all actions
against cluster in case of failures
2024-05-14 14:24:03 +02:00
Artsiom Mishuta
38125a0049 test/pylib: Introduce Stop Event
indrodce stop event that
interrupt start node on state "wait for node started" if someone wants to stop it
2024-05-14 14:24:03 +02:00
Artsiom Mishuta
4c2527efce test/pylib: Introduce Start-Stop Lock:
The methods stop, stop_gracefully, and start in ScyllaServer
are not designed for parallel execution.
To circumvent issues arising from concurrent calls,
a start_stop_lock has been introduced.
This lock ensures that these methods are executed sequentially.
2024-05-14 14:24:03 +02:00
Botond Dénes
a15a9c3e8d Merge 'utils: chunked_vector: fill ctor: make exception safe' from Benny Halevy
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not fully constructed yet.

Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.

Fixes scylladb/scylladb#18635

- [x] Although the fixes is for a rare bug, it has very low risk and so it's worth backporting to all live versions

Closes scylladb/scylladb#18636

* github.com:scylladb/scylladb:
  chunked_vector_test: add more exception safety tests
  chunked_vector_test: exception_safe_class: count also moved objects
  utils: chunked_vector: fill ctor: make exception safe
2024-05-14 13:35:02 +03:00
Botond Dénes
78afb3644c test/boost/mutation_fragment_test.cc: add test for validator validation levels
To make sure that the validator doesn't validate what the validation
level doesn't include.
2024-05-14 06:03:20 -04:00
Botond Dénes
e7b07692b6 mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
Despite its name, this validation level still did some validation. Fix
this, by short-circuiting the catch-all operator(), preventing any
validation when the user asked for none.
2024-05-14 06:02:10 -04:00
Botond Dénes
f6511ca1b0 mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
When set to false, no exceptions will be raised from the validator on
validation error. Instead, it will just return false from the respective
validator methods. This makes testing simpler, asserting exceptions is
clunky.
When true (default), the previous behaviour will remain: any validation
error will invoke on_internal_error(), resulting in either std::abort()
or an exception.
2024-05-14 05:59:40 -04:00
Piotr Dulikowski
448f651049 Merge 'hinted handoff: Prevent segmentation fault when initializing endpoint managers ' from Dawid Mędrek
We don't attempt to create an endpoint manager for a hint directory if there is no mapping host ID–IP corresponding to the directory's name, an IP address. That prevents a segmentation fault.

Fixes scylladb/scylladb#18649

Closes scylladb/scylladb#18650

* github.com:scylladb/scylladb:
  db/hints: Remove an unused header
  db/hints: Remove migrating flag before initializing endpoint managers
  db/hints: Prevent segmentation fault when initializing endpoint managers
2024-05-14 07:34:16 +02:00
Amnon Heiman
0c84692c97 replica/table.cc: Add metrics per-table-per-node
This patch adds metrics that will be reported per-table per-node.
The added metrics (that are part of the per-table per-shard metrics)
are:
scylla_column_family_cache_hit_rate
scylla_column_family_read_latency
scylla_column_family_write_latency
scylla_column_family_live_disk_space

Fixes #18642

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#18645
2024-05-14 07:54:34 +03:00
Raphael S. Carvalho
0b2ec3063c sstables: Fix incremental_reader_selector (for range reads) with tablets
incremental_reader_selector is the mechanism for incremental comsumption
of disjoint sstables on range reads.

tablet_sstable_set was implemented, such that selector is efficient with
tablets.

The problem is selector is vnode addicted and will only consider a given
set exhausted when maximum token is reached.

With tablets, that means a range read on first tablet of a given shard
will also consume other tablets living in the same shard. That results
in combined reader having to work with empty sstable readers of tablets
that don't intersect with the range of the read. It won't cause extra
I/O because the underlying sstables don't intersect with the range of
the read. It's only unnecessary CPU work, as it involves creating
readers (= allocation), feeding them into combined reader, which will
in turn invoke the sstable readers only to realize they don't have any
data for that range.

With 100k tablets (ranges), and 100 tablets per shard, and ~5 sstables
per tablet, there will be this amount of readers (empty or not):
  (100k * ((100^2 + 100) / 2) * avg_sstable_per_tablet=5) = ~2.5 billions.

~5000 times more readers, it can be quite significant additional cpu
work, even though I/O dominates the most in scans. It's an inefficiency
that we rather get rid of.

The behavior can be observed from logs (there's 1 sstable for each of
4 tablets, but note how readers are created for every single one of
them when reading only 1 tablet range):
```
table - make_reader_v2 - range=(-inf, {-4611686018427387905, end}]
    incremental_reader_selector - create_new_readers(null): selecting on pos {minimum token, w=-1}
    sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._34qn42... that has range [{-9151620220812943033, start},{-4813568684827439727, end}]
    incremental_reader_selector - create_new_readers(null): selecting on pos {-4611686018427387904, w=-1}
    sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._368nk2... that has range [{-4599560452460784857, start},{-78043747517466964, end}]
    incremental_reader_selector - create_new_readers(null): selecting on pos {0, w=-1}
    sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._38lj42... that has range [{851021166589397842, start},{3516631334339266977, end}]
    incremental_reader_selector - create_new_readers(null): selecting on pos {4611686018427387904, w=-1}
    sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._3dba82... that has range [{5065088566032249228, start},{9215673076482556375, end}]
```

Fix is about making sure the tablet set won't select past the
supplied range of the read.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18556
2024-05-14 07:43:22 +03:00
Wojciech Mitros
485eb7a64c test: add test for bad_allocs during large mv queries
This patch adds a test for reproducing issue #12379, which is
being fixed in #16819.
The test case works by creating a table with a materialized
view, and then performing a partition delete query on it.
At the same time, it uses injections to limit the memory
to a level lower than usual, in order to increase the
consistency of the test, and to limit its runtime.
Before #16819, the test would exceed the limit and fail,
and now the next allocation is throttled using a sleep.
2024-05-13 18:16:39 +02:00
Jan Ciolek
e0442d7bfa mv: throttle view update generation for large queries
For every mutation applied to the base table we have to
generate the corresponding materialized view table updates.

In case of simple requests, like INSERT or UPDATE, the number
of view updates generated per base table mutation is limited
to at most a few view table updates per base table update.

The situation is different for DELETE queries, which can delete
the whole partitions or clustering ranges. Range deletions are
fast on the base table, but for the view table the situation
is different. Deleting a single partition in the base table
will generate as many singular view updates as there are rows
in the deleted partition, which could potentially be in the millions.

To prevent OOM view updates are generated in batches of at most 100 rows.
There is a loop which generates the next batch of updates, spawns tasks
to send them to remote nodes, generates another batch and so on.

The problem is that there is no concurrency control - each batch is scheduled
to be sent in the background, but the following batch is generated without
waiting for the previously generated updates to be sent. This can lead to
unbounded concurrency and OOM.

To protect against this view update generation should be limited somehow.

There is an existing mechanism for limiting view updates - throttling.
We keep track of how many pending view updates there are, in the view backlog,
and delay responses to the client based on this backlog's fullness.
For a well behaved client with limited concurrency this will slow down
the amount of incoming requests until it reaches an optimal point.

This works for simple queries (INSERT, UPDATE, ...), but it doesn't do anything
for range DELETEs. A DELETE is a single request that generates millions of view
updates, delaying client response doesn't help.

The throttling mechanism could be extend to cover this case - we could treat the
DELETE request like any other client and force it to wait before sending more updates.

This commit implements this approach - before sending the next batch of updates
the generator is forced to sleep for a bit of time, calculated using the exisiting
throttling equation.
The more full the backlog gets the more the generator will have to sleep for,
and hopefully this will prevent overloading the system with view updates.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2024-05-13 18:16:23 +02:00
Jan Ciolek
cd62697605 exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
The `request_timeout_exception` is thrown when a client request can't be completed in time.
Previously this class included some fields specific to read/write timeouts:
```
db::consistency_level consistency;
int32_t received;
int32_t block_for;
```

The problem is that a request can timeout for reasons other than read/write timeout,
for example the request might timeout due to materialized view update generation taking
too long.

In such cases of non read/write timeouts we would like to be able use request_timeout_exception,
but it contains fields that aren't releveant in these cases.

To deal with this let's create read_write_timeout_exception, which inherits
from request_timeout_exception. read_write_timout_exception will contain all
of these fields that are specific to read/write timeouts. request_timeout_exception
will become the base class that doesn't have any fields, the other case-specific
exceptions will derive from it and add the desired fields.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2024-05-13 18:16:09 +02:00
Jan Ciolek
ae28b8bdb7 db/view: extract view throttling delay calculation to a global function
In order to prevent overload caused by too many view updates,
their number is limited by delaying client responses.
The amount of time to delay for is calculated based on the
fullness of the view update backlog.

Currently this is done in the function calculate_delay,
used by abstract_write_response_handler.

In the following commits I will introduce another throttling
mechanism that uses the same equation to calculate wait time,
so it would be good to reuse the exsiting function.

Let's make the function globally accessible.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2024-05-13 18:14:56 +02:00
Pavel Emelyanov
bb1696910c Merge 'scylla-nodetool: make documentation links product and version dependant' from Botond Dénes
Currently, all documentation links that feature anywhere in the help output of scylla-nodetool, are hard-coded to point to the documentation of the latest stable release. As our documentation is version and product (open-source or enterprise) specific, this is not correct. This PR addresses this, by generating documentation links such that they point to the documentation appropriate for the product and version of the scylladb release.

Fixes: https://github.com/scylladb/scylladb/issues/18276

- [x] the native nodetool is a new feature, no backport needed

Closes scylladb/scylladb#18476

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: make doc link version-specific
  release: introduce doc_link()
  build: pass scylla product to release.cc
2024-05-13 18:03:45 +03:00
Botond Dénes
d82a31f15f service/storage_proxy: add useful version of base write throttle metrics
There are two metrics to help observe base-write throttling:
* current_throttled_base_writes
* last_mv_flow_control_delay

Both show a snapshot of what is happening right at the time of querying
these metrincs. This doesn't work well when one wants to investigate the
role throttling is playing in occasional write timeouts.s Prometheus
scrapes metrics in multi-second intervals, and the probability of that
instant catching the throttling at play is very small (almost zero).
Add two new metrics:
* throttled_base_writes_total
* mv_flow_control_delay_total

These accumulate all values, allowing graphana to derive the values and
extract information about throttle events that happened in the past
(but not necessarily at the instant of the scrape).
Note that dividing the two values, will yield the average delay for a
throttle, which is also useful.

Closes scylladb/scylladb#18435
2024-05-13 18:02:06 +03:00
Dawid Medrek
ef8f14d44b db/hints: Remove an unused header 2024-05-13 16:40:47 +02:00
Dawid Medrek
c9bbb92b1a db/hints: Remove migrating flag before initializing endpoint managers
Before these changes, if initializing endpoint
managers after the migration of hinted handoff
to host ID is done throws an exception, we
don't remove the flag indicating the migration
is still in progress. However, the migration
has, in practice, finished -- all of the
hint directories have been mapped to host IDs
and all of the nodes in the cluster are
host-ID-based. Because of that, it makes sense
to remove the flag early on.
2024-05-13 16:40:47 +02:00
Dawid Medrek
bdcde0c210 db/hints: Prevent segmentation fault when initializing endpoint managers
If hinted handoff is still IP-based and there is
a hint directory representing an IP without
a corresponding mapping to a host ID in
`locator::token_metadata`, an attemp to initialize
its endpoint manager will result in a segmentation
fault. This commit prevents that.
2024-05-13 16:40:47 +02:00
Benny Halevy
4bbb66f805 chunked_vector_test: add more exception safety tests
For insertion, with and without reservation,
and for fill and copy constructors.

Reproduces https://github.com/scylladb/scylladb/issues/18635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-13 17:18:38 +03:00
Benny Halevy
88b3173d03 chunked_vector_test: exception_safe_class: count also moved objects
We have to account for moved objects as well
as copied objects so they will be balanced with
the respective `del_live_object` calls called
by the destructor.

However, since chunked_vector requires the
value_type to be nothrow_move_constructible,
just count the additional live object, but
do not modify _countdown or, respectively, throw
an exception, as this should be considered only
for the default and copy constructors.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-13 17:18:38 +03:00
Benny Halevy
64c51cf32c utils: chunked_vector: fill ctor: make exception safe
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not
fully constructed yet.

Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.

Fixes scylladb/scylladb#18635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-13 17:18:38 +03:00
Michał Jadwiszczak
3e5c34831c test/cql-pytest/test_describe: add test for hiding cdc tables 2024-05-13 16:14:11 +02:00
Michał Jadwiszczak
f12edbdd95 cql3/statements/describe_statement: hide cdc tables
Tables with `_scylla_cdc_log` suffix are internal tables used by cdc.

We want to hide those tables in all describe statements, as they
shouldn't be created by user but created by Scylla when user creates a
table with cdc enabled.

Instead, we include `ALTER TABLE <cdc log table> WITH <all table properties>`
to the description of cdc base table, so all changes to cdc log table's
properties are preserved in backup.
2024-05-13 16:11:13 +02:00
Michał Jadwiszczak
05a51c9286 schema: add a method to generate ALTER statement with all properties
In the describe statement, we need to generate `ALTER TABLE` statement
with all schema's properties for some tables (cdc log tables).

The method prints valid CQL statement with current values of
the properties.
2024-05-13 16:11:06 +02:00
Michał Jadwiszczak
b62f7a1dd3 schema: extract schema's properties generation
In a later commit, we want to add a method to create
`ALTER TABLE ... WITH` statement including all schema's
properties with current values.
2024-05-13 14:52:32 +02:00
Asias He
952dfc6157 repair: Introduce repair_partition_count_estimation_ratio config option
In commit 642f9a1966 (repair: Improve
estimated_partitions to reduce memory usage), a 10% hard coded
estimation ratio is used.

This patch introduces a new config option to specify the estimation
ratio of partitions written by repair out of the total partitions.

It is set to 0.1 by default.

Fixes #18615

Closes scylladb/scylladb#18634
2024-05-13 15:16:55 +03:00
Botond Dénes
afa870a387 Merge 'Some sstable set related improvements' from Raphael "Raph" Carvalho
Closes scylladb/scylladb#18616

* github.com:scylladb/scylladb:
  replica: Make it explicit table's sstable set is immutable
  replica: avoid reallocations in tablet_sstable_set
  replica: Avoid compound set if only one sstable set is filled
2024-05-13 14:17:24 +03:00
Botond Dénes
a77796f484 test/nodetool: make test pass with cassandra nodetool
After the recent fixes 4 tests started failing with the java nodetool
implementation. We are about to ditch the java implementation, but until
we actually do, it is valuable to keep the tests passing with both the
native and java implementation.
So in this patch, these tests are fixed to pass with the java
implementation too.
There is one test, test_help.py, which fails only if run together with
all the tests. I couldn't confirm this 100%, but it seems like this is
due to JMX sending a rouge request on some timer, which happens to hit
this test. I don't think this is worth trying to fix.
2024-05-13 07:09:20 -04:00
Botond Dénes
bec4c17db4 tools/scylla-nodetool: status: fix token count for tablets
Currently, the token count column is always based on the vnodes, which
makes no sense for tablet keyspaces. If a tablet keyspace is provided as
the keyspace argument, don't print the vnode token count. If the user
provided a table argument as well, print the tablet count, otherwise
print "?".
2024-05-13 07:09:20 -04:00
Botond Dénes
e82455beab tools/scylla-nodetool: add tablet support to ring command
Add a table parameter. Pass both keyspace and table (when provided) to
the /storage_service/tokens_endpoint API endpoint, so that the returned
(and printed) token ring is that of the table's tablets, not the vnode
ring.
Also pass the table param to the ownership API, which will complain if
this param is missing for a tablet keyspace.
2024-05-13 07:09:20 -04:00
Botond Dénes
fd25bb6f9f api/storage_service: add tablet support for /storage_service/tokens_endpoint
Add a keyspace and cf parameter. When specified, the endpoint will
return token -> primary replica mapping for the table's tablet tokens,
not the vnodes.
2024-05-13 07:09:20 -04:00
Botond Dénes
8690dbf8ad service/storage_service: introduce get_tablet_to_endpoint_map()
The tablet variant of the existing get_token_to_endpoint_map(), which
returns a list of tablet tokens and the primary replica for each.
2024-05-13 06:57:13 -04:00
Pavel Emelyanov
2ce643d06b table: Directly compare std::optional<shard_id> with shard_id
There's a loop that calculates the number of shard matches over a tablet
map. The check of the given shard against optional<shard> can be made
shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18592
2024-05-13 13:25:05 +03:00
Andrei Chekun
76a766cab0 Migrate alternator tests to PythonTestSuite
As part of the unification process, alternator tests are migrated to the PythonTestSuite instead of using the RunTestSuite. The main idea is to have one suite, so there will be easier to maintain and introduce new features.
Introduce the prepare_sql option for suite.yaml to add possibility to run cql statements as precondition for the test suite.
Related: https://github.com/scylladb/scylladb/issues/18188

Closes scylladb/scylladb#18442
2024-05-13 13:23:29 +03:00
Avi Kivity
51d09e6a2a cql3: castas_fcts: do not rely on boost casting large multiprecision integers to floats behavior
In [1] a bug casting large multiprecision integers to floats is documented (note that it
received two fixes, the most recent and relevant is [2]). Even with the fix, boost now
returns NaN instead of ±∞ as it did before [3].

Since we cannot rely on boost, detect the conditions that trigger the bug and return
the expected result.

The unit test is extended to cover large negative numbers.

Boost version behavior:
 - 1.78 - returns ±∞
 - 1.79 - terminates
 - 1.79 + fix - returns NaN

Fixes https://github.com/scylladb/scylladb/issues/18508

[1] https://github.com/boostorg/multiprecision/issues/553
[2] ea786494db
[3] https://github.com/boostorg/math/issues/1132

Closes scylladb/scylladb#18532
2024-05-13 13:18:28 +03:00
Yaniv Michael Kaul
4639ca1bf5 compaction_strategy.cc: typo -> "performanceimproves" -> "performance improves"
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#18629
2024-05-13 08:43:38 +03:00
Patryk Wrobel
ec820e214c scylla_io_setup: ensure correct RLIMIT_NOFILE for iotune
The default limit of open file descriptors
per process may be too small for iotune on
certain machines with large number of cores.

In such case iotune reports failure due to
unability to create files or to set up seastar
framework.

This change configures the limit of open file
descriptors before running iotune to ensure
that the failure does not occur.

The limit is set via 'resource.setrlimit()' in
the parent process. The limit is then inherited
by the child process.

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#18546
2024-05-13 08:35:52 +03:00
Botond Dénes
32a0867b38 locator/tablets: introduce the primary replica concept
The primary replica is an arbitrary replica of the tablet's, which is
considered to tbe the "main" owner of the tablet, similar to how
replicas own tokens in the vnode world.
To avoid aliasing the primary replicas with a certain DC or rack,
primary replicas are rotated among the tablet's replicas, selecting
tablet_id % replica_count as the primary replica.
2024-05-13 01:35:05 -04:00
Avi Kivity
cc8b4e0630 batchlog_manager, test: initialize delay configuration
In b4e66ddf1d (4.0) we added a new batchlog_manager configuration
named delay, but forgot to initialize it in cql_test_env. This somehow
worked, but doesn't with clang 18.

Fix it by initializing to 0 (there isn't a good reason to delay it).
Also provide a default to make it safer.

Closes scylladb/scylladb#18572
2024-05-13 07:57:35 +03:00
Israel Fruchter
a1a6bd6798 Update tools/cqlsh submodule to v6.0.18
* tools/cqlsh e5f5eafd...c8158555 (11):
  > cqlshlib/sslhandling: fix logic of `ssl_check_hostname`
  > cqlshlib/sslhandling.py: don't use empty userkey/usercert
  > Dockerfile: noninteractive isn't enough for answering yet on apt-get
  > fix cqlsh version print
  > cqlshlib/sslhandling: change `check_hostname` deafult to False
  > Introduce new ssl configuration for disableing check_hostname
  > set the hostname in ssl_options.server_hostname when SSL is used
  > issue-73 Fixed a bug where username and password from the credentials file were ignored.
  > issue-73 Fixed a bug where username and password from the credentials file were ignored.
  > issue-73
  > github actions: update `cibuildwheel==v2.16.5`

Fixes: scylladb/scylladb#18590

Closes scylladb/scylladb#18591
2024-05-13 07:25:10 +03:00
Yaron Kaikov
3eb81915c1 docker: drop jmx and tools-java from installation
Following the work done in dd0779675f,
removing the scylla-jmx and scylla-tools-java from our docker image

Closes scylladb/scylladb#18566
2024-05-13 07:24:23 +03:00
Takuya ASADA
9538af0d95 scylla_kernel_check: fix block device size error on latest mkfs.xfs
On latest mkfs.xfs, it does not allow to format a block device which is
smaller than 300MB.
There are options to ignore this validation but it is unsupported
feature, so it is better to increase the loopback image size to
"supported size" == 300MB.

reference: https://lore.kernel.org/all/164738662491.3191861.15611882856331908607.stgit@magnolia/

Fixes #18568

Closes scylladb/scylladb#18620
2024-05-13 07:23:29 +03:00
Avi Kivity
c8cc47df2d Merge 'replica: allocate storage groups dynamically' from Aleksandra Martyniuk
Allocate storage groups dynamically, i.e.:
- on table creation allocate only storage groups that are on this
  shard;
- allocate a storage group for tablet that is moved to this shard;
- deallocate storage group for tablet that is moved out of this shard.

Output of `./build/release/scylla perf-simple-query -c 1 --random-seed=2248493992` before change:
```
random-seed=2248493992
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64933.90 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42163 insns/op,        0 errors)
65865.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42155 insns/op,        0 errors)
66649.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42176 insns/op,        0 errors)
67029.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42176 insns/op,        0 errors)
68361.21 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42166 insns/op,        0 errors)

median 66649.36 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42176 insns/op,        0 errors)
median absolute deviation: 784.00
maximum: 68361.21
minimum: 64933.90
```

Output of `./build/release/scylla perf-simple-query -c 1 --random-seed=2248493992` after change:
```
random-seed=2248493992
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
63744.12 tps ( 63.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42153 insns/op,        0 errors)
66613.16 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42153 insns/op,        0 errors)
69667.39 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42184 insns/op,        0 errors)
67824.78 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42180 insns/op,        0 errors)
67244.21 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42174 insns/op,        0 errors)

median 67244.21 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   42174 insns/op,        0 errors)
median absolute deviation: 631.05
maximum: 69667.39
minimum: 63744.12
```

Fixes: #16877.

Closes scylladb/scylladb#17664

* github.com:scylladb/scylladb:
  test: add test for back and forth tablets migration
  replica: allocate storage groups dynamically
  replica: refresh snapshot in compaction_group::cleanup
  replica: add rwlock to storage_group_manager
  replica: handle reads of non-existing tablets gracefully
  service: move to cleanup stage if allow_write_both_read_old fails
  replica: replace table::as_table_state
  compaction: pass compaction group id to reshape_compaction_group
  replica: open code get_compaction_group in perform_cleanup_compaction
  replica: drop single_compaction_group_if_available
2024-05-12 21:22:02 +03:00
Nadav Har'El
9813ec9446 Merge 'test: perf: add end-to-end benchmark for alternator' from Marcin Maliszkiewicz
The code is based on similar idea as perf_simple_query. The main differences are:
  - it starts full scylla process
  - communicates with alternator via http (localhost)
  - uses richer table schema with all dynamoDB types instead of only strings

  Testing code runs in the same process as scylla so we can easily get various perf counters (tps, instr, allocation, etc).

  Results on my machine (with 1 vCPU):
  > ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload read --duration 10 2> /dev/null
  ...
  median 23402.59616090321
  median absolute deviation: 598.77
  maximum: 24014.41
  minimum: 19990.34

  > ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write --duration 10 2> /dev/null
  ...
  median 16089.34211320635
  median absolute deviation: 552.65
  maximum: 16915.95
  minimum: 14781.97

  The above seem more realistic than results from perf_simple_query which are 96k and 49k tps (per core).

Related: https://github.com/scylladb/scylladb/issues/12518

Closes scylladb/scylladb#13121

* github.com:scylladb/scylladb:
  test: perf: alternator: add option to skip data pre-population
  perf-alternator-workloads: add operations-per-shard option
  test: perf: add global secondary indexes write workload for alternator
  test: perf: add option to continue after failed request
  test: perf: add read modify write workload for alternator (lwt)
  test: perf: add scan workload for alternator
  test: perf: add end-to-end benchmark for alternator
  test: perf: extract result aggregation logic to a separate struct
2024-05-12 18:15:29 +03:00
Kefu Chai
fd14b6f26b test/nodetool: do not accept 1 return code when passing --help to nodetool
in 906700d5, we accepted 0 as well as the return code of
"nodetool <command> --help", because we needed to be prepared for
the newer seastar submodule while be compatible with the older
seastar versions. now that in 305f1bd3, we bumped up the seastar
module, and this commit picked up the change to return 0 when
handling "--help" command line option in seastar, we are able to
drop the workaround.

so, in this change, we only use "0" as the expected return code.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18627
2024-05-12 14:30:31 +03:00
Avi Kivity
be76527781 Merge 'build: cmake build dist-unified by default and put tarballs under per-config paths' from Kefu Chai
in the same spirit of d57a82c156, this change adds `dist-unified` as one of the default targets. so that it is built by default. the unified package is required to when redistributing the precompiled packages -- we publish the rpm, deb and tar balls to S3.

- [x] cmake related change, no need to backport

Closes scylladb/scylladb#18621

* github.com:scylladb/scylladb:
  build: cmake: use paths to be compatible with CI
  build: cmake build dist-unified by default
2024-05-12 11:16:03 +03:00
Benny Halevy
796ca367d1 gossiper: rename topo_sm member to _topo_sm
Follow scylla convention for class member naming.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#18528
2024-05-12 11:02:35 +03:00
Avi Kivity
2ad13e5d76 auth: complete coroutinization of password_authenticator::create_default_if_missing
password_authenticator::create_default_if_missing() is a confusing mix of
coroutines and continuations, simplify it to a normal coroutine.

Closes scylladb/scylladb#18571
2024-05-11 17:04:20 +03:00
Kefu Chai
1186ddef16 build: cmake: use paths to be compatible with CI
our CI workflow for publishing the packages expects the tar balls
to be located under `build/$buildMode/dist/tar`, where `$buildMode`
is "release" or "debug".

before this change, the CMake building system puts the tar balls
under "build/dist" when the multi-config generator is used. and
`configure.py` uses multi-config generator.

in this change, we put the tar balls for redistribution under
`build/$<CONFIG>/dist/tar`, where `$<CONFIG>` is "RelWithDebInfo"
or "Debug", this works better with the CI workflow -- we just need
to map "release" and "debug" to "RelWithDebInfo" and "Debug" respectively.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-05-11 21:56:50 +08:00
Kefu Chai
0f85255c74 build: cmake build dist-unified by default
in the same spirit of d57a82c156, this change adds `dist-unified`
as one of the default targets. so that it is built by default.
the unified package is required to when redistributing the precompiled
packages -- we publish the rpm, deb and tar balls to S3.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-05-11 18:44:11 +08:00
Raphael S. Carvalho
7faba69f28 replica: Make it explicit table's sstable set is immutable
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-10 11:58:08 -03:00
Raphael S. Carvalho
55c0272b68 replica: avoid reallocations in tablet_sstable_set
reserve upfront wherever possible to avoid reallocations.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-10 10:44:39 -03:00
Raphael S. Carvalho
35a0d47408 replica: Avoid compound set if only one sstable set is filled
Most of the time only main set is filled, so we can avoid one layer
of indirection (= compound set) when maintenance set is empty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-10 10:44:34 -03:00
Aleksandra Martyniuk
51fdda4199 test: add test for back and forth tablets migration 2024-05-10 15:08:56 +02:00
Aleksandra Martyniuk
b4371a0ea0 replica: allocate storage groups dynamically
Currently empty storage_groups are allocated for tablets that are
not on this shard.

Allocate storage groups dynamically, i.e.:
- on table creation allocate only storage groups that are on this
  shard;
- allocate a storage group for tablet that is moved to this shard;
- deallocate storage group for tablet that is cleaned up.

Stop compaction group before it's deallocated.

Add a flag to table::cleanup_tablet deciding whether to deallocate
sgs and use it in commitlog tests.
2024-05-10 15:08:21 +02:00
Aleksandra Martyniuk
6e1e082e8c replica: refresh snapshot in compaction_group::cleanup
During compaction_group::cleanup sstables set is updated, but
row_cache::_underlaying still keeps a shared ptr to the old set.
Due to that descriptors to deleted sstables aren't closed.

Refresh snapshot in order to store new sstables set in _underlying
mutation source.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
c283746b32 replica: add rwlock to storage_group_manager
Add rwlock which prevents storage groups from being added/deleted
while some other layers itereates over them (or their compaction
groups).

Add methods to iterate over storage groups with the lock held.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
54fcb7be53 replica: handle reads of non-existing tablets gracefully
In the following patches, storage groups (and so also sstables sets)
will be allocated only for tablets that are located on this shard.
Some layers may try to read non-existing sstable sets.

Handle this case as if the sstables set was empty instead of calling
on_internal_error.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
561fb1dd09 service: move to cleanup stage if allow_write_both_read_old fails
If allow_write_both_read_old tablet transition stage fails, move
to cleanup_target stage before reverting migration.

It's a preparation for further patches which deallocate storage
group of a tablet during cleanup.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
532653f118 replica: replace table::as_table_state
Replace table::as_table_state with table::try_get_table_state_with_static_sharding
which throws if a table does not use static sharding.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
cf9913b0b7 compaction: pass compaction group id to reshape_compaction_group
Pass compaction group id to
shard_reshaping_compaction_task_impl::reshape_compaction_group.
Modify table::as_table_state to return table_state of the given
compaction group.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
90d618d8c9 replica: open code get_compaction_group in perform_cleanup_compaction
Open code get_compaction_group in table::perform_cleanup_compaction
as its definition won't be relevant once storage groups are allocated
dynamically.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
8505389963 replica: drop single_compaction_group_if_available
Drop single_compaction_group_if_available as it's unused.
2024-05-10 14:56:38 +02:00
Lakshmi Narayanan Sreethar
d39adf6438 compaction: improve partition estimates for garbage collected sstables
When a compaction strategy uses garbage collected sstables to track
expired tombstones, do not use complete partition estimates for them,
instead, use a fraction of it based on the droppable tombstone ratio
estimate.

Fixes #18283

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18465
2024-05-10 13:02:34 +03:00
Botond Dénes
3286a6fa14 Merge 'Reload reclaimed bloom filters when memory is available' from Lakshmi Narayanan Sreethar
PR #17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components).

The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory.

Closes scylladb/scylladb#18186

* github.com:scylladb/scylladb:
  sstable_datafile_test: add testcase to test reclaim during reload
  sstable_datafile_test: add test to verify auto reload of reclaimed components
  sstables_manager: reload previously reclaimed components when memory is available
  sstables_manager: start a fiber to reload components
  sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
  sstable_datafile_test: add test to verify reclaimed components reload
  sstables: support reloading reclaimed components
  sstables_manager: add new intrusive set to track the reclaimed sstables
  sstable: add link and comparator class to support new instrusive set
  sstable: renamed intrusive list link type
  sstable: track memory reclaimed from components per sstable
  sstable: rename local variable in sstable::total_reclaimable_memory_size
2024-05-10 13:01:01 +03:00
Kefu Chai
305f1bd382 Update seastar submodule
* seastar b73e5e7d...42f15a5f (27):
  > prometheus: revert the condition for enabling aggregation
  > tests/unit: add a unit test for json2code
  > seastar-json2code: fix the path param handling
  > github/workflow: do not override <clang++,23,release>
  > github/workflow: add a github workflow for running tests
  > prometheus: support disabling aggregation at query time
  > apps/httpd: free allocated http_server_control
  > rpc: cast rpc::tuple to std::tuple when passing it to std::apply
  > stall-analyser: move `args` into main()
  > stall-analyser: move print_command_line_options() out of Graph
  > stall-analyser: pass branch_threshold via parameter
  > stall-analyser: move process_graph() into Graph class
  > scripts: addr2line: cache the results of resolve_address()
  > stall-analyser: document the parser of log lines
  > stall-analyser: move resolver into main()
  > stall-analyser: extract get_command_line_parser() out
  > stall-analyser: move graph into main()
  > stall-analyser: extract main() out
  > stall-analyser: extract print_command_line_options() out
  > stall-analyser: add more typing annotatins
  > stall-analyser: surround top-level function with two empty lines
  > core/app_template: return status code 0 for --help
  > iotune: Print file alignments too
  > seastar-json2code: extract Parameter class
  > seastar-json2code: use f-string when appropriate
  > seastar-json2code: use nickname in place of oper['nickname']
  > seastar-json2code: use dict.get() when checking allowMultiple

Closes scylladb/scylladb#18598
2024-05-10 12:50:16 +03:00
Patryk Jędrzejczak
a04ea7b997 topology_coordinator: send barrier to a decommissioning node
The code in `global_token_metadata_barrier` allows drain to fail.
Then, it relies on fencing. However, we don't send the barrier
command to a decommissioning node, which may still receive requests.
The node may accept a write with a stale topology version. It makes
fencing ineffective.

Fix this issue by sending the barrier command to a decommissioning
node.

The raft-based topology is moved out of experimental in 6.0, no need
to backport the patch.

Fixes scylladb/scylladb#17108

Closes scylladb/scylladb#18599
2024-05-10 10:53:16 +02:00
Botond Dénes
c35031dda5 Merge 'repair: tablet_repair: make best effort in spite of errors' from Benny Halevy
Currently if any shard repair task fails,
`tablet_repair_task_impl` per-shard loop
breaks, since it doesn't handle the expection.
Although repair does return an error, which
is as expected, we change vnode-based repair
to make a best effort and try to repair
as much as it can, even if any of the ranges
failed.

This causes the `test_repair_with_down_nodes_2b`
dtest to fail with tablets, as seen in, e.g.
https://jenkins.scylladb.com/view/master/job/scylla-master/job/tablets/job/gating-dtest-release-with-tablets/52/testReport/repair_additional_test/TestRepairAdditional/FullDtest___full_split002___test_repair_with_down_nodes_2b/
```
AssertionError: assert 1765 == 2000
```

- [x] ** Backport reason (please explain below if this patch should be backported or not) **
Tablet repair code will be introduced in 6.0, no need to backport to earlier versions.

Closes scylladb/scylladb#18518

* github.com:scylladb/scylladb:
  repair: tablet_repair_task_impl: modernize table lookup
  repair: tablet_repair: make best effort in spite of errors
2024-05-10 10:51:09 +03:00
Piotr Dulikowski
a3070089de main: initialize scheduling group keys before service levels
Due to scylladb/seastar#2231, creating a scheduling group and a
scheduling group key is not safe to do in parallel. The service level
code may attempt to create scheduling groups while
the cql_transport::cql_sg_stats scheduling group key is being created.

Until the seastar issue is fixed, move initialization of the cql sg
states before service level initialization.

Refs: scylladb/seastar#2231

Closes scylladb/scylladb#18581
2024-05-10 10:35:05 +03:00
Kefu Chai
28791aa2c1 build: cmake: link thrift against absl::header
this change is a leftover of 0b0e661a85.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18596
2024-05-09 18:43:23 +03:00
Avi Kivity
37d32a5f8b Merge 'Cleanup inactive reads on tablet migration' from Botond Dénes
When a tablet is migrated away, any inactive read which might be reading from said tablet, has to be dropped. Otherwise these inactive reads can prevent sstables from being removed and these sstables can potentially survive until the tablet is migrated back and resurrect data.
This series introduces the fix as well as a reproducer test.

Fixes: https://github.com/scylladb/scylladb/issues/18110

Closes scylladb/scylladb#18179

* github.com:scylladb/scylladb:
  test: add test for cleaning up cached querier on tablet migration
  querier: allow injecting cache entry ttl by error injector
  replica/table: cleanup_tablet(): clear inactive reads for the tablet
  replica/database: introduce clear_inactive_reads_for_tablet()
  replica/database: introduce foreach_reader_concurrency_semaphore
  reader_concurrency_semaphore: add range param to evict_inactive_reads_for_table()
  reader_concurrency_semaphore: allow storing a range with the inactive reader
  reader_concurrency_semaphore: avoid detach() in inactive_read_handle::abandon()
2024-05-09 17:34:49 +03:00
Lakshmi Narayanan Sreethar
4d22c4b68b sstable_datafile_test: add testcase to test reclaim during reload
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 19:57:40 +05:30
Pavel Emelyanov
5497bb5a3d loading_shared_values: Replace static-assert with concept
The templatized get_or_load() accepts Loader template parameter and
static-asserts on its signature. Concept is more suitable here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18582
2024-05-09 16:29:49 +03:00
Patryk Jędrzejczak
332bd8ea98 raft: raft_group_registry: start_server_for_group: catch and rethrow abort_requested_exception
If we initiate the shutdown while starting the group 0 server,
we could catch `abort_requested_exception` in `start_server_for_group`
and call `on_internal_error`. Then, Scylla aborts with a coredump.
It causes problems in tests that shut down bootstrapping nodes.

The `abort_requested_exception` can be thrown from
`gossiper::lock_endpoint` called in
`storage_service::topology_state_load`. So, the issue is new and
applies only to the raft-based topology. Hence, there is no need
to backport the patch.

Fixes scylladb/scylladb#17794
Fixes scylladb/scylladb#18197

Closes scylladb/scylladb#18569
2024-05-09 14:55:11 +02:00
Benny Halevy
073680768f repair: tablet_repair_task_impl: modernize table lookup
Currently, the loop that goes over all repair metas
checks for the table's existance using `find_column_family()`.
Although this is correct, it might cause an exception storm
if a table o keyspace are dropped during repair.

This can be avoided by using the more modern interface,
`get_table_if_exists` in the database `tables_metadata`
that returns a `lw_shared_ptr<replica::table>`, exactly
as we need, that has value iff the table still exists
without throwing any exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-09 15:43:00 +03:00
Benny Halevy
c55aa4b121 repair: tablet_repair: make best effort in spite of errors
Currently if any shard repair task fails,
`tablet_repair_task_impl` per-shard loop
breaks, since it doesn't handle the expection.
Although repair does return an error, which
is as expected, we change vnode-based repair
to make a best effort and try to repair
as much as it can, even if any of the ranges
failed.

This causes the `test_repair_with_down_nodes_2b`
dtest to fail with tablets, as seen in, e.g.
https://jenkins.scylladb.com/view/master/job/scylla-master/job/tablets/job/gating-dtest-release-with-tablets/52/testReport/repair_additional_test/TestRepairAdditional/FullDtest___full_split002___test_repair_with_down_nodes_2b/
```
AssertionError: assert 1765 == 2000
```

This change adds a check for the keyspace and table presence
whenever an individual repair task fails, instead of the
global check at the end, so that failures due to dropping
of the keyspace or the table are logged as warnings, but
ignored for the purpose of failing the overall repair status.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-09 15:42:59 +03:00
Lakshmi Narayanan Sreethar
a080daaa94 sstable_datafile_test: add test to verify auto reload of reclaimed components
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:49:22 +05:30
Lakshmi Narayanan Sreethar
0b061194a7 sstables_manager: reload previously reclaimed components when memory is available
When an SSTable is dropped, the associated bloom filter gets discarded
from memory, bringing down the total memory consumption of bloom
filters. Any bloom filter that was previously reclaimed from memory due
to the total usage crossing the threshold, can now be reloaded back into
memory if the total usage can still stay below the threshold. Added
support to reload such reclaimed filters back into memory when memory
becomes available.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:49:22 +05:30
Lakshmi Narayanan Sreethar
f758d7b114 sstables_manager: start a fiber to reload components
Start a fiber that gets notified whenever an sstable gets deleted. The
fiber doesn't do anything yet but the following patch will add support
to reload reclaimed components if there is sufficient memory.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:49:22 +05:30
Lakshmi Narayanan Sreethar
24064064e9 sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
The testcase uses an sstable whose mutation key and the generation are
owned by different shards. Due to this, when process_sstable_dir is
called, the sstable gets loaded into a different shard than the one that
was intended. This also means that the sstable and the sstable manager
end up in different shards.

The following patch will introduce a condition variable in sstables
manager which will be signalled from the sstables. If the sstable and
the sstable manager are in different shards, the signalling will cause
the testcase to fail in debug mode with this error : "Promise task was
set on shard x but made ready on shard y". So, fix it by supplying
appropriate generation number owned by the same shard which owns the
mutation key as well.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Lakshmi Narayanan Sreethar
69b2a127b0 sstable_datafile_test: add test to verify reclaimed components reload
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Lakshmi Narayanan Sreethar
54bb03cff8 sstables: support reloading reclaimed components
Added support to reload components from which memory was previously
reclaimed as the total memory of reclaimable components crossed a
threshold. The implementation is kept simple as only the bloom filters
are considered reclaimable for now.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Lakshmi Narayanan Sreethar
2340ab63c6 sstables_manager: add new intrusive set to track the reclaimed sstables
The new set holds the sstables from where the memory has been reclaimed
and is sorted in ascending order of the total memory reclaimed.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Lakshmi Narayanan Sreethar
140d8871e1 sstable: add link and comparator class to support new instrusive set
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Lakshmi Narayanan Sreethar
3ef2f79d14 sstable: renamed intrusive list link type
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Lakshmi Narayanan Sreethar
02d272fdb3 sstable: track memory reclaimed from components per sstable
Added a member variable _total_memory_reclaimed to the sstable class
that tracks the total memory reclaimed from a sstable.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Lakshmi Narayanan Sreethar
a53af1f878 sstable: rename local variable in sstable::total_reclaimable_memory_size
Renamed local variable in sstable::total_reclaimable_memory_size in
preparation for the next patch which adds a new member variable
_total_memory_reclaimed to the sstable class.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-09 17:48:58 +05:30
Marcin Maliszkiewicz
a1099791c4 test: perf: alternator: add option to skip data pre-population 2024-05-09 13:59:17 +02:00
Marcin Maliszkiewicz
fd416fac3b perf-alternator-workloads: add operations-per-shard option 2024-05-09 13:59:13 +02:00
Marcin Maliszkiewicz
5b8acf182a test: perf: add global secondary indexes write workload for alternator 2024-05-09 13:59:08 +02:00
Marcin Maliszkiewicz
43a64ac558 test: perf: add option to continue after failed request 2024-05-09 13:59:03 +02:00
Marcin Maliszkiewicz
70b5b5024b test: perf: add read modify write workload for alternator (lwt) 2024-05-09 13:58:58 +02:00
Marcin Maliszkiewicz
5b8e554431 test: perf: add scan workload for alternator 2024-05-09 13:58:54 +02:00
Marcin Maliszkiewicz
55030b1550 test: perf: add end-to-end benchmark for alternator
The code is based on similar idea as perf_simple_query. The main differences are:
- it starts full scylla process
- communicates with alternator via http (localhost)
- uses richer table schema with all dynamoDB types instead of only strings

Testing code runs in the same process as scylla so we can easily get various perf counters (tps, instr, allocation, etc).

Results on my machine (with 1 vCPU):
> ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload read --duration 10 2> /dev/null
...
median 23402.59616090321
median absolute deviation: 598.77
maximum: 24014.41
minimum: 19990.34

> ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write --duration 10 2> /dev/null
...
median 16089.34211320635
median absolute deviation: 552.65
maximum: 16915.95
minimum: 14781.97

The above seem more realistic than results from perf_simple_query which are 96k and 49k tps (per core).
2024-05-09 13:58:40 +02:00
Marcin Maliszkiewicz
6152223890 test: perf: extract result aggregation logic to a separate struct
It will be reused later by a new tool.
2024-05-09 13:58:29 +02:00
Gleb Natapov
3b40d450e5 gossiper: try to locate an endpoint by the host id when applying state if search by IP fails
Even if there is no endpoint for the given IP the state can still belong to existing endpoint that
was restarted with different IP, so lets try to locate the endpoint by host id as well. Do it in raft
topology mode only to not have impact on gossiper mode.

Also make the test more robust in detecting wrong amount of entries in
the peers table. Today it may miss that there is a wrong entry there
because the map will squash two entries for the same host id into one.

Fixes: scylladb/scylladb#18419
Fixes: scylladb/scylladb#18457
2024-05-09 13:14:54 +02:00
Patrik
b0fbe71eaf Update launch-on-gcp.rst
Closes scylladb/scylladb#18512
2024-05-09 10:12:31 +03:00
Avi Kivity
b7055b5f2f storage_service: don't rely on optional<> formatting for removed node error
std::optional formatting changed while moving from the home-grown formatter to
the fmt provided formatter; don't rely on it for user visible messages.

Here, the optional formatted is known to be engaged, so just print it.

Closes scylladb/scylladb#18534
2024-05-09 10:03:23 +03:00
Kefu Chai
906700d523 test/nodetool: accept -1 returncode also when --help is invoked
in newer seastar, 0 is returned as the returncode of the application
when handling `--help`. to prepare for this behavior, let's
accept it before updating the seastar submodule.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18574
2024-05-09 08:26:44 +03:00
Kefu Chai
6047b3b6aa build: cmake: build async_utils.cc
async_utils.cc was introduced in e1411f39, so let's
update the cmake building system to build it. without
which, we'd run into link failure like:

```
ld.lld: error: undefined symbol: to_mutation_gently(canonical_mutation const&, seastar::lw_shared_ptr<schema const>)
>>> referenced by storage_service.cc
>>>               storage_service.cc.o:(service::storage_service::merge_topology_snapshot(service::raft_snapshot)) in archive service/Dev/libservice.a
>>> referenced by group0_state_machine.cc
>>>               group0_state_machine.cc.o:(service::write_mutations_to_database(service::storage_proxy&, gms::inet_address, std::vector<canonical_mutation, std::allocator<canonical_mutation>>)) inarchive service/Dev/libservice.a
>>> referenced by group0_state_machine.cc
>>>               group0_state_machine.cc.o:(service::write_mutations_to_database(service::storage_proxy&, gms::inet_address, std::vector<canonical_mutation, std::allocator<canonical_mutation>>) (.resume)) in archive service/Dev/libservice.a
>>> referenced 1 more times
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18524
2024-05-09 08:26:44 +03:00
Kefu Chai
c336904722 build: cmake: mark abseil include SYSTEM
this change is a followup of 0b0e661a. it helps to ensure that the header files in
abseil submodule have higher priority when the compiler includes abseil headers
when building with CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18523
2024-05-09 08:26:44 +03:00
Kefu Chai
2a9a874e19 db,service: fix typos in comments
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18567
2024-05-09 08:26:44 +03:00
Anna Stuchlik
65c8b81051 doc: add OS support in version 6.0
This commit adds OS support in version 6.0.
In addition, it removes the information about version 5.2, as this version is no longer supported, according to our policy.

Closes scylladb/scylladb#18562
2024-05-09 08:26:44 +03:00
Anna Stuchlik
74fb9808ed doc: update Consistent Topology with Raft
This PR:
- Removes the `.. only:: opensource` directive from Consistent Topology with Raft.
  This feature is no longer an Open Source-only experimental feature.
- Removes redundant version-specific information.
- Moves the necessary version-specific information to a separate file.

This is a follow-up to 55b011902e.

Refs https://github.com/scylladb/scylladb/pull/18285/

Closes scylladb/scylladb#18553
2024-05-09 08:26:44 +03:00
Calle Wilund
79d56ccaad commitlog: Fix request_controller semaphore accounting.
Fixes #18488

Due to the discrepancy between bytes added to CL and bytes written to disk
(due to CRC sector overhead), we fail to account for the proper byte count
when issuing account_memory_usage in allocate (using bytes added) and in
cycle:s notify_memory_written (disk bytes written).

This leads us to slowly, but surely, add to the semaphore all the time.
Eventually rendering it useless.

Also, terminate call would _not_ take any of this into account,
and the chunk overhead there would cause a (smaller) discrepancy
as well.

Fix by simply ensuring that buffer alloc handles its byte usage,
then accounting based on buffer position, not input byte size.

Closes scylladb/scylladb#18489
2024-05-09 08:26:44 +03:00
Botond Dénes
155332ebf8 Merge 'Drain view_builder in generic drain (again)' from Pavel Emelyanov
Some time ago #16558 was merged that moved view builder drain into generic drain. After this merge dtests started to fail from time to time, so the PR was reverted (see #18278). In #18295 the hang was found. View builder drain was moved from "before stopping messaging service to "after" it, and view update write handlers in proxy hanged for hard-coded timeout of 5 minutes without being aborted. Tests don't wait for 5 minutes and kill scylla, then complain about it and fail.

This PR brings back the original PR as well as the necessary fix that cancels view update write handlers on stop.

Closes scylladb/scylladb#18408

* github.com:scylladb/scylladb:
  Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB"
  view: Abort pending view updates when draining
2024-05-09 08:26:44 +03:00
Aleksandra Martyniuk
67bbaad62e tasks: use default task_ttl in scylla.yaml
Currently default task_ttl_in_seconds is 0, but scylla.yaml changes
the value to 10.

Change task_ttl_in_seconds in scylla.yaml to 0, so that there are
consistent defaults. Comment it out.

Fixes: #16714.

Closes scylladb/scylladb#18495
2024-05-09 08:26:44 +03:00
Botond Dénes
0438febdc9 Merge 'alternator: fix REST API access to an Alternator LSI' from Nadav Har'El
The name of the Scylla table backing an Alternator LSI looks like `basename:!lsiname`. Some REST API clients (including Scylla Manager) when they send a "!" character in the REST API request path may decide to "URL encode" it - convert it to `%21`.

Because of a Seastar bug (https://github.com/scylladb/seastar/issues/725) Scylla's REST API server forgets to do the URL decoding on the path part of the request, which leads to the REST API request failing to address the LSI table.

The first patch in this PR fixes the bug by using a new Seastar API introduced in https://github.com/scylladb/seastar/pull/2125 that does the URL decoding as appropriate. The second patch in the PR is a new test for this bug, which fails without the fix, and passes afterwards.

Fixes #5883.

Closes scylladb/scylladb#18286

* github.com:scylladb/scylladb:
  test/alternator: test addressing LSI using REST API
  REST API: stop using deprecated, buggy, path parameter
2024-05-09 08:26:43 +03:00
Yaniv Michael Kaul
124064844f docs/dev/object_stroage.md: convert example AWS keys to be more innocent
Someone thought that they actually represent real keys (the 'EXAMPLE' in their name was not enough).
Converted them to be as clear as can be, example data.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#18565
2024-05-09 08:26:43 +03:00
Asias He
46269a99d8 repair: Add ranges_parallelism option support for tablet
The ranges_parallelism option is introduced in commit 9b3fd9407b.
Currently, this option works for vnode table repair only.

This patch enables it for tablet repair, since it is useful for
tablet repair too.

Fixes #18383

Closes scylladb/scylladb#18385
2024-05-09 08:26:43 +03:00
Benny Halevy
0156e97560 storage_proxy: cas: reject for tablets-enabled tables
Currently, LWT is not supported with tablets.
In particular the interaction between paxos and tablet
migration is not handled yet.

Therefore, it is better to outright reject LWT queries
for tablets-enabled tables rather than support them
in a flaky way.

This commit also marks tests that depend on LWT
as expeced to fail.

Fixes scylladb/scylladb#18066

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#18103
2024-05-09 08:26:43 +03:00
Patryk Jędrzejczak
053a2893cf raft topology: join_token_ring: prevent shutdown hangs
Shutdown of a bootstrapping node could hang on
`_topology_state_machine.event.when()` in
`wait_for_topology_request_completion`. It caused
scylladb/scylladb#17246 and scylladb/scylladb#17608.

On a normal node, `wait_for_group0_stop` would prevent it, but this
function won't be called before we join group 0. Solve it by adding
a new subscriber to `_abort_source`.

Additionally, trigger `_group0_as` to prevent other hang scenarios.

Note that if both the new subscriber and `wait_for_group0_stop` are
called, nothing will break. `abort_source::request_abort` and
`conditional_variable::broken` can be called multiple times.

The raft-based topology is moved out of experimental in 6.0, no need
to backport the patch.

Fixes scylladb/scylladb#17246
Fixes scylladb/scylladb#17608

Closes scylladb/scylladb#18549
2024-05-09 08:26:43 +03:00
Botond Dénes
96a7ed7efb Merge 'sstables: add dead row count when issuing warning to system.large_partitions' from Ferenc Szili
This is the second half of the fix for issue #13968. The first half is already merged with PR #18346

Scylla issues warnings for partitions containing more rows than a configured threshold. The warning is issued by inserting a row into the `system.large_partitions` table. This row contains the information about the partition for which the warning is issued: keyspace, table, sstable, partition key and size, compaction time and the number of rows in the partition. A previous PR #18346 also added range tombstone count to this row.

This change adds a new counter for dead rows to the large_partitions table.

This change also adds cluster feature protection for writing into these new counters. This is needed in case a cluster is in the process of being upgraded to this new version, after which an upgraded node writes data with the new schema into `system.large_partitions`, and finally a node is then rolled back to an old version. This node will then revert the schema to the old version, but the written sstables will still contain data with the new counters, causing any readers of this table to throw errors when they encounter these cells.

This is an enhancement, and backporting is not needed.

Fixes #13968

Closes scylladb/scylladb#18458

* github.com:scylladb/scylladb:
  sstable: added test for counting dead rows
  sstable: added docs for system.large_partitions.dead_rows
  sstable: added cluster feature for dead rows and range tombstones
  sstable: write dead_rows count to system.large_partitions
  sstable: added counter for dead rows
2024-05-09 08:26:43 +03:00
David Garcia
d63d418ae3 docs: change "create an issue" github label to "type/documentation"
Closes scylladb/scylladb#18550
2024-05-09 08:26:43 +03:00
Kefu Chai
02be1e9309 .github: add clang-tidy workflow
clang-tidy is a tool provided by Clang to perform static analysis on
C++ source files. here, we are mostly intersted in using its
https://clang.llvm.org/extra/clang-tidy/checks/bugprone/use-after-move.html
check to reveal the potential issues.

this workflow is added to run clang-tidy when building the tree, so
that the warnings from clang-tidy can be noticed by developers.

a dedicated action is added so other github workflow can reuse it to
setup the building environment in an ubuntu:jammy runner.

clang-tidy-matcher.json is added to annotate the change, so that the
warnings are more visible with github webpage.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18342
2024-05-09 08:26:43 +03:00
David Garcia
4a1b109641 docs: add swagger ui extension
Renders the API Reference from api/api-doc using Swagger UI 2.2.10.

address comments

Closes scylladb/scylladb#18253
2024-05-09 08:26:43 +03:00
Botond Dénes
c7c4964b1c tools/scylla-nodetool: make doc link version-specific
Generate documentation link, such that they point to the documentation
page, which is appropriate to the current product (open-source or
enterprise) and version. The documentation links are generated by a new
function and the documentation links are injected into the description
of nodetool command via fmt::format().
2024-05-08 09:41:18 -04:00
Botond Dénes
2d1e938849 release: introduce doc_link()
Allows generating documentation links that are appropriate for the
current product (open-source or enterprise) and version.
To be used in the next patch to make scylla-nodetool's documentation
links product and version appropriate.
2024-05-08 09:41:17 -04:00
Botond Dénes
9d2156bd8a build: pass scylla product to release.cc
In the form of -DSCYLLA_PRODUCT. To be used in the next patch.
2024-05-08 09:40:24 -04:00
Kamil Braun
4dcae66380 Merge 'test: {auth,topology}: use manager.rolling_restart' from Piotr Dulikowski
Instead of performing a rolling restart by calling `restart` in a loop over every node in the cluster, use the dedicated
`manager.rolling_restart` function. This method waits until all other nodes see the currently processed node as up or down before proceeding to the next step. Not doing so may lead to surprising behavior.

In particular, in scylladb/scylladb#18369, a test failed shortly after restarting three nodes. Because nodes were restarted one after another too fast, when the third node was restarted it didn't send a notification to the second node because it still didn't know that the second node was alive. This led the second node to notice that the third node restarted by observing that it incremented its generation in gossip (it restarted too fast to be marked as down by the failure detector). In turn, this caused the second node to send "third node down" and "third node up" notifications to the driver in a quick succession, causing it to drop and reestablish all connections to that node. However, this happened _after_ rolling upgrade finished and _after_ the test logic confirmed that all nodes were alive. When the notifications were sent to the driver, the test was executing some statements necessary for the test to pass - as they broke, the test failed.

Fixes: scylladb/scylladb#18369

Closes scylladb/scylladb#18379

* github.com:scylladb/scylladb:
  test: get rid of server-side server_restart
  test: util: get rid of the `restart` helper
  test: {auth,topology}: use manager.rolling_restart
2024-05-08 09:45:08 +02:00
Piotr Dulikowski
180cb7a2b9 storage_service: notify lifecycle subs only after token metadata update
Currently, in raft mode, when raft topology is reloaded from disk or a
notification is received from gossip about an endpoint change, token
metadata is updated accordingly. While updating token metadata we detect
whether some nodes are joining or are leaving and we notify endpoint
lifecycle subscribers if such an event occurs. These notifications are
fired _before_ we finish updating token metadata and before the updated
version is globally available.

This behavior, for "node leaving" notifications specifically, was not
present in legacy topology mode. Hinted handoff depends on token
metadata being updated before it is notified about a leaving node (we
had a similar issue before: scylladb/scylladb#5087, and we fixed it by
enforcing this property). Because this is not true right now for raft
mode, this causes the hint draining logic not to work properly - when a
node leaves the cluster, there should be an attempt to send out hints
for that node, but instead hints are not sent out and are kept on disk.

In order to fix the issue with hints, postpone notifying endpoint
lifecycle subscribers about joined and left nodes only after the final
token metadata is computed and replicated to all shards.

Fixes: scylladb/scylladb#17023

Closes scylladb/scylladb#18377
2024-05-08 09:40:44 +02:00
Kamil Braun
03818c4aa9 direct_failure_detector: increase ping timeout and make it tunable
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

Closes scylladb/scylladb#18443
2024-05-07 23:40:23 +02:00
Anna Stuchlik
98367cb6a1 doc: Snitch switch is not supported with tablets
This commit adds the tablets-related limitation:
if you use tablets, then changing snitch is not supported

Refs:https://github.com/scylladb/scylladb/issues/17513
See: https://github.com/scylladb/scylladb/issues/17513#issuecomment-2022552677

Closes scylladb/scylladb#18548
2024-05-07 17:26:05 +02:00
Pavel Emelyanov
677e80a4d5 table: Coroutinize table::delete_sstables_atomically()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18499
2024-05-07 17:10:28 +02:00
Kamil Braun
53443f566a Merge 'Coroutinize generic_server's listen() method' from Pavel Emelyanov
It needs some local naming cleanup, but otherwise it's pretty simple

Closes scylladb/scylladb#18510

* github.com:scylladb/scylladb:
  generic_server: Fix indentation after previous patch
  generic_server: Coroutinize listen() method
  generic_server: Rename creds argument to builder
2024-05-07 17:08:59 +02:00
Ferenc Szili
60bf846f68 sstable: added test for counting dead rows 2024-05-07 15:44:33 +02:00
Ferenc Szili
8e9771d010 sstable: added docs for system.large_partitions.dead_rows 2024-05-07 15:44:33 +02:00
Avi Kivity
9b8dfb2b19 compaction: compaction_strategy validation: don't rely on optional<> formatting
std::optional formatting changed while moving from the home-grown formatter to
the fmt provided formatter; don't rely on it for user visible messages.

Here, the optional formatted is known to be engaged, so just print it.

Closes scylladb/scylladb#18533
2024-05-07 12:02:33 +03:00
Kefu Chai
7e578ae964 message: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18527
2024-05-07 11:59:36 +03:00
Raphael S. Carvalho
570e3f8df0 compaction: exclude expired sstables from calculation of base timestamps
base timestamps are feeded into the sstable writer for calculating
delta, used by varints. given that expired ssts are bypassed, we
don't have to account them. so if we compacting fully expired and
new sstable together, we can save a bit by having a base ts closer
to the data actually written into output. also I wanted to move
the calculation into the loop in setup(), to avoid two iterations
over input set that can have even more than 1k elements.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18504
2024-05-07 08:43:50 +03:00
Raphael S. Carvalho
2d9142250e Fix flakiness in test_tablet_load_and_stream due to premature gossiper abort on shutdown
Until https://github.com/scylladb/scylladb/issues/15356 is fixed, this
will be handled by explicitly closing the connection, so if scylla fails
to update gossiper state due to premature abort on shutdown, then we
won't be stuck in an endless reconnection attempt (later through
heartbeats (30s interval)), causing the test to timeout.

Manifests in scylla logs like this:
gossip - failure_detector_loop: Got error in the loop, live_nodes={127.147.5.10, 127.147.5.16}: seastar::sleep_aborted (Sleep is aborted)
gossip - failure_detector_loop: Finished main loop
migration_manager - stopping migration service
storage_service - Shutting down native transport server
gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
cql_server_controller - CQL server stopped
...
gossip - My status = NORMAL
gossip - Announcing shutdown
gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
gossip - Sending a GossipShutdown to 127.147.5.10 with generation 1714449924
gossip - Sending a GossipShutdown to 127.147.5.16 with generation 1714449924
gossip - === Gossip round FAIL: seastar::abort_requested_exception (abort requested)

Refs #14746.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18484
2024-05-07 02:31:02 +02:00
Piotr Dulikowski
5459cfed6a Merge 'auth: don't run legacy migrations in auth-v2 mode' from Marcin Maliszkiewicz
We won't run:
- old pre auth-v1 migration code
- code creating auth-v1 tables

We will keep running:
- code creating default rows
- code creating auth-v1 keyspace (needed due to cqlsh legacy hack,
  it errors when executing `list roles` or `list users` if
  there is no system_auth keyspace, it does support case when
  there is no expected tables)

Fixes https://github.com/scylladb/scylladb/issues/17737

Closes scylladb/scylladb#17939

* github.com:scylladb/scylladb:
  auth: don't run legacy migrations on auth-v2 startup
  auth: fix indent in password_authenticator::start
  auth: remove unused service::has_existing_legacy_users func
2024-05-06 19:53:35 +02:00
Wojciech Mitros
8472c46c8a service_level_controller: coroutinize notify_service_level_removed
To avoid conflicts arising from the discrepancy between different
versions of the repository, use coroutines instead of continuations
in service_level_controller::notify_service_level_removed().

Closes scylladb/scylladb#18525
2024-05-06 14:20:49 +03:00
Piotr Dulikowski
92e5018ddb test: get rid of server-side server_restart
Restarting a node amounts to just shutting it down and then starting
again. There is no good reason to have a dedicated endpoint in the
ScyllaClusterManager for restarting when it can be implemented by
calling two endpoints in a sequence: stop and start - it's just code
duplication.

Remove the server_restart endpoint in ScyllaClusterManager and
reimplement it as two endpoint calls in the ManagerClient.
2024-05-06 12:54:53 +02:00
Piotr Dulikowski
8de2bda7ae test: util: get rid of the restart helper
We already have `ManagerClient.server_restart`, which can be used in its
place.
2024-05-06 12:24:40 +02:00
Piotr Dulikowski
897e603bf0 test: {auth,topology}: use manager.rolling_restart
Instead of performing a rolling restart by calling `restart` in a loop
over every node in the cluster, use the dedicated
`manager.rolling_restart` function. This method waits until all other
nodes see the currently processed node as up or down before proceeding
to the next step. Not doing so may lead to surprising behavior.

In particular, in scylladb/scylladb#18369, a test failed shortly after
restarting three nodes. Because nodes were restarted one after another
too fast, when the third node was restarted it didn't send a
notification to the second node because it still didn't know that the
second node was alive. This led the second node to notice that the third
node restarted by observing that it incremented its generation in gossip
(it restarted too fast to be marked as down by the failure detector). In
turn, this caused the second node to send "third node down" and "third
node up" notifications to the driver in a quick succession, causing it
to drop and reestablish all connections to that node. However, this
happened _after_ rolling upgrade finished and _after_ the test logic
confirmed that all nodes were alive. When the notifications were sent to
the driver, the test was executing some statements necessary for the
test to pass - as they broke, the test failed.

Fixes: scylladb/scylladb#18369
2024-05-06 12:24:40 +02:00
Kamil Braun
ccbb9f5343 Merge 'topology_coordinator: clear obsolete generations earlier' from Patryk Jędrzejczak
We want to clear CDC generations that are no longer needed
(because all writes are already using a new generation) so they
don't take space and are not sent during snapshot transfers
(see e.g. https://github.com/scylladb/scylladb/issues/17545).

The condition used previously was that we clear generations which
were closed (i.e., a new generation started at this time) more than
24h ago. This is a safe choice, but too conservative: we could
easily end up with a large number of obsolete generations if we
boot multiple nodes during 24h (which is especially easy to do
with tablets.)

Change this bound from 24h to `5s + ring_delay`. The choice is
explained in a comment in the code.

Additionally, improve `test_raft_snapshot_request` that would
become flaky after the change so it's not sensitive to changes
anymore.

The raft-based topology was experimental before 6.0, no need
to backport.

Ref: scylladb/scylladb#17545

Closes scylladb/scylladb#18497

* github.com:scylladb/scylladb:
  topology_coordinator: clear obsolete generations earlier
  test: test_raft_snapshot_request: improve the last assertion
  test: test_raft_snapshot_request: find raft leader after restart
  test: test_raft_shanpshot_request: simplify appended_command
2024-05-06 12:03:33 +02:00
Kamil Braun
1a50a524e7 Merge 'topology_coordinator: compute cluster size correctly during upgrade' from Piotr Dulikowski
During upgrade to raft topology, information about service levels is copied from the legacy tables in system_distributed to the raft-managed tables of group 0. system_distributed has RF=3, so if the cluster has only one or two nodes we should use lower consistency level than ALL - and the current procedure does exactly that, it selects QUORUM in case of two nodes and ONE in case of only one node. The cluster size is determined based on the call to _gossiper.num_endpoints().

Despite its name, gossiper::num_endpoints() does not necessarily return the number of nodes in the cluster but rather the number of endpoint states in gossiper (this behavior is documented in a comment near the declaration of this function). In some cases, e.g. after gossiper-based nodetool remove, the state might be kept for some time after removal (3 days in this case).

The consequence of this is that gossiper::num_endpoints() might return more than the current number of nodes during upgrade, and that in turn might cause migration of data from one table to another to fail - causing the upgrade procedure to get stuck if there is only 1 or two nodes in the cluster.

In order to fix this, use token_metadata::get_all_endpoints() as a measure of the cluster size.

Fixes: scylladb/scylladb#18198

Closes scylladb/scylladb#18261

* github.com:scylladb/scylladb:
  test: topology: test that upgrade succeeds after recent removal
  topology_coordinator: compute cluster size correctly during upgrade
2024-05-06 11:06:09 +02:00
Piotr Dulikowski
64ba620dc2 Merge 'hinted handoff: Use host IDs instead of IPs in the module' from Dawid Mędrek
This pull request introduces host ID in the Hinted Handoff module. Nodes are now identified by their host IDs instead of their IPs. The conversion occurs on the boundary between the module and `storage_proxy.hh`, but aside from that, IPs have been erased.

The changes take into considerations that there might still be old hints, still identified by IPs, on disk – at start-up, we map them to host IDs if it's possible so that they're not lost.

Refs scylladb/scylladb#6403
Fixes scylladb/scylladb#12278

Closes scylladb/scylladb#15567

* github.com:scylladb/scylladb:
  docs: Update Hinted Handoff documentation
  db/hints: Add endpoint_downtime_not_bigger_than()
  db/hints: Migrate hinted handoff when cluster feature is enabled
  db/hints: Handle arbitrary directories in resource manager
  db/hints: Start using hint_directory_manager
  db/hints: Enforce providing IP in get_ep_manager()
  db/hints: Introduce hint_directory_manager
  db/hints/resource_manager: Update function description
  db/hints: Coroutinize space_watchdog::scan_one_ep_dir()
  db/hints: Expose update lock of space watchdog
  db/hints: Add function for migrating hint directories to host ID
  db/hints: Take both IP and host ID when storing hints
  db/hints: Prepare initializing endpoint managers for migrating from IP to host ID
  db/hints: Migrate to locator::host_id
  db/hints: Remove noexcept in do_send_one_mutation()
  service: Add locator::host_id to on_leave_cluster
  service: Fix indentation
  db/hints: Fix indentation
2024-05-06 09:58:18 +02:00
Patryk Jędrzejczak
628d7e709e cdc: generation: fix retrieve_generation_data_v2
`system_keyspace::read_cdc_generation_opt` queries
`system.cdc_generations_v3`, which stores ids of CDC generations
as timeuuids. This function shouldn't be called with a normal uuid
(used by `system.cdc_generations_v2` to store generation ids).
Such a call would end with a marshaling error.

Before this patch,`retrieve_generation_data_v2` could call
`system_keyspace::read_cdc_generation_opt` with a normal uuid if
the generation wasn't present in `system.cdc_generations_v2`.
This logic caused a marshaling error while handling the
`check_and_repair_cdc_streams` request in the
`cdc_test.TestCdc.test_check_and_repair_cdc_streams_liveness` dtest.

This patch fixes the code being added in 6.0, no need to backport it.

Fixes scylladb/scylladb#18473

Closes scylladb/scylladb#18483
2024-05-06 09:12:47 +02:00
Kamil Braun
16846bf5ce Merge 'Do not serialize removenode operation with api lock if topology over raft is enabled' from Gleb
With topology over raft all operation are already serialized by the
coordinator anyway, so no need to synchronize removenode using api lock.
All others are still synchronized since there cannot be executed in
parallel for the same node anyway.

* 'gleb/17681-fix' of github.com:scylladb/scylla-dev:
  storage_service: do not take API lock for removenode operation if topology coordinator is enabled
  test: return file mark from wait_for that points after the found string
2024-05-06 09:03:03 +02:00
Benny Halevy
ebff5f5d70 everywhere: include seastar headers using angle brackets
seastar is an external library therefore it should
use the system-include syntax.

Closes scylladb/scylladb#18513
2024-05-06 10:00:31 +03:00
Kefu Chai
5ca9a46a91 test/lib: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18515
2024-05-05 23:31:48 +03:00
Kefu Chai
0b0e661a85 build: bring abseil submodule back
because of https://bugzilla.redhat.com/show_bug.cgi?id=2278689,
the rebuilt abseil package provided by fedora has different settings
than the ones if the tree is built with the sanitizer enabled. this
inconsistency leads to a crash.

to address this problem, we have to reinstate the abseil submodule, so
we can built it with the same compiler options with which we build the
tree.

in this change

* Revert "build: drop abseil submodule, replace with distribution abseil"
* update CMake building system with abseil header include settings
* bump up the abseil submodule to the latest LTS branch of abseil:
  lts_2024_01_16
* update scylla-gdb.py to adapt to the new structure of
  flat_hash_map

This reverts commit 8635d24424.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18511
2024-05-05 23:31:09 +03:00
Kefu Chai
ea791919cf service/storage_proxy: drop unused operator<<
operator<<(ostream, paxos_response_handler) is not used anymore,
so let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18520
2024-05-05 16:33:29 +03:00
Nadav Har'El
21557cfaa6 cql3: Fix invalid JSON parsing for JSON object with different key types
More than three years ago, in issue #7949, we noticed that trying to
set a `map<ascii, int>` from JSON input (i.e., using INSERT JSON or the
fromJson() function) fails - the ascii key is incorrectly parsed.
We fixed that issue in commit 75109e9519
but unfortunately, did not do our due diligence: We did not write enough
tests inspired by this bug, and failed to discover that actually we have
the same bug for many other key types, not just for "ascii". Specifically,
the following key types have exactly the same bug:

  * blob
  * date
  * inet
  * time
  * timestamp
  * timeuuid
  * uuid

Other types, like numbers or boolean worked "by accident" - instead of
parsing them as a normal string, we asked the JSON parser to parse them
again after removing the quotes, and because unquoted numbers and
unquoted true/false happwn to work in JSON, this didn't fail.

The fix here is very simple - for all *native* types (i.e., not
collections or tuples), the encoding of the key in JSON is simply a
quoted string - and removing the quotes is all we need to do and there's
no need to run the JSON parser a second time. Only for more elaborate
types - collections and tuples - we need to run the JSON parser a
second time on the key string to build the more elaborate object.

This patch also includes tests for fromJson() reading a map with all
native key types, confirming that all the aforementioned key types
were broken before this patch, and all key types (including the numbers
and booleans which worked even befoe this patch) work with this patch.

Fixes #18477.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#18482
2024-05-05 15:42:43 +03:00
Kefu Chai
f2b1c47dfc test/boost: s/boost::range::random_shuffle/std::ranges::shuffle/
`boost::range::random_shuffle()` uses the deprecated
`std::random_shuffle()` under the hood, so let's use
`std::ranges::shuffle()` which is available since C++20.

this change should address the warning like:

```
[312/753] CXX build/debug/test/boost/counter_test.o                                                                                                                                                                 In file included from test/boost/counter_test.cc:17:
/usr/include/boost/range/algorithm/random_shuffle.hpp:106:13: warning: 'random_shuffle<__gnu_cxx::__normal_iterator<counter_shard *, std::vector<counter_shard>>>' is deprecated: use 'std::shuffle' instead [-Wdepr
ecated-declarations]
  106 |     detail::random_shuffle(boost::begin(rng), boost::end(rng));
      |             ^
test/boost/counter_test.cc:507:27: note: in instantiation of function template specialization 'boost::range::random_shuffle<std::vector<counter_shard>>' requested here
  507 |             boost::range::random_shuffle(shards);
      |                           ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_algo.h:4489:5: note: 'random_shuffle<__gnu_cxx::__normal_iterator<counter_shard *, std::vector<counter_shard>>>' has been explicitly marked
deprecated here
 4489 |     _GLIBCXX14_DEPRECATED_SUGGEST("std::shuffle")
      |     ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/x86_64-redhat-linux/bits/c++config.h:1957:45: note: expanded from macro '_GLIBCXX14_DEPRECATED_SUGGEST'
 1957 | # define _GLIBCXX14_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
      |                                             ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/x86_64-redhat-linux/bits/c++config.h:1941:19: note: expanded from macro '_GLIBCXX_DEPRECATED_SUGGEST'
 1941 |   __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
      |                   ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18517
2024-05-05 15:39:57 +03:00
Pavel Emelyanov
99f9807f15 sstables: Remove operator<<(std::ostream&, const deletion_time&)
It's completely unused, likely in favor of recently added formatter
for the type in question.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18502
2024-05-05 14:43:27 +03:00
Pavel Emelyanov
ddd2623418 generic_server: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-03 12:29:08 +03:00
Pavel Emelyanov
a1daa7093e generic_server: Coroutinize listen() method
Straightforward. Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-03 12:28:42 +03:00
Pavel Emelyanov
030f1ef81c generic_server: Rename creds argument to builder
So that it doesn't clash with local creds variable that will appear in
this method after its coroutinization.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-03 12:27:37 +03:00
Kefu Chai
53b98a8610 test: string_format_test: disable test if {fmt} >= 10.0.0
{fmt} v10.0.0 introduces formatter for `std::optional`, so there
is no need to test it. furthermore the behavior of this formatter
is different from our homebrew one. so let's skip this test if
{fmt} v10.0.0 or up is used.

Refs #18508

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18509
2024-05-03 11:34:23 +03:00
Kefu Chai
3421e6dcc1 tools/scylla-nodetool: add formatter for char*
in {fmt} version 10.0.0, it has a regression, which dropped the
formatter for `char *`, even it does format `const char*`, as the
latter is convertible to
`fmt::stirng_view`.

and this issue was addressed in 10.1.0 using 616a4937, which adds
the formatter for `Char *` back, where `Char` is a template parameter.

but we do need to print `vector<char*>`, so, to address the build
failure with {fmt} version 10.0.0, which is shipped along with
fedora 39. let's backport this formatter.

Fixes #18503
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18505
2024-05-02 23:25:24 +03:00
Avi Kivity
8de81f8f91 Merge 'Unstall merge topology snapshot' from Benny Halevy
This series adds facilities to gently convert canonical mutations back to mutations
and to gently make canonical mutations or freeze mutations in a seastar thread.

Those are used in storage_service::merge_topology_snapshot to prevent reactor stalls
due to large mutation, as seed in the test_add_many_nodes_under_load dtest.

Also, migration_manager migration_request was converted to use a seastar thread
to use the above facilities to prevent reactor stalls with large schema mutations,
e,g, with a large number of tables, and/or when reading tablets mutations with
a large number of tablets in a table.

perf-simple-query --write results:
Before:
```
median 79151.53 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53289 insns/op,        0 errors)
```
After:
```
median 79716.73 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53314 insns/op,        0 errors)
```

Closes scylladb/scylladb#18290

* github.com:scylladb/scylladb:
  storage_proxy: add mutate_locally(vector<frozen_mutation_and_schema>) method
  raft: group0_state_machine: write_mutations_to_database: freeze mutations gently
  database: apply_in_memory: unfreeze_gently large mutations
  storage_service: get_system_mutations: make_canonical_mutation_gently
  tablets: read_tablet_mutations: make_canonical_mutation_gently
  schema_tables: convert_schema_to_mutations: make_canonical_mutation_gently
  schema_tables: redact_columns_for_missing_features: get input mutation using rvalue reference
  storage_service: merge_topology_snapshot: freeze_gently
  canonical_mutation: add make_canonical_mutation_gently
  frozen_mutation: move unfreeze_gently to async_utils
  mutation: add freeze_gently
  idl-compiler: generate async serialization functions for stub members
  raft: group0_state_machine: write_mutations_to_database: use to_mutation_gently
  storage_service: merge_topology_snapshot: co_await to_mutation_gently
  canonical_mutation: add to_mutation_gently
  idl-compiler: emit include directive in generated impl header file
  mutation_partition: add apply_gently
  collection_mutation: improve collection_mutation_view formatting
  mutation_partition: apply_monotonically: do not support schema upgrade
  test/perf: report also log_allocations/op
2024-05-02 23:24:38 +03:00
Nadav Har'El
f604269f0a cql3, secondary index: consistently choose index to use in a query
When a table has secondary indexes on *multiple* columns, and several
such columns are used for filtering in a query, Scylla chooses one
of these indexes as the main driver of the query, and the second
column's restriction is implemented as filtering.

Before this patch, the index to use was chosen fairly randomly, based on
the order of the indexes in the schema. This order may be different in
different coordinators, and may even change across restarts on the same
coordinators. This is not only inconsistent, it can cause outright wrong
results when using *paging* and switching (or restarting) coordinates
in the middle of a paged scan... One coordinator saves one index's key
in the paging state, and then the other coordinator gets this paging
state and wrongly believes it is supposed to be a key of a *different*
index.

The fix in this patch is to pick the index suitable for the first
indexed column mentioned in the query. This has two benefits over
the situation before the patch:

1. The decision of which index to use no longer changes between
   coordinators or across restarts - it just depends on the schema
   and the specific query.

2. Different indexes can have different "specificity" so using one
   or the other can change the query's performance. After this patch,
   the user is in control over which index is used by changing the
   order of terms in the query. A curious user can use tracing to
   check which index was used to implement a particular query.

An xfailing test we had for this issue no longer fails, so the "xfail"
marker is removed.

Fixes #7969

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#14450
2024-05-02 19:52:42 +02:00
Benny Halevy
890b890e36 storage_proxy: add mutate_locally(vector<frozen_mutation_and_schema>) method
Generalizing the ad-hoc implementation out of
group0_state_machine.write_mutations_to_database.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:42:58 +03:00
Benny Halevy
4ae5bbb058 raft: group0_state_machine: write_mutations_to_database: freeze mutations gently
write_mutations_to_database might need to handle
large mutations from system tables, so to prevent
reactor stalls, freeze the mutations gently
and call proxy.mutate_locally in parallel on
the individual frozen mutations, rather than
calling the vector<mutation> based entry point
that eventually freezes each mutation synchronously.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:06 +03:00
Benny Halevy
a9f157b648 database: apply_in_memory: unfreeze_gently large mutations
Prevent stalls coming from applying large
mutations in memory synchronously,
like the ones seen with the test_add_many_nodes_under_load
dtest:
```
  | | |   ++[5#2/2 44%] addr=0x1498efb total=256 count=3 avg=85:
  | | |   |             replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0::operator()() const::{lambda()#1}::operator() at ./replica/memtable.cc:804
  | | |   |             (inlined by) logalloc::allocating_section::with_reclaiming_disabled<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0::operator()() const::{lambda()#1}&> at ././utils/logalloc.hh:500
  | | |   |             (inlined by) logalloc::allocating_section::operator()<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0::operator()() const::{lambda()#1}>(logalloc::region&, replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0::operator()() const::{lambda()#1}&&)::{lambda()#1}::operator() at ././utils/logalloc.hh:527
  | | |   |             (inlined by) logalloc::allocating_section::with_reserve<logalloc::allocating_section::operator()<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0::operator()() const::{lambda()#1}>(logalloc::region&, replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0::operator()() const::{lambda()#1}&&)::{lambda()#1}> at ././utils/logalloc.hh:471
  | | |   |             (inlined by) logalloc::allocating_section::operator()<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0::operator()() const::{lambda()#1}> at ././utils/logalloc.hh:526
  | | |   |             (inlined by) replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0::operator() at ./replica/memtable.cc:800
  | | |   |             (inlined by) with_allocator<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_0> at ././utils/allocation_strategy.hh:318
  | | |   |             (inlined by) replica::memtable::apply at ./replica/memtable.cc:799
  | | |     ++[6#1/1 100%] addr=0x145047b total=1731 count=21 avg=82:
  | | |     |              replica::table::do_apply<frozen_mutation const&, seastar::lw_shared_ptr<schema const>&> at ./replica/table.cc:2896
  | | |       ++[7#1/1 100%] addr=0x13ddccb total=2852 count=32 avg=89:
  | | |       |              replica::table::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const>, db::rp_handle&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::$_0::operator() at ./replica/table.cc:2924
  | | |       |              (inlined by) seastar::futurize<void>::invoke<replica::table::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const>, db::rp_handle&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::$_0&> at ././seastar/include/seastar/core/future.hh:2032
  | | |       |              (inlined by) seastar::futurize_invoke<replica::table::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const>, db::rp_handle&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::$_0&> at ././seastar/include/seastar/core/future.hh:2066
  | | |       |              (inlined by) replica::dirty_memory_manager_logalloc::region_group::run_when_memory_available<replica::table::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const>, db::rp_handle&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::$_0> at ./replica/dirty_memory_manager.hh:572
  | | |       |              (inlined by) replica::table::apply at ./replica/table.cc:2923
  | | |       ++           - addr=0x1330ba1:
  | | |       |              replica::database::apply_in_memory at ./replica/database.cc:1812
  | | |       ++           - addr=0x1360054:
  | | |       |              replica::database::do_apply at ./replica/database.cc:2032
```

This change has virtually no effect on small mutations
(up to 128KB in size).
build/release/scylla perf-simple-query --write --default-log-level=error --random-seed=1 -c 1
Before:
median 80092.06 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53291 insns/op,        0 errors)
After:
median 78780.86 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53311 insns/op,        0 errors)

To estimate the performance ramifications on
large mutations, I measured perf-simple-query --write
calling unfreeze_gently in all cases:
median 77411.26 tps ( 71.3 allocs/op,   8.0 logallocs/op,  14.3 tasks/op,   53280 insns/op,        0 errors)

Showing the allocations that moved out of logalloc
(in memtable::apply of frozen_mutation) into seastar
allocations (in unfreeze_gently) and <1% cpu overhead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:06 +03:00
Benny Halevy
7dd6a81026 storage_service: get_system_mutations: make_canonical_mutation_gently
and also unfreeze_gently the result frozen_mutation:s
to prevent the following stalls that were seen
with the test_add_many_nodes_under_load dtest:
```
  ++[1#1/58 5%] addr=0x16330e9 total=321 count=4 avg=80:
  |             utils::uleb64_express_encode_impl at ././utils/vle.hh:73
  |             (inlined by) utils::uleb64_express_encode<void (&)(char const*, unsigned long), void (&)(char const*, unsigned long)> at ././utils/vle.hh:82
  |             (inlined by) logalloc::region_impl::object_descriptor::encode at ./utils/logalloc.cc:1658
  |             (inlined by) logalloc::region_impl::alloc_small at ./utils/logalloc.cc:1743
  ++          - addr=0x1634cff:
  |             logalloc::region_impl::alloc at ./utils/logalloc.cc:2104
  | ++[2#1/2 83%] addr=0x116e22c total=321 count=4 avg=80:
  | |             managed_bytes::managed_bytes at ././utils/managed_bytes.hh:552
  | | ++[3#1/3 51%] addr=0x1551288 total=198 count=3 avg=66:
  | | |             compound_wrapper<clustering_key_prefix, clustering_key_prefix_view>::compound_wrapper at ././keys.hh:149
  | | |             (inlined by) prefix_compound_wrapper<clustering_key_prefix, clustering_key_prefix_view, clustering_key_prefix>::prefix_compound_wrapper at ././keys.hh:574
  | | |             (inlined by) clustering_key_prefix::clustering_key_prefix at ././keys.hh:865
  | | |             (inlined by) rows_entry::rows_entry at ./mutation/mutation_partition.hh:957
  | | ++          - addr=0x153f09f:
  | | |             allocation_strategy::construct<rows_entry, schema const&, position_in_partition_view&, seastar::bool_class<dummy_tag>&, seastar::bool_class<continuous_tag>&> at ././utils/allocation_strategy.hh:160
  | | ++          - addr=0x151409a:
  | | |             mutation_partition::append_clustered_row at ./mutation/mutation_partition.cc:719
  | | ++          - addr=0x14ab38f:
  | | |             partition_builder::accept_row at ././partition_builder.hh:57
  | | | ++[4#1/1 100%] addr=0x1579766 total=577 count=7 avg=82:
  | | | |              mutation_partition_view::do_accept<partition_builder> at ./mutation/mutation_partition_view.cc:212
  | | |   ++[5#1/2 56%] addr=0x14e737c total=321 count=4 avg=80:
  | | |   |             frozen_mutation::unfreeze at ./mutation/frozen_mutation.cc:116
  | | |   | ++[6#1/1 100%] addr=0x24fb47e total=1476 count=18 avg=82:
  | | |   | |              service::storage_service::get_system_mutations at ./service/storage_service.cc:6401
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:06 +03:00
Benny Halevy
3143f575e5 tablets: read_tablet_mutations: make_canonical_mutation_gently
To prevent reactor stalls due to large tablets
mutations (that can contain over 100,000 rows).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:06 +03:00
Benny Halevy
7f372dd9ae schema_tables: convert_schema_to_mutations: make_canonical_mutation_gently
To prevent stalls due to large schema mutations.
While at it, reserve the result canonical_mutation vector.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:05 +03:00
Benny Halevy
61dea98185 schema_tables: redact_columns_for_missing_features: get input mutation using rvalue reference
The function upgrades the input mutation
only in certain cases.  Currently it accepts
the input mutation by value, which may cause
and extraneous copy if the caller doesn't move
the mutation, as done in
`adjust_schema_for_schema_features`.

Getting an rvalue reference instead makes the
interface clearer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:05 +03:00
Benny Halevy
bc1985b8ce storage_service: merge_topology_snapshot: freeze_gently
Freezing large mutations synchronously may cause
reactor stalls, as seen in the test_add_many_nodes_under_load
dtest:
```
  ++[1#1/37 5%] addr=0x15b0bf total=99 count=2 avg=50: ?? ??:0
  | ++[2#1/2 67%] addr=0x15a331f total=66 count=1 avg=66:
  | |             bytes_ostream::write at ././bytes_ostream.hh:248
  | |             (inlined by) bytes_ostream::write at ././bytes_ostream.hh:263
  | |             (inlined by) ser::serialize_integral<unsigned int, bytes_ostream> at ././serializer.hh:203
  | |             (inlined by) ser::integral_serializer<unsigned int>::write<bytes_ostream> at ././serializer.hh:217
  | |             (inlined by) ser::serialize<unsigned int, bytes_ostream> at ././serializer.hh:254
  | |             (inlined by) ser::writer_of_column<bytes_ostream>::write_id at ./build/dev/gen/idl/mutation.dist.impl.hh:4680
  | | ++[3#1/1 100%] addr=0x159df71 total=132 count=2 avg=66:
  | | |              (anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}::operator() at ./mutation/mutation_partition_serializer.cc:99
  | | |              (inlined by) row::maybe_invoke_with_hash<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1} const, cell_and_hash const> at ./mutation/mutation_partition.hh:133
  | | |              (inlined by) row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}::operator() at ./mutation/mutation_partition.hh:152
  | | |              (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>::operator() at ././utils/compact-radix-tree.hh:1888
  | | |              (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::visit_slot<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>&> at ././utils/compact-radix-tree.hh:1560
  | | ++           - addr=0x159d84d:
  | | |              compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>&> at ././utils/compact-radix-tree.hh:1364
  | | |              (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>&, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> > at ././utils/compact-radix-tree.hh:799
  | | |              (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>&> at ././utils/compact-radix-tree.hh:807
  | |   ++[4#1/1 100%] addr=0x1596f4a total=329 count=5 avg=66:
  | |   |              compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true> > at ././utils/compact-radix-tree.hh:473
  | |   |              (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true> > at ././utils/compact-radix-tree.hh:1626
  | |   |              (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::walk<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}> at ././utils/compact-radix-tree.hh:1909
  | |   |              (inlined by) row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}> at ./mutation/mutation_partition.hh:151
  | |   |              (inlined by) (anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> > at ./mutation/mutation_partition_serializer.cc:97
  | |   |              (inlined by) write_row<ser::writer_of_deletable_row<bytes_ostream> > at ./mutation/mutation_partition_serializer.cc:168
  | |     ++[5#1/2 80%] addr=0x15a310c total=263 count=4 avg=66:
  | |     |             mutation_partition_serializer::write_serialized<ser::writer_of_mutation_partition<bytes_ostream> > at ./mutation/mutation_partition_serializer.cc:180
  | |     | ++[6#1/2 62%] addr=0x14eb60a total=428 count=7 avg=61:
  | |     | |             frozen_mutation::frozen_mutation(mutation const&)::$_0::operator()<ser::writer_of_mutation_partition<bytes_ostream> > at ./mutation/frozen_mutation.cc:85
  | |     | |             (inlined by) ser::after_mutation__key<bytes_ostream>::partition<frozen_mutation::frozen_mutation(mutation const&)::$_0> at ./build/dev/gen/idl/mutation.dist.impl.hh:7058
  | |     | |             (inlined by) frozen_mutation::frozen_mutation at ./mutation/frozen_mutation.cc:84
  | |     | | ++[7#1/1 100%] addr=0x14ed388 total=532 count=9 avg=59:
  | |     | | |              freeze at ./mutation/frozen_mutation.cc:143
  | |     | |   ++[8#1/2 74%] addr=0x252cf55 total=394 count=6 avg=66:
  | |     | |   |             service::storage_service::merge_topology_snapshot at ./service/storage_service.cc:763
```

This change uses freeze_gently to freeze
the cdc_generations_v2 mutations one at a time
to prevent the stalls reported above.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:05 +03:00
Benny Halevy
a016e1d05d canonical_mutation: add make_canonical_mutation_gently
Make a canonical mutation gently using an
async serialization function.
Similar to freeze_gently, yielding is considered
only in-between range tombstones and rows.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:04 +03:00
Benny Halevy
a126160d7e frozen_mutation: move unfreeze_gently to async_utils
Unfreeze_gently doesn't have to be a method of
frozen_mutation.  It might as well be implemented as
a free function reading from a frozen_mutation
and preparing a mutation gently.

The logic will be used in a later patch
to make a canonical mutation directly from
a frozen_mutation instead of unfreezing it
and then converting it to a canonical_mutation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:27:56 +03:00
Benny Halevy
aa27ef8811 mutation: add freeze_gently
Allow yielding in between serializing of
range tombstones and rows to prevent reactor
stalls due to large mutations with many
rows or range tombstones.

mutations that have many cells might still
stall but those are considered infrequent enough
to ignore for now.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:27:56 +03:00
Benny Halevy
0da2940c72 idl-compiler: generate async serialization functions for stub members
To be used in a following patch for e.g.
mutation::freeze_gently.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:27:56 +03:00
Benny Halevy
504a9ab897 raft: group0_state_machine: write_mutations_to_database: use to_mutation_gently
Prevent stalls coming from writing large mutations
like the ones seen with the test_add_many_nodes_under_load
dtest:
```
  ++[1#11/11 6%] addr=0x15408f6 total=33 count=1 avg=33:
  |              managed_bytes::managed_bytes at ././utils/managed_bytes.hh:284
  |              (inlined by) atomic_cell_or_collection::atomic_cell_or_collection at ./mutation/atomic_cell_or_collection.hh:25
  |              (inlined by) cell_and_hash::cell_and_hash at ./mutation/mutation_partition.hh:73
  |              (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::emplace<atomic_cell_or_collection, seastar::optimized_optional<cell_hash> > at ././utils/compact-radix-tree.hh:1809
  ++           - addr=0x1518bae:
  |              row::append_cell at ./mutation/mutation_partition.cc:1344
  ++           - addr=0x14acb23:
  |              partition_builder::accept_row_cell at ././partition_builder.hh:70
  ++           - addr=0x157a6a6:
  |              mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor::accept_atomic_cell at ./mutation/mutation_partition_view.cc:218
  |              (inlined by) (anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor::operator() at ./mutation/mutation_partition_view.cc:138
  |              (inlined by) boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>::internal_visit<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>&> at /usr/include/boost/variant/variant.hpp:1028
  |              (inlined by) boost::detail::variant::visitation_impl_invoke_impl<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void*, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type> > at /usr/include/boost/variant/detail/visitation_impl.hpp:117
  |              (inlined by) boost::detail::variant::visitation_impl_invoke<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void*, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::has_fallback_type_> at /usr/include/boost/variant/detail/visitation_impl.hpp:157
  |              (inlined by) boost::detail::variant::visitation_impl<mpl_::int_<0>, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l_item<mpl_::long_<3l>, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, boost::mpl::l_item<mpl_::long_<2l>, ser::collection_cell_view, boost::mpl::l_item<mpl_::long_<1l>, ser::unknown_variant_type, boost::mpl::l_end> > > >, boost::mpl::l_iter<boost::mpl::l_end> >, boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void*, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::has_fallback_type_> at /usr/include/boost/variant/detail/visitation_impl.hpp:238
  |              (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::internal_apply_visitor_impl<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void*> at /usr/include/boost/variant/variant.hpp:2337
  |              (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::internal_apply_visitor<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false> > at /usr/include/boost/variant/variant.hpp:2349
  |              (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::apply_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const> at /usr/include/boost/variant/variant.hpp:2393
  |              (inlined by) boost::apply_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>&> at /usr/include/boost/variant/detail/apply_visitor_unary.hpp:68
  |              (inlined by) (anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor> at ./mutation/mutation_partition_view.cc:158
  |              (inlined by) mutation_partition_view::do_accept<partition_builder> at ./mutation/mutation_partition_view.cc:224
  ++           - addr=0x151234a:
  |              mutation_partition::apply at ./mutation/mutation_partition.cc:476
  ++           - addr=0x14e1103:
  |              canonical_mutation::to_mutation at ./mutation/canonical_mutation.cc:76
  ++           - addr=0x283f9ee:
  |              service::write_mutations_to_database at ./service/raft/group0_state_machine.cc:124
  ++           - addr=0x283f36c:
  |              service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2::operator() at ./service/raft/group0_state_machine.cc:165
  ++           - addr=0x28395e3:
  |              std::__invoke_impl<seastar::future<void>, seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, service::topology_change&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
  |              (inlined by) std::__invoke<seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, service::topology_change&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:96
  |              (inlined by) std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<seastar::future<void> > (*)(seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>&&, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&)>, std::integer_sequence<unsigned long, 2ul> >::__visit_invoke at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1032
  |              (inlined by) std::__do_visit<std::__detail::__variant::__deduce_visit_result<seastar::future<void> >, seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1793
  |              (inlined by) std::visit<seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1854
  |              (inlined by) service::group0_state_machine::merge_and_apply at ./service/raft/group0_state_machine.cc:156
  ++           - addr=0x284781e:
  |              service::group0_state_machine::apply at ./service/raft/group0_state_machine.cc:220
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:27:56 +03:00
Benny Halevy
574cb7d977 storage_service: merge_topology_snapshot: co_await to_mutation_gently
Perevent stalls from "unpacking" of large
canonical mutations seen with test_add_many_nodes_under_load
when called from `group0_state_machine::transfer_snapshot`:

```
  ++[1#1/44 14%] addr=0x395b2f total=569 count=6 avg=95: ?? ??:0
  | ++[2#1/2 56%] addr=0x3991e3 total=321 count=4 avg=80: ?? ??:0
  | ++          - addr=0x1587159:
  | |             std::__new_allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> >::allocate at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:147
  | |             (inlined by) std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> >::allocate at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198
  | |             (inlined by) std::allocator_traits<std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > >::allocate at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482
  | |             (inlined by) std::_Vector_base<seastar::basic_sstring<signed char, unsigned int, 31u, false>, std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > >::_M_allocate at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:378
  | |             (inlined by) std::vector<seastar::basic_sstring<signed char, unsigned int, 31u, false>, std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > >::reserve at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/vector.tcc:79
  | |             (inlined by) ser::idl::serializers::internal::vector_serializer<std::vector<seastar::basic_sstring<signed char, unsigned int, 31u, false>, std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > > >::read<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ././serializer_impl.hh:226
  | |             (inlined by) ser::deserialize<std::vector<seastar::basic_sstring<signed char, unsigned int, 31u, false>, std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > >, seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ././serializer.hh:264
  | |             (inlined by) ser::serializer<clustering_key_prefix>::read<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> >(seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator>&)::{lambda(auto:1&)#1}::operator()<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ./build/dev/gen/idl/keys.dist.impl.hh:31
  | ++          - addr=0x1587085:
  | |             seastar::with_serialized_stream<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator>, ser::serializer<clustering_key_prefix>::read<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> >(seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator>&)::{lambda(auto:1&)#1}, void, void> at ././seastar/include/seastar/core/simple-stream.hh:646
  | |             (inlined by) ser::serializer<clustering_key_prefix>::read<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ./build/dev/gen/idl/keys.dist.impl.hh:28
  | |             (inlined by) ser::deserialize<clustering_key_prefix, seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ././serializer.hh:264
  | |             (inlined by) ser::deletable_row_view::key() const::{lambda(auto:1&)#1}::operator()<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> const> at ./build/dev/gen/idl/mutation.dist.impl.hh:1268
  | | ++[3#1/1 100%] addr=0x15865a3 total=577 count=7 avg=82:
  | | |              seastar::memory_input_stream<bytes_ostream::fragment_iterator>::with_stream<ser::deletable_row_view::key() const::{lambda(auto:1&)#1}> at ././seastar/include/seastar/core/simple-stream.hh:491
  | | |              (inlined by) seastar::with_serialized_stream<seastar::memory_input_stream<bytes_ostream::fragment_iterator> const, ser::deletable_row_view::key() const::{lambda(auto:1&)#1}, void> at ././seastar/include/seastar/core/simple-stream.hh:639
  | | |              (inlined by) ser::deletable_row_view::key at ./build/dev/gen/idl/mutation.dist.impl.hh:1264
  | |   ++[4#1/1 100%] addr=0x157cf27 total=643 count=8 avg=80:
  | |   |              mutation_partition_view::do_accept<partition_builder> at ./mutation/mutation_partition_view.cc:212
  | |   ++           - addr=0x1516cac:
  | |   |              mutation_partition::apply at ./mutation/mutation_partition.cc:497
  | |     ++[5#1/1 100%] addr=0x14e4433 total=1765 count=22 avg=80:
  | |     |              canonical_mutation::to_mutation at ./mutation/canonical_mutation.cc:60
  | |       ++[6#1/2 98%] addr=0x2452a60 total=1732 count=21 avg=82:
  | |       |             service::storage_service::merge_topology_snapshot at ./service/storage_service.cc:761
  | |       ++          - addr=0x2858782:
  | |       |             service::group0_state_machine::transfer_snapshot at ./service/raft/group0_state_machine.cc:303
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:27:56 +03:00
Benny Halevy
c485ed6287 canonical_mutation: add to_mutation_gently
to_mutation_gently generates mutation from canonical_mutation
asynchronously using the newly introduced mutation_partition
accept_gently method.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:27:54 +03:00
Benny Halevy
7f7e4616ab idl-compiler: emit include directive in generated impl header file
The generated implementation header file depends
on the generated header file for the types it uses.
Generate a respective #include directive to make it self-sufficient.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 18:50:16 +03:00
Benny Halevy
e1411f3911 mutation_partition: add apply_gently
To be used for freezing mutations or
making canonical mutations gently.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 18:45:24 +03:00
Benny Halevy
f625cd76a9 collection_mutation: improve collection_mutation_view formatting
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 18:42:41 +03:00
Benny Halevy
15e8ecb670 mutation_partition: apply_monotonically: do not support schema upgrade
Currently, if the input mutation_partition requires
schema upgrade, apply_monotonically always silently reverts to
being non-preemptible, even if the caller passed is_preemptible::yes.

To prevent that from happening, put the burden of upgrading
the mutation_partition schem on the caller, which is
today the apply() methods, which are synchronous anyhow.

With that, we reduce the proliferation of the
`apply_monotonically` overloads and keep only the
low level one (which could potentially be private as well,
as it's called only from within the mutation/ source files
and from tests)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 18:42:41 +03:00
Benny Halevy
e5ca65f78b test/perf: report also log_allocations/op
Currently perf-simple-query --write ignores
log allocations that happen on the memtable
apply path.

This change adds tracking and accounting
of the number of log allocation,
and reporting of thereof.

For reference, here's the output of
build/release/scylla perf-simple-query --write --default-log-level=error --random-seed=1 -c 1
```
random-seed=1
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
78073.55 tps ( 59.4 allocs/op,  16.3 logallocs/op,  14.3 tasks/op,   52991 insns/op,        0 errors)
77263.59 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53282 insns/op,        0 errors)
79913.07 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53295 insns/op,        0 errors)
79554.32 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53284 insns/op,        0 errors)
79151.53 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53289 insns/op,        0 errors)

median 79151.53 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   53289 insns/op,        0 errors)
median absolute deviation: 761.54
maximum: 79913.07
minimum: 77263.59
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 18:42:41 +03:00
Avi Kivity
e0d597348b Merge 'Remove sstable_directory::_sstable_dir member' from Pavel Emelyanov
Different sstable storage backends use slightly different notion of what sstable location is. Filesystem storage knows it's `/var/lib/data/ks/cf-uuid/state` path, while s3 storage keeps only this path's part without state (and even that's not very accurate, because bucket prefix is missing as well as "/var/lib/data" prefix is not needed and eventually should be omitted). Nonetheless, the sstable_directory still keeps the filsystem-like path, while it's really only needed by the filesystem lister. This PR removes it.

Closes scylladb/scylladb#18496

* github.com:scylladb/scylladb:
  sstable_directory: Remove _sstable_dir member
  sstable_directory: Create sstable path with make_path() when logging
  sstable_directory: Use make_path to construct filesystem lister
  sstable_directory: Move some logging around
2024-05-02 17:52:21 +03:00
Patryk Jędrzejczak
b8e3bf4b09 topology_coordinator: clear obsolete generations earlier
We want to clear CDC generations that are no longer needed
(because all writes are already using a new generation) so they
don't take space and are not sent during snapshot transfers
(see e.g. scylladb/scylladb#17545).

The condition used previously was that we clear generations which
were closed (i.e., a new generation started at this time) more than
24h ago. This is a safe choice, but too conservative: we could
easily end up with a large number of obsolete generations if we
boot multiple nodes during 24h (which is especially easy to do
with tablets.)

Change this bound from 24h to `5s + ring_delay`. The choice is
explained in a comment in the code.

Also, prevent `test_cdc_generation_clearing` from being flaky by
firing the `increase_cdc_generation_leeway` error injection on
the server being the topology coordinator.

Ref: scylladb/scylladb#17545
2024-05-02 12:46:33 +02:00
Patryk Jędrzejczak
f61c50baa4 test: test_raft_snapshot_request: improve the last assertion
The last assertion in the test is very sensitive to changes. The
constant has already been increased from 0 to 1 due to flakiness.
The old comment explains it.

In the following patch, we change the CDC generation publisher so
that it clears the obsolete CDC generations earlier. This change
would make this assertion flaky again. After restarting the servers,
the new topology coordinator could remove the first generation if it
became obsolete. This operation appends a new entry to the log. If
it happened after triggering snapshot, the assertion could fail
with `2 <= 1`.

We could increase the constant again to unflake the test, but we
better improve it once and for all. We change the assertion so
that it's not sensitive to changes in the code based on Raft. The
explanation is in the new comment.
2024-05-02 12:46:33 +02:00
Patryk Jędrzejczak
44791a849e test: test_raft_snapshot_request: find raft leader after restart
Finding the new Raft leader after restart simplifies the test
and makes it easier to reason about. There are two improvements:
- we only need to wait until the leader appends a command, so
  the read barrier becomes unnecessary,
- we only need to trigger snapshot on the leader.

We also use the knowledge about the leader in the following patch.
2024-05-02 12:46:33 +02:00
Patryk Jędrzejczak
41198998c5 test: test_raft_shanpshot_request: simplify appended_command
We shorten the code and remove the unused `log_size` variable.
2024-05-02 12:46:31 +02:00
Yaron Kaikov
2cf7cc1ea5 scylla_setup: Remove jmx and tools packages from being verified
Following
b8634fb244
machine image started to fail with the following error:
```
10:44:59  ␛[0;32m    googlecompute.gce: scylla-jmx package is not installed.␛[0m
10:44:59  ␛[1;31m==> googlecompute.gce: Traceback (most recent call last):␛[0m
10:44:59  ␛[1;31m==> googlecompute.gce:   File "/home/ubuntu/scylla_install_image", line 135, in <module>␛[0m
10:44:59  ␛[1;31m==> googlecompute.gce:     run('/opt/scylladb/scripts/scylla_setup --no-coredump-setup --no-sysconfig-setup --no-raid-setup --no-io-setup --no-ec2-check --no-swap-setup --no-cpuscaling-setup --no-ntp-setup', shell=True, check=True)␛[0m
10:44:59  ␛[1;31m==> googlecompute.gce:   File "/usr/lib/python3.10/subprocess.py", line 526, in run␛[0m
10:44:59  ␛[1;31m==> googlecompute.gce:     raise CalledProcessError(retcode, process.args,␛[0m
10:44:59  ␛[1;31m==> googlecompute.gce: subprocess.CalledProcessError: Command '/opt/scylladb/scripts/scylla_setup --no-coredump-setup --no-sysconfig-setup --no-raid-setup --no-io-setup --no-ec2-check --no-swap-setup --no-cpuscaling-setup --no-ntp-setup' returned non-zero exit status 1.␛[0m
```

It seems we no longer need to verify that jmx and tools-java packages are installed.

Closes scylladb/scylladb#18494
2024-05-02 13:30:50 +03:00
Pavel Emelyanov
b8f9eeb82b sstable_directory: Remove _sstable_dir member
It's no longer in use.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-02 13:12:59 +03:00
Pavel Emelyanov
608762adda sstable_directory: Create sstable path with make_path() when logging
The sstable_directory::sstable_filename() should generate a name of an
sstable for log messages. It's not accurate, because it silently assumes
that the filename is on local storage, which might not be the case.
Fixing it is large chage, so for now replace _sstable_dir with explicit
call to make_path(). The change is idempotent, as _sstable_dir is
initialized with the result of make_path() call in constructor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-02 13:12:59 +03:00
Pavel Emelyanov
07c1df575e sstable_directory: Use make_path to construct filesystem lister
The _sstable_dir is used currently, but it's initialized with
make_path() result anyway.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-02 13:12:59 +03:00
Pavel Emelyanov
ef98777b27 sstable_directory: Move some logging around
At the beginning of .process() method there's a log message which path
and which storage is being processed. That's not really nice, because,
e.g. filesystem lister may skip processing quarantine directory. Also,
the registry lister doesn't list entries by their _sstable_dir, but
rather by its _location (spoiler: dir = location / state).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-02 13:08:28 +03:00
Ferenc Szili
90634b419c sstable: added cluster feature for dead rows and range tombstones
Previously, writing into system.large_partitions was done by calling
record_large_partition(). In order to write different data based on
the cluster feature flag, another level of indirection was added by
calling _record_large_partitions which is initialized to a lambda
which calls internal_record_large_partitions(). This function does
not record the values of the two new columns (dead_rows and
range_tombstones). After the cluster feature flag becomes true,
_record_large_partitions is set to a lambda which calls
internal_record_large_partitions_all_data() which record the values
of the two new columns.
2024-05-02 11:49:46 +02:00
Ferenc Szili
b06af5b2b9 sstable: write dead_rows count to system.large_partitions 2024-05-02 11:49:10 +02:00
Ferenc Szili
63e724c974 sstable: added counter for dead rows 2024-05-02 11:49:10 +02:00
Nadav Har'El
5558143014 test/alternator: test addressing LSI using REST API
The name of the Scylla table backing an Alternator LSI looks like
basename:!lsiname. Some REST API clients (including Scylla Manager)
when they send a "!" character in the REST API request may decide
to "URL encode" it - convert it to %21.

Because of a Seastar bug (https://github.com/scylladb/seastar/issues/725)
Scylla's REST API server forgets to do the URL decoding, which leads
to the REST API request failing to address the LSI table.

This patch introduces a test for this bug, which fails without the
Seastar issue being fixed, and passes afterwards (i.e., after the
previous patch that starts to use the new, fixed, Seastar API).

The test creates an LSI, uses the REST API to find its name and then
tries to call some REST API ("compaction_strategy") on this table name,
after deliberately URL-encoding it.

Refs #5883.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-05-02 12:33:54 +03:00
Nadav Har'El
1aacfdf460 REST API: stop using deprecated, buggy, path parameter
The API req->param["name"] to access parameters in the path part of the
URL was buggy - it forgot to do URL decoding and the result of our use
of it in Scylla was bugs like #5883 - where special characters in certain
REST API requests got botched up (encoded by the client, then not
decoded by the server).

The solution is to replace all uses of req->param["name"] by the new
req->get_path_param("name"), which does the decoding correctly.

Unfortunately we needed to change 104 (!) callers in this patch, but the
transformation is mostly mechanical and there is no functional changes in
this patch. Another set of changes was to bring req, not req->param, to
a few functions that want to get the path param.

This patch avoids the numerous deprecation warnings we had before, and
more importantly, it fixes #5883.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-05-02 12:33:46 +03:00
Jan Ciolek
59b7920b0b view_update_generator: add get_storage_proxy()
During view generation we would like to be able
to access information about the current state
of view update backlogs, but this information
is kept inside storage_proxy.

A reference to storage_proxy is kept inside view_update_generator,
so the easiest way to get access to it from the view update code
is by adding a public getter there.

There's already a similar getter for replica::database: get_db(),
so it's in line with the rest of the code.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2024-05-02 10:59:55 +02:00
Jan Ciolek
4c5cfc7683 storage_proxy: make view backlog getters public
Storage proxy maintains information about both local
and remote view update backlogs.

This information might also be useful outside of storage_proxy,
so let's expose the functions that allow to acces backlog information.

There aren't any implementation quirks that would make
it unsafe to make the functions public, the worst that
can happen is that someone causes a lot of atomic operations
by repeatedly calling get_view_update_backlog().

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2024-05-02 10:59:55 +02:00
Pavel Emelyanov
67736b5cd3 Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB"
This reverts commit 9c2a836607.
2024-05-02 08:16:14 +03:00
Pavel Emelyanov
d47053266b view: Abort pending view updates when draining
When view builder is drained (it now happens very early, but next patch
moves this into regular drain) it waits for all on-going view build
steps to complete. This includes waiting for any outstanding proxy view
writes to complete as well.

View writes in proxy have very high timeout of 5 minutes but they are
cancellable. However, canecelling of such writes happens in proxy's
drain_on_shutdown() call which, in turn, happens pretty late on
shutdown. Effectively, by the time it happens all view writes mush have
completed already, so stop-time cancelling doesn't really work nowadays.

Next patch makes view builder drain happen a bit later during shutdown,
namely -- _after_ shutting down messaging service. When it happen that
late, non-working view writes cancellation becomes critical, as view
builder drain hangs for aforementioned 5 minutes. This patch explicitly
cancels all view writes when view builder stops.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-02 08:16:12 +03:00
Kefu Chai
f183f5aa80 Update seastar submodule
* seastar 2b43417d...b73e5e7d (11):
  > treewide: inherit from formatter<string_view> not formatter<std::string_view>
  > CMakeLists.txt: Apply CXX deprecated flags conditionally
  > tls: add assignment operator for gnutls_datum
  > tls: s/get0()/get()/
  > io_queue: do not reference moved variable
  > TLS: use helper function in get_distinguished_name & get_alt_name_information
  > TLS: Add support for TLS1.3 session tickets
  > iotune: ignore shards with id above max_iodepth
  > core/future: remove a template parameter from set_callback()
  > util: with_file_input_stream: always close file
  > core/sleep: Use more raii-sh aproach to maintain sleeper

Fixes #5181

Closes scylladb/scylladb#18491
2024-05-02 07:35:42 +03:00
Takuya ASADA
b8634fb244 dist: stop installing scylla-tools, scylla-jmx by default
Since we added native nodetool, we no longer need to install scylla-tools
and scylla-jmx, drop them from scylla metapackage and make it optional
package.

Closes #18472

Closes scylladb/scylladb#18487
2024-05-01 22:15:40 +03:00
Kefu Chai
af5674211d redis/server.hh: suppress -Wimplicit-fallthrough from protocol_parser.hh
when compiling the tree with clang-18 and ragel 6.10, the compiler
warns like:

```
/usr/local/bin/cmake -E __run_co_compile --tidy="clang-tidy-18;--checks=-*,bugprone-use-after-move;--extra-arg-before=--driver-mode=g++" --source=/home/runner/work/scylladb/scylladb/redis/controller.cc -- /usr/bin/clang++-18 -DBOOST_NO_CXX98_FUNCTION_BASE -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -I/home/runner/work/scylladb/scylladb -I/home/runner/work/scylladb/scylladb/build/gen -I/home/runner/work/scylladb/scylladb/seastar/include -I/home/runner/work/scylladb/scylladb/build/seastar/gen/include -I/home/runner/work/scylladb/scylladb/build/seastar/gen/src -isystem /home/runner/work/scylladb/scylladb/cooking/include -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/runner/work/scylladb/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT redis/CMakeFiles/redis.dir/controller.cc.o -MF redis/CMakeFiles/redis.dir/controller.cc.o.d -o redis/CMakeFiles/redis.dir/controller.cc.o -c /home/runner/work/scylladb/scylladb/redis/controller.cc
error: too many errors emitted, stopping now [clang-diagnostic-error]
Error: /home/runner/work/scylladb/scylladb/build/gen/redis/protocol_parser.hh:110:1: error: unannotated fall-through between switch labels [clang-diagnostic-implicit-fallthrough]
  110 | case 1:
      | ^
/home/runner/work/scylladb/scylladb/build/gen/redis/protocol_parser.hh:110:1: note: insert 'FMT_FALLTHROUGH;' to silence this warning
  110 | case 1:
      | ^
      | FMT_FALLTHROUGH;
```

since we have `-Werror`, the warnings like this are considered as error,
hence the build fails. in order to address this failure, let's silence
this warning when including this generated header file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18447
2024-05-01 18:47:24 +03:00
Kefu Chai
08d1362f80 utils/chunked_vector: fix some typos in comment
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18486
2024-05-01 16:38:43 +03:00
Nadav Har'El
4e78e2d506 test/cql-pytest, cdc: add test for what happens when log name is taken
In our CDC implementation, the CDC log table for table "xyz" is always
called "xyz_scylla_cdc_log". If this table name is taken, and the user
tries to create a table "xyz" with CDC enabled - or enable CDC on the
table "xyz", the creation/enabling should fail gracefully, with a clear
error message. This test verifies this.

The new test passes - the code is already correct. I just wanted to
verify that it is (and to prevent future regressions).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#18485
2024-05-01 14:46:19 +03:00
Pavel Emelyanov
5d992a4f01 proxy: Remove declaration of nonexisting view_update_write_response_handler class
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18417
2024-05-01 10:15:41 +03:00
Botond Dénes
65a385f5d0 Merge 'Relax the way view builder code checks if a table exists' from Pavel Emelyanov
There are two places that workaround db.column_family_exists() call with some fancy exceptions-catching lambda.
This PR makes things simpler.

Closes scylladb/scylladb#18441

* github.com:scylladb/scylladb:
  view: Open-code one line lambda checking if table exists
  view: Use non-throwoing check if a table exists
2024-05-01 10:14:58 +03:00
Kefu Chai
94ac0799d9 build: cmake: link scylla_tracing against scylla-main
because tracing/trace_keyspace_helper.cc references symbols
defined by table_helper, which is in turn provided by scylla-main,
we should link tracing_tracing against scylla-main.

otherwise we could have following link failure:

```
./build/./tracing/trace_keyspace_helper.cc:214: error: undefined reference to 'table_helper::setup_keyspace(cql3::query_processor&, service::migration_manager&, std::basic_string_view<char, std::char_traits<char> >, seastar::basic_sstring<char, unsigned int, 15u, true>, service::query_state&, std::vector<table_helper*, std::allocator<table_helper*> >)'
./build/./tracing/trace_keyspace_helper.cc:396: error: undefined reference to 'table_helper::cache_table_info(cql3::query_processor&, service::migration_manager&, service::query_state&)'
./table_helper.hh:92: error: undefined reference to 'table_helper::insert(cql3::query_processor&, service::migration_manager&, service::query_state&, seastar::noncopyable_function<cql3::query_options ()>)'
./table_helper.hh:92: error: undefined reference to 'table_helper::insert(cql3::query_processor&, service::migration_manager&, service::query_state&, seastar::noncopyable_function<cql3::query_options ()>)'
./table_helper.hh:92: error: undefined reference to 'table_helper::insert(cql3::query_processor&, service::migration_manager&, service::query_state&, seastar::noncopyable_function<cql3::query_options ()>)'
./table_helper.hh:92: error: undefined reference to 'table_helper::insert(cql3::query_processor&, service::migration_manager&, service::query_state&, seastar::noncopyable_function<cql3::query_options ()>)'
clang++-18: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18455
2024-05-01 10:08:11 +03:00
Kefu Chai
f0d12df7fc reloc: create $BUILDDIR for getting its path
when building with CMake, there is a use case where the $BUILDIR
is not created yet, when `reloc/build_rpm.sh` is launched. in order
to enable us to run this script without creating $BUILDIR first, let's
create this directory first.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18464
2024-05-01 09:52:17 +03:00
Kefu Chai
8168f02550 raft_group_registry: do not use moved variable
clang-tidy warns like:
```
[628/713] Building CXX object service/CMakeFiles/service.dir/raft/raft_group_registry.cc.o
Warning: /home/runner/work/scylladb/scylladb/service/raft/raft_group_registry.cc:543:66: warning: 'id' used after it was moved [bugprone-use-after-move]
  543 |             auto& rate_limit = _rate_limits.try_get_recent_entry(id, std::chrono::minutes(5));
      |                                                                  ^
/home/runner/work/scylladb/scylladb/service/raft/raft_group_registry.cc:539:19: note: move occurred here
  539 |     auto dst_id = raft::server_id{std::move(id)};
      |                   ^
```

this is a false alarm. as the type of `id` is actually `utils::UUID`
which is a struct enclosing two `int64_t` variables. and we don't
define a move constructor for `utils::UUID`. so the value of of `id`
is intact after being moved away. but it is still confusing at
the first glance, as we are indeed referencing a moved-away variable.

so in order to reduce the confusion and to silence the warning, let's
just do not `std::move(id)`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18449
2024-05-01 09:45:12 +03:00
Kefu Chai
bd0d246b57 tools/scylla-nodetool: implement the resetlocalschema command
Fixes #18468
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18470
2024-05-01 08:49:11 +03:00
Raphael S. Carvalho
b980634ff2 test: Verify tablet cleanup is properly retried on failure
Doesn't test only coordinator ability to retry on failure, but also
that replica will be able to properly continue cleanup of a storage
group from where it left off (when failure happened), not leave any
sstables behind.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18426
2024-04-30 19:27:17 +02:00
Raphael S. Carvalho
62b1cfa89c topology_coordinator: Fix synchronization of tablet split with other concurrent ops
Finalization of tablet split was only synchronizing with migrations, but
that's not enough as we want to make sure that all processes like repair
completes first as they might hold erm and therefore will be working
with a "stale" version of token metadata.

For synchronization to work properly, handling of tablet split finalize
will now take over the state machine, when possible, and execute a
global token metadata barrier to guarantee that update in topology by
split won't cause problems. Repair for example could be writing a
sstable with stale metadata, and therefore, could generate a sstable
that spans multiple tablets. We don't want that to happen, therefore
we need the barrier.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18380
2024-04-30 19:23:28 +02:00
Botond Dénes
525553aa41 SCYLLA-VERSION-GEN: warn against using - or _ in custom version names
Doing so is a pitfall that will make one waste a lot of time rebuilding
the packages, just because at the end it turns out that the version has
illegal characters in it. The author of this patch has certainly fallen
into this pitfall a lot of times.

Closes scylladb/scylladb#18429
2024-04-30 18:14:51 +03:00
Avi Kivity
ea15ddc7dc Merge 'Fix population of non-normal sstables from registry' from Pavel Emelyanov
On boot sstables are populated from normal location as well as from quarantine and staging. It turned out that sstables listed in registry (S3-backed ones) are not populated from non-normal states.

Closes scylladb/scylladb#18439

* github.com:scylladb/scylladb:
  test: Add test for how quarantined sstables registry entries are loaded
  sstable_directory: Use sstable location to initialize registry lister
2024-04-30 18:10:11 +03:00
Avi Kivity
329b135b5e Merge 'chunked_vector: fix use after free in emplace back' from Benny Halevy
Currently, push_back or emplace_back reallocate the last chunk
before constructing the new element.

If the arg passed to push_back/emplace_back is a reference to an
existing element in the vector, reallocating the last chunk will
invalidate the arg reference before it is used.

This patch changes the order when reallocating
the last chunk in reserve_for_emplace_back:
First, a new chunk_ptr is allocated.
Then, the back_element is emplaced in the
newly allocated array.
And only then, existing elements in the current
last chunk are migrated to the new chunk.
Eventually, the new chunk replaces the existing chunk.

If no reservation is requried, the back element
is emplaced "in place" in the current last chunk.

Fixes scylladb/scylladb#18072

Closes scylladb/scylladb#18073

* github.com:scylladb/scylladb:
  test: chunked_managed_vector_test: add test_push_back_using_existing_element
  utils: chunked_vector: reserve_for_emplace_back: emplace before migrating existing elements
  utils: chunked_vector: push_back: call emplace_back
  utils: chunked_vector: define min_chunk_capacity
  utils: chunked*vector: use std::clamp
2024-04-30 18:09:04 +03:00
David Garcia
f62197ee1e docs: enable concurrent downloads
Downloads chunks of 10 CSV concurrently to speed up doc builds.

Closes scylladb/scylladb#18469
2024-04-30 16:13:40 +03:00
Raphael S. Carvalho
d7a01598ce tools: Make sstable shard-of efficient by loading minimum to compute owners
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18440
2024-04-30 16:10:58 +03:00
Gleb Natapov
f2b0a5e9e1 storage_service: do not take API lock for removenode operation if topology coordinator is enabled
Topology coordinator serialize operations internally, so there is no
need to have an external lock.

Fixes: scylladb/scylladb#17681
2024-04-30 15:13:50 +03:00
Gleb Natapov
0a7101923c test: return file mark from wait_for that points after the found string
Returning file mark allows to start searching from the point where the
previous string was found.
2024-04-30 15:06:32 +03:00
Kefu Chai
3a1ceb96d7 utils: UUID_gen: include <atomic>
in UUID_gen.cc, we are using `std::atomic<int64_t>` in
`make_thread_local_node()`, but this template is not defined by
any of the included headers. but  we should include used headers
to be self-contained.

when compiling on ubuntu:jammy with libstdc++-13, we have following
error:
```
/usr/local/bin/cmake -E __run_co_compile --tidy="clang-tidy-18;--checks=-*,bugprone-use-after-move;--extra-arg-before=--driver-mode=g++" --source=/home/runner/work/scylladb/scylladb/utils/UUID_gen.cc -- /usr/bin/clang++-18 -DBOOST_ALL_NO_LIB -DBOOST_NO_CXX98_FUNCTION_BASE -DBOOST_REGEX_DYN_LINK -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -I/home/runner/work/scylladb/scylladb -I/home/runner/work/scylladb/scylladb/seastar/include -I/home/runner/work/scylladb/scylladb/build/seastar/gen/include -I/home/runner/work/scylladb/scylladb/build/seastar/gen/src -isystem /home/runner/work/scylladb/scylladb/cooking/include -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overl
Error: /home/runner/work/scylladb/scylladb/utils/UUID_gen.cc:29:33: error: implicit instantiation of undefined template 'std::atomic<long>' [clang-diagnostic-error]
   29 |     static std::atomic<int64_t> thread_id_counter;
      |                                 ^
/usr/lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/shared_ptr_atomic.h:361:11: note: template is declared here
  361 |     class atomic;
      |           ^
```
so, in this change, we include `<atomic>` to address this
build failure.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18387
2024-04-30 09:07:22 +03:00
Kefu Chai
6a73c911e3 tools: lua_sstable_consumer.cc: be compatible with Lua 5.3's lua_resume()
in Lua 5.3, lua_resume() only accepts three parameters, while in Lua 5.4,
this function accepts four parameters. so in order to be compatible with
Lua 5.3, we should not pass the 4th parameter to this function.
a macro is defined to conditionally pass this parameter based on the
Lua's version.

see https://www.lua.org/manual/5.3/manual.html#lua_resume

Refs 5b5b8b3264
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18450
2024-04-30 09:06:25 +03:00
Botond Dénes
0ace90ad04 test: add test for cleaning up cached querier on tablet migration
Check that a cached querier, which exists prior to a migration, will be
cleaned up afterwards. This reproduces #18110.
The test fails before the fix for the above and passes afterwards.
2024-04-30 01:47:16 -04:00
Botond Dénes
64c817462e querier: allow injecting cache entry ttl by error injector
To allow making tests more robust by setting TTL to a very large value,
whent the test relies on entries being present for a given time.
2024-04-30 01:47:16 -04:00
Botond Dénes
03995d9397 replica/table: cleanup_tablet(): clear inactive reads for the tablet
To avoid any resource surviving the cleanup, via some inactive read
pinning it. This can cause data resurrection if the tablet is later
migrated back and the pinned data source is added back to the tablet.
2024-04-30 01:47:16 -04:00
Botond Dénes
a062e3f650 replica/database: introduce clear_inactive_reads_for_tablet()
To be used on the tablet cleanup path, to clear any inactive read which
might be related to the cleaned-up tablet.
2024-04-30 01:44:03 -04:00
Botond Dénes
338af5055c replica/database: introduce foreach_reader_concurrency_semaphore
Currently we have a single method -- detach_column_family() -- which
does something with each semaphore. Soon there will be another one.
Introduce a method to do something with all semaphores, to make this
smoother. Enterprise has a different set of semaphores, and this will
reduce friction.
2024-04-30 01:43:56 -04:00
Botond Dénes
3c813fbb99 reader_concurrency_semaphore: add range param to evict_inactive_reads_for_table()
When the new optional parameter has a value, evict only inactive reads,
whose ranges overlap with the provided range. The range for the inactive
read is provided in `register_inactive_read()`. If the inactive read has
no range, ovarlap is assumed and the read is evicted.
This will be used to evict all inactive reads that could potentially use
a cleaned-up tablet.
2024-04-30 01:31:08 -04:00
Botond Dénes
9e7a957ffb reader_concurrency_semaphore: allow storing a range with the inactive reader
This allows specifying the range the inactive read is reading from. To
be used in the next patch to selectively evict inactive reads whose
range overlaps with a certain (tablet) range.
2024-04-30 01:31:08 -04:00
Botond Dénes
67684308d1 reader_concurrency_semaphore: avoid detach() in inactive_read_handle::abandon()
inactive_read_handle::abandon() evicts and destroyes the inactive-read,
so it is not left behind. Currently, while doing so, it triggers the
inactive_read's own version of abandon(): detach(). The two has bad
interaction when the inactive_read_handle stores the last permit
instance, causing (so far benign) use-after-free. Prevent triggering
detach() to avoid this bad interaction altogether.
2024-04-30 01:31:08 -04:00
Piotr Dulikowski
35f456c483 Merge 'Extend ALTER TABLE ... DROP to allow specifying timestamp of column drop' from Michał Jadwiszczak
In order to correctly restore schema from `DESC SCHEMA WITH INTERNALS`, we need a way to drop a column with a timestamp in the past.

Example:
- table t(a int pk, b int)
- insert some data1
- drop column b
- add column b int
- insert some data2

If the sstables weren't compacted, after restoring the schema from description:
- we will loss column b in data2 if we simply do `ALTER TABLE t DROP b` and `ALTER TABLE t ADD b int`
- we will resurrect column b in data1 if we skip dropping and re-adding the column

Test for this: https://github.com/scylladb/scylla-dtest/pull/4122

Fixes #16482

Closes scylladb/scylladb#18115

* github.com:scylladb/scylladb:
  docs/cql: update ALTER TABLE docs
  test/cqlpytest: add test for prepared `ALTER TABLE ... DROP ... USING TIMESTAMP ?`
  test/cql-pytest: remove `xfail` from alter table with timestamp tests
  cql3/statements: extend `ALTER TABLE ... DROP` to allow specifying timestamp of column drop
  cql3/statements: pass `query_options` to `prepare_schema_mutations()`
  cql3/statements: add bound terms to alter table statement
  cql3/statements: split alter_table_statement into raw and prepared
  schema: allow to specify timestamp of dropped column
2024-04-29 14:05:05 +02:00
Piotr Dulikowski
dec652de9e test: topology: test that upgrade succeeds after recent removal
Adds a regression test for scylladb/scylladb#18198 - start a two node
cluster in legacy topology mode, use nodetool removenode on one of the
nodes, upgrade the remaining 1-node cluster and observe that it
succeeds.
2024-04-29 13:33:40 +02:00
Piotr Dulikowski
cb4a4f2caf topology_coordinator: compute cluster size correctly during upgrade
During upgrade to raft topology, information about service levels is
copied from the legacy tables in system_distributed to the raft-managed
tables of group 0. system_distributed has RF=3, so if the cluster has
only one or two nodes we should use lower consistency level than ALL -
and the current procedure does exactly that, it selects QUORUM in case
of two nodes and ONE in case of only one node. The cluster size is
determined based on the call to _gossiper.num_endpoints().

Despite its name, gossiper::num_endpoints() does not necessarily return
the number of nodes in the cluster but rather the number of endpoint
states in gossiper (this behavior is documented in a comment near the
declaration of this function). In some cases, e.g. after gossiper-based
nodetool remove, the state might be kept for some time after removal (3
days in this case).

The consequence of this is that gossiper::num_endpoints() might return
more than the current number of nodes during upgrade, and that in turn
might cause migration of data from one table to another to fail -
causing the upgrade procedure to get stuck if there is only 1 or two
nodes in the cluster.

In order to fix this, use token_metadata::get_all_endpoints() as a
measure of the cluster size.

Fixes: scylladb/scylladb#18198
2024-04-29 13:26:29 +02:00
Takuya ASADA
af0c0ee8af configure.py: revert changing builddir as absolute path
On be3776ec2a, we changed outdir to
absolute path.
This causes "unknown target" error when we build Scylla using the relative
path something like "ninja build/dev/scylla", since the target name
become absolte path.
Revert the change to able to build with the relative path.

Also, change optimized_clang.sh to use relative path for --builddir,
since we reference "../../$builddir/SCYLLA-*-FILE" when we build
submodule, it won't work with absolute path.

Fixes #18321

Closes scylladb/scylladb#18338
2024-04-29 09:35:21 +03:00
Kefu Chai
4433d2e10e build: cmake: let iotune depends on config specific file
before this change, in order to build `${iotune_path}`, we use
the rule to build `app_iotune` but this target is built using
the default build type, see
https://cmake.org/cmake/help/latest/variable/CMAKE_DEFAULT_BUILD_TYPE.html#variable:CMAKE_DEFAULT_BUILD_TYPE
so, if we want to build `${iotune_path}` for the configuration
which is not listed as the first item in `CMAKE_CONFIGURATION_TYPES`,
we would end up with copying an nonexistent file.

to address this issue, we override the this behavior using
the `$<OUTPUT_CONFIG:...>` generator-expression. so that we
can depend on non-unique path. and the file-level dependency
between ${iotune_path} and $<CONFIG>/iotune can be established.

see also
https://cmake.org/cmake/help/latest/generator/Ninja%20Multi-Config.html#custom-commands

Refs #2717

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18395
2024-04-29 09:06:39 +03:00
Kefu Chai
f03f69ad4f partition_version: move the base class in move ctor
before this change, `partition_version` uses a hand-crafted move
constructor. but it suffers from the warning from clang-tidy, which
believe there is a use-after-move issue, as the inner instance of
it's parent class is constructed using
`anchorless_list_base_hook(std::move(pv))`, and its other member
variables are initialized like `_partition(std::move(pv._partition))`

`std::move(pv)` does not do anything, but *indicates* `pv` maybe
moved from. and what is moved away is but the part belong to its
parent class. so this issue is benign.

but, it's still annoying. as we need to tell the genuine issues
reported by clang-tidy from the false alarms. so we have at least
two options:

- stop using clang-tidy
- ignore this warning
- silence this warning using LINT direction in a comment
- use another way to implement the move constructor

in this change, we just cast the moved instance to its
base class and move it instead, this should applease
clang-tidy.

Fixes #18354
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18359
2024-04-28 18:34:45 +02:00
Dawid Medrek
bf802e99eb docs: Update Hinted Handoff documentation
We briefly explain the process of migration
of Hinted Handoff to host IDs, the rationale
for it, consequences, and possible side effects.
2024-04-28 01:22:59 +02:00
Dawid Medrek
46ab22f805 db/hints: Add endpoint_downtime_not_bigger_than()
We add an auxiliary function checking if a node
hasn't been down for too long. Although
`gms::gossiper` provides already exposes a function
responsible for that, it requires that its
argument be an IP address. That's the reason
we add a new function.
2024-04-28 01:22:59 +02:00
Dawid Medrek
0ef8d67d32 db/hints: Migrate hinted handoff when cluster feature is enabled
These changes migrate hinted handoff to using
host ID as soon as the corresponding cluster
feature is enabled.

When a node starts, it defaults to creating
directories naming them after IP addresses.
When the whole cluster has upgraded
to a version of Scylla that can handle
directories representing host IDs,
we perform a migration of the IP folders,
i.e. we try to rename them to host IDs.
Invalid directories, i.e. those that
represent neither an IP address, nor a host
ID, are removed.

During the migration, hinted handoff is
disabled. It is necessary because we have
to modify the disk's contents, so new hints
cannot be saved until the migration finishes.
2024-04-28 01:22:57 +02:00
Dawid Medrek
58784cd8db db/hints: Handle arbitrary directories in resource manager
Before these changes, resource manager only handled
the case when directories it browsed represented
valid host IDs. However, since before migrating
hinted handoff to using host IDs we still name
directories after IP addresses, that would lead
to exceptins that shouldn't happen.

We make resource manager handle directories
of arbitrary names correctly.
2024-04-27 22:31:07 +02:00
Dawid Medrek
ee84e810ca db/hints: Start using hint_directory_manager
We start keeping track of mappings IP - host ID.
The mappings are between endpoint managers
(identified by host IDs) and the hint directories
managed by them (represented by IP addresses).

This is a prelude to handling IP directories
by the hint shard manager.

The structure should only be used by the hint
manager before it's migrated to using host IDs.
The reason for that is that we rely on the
information obtained from the structure, but
it might not make sense later on.

When we start creating directories named after
host IDs and there are no longer directories
representing IP addresses, there is no relation
between host IDs and IPs -- just because
the structure is supposed to keep track between
endpoint managers and hint directories that
represent IP addresses. If they represent
host IDs, the connection between the two
is lost.

Still using the data structure could lead
to bugs, e.g. if we tried to associate
a given endpoint manager's host ID with its
corresponding IP address from
locator::token_metadata, it could happen that
two different host IDs would be bound to
the same IP address by the data structure:
node A has IP I1, node A changes its IP to I2,
node B changes its IP to I1. Though nodes
A and B have different host IDs (because they
are unique), the code would try to save hints
towards node B in node A's hint directory,
which should NOT happen.

Relying on the data structure is thus only
safe before migrating hinted handoff to using
host IDs. It may happen that we save a hint
in the hint directory of the wrong node indeed,
but since migration to using host IDs is
a process that only happens once, it's a price
we are ready to pay. It's only imperative to
prevent it from happening in normal
circumstances.
2024-04-27 22:31:07 +02:00
Dawid Medrek
aa4b06a895 db/hints: Enforce providing IP in get_ep_manager()
We drop the default argument in the function's signature.
Also, we adjust the code of change_host_filter() to
be able to perform calls to get_ep_manager().
2024-04-27 22:31:07 +02:00
Dawid Medrek
d0f58736c8 db/hints: Introduce hint_directory_manager
This commit introduces a new class responsible
for keeping track of mappings IP-host ID.
Before hinted handoff is migrated to using
host IDs, hint directories still have to
represent IP addresses. However, since
we identify endpoint managers by host IDs
already, we need to be able to associate
them with the directories they manage.
This class serves this purpose.
2024-04-27 22:31:07 +02:00
Dawid Medrek
f9af01852d db/hints/resource_manager: Update function description
The current description of the function
`space_watchdog::scan_one_ep_dir` is
not up-to-date with the function's
signature. This commit updates it.
2024-04-27 22:31:07 +02:00
Dawid Medrek
59d49c5219 db/hints: Coroutinize space_watchdog::scan_one_ep_dir() 2024-04-27 22:31:07 +02:00
Dawid Medrek
8fd9c80387 db/hints: Expose update lock of space watchdog
We expose the update lock of space watchdog
to be able to prevent it from scanning
hint directories. It will be necessary in an
upcoming commit when we will be renaming hint
directories and possibly removing some of them.
Race conditions are unacceptable, so resource
manager cannot be able to access the directory
during that time.
2024-04-27 22:31:07 +02:00
Dawid Medrek
934e4bb45e db/hints: Add function for migrating hint directories to host ID
We add a function that will be used while
migrating hinted handoff to using host IDs.
It iterates over existing hint directories
and tries to rename them to the corresponding
host IDs. In case of a failure, we remove
it so that at the end of its execution
the only remaining directories are those
that represent host IDs.
2024-04-27 22:31:04 +02:00
Dawid Medrek
e36f853f9b db/hints: Take both IP and host ID when storing hints
The store_hint() method starts taking both an IP
and a host ID as its arguments. The rationale
for the change is depending on the stage of
the cluster (before an upgrade to the
host-ID-based hinted handdof and after it),
we might need to create a directory representing
either an IP address, or a host ID.

Because locator::topology can change in the
before obtaining the host ID we pass
and when the function is being executed,
we need to pass both parameters explicitly
to ensure the consistency between them.
2024-04-27 20:35:58 +02:00
Dawid Medrek
063d4d5e91 db/hints: Prepare initializing endpoint managers for migrating from IP to host ID
We extract the initialization of endpoint managers
from the start method of the hint manager
to a separate function and make it handle directories
that represent either IP addresses, or host IDs;
other directories are ignored.

It's necessary because before Scylla is upgraded
to a version that uses host-ID-based hinted handoff,
we need to continue only managing IP directories.
When Scylla has been upgraded, we will need to handle
host ID directories.

It may also happen that after an upgrade (but not
before it), Scylla fails while renaming
the directories, so we end up with some of them
representing IP address, and some representing
host IDs. After these changes, the code handles
that scenario as well.
2024-04-27 20:35:53 +02:00
Dawid Medrek
cfd03fe273 db/hints: Migrate to locator::host_id
We change the type of node identifiers
used within the module and fix compilation.
Directories storing hints to specific nodes
are now represented by host IDs instead of
IPs.
2024-04-26 22:44:04 +02:00
Dawid Medrek
1af7fa74e8 db/hints: Remove noexcept in do_send_one_mutation()
While the function is marked as noexcept, the returned
future can in fact store an exception. We remove the
specifier to reflect the actual behavior of the
function.
2024-04-26 22:44:04 +02:00
Dawid Medrek
54ae9797b9 service: Add locator::host_id to on_leave_cluster
We extend the function
endpoint_lifecycle_subscriber::on_leave_cluster
by another argument -- locator::host_id.
It's more convenient to have a consistent
pair of IP and host ID.
2024-04-26 22:44:03 +02:00
Dawid Medrek
a36387d942 service: Fix indentation 2024-04-26 22:44:03 +02:00
Dawid Medrek
c585444c60 db/hints: Fix indentation 2024-04-26 22:44:03 +02:00
Pavel Emelyanov
7f2742893e view: Open-code one line lambda checking if table exists
Continuation of the previous patch. The lambda in question used to be a
heavyweight(y) code, but now it's one-liner. And it's only called once,
so no more point in keeping it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-26 20:19:38 +03:00
Pavel Emelyanov
a3e76f9c93 view: Use non-throwoing check if a table exists
Two places in view code check if a table exists by finding its schema ID
and catching no_such_column_family exception. That's a bit heavyweight,
database has column_family_exists() method for such cases.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-26 20:17:35 +03:00
Pavel Emelyanov
5e23493d25 test: Add test for how quarantined sstables registry entries are loaded
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-26 16:54:43 +03:00
Pavel Emelyanov
ba512c52a5 sstable_directory: Use sstable location to initialize registry lister
When populating sstables on boot a bunch of sstable_directory objects is
created. For each sstable there come three -- one for normal, quarantine
and staging state. Each is initialized with sstable location (which is
now a datadir/ks_name/cf_name-and-uuid) and the desired state (a enum
class). When created, the directory object wires up component lister,
depending on which storage options are provided. For local sstables a
legacy filesystem lister is created and it's initialized with a path
where to search files for -- location + / + string(state). But for s3
sstables, that keep their entries in registry, the lister is
errorneously initialized with the same location + / + string(state)
value. The mistake is that sstables in registry keep location and state
in different columns, so for any state lister should query registry with
the same location value (then it filters entries by state on its own).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-26 16:36:47 +03:00
Kamil Braun
d8313dda43 Merge 'db: config: move consistent-topology-changes out of experimental and make it the default for new clusters' from Patryk Jędrzejczak
We move consistent cluster management out of experimental and
make it the default for new clusters in 6.0. In code, we make the
`consistent-topology-changes` flag unused and assumed to be true.

In 6.0, the topology upgrade procedure will be manual and
voluntary, so some clusters will still be using the gossip-based
topology even though they support the raft-based topology.
Therefore, we need to continue testing the gossip-based topology.
This is possible by using the `force-gossip-topology-changes` flag
introduced in scylladb/scylladb#18284.

Ref scylladb/scylladb#17802

Closes scylladb/scylladb#18285

* github.com:scylladb/scylladb:
  docs: raft.rst: update after removing consistent-topology-changes
  treewide: fix indentation after the previous patch
  db: config: make consistent-topology-changes unused
  test: lib: single_node_cql_env: restart a node in noninitial run_in_thread calls
  test: test_read_required_hosts: run with force-gossip-topology-changes
  storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes
  storage_service: join_token_ring: fix finish_setup_after_join calls
2024-04-26 14:45:29 +02:00
Botond Dénes
b96f28356a Merge 'api/storage_service: convert runtime_error from repair to http error ' from Kefu Chai
in `set_repair()`, despite that the repair is performed asynchronously,
we check the options specified by client immediately, and throw
`std::runtime_error`, if any of them is not supported.

before this change, these unhandled exceptions are translated to HTTP
500 error but the underlying HTTP router. but this is misleading, as
these errors are caused by client, not server.

in this change, we handle the `runtime_error`, and translate them
into `httpd::bad_param_exception`, so that the client can have
HTTP 400 (Bad Request) instead of HTTP 500 (Internal Server Error),
and with informative error message.

for instance, if we apply repair with "small_table_optimization" enabled
on a keyspace with tablets enabled. we should have an HTTP error 400
with "The small_table_optimization option is not supported for tablet repair"
as the body of the error. this would much more helpful.

Closes scylladb/scylladb#18389

* github.com:scylladb/scylladb:
  api/storage_service: convert runtime_error from repair to http error
  repair: change runtime_error to invalid_argument in do_repair_start()
  api/storage_service: coroutinize set_repair()
2024-04-26 13:27:51 +03:00
Patryk Jędrzejczak
3a100cd16c test: test_raft_recovery_stuck: ensure raft upgrade procedure failed
We have log browsing in test.py now, so we can fix this TODO easily.

Closes scylladb/scylladb#18425
2024-04-26 10:16:49 +02:00
Asias He
62a9ecff51 repair: Cleanup repair history status entry for tablet
The entry in the repair history map that is used to track repair status
internally for each repair job should be removed after the repair job is
done. We do the same for vnode repairs.

This patch adds the missing automatic history cleanup code which is
missed in the initial tablet repair support in commit 54239514af,
which does not support repair history update back then.

Refs #17046

Closes scylladb/scylladb#18434
2024-04-26 10:56:45 +03:00
Botond Dénes
044fd7a3ec Merge 'Move some view updating methods from table to view_update_generator' from Pavel Emelyanov
The populate_views() and generate_and_propagate_view_updates() both naturally belong to view_update_generator -- they don't need anything special from table itself, but rather depend on some internals of the v.u.generator itself.

Moving them there lets removing the view concurrency semaphore from keyspace and table, thus reducing the cross-components dependencies.

Closes scylladb/scylladb#18421

* github.com:scylladb/scylladb:
  replica: Do not carry view concurrency semaphore pointer around
  view: Get concurrency semaphore via database, not table
  view_update_generator: Mark mutate_MV() private
  view: Move view_update_generator methods' code
  view: Move table::generate_and_propagate_view_updates into view code
  view: Move table::populate_views() into view_update_generator class
2024-04-26 10:55:38 +03:00
Botond Dénes
d566eec89a Merge 'treewide: remove {dclocal_,}read_repair_chance options' from Kefu Chai
dclocal_read_repair_chance and read_repair_chance have been removed in Cassandra 3.11 and 4.x, see
https://issues.apache.org/jira/browse/CASSANDRA-13910. if we expose these properties via DDL, Cassandra would fail to consume the CQL statement creating the table when performing migration from Scylla to Cassandra 4.x, as the latter does not understand these properties anymore.

currently the default values of `dc_local_read_repair_chance` and `read_repair_chance` are both "0". so they are practically disabled, unless user deliberately set them to a value greater than 0.

also, as a side effect, Cassandra 4.x has better support of Python3. the cqlsh shipped along with Cassandra 3.11.16 only supports python2.7, see
https://github.com/apache/cassandra/blob/cassandra-3.11.16/bin/cqlsh.py it errors out if the system only provides python3 with the error of
```
No appropriate python interpreter found.
```
but modern linux systems do not provide python2 anymore.

so, in this change, we deprecate these two options.

Fixes #3502
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18087

* github.com:scylladb/scylladb:
  docs: drop documents related to {,dclocal_}read_repair_chance
  treewide: remove {dclocal_,}read_repair_chance options
2024-04-26 10:48:47 +03:00
Michał Chojnowski
c1146314a1 docs: clarify that DELETE can be used with USING TIMEOUT
The current text seems to suggest that `USING TIMEOUT` doesn't work with `DELETE` and `BATCH`. But that's wrong.

Closes scylladb/scylladb#18424
2024-04-26 10:48:17 +03:00
Pavel Emelyanov
4ac30e5337 view-builder: Print correct exception in built ste exception handler
Inside .handle_exception() continuation std::current_exception() doesn't
work, there's std::exception ex argument to handler's lambda instead

fixes #18423

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18349
2024-04-26 09:58:45 +03:00
Kefu Chai
0bbaded4ce api/storage_service: convert runtime_error from repair to http error
in `set_repair()`, despite that the repair is performed asynchronously,
we check the options specified by client immediately, and throw
`std::runtime_error`, if any of them is not supported.

before this change, these unhandled exceptions are translated to HTTP
500 error but the underlying HTTP router. but this is misleading, as
these errors are caused by client, not server. and the error message
is missing in the HTTP error message when performing the translation.

in this change, we handle the `runtime_error`, and translate them
into `httpd::bad_param_exception`, so that the client can have
HTTP 400 (Bad Request) instead of HTTP 500 (Internal Server Error),
and with informative error message.

for instance, if we apply repair with "small_table_optimization" enabled
on a keyspace with tablets enabled. we should have an HTTP error 400
with "The small_table_optimization option is not supported for tablet repair"
as the body of the error. this would much more helpful.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-26 14:25:15 +08:00
Kefu Chai
9de9f401a1 repair: change runtime_error to invalid_argument in do_repair_start()
if an error is caused by the option provided by user, would be better
to throw an `std::invalid_argument` instead of `std::runtime_error`,
so that the caller can make a better decision when handling the
thrown exceptions.

so, in this change, we change the exceptions raise directly in
`repair_service::do_repair_start()` from `std::runtime_error` to
`std::invalid_argument`. please note, in the lambda named `host2ip`,
since the hostname is not provided by user, so we are not changing
the exception type in that lambda.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-26 14:24:45 +08:00
Kefu Chai
d737ba1ab2 api/storage_service: coroutinize set_repair()
before this change, `set_repair()` uses a lambda for handling
the client-side requests. and this works great. but the underlying
`repair_start()` throws if any of the given options is not sane.
and we don't handle any of these throw exceptions in `set_repair()`,
from client's point of view, it would get an HTTP 500 error code,
which implies an "Internal Server Error". but actually, we should
blame the client for the error, not the server.

so, to prepare the error handling, let's take the opportunity to
coroutinize the lambda handling the request, so that we can handle
the exception in a more elegant way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-26 14:24:03 +08:00
Michał Jadwiszczak
7f839f727e docs/cql: update ALTER TABLE docs 2024-04-26 07:01:08 +02:00
Michał Jadwiszczak
7cbce78480 test/cqlpytest: add test for prepared ALTER TABLE ... DROP ... USING TIMESTAMP ? 2024-04-26 07:01:02 +02:00
Botond Dénes
7cbe5c78b4 install.sh: use the native nodetool directly
* tools/java b810e8b00e...4ee15fd9ea (1):
  > install.sh: don't install nodetool into /usr/bin

Add a bin/nodetool and install it to bin/ in install.sh. This script
simply forwards to scylla nodetool and it is the replacement for the
Java nodetool, which is dropped from the java-tools's install.sh, in the
submodule update also included in this patch.
With this change, we now hardwire the usage of the native nodetool, as
*the* nodetool, with the intermediary nodetool wrapper script removed
from the picture.
Bash completion was copied from the java tools repository and it is now
installed by the scylla package, together with nodetool.

The Java nodetool is still available as as a fall-back, in case the
native nodetool has problems, at the path of
/opt/scylladb/share/cassandra/bin/nodetool.

Testing

I tested upgrades on a DEB and RPM distro: Ubuntu and Fedora.
First I installed scylla-5.4, then I installed the packages for this PR.
On Ubuntu, I had to use dpkg -i --auto-deconfigure, otherwise, dpkg would
refuse to install the new packages because they break the old ones. No
extra flags were required on Fedora.
In both cases, /usr/bin/nodetool was changed from a thunk calling the
Java nodetool (from 5.4) to the native launcher script from this PR.
/opt/scylladb/share/cassandra/bin/nodetool remained in place and still
works after the upgrade.

I also verified that --nonroot installs also work. Nodetool works both
when called with an absolute path, or when ~/scylladb/bin is added to
$PATH.

Fixes: #18226
Fixes: #17412

Closes scylladb/scylladb#18255

[avi: reset submodule to actual hash we ended up with]
2024-04-25 22:52:00 +03:00
Michał Jadwiszczak
27a4331dcd test/cql-pytest: remove xfail from alter table with timestamp tests
Previous patch introduced `ALTER TABLE ... DROP .. USING TIMESTAM ...`
so those test should no longer fail.

Refs #9929
2024-04-25 21:27:40 +02:00
Michał Jadwiszczak
80f0357436 cql3/statements: extend ALTER TABLE ... DROP to allow specifying timestamp of column drop 2024-04-25 21:27:40 +02:00
Michał Jadwiszczak
7dc0d068c0 cql3/statements: pass query_options to prepare_schema_mutations()
The object is needed to get timestamp from attributes (in a case when
the statement was prepared with parameter marker).
2024-04-25 21:27:40 +02:00
Michał Jadwiszczak
998a65a4f6 cql3/statements: add bound terms to alter table statement
Until now, alter table couldn't take any parameter marker, so the bound
terms were always 0.
Adding `USING TIMESTAMP` to `ALTER TABLE ... DROP` also adds possibility
to prepare a alter table statement with a paramenter marker.
2024-04-25 21:27:40 +02:00
Michał Jadwiszczak
d268641c27 cql3/statements: split alter_table_statement into raw and prepared
Currently alter table doesn't prepare any parameters so raw statement
and prepared one could be the same class.
Later commit will add attributes to the statement, which needs to be
prepared, that's why I'm splitting.
2024-04-25 21:27:40 +02:00
Michał Jadwiszczak
1c5563ba44 schema: allow to specify timestamp of dropped column
In order to drop a column with specified timestamp, we need to
allow it in out schema class.
2024-04-25 21:27:40 +02:00
Avi Kivity
c2b8ca7d71 Merge 'cql3: statements: change default tombstone_gc mode for tablets' from Aleksandra Martyniuk
Repair may miss some tablets that migrated across nodes.
So if tombstones expire after some timeout, then we can
have data resurrection.

Set default tombstone_gc mode to "repair" for tables which
use tablets (if repair is required).

Fixes: #16627.

Closes scylladb/scylladb#18013

* github.com:scylladb/scylladb:
  test: check default value of tombstone_gc
  test: topology: move some functions to util.py
  cql3: statements: change default tombstone_gc mode for tablets
2024-04-25 19:18:37 +03:00
Lakshmi Narayanan Sreethar
6af2659b57 sstables: reclaim_memory_from_components: do not update _recognised_components
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.

Fixes #18398

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18400
2024-04-25 19:15:59 +03:00
Raphael S. Carvalho
4a5fdc5814 table: Remove outdated FIXME about sstable spanning multiple tablets
The FIXME was added back then because we thought the interface of
compaction_group_for_sstable might have to be adjusted if a sstable
were allowed to temporarily span multiple tablets until it's split,
but we have gone a different path.
If a sstable's key range incorrectly spans more than one tablet,
that will be considered a bug and an exception is thrown.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18410
2024-04-25 17:21:11 +03:00
Marcin Maliszkiewicz
7085339f72 cql3: test: include get_mutations_internal log in test.py
We have a concurrent modification conflict in tests and suspect
duplicated requests but since we don't log successful requests
we have no way to verify if that's the case. get_mutations_internal log
will help to tell wchich nodes are trying to push auth or
service levels mutations into raft.

Refs scylladb/scylladb#18319

Closes scylladb/scylladb#18413
2024-04-25 17:17:53 +03:00
Botond Dénes
0234b4542a Merge '[github] add PR template and action to verify PR tasks was completed' from Yaron Kaikov
Today with the backport automation, the developer added the relevant backport label, but without any explanation of why

Adding the PR template with a placeholder for the developer to add his decision about backport yes or no

The placeholder is marked as a task, so once the explanation is added, the task must be checked as completed

Also adding another check to the PR summary will make it clear to the maintainer/reviewer if the developer explained about backport

Closes scylladb/scylladb#18275

* github.com:scylladb/scylladb:
  [github] add action to verify PR tasks was completed
  [github] add PR template
2024-04-25 17:14:50 +03:00
Pavel Emelyanov
18cc2cfa31 replica: Generalize snapshot details for single table/snapshot dir
There are two places that get total:live stats for a table snapshot --
database::get_snapshot_details() and table::get_snapshot_details(). Both
do pretty similar thing -- walk the table/snapshots/ directory, then
each of the found sub-directory and accumulate the found files' sizes
into snapshot details structure.

Both try to tell total from live sizes by checking whether an sstable
component found in snapshots is present in the table datadir. The
database code does it in a more correct way -- not just checks the file
presense, but also compares if it's a hardlink on the snapshot file,
while the table code just checks if the file of the same name exists.

This patch does both -- makes both database and table call the same
helper method for a single snapshot details, and makes the generalized
version use more elaborated collision check, thus fixing the per-table
details getting behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18347
2024-04-25 17:12:42 +03:00
Asias He
1ca779d287 streaming: Fix use after move in fire_stream_event
The event is used in a loop.

Found by clang-tidy:

```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
        listener->handle_stream_event(std::move(event));
                                                ^
streaming/stream_result_future.cc:80:39: note: move occurred here
        listener->handle_stream_event(std::move(event));
                                      ^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
        listener->handle_stream_event(std::move(event));
                                                ^
```

Fixes #18332

Closes scylladb/scylladb#18333
2024-04-25 16:48:54 +03:00
Botond Dénes
2c8bd99cd4 Merge 'Coroutinize view_builder::stop()' from Pavel Emelyanov
It's pretty straightforward, but prior to that, exception handling needs some care

Closes scylladb/scylladb#18378

* github.com:scylladb/scylladb:
  view-builder: Coroutinize stop()
  view_builder: Do not try to handle step join exceptions on stop
2024-04-25 16:48:25 +03:00
Kefu Chai
014a069ed2 build: cmake: require {fmt} >= 9.0.0
we are using `fmt::ostream_formatter` which was introduced in
{fmt} v9.0.0, see https://github.com/fmtlib/fmt/releases/tag/9.0.0 .

before this change, we depend on Seastar to find {fmt}. but
the minimal version of {fmt} required by Seastar is 5.0.0, which
cannot fulfill the needs to build scylladb.

in this change, we find {fmt} package in scylla, and specify the
minimal required version of 9.0.0, so the build can fail at the
configuration time. {fmt} v8 could be still being used by users.
for instance, ubuntu:jammy comes with libfmt-dev 8.1.1. and
ubuntu:jammy is EOL in Apr 2027, see
https://ubuntu.com/about/release-cycle .

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18386
2024-04-25 16:35:08 +03:00
Amnon Heiman
dfea50a7e9 db/config.cc add metric family config from file
Metric family config lets a user configure the metric family aggregate labels.
This patch modifies the existing relable-config from file to accept
metric family config.

Similar to the existing relable_config, it adds a metric_family_configs
section.  For example, the following configuration demonstrates changing
aggregate labels by name and regular expression.

```
metric_family_configs:
 - name: storage_service
   aggregate_labels: [shard]
 - regex: (storage_proxy.*)
   aggregate_labels: [shard, scheduling_group_name]
```

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#18339
2024-04-25 16:03:39 +03:00
Kefu Chai
e9b31cb4c1 test: locator_topology: s/get0()/get()/
this change addresses the leftover of 9e8805bb49

Refs 9e8805bb49

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18390
2024-04-25 16:03:01 +03:00
Patryk Jędrzejczak
55b011902e docs: raft.rst: update after removing consistent-topology-changes 2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak
0d428a3857 treewide: fix indentation after the previous patch 2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak
3a34bb18cd db: config: make consistent-topology-changes unused
We make the `consistent-topology-changes` experimental feature
unused and assumed to be true in 6.0. We remove code branches that
executed if `consistent-topology-changes` was disabled.
2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak
77342ffb34 test: lib: single_node_cql_env: restart a node in noninitial run_in_thread calls
In the following commit, we make the `consistent-topology-changes`
experimental feature unused. Then, all unit tests in the boost suite
will start using the raft-based topology by default. Unfortunately,
tests with multiple `single_node_cql_env::run_in_thread` calls
(usually coming from the `do_with_cql_env_thread` calls) would fail.

In a noninitial `run_in_thread` call, a node is started as if it
booted for the first time. On the other hand, it has its persistent
state from previous boots. Hence, the node can behave strangely and
unexpectedly. In particular, `SYSTEM.TOPOLOGY` is not empty and the
assertion that expects it to be empty when we boot for the first
time fails.

We fix this issue by making noninitial `run_in_thread` calls
behave as normal restarts.

After this change,
`test_schema_digest_does_not_change_with_disabled_features` starts
failing. This test copies the data directory before booting for the
first time, so the new
`_sys_ks.local().build_bootstrap_info().get();` makes the node
incorrectly think it restarts. Then, after noticing it is not a part
of group 0, the node would start the raft upgrade procedure if we
didn't run it in the raft RECOVERY mode. This procedure would get
stuck because it depends on messaging being enabled even if the node
communicates only with itself and messaging is disabled in boost tests.
2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak
88038d958a test: test_read_required_hosts: run with force-gossip-topology-changes
In one of the following commits, we make the
`consistent-topology-changes` experimental feature unused. Then,
all unit tests in the boost suite will start using the raft-based
topology by default. Unfortunately, some tests would start failing
and `test_read_required_hosts` is one of them.

`tablet_cql_test_config` in `tablets_test.cc` doesn't use
`consistent-topology-changes`, so all test cases in this file
run incorrectly wit the gossip-based topology changes. With
`consistent-topology-changes`, only `test_read_required_hosts`
fails. The failure happens on `auto table2 = add_table(e).get();`:
```
ERROR 2024-04-17 11:14:16,083 [shard 0:main] load_balancer -
Replica 9b94d710-fbfb-11ee-9c4f-448617b47e11:0 of tablet
9b94d713-fbfb-11ee-9c4f-448617b47e11:0 not found in topology
```
This test case needs to be investigated and rewritten so that
it passes with the raft-based topology. However, we don't want
this issue to block the process of making the
`consistent-topology-changes` experimental feature unused. We
leave a FIXME and we will open a new issue to track it.
2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak
213f2f6882 storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes
The `force_gossip_based_join` error injection does exactly what we
expect from `force-gossip-topology-changes` so we can do a simple
replacement.

We prefer a flag over an error injection because we will use it
a lot in CI jobs' configurations, some tests, manual testing etc.
It's much more convenient.

Moreover, the flag can be used in the release mode, so we re-enable
all tests that were disabled in release mode only because of using
the `force_gossip_based_join` error injection.

The name of the `force-gossip-topology-changes` flag suggests that
using it should always succesfully force the gossip-based topology
or, if forcing is not possible, the booting should fail. We don't
want a node with `force-gossip-topology-changes=true` that silently
boots in the raft-topology mode. We achieve it by throwing a
runtime error from `join_cluster` in two cases:
- the node is restarting in the cluster that is using raft topology
- the node is joining the cluster that is using raft topology
2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak
d6ee540efc storage_service: join_token_ring: fix finish_setup_after_join calls
The `topology_change_enabled` parameter of `finish_setup_after_join`
is used underneath to enable pulling raft topology snapshots in two
cases:
- when the node joins the cluster that uses the raft-based topology,
- when the SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES feature is enabled.
The first case happens in the first changed call.
`_raft_experimental_topology` always equals true there. The second
call was incorrect as it could enable pulling snapshots before
SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES was enabled. It could cause
problems during rolling upgrade to 6.0. For more information see
07aba3abc4.
2024-04-25 14:33:21 +02:00
Yaron Kaikov
5e63f74984 [github] add action to verify PR tasks was completed
Adding another check to the PR summary will make it clear to the maintainer/reviewer if the developer explained about backport
2024-04-25 15:24:22 +03:00
Botond Dénes
aaa76d4c0e Merge 'Getting per-table snapshot size is racy wrt creating new snapshots' from Pavel Emelyanov
The API endpoint in question calls table::get_snapshot_detail() which just walks table/snapshots/ directory. This can clash with creating a new snapshot. Database-wide walk is guarded with snapshot-ctl's locking, so should the per-table API do

Closes scylladb/scylladb#18414

* github.com:scylladb/scylladb:
  snapshot: Get per-table snapshot size under snapshot lock
  snapshot: Move per-table snap API to other snapshot endpoints
2024-04-25 14:57:52 +03:00
Kefu Chai
e5b30ae2ad partition_version: do not rereference moved variable
in `partition_entry::apply_to_incomplete()`, we pass `*dst_snp` and
`std::move(dst_snp)` to build the capture variable list of a lambda,
but the order of evaluation of these variables are unspecified.
fortunately, we haven't run into any issues at this moment. but this
is not future-proof. so, let's avoid this by storing a reference
of the dereferenced smart pointer, and use it later on.

this issue is identified by clang-tidy:

```
/home/kefu/dev/scylladb/mutation/partition_version.cc:500:53: warning: 'dst_snp' used after it was moved [bugprone-use-after-move]
  500 |             cur = partition_snapshot_row_cursor(s, *dst_snp),
      |                                                     ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:502:23: note: move occurred here
  502 |             dst_snp = std::move(dst_snp),
      |                       ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:500:53: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
  500 |             cur = partition_snapshot_row_cursor(s, *dst_snp),
      |                                                     ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:501:57: warning: 'src_snp' used after it was moved [bugprone-use-after-move]
  501 |             src_cur = partition_snapshot_row_cursor(s, *src_snp, can_move),
      |                                                         ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:504:23: note: move occurred here
  504 |             src_snp = std::move(src_snp),
      |                       ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:501:57: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
  501 |             src_cur = partition_snapshot_row_cursor(s, *src_snp, can_move),
      |                                                         ^
```

Fixes #18360
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18361
2024-04-25 14:57:52 +03:00
Pavel Emelyanov
8aaa09ee97 replica: Do not carry view concurrency semaphore pointer around
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:27:43 +03:00
Pavel Emelyanov
2ee7c41139 view: Get concurrency semaphore via database, not table
The _view_update_concurrency_sem field on database propagates itself via
keyspace config down to table config and view_update_generator then
grabs one via table:: helper. That's an overkil, view_update_generator
has direct reference on the database and can get this semaphore from
there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:25:57 +03:00
Pavel Emelyanov
3d8b572d96 view_update_generator: Mark mutate_MV() private
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:25:40 +03:00
Pavel Emelyanov
bc4552740f view: Move view_update_generator methods' code
Now when the two methods belong to another class, move the code itself
to db/view , where the class itself resides.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:24:20 +03:00
Pavel Emelyanov
c2bf6b43b2 view: Move table::generate_and_propagate_view_updates into view code
Similarly to populate_views() method, this one also naturally belongs to
view_update_generator class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:20:06 +03:00
Pavel Emelyanov
670c7c925c view: Move table::populate_views() into view_update_generator class
The method in question has little to do with table, effectively it only
needs stats and consurrency semaphore. And the semaphore in question is
obtained from table indirectly, it really resides on database. On the
other hand, the method carries lots of bits from db::view, e.g. the
view_update_builder class, memory_usage_of() helper and a bit more.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:17:20 +03:00
Kefu Chai
e5bcea6718 docs: drop documents related to {,dclocal_}read_repair_chance
since "read_repair_chance" and "dclocal_read_repair_chance" are
removed, and not supported anymore. let's stop documenting them.

Refs #3502

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-25 17:15:27 +08:00
Kefu Chai
c323c93fa4 treewide: remove {dclocal_,}read_repair_chance options
dclocal_read_repair_chance and read_repair_chance have been removed
in Cassandra 3.11 and 4.x, see
https://issues.apache.org/jira/browse/CASSANDRA-13910.
if we expose the properties via DDL, Cassandra would fails to consume
the CQL statement to creating the table when performing migration
from Scylla to Cassandra 4.x, as the latter does not understand
these properties anymore.

currently the default values of `dc_local_read_repair_chance` and
`read_repair_chance` are both "0". so this is practically disabled,
unless user deliberately set them to a value greater than 0.

also, as a side effect, Cassandra 4.x has better support of
Python3. the cqlsh shipped along with Cassandra 3.11.16 only
supports python2.7, see
https://github.com/apache/cassandra/blob/cassandra-3.11.16/bin/cqlsh.py
it errors out if the system only provides python3 with the error
of

```
No appropriate python interpreter found.
```
but modern linux systems do not provide python2 anymore.

so, in this change, we deprecate these two options.

Fixes #3502
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-25 17:15:27 +08:00
Botond Dénes
ca26899c36 Merge 'sstable: large data handler needs to count range tombstones as rows' from Ferenc Szili
When issuing warnings about partitions with the number of rows above a configured threshold, the  large partitions handler does not take into consideration the number of range tombstone markers in the total rows count. This fix adds the number of range tombstone markers to the total number of rows and saves this total in system.large_partitions.rows (if it is above the threshold). It also adds a new column range_tombstones to the system.large_partitions table which only contains the number of range tombstone markers for the given partition.

This PR fixes the first part of issue #13968
It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.

Closes scylladb/scylladb#18346

* github.com:scylladb/scylladb:
  sstables: add docs changes for system.large_partitions
  sstable: large data handler needs to count range tombstones as rows
2024-04-25 11:38:30 +03:00
Pavel Emelyanov
e97abfc473 tablets: Fix indentation after flat-hash-map patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18364
2024-04-25 11:36:37 +03:00
Kefu Chai
0b5a861961 build: cmake: reference build_mode with ${scylla_build_mode_${CMAKE_BUILD_TYPE}}
before this change, if we generate the building system with plain
`Ninja`, instead of `Ninja Multi-Config` using cmake, the build
fails, because `${scylla_build_mode_${CMAKE_BUILD_TYPE}}` is not
defined. so the profile used for building the rust library would be
"rust-", which does not match any of the profiles defined by
`Cargo.toml`.

in this change, we use `$CMAKE_BUILD_TYPE` instead of "$config". as
the former is defined for non-multi generator. while the latter
is. see https://cmake.org/cmake/help/latest/generator/Ninja%20Multi-Config.html

with this change, we are able to generate the building system properly
with the "Ninja" generator. if we just want to run some static analyzer
against the source tree or just want to build scylladb with a single
configuration, the "Ninja" generator is a good fit.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18353
2024-04-25 10:51:54 +03:00
Pavel Emelyanov
ae4c1c44ec snapshot: Get per-table snapshot size under snapshot lock
Walking per-table snapshot directory without lock is racy. There's
snapshot-ctl locking that's used to get db-wide snapshot details, it
should be used to get per-table snapshot details too

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 10:05:51 +03:00
Pavel Emelyanov
186b36165e snapshot: Move per-table snap API to other snapshot endpoints
So that they are collected in one place and to facilitate next patch
that's going to use snapshot-ctl for per-table API too

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 10:05:01 +03:00
Anna Stuchlik
b5d256a991 doc: add Scylla Doctor to the docs
This commit adds the description and usage instructions of Scylla Doctor
to the "How to Report a ScyllaDB Problem" page.

Scylla Doctor replaces Health Check Report, so the description of
and references to the latter are removed with this commit.

Fixes https://github.com/scylladb/scylladb/issues/16276

Closes scylladb/scylladb#17617
2024-04-25 09:50:38 +03:00
Asias He
037bba0ca1 repair: Turn on off_strategy_updater for tablet repair
The off_strategy_updater is used during repair to update the automatic
off strategy timer so off_strategy compaction starts automatically only
after repair finishes. We still use off_strategy for tablets. So we
should still turn on the updater.

The update logic is used for vnode tables. We can share the code with
vnode table instead of copying, but since there is a possibility we
could disable off_strategy for tablets. We'd better postpone the code
sharing as follow ups. If later, we decide to disable off_strategy for
tablets, we can remove the updater for tablet.

Fixes #18196

Closes scylladb/scylladb#18266
2024-04-25 09:03:07 +03:00
Kamil Braun
3363f6e1e8 Merge 'Fix write failures during node replace with same IP with topology over raft' from Gleb
Currently a new node is marked as alive too late, after it is already
reported as a pending node. The patch series changes replace procedure
to be the same as what node_ops do: first stop reporting the IP of the
node that is being replaced as a natural replica for writes, then mark
the IP is alive, and only after that report the IP as a pending endpoint.

Fixes: scylladb/scylladb#17421

* 'gleb/17421-fix-v2' of github.com:scylladb/scylla-dev:
  test_replace_reuse_ip: add data plane load
  sync_raft_topology_nodes: make replace procedure similar to nodeops one
  storage_service: topology_coordinator: fix indentation after previous patch
  storage_service: topology coordinator: drop ring check in node_state::replacing state
2024-04-24 17:09:01 +02:00
Petr Gusev
bc98774f83 test_replace_reuse_ip: add data plane load
In this commit we enhance test_replace_reuse_ip
to reproduce #17421. We create a test table and run
insert queries on it while the first node is
being replaced. In this form the test fails
without the fix from the previous commit. Some
insert requests fail with [Unavailable exception]
"Cannot achieve consistency level for cl QUORUM...".
2024-04-24 16:59:24 +03:00
Gleb Natapov
4614fedd22 sync_raft_topology_nodes: make replace procedure similar to nodeops one
In replace-with-same-ip a new node calls gossiper.start_gossiping
from join_token_ring with the 'advertise' parameter set to false.
This means that this node will fail echo RPC-s from other nodes,
making it appear as not alive to them. The node changes this only
in storage_service::join_node_response_handler, when the topology
coordinator notifies it that it's actually allowed to join the
cluster. The node calls _gossiper.advertise_to_nodes({}), and
only from this moment other nodes can see it as alive.

The problem is that topology coordinator sends this notification
in topology::transition_state::join_group0 state. In this state
nodes of the cluster already see the new node as pending,
they react with calling tmpr->add_replacing_endpoint and
update_topology_change_info when they process the corresponding
raft notification in sync_raft_topology_nodes. When the new
token_metadata is published, assure_sufficient_live_nodes
sees the new node in pending_endpoints. All of this happen
before the new node handled successful join notification,
so it's not alive yet. Suppose we had a cluster with three
nodes and we're replacing on them with a fourth node.
For cl=qurum assure_sufficient_live_nodes throws if
live < need + pending, which in our case becomes 2 < 2 + 1.
The end effect is that during replace-with-same-ip
data plane requests can fail with unavailable_exception,
breaking availability.

The patch makes boot procedure more similar to node ops one.
It splits the marking of a node as "being replaced" and adding it to
pending set in to different steps and marks it as alive in the middle.
So when the node is in topology::transition_state::join_group0 state
it marked as "being replaced" which means it will no longer be used for
reads and writes. Then, in the next state, new node is marked as alive and
is added to pending list.

fixes scylladb/scylladb#17421
2024-04-24 16:59:22 +03:00
Kamil Braun
1297b9a322 mutation: mutation_by_size_splitter: skip last mutation if it's empty
Currently, the last mutation emitted by split_mutation could be empty.
It can happen as follows:
- consume range tombstone change at pos `1` with some timestamp
- consume clustering row at pos `2`
- flush: this will create mutation with range tombstone (1, 2) and
  clustering row at 2
- consume range tombstone change at pos `2` with no timestamp (i.e.
  closing rtc)
- end of partition

since the closing rtc has the same position as the clustering row, no
additional range tombstone will be emitted -- the only necessary range
tombstone was already emitted in the previous mutation.

On the other hand, `test_split_mutations` expects all emitted mutations
to be non-empty, which is a sane expectation for this function.

The test catched a case like this with random-seed=629157129.

Fix this by skipping the last mutation if it turns out to be empty.

Fixes: scylladb/scylladb#18042

Closes scylladb/scylladb#18375
2024-04-24 16:25:31 +03:00
Raphael S. Carvalho
71682aebdd storage_service: Fix use-after-move in storage_service::node_ops_cmd_handler
```
service/storage_service.cc:4288:62: warning: 'req' used after it was moved [bugprone-use-after-move]
            node_ops_insert(ops_uuid, coordinator, std::move(req.ignore_nodes), [this, coordinator, req = std::move(req)] () mutable {
                                                             ^
service/storage_service.cc:4288:107: note: move occurred here
            node_ops_insert(ops_uuid, coordinator, std::move(req.ignore_nodes), [this, coordinator, req = std::move(req)] () mutable {
                                                                                                          ^
service/storage_service.cc:4288:62: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
            node_ops_insert(ops_uuid, coordinator, std::move(req.ignore_nodes), [this, coordinator, req = std::move(req)] () mutable {
                                                             ^

```

if evaluation order is right-to-left (GCC), req is moved first, and req.ignore_nodes will be empty,
so nodes that should be ignored will still be considered, potentially resulting in a failure during
replace.

https://godbolt.org/z/jPcM6GEx1

courtesy of clang-tidy.

Fixes #18324.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18366
2024-04-24 15:36:28 +03:00
Aleksandra Martyniuk
06f6aaf2cf test: check default value of tombstone_gc
Add a test which checks whether default tombstone_gc value is properly
set and if it does not override previous setting.
2024-04-24 10:57:51 +02:00
Aleksandra Martyniuk
e0d498716a test: topology: move some functions to util.py
Move functions marked with asynccontextmanager from test/topology/test_mv.py
to test/topology/util.py so that they can be used in other tests.
2024-04-24 10:57:51 +02:00
Aleksandra Martyniuk
58f72f9019 cql3: statements: change default tombstone_gc mode for tablets
Currently, if tombstone_gc mode isn't specified for a table,
then "timeout" is used by default. With tablets, running
"nodetool repair -pr" may miss a tablet if it migrated across
the nodes. Then, if we expire tombstones for ranges that
weren't repaired, we may get data resurrection.

Set default tombstone_gc mode value for DDLs that don't
specify it. It's set to "repair" for tables which use tablets
unless they use local replication strategy or rf = 1.
Otherwise it's set to "timeout".
2024-04-24 10:42:10 +02:00
Kamil Braun
8876b9b0ef test/pylib: random_tables: use IF NOT EXISTS when creating keyspace
Due to Python driver's unexpected behavior, "CREATE KEYSPACE" statement
may sometimes get executed twice (scylladb/python-driver#317), leading
to "Keyspace ... already exists" error in our tests
(scylladb/scylladb#17654). Work around this by using "IF NOT EXISTS".

Fixes: scylladb/scylladb#17654

Closes scylladb/scylladb#18368
2024-04-24 10:09:26 +03:00
Pavel Emelyanov
1b1b86809d view-builder: Coroutinize stop()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-23 20:43:42 +03:00
Pavel Emelyanov
eaf78fca04 view_builder: Do not try to handle step join exceptions on stop
Commit 23c891923e (main: make sure view_builder doesn't propagate
semaphore errors) ignored some exceptions that could pop up from the
_build_step/do_build_step() serialized action, since they are "benign"
on stop.

Later there came b56b10a4bb (view_builder: do_build_step: handle
unexpected exceptions) that plugged any exception from the action in
question, regardless of they happen on stop or run-time.

Apparently, the latter commit supersedes  the former.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-23 20:26:14 +03:00
Anna Stuchlik
c0e4f3e646 doc: include OSS-specific info as separate files
This commit excludes OSS-specific links and content
added in https://github.com/scylladb/scylladb/pull/17624
to separate files and adds the include directive `.. scylladb_include_flag::`
to include these files in the doc source files.

Reason: Adding the link to the Open Source upgrade guide
(/upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology)
breaks the Enterprise documentation because the Enterprise docs don't
contain that upgrade guide.  We must add separate files for OSS and
Enterprise to prevent failing the Enterprise build and breaking the
links.

Closes scylladb/scylladb#18372
2024-04-23 16:59:05 +02:00
Raphael S. Carvalho
fa2dc5aefa sstables: Fix use-after-move in an error path of FS-based sstable writer
```
sstables/storage.cc:152:21: warning: 'file_path' used after it was moved [bugprone-use-after-move]
        remove_file(file_path).get();
                    ^
sstables/storage.cc:145:64: note: move occurred here
    auto w = file_writer(output_stream<char>(std::move(sink)), std::move(file_path));

```

It's a regression when TOC is found for a new sstable, and we try to delete temporary TOC.

courtesy of clang-tidy.

Fixes #18323.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18367
2024-04-23 17:19:55 +03:00
Pavel Emelyanov
f5f57dc817 table: No need to open directory in snapshot_exists()
In order to check if a snapshot of a certain name exists the checking
method opens directory. It can be made with more lightweight call.

Also, though not critical, is that it fogets to close it.

Coroutinuze the method while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18365
2024-04-23 17:19:24 +03:00
Botond Dénes
572003c469 Merge 'Cleanup the way snapshot details are propagated via API' from Pavel Emelyanov
There's a database::get_snapshot_details() method that returns collection of all snapshots for all ks.cf out there and there are several *snapshot_details* aux structures around it. This PR keeps only one "details" and cleans up the way it propagates from database up to the respective API calls.

Closes scylladb/scylladb#18317

* github.com:scylladb/scylladb:
  snapshot_ctl: Brush up true_snapshots_size() internals
  snapshot_ctl: Remove unused details struct
  snapshot_ctl: No double recoding of details
  database,snapshots: Move database::snapshot_details into snapshot_ctl
  database,snapshots: Make database::get_snapshot_details() return map, not vector
  table,snapshots: Move table::snapshot_details into snapshot_ctl
2024-04-23 16:28:25 +03:00
Kefu Chai
9e8805bb49 repair, transport: s/get0()/get()/
`future::get0()` was deprecated in favor of `future::get()`. so
let's use the latter instead. this change silences a `-Wdeprecated`
warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18357
2024-04-23 15:48:54 +03:00
Kefu Chai
4fd9b2a791 reader: silence false-positive use-after-move warning
when compiling with clang-tidy, it warngs:
```
[6/9] Building CXX object readers/CMakeFiles/readers.dir/multishard.cc.o
/home/kefu/dev/scylladb/readers/multishard.cc:84:53: warning: 'fut_and_result' used after it was moved [bugprone-use-after-move]
   84 |                 auto result = std::get<1>(std::move(fut_and_result));
      |                                                     ^
/home/kefu/dev/scylladb/readers/multishard.cc:79:34: note: move occurred here
   79 |             _read_ahead_future = std::get<0>(std::move(fut_and_result));
      |                                  ^
```

but this warning is but a false alarm, as we are not really moving away
the *whole* tuple, we are just move away an element from it. but
clang-tidy cannot tell which element we are actually moving. so, silence
both places of `std::move()`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18363
2024-04-23 15:47:50 +03:00
Botond Dénes
5a1e3b25d0 Merge 'Sanitize sstables::directory_semaphore usage' from Pavel Emelyanov
The semaphore in question is used to limit parallelism of manipulations with table's sstables. It's currently used in two places -- sstable_directory (mainly on boot) and by table::take_snapshot() to take snapshot. For the latter, there's also a database -> sharded<directory_semaphore> reference.

This PR sanitizes the semaphore usage. The results are
- directory_semaphore no longer needs to friend several classes that mess with its internals
- database no longer references directory_semaphore

Closes scylladb/scylladb#18281

* github.com:scylladb/scylladb:
  database: Keep local directory_semaphore to initialize sstables managers
  database: Don't reference directory_semaphore
  table: Use directory semaphore from sstables manager
  table: Indentation fix after previous patch
  table: Use directory_semaphore for rate-limited snapshot taking
  sstables: Move directory_semaphore::parallel_for_each() to header
  sstables: Move parallel_for_each_restricted to directory_semaphore
  table: Use smp::all_cpus() to iterate over all CPUs locally
2024-04-23 13:54:52 +03:00
Kefu Chai
ab4de1f470 auth: move fmt::formatter<auth::resource_kind> up
before this change, `fmt::formatter<auth::resource_kind>` is located at
line 250 in this file, but it is used at line 130. so, {fmt} is not able
to find it:

```
/usr/include/fmt/core.h:2593:45: error: implicit instantiation of undefined template 'fmt::detail::type_is_unformattable_for<auth::resource_kind, char>'
 2593 |     type_is_unformattable_for<T, char_type> _;
      |                                             ^
/usr/include/fmt/core.h:2656:23: note: in instantiation of function template specialization 'fmt::detail::parse_format_specs<auth::resource_kind, fmt::detail::compile_parse_context<char>>' requested here
 2656 |         parse_funcs_{&parse_format_specs<Args, parse_context_type>...} {}
      |                       ^
/usr/include/fmt/core.h:2787:47: note: in instantiation of member function 'fmt::detail::format_string_checker<char, auth::resource_kind, auth::resource_kind>::format_string_checker' requested here
 2787 |       detail::parse_format_string<true>(str_, checker(s));
      |                                               ^
/home/kefu/dev/scylladb/auth/resource.hh:130:29: note: in instantiation of function template specialization 'fmt::basic_format_string<char, auth::resource_kind &, auth::resource_kind &>::basic_format_string<char[65], 0>' requested here
  130 |             seastar::format("This resource has kind '{}', but was expected to have kind '{}'.", actual, expected)) {
      |                             ^
/usr/include/fmt/core.h:1578:45: note: template is declared here
 1578 | template <typename T, typename Char> struct type_is_unformattable_for;
      |                                             ^
```

in this change, `fmt::formatter<auth::resource_kind>` is moved up to
where `auth::resource_kind` is defined. so that it can be used by its
caller.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18316
2024-04-23 12:11:17 +03:00
Kefu Chai
48048c2f94 utils/to_string: include fmt/std.h if fmt >= v10
in to_string.hh, we define the specialization of
`fmt::formatter<std::optional<T>>`, which is available in {fmt} v10
and up. to avoid conditionally including `utils/to_string.hh` and
`fmt/std.h` in all source files formatting `std::optional<T>` using
{fmt}, let's include `fmt/std.h` if {fmt}'s verison is greater or equal
to 10. in future, we should drop the specialization and use `fmt/std.h`
directly.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18325
2024-04-23 12:09:05 +03:00
Kefu Chai
e2d5054c53 types: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18326
2024-04-23 12:08:23 +03:00
Pavel Emelyanov
4445ee9a55 Merge 'install-dependencies.sh: add more dependencies for debian' from Kefu Chai
in this changeset, we install `libxxhash-dev` and `cargo` for debian, and install cxxbridge for all distros, so that at least debian can be built without further preparations after running `install-dependencies.sh`.

Closes scylladb/scylladb#18335

* github.com:scylladb/scylladb:
  install-dependencies.sh: move cargo out of fedora branch
  install-dependencies: install cargo and wabt for debian
  install-dependencies.sh: add libxxhash-dev for debian
2024-04-23 12:04:47 +03:00
Lakshmi Narayanan Sreethar
de6570e1ec serializer_impl, sstables: fix build failure due to missing includes
When building scylla with cmake, it fails due to missing includes in
serializer_impl.hh and sstables/compress.hh files. Fix that by adding
the appropriate include files.

Fixes #18343

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18344
2024-04-23 12:03:51 +03:00
Kefu Chai
826f413cad thrift: avoid use-after-move in make_non_overlapping_ranges()
in handler.cc, `make_non_overlapping_ranges()` references a moved
instance of `ColumnSlice` when something unexpected happens to
format the error message in an exception, the move constructor of
`ColumnSlice` is default-generated, so the members' move constructors
are used to construct the new instance in the move constructor. this
could lead to undefined behavior when dereferencing the move instance.

in this change, in order to avoid use-after free, let's keep
a copy of the referenced member variables and reference them when
formatting error message in the exception.

this use-after-move issue was introduced in 822a315dfa, which implemented
`get_multi_slice` verb and this piece in the first place. since both 5.2
and 5.4 include this commit, we should backport this change to them.

Refs 822a315dfa
Fixes #18356
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18358
2024-04-23 12:02:09 +03:00
Kefu Chai
ad2c26824a main: do not reference moved variable
before this change, we dereference `linfo` after moving it away.
and clang-tidy warns us like

```
[19/171] Building CXX object CMakeFiles/scylla.dir/main.cc.o
/home/kefu/dev/scylladb/main.cc:559:12: warning: 'linfo' used after it was moved [bugprone-use-after-move]
  559 |     return linfo.host_id;
      |            ^
/home/kefu/dev/scylladb/main.cc:558:36: note: move occurred here
  558 |     sys_ks.local().save_local_info(std::move(linfo), snitch.local()->get_location(), broadcast_address, broadcast_rpc_address).get();
      |                                    ^
```

the default-generated move constructor of `local_info` uses the
default-generated move constructor of `locator::host_id`, which in turn
use the default-generated move constructor of
`utils::tagged_uuid<struct host_id_tag>`, and then `utils::UUID` 's
move constructor. since `UUID` does not contain any moveable resources,
what it has is but two `int64_t` member variables. so this is a benign
issue. but still, it is distracting.

in this change, we keep the value of `host_id` locally, and return it
instead to silence this warning, and to improve the maintainability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18362
2024-04-23 11:58:58 +03:00
Patryk Jędrzejczak
14911051ee db: config: introduce force-gossip-topology-changes
We are going to make the `consistent-topology-changes` experimental
feature unused in 6.0. However, the topology upgrade procedure will
be manual and voluntary, so some 6.0 clusters will be using the
gossip-based topology. Therefore, we need to continue testing the
gossip-based topology. The solution is introducing a new flag,
`force-gossip-topology-changes`, that will enforce the gossip-based
topology in a fresh cluster.

In this patch, we only introduce the parameter without any effect.
Here is the explanation. Making `consistent-topology-changes` unused
and introducing `force-gossip-topology-changes` requires adjustments
in scylla-dtest. We want to merge changes to scylladb and scylla-dtest
in a way that ensures all tests are run correctly during the whole
process. If we merged all changes to scylladb first, before merging
the scylla-dtest changes, all tests would run with the raft-based
topology and the ones excluded in the raft-based topology would fail.
We also can't merge all changes to scylla-dtest first. However, we
can follow this plan:
1. scylladb: merge this patch
2. scylla-dtest: start using `force-gossip-topology-changes`
   in jobs that run without the raft-based topology
3. scylladb: merge the rest of the changes
4. scylla-dtest: merge the rest of the changes

Ref scylladb/scylladb#17802

Closes scylladb/scylladb#18284
2024-04-23 09:42:46 +02:00
Botond Dénes
275ed9a9bc replica/mutation_dump: create_underlying_mutation_sources(): remove false move
transformed_cr is moved in a loop, in each iteration. This is harmless
because the variable is const and the move has no effect, yet it is
confusing to readers and triggers false positives in clang-tidy
(moved-from object reused). Remove it.

Fixes: #18322

Closes scylladb/scylladb#18348
2024-04-23 01:21:36 +02:00
Kamil Braun
e9285e5c04 Merge 'various fixes for topology coordinator' from Gleb
The series contains fixes for some problems found during scalability
testing and one clean up patch.

Ref: scylladb/scylladb#17545

* 'gleb/topology-fixes-v4' of github.com:scylladb/scylla-dev:
  gossiper: disable status check for endpoints in raft mode
  storage_service: introduce a setter for topology_change_kind
  topology coordinator: drop unused structure
  storage_service: yield in get_system_mutations
2024-04-22 17:37:47 +02:00
Calle Wilund
82d97da3e0 commitlog: Remove (benign) use-after-move
Fixes #18329

named_file::assign call uses old object "known_size" after a move
of the object. While this is wholly ok, since the attribute accessed
will not be modified/destroyed by the move, it causes warnings in
"tidy" runs, and might confuse or cause real errors should impl. change.

Closes scylladb/scylladb#18337
2024-04-22 17:20:19 +03:00
Ferenc Szili
c528597a84 sstables: add docs changes for system.large_partitions
This commit updates the documentation changes for the new column
range_tombstones in system.large_partitions
2024-04-22 15:25:41 +02:00
Ferenc Szili
98bec4e02a sstable: large data handler needs to count range tombstones as rows
When issuing warnings about partitions with the number of rows above a configured threshold,
the  large partitions handler does not take into consideration the number of range tombstone
markers in the total rows count. This fix adds the number of range tombstone markers to the
total number of rows and saves this total in system.large_partitions.rows (if it is above
the threshold). It also adds a new column range_tombstones to the system.large_partitions
table which only contains the number of range tombstone markers for the given partition.

This PR fixes the first part of issue #13968
It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.
2024-04-22 15:24:18 +02:00
Kefu Chai
ff04375016 main: drop unused namespace alias
`fs` namespace alias was introduced in ff4d8b6e85, but we don't
use it anymore. so let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18308
2024-04-22 13:50:28 +03:00
Nadav Har'El
59b40484c8 Update seastar submodule
* seastar 8fabb30a...2b43417d (6):
  > future: deprecate future::get0()
  > build: do not export valgrind with export()
  > http: deprecate buggy path param[]
  > http/request: add get_path_param method
  > http/request: get_query_param refactor
  > http/util: add path_decode method

Refs #5883 (fixes https://github.com/scylladb/seastar/issues/725 and
provides a new API to read the decoded paths).

Closes scylladb/scylladb#18297
2024-04-22 11:12:49 +03:00
Kefu Chai
85406a450c install-dependencies.sh: move cargo out of fedora branch
so that we install cxxbridge-cmd on all distros, and cxxbridge is
available when building scylladb.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-22 15:41:20 +08:00
Kefu Chai
835742af6d install-dependencies: install cargo and wabt for debian
cargo is used for installing cxxbridge-cmd, which is in turn used
when building the cxx bindings for the rust modules. so we need it
on all distros.

in this change, we add cargo for debian. so that we don't have
build failure like:

```
CMake Error at rust/CMakeLists.txt:32 (find_program):
  Could not find CXXBRIDGE using the following names: cxxbridge
```

for similar reason, we also need wabt, which provides wasm2wat,
without which, we'd have

```
CMake Error at test/resource/wasm/CMakeLists.txt:1 (find_program):
  Could not find WASM2WAT using the following names: wasm2wat
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-22 15:41:20 +08:00
Kefu Chai
a70a288627 install-dependencies.sh: add libxxhash-dev for debian
libxxhash is used for building on both fedora and debian. `xxhash-devel`
is already listed in `fedora_packages`, we should have its counterpart
in `debian_base_packages`. otherwise the build on debian and its
derivatives could fail like

```
CMake Error at /usr/local/share/cmake-3.29/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find xxHash (missing: xxhash_LIBRARY xxhash_INCLUDE_DIR) (found
  version "")
Call Stack (most recent call first):
  /usr/local/share/cmake-3.29/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  cmake/FindxxHash.cmake:30 (find_package_handle_standard_args)
  CMakeLists.txt:75 (find_package)
```

if we are using CMake to generate the building system. if we use
`configure.py` to generate `build.ninja`, the build would fails at
build time.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-22 15:22:51 +08:00
Gleb Natapov
0c77e96b0b storage_service: topology_coordinator: fix indentation after previous patch 2024-04-21 18:53:21 +03:00
Gleb Natapov
b8ee8911ca storage_service: topology coordinator: drop ring check in node_state::replacing state
Always modify topology metadata in node_state::replacing state. There is
no dependency on the ring value at all.
2024-04-21 18:53:04 +03:00
Gleb Natapov
06e6ed09ed gossiper: disable status check for endpoints in raft mode
Gossiper automatically removes endpoints that do not have tokens in
normal state and either do not send gossiper updates or are dead for a
long time. We do not need this with topology coordinator mode since in
this mode the coordinator is responsible to manage the set of nodes in
the cluster. In addition the patch disables quarantined endpoint
maintenance in gossiper in raft mode and uses left node list from the
topology coordinator to ignore updates for nodes that are no longer part
of the topology.
2024-04-21 16:36:07 +03:00
Gleb Natapov
0e3f92fa49 storage_service: introduce a setter for topology_change_kind
In the next patch we will extend it to have other side affects.
2024-04-21 16:36:07 +03:00
Gleb Natapov
040c6ca0c1 topology coordinator: drop unused structure 2024-04-21 16:36:07 +03:00
Gleb Natapov
d0a00f3489 storage_service: yield in get_system_mutations
Yield in a loop that converts a result to canonical_mutation. We
observed stalls for very large tables.
2024-04-21 16:36:07 +03:00
Avi Kivity
87b08c957f Merge 'treewide: drop FMT_DEPRECATED_OSTREAM macro and homebrew range formatters' from Kefu Chai
before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter.

in this change, we include `fmt/ranges.h` and/or `fmt/std.h` for formatting the container types, like vector, map optional and variant using {fmt} instead of the homebrew formatter based on operator<<.
with this change, the changes adding fmt::formatter and the changes using ostream formatter explicitly, we are allowed to drop `FMT_DEPRECATED_OSTREAM` macro.

Refs scylladb#13245

Closes scylladb/scylladb#17968

* github.com:scylladb/scylladb:
  treewide: do not define FMT_DEPRECATED_OSTREAM
  treewide: include fmt/ranges.h and/or fmt/std.h
  utils/managed_bytes: add support for fmt::to_string() to bytes and friends
2024-04-20 22:25:00 +03:00
Mikołaj Grzebieluch
65cfb9b4e0 storage_service: skip wait_for_gossip_to_settle if topology changes are based on raft
Waiting for gossip to settle slows down the bootstrap of the cluster.
It is safe to disable it if the topology is based on Raft.

Fixes scylladb/scylladb#16055

Closes scylladb/scylladb#17960
2024-04-20 17:56:51 +02:00
Pavel Emelyanov
67a408447f snapshot_ctl: Brush up true_snapshots_size() internals
Previous patches broke indentation in this method. Fix it by shortening
the summation loop with the help of std::accumulate()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 21:06:06 +03:00
Pavel Emelyanov
50add3314d snapshot_ctl: Remove unused details struct
Now the details are manipulated via some other structs and this one can
just be removed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 20:04:34 +03:00
Pavel Emelyanov
e8f10be12e snapshot_ctl: No double recoding of details
Currently database::get_snapshot_details() returns a collection of
snapshots. The snapshot_ctl converts this collection into similarly
looking one with slightly different structures inside. The resulting
collection is converted one more time on the API layer into another
similarly looking map.

This patch removes the intermediate conversion.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 20:04:32 +03:00
Pavel Emelyanov
8ec3f057a8 database,snapshots: Move database::snapshot_details into snapshot_ctl
Similarly to how it looks like for table::snapshot_details

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 20:04:29 +03:00
Pavel Emelyanov
f6bc283bbb database,snapshots: Make database::get_snapshot_details() return map, not vector
So that it's in-sync with table::get_snapshot_details(). Next patches
will improve this place even further.

Also, there can be many snapshots and vector can grow large, but that's
less of an issue here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 20:04:25 +03:00
Pavel Emelyanov
a36c13beb3 table,snapshots: Move table::snapshot_details into snapshot_ctl
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 19:59:34 +03:00
Kefu Chai
372a4d1b79 treewide: do not define FMT_DEPRECATED_OSTREAM
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.

in this change,

* utils: drop the range formatters in to_string.hh and to_string.c, as
  we don't use them anymore. and the tests for them in
  test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
  we are not able to print the elements using operator<< anymore
  after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
  due to a bug in {fmt} v9, {fmt} fails to format a range whose
  element type is `basic_sstring<uint8_t>`, as it considers it
  as a string-like type, but `basic_sstring<uint8_t>`'s char type
  is signed char, not char. this issue does not exist in {fmt} v10,
  so, in this change, we add a workaround to explicitly specialize
  the type trait to assure that {fmt} format this type using its
  `fmt::formatter` specialization instead of trying to format it
  as a string. also, {fmt}'s generic ranges formatter calls the
  pair formatter's `set_brackets()` and `set_separator()` methods
  when printing the range, but operator<< based formatter does not
  provide these method, we have to include this change in the change
  switching to {fmt}, otherwise the change specializing
  `fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
  for comparing values. but without the operator<< based formatters,
  Boost.Test would not be able to print them. after removing
  the homebrew formatters, we need to use the generic
  `boost_test_print_type()` helper to do this job. so we are
  including `test_utils.hh` in tests so that we can print
  the formattable types.
* treewide: add "#include "utils/to_string.hh" where
  `fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:57:36 +08:00
Kefu Chai
a439ebcfce treewide: include fmt/ranges.h and/or fmt/std.h
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:56:16 +08:00
Kefu Chai
01f13850cb utils/managed_bytes: add support for fmt::to_string() to bytes and friends
in 3835ebfcdc, `fmt::formatter` were added to `bytes` and friend, but
their `format()` methods were intentionally implemented as plain
methods, which only acccept `fmt::format_context`. it was a decision
decision. the intention was to reduce the usage of template, to speed
up the compilation at the expense of dropping the support of other
appenders, notably the one used by `fmt::to_string()`, where the type
of "format_context" is not a `fmt::format_context`, but a string
appender. but it turns out we still have users in tests using
`fmt::to_string()`, to convert, for instance, `bytes` to `std::string`,

so, to make their life easier, we add the templated `format()` to
these types. an alternative is to change the callers to use something
like `fmt::format("{}", v)`, which is less convenient though.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:56:13 +08:00
Kefu Chai
5ab527e669 main: do not echo parsed options when calling scylla interactively
in 2f0f53ac, we added logging of parsed command line options so that we
can see how scylla is launched in case it fails to boot. but when scylla
is called interactively in console. this echo is a little bit annoying.
see following console session
```console
$ scylla --help-loggers
Scylla version 5.5.0~dev-0.20240419.3c9651adf297 with build-id 7dd6a110e608535e5c259a03548eda6517ab4bde starting ...
command used: "./RelWithDebInfo/scylla --help-loggers"
pid: 996503
parsed command line options: [help-loggers]
Available loggers:
    BatchStatement
    LeveledManifest
    alter_keyspace
    alter_table
...
```

so in this change, we check if the stdin is associated with a terminal
device, if that the case, we don't print the scylla version, parsed
command line and pid. and the interactive session looks like:

```console
$ scylla --help-loggers
Available loggers:
    BatchStatement
    LeveledManifest
    alter_keyspace
    alter_table
```
no more distracting information printed. the original behavior
can be tested like:

```console
$ : | ./RelWithDebInfo/scylla --help-loggers
```

assuming scylla is always launched with systemd, which connects
stdin to /dev/null. see
https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#Logging%20and%20Standard%20Input/Output
. so this behavior is preserved with this change.

Refs #4203

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18309
2024-04-19 15:00:05 +03:00
Raphael S. Carvalho
223214439b compaction: Disconsider active tables in the hourly compaction reevaluation
This hourly reevaluation is there to help tablets that have very low
write activity, which can go a long time without flushing a memtable,
and it's important to reevaluate compaction as data can get expired.
Today it can happen that we reevaluate a table that is being compacted
actively, which is waste of cpu as the reevaluation will happen anyway
when there are changes to sstable set. This waste can be amplified with
a significant tablet count in a given shard.
Eventually, we could make the revaluation time per table based on
expiration histogram, but until we get there, let's avoid this waste
by only reevaluating tables that are compaction idle for more than 1h.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18280
2024-04-19 14:33:40 +03:00
Pavel Emelyanov
ba58b71eea database: Keep local directory_semaphore to initialize sstables managers
Now database is constructed with sharded<directory_semaphore>, but it no
longer needs sharded, local is enough.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
53909da390 database: Don't reference directory_semaphore
It was only used by table taking snapshot code. Now it uses sstables
manager's reference and database can stop carrying it around.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
be5bc38cde table: Use directory semaphore from sstables manager
It's natural for a table to itarate over its sstables, get the semaphore
from the manager of sstables, not from database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
7e7dd2649b table: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
2fced3c557 table: Use directory_semaphore for rate-limited snapshot taking
The table::take_snapshot() limits its parallelizm with the help of
direcoty semaphore already, but implements it "by hand". There's already
parallel_for_each() method on the dir.sem. class that does exactly that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
6514c67fae sstables: Move directory_semaphore::parallel_for_each() to header
It's a template and in order to use it in other .cc files it's more
convenient to move it into a header file

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
ad1a9d4c11 sstables: Move parallel_for_each_restricted to directory_semaphore
In order not to make sstable_directory mess with private members of this
class. Next patch will also make use of this new method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
0d2178202d table: Use smp::all_cpus() to iterate over all CPUs locally
Currently it uses irange(0, smp::count0), but seastar provides
convenient helper call for the very same range object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Kefu Chai
a5dae74aee doc: update nodetool setlogginglevel sample output with most recent loggers list
in order to reduce the confusion like:

> I cannot find foobar in the list, is it supported?

also, take this opportunity to use "console" instead of "shell" for
rendering the code block. it's a better fit in this case. since we
are using pygment for syntax highlighting,
see https://pygments.org/docs/lexers/#pygments.lexers.shell.BashSessionLexer
for details on the "console" lexer.

and add a prompt before the command line, so that "console" lexer
can render the command line and output better.

also, add a note explaining that user should refer the output of
`scylla` to see the list of logger classes.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18311
2024-04-19 13:25:39 +03:00
Kefu Chai
c04654e865 storage_service: capture this explicitly
clang-19 complains with `-Wdeprecated-this-capture`:

```
/home/kefu/dev/scylladb/service/storage_service.cc:5837:22: error: implicit capture of 'this' with a capture default of '=' is deprecated [-Werror,-Wdeprecated-this-capture]
 5837 |         auto* node = get_token_metadata().get_topology().find_node(dst.host);
      |                      ^
/home/kefu/dev/scylladb/service/storage_service.cc:5830:44: note: add an explicit capture of 'this' to capture '*this' by reference
 5830 |     co_await transit_tablet(table, token, [=] (const locator::tablet_map& tmap, api::timestamp_type write_timestamp) {
      |                                            ^
      |                                             , this
```

since https://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0806r2.html
was approved, see https://eel.is/c++draft/depr.capture.this. and newer
versions of C++ compilers implemented it, so we need to capture `this`
explicitly to be more standard compliant, and to be more future-proof
in this regard.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18306
2024-04-19 10:05:57 +03:00
Kefu Chai
168ade72f8 treewide: replace formatter<std::string_view> with formatter<string_view>
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.

this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:

```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
  254 |     return formatter<std::string_view>::format(it->second, ctx);
      |            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
 2759 |   FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
      |                      ^      ~~~~~~~~~~~~
```

because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18299
2024-04-19 07:44:07 +03:00
Avi Kivity
6e487a49aa Merge 'toolchain: support building an optimized clang' from Takuya ASADA
This is complete version of #12786, since I take over the issue from @mykaul.
Update from original version are:
- Support ARM64 build (disable BOLT for now since it doesn't functioning)
- Changed toolchain settings to get current scylla able to build (support WASM, etc)
- Stop git clone scylladb repo, create git-archive of current scylla directory and import it
-  Update Clang to 17.0.6
- Save entire clang directory for BUILD mode, not just /usr/bin/clang binary
- Implemented INSTALL_PREBUILT mode to install prebuilt image which built in BUILD mode

Note that this patch drops cross-build support of frozen toolchain, since building clang and scylla multiple time in qemu-user-static will very slow, it's not usable.
Instead, we should build the image for each architecture natively.

----

This is a different way attempting to combine building an optimized clang (using LTO, PGO and BOLT, based on compiling ScyllaDB) to dbuild. Per Avi's request, there are 3 options: skip this phase (which is the current default), build it and build + install it to the default path.

Fixes: #10985
Fixes: scylladb/scylla-enterprise#2539

Closes scylladb/scylladb#17196

* github.com:scylladb/scylladb:
  toolchain: support building an optimized clang
  configure.py: add --build-dir option
2024-04-18 19:20:23 +00:00
Anna Stuchlik
a3481a4566 doc: document the system_auth_v2 feature
This commit includes updates related to replacing system_auth with system_auth_v2.

- The keyspace name system_auth is renamed to system_auth_v2.
- The procedures are updated to account for system_auth_v2.
- No longer required system_auth RF changes are removed from procedures.
- The information is added that if the consistent topology updates feature
  was not enabled upon upgrade from 5.4, there are limitations or additional
  steps to do (depending on the procedure).
  The files with that kind of information are to be found in _common folders
  and included as needed.
- The upgrade guide has been updated to reflect system_auth_v2 and related impacts.

Closes scylladb/scylladb#18077
2024-04-18 18:33:49 +02:00
Kefu Chai
21b03d2ce3 topology_coordinator: remove unused variable
when compiling the tree with clang-19, it complains:

```
/home/kefu/dev/scylladb/service/topology_coordinator.cc:1968:31: error: variable 'reject' set but not used [-Werror,-Wunused-but-set-variable]
 1968 |                     if (auto* reject = std::get_if<join_node_response_params::rejected>(&validation_result)) {
      |                               ^
1 error generated.
```
so, despite that we evaluate the assignment statement to see it
evaluates to true or false, the compiler still believes that the
variable is not used. probably, the value of the statement is not
dependent on the value of the value being assigned. either way,
let's use `std::holds_alternative<..>` instead of `std::get_if<..>`,
to silence this warning, and the code is a little bit more compacted,
in the sense of less tokens in the `if` statement.

in order to be self-contained, we take the opportunity to
include `<variant>` in this source file, as a function declared
in this header is used.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18291
2024-04-18 18:04:56 +03:00
Amnon Heiman
e8410848a8 Update seastar submodule
This patch updates the seastar submodule to get the latest safety patch for the metric layer.
The latest patch allows manipulating metric_families early in the
start-up process and is safer if someone chooses to aggregate summaries.

* seastar f3058414...8fabb30a (4):
  > stall-analyser: improve stall pattern matching
  > TLS: Move background BYE handshake to engine::run_in_background
  > metrics.cc: Safer set_metric_family_configs
  > src/core/metrics.cc: handle SUMMARY add operator

Closes scylladb/scylladb#18293
2024-04-18 18:02:28 +03:00
Tomasz Grabiec
393cb54c01 Merge 'Generalize tablet transition API calls' from Pavel Emelyanov
Recently there had been added add_tablet_replica and del_tablet_replica API calls that copy big portion of the existing move_tablet API call's logic. This PR generalizes the common parts

Closes scylladb/scylladb#18272

* github.com:scylladb/scylladb:
  tablets: Generalize transition mutations preparation
  tablets: Generalize tablet-already-in-transition check
  tablets: Generalize raft communications for tablet transition API calls
  tablets: Drop src vs dst equality check from move_tablet()
2024-04-18 14:42:10 +02:00
Anna Stuchlik
ad81f9f56a doc: replace Scylla with ScyllaDB in Glossary
This commit replaces "Scylla" with "ScyllaDB" on the Glossary page.

The product has been rebranded as "ScyllaDB".

Closes scylladb/scylladb#18296
2024-04-18 14:59:23 +03:00
Kamil Braun
9c2a836607 Revert "Merge 'Drain view_builder in generic drain' from ScyllaDB"
This reverts commit 298a7fcbf2, reversing
changes made to 5cf53e670d.

The change made CI flaky.

Fixes: scylladb/scylladb#18278
2024-04-18 11:50:41 +02:00
Yaron Kaikov
44d1ffe86b [github] add PR template
Today with the backport automation, the developer added the relevant backport label, but without any explanation of why

Adding the PR template with a placeholder for the developer to add his decision about backport yes or no

The placeholder is marked as a task, so once the explanation is added, the task must be checked as completed
2024-04-17 15:40:32 +03:00
Pavel Emelyanov
1b2cd56bcc tablets: Generalize transition mutations preparation
Tablet transition handlers prepare two mutations -- one for tablets
table, that sets transition state, transition mode and few others; and
another one for topology table that "activates" the tablet_migration
state for topology coordinator.

The latter is common to all three handlers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 12:01:51 +03:00
Pavel Emelyanov
3beccb8165 tablets: Generalize tablet-already-in-transition check
Continuation of the previous patch -- there's a common sanity check of
tablet transition API handlers, namely that this tablet is not in
transition already.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 12:01:02 +03:00
Pavel Emelyanov
14923812ad tablets: Generalize raft communications for tablet transition API calls
There are three transition calls -- move, add replica and del replica --
and all three work similarly. In a loop they try to get guard for raft
operation, then perform sanity checks on topology state, then prepare
mutations and then try to apply them to raft. After the loop finishes
all three wait for transition for the given tablet to complete.

This patch generalizes the raft kicking loop and the transition
completion waiting code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 11:59:03 +03:00
Pavel Emelyanov
c4d538320e tablets: Drop src vs dst equality check from move_tablet()
The code here looks like this

    if src.host == dst.host
        throw "Local migration not possible"

    if src == dst
        co_return;

The 2nd check is apparently never satisfied -- if src == dst this means
that src.host == dst.host and it should have thrown already

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 11:57:10 +03:00
Marcin Maliszkiewicz
7e749cd848 auth: don't run legacy migrations on auth-v2 startup
We won't run:
- old pre auth-v1 migration code
- code creating auth-v1 tables

We will keep running:
- code creating default rows
- code creating auth-v1 keyspace (needed due to cqlsh legacy hack,
  it errors when executing `list roles` or `list users` if
  there is no system_auth keyspace, it does support case when
  there is no expected tables)
2024-04-15 12:09:39 +02:00
Marcin Maliszkiewicz
d40ff81c5b auth: fix indent in password_authenticator::start 2024-04-15 12:09:32 +02:00
Marcin Maliszkiewicz
3e8cf20b98 auth: remove unused service::has_existing_legacy_users func 2024-04-15 12:09:32 +02:00
Benny Halevy
a7c5fccab9 test: chunked_managed_vector_test: add test_push_back_using_existing_element
chunked_managed_vector isn't susceptible to #18072
since the elements it keeps are managed_ref<T> and
those must be constructed by the caller, before reallocation
takes place, so it's safer with that respect.

The unit test is added to verify that and prevent
regressions in the future.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-11 14:34:50 +03:00
Benny Halevy
2afc584f08 utils: chunked_vector: reserve_for_emplace_back: emplace before migrating existing elements
Currently, push_back or emplace_back reallocate the last chunk
before constructing the new element.

If the arg passed to push_back/emplace_back is a reference to an
existing element in the vector, reallocating the last chunk will
invalidate the arg reference before it is used.

This patch changes the order when reallocating
the last chunk in reserve_for_emplace_back:
First, a new chunk_ptr is allocated.
Then, the back_element is emplaced in the
newly allocated array.
And only then, existing elements in the current
last chunk are migrated to the new chunk.
Eventually, the new chunk replaces the existing chunk.

If no reservation is requried, the back element
is emplaced "in place" in the current last chunk.

Fixes scylladb/scylladb#18072

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-11 14:34:48 +03:00
Benny Halevy
2c0e40a21f utils: chunked_vector: push_back: call emplace_back
When pushing an element with a value referencing
an exisiting element in the vector, we currently
risking use-after-free when that element gets moved
to a reallocated chunk, if capacity needs to be reserved,
by that, invaliding the refernce to the existing element
before it is used.

This patch prepares for fixing that in the emplace path
by converging to a single code path.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-11 14:33:43 +03:00
Benny Halevy
882bb21903 utils: chunked_vector: define min_chunk_capacity
Expose the number of items in the first allocated chunk.
This will be used by a unit test in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-11 14:33:43 +03:00
Benny Halevy
e066f81cb3 utils: chunked*vector: use std::clamp
It is available in the std library since C++17.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-11 14:33:43 +03:00
Yaniv Kaul
bd34f2fe46 toolchain: support building an optimized clang
This is a different way attempting to combine building an optimized clang (using LTO, PGO and BOLT, based on compiling ScyllaDB) to dbuild. Per Avi's request, there are 3 options: skip this phase (which is the current default), build it and build + install it to the default path.

Fixes: #10985
Fixes: scylladb/scylla-enterprise#2539
2024-04-08 22:53:59 +09:00
Takuya ASADA
be3776ec2a configure.py: add --build-dir option
Add --build-dir option to specify build directory.
This is needed for optimized clang support, since it requires to build
Scylla in tools/toolchain/prepare, w/o deleting current build/
directory.
2024-04-01 18:35:42 +09:00
886 changed files with 24034 additions and 7959 deletions

84
.github/actions/setup-build/action.yaml vendored Normal file
View File

@@ -0,0 +1,84 @@
name: setup-build-env
description: Setup Building Environment
inputs:
install_clang_tool:
description: 'install clang-tool'
required: false
default: false
type: boolean
install_clang_tidy:
description: 'install clang-tidy'
required: false
default: false
type: boolean
# use the stable branch
# should be the same as the one used by the compositing workflow
env:
CLANG_VERSION: 18
runs:
using: 'composite'
steps:
- name: Add scylla-ppa repo
shell: bash
run: |
sudo add-apt-repository ppa:scylladb/ppa
- name: Add clang apt repo
if: ${{ inputs.install_clang_tool || inputs.install_clang_tidy }}
shell: bash
run: |
sudo apt-get install -y curl
curl -fsSL https://apt.llvm.org/llvm-snapshot.gpg.key | sudo tee /etc/apt/trusted.gpg.d/apt.llvm.org.asc >/dev/null
repo_component=llvm-toolchain-jammy
# use the development branch if $CLANG_VERSION is empty
if [ -n "$CLANG_VERSION" ]; then
repo_component+=-$CLANG_VERSION
fi
echo "deb http://apt.llvm.org/jammy/ $repo_component main" | sudo tee -a /etc/apt/sources.list.d/llvm.list
sudo apt-get update
- name: Install clang-tools
if: ${{ inputs.install_clang_tools }}
shell: bash
run: |
sudo apt-get install -y clang-tools-$CLANG_VERSION
- name: Install clang-tidy
if: ${{ inputs.install_clang_tidy }}
shell: bash
run: |
sudo apt-get install -y clang-tidy-$CLANG_VERSION
- name: Install GCC-12
# ubuntu:jammy comes with GCC-11. and libstdc++-11 fails to compile
# scylla which defines value type of std::unordered_map in .cc
shell: bash
run: |
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/ppa
sudo apt-get install -y libstdc++-12-dev
- name: Install more build dependencies
shell: bash
run: |
# - do not install java dependencies, which is not only not necessary,
# and they include "python", which is not EOL and not available.
# - replace "scylla-libthrift010" with "libthrift-dev". because
# scylla-libthrift010 : Depends: libssl1.0.0 (>= 1.0.1) but it is not installable
# - we don't perform tests, so minio is not necessary.
sed -i.orig \
-e '/tools\/.*\/install-dependencies.sh/d' \
-e 's/scylla-libthrift010-dev/libthrift-dev/' \
-e 's/(minio_download_jobs)/(true)/' \
./install-dependencies.sh
sudo ./install-dependencies.sh
mv ./install-dependencies.sh{.orig,}
# for ld.lld
sudo apt-get install -y lld-18
- name: Install {fmt} using cooking.sh
shell: bash
run: |
sudo apt-get remove -y libfmt-dev
seastar/cooking.sh -d build-fmt -p cooking -i fmt

18
.github/clang-tidy-matcher.json vendored Normal file
View File

@@ -0,0 +1,18 @@
{
"problemMatcher": [
{
"owner": "clang-tidy",
"pattern": [
{
"regexp": "^([^:]+):(\\d+):(\\d+):\\s+(warning|error):\\s+(.*?)\\s+\\[(.*?)\\]$",
"file": 1,
"line": 2,
"column": 3,
"severity": 4,
"message": 5,
"code": 6
}
]
}
]
}

View File

@@ -1,9 +1,9 @@
import requests
from github import Github
import argparse
import re
import sys
import os
from github import Github
from github.GithubException import UnknownObjectException
try:
github_token = os.environ["GITHUB_TOKEN"]
@@ -23,36 +23,68 @@ def parser():
'commit, exclusive).')
parser.add_argument('--update_issue', type=bool, default=False, help='Set True to update issues when backport was '
'done')
parser.add_argument('--label', type=str, required=True, help='Label to use')
parser.add_argument('--ref', type=str, required=True, help='PR target branch')
return parser.parse_args()
def add_comment_and_close_pr(pr, comment):
if pr.state == 'open':
pr.create_issue_comment(comment)
pr.edit(state="closed")
def mark_backport_done(repo, ref_pr_number, branch):
pr = repo.get_pull(int(ref_pr_number))
label_to_remove = f'backport/{branch}'
label_to_add = f'{label_to_remove}-done'
current_labels = [label.name for label in pr.get_labels()]
if label_to_remove in current_labels:
pr.remove_from_labels(label_to_remove)
if label_to_add not in current_labels:
pr.add_to_labels(label_to_add)
def main():
# This script is triggered by a push event to either the master branch or a branch named branch-x.y (where x and y represent version numbers). Based on the pushed branch, the script performs the following actions:
# - When ref branch is `master`, it will add the `promoted-to-master` label, which we need later for the auto backport process
# - When ref branch is `branch-x.y` (which means we backported a patch), it will replace in the original PR the `backport/x.y` label with `backport/x.y-done` and will close the backport PR (Since GitHub close only the one referring to default branch)
args = parser()
pr_pattern = re.compile(r'Closes .*#([0-9]+)')
target_branch = re.search(r'branch-(\d+\.\d+)', args.ref)
g = Github(github_token)
repo = g.get_repo(args.repository, lazy=False)
commits = repo.compare(head=args.commit_after_merge, base=args.commit_before_merge)
processed_prs = set()
# Print commit information
for commit in commits.commits:
print(commit.sha)
print(f'Commit sha is: {commit.sha}')
match = pr_pattern.search(commit.commit.message)
if match:
pr_number = match.group(1)
url = f'https://api.github.com/repos/{args.repository}/issues/{pr_number}/labels'
data = {
"labels": [f'{args.label}']
}
headers = {
"Authorization": f"token {github_token}",
"Accept": "application/vnd.github.v3+json"
}
response = requests.post(url, headers=headers, json=data)
if response.ok:
print(f"Label added successfully to {url}")
pr_number = int(match.group(1))
if pr_number in processed_prs:
continue
if target_branch:
pr = repo.get_pull(pr_number)
branch_name = target_branch[1]
refs_pr = re.findall(r'Refs (?:#|https.*?)(\d+)', pr.body)
if refs_pr:
print(f'branch-{target_branch.group(1)}, pr number is: {pr_number}')
# 1. change the backport label of the parent PR to note that
# we've merge the corresponding backport PR
# 2. close the backport PR and leave a comment on it to note
# that it has been merged with a certain git commit,
ref_pr_number = refs_pr[0]
mark_backport_done(repo, ref_pr_number, branch_name)
comment = f'Closed via {commit.sha}'
add_comment_and_close_pr(pr, comment)
else:
print(f"No label was added to {url}")
try:
pr = repo.get_pull(pr_number)
pr.add_to_labels('promoted-to-master')
print(f'master branch, pr number is: {pr_number}')
except UnknownObjectException:
print(f'{pr_number} is not a PR but an issue, no need to add label')
processed_prs.add(pr_number)
if __name__ == "__main__":

View File

@@ -4,6 +4,10 @@ on:
push:
branches:
- master
- branch-*.*
env:
DEFAULT_BRANCH: 'master'
jobs:
check-commit:
@@ -15,6 +19,8 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
fetch-depth: 0 # Fetch all history for all tags and branches
- name: Install dependencies
@@ -23,4 +29,4 @@ jobs:
- name: Run python script
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --label promoted-to-master
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --ref ${{ github.ref }}

63
.github/workflows/clang-tidy.yaml vendored Normal file
View File

@@ -0,0 +1,63 @@
name: clang-tidy
on:
pull_request:
branches:
- master
paths-ignore:
- '**/*.rst'
- '**/*.md'
- 'docs/**'
- '.github/**'
workflow_dispatch:
schedule:
# only at 5AM Saturday
- cron: '0 5 * * SAT'
env:
# use the stable branch
CLANG_VERSION: 18
BUILD_TYPE: RelWithDebInfo
BUILD_DIR: build
CLANG_TIDY_CHECKS: '-*,bugprone-use-after-move'
permissions: {}
# cancel the in-progress run upon a repush
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
clang-tidy:
name: Run clang-tidy
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
submodules: true
- uses: ./.github/actions/setup-build
with:
install_clang_tidy: true
- name: Generate the building system
run: |
cmake \
-DCMAKE_BUILD_TYPE=$BUILD_TYPE \
-DCMAKE_C_COMPILER=clang-$CLANG_VERSION \
-DScylla_USE_LINKER=ld.lld-$CLANG_VERSION \
-DCMAKE_CXX_COMPILER=clang++-$CLANG_VERSION \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DCMAKE_CXX_CLANG_TIDY="clang-tidy-$CLANG_VERSION;--checks=$CLANG_TIDY_CHECKS" \
-DCMAKE_CXX_FLAGS=-DFMT_HEADER_ONLY \
-DCMAKE_PREFIX_PATH=$PWD/cooking \
-G Ninja \
-B $BUILD_DIR \
-S .
# see https://github.com/actions/toolkit/blob/main/docs/problem-matchers.md
- run: |
echo "::add-matcher::.github/clang-tidy-matcher.json"
- name: Build with clang-tidy enabled
run: |
cmake --build $BUILD_DIR --target scylla
- run: |
echo "::remove-matcher owner=clang-tidy::"

3
.gitignore vendored
View File

@@ -18,7 +18,7 @@ CMakeLists.txt.user
*.egg-info
__pycache__CMakeLists.txt.user
.gdbinit
resources
/resources
.pytest_cache
/expressions.tokens
tags
@@ -30,3 +30,4 @@ compile_commands.json
.ccls-cache/
.mypy_cache
.envrc
clang_build

3
.gitmodules vendored
View File

@@ -6,6 +6,9 @@
path = swagger-ui
url = ../scylla-swagger-ui
ignore = dirty
[submodule "abseil"]
path = abseil
url = ../abseil-cpp
[submodule "scylla-jmx"]
path = tools/jmx
url = ../scylla-jmx

View File

@@ -42,8 +42,6 @@ else()
COMMENT "List configured modes")
endif()
add_compile_definitions(
FMT_DEPRECATED_OSTREAM)
include(limit_jobs)
# Configure Seastar compile options to align with Scylla
set(CMAKE_CXX_STANDARD "20" CACHE INTERNAL "")
@@ -57,6 +55,33 @@ set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_TESTS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)
add_subdirectory(seastar)
set(ABSL_PROPAGATE_CXX_STD ON CACHE BOOL "" FORCE)
find_package(Sanitizers QUIET)
set(sanitizer_cxx_flags
$<$<IN_LIST:$<CONFIG>,Debug;Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>>)
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
set(ABSL_GCC_FLAGS ${sanitizer_cxx_flags})
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
set(ABSL_LLVM_FLAGS ${sanitizer_cxx_flags})
endif()
set(ABSL_DEFAULT_LINKOPTS
$<$<IN_LIST:$<CONFIG>,Debug;Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_LINK_LIBRARIES>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_LINK_LIBRARIES>>)
add_subdirectory(abseil)
add_library(absl-headers INTERFACE)
target_include_directories(absl-headers SYSTEM INTERFACE
"${PROJECT_SOURCE_DIR}/abseil")
add_library(absl::headers ALIAS absl-headers)
# Exclude absl::strerror from the default "all" target since it's not
# used in Scylla build and, moreover, makes use of deprecated glibc APIs,
# such as sys_nerr, which are not exposed from "stdio.h" since glibc 2.32,
# which happens to be the case for recent Fedora distribution versions.
#
# Need to use the internal "absl_strerror" target name instead of namespaced
# variant because `set_target_properties` does not understand the latter form,
# unfortunately.
set_target_properties(absl_strerror PROPERTIES EXCLUDE_FROM_ALL TRUE)
# System libraries dependencies
find_package(Boost REQUIRED
@@ -68,7 +93,7 @@ target_link_libraries(Boost::regex
find_package(Lua REQUIRED)
find_package(ZLIB REQUIRED)
find_package(ICU COMPONENTS uc i18n REQUIRED)
find_package(absl COMPONENTS hash raw_hash_set REQUIRED)
find_package(fmt 9.0.0 REQUIRED)
find_package(libdeflate REQUIRED)
find_package(libxcrypt REQUIRED)
find_package(Snappy REQUIRED)
@@ -125,6 +150,8 @@ target_sources(scylla-main
target_link_libraries(scylla-main
PRIVATE
db
absl::headers
absl::btree
absl::hash
absl::raw_hash_set
Seastar::seastar
@@ -237,6 +264,7 @@ target_link_libraries(scylla PRIVATE
target_link_libraries(scylla PRIVATE
seastar
absl::headers
Boost::program_options)
target_include_directories(scylla PRIVATE

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=5.5.0-dev
VERSION=6.0.5
if test -f version
then
@@ -88,10 +88,13 @@ else
SCYLLA_VERSION=$VERSION
if [ -z "$SCYLLA_RELEASE" ]; then
GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1 --abbrev=12)
# For custom package builds, replace "0" with "counter.your_name",
# For custom package builds, replace "0" with "counter.yourname",
# where counter starts at 1 and increments for successive versions.
# This ensures that the package manager will select your custom
# package over the standard release.
# Do not use any special characters like - or _ in the name above!
# These characters either have special meaning or are illegal in
# version strings.
SCYLLA_BUILD=0
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
elif [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then

1
abseil Submodule

Submodule abseil added at d7aaad83b4

View File

@@ -27,7 +27,8 @@ target_link_libraries(alternator
cql3
idl
Seastar::seastar
xxHash::xxhash)
xxHash::xxhash
absl::headers)
check_headers(check-headers alternator
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -6,8 +6,10 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <fmt/ranges.h>
#include <seastar/core/sleep.hh>
#include "alternator/executor.hh"
#include "cdc/log.hh"
#include "db/config.hh"
#include "log.hh"
#include "schema/schema_builder.hh"
@@ -4438,8 +4440,10 @@ future<executor::request_return_type> executor::list_tables(client_state& client
auto tables = _proxy.data_dictionary().get_tables(); // hold on to temporary, table_names isn't a container, it's a view
auto table_names = tables
| boost::adaptors::filtered([] (data_dictionary::table t) {
return t.schema()->ks_name().find(KEYSPACE_NAME_PREFIX) == 0 && !t.schema()->is_view();
| boost::adaptors::filtered([this] (data_dictionary::table t) {
return t.schema()->ks_name().find(KEYSPACE_NAME_PREFIX) == 0 &&
!t.schema()->is_view() &&
!cdc::is_log_for_some_table(_proxy.local_db(), t.schema()->ks_name(), t.schema()->cf_name());
})
| boost::adaptors::transformed([] (data_dictionary::table t) {
return t.schema()->cf_name();
@@ -4575,7 +4579,7 @@ static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_vie
// used by default on new Alternator tables. Change this initialization
// to 0 enable tablets by default, with automatic number of tablets.
std::optional<unsigned> initial_tablets;
if (sp.get_db().local().get_config().check_experimental(db::experimental_features_t::feature::TABLETS)) {
if (sp.get_db().local().get_config().enable_tablets()) {
auto it = tags_map.find(INITIAL_TABLETS_TAG_KEY);
if (it != tags_map.end()) {
// Tag set. If it's a valid number, use it. If not - e.g., it's

View File

@@ -256,6 +256,6 @@ public:
} // namespace parsed
} // namespace alternator
template <> struct fmt::formatter<alternator::parsed::path> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<alternator::parsed::path> : fmt::formatter<string_view> {
auto format(const alternator::parsed::path&, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -8,6 +8,7 @@
#include "alternator/server.hh"
#include "log.hh"
#include <fmt/ranges.h>
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
@@ -210,8 +211,11 @@ protected:
sstring local_dc = topology.get_datacenter();
std::unordered_set<gms::inet_address> local_dc_nodes = topology.get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
if (_gossiper.is_alive(ip)) {
rjson::push_back(results, rjson::from_string(ip.to_sstring()));
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We need the node to be in normal state. See #19694.
if (_gossiper.is_normal(ip)) {
rjson::push_back(results, rjson::from_string(fmt::to_string(ip)));
}
}
rep->set_status(reply::status_type::ok);

View File

@@ -1057,9 +1057,6 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche
if (stream_enabled->GetBool()) {
auto db = sp.data_dictionary();
if (!db.features().cdc) {
throw api_error::validation("StreamSpecification: streams (CDC) feature not enabled in cluster.");
}
if (!db.features().alternator_streams) {
throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");
}

View File

@@ -26,6 +26,7 @@
#include "log.hh"
#include "gc_clock.hh"
#include "replica/database.hh"
#include "service/client_state.hh"
#include "service_permit.hh"
#include "timestamp.hh"
#include "service/storage_proxy.hh"
@@ -383,6 +384,9 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
// the chances of covering all ranges during a scan when restarts occur.
// A more deterministic way would be to regularly persist the scanning state,
// but that incurs overhead that we want to avoid if not needed.
//
// FIXME: Check if this algorithm is safe with tablet migration.
// https://github.com/scylladb/scylladb/issues/16567
enum primary_or_secondary_t {primary, secondary};
template<primary_or_secondary_t primary_or_secondary>
class token_ranges_owned_by_this_shard {
@@ -495,6 +499,7 @@ struct scan_ranges_context {
bytes column_name;
std::optional<std::string> member;
service::client_state internal_client_state;
::shared_ptr<cql3::selection::selection> selection;
std::unique_ptr<service::query_state> query_state_ptr;
std::unique_ptr<cql3::query_options> query_options;
@@ -504,6 +509,7 @@ struct scan_ranges_context {
: s(s)
, column_name(column_name)
, member(member)
, internal_client_state(service::client_state::internal_tag())
{
// FIXME: don't read the entire items - read only parts of it.
// We must read the key columns (to be able to delete) and also
@@ -522,10 +528,9 @@ struct scan_ranges_context {
std::vector<query::clustering_range> ck_bounds{query::clustering_range::make_open_ended_both_sides()};
auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), opts);
command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state;
// NOTICE: empty_service_permit is used because the TTL service has fixed parallelism
query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, empty_service_permit());
query_state_ptr = std::make_unique<service::query_state>(internal_client_state, trace_state, empty_service_permit());
// FIXME: What should we do on multi-DC? Will we run the expiration on the same ranges on all
// DCs or only once for each range? If the latter, we need to change the CLs in the
// scanner and deleter.

View File

@@ -72,7 +72,8 @@ target_link_libraries(api
idl
wasmtime_bindings
Seastar::seastar
xxHash::xxhash)
xxHash::xxhash
absl::headers)
check_headers(check-headers api
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -63,6 +63,28 @@
"paramType":"path"
}
]
},
{
"method":"GET",
"summary":"Read the state of an injection from all shards",
"type":"array",
"items":{
"type":"error_injection_info"
},
"nickname":"read_injection",
"produces":[
"application/json"
],
"parameters":[
{
"name":"injection",
"description":"injection name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
@@ -152,5 +174,39 @@
}
}
}
},
"models":{
"mapper":{
"id":"mapper",
"description":"A key value mapping",
"properties":{
"key":{
"type":"string",
"description":"The key"
},
"value":{
"type":"string",
"description":"The value"
}
}
},
"error_injection_info":{
"id":"error_injection_info",
"description":"Information about an error injection",
"properties":{
"enabled":{
"type":"boolean",
"description":"Is the error injection enabled"
},
"parameters":{
"type":"array",
"items":{
"type":"mapper"
},
"description":"The parameter values"
}
},
"required":["enabled"]
}
}
}

View File

@@ -90,7 +90,7 @@
"operations":[
{
"method":"GET",
"summary":"Returns a list of the tokens endpoint mapping",
"summary":"Returns a list of the tokens endpoint mapping, provide keyspace and cf param to get tablet mapping",
"type":"array",
"items":{
"type":"mapper"
@@ -100,6 +100,22 @@
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace to provide the tablet mapping for",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"cf",
"description":"The table to provide the tablet mapping for",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
@@ -1897,6 +1913,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"force",
"description":"Enforce the source_dc option, even if it unsafe to use for rebuild",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -2726,6 +2750,22 @@
}
]
},
{
"path":"/storage_service/quiesce_topology",
"operations":[
{
"nickname":"quiesce_topology",
"method":"POST",
"summary":"Waits until there are no ongoing topology operations. Guarantees that topology operations which started before the call are finished after the call. This doesn't consider requested but not started operations. Such operations may start after the call succeeds.",
"type":"void",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_service/metrics/total_hints",
"operations":[

View File

@@ -194,6 +194,21 @@
"parameters":[]
}
]
},
{
"path":"/system/highest_supported_sstable_version",
"operations":[
{
"method":"GET",
"summary":"Get highest supported sstable version",
"type":"string",
"nickname":"get_highest_supported_sstable_version",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -346,7 +346,7 @@ void req_params::process(const request& req) {
continue;
}
try {
ent.value = req.param[name];
ent.value = req.get_path_param(name);
} catch (std::out_of_range&) {
throw httpd::bad_param_exception(fmt::format("Mandatory parameter '{}' was not provided", name));
}

View File

@@ -54,7 +54,7 @@ static const char* str_to_regex(const sstring& v) {
void set_collectd(http_context& ctx, routes& r) {
cd::get_collectd.set(r, [](std::unique_ptr<request> req) {
auto id = ::make_shared<scollectd::type_instance_id>(req->param["pluginid"],
auto id = ::make_shared<scollectd::type_instance_id>(req->get_path_param("pluginid"),
req->get_query_param("instance"), req->get_query_param("type"),
req->get_query_param("type_instance"));
@@ -91,7 +91,7 @@ void set_collectd(http_context& ctx, routes& r) {
});
cd::enable_collectd.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {
std::regex plugin(req->param["pluginid"].c_str());
std::regex plugin(req->get_path_param("pluginid").c_str());
std::regex instance(str_to_regex(req->get_query_param("instance")));
std::regex type(str_to_regex(req->get_query_param("type")));
std::regex type_instance(str_to_regex(req->get_query_param("type_instance")));

View File

@@ -6,9 +6,11 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <fmt/ranges.h>
#include "column_family.hh"
#include "api/api.hh"
#include "api/api-doc/column_family.json.hh"
#include "api/api-doc/storage_service.json.hh"
#include <vector>
#include <seastar/http/exception.hh>
#include "sstables/sstables.hh"
@@ -28,6 +30,7 @@ using namespace httpd;
using namespace json;
namespace cf = httpd::column_family_json;
namespace ss = httpd::storage_service_json;
std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name) {
auto pos = name.find("%3A");
@@ -79,6 +82,65 @@ future<json::json_return_type> get_cf_stats(http_context& ctx,
}, std::plus<int64_t>());
}
static future<json::json_return_type> set_tables(http_context& ctx, const sstring& keyspace, std::vector<sstring> tables, std::function<future<>(replica::table&)> set) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return do_with(keyspace, std::move(tables), [&ctx, set] (const sstring& keyspace, const std::vector<sstring>& tables) {
return ctx.db.invoke_on_all([&keyspace, &tables, set] (replica::database& db) {
return parallel_for_each(tables, [&db, &keyspace, set] (const sstring& table) {
replica::table& t = db.find_column_family(keyspace, table);
return set(t);
});
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
class autocompaction_toggle_guard {
replica::database& _db;
public:
autocompaction_toggle_guard(replica::database& db) : _db(db) {
assert(this_shard_id() == 0);
if (!_db._enable_autocompaction_toggle) {
throw std::runtime_error("Autocompaction toggle is busy");
}
_db._enable_autocompaction_toggle = false;
}
autocompaction_toggle_guard(const autocompaction_toggle_guard&) = delete;
autocompaction_toggle_guard(autocompaction_toggle_guard&&) = default;
~autocompaction_toggle_guard() {
assert(this_shard_id() == 0);
_db._enable_autocompaction_toggle = true;
}
};
static future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
apilog.info("set_tables_autocompaction: enabled={} keyspace={} tables={}", enabled, keyspace, tables);
return ctx.db.invoke_on(0, [&ctx, keyspace, tables = std::move(tables), enabled] (replica::database& db) {
auto g = autocompaction_toggle_guard(db);
return set_tables(ctx, keyspace, tables, [enabled] (replica::table& cf) {
if (enabled) {
cf.enable_auto_compaction();
} else {
return cf.disable_auto_compaction();
}
return make_ready_future<>();
}).finally([g = std::move(g)] {});
});
}
static future<json::json_return_type> set_tables_tombstone_gc(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
apilog.info("set_tables_tombstone_gc: enabled={} keyspace={} tables={}", enabled, keyspace, tables);
return set_tables(ctx, keyspace, std::move(tables), [enabled] (replica::table& t) {
t.set_tombstone_gc_enabled(enabled);
return make_ready_future<>();
});
}
static future<json::json_return_type> get_cf_stats_count(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
return map_reduce_cf(ctx, name, int64_t(0), [f](const replica::column_family& cf) {
@@ -304,6 +366,14 @@ ratio_holder filter_recent_false_positive_as_ratio_holder(const sstables::shared
return ratio_holder(f + sst->filter_get_recent_true_positive(), f);
}
uint64_t accumulate_on_active_memtables(replica::table& t, noncopyable_function<uint64_t(replica::memtable& mt)> action) {
uint64_t ret = 0;
t.for_each_active_memtable([&] (replica::memtable& mt) {
ret += action(mt);
});
return ret;
}
void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace>& sys_ks) {
cf::get_column_family_name.set(r, [&ctx] (const_req req){
std::vector<sstring> res;
@@ -338,14 +408,14 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed(std::mem_fn(&replica::memtable::partition_count)), uint64_t(0));
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t{0}, [](replica::column_family& cf) {
return accumulate_on_active_memtables(cf, std::mem_fn(&replica::memtable::partition_count));
}, std::plus<>());
});
cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, uint64_t{0}, [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed(std::mem_fn(&replica::memtable::partition_count)), uint64_t(0));
return accumulate_on_active_memtables(cf, std::mem_fn(&replica::memtable::partition_count));
}, std::plus<>());
});
@@ -358,34 +428,34 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().total_space();
}), uint64_t(0));
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return accumulate_on_active_memtables(cf, [] (replica::memtable& active_memtable) {
return active_memtable.region().occupancy().total_space();
});
}, std::plus<int64_t>());
});
cf::get_all_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().total_space();
}), uint64_t(0));
return accumulate_on_active_memtables(cf, [] (replica::memtable& active_memtable) {
return active_memtable.region().occupancy().total_space();
});
}, std::plus<int64_t>());
});
cf::get_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().used_space();
}), uint64_t(0));
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return accumulate_on_active_memtables(cf, [] (replica::memtable& active_memtable) {
return active_memtable.region().occupancy().used_space();
});
}, std::plus<int64_t>());
});
cf::get_all_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().used_space();
}), uint64_t(0));
return accumulate_on_active_memtables(cf, [] (replica::memtable& active_memtable) {
return active_memtable.region().occupancy().used_space();
});
}, std::plus<int64_t>());
});
@@ -399,7 +469,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_cf_all_memtables_off_heap_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return cf.occupancy().total_space();
}, std::plus<int64_t>());
});
@@ -415,7 +485,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_cf_all_memtables_live_data_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return cf.occupancy().used_space();
}, std::plus<int64_t>());
});
@@ -423,14 +493,14 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_all_cf_all_memtables_live_data_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().used_space();
}), uint64_t(0));
return accumulate_on_active_memtables(cf, [] (replica::memtable& active_memtable) {
return active_memtable.region().occupancy().used_space();
});
}, std::plus<int64_t>());
});
cf::get_memtable_switch_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats(ctx,req->param["name"] ,&replica::column_family_stats::memtable_switch_count);
return get_cf_stats(ctx,req->get_path_param("name") ,&replica::column_family_stats::memtable_switch_count);
});
cf::get_all_memtable_switch_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -439,7 +509,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), utils::estimated_histogram(0), [](replica::column_family& cf) {
utils::estimated_histogram res(0);
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_partition_size);
@@ -451,7 +521,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
uint64_t res = 0;
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res += i->get_stats_metadata().estimated_partition_size.count();
@@ -462,7 +532,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), utils::estimated_histogram(0), [](replica::column_family& cf) {
utils::estimated_histogram res(0);
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_cells_count);
@@ -479,7 +549,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_pending_flushes.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats(ctx,req->param["name"] ,&replica::column_family_stats::pending_flushes);
return get_cf_stats(ctx,req->get_path_param("name") ,&replica::column_family_stats::pending_flushes);
});
cf::get_all_pending_flushes.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -487,7 +557,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_read.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats_count(ctx,req->param["name"] ,&replica::column_family_stats::reads);
return get_cf_stats_count(ctx,req->get_path_param("name") ,&replica::column_family_stats::reads);
});
cf::get_all_read.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -495,7 +565,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_write.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats_count(ctx, req->param["name"] ,&replica::column_family_stats::writes);
return get_cf_stats_count(ctx, req->get_path_param("name") ,&replica::column_family_stats::writes);
});
cf::get_all_write.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -503,19 +573,19 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::reads);
return get_cf_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::reads);
});
cf::get_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_rate_and_histogram(ctx, req->param["name"], &replica::column_family_stats::reads);
return get_cf_rate_and_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::reads);
});
cf::get_read_latency.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats_sum(ctx,req->param["name"] ,&replica::column_family_stats::reads);
return get_cf_stats_sum(ctx,req->get_path_param("name") ,&replica::column_family_stats::reads);
});
cf::get_write_latency.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats_sum(ctx, req->param["name"] ,&replica::column_family_stats::writes);
return get_cf_stats_sum(ctx, req->get_path_param("name") ,&replica::column_family_stats::writes);
});
cf::get_all_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -527,11 +597,11 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::writes);
return get_cf_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::writes);
});
cf::get_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_rate_and_histogram(ctx, req->param["name"], &replica::column_family_stats::writes);
return get_cf_rate_and_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::writes);
});
cf::get_all_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -543,7 +613,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_pending_compactions.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return cf.estimate_pending_compactions();
}, std::plus<int64_t>());
});
@@ -555,7 +625,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats(ctx, req->param["name"], &replica::column_family_stats::live_sstable_count);
return get_cf_stats(ctx, req->get_path_param("name"), &replica::column_family_stats::live_sstable_count);
});
cf::get_all_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -563,11 +633,11 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_unleveled_sstables.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_unleveled_sstables(ctx, req->param["name"]);
return get_cf_unleveled_sstables(ctx, req->get_path_param("name"));
});
cf::get_live_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return sum_sstable(ctx, req->param["name"], false);
return sum_sstable(ctx, req->get_path_param("name"), false);
});
cf::get_all_live_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -575,7 +645,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_total_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return sum_sstable(ctx, req->param["name"], true);
return sum_sstable(ctx, req->get_path_param("name"), true);
});
cf::get_all_total_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -584,7 +654,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_min_row_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], INT64_MAX, min_partition_size, min_int64);
return map_reduce_cf(ctx, req->get_path_param("name"), INT64_MAX, min_partition_size, min_int64);
});
// FIXME: this refers to partitions, not rows.
@@ -594,7 +664,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_max_row_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), max_partition_size, max_int64);
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), max_partition_size, max_int64);
});
// FIXME: this refers to partitions, not rows.
@@ -605,7 +675,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_mean_row_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, req->param["name"], integral_ratio_holder(), mean_partition_size, std::plus<integral_ratio_holder>());
return map_reduce_cf(ctx, req->get_path_param("name"), integral_ratio_holder(), mean_partition_size, std::plus<integral_ratio_holder>());
});
// FIXME: this refers to partitions, not rows.
@@ -615,7 +685,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
@@ -633,7 +703,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
@@ -651,7 +721,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
@@ -663,7 +733,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
@@ -675,7 +745,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
@@ -693,7 +763,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
@@ -711,7 +781,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
@@ -734,7 +804,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// We are missing the off heap memory calculation
// Return 0 is the wrong value. It's a work around
// until the memory calculation will be available
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
@@ -747,7 +817,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_speculative_retries.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
@@ -760,32 +830,14 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_key_cache_hit_rate.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
cf::get_true_snapshots_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
auto uuid = get_uuid(req->param["name"], ctx.db.local());
return ctx.db.local().find_column_family(uuid).get_snapshot_details().then([](
const std::unordered_map<sstring, replica::column_family::snapshot_details>& sd) {
int64_t res = 0;
for (auto i : sd) {
res += i.second.total;
}
return make_ready_future<json::json_return_type>(res);
});
});
cf::get_all_true_snapshots_size.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
cf::get_row_cache_hit_out_of_range.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
@@ -796,7 +848,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_row_cache_hit.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const replica::column_family& cf) {
return map_reduce_cf_raw(ctx, req->get_path_param("name"), utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().hits.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -812,7 +864,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_row_cache_miss.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const replica::column_family& cf) {
return map_reduce_cf_raw(ctx, req->get_path_param("name"), utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().misses.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -829,102 +881,120 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_cas_prepare.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().cas_prepare.histogram();
});
});
cf::get_cas_propose.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().cas_accept.histogram();
});
});
cf::get_cas_commit.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().cas_learn.histogram();
});
});
cf::get_sstables_per_read_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), utils::estimated_histogram(0), [](replica::column_family& cf) {
return cf.get_stats().estimated_sstable_per_read;
},
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
cf::get_tombstone_scanned_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::tombstone_scanned);
return get_cf_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::tombstone_scanned);
});
cf::get_live_scanned_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::live_scanned);
return get_cf_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::live_scanned);
});
cf::get_col_update_time_delta_histogram.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
std::vector<double> res;
return make_ready_future<json::json_return_type>(res);
});
cf::get_auto_compaction.set(r, [&ctx] (const_req req) {
auto uuid = get_uuid(req.param["name"], ctx.db.local());
auto uuid = get_uuid(req.get_path_param("name"), ctx.db.local());
replica::column_family& cf = ctx.db.local().find_column_family(uuid);
return !cf.is_auto_compaction_disabled_by_user();
});
cf::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/enable_auto_compaction: name={}", req->param["name"]);
return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
cf.enable_auto_compaction();
}).then([g = std::move(g)] {
return make_ready_future<json::json_return_type>(json_void());
});
});
apilog.info("column_family/enable_auto_compaction: name={}", req->get_path_param("name"));
auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));
validate_table(ctx, ks, cf);
return set_tables_autocompaction(ctx, ks, {std::move(cf)}, true);
});
cf::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/disable_auto_compaction: name={}", req->param["name"]);
return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
return cf.disable_auto_compaction();
}).then([g = std::move(g)] {
return make_ready_future<json::json_return_type>(json_void());
});
});
apilog.info("column_family/disable_auto_compaction: name={}", req->get_path_param("name"));
auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));
validate_table(ctx, ks, cf);
return set_tables_autocompaction(ctx, ks, {std::move(cf)}, false);
});
ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, keyspace, tables, true);
});
ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, keyspace, tables, false);
});
cf::get_tombstone_gc.set(r, [&ctx] (const_req req) {
auto uuid = get_uuid(req.param["name"], ctx.db.local());
auto uuid = get_uuid(req.get_path_param("name"), ctx.db.local());
replica::table& t = ctx.db.local().find_column_family(uuid);
return t.tombstone_gc_enabled();
});
cf::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/enable_tombstone_gc: name={}", req->param["name"]);
return foreach_column_family(ctx, req->param["name"], [](replica::table& t) {
t.set_tombstone_gc_enabled(true);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
apilog.info("column_family/enable_tombstone_gc: name={}", req->get_path_param("name"));
auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));
validate_table(ctx, ks, cf);
return set_tables_tombstone_gc(ctx, ks, {std::move(cf)}, true);
});
cf::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/disable_tombstone_gc: name={}", req->param["name"]);
return foreach_column_family(ctx, req->param["name"], [](replica::table& t) {
t.set_tombstone_gc_enabled(false);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
apilog.info("column_family/disable_tombstone_gc: name={}", req->get_path_param("name"));
auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));
validate_table(ctx, ks, cf);
return set_tables_tombstone_gc(ctx, ks, {std::move(cf)}, false);
});
ss::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, keyspace, tables, true);
});
ss::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, keyspace, tables, false);
});
cf::get_built_indexes.set(r, [&ctx, &sys_ks](std::unique_ptr<http::request> req) {
auto ks_cf = parse_fully_qualified_cf_name(req->param["name"]);
auto ks_cf = parse_fully_qualified_cf_name(req->get_path_param("name"));
auto&& ks = std::get<0>(ks_cf);
auto&& cf_name = std::get<1>(ks_cf);
return sys_ks.local().load_view_build_progress().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace_view_build_progress>& vb) mutable {
@@ -962,7 +1032,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_compression_ratio.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto uuid = get_uuid(req->param["name"], ctx.db.local());
auto uuid = get_uuid(req->get_path_param("name"), ctx.db.local());
return ctx.db.map_reduce(sum_ratio<double>(), [uuid](replica::database& db) {
replica::column_family& cf = db.find_column_family(uuid);
@@ -973,21 +1043,21 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_read_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().reads.histogram();
});
});
cf::get_write_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().writes.histogram();
});
});
cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<http::request> req) {
sstring strategy = req->get_query_param("class_name");
apilog.info("column_family/set_compaction_strategy_class: name={} strategy={}", req->param["name"], strategy);
return foreach_column_family(ctx, req->param["name"], [strategy](replica::column_family& cf) {
apilog.info("column_family/set_compaction_strategy_class: name={} strategy={}", req->get_path_param("name"), strategy);
return foreach_column_family(ctx, req->get_path_param("name"), [strategy](replica::column_family& cf) {
cf.set_compaction_strategy(sstables::compaction_strategy::type(strategy));
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -995,7 +1065,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_compaction_strategy_class.set(r, [&ctx](const_req req) {
return ctx.db.local().find_column_family(get_uuid(req.param["name"], ctx.db.local())).get_compaction_strategy().name();
return ctx.db.local().find_column_family(get_uuid(req.get_path_param("name"), ctx.db.local())).get_compaction_strategy().name();
});
cf::set_compression_parameters.set(r, [](std::unique_ptr<http::request> req) {
@@ -1011,7 +1081,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_sstable_count_per_level.set(r, [&ctx](std::unique_ptr<http::request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], std::vector<uint64_t>(), [](const replica::column_family& cf) {
return map_reduce_cf_raw(ctx, req->get_path_param("name"), std::vector<uint64_t>(), [](const replica::column_family& cf) {
return cf.sstable_count_per_level();
}, concat_sstable_count_per_level).then([](const std::vector<uint64_t>& res) {
return make_ready_future<json::json_return_type>(res);
@@ -1020,7 +1090,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_sstables_for_key.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto key = req->get_query_param("key");
auto uuid = get_uuid(req->param["name"], ctx.db.local());
auto uuid = get_uuid(req->get_path_param("name"), ctx.db.local());
return ctx.db.map_reduce0([key, uuid] (replica::database& db) -> future<std::unordered_set<sstring>> {
auto sstables = co_await db.find_column_family(uuid).get_sstables_by_partition_key(key);
@@ -1036,7 +1106,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::toppartitions.set(r, [&ctx] (std::unique_ptr<http::request> req) {
auto name = req->param["name"];
auto name = req->get_path_param("name");
auto [ks, cf] = parse_fully_qualified_cf_name(name);
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
@@ -1063,7 +1133,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
}
auto [ks, cf] = parse_fully_qualified_cf_name(*params.get("name"));
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
apilog.info("column_family/force_major_compaction: name={} flush={}", req->param["name"], flush);
apilog.info("column_family/force_major_compaction: name={} flush={}", req->get_path_param("name"), flush);
auto keyspace = validate_keyspace(ctx, ks);
std::vector<table_info> table_infos = {table_info{
@@ -1156,8 +1226,6 @@ void unset_column_family(http_context& ctx, routes& r) {
cf::get_speculative_retries.unset(r);
cf::get_all_speculative_retries.unset(r);
cf::get_key_cache_hit_rate.unset(r);
cf::get_true_snapshots_size.unset(r);
cf::get_all_true_snapshots_size.unset(r);
cf::get_row_cache_hit_out_of_range.unset(r);
cf::get_all_row_cache_hit_out_of_range.unset(r);
cf::get_row_cache_hit.unset(r);
@@ -1174,6 +1242,13 @@ void unset_column_family(http_context& ctx, routes& r) {
cf::get_auto_compaction.unset(r);
cf::enable_auto_compaction.unset(r);
cf::disable_auto_compaction.unset(r);
ss::enable_auto_compaction.unset(r);
ss::disable_auto_compaction.unset(r);
cf::get_tombstone_gc.unset(r);
cf::enable_tombstone_gc.unset(r);
cf::disable_tombstone_gc.unset(r);
ss::enable_tombstone_gc.unset(r);
ss::disable_tombstone_gc.unset(r);
cf::get_built_indexes.unset(r);
cf::get_compression_metadata_off_heap_memory_used.unset(r);
cf::get_compression_parameters.unset(r);

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/exception.hh>
#include "compaction_manager.hh"
#include "compaction/compaction_manager.hh"
@@ -110,7 +111,7 @@ void set_compaction_manager(http_context& ctx, routes& r) {
});
cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto ks_name = validate_keyspace(ctx, req->param);
auto ks_name = validate_keyspace(ctx, req);
auto table_names = parse_tables(ks_name, ctx, req->query_parameters, "tables");
if (table_names.empty()) {
table_names = map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());
@@ -153,10 +154,13 @@ void set_compaction_manager(http_context& ctx, routes& r) {
});
cm::get_compaction_history.set(r, [&ctx] (std::unique_ptr<http::request> req) {
std::function<future<>(output_stream<char>&&)> f = [&ctx](output_stream<char>&& s) {
return do_with(output_stream<char>(std::move(s)), true, [&ctx] (output_stream<char>& s, bool& first){
return s.write("[").then([&ctx, &s, &first] {
return ctx.db.local().get_compaction_manager().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable {
std::function<future<>(output_stream<char>&&)> f = [&ctx] (output_stream<char>&& out) -> future<> {
auto s = std::move(out);
bool first = true;
std::exception_ptr ex;
try {
co_await s.write("[");
co_await ctx.db.local().get_compaction_manager().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable -> future<> {
cm::history h;
h.id = fmt::to_string(entry.id);
h.ks = std::move(entry.ks);
@@ -170,18 +174,21 @@ void set_compaction_manager(http_context& ctx, routes& r) {
e.value = it.second;
h.rows_merged.push(std::move(e));
}
auto fut = first ? make_ready_future<>() : s.write(", ");
if (!first) {
co_await s.write(", ");
}
first = false;
return fut.then([&s, h = std::move(h)] {
return formatter::write(s, h);
});
}).then([&s] {
return s.write("]").then([&s] {
return s.close();
});
co_await formatter::write(s, h);
});
});
});
co_await s.write("]");
co_await s.flush();
} catch (...) {
ex = std::current_exception();
}
co_await s.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
return make_ready_future<json::json_return_type>(std::move(f));
});

View File

@@ -92,7 +92,7 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx
});
cs::find_config_id.set(r, [&cfg] (const_req r) {
auto id = r.param["id"];
auto id = r.get_path_param("id");
for (auto&& cfg_ref : cfg.values()) {
auto&& cfg = cfg_ref.get();
if (id == cfg.name()) {

View File

@@ -24,7 +24,7 @@ namespace hf = httpd::error_injection_json;
void set_error_injection(http_context& ctx, routes& r) {
hf::enable_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
sstring injection = req->get_path_param("injection");
bool one_shot = req->get_query_param("one_shot") == "True";
auto params = req->content;
@@ -56,7 +56,7 @@ void set_error_injection(http_context& ctx, routes& r) {
});
hf::disable_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
sstring injection = req->get_path_param("injection");
auto& errinj = utils::get_local_injector();
return errinj.disable_on_all(injection).then([] {
@@ -64,6 +64,32 @@ void set_error_injection(http_context& ctx, routes& r) {
});
});
hf::read_injection.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {
const sstring injection = req->get_path_param("injection");
std::vector<error_injection_json::error_injection_info> error_injection_infos(smp::count, error_injection_json::error_injection_info{});
co_await smp::invoke_on_all([&] {
auto& info = error_injection_infos[this_shard_id()];
auto& errinj = utils::get_local_injector();
const auto enabled = errinj.is_enabled(injection);
info.enabled = enabled;
if (!enabled) {
return;
}
std::vector<error_injection_json::mapper> parameters;
for (const auto& p : errinj.get_injection_parameters(injection)) {
error_injection_json::mapper param;
param.key = p.first;
param.value = p.second;
parameters.push_back(std::move(param));
}
info.parameters = std::move(parameters);
});
co_return json::json_return_type(error_injection_infos);
});
hf::disable_on_all.set(r, [](std::unique_ptr<request> req) {
auto& errinj = utils::get_local_injector();
return errinj.disable_on_all().then([] {
@@ -72,7 +98,7 @@ void set_error_injection(http_context& ctx, routes& r) {
});
hf::message_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
sstring injection = req->get_path_param("injection");
auto& errinj = utils::get_local_injector();
return errinj.receive_message_on_all(injection).then([] {
return make_ready_future<json::json_return_type>(json::json_void());

View File

@@ -66,7 +66,7 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::map<sstring, sstring> nodes_status;
g.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state&) {
nodes_status.emplace(node.to_sstring(), g.is_alive(node) ? "UP" : "DOWN");
nodes_status.emplace(fmt::to_string(node), g.is_alive(node) ? "UP" : "DOWN");
});
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
});
@@ -81,9 +81,9 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {
auto state = g.get_endpoint_state_ptr(gms::inet_address(req->param["addr"]));
auto state = g.get_endpoint_state_ptr(gms::inet_address(req->get_path_param("addr")));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->get_path_param("addr")));
}
std::stringstream ss;
g.append_endpoint_state(ss, *state);

View File

@@ -32,21 +32,21 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
});
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
gms::inet_address ep(req->param["addr"]);
gms::inet_address ep(req->get_path_param("addr"));
// synchronize unreachable_members on all shards
co_await g.get_unreachable_members_synchronized();
co_return g.get_endpoint_downtime(ep);
});
httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<http::request> req) {
gms::inet_address ep(req->param["addr"]);
gms::inet_address ep(req->get_path_param("addr"));
return g.get_current_generation_number(ep).then([] (gms::generation_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
});
httpd::gossiper_json::get_current_heart_beat_version.set(r, [&g] (std::unique_ptr<http::request> req) {
gms::inet_address ep(req->param["addr"]);
gms::inet_address ep(req->get_path_param("addr"));
return g.get_current_heart_beat_version(ep).then([] (gms::version_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
@@ -54,17 +54,17 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
httpd::gossiper_json::assassinate_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
if (req->get_query_param("unsafe") != "True") {
return g.assassinate_endpoint(req->param["addr"]).then([] {
return g.assassinate_endpoint(req->get_path_param("addr")).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
return g.unsafe_assassinate_endpoint(req->param["addr"]).then([] {
return g.unsafe_assassinate_endpoint(req->get_path_param("addr")).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
gms::inet_address ep(req->param["addr"]);
gms::inet_address ep(req->get_path_param("addr"));
return g.force_remove_endpoint(ep, gms::null_permit_id).then([] {
return make_ready_future<json::json_return_type>(json_void());
});

View File

@@ -146,7 +146,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
});
hf::inject_disconnect.set(r, [&ms] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto ip = msg_addr(req->param["ip"]);
auto ip = msg_addr(req->get_path_param("ip"));
co_await ms.invoke_on_all([ip] (netw::messaging_service& ms) {
ms.remove_rpc_client(ip);
});

View File

@@ -24,7 +24,7 @@ using namespace json;
void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr) {
r::trigger_snapshot.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {
raft::group_id gid{utils::UUID{req->param["group_id"]}};
raft::group_id gid{utils::UUID{req->get_path_param("group_id")}};
auto timeout_dur = std::invoke([timeout_str = req->get_query_param("timeout")] {
if (timeout_str.empty()) {
return std::chrono::seconds{60};
@@ -61,17 +61,31 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis
co_return json_void{};
});
r::get_leader_host.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {
return smp::submit_to(0, [&] {
auto& srv = std::invoke([&] () -> raft::server& {
if (req->query_parameters.contains("group_id")) {
raft::group_id id{utils::UUID{req->get_query_param("group_id")}};
return raft_gr.local().get_server(id);
} else {
return raft_gr.local().group0();
}
if (!req->query_parameters.contains("group_id")) {
const auto leader_id = co_await raft_gr.invoke_on(0, [] (service::raft_group_registry& raft_gr) {
auto& srv = raft_gr.group0();
return srv.current_leader();
});
return json_return_type(srv.current_leader().to_sstring());
co_return json_return_type{leader_id.to_sstring()};
}
const raft::group_id gid{utils::UUID{req->get_query_param("group_id")}};
std::atomic<bool> found_srv{false};
std::atomic<raft::server_id> leader_id = raft::server_id::create_null_id();
co_await raft_gr.invoke_on_all([gid, &found_srv, &leader_id] (service::raft_group_registry& raft_gr) {
if (raft_gr.find_server(gid)) {
found_srv = true;
leader_id = raft_gr.get_server(gid).current_leader();
}
return make_ready_future<>();
});
if (!found_srv) {
throw bad_param_exception{fmt::format("Server for group ID {} not found", gid)};
}
co_return json_return_type(leader_id.load().to_sstring());
});
}

View File

@@ -26,6 +26,7 @@
#include <boost/algorithm/string/trim_all.hpp>
#include <boost/algorithm/string/case_conv.hpp>
#include <boost/functional/hash.hpp>
#include <fmt/ranges.h>
#include "service/raft/raft_group0_client.hh"
#include "service/storage_service.hh"
#include "service/load_meter.hh"
@@ -35,6 +36,7 @@
#include <seastar/http/exception.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/coroutine/exception.hh>
#include "repair/row_level.hh"
#include "locator/snitch_base.hh"
#include "column_family.hh"
@@ -53,6 +55,7 @@
#include "locator/abstract_replication_strategy.hh"
#include "sstables_loader.hh"
#include "db/view/view_builder.hh"
#include "utils/user_provided_param.hh"
using namespace seastar::httpd;
using namespace std::chrono_literals;
@@ -63,6 +66,7 @@ namespace api {
namespace ss = httpd::storage_service_json;
namespace sp = httpd::storage_proxy_json;
namespace cf = httpd::column_family_json;
using namespace json;
sstring validate_keyspace(const http_context& ctx, sstring ks_name) {
@@ -72,11 +76,15 @@ sstring validate_keyspace(const http_context& ctx, sstring ks_name) {
throw bad_param_exception(replica::no_such_keyspace(ks_name).what());
}
sstring validate_keyspace(const http_context& ctx, const parameters& param) {
return validate_keyspace(ctx, param["keyspace"]);
sstring validate_keyspace(const http_context& ctx, const std::unique_ptr<http::request>& req) {
return validate_keyspace(ctx, req->get_path_param("keyspace"));
}
static void validate_table(const http_context& ctx, sstring ks_name, sstring table_name) {
sstring validate_keyspace(const http_context& ctx, const http::request& req) {
return validate_keyspace(ctx, req.get_path_param("keyspace"));
}
void validate_table(const http_context& ctx, sstring ks_name, sstring table_name) {
auto& db = ctx.db.local();
try {
db.find_column_family(ks_name, table_name);
@@ -199,14 +207,13 @@ using ks_cf_func = std::function<future<json::json_return_type>(http_context&, s
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
}
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {
namespace cf = httpd::column_family_json;
return q.scatter().then([&q, legacy_request] {
return sleep(q.duration()).then([&q, legacy_request] {
return q.gather(q.capacity()).then([&q, legacy_request] (auto topk_results) {
@@ -236,47 +243,6 @@ seastar::future<json::json_return_type> run_toppartitions_query(db::toppartition
});
}
static future<json::json_return_type> set_tables(http_context& ctx, const sstring& keyspace, std::vector<sstring> tables, std::function<future<>(replica::table&)> set) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return do_with(keyspace, std::move(tables), [&ctx, set] (const sstring& keyspace, const std::vector<sstring>& tables) {
return ctx.db.invoke_on_all([&keyspace, &tables, set] (replica::database& db) {
return parallel_for_each(tables, [&db, &keyspace, set] (const sstring& table) {
replica::table& t = db.find_column_family(keyspace, table);
return set(t);
});
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
apilog.info("set_tables_autocompaction: enabled={} keyspace={} tables={}", enabled, keyspace, tables);
return ctx.db.invoke_on(0, [&ctx, keyspace, tables = std::move(tables), enabled] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return set_tables(ctx, keyspace, tables, [enabled] (replica::table& cf) {
if (enabled) {
cf.enable_auto_compaction();
} else {
return cf.disable_auto_compaction();
}
return make_ready_future<>();
}).finally([g = std::move(g)] {});
});
}
future<json::json_return_type> set_tables_tombstone_gc(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
apilog.info("set_tables_tombstone_gc: enabled={} keyspace={} tables={}", enabled, keyspace, tables);
return set_tables(ctx, keyspace, std::move(tables), [enabled] (replica::table& t) {
t.set_tombstone_gc_enabled(enabled);
return make_ready_future<>();
});
}
future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl, std::unique_ptr<http::request> req) {
scrub_info info;
auto rp = req_params({
@@ -414,7 +380,7 @@ void unset_rpc_controller(http_context& ctx, routes& r) {
}
void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair) {
ss::repair_async.set(r, [&ctx, &repair](std::unique_ptr<http::request> req) {
ss::repair_async.set(r, [&ctx, &repair](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
static std::unordered_set<sstring> options = {"primaryRange", "parallelism", "incremental",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "ignore_nodes", "trace",
"startToken", "endToken", "ranges_parallelism", "small_table_optimization"};
@@ -427,8 +393,7 @@ void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair) {
continue;
}
if (!options.contains(x.first)) {
return make_exception_future<json::json_return_type>(
httpd::bad_param_exception(format("option {} is not supported", x.first)));
throw httpd::bad_param_exception(format("option {} is not supported", x.first));
}
}
std::unordered_map<sstring, sstring> options_map;
@@ -443,10 +408,14 @@ void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair) {
// returns immediately, not waiting for the repair to finish. The user
// then has other mechanisms to track the ongoing repair's progress,
// or stop it.
return repair_start(repair, validate_keyspace(ctx, req->param),
options_map).then([] (int i) {
return make_ready_future<json::json_return_type>(i);
});
try {
int res = co_await repair_start(repair, validate_keyspace(ctx, req), options_map);
co_return json::json_return_type(res);
} catch (const std::invalid_argument& e) {
// if the option is not sane, repair_start() throws immediately, so
// convert the exception to an HTTP error
throw httpd::bad_param_exception(e.what());
}
});
ss::get_active_repair_async.set(r, [&repair] (std::unique_ptr<http::request> req) {
@@ -526,7 +495,7 @@ void unset_repair(http_context& ctx, routes& r) {
void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader) {
ss::load_new_ss_tables.set(r, [&ctx, &sst_loader](std::unique_ptr<http::request> req) {
auto ks = validate_keyspace(ctx, req->param);
auto ks = validate_keyspace(ctx, req);
auto cf = req->get_query_param("cf");
auto stream = req->get_query_param("load_and_stream");
auto primary_replica = req->get_query_param("primary_replica_only");
@@ -557,8 +526,8 @@ void unset_sstables_loader(http_context& ctx, routes& r) {
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb) {
ss::view_build_statuses.set(r, [&ctx, &vb] (std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto view = req->param["view"];
auto keyspace = validate_keyspace(ctx, req);
auto view = req->get_path_param("view");
return vb.local().view_build_statuses(std::move(keyspace), std::move(view)).then([] (std::unordered_map<sstring, sstring> status) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));
@@ -584,8 +553,24 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return ctx.db.local().commitlog()->active_config().commit_log_location;
});
ss::get_token_endpoint.set(r, [&ss] (std::unique_ptr<http::request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ss.local().get_token_to_endpoint_map(), [](const auto& i) {
ss::get_token_endpoint.set(r, [&ctx, &ss] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
const auto keyspace_name = req->get_query_param("keyspace");
const auto table_name = req->get_query_param("cf");
std::map<dht::token, gms::inet_address> token_endpoints;
if (keyspace_name.empty() && table_name.empty()) {
token_endpoints = ss.local().get_token_to_endpoint_map();
} else if (!keyspace_name.empty() && !table_name.empty()) {
auto& db = ctx.db.local();
if (!db.has_schema(keyspace_name, table_name)) {
throw bad_param_exception(fmt::format("Failed to find table {}.{}", keyspace_name, table_name));
}
token_endpoints = co_await ss.local().get_tablet_to_endpoint_map(db.find_schema(keyspace_name, table_name)->id());
} else {
throw bad_param_exception("Either provide both keyspace and table (for tablet table) or neither (for vnodes)");
}
co_return json::json_return_type(stream_range_as_array(token_endpoints, [](const auto& i) {
storage_service_json::mapper val;
val.key = fmt::to_string(i.first);
val.value = fmt::to_string(i.second);
@@ -663,7 +648,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::get_range_to_endpoint_map.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto table = req->get_query_param("cf");
auto erm = std::invoke([&]() -> locator::effective_replication_map_ptr {
@@ -694,7 +679,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
m.key.push("");
}
for (const gms::inet_address& address : entry.second) {
m.value.push(address.to_sstring());
m.value.push(fmt::to_string(address));
}
return m;
});
@@ -703,7 +688,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::get_pending_range_to_endpoint_map.set(r, [&ctx](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
std::vector<ss::maplist_mapper> res;
return make_ready_future<json::json_return_type>(res);
});
@@ -712,13 +697,13 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
if (!req->param.exists("keyspace")) {
throw bad_param_exception("The keyspace param is not provided");
}
auto keyspace = req->param["keyspace"];
auto keyspace = req->get_path_param("keyspace");
auto table = req->get_query_param("table");
if (!table.empty()) {
validate_table(ctx, keyspace, table);
return describe_ring_as_json_for_table(ss, keyspace, table);
}
return describe_ring_as_json(ss, validate_keyspace(ctx, req->param));
return describe_ring_as_json(ss, validate_keyspace(ctx, req));
});
ss::get_load.set(r, [&ctx](std::unique_ptr<http::request> req) {
@@ -746,7 +731,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::get_natural_endpoints.set(r, [&ctx, &ss](const_req req) {
auto keyspace = validate_keyspace(ctx, req.param);
auto keyspace = validate_keyspace(ctx, req);
return container_to_vec(ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"),
req.get_query_param("key")));
});
@@ -815,7 +800,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
if (rs.get_type() == locator::replication_strategy_type::local || !rs.is_vnode_based()) {
@@ -910,7 +895,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::force_keyspace_flush.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, column_families);
auto& db = ctx.db;
@@ -1019,7 +1004,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::truncate.set(r, [&ctx](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto column_family = req->get_query_param("cf");
return make_ready_future<json::json_return_type>(json_void());
});
@@ -1153,7 +1138,16 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::rebuild.set(r, [&ss](std::unique_ptr<http::request> req) {
auto source_dc = req->get_query_param("source_dc");
utils::optional_param source_dc;
if (auto source_dc_str = req->get_query_param("source_dc"); !source_dc_str.empty()) {
source_dc.emplace(std::move(source_dc_str)).set_user_provided();
}
if (auto force_str = req->get_query_param("force"); !force_str.empty() && service::loosen_constraints(validate_bool(force_str))) {
if (!source_dc) {
throw bad_param_exception("The `source_dc` option must be provided for using the `force` option");
}
source_dc.set_force();
}
apilog.info("rebuild: source_dc={}", source_dc);
return ss.local().rebuild(std::move(source_dc)).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -1163,14 +1157,14 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::bulk_load.set(r, [](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
auto path = req->param["path"];
auto path = req->get_path_param("path");
return make_ready_future<json::json_return_type>(json_void());
});
ss::bulk_load_async.set(r, [](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
auto path = req->param["path"];
auto path = req->get_path_param("path");
return make_ready_future<json::json_return_type>(json_void());
});
@@ -1257,38 +1251,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
}
});
ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, keyspace, tables, true);
});
ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, keyspace, tables, false);
});
ss::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, keyspace, tables, true);
});
ss::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, keyspace, tables, false);
});
ss::deliver_hints.set(r, [](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
@@ -1382,7 +1344,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::get_effective_ownership.set(r, [&ctx, &ss] (std::unique_ptr<http::request> req) {
auto keyspace_name = req->param["keyspace"] == "null" ? "" : validate_keyspace(ctx, req->param);
auto keyspace_name = req->get_path_param("keyspace") == "null" ? "" : validate_keyspace(ctx, req);
auto table_name = req->get_query_param("cf");
if (!keyspace_name.empty()) {
@@ -1622,6 +1584,11 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
co_return json_void();
});
ss::quiesce_topology.set(r, [&ss] (std::unique_ptr<http::request> req) -> future<json_return_type> {
co_await ss.local().await_topology_quiesced();
co_return json_void();
});
sp::get_schema_versions.set(r, [&ss](std::unique_ptr<http::request> req) {
return ss.local().describe_schema_versions().then([] (auto result) {
std::vector<sp::mapper_list> res;
@@ -1697,10 +1664,6 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::get_trace_probability.unset(r);
ss::get_slow_query_info.unset(r);
ss::set_slow_query.unset(r);
ss::enable_auto_compaction.unset(r);
ss::disable_auto_compaction.unset(r);
ss::enable_tombstone_gc.unset(r);
ss::disable_tombstone_gc.unset(r);
ss::deliver_hints.unset(r);
ss::get_cluster_name.unset(r);
ss::get_partitioner_name.unset(r);
@@ -1725,6 +1688,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::add_tablet_replica.unset(r);
ss::del_tablet_replica.unset(r);
ss::tablet_balancing_enable.unset(r);
ss::quiesce_topology.unset(r);
sp::get_schema_versions.unset(r);
}
@@ -1732,32 +1696,41 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
ss::get_snapshot_details.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto result = co_await snap_ctl.local().get_snapshot_details();
co_return std::function([res = std::move(result)] (output_stream<char>&& o) -> future<> {
auto result = std::move(res);
std::exception_ptr ex;
output_stream<char> out = std::move(o);
bool first = true;
try {
auto result = std::move(res);
bool first = true;
co_await out.write("[");
for (auto&& map : result) {
if (!first) {
co_await out.write(", ");
co_await out.write("[");
for (auto& [name, details] : result) {
if (!first) {
co_await out.write(", ");
}
std::vector<ss::snapshot> snapshot;
for (auto& cf : details) {
ss::snapshot snp;
snp.ks = cf.ks;
snp.cf = cf.cf;
snp.live = cf.details.live;
snp.total = cf.details.total;
snapshot.push_back(std::move(snp));
}
ss::snapshots all_snapshots;
all_snapshots.key = name;
all_snapshots.value = std::move(snapshot);
co_await all_snapshots.write(out);
first = false;
}
std::vector<ss::snapshot> snapshot;
for (auto& cf : std::get<1>(map)) {
ss::snapshot snp;
snp.ks = cf.ks;
snp.cf = cf.cf;
snp.live = cf.live;
snp.total = cf.total;
snapshot.push_back(std::move(snp));
}
ss::snapshots all_snapshots;
all_snapshots.key = std::get<0>(map);
all_snapshots.value = std::move(snapshot);
co_await all_snapshots.write(out);
first = false;
co_await out.write("]");
co_await out.flush();
} catch (...) {
ex = std::current_exception();
}
co_await out.write("]");
co_await out.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
});
});
@@ -1836,6 +1809,20 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
co_return json::json_return_type(static_cast<int>(scrub_status::successful));
});
cf::get_true_snapshots_size.set(r, [&snap_ctl] (std::unique_ptr<http::request> req) {
auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));
return snap_ctl.local().true_snapshots_size(std::move(ks), std::move(cf)).then([] (int64_t res) {
return make_ready_future<json::json_return_type>(res);
});
});
cf::get_all_true_snapshots_size.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(0);
});
}
void unset_snapshot(http_context& ctx, routes& r) {
@@ -1844,6 +1831,8 @@ void unset_snapshot(http_context& ctx, routes& r) {
ss::del_snapshot.unset(r);
ss::true_snapshots_size.unset(r);
ss::scrub.unset(r);
cf::get_true_snapshots_size.unset(r);
cf::get_all_true_snapshots_size.unset(r);
}
}

View File

@@ -40,7 +40,11 @@ sstring validate_keyspace(const http_context& ctx, sstring ks_name);
// verify that the keyspace parameter is found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective keyspace error.
sstring validate_keyspace(const http_context& ctx, const httpd::parameters& param);
sstring validate_keyspace(const http_context& ctx, const std::unique_ptr<http::request>& req);
// verify that the table parameter is found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective table error.
void validate_table(const http_context& ctx, sstring ks_name, sstring table_name);
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown

View File

@@ -108,7 +108,7 @@ void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_
});
hs::get_total_incoming_bytes.set(r, [&sm](std::unique_ptr<request> req) {
gms::inet_address peer(req->param["peer"]);
gms::inet_address peer(req->get_path_param("peer"));
return sm.map_reduce0([peer](streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {
return sbytes.bytes_received;
@@ -129,7 +129,7 @@ void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_
});
hs::get_total_outgoing_bytes.set(r, [&sm](std::unique_ptr<request> req) {
gms::inet_address peer(req->param["peer"]);
gms::inet_address peer(req->get_path_param("peer"));
return sm.map_reduce0([peer] (streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {
return sbytes.bytes_sent;

View File

@@ -10,6 +10,7 @@
#include "api/api-doc/system.json.hh"
#include "api/api-doc/metrics.json.hh"
#include "replica/database.hh"
#include "sstables/sstables_manager.hh"
#include <rapidjson/document.h>
#include <seastar/core/reactor.hh>
@@ -122,9 +123,9 @@ void set_system(http_context& ctx, routes& r) {
hs::get_logger_level.set(r, [](const_req req) {
try {
return logging::level_name(logging::logger_registry().get_logger_level(req.param["name"]));
return logging::level_name(logging::logger_registry().get_logger_level(req.get_path_param("name")));
} catch (std::out_of_range& e) {
throw bad_param_exception("Unknown logger name " + req.param["name"]);
throw bad_param_exception("Unknown logger name " + req.get_path_param("name"));
}
// just to keep the compiler happy
return sstring();
@@ -133,9 +134,9 @@ void set_system(http_context& ctx, routes& r) {
hs::set_logger_level.set(r, [](const_req req) {
try {
logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
logging::logger_registry().set_logger_level(req.param["name"], level);
logging::logger_registry().set_logger_level(req.get_path_param("name"), level);
} catch (std::out_of_range& e) {
throw bad_param_exception("Unknown logger name " + req.param["name"]);
throw bad_param_exception("Unknown logger name " + req.get_path_param("name"));
} catch (boost::bad_lexical_cast& e) {
throw bad_param_exception("Unknown logging level " + req.get_query_param("level"));
}
@@ -182,6 +183,11 @@ void set_system(http_context& ctx, routes& r) {
apilog.info("Profile dumped to {}", profile_dest);
return make_ready_future<json::json_return_type>(json::json_return_type(json::json_void()));
}) ;
hs::get_highest_supported_sstable_version.set(r, [&ctx] (const_req req) {
auto& table = ctx.db.local().find_column_family("system", "local");
return seastar::to_sstring(table.get_sstables_manager().get_highest_supported_format());
});
}
}

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/exception.hh>
#include <seastar/http/exception.hh>
#include "task_manager.hh"
@@ -23,6 +24,8 @@ namespace tm = httpd::task_manager_json;
using namespace json;
using namespace seastar::httpd;
using task_variant = std::variant<tasks::task_manager::foreign_task_ptr, tasks::task_manager::task::task_essentials>;
inline bool filter_tasks(tasks::task_manager::task_ptr task, std::unordered_map<sstring, sstring>& query_params) {
return (!query_params.contains("keyspace") || query_params["keyspace"] == task->get_status().keyspace) &&
(!query_params.contains("table") || query_params["table"] == task->get_status().table);
@@ -102,13 +105,14 @@ future<full_task_status> retrieve_status(const tasks::task_manager::foreign_task
s.module = task->get_module_name();
s.progress.completed = progress.completed;
s.progress.total = progress.total;
std::vector<std::string> ct{task->get_children().size()};
boost::transform(task->get_children(), ct.begin(), [] (const auto& child) {
std::vector<std::string> ct = co_await task->get_children().map_each_task<std::string>([] (const tasks::task_manager::foreign_task_ptr& child) {
return child->id().to_sstring();
}, [] (const tasks::task_manager::task::task_essentials& child) {
return child.task_status.id.to_sstring();
});
s.children_ids = std::move(ct);
co_return s;
}
};
void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm, db::config& cfg) {
tm::get_modules.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
@@ -123,7 +127,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
chunked_stats local_res;
tasks::task_manager::module_ptr module;
try {
module = tm.find_module(req->param["module"]);
module = tm.find_module(req->get_path_param("module"));
} catch (...) {
throw bad_param_exception(fmt::format("{}", std::current_exception()));
}
@@ -138,25 +142,34 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
auto s = std::move(os);
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
co_await formatter::write(s, ts);
std::exception_ptr ex;
try {
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
co_await formatter::write(s, ts);
}
}
co_await s.write("]");
co_await s.flush();
} catch (...) {
ex = std::current_exception();
}
co_await s.write("]");
co_await s.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
co_return std::move(f);
});
tm::get_task_status.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
tasks::task_manager::foreign_task_ptr task;
try {
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
@@ -173,13 +186,13 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
});
tm::abort_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
if (!task->is_abortable()) {
co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));
}
co_await task->abort();
task->abort();
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
@@ -188,12 +201,11 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
});
tm::wait_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
tasks::task_manager::foreign_task_ptr task;
try {
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) {
return task->done().then_wrapped([task] (auto f) {
task->unregister_task();
// done() is called only because we want the task to be complete before getting its status.
// The future should be ignored here as the result does not matter.
f.ignore_ready_future();
@@ -209,8 +221,8 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
tm::get_task_status_recursively.set(r, [&_tm = tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& tm = _tm;
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
std::queue<tasks::task_manager::foreign_task_ptr> q;
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
std::queue<task_variant> q;
utils::chunked_vector<full_task_status> res;
tasks::task_manager::foreign_task_ptr task;
@@ -230,10 +242,33 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
q.push(co_await task.copy()); // Task cannot be moved since we need it to be alive during whole loop execution.
while (!q.empty()) {
auto& current = q.front();
res.push_back(co_await retrieve_status(current));
for (auto& child: current->get_children()) {
q.push(co_await child.copy());
}
co_await std::visit(overloaded_functor {
[&] (const tasks::task_manager::foreign_task_ptr& task) -> future<> {
res.push_back(co_await retrieve_status(task));
co_await task->get_children().for_each_task([&q] (const tasks::task_manager::foreign_task_ptr& child) -> future<> {
q.push(co_await child.copy());
}, [&] (const tasks::task_manager::task::task_essentials& child) {
q.push(child);
return make_ready_future();
});
},
[&] (const tasks::task_manager::task::task_essentials& task) -> future<> {
res.push_back(full_task_status{
.task_status = task.task_status,
.type = task.type,
.progress = task.task_progress,
.parent_id = task.parent_id,
.abortable = task.abortable,
.children_ids = boost::copy_range<std::vector<std::string>>(task.failed_children | boost::adaptors::transformed([] (auto& child) {
return child.task_status.id.to_sstring();
}))
});
for (auto& child: task.failed_children) {
q.push(child);
}
return make_ready_future();
}
}, current);
q.pop();
}

View File

@@ -83,20 +83,19 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man
});
tmt::finish_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
auto it = req->query_parameters.find("error");
bool fail = it != req->query_parameters.end();
std::string error = fail ? it->second : "";
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {
co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
if (fail) {
test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
co_await test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
} else {
test_task.finish();
co_await test_task.finish();
}
return make_ready_future<>();
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include <fmt/ranges.h>
#include "api/api.hh"
#include "api/storage_service.hh"
@@ -29,7 +30,7 @@ using ks_cf_func = std::function<future<json::json_return_type>(http_context&, s
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
@@ -61,7 +62,7 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {

View File

@@ -31,7 +31,7 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to
});
ss::get_node_tokens.set(r, [&tm] (std::unique_ptr<http::request> req) {
gms::inet_address addr(req->param["endpoint"]);
gms::inet_address addr(req->get_path_param("endpoint"));
auto& local_tm = *tm.local().get();
const auto host_id = local_tm.get_host_id_if_known(addr);
return make_ready_future<json::json_return_type>(stream_range_as_array(host_id ? local_tm.get_tokens(*host_id): std::vector<dht::token>{}, [](const dht::token& i) {

View File

@@ -30,6 +30,7 @@ target_link_libraries(scylla_auth
Seastar::seastar
xxHash::xxhash
PRIVATE
absl::headers
cql3
idl
wasmtime_bindings

View File

@@ -50,7 +50,7 @@ inline bool is_anonymous(const authenticated_user& u) noexcept {
/// The user name, or "anonymous".
///
template <>
struct fmt::formatter<auth::authenticated_user> : fmt::formatter<std::string_view> {
struct fmt::formatter<auth::authenticated_user> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const auth::authenticated_user& u, FormatContext& ctx) const {
if (u.name) {

View File

@@ -48,15 +48,15 @@ public:
}
template <>
struct fmt::formatter<auth::authentication_option> : fmt::formatter<std::string_view> {
struct fmt::formatter<auth::authentication_option> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const auth::authentication_option a, FormatContext& ctx) const {
using enum auth::authentication_option;
switch (a) {
case password:
return formatter<std::string_view>::format("PASSWORD", ctx);
return formatter<string_view>::format("PASSWORD", ctx);
case options:
return formatter<std::string_view>::format("OPTIONS", ctx);
return formatter<string_view>::format("OPTIONS", ctx);
}
std::abort();
}

View File

@@ -10,8 +10,10 @@
#include "auth/certificate_authenticator.hh"
#include <regex>
#include <fmt/ranges.h>
#include "utils/class_registrator.hh"
#include "utils/to_string.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
@@ -74,7 +76,7 @@ auth::certificate_authenticator::certificate_authenticator(cql3::query_processor
continue;
} catch (std::out_of_range&) {
// just fallthrough
} catch (std::regex_error&) {
} catch (boost::regex_error&) {
std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));
}
}

View File

@@ -24,7 +24,6 @@
#include "service/raft/group0_state_machine.hh"
#include "timeout_config.hh"
#include "db/config.hh"
#include "db/system_auth_keyspace.hh"
#include "utils/error_injection.hh"
namespace auth {
@@ -41,14 +40,14 @@ constinit const std::string_view AUTH_PACKAGE_NAME("org.apache.cassandra.auth.")
static logging::logger auth_log("auth");
bool legacy_mode(cql3::query_processor& qp) {
return qp.auth_version < db::system_auth_keyspace::version_t::v2;
return qp.auth_version < db::system_keyspace::auth_version_t::v2;
}
std::string_view get_auth_ks_name(cql3::query_processor& qp) {
if (legacy_mode(qp)) {
return meta::legacy::AUTH_KS;
}
return db::system_auth_keyspace::NAME;
return db::system_keyspace::NAME;
}
// Func must support being invoked more than once.
@@ -65,7 +64,7 @@ future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_f
}).discard_result();
}
static future<> create_metadata_table_if_missing_impl(
static future<> create_legacy_metadata_table_if_missing_impl(
std::string_view table_name,
cql3::query_processor& qp,
std::string_view cql,
@@ -73,7 +72,7 @@ static future<> create_metadata_table_if_missing_impl(
assert(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
auto db = qp.db();
auto parsed_statement = cql3::query_processor::parse_statement(cql);
auto parsed_statement = cql3::query_processor::parse_statement(cql, cql3::dialect{});
auto& parsed_cf_statement = static_cast<cql3::statements::raw::cf_statement&>(*parsed_statement);
parsed_cf_statement.prepare_keyspace(meta::legacy::AUTH_KS);
@@ -98,12 +97,12 @@ static future<> create_metadata_table_if_missing_impl(
}
}
future<> create_metadata_table_if_missing(
future<> create_legacy_metadata_table_if_missing(
std::string_view table_name,
cql3::query_processor& qp,
std::string_view cql,
::service::migration_manager& mm) noexcept {
return futurize_invoke(create_metadata_table_if_missing_impl, table_name, qp, cql, mm);
return futurize_invoke(create_legacy_metadata_table_if_missing_impl, table_name, qp, cql, mm);
}
::service::query_state& internal_distributed_query_state() noexcept {
@@ -123,7 +122,7 @@ static future<> announce_mutations_with_guard(
::service::raft_group0_client& group0_client,
std::vector<canonical_mutation> muts,
::service::group0_guard group0_guard,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
auto group0_cmd = group0_client.prepare_command(
::service::write_mutations{
@@ -139,7 +138,7 @@ future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
start_operation_func_t start_operation_func,
std::function<mutations_generator(api::timestamp_type& t)> gen,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
// account for command's overhead, it's better to use smaller threshold than constantly bounce off the limit
size_t memory_threshold = group0_client.max_command_size() * 0.75;
@@ -190,7 +189,7 @@ future<> announce_mutations(
::service::raft_group0_client& group0_client,
const sstring query_string,
std::vector<data_value_or_unset> values,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
auto group0_guard = co_await group0_client.start_operation(as, timeout);
auto timestamp = group0_guard.write_timestamp();

View File

@@ -70,7 +70,7 @@ future<> once_among_shards(Task&& f) {
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func);
future<> create_metadata_table_if_missing(
future<> create_legacy_metadata_table_if_missing(
std::string_view table_name,
cql3::query_processor&,
std::string_view cql,
@@ -84,7 +84,7 @@ future<> create_metadata_table_if_missing(
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
// Use this function when need to perform read before write on a single guard or if
// you have more than one mutation and potentially exceed single command size limit.
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source*)>;
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source&)>;
using mutations_generator = coroutine::experimental::generator<mutation>;
future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
@@ -93,7 +93,7 @@ future<> announce_mutations_with_batching(
// function here
start_operation_func_t start_operation_func,
std::function<mutations_generator(api::timestamp_type& t)> gen,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
@@ -102,7 +102,7 @@ future<> announce_mutations(
::service::raft_group0_client& group0_client,
const sstring query_string,
std::vector<data_value_or_unset> values,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
}

View File

@@ -9,7 +9,7 @@
*/
#include "auth/default_authorizer.hh"
#include "db/system_auth_keyspace.hh"
#include "db/system_keyspace.hh"
extern "C" {
#include <crypt.h>
@@ -103,7 +103,7 @@ future<> default_authorizer::migrate_legacy_metadata() {
});
}
future<> default_authorizer::start() {
future<> default_authorizer::start_legacy() {
static const sstring create_table = fmt::format(
"CREATE TABLE {}.{} ("
"{} text,"
@@ -121,7 +121,7 @@ future<> default_authorizer::start() {
90 * 24 * 60 * 60); // 3 months.
return once_among_shards([this] {
return create_metadata_table_if_missing(
return create_legacy_metadata_table_if_missing(
PERMISSIONS_CF,
_qp,
create_table,
@@ -144,6 +144,13 @@ future<> default_authorizer::start() {
});
}
future<> default_authorizer::start() {
if (legacy_mode(_qp)) {
return start_legacy();
}
return make_ready_future<>();
}
future<> default_authorizer::stop() {
_as.request_abort();
return _finished.handle_exception_type([](const sleep_aborted&) {}).handle_exception_type([](const abort_requested_exception&) {});
@@ -196,7 +203,7 @@ default_authorizer::modify(
cql3::query_processor::cache_internal::no).discard_result();
}
co_return co_await announce_mutations(_qp, _group0_client, query,
{permissions::to_strings(set), sstring(role_name), resource.name()}, &_as, ::service::raft_timeout{});
{permissions::to_strings(set), sstring(role_name), resource.name()}, _as, ::service::raft_timeout{});
}
@@ -249,7 +256,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name) {
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, _as, ::service::raft_timeout{});
}
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
@@ -339,9 +346,9 @@ future<> default_authorizer::revoke_all(const resource& resource) {
const auto timeout = ::service::raft_timeout{};
co_await announce_mutations_with_batching(
_group0_client,
[this, timeout](abort_source* as) { return _group0_client.start_operation(as, timeout); },
[this, timeout](abort_source& as) { return _group0_client.start_operation(as, timeout); },
std::move(gen),
&_as,
_as,
timeout);
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", name, e);

View File

@@ -60,6 +60,8 @@ public:
virtual const resource_set& protected_resources() const override;
private:
future<> start_legacy();
bool legacy_metadata_exists() const;
future<> revoke_all_legacy(const resource&);

View File

@@ -10,7 +10,7 @@
#include "auth/resource.hh"
#include "auth/role_manager.hh"
#include "seastar/core/future.hh"
#include <seastar/core/future.hh>
namespace cql3 {
class query_processor;

View File

@@ -132,48 +132,48 @@ future<> password_authenticator::create_default_if_missing() {
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{salted_pwd, _superuser},
cql3::query_processor::cache_internal::no).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
cql3::query_processor::cache_internal::no);
plogger.info("Created default superuser authentication record.");
} else {
co_await announce_mutations(_qp, _group0_client, query,
{salted_pwd, _superuser}, &_as, ::service::raft_timeout{}).then([]() {
plogger.info("Created default superuser authentication record.");
});
{salted_pwd, _superuser}, _as, ::service::raft_timeout{});
plogger.info("Created default superuser authentication record.");
}
}
future<> password_authenticator::start() {
return once_among_shards([this] {
auto f = create_metadata_table_if_missing(
meta::roles_table::name,
_qp,
meta::roles_table::creation_query(),
_migration_manager);
return once_among_shards([this] {
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
if (legacy_mode(_qp)) {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash, _superuser).get()) {
if (legacy_metadata_exists()) {
plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");
}
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash, _superuser).get()) {
if (legacy_metadata_exists()) {
plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");
}
return;
}
return;
}
if (legacy_metadata_exists()) {
migrate_legacy_metadata().get();
return;
}
}
create_default_if_missing().get();
});
});
if (legacy_metadata_exists()) {
migrate_legacy_metadata().get();
return;
}
create_default_if_missing().get();
});
});
return f;
});
if (legacy_mode(_qp)) {
return create_legacy_metadata_table_if_missing(
meta::roles_table::name,
_qp,
meta::roles_table::creation_query(),
_migration_manager);
}
return make_ready_future<>();
});
}
future<> password_authenticator::stop() {
@@ -271,7 +271,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query,
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}, &_as, ::service::raft_timeout{});
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}, _as, ::service::raft_timeout{});
}
}
@@ -294,7 +294,7 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query,
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}, &_as, ::service::raft_timeout{});
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}, _as, ::service::raft_timeout{});
}
}
@@ -311,7 +311,7 @@ future<> password_authenticator::drop(std::string_view name) {
{sstring(name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(name)}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(name)}, _as, ::service::raft_timeout{});
}
}

View File

@@ -8,6 +8,7 @@
#include "auth/permissions_cache.hh"
#include <fmt/ranges.h>
#include "auth/authorizer.hh"
#include "auth/service.hh"

View File

@@ -41,6 +41,28 @@ enum class resource_kind {
data, role, service_level, functions
};
}
template <>
struct fmt::formatter<auth::resource_kind> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const auth::resource_kind kind, FormatContext& ctx) const {
using enum auth::resource_kind;
switch (kind) {
case data:
return formatter<string_view>::format("data", ctx);
case role:
return formatter<string_view>::format("role", ctx);
case service_level:
return formatter<string_view>::format("service_level", ctx);
case functions:
return formatter<string_view>::format("functions", ctx);
}
std::abort();
}
};
namespace auth {
///
/// Type tag for constructing data resources.
///
@@ -246,25 +268,6 @@ std::pair<sstring, std::vector<data_type>> decode_signature(std::string_view enc
}
template <>
struct fmt::formatter<auth::resource_kind> : fmt::formatter<std::string_view> {
template <typename FormatContext>
auto format(const auth::resource_kind kind, FormatContext& ctx) const {
using enum auth::resource_kind;
switch (kind) {
case data:
return formatter<std::string_view>::format("data", ctx);
case role:
return formatter<std::string_view>::format("role", ctx);
case service_level:
return formatter<std::string_view>::format("service_level", ctx);
case functions:
return formatter<std::string_view>::format("functions", ctx);
}
std::abort();
}
};
namespace std {
template <>

View File

@@ -28,10 +28,9 @@
#include "db/config.hh"
#include "db/consistency_level_type.hh"
#include "db/functions/function_name.hh"
#include "db/system_auth_keyspace.hh"
#include "log.hh"
#include "schema/schema_fwd.hh"
#include "seastar/core/future.hh"
#include <seastar/core/future.hh>
#include "service/migration_manager.hh"
#include "timestamp.hh"
#include "utils/class_registrator.hh"
@@ -172,7 +171,7 @@ service::service(
used_by_maintenance_socket) {
}
future<> service::create_keyspace_if_missing(::service::migration_manager& mm) const {
future<> service::create_legacy_keyspace_if_missing(::service::migration_manager& mm) const {
assert(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
auto db = _qp.db();
@@ -204,8 +203,12 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
// version is set in query processor to be easily available in various places we call auth::legacy_mode check.
_qp.auth_version = auth_version;
if (!_used_by_maintenance_socket) {
// this legacy keyspace is only used by cqlsh
// it's needed when executing `list roles` or `list users`
// it doesn't affect anything except that cqlsh fails if keyspace
// is not found
co_await once_among_shards([this, &mm] {
return create_keyspace_if_missing(mm);
return create_legacy_keyspace_if_missing(mm);
});
}
co_await _role_manager->start();
@@ -248,51 +251,6 @@ void service::reset_authorization_cache() {
_qp.reset_cache();
}
future<bool> service::has_existing_legacy_users() const {
if (!_qp.db().has_schema(meta::legacy::AUTH_KS, meta::legacy::USERS_CF)) {
return make_ready_future<bool>(false);
}
static const sstring default_user_query = format("SELECT * FROM {}.{} WHERE {} = ?",
meta::legacy::AUTH_KS,
meta::legacy::USERS_CF,
meta::user_name_col_name);
static const sstring all_users_query = format("SELECT * FROM {}.{} LIMIT 1",
meta::legacy::AUTH_KS,
meta::legacy::USERS_CF);
// This logic is borrowed directly from Apache Cassandra. By first checking for the presence of the default user, we
// can potentially avoid doing a range query with a high consistency level.
return _qp.execute_internal(
default_user_query,
db::consistency_level::ONE,
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::yes).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.execute_internal(
default_user_query,
db::consistency_level::QUORUM,
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::yes).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.execute_internal(
all_users_query,
db::consistency_level::QUORUM,
cql3::query_processor::cache_internal::no).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});
});
}
future<permission_set>
service::get_uncached_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
if (is_anonymous(maybe_role)) {
@@ -685,7 +643,7 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
}
auto muts = co_await qp.get_mutations_internal(
format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_auth_keyspace::NAME,
db::system_keyspace::NAME,
cf_name,
col_names_str,
val_binders_str),
@@ -700,12 +658,12 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
}
}
co_yield co_await sys_ks.make_auth_version_mutation(ts,
db::system_auth_keyspace::version_t::v2);
db::system_keyspace::auth_version_t::v2);
};
co_await announce_mutations_with_batching(g0,
start_operation_func,
std::move(gen),
&as,
as,
std::nullopt);
}

View File

@@ -175,9 +175,7 @@ public:
}
private:
future<bool> has_existing_legacy_users() const;
future<> create_keyspace_if_missing(::service::migration_manager& mm) const;
future<> create_legacy_keyspace_if_missing(::service::migration_manager& mm) const;
};
future<bool> has_superuser(const service&, const authenticated_user&);

View File

@@ -143,7 +143,7 @@ const resource_set& standard_role_manager::protected_resources() const {
return resources;
}
future<> standard_role_manager::create_metadata_tables_if_missing() const {
future<> standard_role_manager::create_legacy_metadata_tables_if_missing() const {
static const sstring create_role_members_query = fmt::format(
"CREATE TABLE {}.{} ("
" role text,"
@@ -155,17 +155,17 @@ future<> standard_role_manager::create_metadata_tables_if_missing() const {
return when_all_succeed(
create_metadata_table_if_missing(
create_legacy_metadata_table_if_missing(
meta::roles_table::name,
_qp,
meta::roles_table::creation_query(),
_migration_manager),
create_metadata_table_if_missing(
create_legacy_metadata_table_if_missing(
meta::role_members_table::name,
_qp,
create_role_members_query,
_migration_manager),
create_metadata_table_if_missing(
create_legacy_metadata_table_if_missing(
meta::role_attributes_table::name,
_qp,
meta::role_attributes_table::creation_query(),
@@ -190,7 +190,7 @@ future<> standard_role_manager::create_default_role_if_missing() {
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, _as, ::service::raft_timeout{});
}
log.info("Created default superuser role '{}'.", _superuser);
} catch(const exceptions::unavailable_exception& e) {
@@ -236,24 +236,30 @@ future<> standard_role_manager::migrate_legacy_metadata() {
future<> standard_role_manager::start() {
return once_among_shards([this] {
return this->create_metadata_tables_if_missing().then([this] {
return futurize_invoke([this] () {
if (legacy_mode(_qp)) {
return this->create_legacy_metadata_tables_if_missing();
}
return make_ready_future<>();
}).then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
if (legacy_mode(_qp)) {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get()) {
if (this->legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get()) {
if (this->legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
return;
}
return;
if (this->legacy_metadata_exists()) {
this->migrate_legacy_metadata().get();
return;
}
}
if (this->legacy_metadata_exists()) {
this->migrate_legacy_metadata().get();
return;
}
create_default_role_if_missing().get();
});
});
@@ -279,7 +285,7 @@ future<> standard_role_manager::create_or_replace(std::string_view role_name, co
{sstring(role_name), c.is_superuser, c.can_login},
cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name), c.is_superuser, c.can_login}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name), c.is_superuser, c.can_login}, _as, ::service::raft_timeout{});
}
}
@@ -327,7 +333,7 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
return announce_mutations(_qp, _group0_client, std::move(query), {sstring(role_name)}, &_as, ::service::raft_timeout{});
return announce_mutations(_qp, _group0_client, std::move(query), {sstring(role_name)}, _as, ::service::raft_timeout{});
}
});
}
@@ -377,7 +383,7 @@ future<> standard_role_manager::drop(std::string_view role_name) {
co_await _qp.execute_internal(query, {sstring(role_name)},
cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, _as, ::service::raft_timeout{});
}
};
// Finally, delete the role itself.
@@ -395,7 +401,7 @@ future<> standard_role_manager::drop(std::string_view role_name) {
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, _as, ::service::raft_timeout{});
}
};
@@ -428,7 +434,7 @@ standard_role_manager::modify_membership(
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, std::move(query),
{role_set{sstring(role_name)}, sstring(grantee_name)}, &_as, ::service::raft_timeout{});
{role_set{sstring(role_name)}, sstring(grantee_name)}, _as, ::service::raft_timeout{});
}
};
@@ -447,7 +453,7 @@ standard_role_manager::modify_membership(
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_return co_await announce_mutations(_qp, _group0_client, insert_query,
{sstring(role_name), sstring(grantee_name)}, &_as, ::service::raft_timeout{});
{sstring(role_name), sstring(grantee_name)}, _as, ::service::raft_timeout{});
}
}
@@ -464,7 +470,7 @@ standard_role_manager::modify_membership(
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_return co_await announce_mutations(_qp, _group0_client, delete_query,
{sstring(role_name), sstring(grantee_name)}, &_as, ::service::raft_timeout{});
{sstring(role_name), sstring(grantee_name)}, _as, ::service::raft_timeout{});
}
}
}
@@ -638,7 +644,7 @@ future<> standard_role_manager::set_attribute(std::string_view role_name, std::s
co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query,
{sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, &_as, ::service::raft_timeout{});
{sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, _as, ::service::raft_timeout{});
}
}
@@ -653,7 +659,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query,
{sstring(role_name), sstring(attribute_name)}, &_as, ::service::raft_timeout{});
{sstring(role_name), sstring(attribute_name)}, _as, ::service::raft_timeout{});
}
}
}

View File

@@ -79,7 +79,7 @@ public:
private:
enum class membership_change { add, remove };
future<> create_metadata_tables_if_missing() const;
future<> create_legacy_metadata_tables_if_missing() const;
bool legacy_metadata_exists();

14
bin/nodetool Executable file
View File

@@ -0,0 +1,14 @@
#!/bin/bash
#
# Copyright (C) 2024-present ScyllaDB
#
#
# SPDX-License-Identifier: AGPL-3.0-or-later
#
SCRIPT_PATH=$(dirname $(realpath "$0"))
INSTALLED_SCYLLA_PATH="${SCRIPT_PATH}/scylla"
# Allow plugging scylla path for local testing
exec ${SCYLLA:-${INSTALLED_SCYLLA_PATH}} nodetool $@

View File

@@ -11,6 +11,7 @@ target_include_directories(cdc
${CMAKE_SOURCE_DIR})
target_link_libraries(cdc
PUBLIC
absl::headers
Seastar::seastar
xxHash::xxhash
PRIVATE

View File

@@ -66,10 +66,10 @@ public:
} // namespace cdc
template <> struct fmt::formatter<cdc::image_mode> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<cdc::image_mode> : fmt::formatter<string_view> {
auto format(cdc::image_mode, fmt::format_context& ctx) const -> decltype(ctx.out());
};
template <> struct fmt::formatter<cdc::delta_mode> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<cdc::delta_mode> : fmt::formatter<string_view> {
auto format(cdc::delta_mode, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -28,6 +28,7 @@
#include "gms/feature_service.hh"
#include "utils/error_injection.hh"
#include "utils/UUID_gen.hh"
#include "utils/to_string.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
@@ -183,7 +184,7 @@ static std::vector<stream_id> create_stream_ids(
size_t index, dht::token start, dht::token end, size_t shard_count, uint8_t ignore_msb) {
std::vector<stream_id> result;
result.reserve(shard_count);
dht::sharder sharder(shard_count, ignore_msb);
dht::static_sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
// compose the id from token and the "index" of the range end owning vnode
@@ -472,16 +473,21 @@ static std::optional<cdc::generation_id> get_generation_id_for(const gms::inet_a
static future<std::optional<cdc::topology_description>> retrieve_generation_data_v2(
cdc::generation_id_v2 id,
bool raft_experimental_topology,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks) {
auto cdc_gen = co_await sys_dist_ks.read_cdc_generation(id.id);
if (raft_experimental_topology && !cdc_gen) {
if (!cdc_gen && id.id.is_timestamp()) {
// If we entered legacy mode due to recovery, we (or some other node)
// might gossip about a generation that was previously propagated
// through raft. If that's the case, it will sit in
// the system.cdc_generations_v3 table.
//
// If the provided id is not a timeuuid, we don't want to query
// the system.cdc_generations_v3 table. This table stores generation
// ids as timeuuids. If the provided id is not a timeuuid, the
// generation cannot be in system.cdc_generations_v3. Also, the query
// would fail with a marshaling error.
cdc_gen = co_await sys_ks.read_cdc_generation_opt(id.id);
}
@@ -490,7 +496,6 @@ static future<std::optional<cdc::topology_description>> retrieve_generation_data
static future<std::optional<cdc::topology_description>> retrieve_generation_data(
cdc::generation_id gen_id,
bool raft_experimental_topology,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
@@ -499,14 +504,13 @@ static future<std::optional<cdc::topology_description>> retrieve_generation_data
return sys_dist_ks.read_cdc_topology_description(id, ctx);
},
[&] (const cdc::generation_id_v2& id) {
return retrieve_generation_data_v2(id, raft_experimental_topology, sys_ks, sys_dist_ks);
return retrieve_generation_data_v2(id, sys_ks, sys_dist_ks);
}
), gen_id);
}
static future<> do_update_streams_description(
cdc::generation_id gen_id,
bool raft_experimental_topology,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
@@ -517,7 +521,7 @@ static future<> do_update_streams_description(
// We might race with another node also inserting the description, but that's ok. It's an idempotent operation.
auto topo = co_await retrieve_generation_data(gen_id, raft_experimental_topology, sys_ks, sys_dist_ks, ctx);
auto topo = co_await retrieve_generation_data(gen_id, sys_ks, sys_dist_ks, ctx);
if (!topo) {
throw no_generation_data_exception(gen_id);
}
@@ -536,13 +540,12 @@ static future<> do_update_streams_description(
*/
static future<> update_streams_description(
cdc::generation_id gen_id,
bool raft_experimental_topology,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
try {
co_await do_update_streams_description(gen_id, raft_experimental_topology, sys_ks, *sys_dist_ks, { get_num_token_owners() });
co_await do_update_streams_description(gen_id, sys_ks, *sys_dist_ks, { get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will retry in the background.",
@@ -550,7 +553,6 @@ static future<> update_streams_description(
// It is safe to discard this future: we keep system distributed keyspace alive.
(void)(([] (cdc::generation_id gen_id,
bool raft_experimental_topology,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
@@ -563,7 +565,7 @@ static future<> update_streams_description(
co_return;
}
try {
co_await do_update_streams_description(gen_id, raft_experimental_topology, sys_ks, *sys_dist_ks, { get_num_token_owners() });
co_await do_update_streams_description(gen_id, sys_ks, *sys_dist_ks, { get_num_token_owners() });
co_return;
} catch (...) {
cdc_log.warn(
@@ -571,7 +573,7 @@ static future<> update_streams_description(
gen_id, std::current_exception());
}
}
})(gen_id, raft_experimental_topology, sys_ks, std::move(sys_dist_ks), std::move(get_num_token_owners), abort_src));
})(gen_id, sys_ks, std::move(sys_dist_ks), std::move(get_num_token_owners), abort_src));
}
}
@@ -609,7 +611,6 @@ struct time_and_ttl {
*/
static future<std::optional<cdc::generation_id_v1>> rewrite_streams_descriptions(
std::vector<time_and_ttl> times_and_ttls,
bool raft_experimental_topology,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
@@ -654,7 +655,7 @@ static future<std::optional<cdc::generation_id_v1>> rewrite_streams_descriptions
co_await max_concurrent_for_each(first, tss.end(), 10, [&] (db_clock::time_point ts) -> future<> {
while (true) {
try {
co_return co_await do_update_streams_description(cdc::generation_id_v1{ts}, raft_experimental_topology, sys_ks, *sys_dist_ks, { get_num_token_owners() });
co_return co_await do_update_streams_description(cdc::generation_id_v1{ts}, sys_ks, *sys_dist_ks, { get_num_token_owners() });
} catch (const no_generation_data_exception& e) {
cdc_log.error("Failed to rewrite streams for generation {}: {}. Giving up.", ts, e);
each_success = false;
@@ -730,7 +731,6 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {
cdc_log.info("Rewriting stream tables in the background...");
auto last_rewritten = co_await rewrite_streams_descriptions(
std::move(times_and_ttls),
_cfg.raft_experimental_topology,
_sys_ks.local(),
_sys_dist_ks.local_shared(),
std::move(get_num_token_owners),
@@ -906,7 +906,7 @@ future<> generation_service::check_and_repair_cdc_streams() {
std::optional<topology_description> gen;
try {
gen = co_await retrieve_generation_data(*latest, _cfg.raft_experimental_topology, _sys_ks.local(), *sys_dist_ks, { tmptr->count_normal_token_owners() });
gen = co_await retrieve_generation_data(*latest, _sys_ks.local(), *sys_dist_ks, { tmptr->count_normal_token_owners() });
} catch (exceptions::request_timeout_exception& e) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
@@ -1022,7 +1022,7 @@ future<> generation_service::legacy_handle_cdc_generation(std::optional<cdc::gen
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", *gen_id);
co_await update_streams_description(*gen_id, _cfg.raft_experimental_topology, _sys_ks.local(), get_sys_dist_ks(),
co_await update_streams_description(*gen_id, _sys_ks.local(), get_sys_dist_ks(),
[tmptr = _token_metadata.get()] { return tmptr->count_normal_token_owners(); },
_abort_src);
}
@@ -1039,7 +1039,7 @@ void generation_service::legacy_async_handle_cdc_generation(cdc::generation_id g
bool using_this_gen = co_await svc->legacy_do_handle_cdc_generation_intercept_nonfatal_errors(gen_id);
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", gen_id);
co_await update_streams_description(gen_id, svc->_cfg.raft_experimental_topology, svc->_sys_ks.local(), svc->get_sys_dist_ks(),
co_await update_streams_description(gen_id, svc->_sys_ks.local(), svc->get_sys_dist_ks(),
[tmptr = svc->_token_metadata.get()] { return tmptr->count_normal_token_owners(); },
svc->_abort_src);
}
@@ -1108,7 +1108,7 @@ future<bool> generation_service::legacy_do_handle_cdc_generation(cdc::generation
assert_shard_zero(__PRETTY_FUNCTION__);
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await retrieve_generation_data(gen_id, _cfg.raft_experimental_topology, _sys_ks.local(), *sys_dist_ks, { _token_metadata.get()->count_normal_token_owners() });
auto gen = co_await retrieve_generation_data(gen_id, _sys_ks.local(), *sys_dist_ks, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
throw std::runtime_error(format(
"Could not find CDC generation {} in distributed system tables (current time: {}),"

View File

@@ -189,7 +189,7 @@ future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(
#if FMT_VERSION < 100000
// fmt v10 introduced formatter for std::exception
template <>
struct fmt::formatter<cdc::no_generation_data_exception> : fmt::formatter<std::string_view> {
struct fmt::formatter<cdc::no_generation_data_exception> : fmt::formatter<string_view> {
auto format(const cdc::no_generation_data_exception& e, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{}", e.what());
}

View File

@@ -41,7 +41,6 @@ public:
unsigned ignore_msb_bits;
std::chrono::milliseconds ring_delay;
bool dont_rewrite_streams = false;
bool raft_experimental_topology = false;
};
private:

View File

@@ -155,10 +155,10 @@ public:
friend fmt::formatter<bound_view>;
};
template <> struct fmt::formatter<bound_kind> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<bound_kind> : fmt::formatter<string_view> {
auto format(bound_kind, fmt::format_context& ctx) const -> decltype(ctx.out());
};
template <> struct fmt::formatter<bound_view> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<bound_view> : fmt::formatter<string_view> {
auto format(const bound_view& b, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{{bound: prefix={},kind={}}}", b._prefix.get(), b._kind);
}

View File

@@ -124,7 +124,7 @@ public:
position_range_iterator end() const { return {_set.end()}; }
};
template <> struct fmt::formatter<clustering_interval_set> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<clustering_interval_set> : fmt::formatter<string_view> {
auto format(const clustering_interval_set& set, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{{{}}}", fmt::join(set, ",\n "));
}

View File

@@ -69,6 +69,7 @@ function(add_version_library name source)
scylla-version-gen)
target_compile_definitions(${name}
PRIVATE
SCYLLA_PRODUCT=\"${Scylla_PRODUCT}\"
SCYLLA_VERSION=\"${Scylla_VERSION}\"
SCYLLA_RELEASE=\"${Scylla_RELEASE}\")
target_link_libraries(${name}

View File

@@ -41,7 +41,7 @@ macro(dist_submodule name dir pkgs)
endif()
set(pkg_name "${Scylla_PRODUCT}-${name}-${Scylla_VERSION}-${Scylla_RELEASE}.${arch}.tar.gz")
set(reloc_pkg "${CMAKE_SOURCE_DIR}/tools/${dir}/build/${pkg_name}")
set(dist_pkg "${CMAKE_CURRENT_BINARY_DIR}/${pkg_name}")
set(dist_pkg "${CMAKE_BINARY_DIR}/$<CONFIG>/dist/tar/${pkg_name}")
add_custom_command(
OUTPUT ${dist_pkg}
COMMAND ${CMAKE_COMMAND} -E copy ${reloc_pkg} ${dist_pkg}

View File

@@ -106,6 +106,7 @@ auto fmt::formatter<collection_mutation_view::printer>::format(const collection_
[&] (const collection_type_impl& ctype) {
auto&& key_type = ctype.name_comparator();
auto&& value_type = ctype.value_comparator();
out = fmt::format_to(out, " collection cells {{");
for (auto&& [key, value] : cmvd.cells) {
if (!first) {
out = fmt::format_to(out, ", ");
@@ -113,20 +114,25 @@ auto fmt::formatter<collection_mutation_view::printer>::format(const collection_
fmt::format_to(out, "{}: {}", key_type->to_string(key), atomic_cell_view::printer(*value_type, value));
first = false;
}
out = fmt::format_to(out, "}}");
},
[&] (const user_type_impl& utype) {
out = fmt::format_to(out, " user-type cells {{");
for (auto&& [raw_idx, value] : cmvd.cells) {
if (!first) {
if (first) {
out = fmt::format_to(out, " ");
} else {
out = fmt::format_to(out, ", ");
}
auto idx = deserialize_field_index(raw_idx);
out = fmt::format_to(out, "{}: {}", utype.field_name_as_string(idx), atomic_cell_view::printer(*utype.type(idx), value));
first = false;
}
out = fmt::format_to(out, "}}");
},
[&] (const abstract_type& o) {
// Not throwing exception in this likely-to-be debug context
out = fmt::format_to(out, "attempted to pretty-print collection_mutation_view_description with type {}", o.name());
out = fmt::format_to(out, " attempted to pretty-print collection_mutation_view_description with type {}", o.name());
}
));
});

View File

@@ -132,7 +132,7 @@ collection_mutation difference(const abstract_type&, collection_mutation_view, c
bytes_ostream serialize_for_cql(const abstract_type&, collection_mutation_view);
template <>
struct fmt::formatter<collection_mutation_view::printer> : fmt::formatter<std::string_view> {
struct fmt::formatter<collection_mutation_view::printer> : fmt::formatter<string_view> {
auto format(const collection_mutation_view::printer&, fmt::format_context& ctx) const
-> decltype(ctx.out());
};

View File

@@ -171,7 +171,8 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
}
static std::vector<shared_sstable> get_uncompacting_sstables(const table_state& table_s, std::vector<shared_sstable> sstables) {
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*table_s.main_sstable_set().all());
auto sstable_set = table_s.sstable_set_for_tombstone_gc();
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*sstable_set->all());
auto& compacted_undeleted = table_s.compacted_undeleted_sstables();
all_sstables.insert(all_sstables.end(), compacted_undeleted.begin(), compacted_undeleted.end());
boost::sort(all_sstables, [] (const shared_sstable& x, const shared_sstable& y) {
@@ -451,6 +452,7 @@ protected:
uint64_t _compacting_data_file_size = 0;
api::timestamp_type _compacting_max_timestamp = api::min_timestamp;
uint64_t _estimated_partitions = 0;
double _estimated_droppable_tombstone_ratio = 0;
uint64_t _bloom_filter_checks = 0;
db::replay_position _rp;
encoding_stats_collector _stats_collector;
@@ -524,9 +526,6 @@ protected:
, _owned_ranges_checker(_owned_ranges ? std::optional<dht::incremental_owned_ranges_checker>(*_owned_ranges) : std::nullopt)
, _progress_monitor(progress_monitor)
{
for (auto& sst : _sstables) {
_stats_collector.update(sst->get_encoding_stats_for_compaction());
}
std::unordered_set<run_id> ssts_run_ids;
_contains_multi_fragment_runs = std::any_of(_sstables.begin(), _sstables.end(), [&ssts_run_ids] (shared_sstable& sst) {
return !ssts_run_ids.insert(sst->run_identifier()).second;
@@ -609,7 +608,8 @@ protected:
sstable_writer_config cfg = _table_s.configure_writer("garbage_collection");
cfg.run_identifier = gc_run;
cfg.monitor = monitor.get();
auto writer = sst->get_writer(*schema(), partitions_per_sstable(), cfg, get_encoding_stats());
uint64_t estimated_partitions = std::max(1UL, uint64_t(ceil(partitions_per_sstable() * _estimated_droppable_tombstone_ratio)));
auto writer = sst->get_writer(*schema(), estimated_partitions, cfg, get_encoding_stats());
return compaction_writer(std::move(monitor), std::move(writer), std::move(sst));
}
@@ -727,6 +727,7 @@ private:
auto fully_expired = _table_s.fully_expired_sstables(_sstables, gc_clock::now());
min_max_tracker<api::timestamp_type> timestamp_tracker;
double sum_of_estimated_droppable_tombstone_ratio = 0;
_input_sstable_generations.reserve(_sstables.size());
for (auto& sst : _sstables) {
co_await coroutine::maybe_yield();
@@ -745,6 +746,7 @@ private:
log_debug("Fully expired sstable {} will be dropped on compaction completion", sst->get_filename());
continue;
}
_stats_collector.update(sst->get_encoding_stats_for_compaction());
_cdata.compaction_size += sst->data_size();
// We also capture the sstable, so we keep it alive while the read isn't done
@@ -753,6 +755,7 @@ private:
// for a better estimate for the number of partitions in the merged
// sstable than just adding up the lengths of individual sstables.
_estimated_partitions += sst->get_estimated_key_count();
sum_of_estimated_droppable_tombstone_ratio += sst->estimate_droppable_tombstone_ratio(gc_clock::now(), _table_s.get_tombstone_gc_state(), _schema);
_compacting_data_file_size += sst->ondisk_data_size();
_compacting_max_timestamp = std::max(_compacting_max_timestamp, sst->get_stats_metadata().max_timestamp);
if (sst->originated_on_this_node().value_or(false) && sst_stats.position.shard_id() == this_shard_id()) {
@@ -764,6 +767,8 @@ private:
log_debug("{} out of {} input sstables are fully expired sstables that will not be actually compacted",
_sstables.size() - ssts->size(), _sstables.size());
}
// _estimated_droppable_tombstone_ratio could exceed 1.0 in certain cases, so limit it to 1.0.
_estimated_droppable_tombstone_ratio = std::min(1.0, sum_of_estimated_droppable_tombstone_ratio / ssts->size());
_compacting = std::move(ssts);
@@ -1701,7 +1706,12 @@ public:
}
compaction_writer create_compaction_writer(const dht::decorated_key& dk) override {
auto shard = _sharder->shard_of(dk.token());
auto shards = _sharder->shard_for_writes(dk.token());
if (shards.size() != 1) {
// Resharding is not supposed to run on tablets, so this case does not have to be supported.
on_internal_error(clogger, fmt::format("Got {} shards for token {} in table {}.{}", shards.size(), dk.token(), _schema->ks_name(), _schema->cf_name()));
}
auto shard = shards[0];
auto sst = _sstable_creator(shard);
setup_new_sstable(sst);

View File

@@ -213,14 +213,14 @@ struct compaction_descriptor {
}
template <>
struct fmt::formatter<sstables::compaction_type> : fmt::formatter<std::string_view> {
struct fmt::formatter<sstables::compaction_type> : fmt::formatter<string_view> {
auto format(sstables::compaction_type, fmt::format_context& ctx) const -> decltype(ctx.out());
};
template <>
struct fmt::formatter<sstables::compaction_type_options::scrub::mode> : fmt::formatter<std::string_view> {
struct fmt::formatter<sstables::compaction_type_options::scrub::mode> : fmt::formatter<string_view> {
auto format(sstables::compaction_type_options::scrub::mode, fmt::format_context& ctx) const -> decltype(ctx.out());
};
template <>
struct fmt::formatter<sstables::compaction_type_options::scrub::quarantine_mode> : fmt::formatter<std::string_view> {
struct fmt::formatter<sstables::compaction_type_options::scrub::quarantine_mode> : fmt::formatter<string_view> {
auto format(sstables::compaction_type_options::scrub::quarantine_mode, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -14,6 +14,7 @@
#include "sstables/sstables.hh"
#include "sstables/sstables_manager.hh"
#include <memory>
#include <fmt/ranges.h>
#include <seastar/core/metrics.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/switch_to.hh>
@@ -386,11 +387,26 @@ future<sstables::compaction_result> compaction_task_executor::compact_sstables_a
co_return res;
}
future<sstables::sstable_set> compaction_task_executor::sstable_set_for_tombstone_gc(table_state& t) {
auto compound_set = t.sstable_set_for_tombstone_gc();
// Compound set will be linearized into a single set, since compaction might add or remove sstables
// to it for incremental compaction to work.
auto new_set = sstables::make_partitioned_sstable_set(t.schema(), false);
co_await compound_set->for_each_sstable_gently([&] (const sstables::shared_sstable& sst) {
auto inserted = new_set.insert(sst);
if (!inserted) {
on_internal_error(cmlog, format("Unable to insert SSTable {} into set used for tombstone GC", sst->get_filename()));
}
});
co_return std::move(new_set);
}
future<sstables::compaction_result> compaction_task_executor::compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement& on_replace, compaction_manager::can_purge_tombstones can_purge,
sstables::offstrategy offstrategy) {
table_state& t = *_compacting_table;
if (can_purge) {
descriptor.enable_garbage_collection(t.main_sstable_set());
descriptor.enable_garbage_collection(co_await sstable_set_for_tombstone_gc(t));
}
descriptor.creator = [&t] (shard_id dummy) {
auto sst = t.make_sstable();
@@ -488,7 +504,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -513,7 +529,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -550,6 +566,14 @@ protected:
// the exclusive lock can be freed to let regular compaction run in parallel to major
lock_holder.return_all();
co_await utils::get_local_injector().inject("major_compaction_wait", [this] (auto& handler) -> future<> {
cmlog.info("major_compaction_wait: waiting");
while (!handler.poll_for_message() && !_compaction_data.is_stop_requested()) {
co_await sleep(std::chrono::milliseconds(5));
}
cmlog.info("major_compaction_wait: released");
});
co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, on_replace);
finish_compaction();
@@ -628,7 +652,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -854,12 +878,11 @@ void compaction_task_executor::finish_compaction(state finish_state) noexcept {
_compaction_state.compaction_done.signal();
}
future<> compaction_task_executor::abort(abort_source& as) noexcept {
void compaction_task_executor::abort(abort_source& as) noexcept {
if (!as.abort_requested()) {
as.request_abort();
stop_compaction("user requested abort");
}
return make_ready_future();
}
void compaction_task_executor::stop_compaction(sstring reason) noexcept {
@@ -976,8 +999,11 @@ void compaction_manager::enable() {
std::function<void()> compaction_manager::compaction_submission_callback() {
return [this] () mutable {
for (auto& e: _compaction_state) {
postpone_compaction_for_table(e.first);
auto now = gc_clock::now();
for (auto& [table, state] : _compaction_state) {
if (now - state.last_regular_compaction > periodic_compaction_submission_interval()) {
postpone_compaction_for_table(table);
}
}
reevaluate_postponed_compactions();
};
@@ -1177,7 +1203,7 @@ public:
, regular_compaction_task_impl(mgr._task_manager_module, tasks::task_id::create_random_id(), mgr._task_manager_module->new_sequence_number(), t.schema()->ks_name(), t.schema()->cf_name(), "", tasks::task_id::create_null_id())
{}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -1227,6 +1253,7 @@ protected:
fmt::ptr(this), descriptor.sstables.size(), weight, t);
setup_new_compaction(descriptor.run_identifier);
_compaction_state.last_regular_compaction = gc_clock::now();
std::exception_ptr ex;
try {
@@ -1347,7 +1374,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -1374,13 +1401,20 @@ private:
}));
};
auto get_next_job = [&] () -> std::optional<sstables::compaction_descriptor> {
auto desc = t.get_compaction_strategy().get_reshaping_job(get_reshape_candidates(), t.schema(), sstables::reshape_mode::strict);
return desc.sstables.size() ? std::make_optional(std::move(desc)) : std::nullopt;
auto get_next_job = [&] () -> future<std::optional<sstables::compaction_descriptor>> {
auto candidates = get_reshape_candidates();
if (candidates.empty()) {
co_return std::nullopt;
}
// all sstables added to maintenance set share the same underlying storage.
auto& storage = candidates.front()->get_storage();
sstables::reshape_config cfg = co_await sstables::make_reshape_config(storage, sstables::reshape_mode::strict);
auto desc = t.get_compaction_strategy().get_reshaping_job(get_reshape_candidates(), t.schema(), cfg);
co_return desc.sstables.size() ? std::make_optional(std::move(desc)) : std::nullopt;
};
std::exception_ptr err;
while (auto desc = get_next_job()) {
while (auto desc = co_await get_next_job()) {
auto compacting = compacting_sstable_registration(_cm, _cm.get_compaction_state(&t), desc->sstables);
auto on_replace = compacting.update_on_sstable_replacement();
@@ -1512,11 +1546,16 @@ protected:
co_return stats;
}
virtual sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const {
static sstables::compaction_descriptor
make_descriptor(const sstables::shared_sstable& sst, const sstables::compaction_type_options& opt, owned_ranges_ptr owned_ranges = {}) {
auto sstable_level = sst->get_sstable_level();
auto run_identifier = sst->run_identifier();
return sstables::compaction_descriptor({ sst },
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, _options, _owned_ranges_ptr);
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, opt, owned_ranges);
}
virtual sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const {
return make_descriptor(sst, _options, _owned_ranges_ptr);
}
virtual future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) {
@@ -1569,19 +1608,30 @@ public:
std::move(sstables), std::move(compacting), compaction_manager::can_purge_tombstones::yes)
, _opt(options.as<sstables::compaction_type_options::split>())
{
if (utils::get_local_injector().is_enabled("split_sstable_rewrite")) {
_do_throw_if_stopping = throw_if_stopping::yes;
}
}
static bool sstable_needs_split(const sstables::shared_sstable& sst, const sstables::compaction_type_options::split& opt) {
return opt.classifier(sst->get_first_decorated_key().token()) != opt.classifier(sst->get_last_decorated_key().token());
}
static sstables::compaction_descriptor
make_descriptor(const sstables::shared_sstable& sst, const sstables::compaction_type_options::split& split_opt) {
auto opt = sstables::compaction_type_options::make_split(split_opt.classifier);
return rewrite_sstables_compaction_task_executor::make_descriptor(sst, std::move(opt));
}
private:
bool sstable_needs_split(const sstables::shared_sstable& sst) const {
return _opt.classifier(sst->get_first_decorated_key().token()) != _opt.classifier(sst->get_last_decorated_key().token());
return sstable_needs_split(sst, _opt);
}
protected:
sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const override {
auto desc = rewrite_sstables_compaction_task_executor::make_descriptor(sst);
desc.options = sstables::compaction_type_options::make_split(_opt.classifier);
return desc;
return make_descriptor(sst, _opt);
}
future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) override {
future<sstables::compaction_result> do_rewrite_sstable(const sstables::shared_sstable sst) {
if (sstable_needs_split(sst)) {
return rewrite_sstables_compaction_task_executor::rewrite_sstable(std::move(sst));
}
@@ -1594,6 +1644,20 @@ protected:
return sstables::compaction_result{};
});
}
future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) override {
co_await utils::get_local_injector().inject("split_sstable_rewrite", [this] (auto& handler) -> future<> {
cmlog.info("split_sstable_rewrite: waiting");
while (!handler.poll_for_message() && !_compaction_data.is_stop_requested()) {
co_await sleep(std::chrono::milliseconds(5));
}
cmlog.info("split_sstable_rewrite: released");
if (_compaction_data.is_stop_requested()) {
throw make_compaction_stopped_exception();
}
}, false);
co_return co_await do_rewrite_sstable(std::move(sst));
}
};
}
@@ -1750,7 +1814,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -2015,6 +2079,31 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_spl
return perform_task_on_all_files<split_compaction_task_executor>(info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_sstables));
}
future<std::vector<sstables::shared_sstable>>
compaction_manager::maybe_split_sstable(sstables::shared_sstable sst, table_state& t, sstables::compaction_type_options::split opt) {
if (!split_compaction_task_executor::sstable_needs_split(sst, opt)) {
co_return std::vector<sstables::shared_sstable>{sst};
}
std::vector<sstables::shared_sstable> ret;
// FIXME: indentation.
auto gate = get_compaction_state(&t).gate.hold();
sstables::compaction_progress_monitor monitor;
sstables::compaction_data info = create_compaction_data();
sstables::compaction_descriptor desc = split_compaction_task_executor::make_descriptor(sst, opt);
desc.creator = [&t] (shard_id _) {
return t.make_sstable();
};
desc.replacer = [&] (sstables::compaction_completion_desc d) {
std::move(d.new_sstables.begin(), d.new_sstables.end(), std::back_inserter(ret));
};
co_await sstables::compact_sstables(std::move(desc), info, t, monitor);
co_await sst->unlink();
co_return ret;
}
// Submit a table to be scrubbed and wait for its termination.
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sstable_scrub(table_state& t, sstables::compaction_type_options::scrub opts, std::optional<tasks::task_info> info) {
auto scrub_mode = opts.operation_mode;

View File

@@ -350,6 +350,11 @@ public:
// or user aborted splitting using stop API.
future<compaction_stats_opt> perform_split_compaction(compaction::table_state& t, sstables::compaction_type_options::split opt, std::optional<tasks::task_info> info = std::nullopt);
// Splits a single SSTable by segregating all its data according to the classifier.
// If SSTable doesn't need split, the same input SSTable is returned as output.
// If SSTable needs split, then output SSTables are returned and the input SSTable is deleted.
future<std::vector<sstables::shared_sstable>> maybe_split_sstable(sstables::shared_sstable sst, table_state& t, sstables::compaction_type_options::split opt);
// Run a custom job for a given table, defined by a function
// it completes when future returned by job is ready or returns immediately
// if manager was asked to stop.
@@ -589,12 +594,14 @@ private:
future<compaction_manager::compaction_stats_opt> compaction_done() noexcept {
return _compaction_done.get_future();
}
future<sstables::sstable_set> sstable_set_for_tombstone_gc(::compaction::table_state& t);
public:
bool stopping() const noexcept {
return _compaction_data.abort.abort_requested();
}
future<> abort(abort_source& as) noexcept;
void abort(abort_source& as) noexcept;
void stop_compaction(sstring reason) noexcept;

View File

@@ -15,6 +15,7 @@
#include "compaction/compaction_fwd.hh"
#include "compaction/compaction_backlog_manager.hh"
#include "gc_clock.hh"
namespace compaction {
@@ -37,6 +38,8 @@ struct compaction_state {
std::unordered_set<sstables::shared_sstable> sstables_requiring_cleanup;
compaction::owned_ranges_ptr owned_ranges_ptr;
gc_clock::time_point last_regular_compaction;
explicit compaction_state(table_state& t);
compaction_state(compaction_state&&) = delete;
~compaction_state();

View File

@@ -11,8 +11,9 @@
#include <vector>
#include <chrono>
#include <fmt/ranges.h>
#include <seastar/core/shared_ptr.hh>
#include "seastar/core/on_internal_error.hh"
#include <seastar/core/on_internal_error.hh>
#include "sstables/shared_sstable.hh"
#include "sstables/sstables.hh"
#include "compaction_strategy.hh"
@@ -31,6 +32,7 @@
#include "compaction_backlog_manager.hh"
#include "size_tiered_backlog_tracker.hh"
#include "leveled_manifest.hh"
#include "utils/to_string.hh"
logging::logger leveled_manifest::logger("LeveledManifest");
@@ -81,7 +83,7 @@ reader_consumer_v2 compaction_strategy_impl::make_interposer_consumer(const muta
}
compaction_descriptor
compaction_strategy_impl::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
compaction_strategy_impl::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
return compaction_descriptor();
}
@@ -148,7 +150,7 @@ static bool validate_unchecked_tombstone_compaction(const std::map<sstring, sstr
auto tmp_value = compaction_strategy_impl::get_value(options, compaction_strategy_impl::UNCHECKED_TOMBSTONE_COMPACTION_OPTION);
if (tmp_value.has_value()) {
if (tmp_value != "true" && tmp_value != "false") {
throw exceptions::configuration_exception(fmt::format("{} value ({}) must be \"true\" or \"false\"", compaction_strategy_impl::UNCHECKED_TOMBSTONE_COMPACTION_OPTION, tmp_value));
throw exceptions::configuration_exception(fmt::format("{} value ({}) must be \"true\" or \"false\"", compaction_strategy_impl::UNCHECKED_TOMBSTONE_COMPACTION_OPTION, *tmp_value));
}
unchecked_tombstone_compaction = tmp_value == "true";
}
@@ -626,7 +628,7 @@ leveled_compaction_strategy::calculate_max_sstable_size_in_mb(std::optional<sstr
max_size);
} else if (max_size < 50) {
leveled_manifest::logger.warn("Max sstable size of {}MB is configured. Testing done for CASSANDRA-5727 indicates that performance" \
"improves up to 160MB", max_size);
" improves up to 160MB", max_size);
}
return max_size;
}
@@ -726,8 +728,8 @@ compaction_backlog_tracker compaction_strategy::make_backlog_tracker() const {
}
sstables::compaction_descriptor
compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
return _compaction_strategy_impl->get_reshaping_job(std::move(input), schema, mode);
compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
return _compaction_strategy_impl->get_reshaping_job(std::move(input), schema, cfg);
}
uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema) const {
@@ -765,6 +767,13 @@ compaction_strategy make_compaction_strategy(compaction_strategy_type strategy,
return compaction_strategy(std::move(impl));
}
future<reshape_config> make_reshape_config(const sstables::storage& storage, reshape_mode mode) {
co_return sstables::reshape_config{
.mode = mode,
.free_storage_space = co_await storage.free_space() / smp::count,
};
}
}
namespace compaction {

View File

@@ -30,6 +30,7 @@ class compaction_strategy_impl;
class sstable;
class sstable_set;
struct compaction_descriptor;
class storage;
class compaction_strategy {
::shared_ptr<compaction_strategy_impl> _compaction_strategy_impl;
@@ -121,11 +122,13 @@ public:
//
// The caller should also pass a maximum number of SSTables which is the maximum amount of
// SSTables that can be added into a single job.
compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const;
compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const;
};
// Creates a compaction_strategy object from one of the strategies available.
compaction_strategy make_compaction_strategy(compaction_strategy_type strategy, const std::map<sstring, sstring>& options);
future<reshape_config> make_reshape_config(const sstables::storage& storage, reshape_mode mode);
}

View File

@@ -76,6 +76,6 @@ public:
return false;
}
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const;
};
}

View File

@@ -8,6 +8,8 @@
#pragma once
#include <cstdint>
namespace sstables {
enum class compaction_strategy_type {
@@ -18,4 +20,10 @@ enum class compaction_strategy_type {
};
enum class reshape_mode { strict, relaxed };
struct reshape_config {
reshape_mode mode;
const uint64_t free_storage_space;
};
}

View File

@@ -146,7 +146,8 @@ int64_t leveled_compaction_strategy::estimated_pending_compactions(table_state&
}
compaction_descriptor
leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
auto mode = cfg.mode;
std::array<std::vector<shared_sstable>, leveled_manifest::MAX_LEVELS> level_info;
auto is_disjoint = [schema] (const std::vector<shared_sstable>& sstables, unsigned tolerance) -> std::tuple<bool, unsigned> {
@@ -203,7 +204,7 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
if (level_info[0].size() > offstrategy_threshold) {
size_tiered_compaction_strategy stcs(_stcs_options);
return stcs.get_reshaping_job(std::move(level_info[0]), schema, mode);
return stcs.get_reshaping_job(std::move(level_info[0]), schema, cfg);
}
for (unsigned level = leveled_manifest::MAX_LEVELS - 1; level > 0; --level) {

View File

@@ -74,7 +74,7 @@ public:
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
};
}

View File

@@ -298,8 +298,9 @@ size_tiered_compaction_strategy::most_interesting_bucket(const std::vector<sstab
}
compaction_descriptor
size_tiered_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const
size_tiered_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const
{
auto mode = cfg.mode;
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));

View File

@@ -96,7 +96,7 @@ public:
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
friend class ::size_tiered_backlog_tracker;
};

View File

@@ -39,6 +39,7 @@ public:
virtual bool compaction_enforce_min_threshold() const noexcept = 0;
virtual const sstables::sstable_set& main_sstable_set() const = 0;
virtual const sstables::sstable_set& maintenance_sstable_set() const = 0;
virtual lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const = 0;
virtual std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point compaction_time) const = 0;
virtual const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept = 0;
virtual sstables::compaction_strategy& get_compaction_strategy() const noexcept = 0;
@@ -63,7 +64,7 @@ public:
namespace fmt {
template <>
struct formatter<compaction::table_state> : formatter<std::string_view> {
struct formatter<compaction::table_state> : formatter<string_view> {
template <typename FormatContext>
auto format(const compaction::table_state& t, FormatContext& ctx) const {
auto s = t.schema();

View File

@@ -159,9 +159,9 @@ future<> reshard(sstables::sstable_directory& dir, sstables::sstable_directory::
// There is a semaphore inside the compaction manager in run_resharding_jobs. So we
// parallel_for_each so the statistics about pending jobs are updated to reflect all
// jobs. But only one will run in parallel at a time
auto& t = table.as_table_state();
auto& t = table.try_get_table_state_with_static_sharding();
co_await coroutine::parallel_for_each(buckets, [&] (std::vector<sstables::shared_sstable>& sstlist) mutable {
return table.get_compaction_manager().run_custom_job(table.as_table_state(), sstables::compaction_type::Reshard, "Reshard compaction", [&] (sstables::compaction_data& info, sstables::compaction_progress_monitor& progress_monitor) -> future<> {
return table.get_compaction_manager().run_custom_job(t, sstables::compaction_type::Reshard, "Reshard compaction", [&] (sstables::compaction_data& info, sstables::compaction_progress_monitor& progress_monitor) -> future<> {
auto erm = table.get_effective_replication_map(); // keep alive around compaction.
sstables::compaction_descriptor desc(sstlist);
@@ -595,28 +595,35 @@ future<> table_reshaping_compaction_task_impl::run() {
future<> shard_reshaping_compaction_task_impl::run() {
auto& table = _db.local().find_column_family(_status.keyspace, _status.table);
auto holder = table.async_gate().hold();
tasks::task_info info{_status.id, _status.shard};
std::unordered_map<size_t, std::unordered_set<sstables::shared_sstable>> sstables_grouped_by_compaction_group;
std::unordered_map<compaction::table_state*, std::unordered_set<sstables::shared_sstable>> sstables_grouped_by_compaction_group;
for (auto& sstable : _dir.get_unshared_local_sstables()) {
auto compaction_group_id = table.get_compaction_group_id_for_sstable(sstable);
sstables_grouped_by_compaction_group[compaction_group_id].insert(sstable);
auto& t = table.table_state_for_sstable(sstable);
sstables_grouped_by_compaction_group[&t].insert(sstable);
}
// reshape sstables individually within the compaction groups
for (auto& sstables_in_cg : sstables_grouped_by_compaction_group | boost::adaptors::map_values) {
co_await reshape_compaction_group(sstables_in_cg, table, info);
for (auto& sstables_in_cg : sstables_grouped_by_compaction_group) {
co_await reshape_compaction_group(*sstables_in_cg.first, sstables_in_cg.second, table, info);
}
}
future<> shard_reshaping_compaction_task_impl::reshape_compaction_group(std::unordered_set<sstables::shared_sstable>& sstables_in_cg, replica::column_family& table, const tasks::task_info& info) {
future<> shard_reshaping_compaction_task_impl::reshape_compaction_group(compaction::table_state& t, std::unordered_set<sstables::shared_sstable>& sstables_in_cg, replica::column_family& table, const tasks::task_info& info) {
while (true) {
auto reshape_candidates = boost::copy_range<std::vector<sstables::shared_sstable>>(sstables_in_cg
| boost::adaptors::filtered([&filter = _filter] (const auto& sst) {
return filter(sst);
}));
auto desc = table.get_compaction_strategy().get_reshaping_job(std::move(reshape_candidates), table.schema(), _mode);
if (reshape_candidates.empty()) {
break;
}
// all sstables were found in the same sstable_directory instance, so they share the same underlying storage.
auto& storage = reshape_candidates.front()->get_storage();
auto cfg = co_await sstables::make_reshape_config(storage, _mode);
auto desc = table.get_compaction_strategy().get_reshaping_job(std::move(reshape_candidates), table.schema(), cfg);
if (desc.sstables.empty()) {
break;
}
@@ -635,8 +642,8 @@ future<> shard_reshaping_compaction_task_impl::reshape_compaction_group(std::uno
desc.creator = _creator;
try {
co_await table.get_compaction_manager().run_custom_job(table.as_table_state(), sstables::compaction_type::Reshape, "Reshape compaction", [&dir = _dir, &table, sstlist = std::move(sstlist), desc = std::move(desc), &sstables_in_cg] (sstables::compaction_data& info, sstables::compaction_progress_monitor& progress_monitor) mutable -> future<> {
sstables::compaction_result result = co_await sstables::compact_sstables(std::move(desc), info, table.as_table_state(), progress_monitor);
co_await table.get_compaction_manager().run_custom_job(t, sstables::compaction_type::Reshape, "Reshape compaction", [&dir = _dir, sstlist = std::move(sstlist), desc = std::move(desc), &sstables_in_cg, &t] (sstables::compaction_data& info, sstables::compaction_progress_monitor& progress_monitor) mutable -> future<> {
sstables::compaction_result result = co_await sstables::compact_sstables(std::move(desc), info, t, progress_monitor);
// update the sstables_in_cg set with new sstables and remove the reshaped ones
for (auto& sst : sstlist) {
sstables_in_cg.erase(sst);

View File

@@ -606,7 +606,7 @@ private:
std::function<bool (const sstables::shared_sstable&)> _filter;
uint64_t& _total_shard_size;
future<> reshape_compaction_group(std::unordered_set<sstables::shared_sstable>& sstables_in_cg, replica::column_family& table, const tasks::task_info& info);
future<> reshape_compaction_group(compaction::table_state& t, std::unordered_set<sstables::shared_sstable>& sstables_in_cg, replica::column_family& table, const tasks::task_info& info);
public:
shard_reshaping_compaction_task_impl(tasks::task_manager::module_ptr module,
std::string keyspace,

View File

@@ -226,12 +226,14 @@ reader_consumer_v2 time_window_compaction_strategy::make_interposer_consumer(con
}
compaction_descriptor
time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
auto mode = cfg.mode;
std::vector<shared_sstable> single_window;
std::vector<shared_sstable> multi_window;
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));
const uint64_t target_job_size = cfg.free_storage_space * reshape_target_space_overhead;
if (mode == reshape_mode::relaxed) {
offstrategy_threshold = max_sstables;
@@ -263,22 +265,41 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
multi_window.size(), !multi_window.empty() && sstable_set_overlapping_count(schema, multi_window) == 0,
single_window.size(), !single_window.empty() && sstable_set_overlapping_count(schema, single_window) == 0);
auto need_trimming = [max_sstables, schema, &is_disjoint] (const std::vector<shared_sstable>& ssts) {
// All sstables can be compacted at once if they're disjoint, given that partitioned set
// will incrementally open sstables which translates into bounded memory usage.
return ssts.size() > max_sstables && !is_disjoint(ssts);
auto get_job_size = [] (const std::vector<shared_sstable>& ssts) {
return boost::accumulate(ssts | boost::adaptors::transformed(std::mem_fn(&sstable::bytes_on_disk)), uint64_t(0));
};
// Targets a space overhead of 10%. All disjoint sstables can be compacted together as long as they won't
// cause an overhead above target. Otherwise, the job targets a maximum of #max_threshold sstables.
auto need_trimming = [&] (const std::vector<shared_sstable>& ssts, const uint64_t job_size, bool is_disjoint) {
const size_t min_sstables = 2;
auto is_above_target_size = job_size > target_job_size;
return (ssts.size() > max_sstables && !is_disjoint) ||
(ssts.size() > min_sstables && is_above_target_size);
};
auto maybe_trim_job = [&need_trimming] (std::vector<shared_sstable>& ssts, uint64_t job_size, bool is_disjoint) {
while (need_trimming(ssts, job_size, is_disjoint)) {
auto sst = ssts.back();
ssts.pop_back();
job_size -= sst->bytes_on_disk();
}
};
if (!multi_window.empty()) {
auto disjoint = is_disjoint(multi_window);
auto job_size = get_job_size(multi_window);
// Everything that spans multiple windows will need reshaping
if (need_trimming(multi_window)) {
if (need_trimming(multi_window, job_size, disjoint)) {
// When trimming, let's keep sstables with overlapping time window, so as to reduce write amplification.
// For example, if there are N sstables spanning window W, where N <= 32, then we can produce all data for W
// in a single compaction round, removing the need to later compact W to reduce its number of files.
boost::partial_sort(multi_window, multi_window.begin() + max_sstables, [](const shared_sstable &a, const shared_sstable &b) {
auto sort_size = std::min(max_sstables, multi_window.size());
boost::partial_sort(multi_window, multi_window.begin() + sort_size, [](const shared_sstable &a, const shared_sstable &b) {
return a->get_stats_metadata().max_timestamp < b->get_stats_metadata().max_timestamp;
});
multi_window.resize(max_sstables);
maybe_trim_job(multi_window, job_size, disjoint);
}
compaction_descriptor desc(std::move(multi_window));
desc.options = compaction_type_options::make_reshape();
@@ -297,15 +318,17 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
std::copy(ssts.begin(), ssts.end(), std::back_inserter(single_window));
continue;
}
// reuse STCS reshape logic which will only compact similar-sized files, to increase overall efficiency
// when reshaping time buckets containing a huge amount of files
auto desc = size_tiered_compaction_strategy(_stcs_options).get_reshaping_job(std::move(ssts), schema, mode);
auto desc = size_tiered_compaction_strategy(_stcs_options).get_reshaping_job(std::move(ssts), schema, cfg);
if (!desc.sstables.empty()) {
return desc;
}
}
}
if (!single_window.empty()) {
maybe_trim_job(single_window, get_job_size(single_window), all_disjoint);
compaction_descriptor desc(std::move(single_window));
desc.options = compaction_type_options::make_reshape();
return desc;

View File

@@ -76,6 +76,7 @@ public:
// To prevent an explosion in the number of sstables we cap it.
// Better co-locate some windows into the same sstables than OOM.
static constexpr uint64_t max_data_segregation_window_count = 100;
static constexpr float reshape_target_space_overhead = 0.1f;
using bucket_t = std::vector<shared_sstable>;
enum class bucket_compaction_mode { none, size_tiered, major };
@@ -168,7 +169,7 @@ public:
return true;
}
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
};
}

View File

@@ -524,7 +524,7 @@ public:
};
template <typename Component>
struct fmt::formatter<std::pair<Component, composite::eoc>> : fmt::formatter<std::string_view> {
struct fmt::formatter<std::pair<Component, composite::eoc>> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const std::pair<Component, composite::eoc>& c, FormatContext& ctx) const {
if constexpr (std::same_as<Component, bytes_view>) {
@@ -636,7 +636,7 @@ public:
};
template <>
struct fmt::formatter<composite_view> : fmt::formatter<std::string_view> {
struct fmt::formatter<composite_view> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const composite_view& v, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "{{{}, compound={}, static={}}}",
@@ -650,7 +650,7 @@ composite::composite(const composite_view& v)
{ }
template <>
struct fmt::formatter<composite> : fmt::formatter<std::string_view> {
struct fmt::formatter<composite> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const composite& v, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "{}", composite_view(v));

View File

@@ -278,10 +278,8 @@ batch_size_fail_threshold_in_kb: 1024
# experimental_features:
# - udf
# - alternator-streams
# - consistent-topology-changes
# - broadcast-tables
# - keyspace-storage-options
# - tablets
# The directory where hints files are stored if hinted handoff is enabled.
# hints_directory: /var/lib/scylla/hints
@@ -560,7 +558,7 @@ murmur3_partitioner_ignore_msb_bits: 12
# enable_parallelized_aggregation: true
# Time for which task manager task is kept in memory after it completes.
task_ttl_in_seconds: 10
# task_ttl_in_seconds: 0
# In materialized views, restrictions are allowed only on the view's primary key columns.
# In old versions Scylla mistakenly allowed IS NOT NULL restrictions on columns which were not part
@@ -619,3 +617,17 @@ maintenance_socket: ignore
# replication_strategy_warn_list:
# - SimpleStrategy
# replication_strategy_fail_list:
# Enables the tablets feature.
# When enabled, newly created keyspaces will have tablets enabled by default.
# That can be explicitly disabled in the CREATE KEYSPACE query
# by using the `tablets = {'enabled': false}` replication option.
#
# When the tablets feature is disabled, there is no way to enable tablets
# per keyspace.
#
# Note that creating keyspaces with tablets enabled is irreversible.
# Disabling the tablets feature may impact existing keyspaces that were created with tablets.
# For example, the tablets map would remain "frozen" and will not respond to topology changes
# like adding, removing, or replacing nodes, or to replication factor changes.
enable_tablets: true

View File

@@ -20,9 +20,6 @@ import textwrap
from shutil import which
from typing import NamedTuple
outdir = 'build'
tempfile.tempdir = f"{outdir}/tmp"
configure_args = str.join(' ', [shlex.quote(x) for x in sys.argv[1:] if not x.startswith('--out=')])
@@ -285,9 +282,9 @@ def generate_compdb(compdb, ninja, buildfile, modes):
# build mode-specific compdbs
for mode in modes:
mode_out = outdir + '/' + mode
submodule_compdbs = [mode_out + '/' + submodule + '/' + compdb for submodule in ['seastar']]
submodule_compdbs = [mode_out + '/' + submodule + '/' + compdb for submodule in ['seastar', 'abseil']]
with open(mode_out + '/' + compdb, 'w+b') as combined_mode_specific_compdb:
subprocess.run(['./scripts/merge-compdb.py', 'build/' + mode,
subprocess.run(['./scripts/merge-compdb.py', outdir + '/' + mode,
ninja_compdb.name] + submodule_compdbs, stdout=combined_mode_specific_compdb)
# sort modes by supposed indexing speed
@@ -459,8 +456,6 @@ modes = {
scylla_tests = set([
'test/boost/UUID_test',
'test/boost/pretty_printers_test',
'test/boost/cdc_generation_test',
'test/boost/aggregate_fcts_test',
'test/boost/allocation_strategy_test',
'test/boost/alternator_unit_test',
@@ -470,7 +465,9 @@ scylla_tests = set([
'test/boost/auth_test',
'test/boost/batchlog_manager_test',
'test/boost/big_decimal_test',
'test/boost/bptree_test',
'test/boost/broken_sstable_test',
'test/boost/btree_test',
'test/boost/bytes_ostream_test',
'test/boost/cache_algorithm_test',
'test/boost/cache_flat_mutation_reader_test',
@@ -479,13 +476,15 @@ scylla_tests = set([
'test/boost/canonical_mutation_test',
'test/boost/cartesian_product_test',
'test/boost/castas_fcts_test',
'test/boost/cdc_generation_test',
'test/boost/cdc_test',
'test/boost/cell_locker_test',
'test/boost/checksum_utils_test',
'test/boost/chunked_vector_test',
'test/boost/chunked_managed_vector_test',
'test/boost/chunked_vector_test',
'test/boost/clustering_ranges_walker_test',
'test/boost/column_mapping_test',
'test/boost/commitlog_cleanup_test',
'test/boost/commitlog_test',
'test/boost/compaction_group_test',
'test/boost/compound_test',
@@ -495,102 +494,124 @@ scylla_tests = set([
'test/boost/counter_test',
'test/boost/cql_auth_query_test',
'test/boost/cql_auth_syntax_test',
'test/boost/cql_query_test',
'test/boost/cql_functions_test',
'test/boost/cql_query_group_test',
'test/boost/cql_query_large_test',
'test/boost/cql_query_like_test',
'test/boost/cql_query_group_test',
'test/boost/cql_functions_test',
'test/boost/cql_query_test',
'test/boost/crc_test',
'test/boost/data_listeners_test',
'test/boost/database_test',
'test/boost/commitlog_cleanup_test',
'test/boost/dirty_memory_manager_test',
'test/boost/double_decker_test',
'test/boost/duration_test',
'test/boost/dynamic_bitset_test',
'test/boost/enum_option_test',
'test/boost/enum_set_test',
'test/boost/extensions_test',
'test/boost/error_injection_test',
'test/boost/estimated_histogram_test',
'test/boost/exception_container_test',
'test/boost/exceptions_fallback_test',
'test/boost/exceptions_optimized_test',
'test/boost/expr_test',
'test/boost/extensions_test',
'test/boost/filtering_test',
'test/boost/flat_mutation_reader_test',
'test/boost/flush_queue_test',
'test/boost/fragmented_temporary_buffer_test',
'test/boost/frozen_mutation_test',
'test/boost/generic_server_test',
'test/boost/gossiping_property_file_snitch_test',
'test/boost/group0_cmd_merge_test',
'test/boost/group0_test',
'test/boost/hash_test',
'test/boost/hashers_test',
'test/boost/hint_test',
'test/boost/idl_test',
'test/boost/index_with_paging_test',
'test/boost/input_stream_test',
'test/boost/intrusive_array_test',
'test/boost/json_cql_query_test',
'test/boost/json_test',
'test/boost/keys_test',
'test/boost/large_paging_state_test',
'test/boost/recent_entries_map_test',
'test/boost/like_matcher_test',
'test/boost/limiting_data_source_test',
'test/boost/linearizing_input_stream_test',
'test/boost/lister_test',
'test/boost/loading_cache_test',
'test/boost/locator_topology_test',
'test/boost/log_heap_test',
'test/boost/estimated_histogram_test',
'test/boost/summary_test',
'test/boost/logalloc_test',
'test/boost/logalloc_standard_allocator_segment_pool_backend_test',
'test/boost/managed_vector_test',
'test/boost/logalloc_test',
'test/boost/managed_bytes_test',
'test/boost/intrusive_array_test',
'test/boost/managed_vector_test',
'test/boost/map_difference_test',
'test/boost/memtable_test',
'test/boost/multishard_combining_reader_as_mutation_source_test',
'test/boost/multishard_mutation_query_test',
'test/boost/murmur_hash_test',
'test/boost/mutation_fragment_test',
'test/boost/mutation_query_test',
'test/boost/mutation_reader_test',
'test/boost/multishard_combining_reader_as_mutation_source_test',
'test/boost/mutation_test',
'test/boost/mutation_writer_test',
'test/boost/mvcc_test',
'test/boost/network_topology_strategy_test',
'test/boost/token_metadata_test',
'test/boost/tablets_test',
'test/boost/sessions_test',
'test/boost/nonwrapping_interval_test',
'test/boost/observable_test',
'test/boost/partitioner_test',
'test/boost/per_partition_rate_limit_test',
'test/boost/pretty_printers_test',
'test/boost/querier_cache_test',
'test/boost/query_processor_test',
'test/boost/wrapping_interval_test',
'test/boost/radix_tree_test',
'test/boost/range_tombstone_list_test',
'test/boost/reusable_buffer_test',
'test/boost/restrictions_test',
'test/boost/rate_limiter_test',
'test/boost/reader_concurrency_semaphore_test',
'test/boost/recent_entries_map_test',
'test/boost/repair_test',
'test/boost/restrictions_test',
'test/boost/result_utils_test',
'test/boost/reusable_buffer_test',
'test/boost/role_manager_test',
'test/boost/row_cache_test',
'test/boost/rust_test',
'test/boost/s3_test',
'test/boost/schema_change_test',
'test/boost/schema_changes_test',
'test/boost/schema_loader_test',
'test/boost/schema_registry_test',
'test/boost/secondary_index_test',
'test/boost/tracing_test',
'test/boost/index_with_paging_test',
'test/boost/serialization_test',
'test/boost/serialized_action_test',
'test/boost/service_level_controller_test',
'test/boost/sessions_test',
'test/boost/small_vector_test',
'test/boost/snitch_reset_test',
'test/boost/sorting_test',
'test/boost/sstable_3_x_test',
'test/boost/sstable_compaction_test',
'test/boost/sstable_conforms_to_mutation_source_test',
'test/boost/sstable_datafile_test',
'test/boost/sstable_directory_test',
'test/boost/sstable_generation_test',
'test/boost/sstable_move_test',
'test/boost/sstable_mutation_test',
'test/boost/sstable_partition_index_cache_test',
'test/boost/schema_changes_test',
'test/boost/sstable_conforms_to_mutation_source_test',
'test/boost/sstable_compaction_test',
'test/boost/sstable_resharding_test',
'test/boost/sstable_directory_test',
'test/boost/sstable_set_test',
'test/boost/sstable_test',
'test/boost/sstable_move_test',
'test/boost/stall_free_test',
'test/boost/statement_restrictions_test',
'test/boost/storage_proxy_test',
'test/boost/string_format_test',
'test/boost/summary_test',
'test/boost/tablets_test',
'test/boost/tagged_integer_test',
'test/boost/token_metadata_test',
'test/boost/top_k_test',
'test/boost/tracing_test',
'test/boost/transport_test',
'test/boost/types_test',
'test/boost/user_function_test',
@@ -598,38 +619,16 @@ scylla_tests = set([
'test/boost/utf8_test',
'test/boost/view_build_test',
'test/boost/view_complex_test',
'test/boost/view_schema_test',
'test/boost/view_schema_pkey_test',
'test/boost/view_schema_ckey_test',
'test/boost/view_schema_pkey_test',
'test/boost/view_schema_test',
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/virtual_table_mutation_source_test',
'test/boost/virtual_table_test',
'test/boost/wasm_test',
'test/boost/wasm_alloc_test',
'test/boost/bptree_test',
'test/boost/btree_test',
'test/boost/radix_tree_test',
'test/boost/double_decker_test',
'test/boost/stall_free_test',
'test/boost/sstable_set_test',
'test/boost/reader_concurrency_semaphore_test',
'test/boost/service_level_controller_test',
'test/boost/schema_loader_test',
'test/boost/lister_test',
'test/boost/group0_test',
'test/boost/exception_container_test',
'test/boost/result_utils_test',
'test/boost/rate_limiter_test',
'test/boost/per_partition_rate_limit_test',
'test/boost/expr_test',
'test/boost/exceptions_optimized_test',
'test/boost/exceptions_fallback_test',
'test/boost/s3_test',
'test/boost/locator_topology_test',
'test/boost/string_format_test',
'test/boost/tagged_integer_test',
'test/boost/group0_cmd_merge_test',
'test/boost/wasm_test',
'test/boost/wrapping_interval_test',
'test/manual/ec2_snitch_test',
'test/manual/enormous_table_scan_test',
'test/manual/gce_snitch_test',
@@ -780,10 +779,16 @@ arg_parser.add_argument('--date-stamp', dest='date_stamp', type=str,
help='Set datestamp for SCYLLA-VERSION-GEN')
arg_parser.add_argument('--use-cmake', action='store_true', help='Use CMake as the build system')
arg_parser.add_argument('--coverage', action = 'store_true', help = 'Compile scylla with coverage instrumentation')
arg_parser.add_argument('--build-dir', action='store', default='build',
help='Build directory path')
args = arg_parser.parse_args()
PROFILES_LIST_FILE_NAME = "coverage_sources.list"
outdir = args.build_dir
tempfile.tempdir = f"{outdir}/tmp"
if args.list_artifacts:
for artifact in sorted(all_artifacts):
print(artifact)
@@ -791,7 +796,6 @@ if args.list_artifacts:
defines = ['XXH_PRIVATE_API',
'SEASTAR_TESTING_MAIN',
'FMT_DEPRECATED_OSTREAM',
]
scylla_raft_core = [
@@ -824,6 +828,7 @@ scylla_core = (['message/messaging_service.cc',
'mutation/partition_version.cc',
'mutation/range_tombstone.cc',
'mutation/range_tombstone_list.cc',
'mutation/async_utils.cc',
'absl-flat_hash_map.cc',
'collection_mutation.cc',
'client_data.cc',
@@ -1011,7 +1016,6 @@ scylla_core = (['message/messaging_service.cc',
'cql3/result_set.cc',
'cql3/prepare_context.cc',
'db/consistency_level.cc',
'db/system_auth_keyspace.cc',
'db/system_keyspace.cc',
'db/virtual_table.cc',
'db/virtual_tables.cc',
@@ -1348,11 +1352,13 @@ scylla_tests_dependencies = scylla_core + alternator + idls + scylla_tests_gener
scylla_raft_dependencies = scylla_raft_core + ['utils/uuid.cc', 'utils/error_injection.cc', 'utils/exceptions.cc']
scylla_tools = ['tools/scylla-types.cc', 'tools/scylla-sstable.cc', 'tools/scylla-nodetool.cc', 'tools/schema_loader.cc', 'tools/utils.cc', 'tools/lua_sstable_consumer.cc']
scylla_perfs = ['test/perf/perf_fast_forward.cc',
scylla_perfs = ['test/perf/perf_alternator.cc',
'test/perf/perf_fast_forward.cc',
'test/perf/perf_row_cache_update.cc',
'test/perf/perf_simple_query.cc',
'test/perf/perf_sstable.cc',
'test/perf/perf_tablets.cc',
'test/perf/tablet_load_balancing.cc',
'test/perf/perf.cc',
'test/lib/alternator_test_env.cc',
'test/lib/cql_test_env.cc',
@@ -1613,7 +1619,7 @@ def generate_version(date_stamp):
date_stamp_opt = ''
if date_stamp:
date_stamp_opt = f'--date-stamp {date_stamp}'
status = subprocess.call(f"./SCYLLA-VERSION-GEN {date_stamp_opt}", shell=True)
status = subprocess.call(f"./SCYLLA-VERSION-GEN --output-dir {outdir} {date_stamp_opt}", shell=True)
if status != 0:
print('Version file generation failed')
sys.exit(1)
@@ -1747,6 +1753,63 @@ def configure_seastar(build_dir, mode, mode_config):
subprocess.check_call(seastar_cmd, shell=False, cwd=cmake_dir)
def configure_abseil(build_dir, mode, mode_config):
abseil_cflags = mode_config['lib_cflags']
cxx_flags = mode_config['cxxflags']
if '-DSANITIZE' in cxx_flags:
abseil_cflags += ' -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr'
# We want to "undo" coverage for abseil if we have it enabled, as we are not
# interested in the coverage of the abseil library. these flags were previously
# added to cxx_ld_flags
if args.coverage:
for flag in COVERAGE_INST_FLAGS:
cxx_flags = cxx_flags.replace(f' {flag}', '')
cxx_flags += ' ' + abseil_cflags.strip()
cmake_mode = mode_config['cmake_build_type']
abseil_cmake_args = [
'-DCMAKE_BUILD_TYPE={}'.format(cmake_mode),
'-DCMAKE_INSTALL_PREFIX={}'.format(build_dir + '/inst'), # just to avoid a warning from absl
'-DCMAKE_C_COMPILER={}'.format(args.cc),
'-DCMAKE_CXX_COMPILER={}'.format(args.cxx),
'-DCMAKE_CXX_FLAGS_{}={}'.format(cmake_mode.upper(), cxx_flags),
'-DCMAKE_EXPORT_COMPILE_COMMANDS=ON',
'-DCMAKE_CXX_STANDARD=20',
'-DABSL_PROPAGATE_CXX_STD=ON',
]
abseil_build_dir = os.path.join(build_dir, mode, 'abseil')
abseil_cmd = ['cmake', '-G', 'Ninja', real_relpath('abseil', abseil_build_dir)] + abseil_cmake_args
os.makedirs(abseil_build_dir, exist_ok=True)
subprocess.check_call(abseil_cmd, shell=False, cwd=abseil_build_dir)
abseil_libs = ['absl/' + lib for lib in [
'container/libabsl_hashtablez_sampler.a',
'container/libabsl_raw_hash_set.a',
'synchronization/libabsl_synchronization.a',
'synchronization/libabsl_graphcycles_internal.a',
'debugging/libabsl_stacktrace.a',
'debugging/libabsl_symbolize.a',
'debugging/libabsl_debugging_internal.a',
'debugging/libabsl_demangle_internal.a',
'time/libabsl_time.a',
'time/libabsl_time_zone.a',
'numeric/libabsl_int128.a',
'hash/libabsl_hash.a',
'hash/libabsl_city.a',
'hash/libabsl_low_level_hash.a',
'base/libabsl_malloc_internal.a',
'base/libabsl_spinlock_wait.a',
'base/libabsl_base.a',
'base/libabsl_raw_logging_internal.a',
'profiling/libabsl_exponential_biased.a',
'strings/libabsl_strings.a',
'strings/libabsl_strings_internal.a',
'base/libabsl_throw_delegate.a']]
def query_seastar_flags(pc_file, use_shared_libs, link_static_cxx=False):
if use_shared_libs:
opt = '--shared'
@@ -1766,9 +1829,7 @@ def query_seastar_flags(pc_file, use_shared_libs, link_static_cxx=False):
'seastar_testing_libs': testing_libs}
pkgs = ['libsystemd',
'jsoncpp',
'absl_raw_hash_set',
'absl_hash']
'jsoncpp']
# Lua can be provided by lua53 package on Debian-like
# systems and by Lua on others.
pkgs.append('lua53' if have_pkg('lua53') else 'lua')
@@ -1831,9 +1892,11 @@ def get_extra_cxxflags(mode, mode_config, cxx, debuginfo):
return cxxflags
def get_release_cxxflags(scylla_version,
def get_release_cxxflags(scylla_product,
scylla_version,
scylla_release):
definitions = {'SCYLLA_VERSION': scylla_version,
definitions = {'SCYLLA_PRODUCT': scylla_product,
'SCYLLA_VERSION': scylla_version,
'SCYLLA_RELEASE': scylla_release}
return [f'-D{name}="\\"{value}\\""' for name, value in definitions.items()]
@@ -1885,17 +1948,17 @@ def write_build_file(f,
rule strip
command = scripts/strip.sh $in
rule package
command = scripts/create-relocatable-package.py --build-dir build/$mode $out
command = scripts/create-relocatable-package.py --build-dir $builddir/$mode --node-exporter-dir $builddir/node_exporter --debian-dir $builddir/debian/debian $out
rule stripped_package
command = scripts/create-relocatable-package.py --stripped --build-dir build/$mode $out
command = scripts/create-relocatable-package.py --stripped --build-dir $builddir/$mode --node-exporter-dir $builddir/node_exporter --debian-dir $builddir/debian/debian $out
rule debuginfo_package
command = dist/debuginfo/scripts/create-relocatable-package.py --build-dir build/$mode $out
command = dist/debuginfo/scripts/create-relocatable-package.py --build-dir $builddir/$mode --node-exporter-dir $builddir/node_exporter $out
rule rpmbuild
command = reloc/build_rpm.sh --reloc-pkg $in --builddir $out
rule debbuild
command = reloc/build_deb.sh --reloc-pkg $in --builddir $out
rule unified
command = unified/build_unified.sh --build-dir build/$mode --unified-pkg $out
command = unified/build_unified.sh --build-dir $builddir/$mode --unified-pkg $out
rule rust_header
command = cxxbridge --include rust/cxx.h --header $in > $out
description = RUST_HEADER $out
@@ -2025,6 +2088,7 @@ def write_build_file(f,
seastar_lib_ext = 'so' if modeval['build_seastar_shared_libs'] else 'a'
seastar_dep = f'$builddir/{mode}/seastar/libseastar.{seastar_lib_ext}'
seastar_testing_dep = f'$builddir/{mode}/seastar/libseastar_testing.{seastar_lib_ext}'
abseil_dep = ' '.join(f'$builddir/{mode}/abseil/{lib}' for lib in abseil_libs)
for binary in sorted(build_artifacts):
if binary in other or binary in wasms:
continue
@@ -2054,6 +2118,9 @@ def write_build_file(f,
if has_thrift:
local_libs += ' ' + maybe_static(args.staticthrift, '-lthrift')
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_system')
objs.extend(['$builddir/' + mode + '/' + artifact for artifact in [
'abseil/' + x for x in abseil_libs
]])
if binary in tests:
if binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
@@ -2067,14 +2134,14 @@ def write_build_file(f,
# quickly re-link the test unstripped by adding a "_g"
# to the test name, e.g., "ninja build/release/testname_g"
link_rule = perf_tests_link_rule if binary.startswith('test/perf/') else tests_link_rule
f.write('build $builddir/{}/{}: {}.{} {} | {} {}\n'.format(mode, binary, link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write('build $builddir/{}/{}: {}.{} {} | {} {} {}\n'.format(mode, binary, link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep, abseil_dep))
f.write(' libs = {}\n'.format(local_libs))
f.write('build $builddir/{}/{}_g: {}.{} {} | {} {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write('build $builddir/{}/{}_g: {}.{} {} | {} {} {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep, abseil_dep))
f.write(' libs = {}\n'.format(local_libs))
else:
if binary == 'scylla':
local_libs += ' ' + "$seastar_testing_libs_{}".format(mode)
f.write('build $builddir/{}/{}: {}.{} {} | {} {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write('build $builddir/{}/{}: {}.{} {} | {} {} {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep, abseil_dep))
f.write(' libs = {}\n'.format(local_libs))
f.write(f'build $builddir/{mode}/{binary}.stripped: strip $builddir/{mode}/{binary}\n')
f.write(f'build $builddir/{mode}/{binary}.debug: phony $builddir/{mode}/{binary}.stripped\n')
@@ -2127,6 +2194,17 @@ def write_build_file(f,
mode=mode,
)
)
compiler_training_artifacts=[]
if mode == 'dev':
compiler_training_artifacts.append(f'$builddir/{mode}/scylla')
elif mode == 'release' or mode == 'debug':
compiler_training_artifacts.append(f'$builddir/{mode}/service/storage_proxy.o')
f.write(
'build {mode}-compiler-training: phony {artifacts}\n'.format(
mode=mode,
artifacts=str.join(' ', compiler_training_artifacts)
)
)
gen_dir = '$builddir/{}/gen'.format(mode)
gen_headers = []
@@ -2251,6 +2329,12 @@ def write_build_file(f,
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}-{scylla_release}.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-{arch}-package-{scylla_version}-{scylla_release}.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
for lib in abseil_libs:
f.write('build $builddir/{mode}/abseil/{lib}: ninja $builddir/{mode}/abseil/build.ninja\n'.format(**locals()))
f.write(' pool = submodule_pool\n')
f.write(' subdir = $builddir/{mode}/abseil\n'.format(**locals()))
f.write(' target = {lib}\n'.format(**locals()))
checkheaders_mode = 'dev' if 'dev' in modes else modes.keys()[0]
f.write('build checkheaders: phony || {}\n'.format(' '.join(['$builddir/{}/{}.o'.format(checkheaders_mode, hh) for hh in headers])))
@@ -2266,6 +2350,9 @@ def write_build_file(f,
f.write(
'build wasm: phony {}\n'.format(' '.join([f'$builddir/{binary}' for binary in sorted(wasms)]))
)
f.write(
'build compiler-training: phony {}\n'.format(' '.join(['{mode}-compiler-training'.format(mode=mode) for mode in default_modes]))
)
f.write(textwrap.dedent(f'''\
build dist-unified-tar: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz' for mode in default_modes])}
@@ -2278,54 +2365,54 @@ def write_build_file(f,
build dist-server: phony dist-server-tar dist-server-debuginfo dist-server-rpm dist-server-deb
rule build-submodule-reloc
command = cd $reloc_dir && ./reloc/build_reloc.sh --version $$(<../../build/SCYLLA-PRODUCT-FILE)-$$(sed 's/-/~/' <../../build/SCYLLA-VERSION-FILE)-$$(<../../build/SCYLLA-RELEASE-FILE) --nodeps $args
command = cd $reloc_dir && ./reloc/build_reloc.sh --version $$(<../../$builddir/SCYLLA-PRODUCT-FILE)-$$(sed 's/-/~/' <../../$builddir/SCYLLA-VERSION-FILE)-$$(<../../$builddir/SCYLLA-RELEASE-FILE) --nodeps $args
rule build-submodule-rpm
command = cd $dir && ./reloc/build_rpm.sh --reloc-pkg $artifact
rule build-submodule-deb
command = cd $dir && ./reloc/build_deb.sh --reloc-pkg $artifact
build tools/jmx/build/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | build/SCYLLA-PRODUCT-FILE build/SCYLLA-VERSION-FILE build/SCYLLA-RELEASE-FILE
build tools/jmx/build/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | $builddir/SCYLLA-PRODUCT-FILE $builddir/SCYLLA-VERSION-FILE $builddir/SCYLLA-RELEASE-FILE
reloc_dir = tools/jmx
build dist-jmx-rpm: build-submodule-rpm tools/jmx/build/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
dir = tools/jmx
artifact = $builddir/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
artifact = build/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-jmx-deb: build-submodule-deb tools/jmx/build/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
dir = tools/jmx
artifact = $builddir/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
artifact = build/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-jmx-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz'.format(mode=mode, scylla_product=scylla_product, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-jmx: phony dist-jmx-tar dist-jmx-rpm dist-jmx-deb
build tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | build/SCYLLA-PRODUCT-FILE build/SCYLLA-VERSION-FILE build/SCYLLA-RELEASE-FILE
build tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | $builddir/SCYLLA-PRODUCT-FILE $builddir/SCYLLA-VERSION-FILE $builddir/SCYLLA-RELEASE-FILE
reloc_dir = tools/java
build dist-tools-rpm: build-submodule-rpm tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
dir = tools/java
artifact = $builddir/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
artifact = build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-tools-deb: build-submodule-deb tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
dir = tools/java
artifact = $builddir/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
artifact = build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-tools-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz'.format(mode=mode, scylla_product=scylla_product, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-tools: phony dist-tools-tar dist-tools-rpm dist-tools-deb
build tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | build/SCYLLA-PRODUCT-FILE build/SCYLLA-VERSION-FILE build/SCYLLA-RELEASE-FILE
build tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | $builddir/SCYLLA-PRODUCT-FILE $builddir/SCYLLA-VERSION-FILE $builddir/SCYLLA-RELEASE-FILE
reloc_dir = tools/cqlsh
build dist-cqlsh-rpm: build-submodule-rpm tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz
dir = tools/cqlsh
artifact = $builddir/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz
artifact = build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-cqlsh-deb: build-submodule-deb tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz
dir = tools/cqlsh
artifact = $builddir/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz
artifact = build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-cqlsh-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.noarch.tar.gz'.format(mode=mode, scylla_product=scylla_product, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-cqlsh: phony dist-cqlsh-tar dist-cqlsh-rpm dist-cqlsh-deb
build tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz: build-submodule-reloc | build/SCYLLA-PRODUCT-FILE build/SCYLLA-VERSION-FILE build/SCYLLA-RELEASE-FILE
build tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz: build-submodule-reloc | $builddir/SCYLLA-PRODUCT-FILE $builddir/SCYLLA-VERSION-FILE $builddir/SCYLLA-RELEASE-FILE
reloc_dir = tools/python3
args = --packages "{python3_dependencies}" --pip-packages "{pip_dependencies}" --pip-symlinks "{pip_symlinks}"
build dist-python3-rpm: build-submodule-rpm tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
artifact = build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build dist-python3-deb: build-submodule-deb tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
artifact = build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build dist-python3-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb
build dist-deb: phony dist-server-deb dist-python3-deb dist-jmx-deb dist-tools-deb dist-cqlsh-deb
@@ -2360,10 +2447,10 @@ def write_build_file(f,
f.write(textwrap.dedent('''\
rule configure
command = {python} configure.py --out=build.ninja.new $configure_args && mv build.ninja.new build.ninja
command = {python} configure.py --out={buildfile}.new $configure_args && mv {buildfile}.new {buildfile}
generator = 1
description = CONFIGURE $configure_args
build build.ninja {build_ninja_list}: configure | configure.py SCYLLA-VERSION-GEN $builddir/SCYLLA-PRODUCT-FILE $builddir/SCYLLA-VERSION-FILE $builddir/SCYLLA-RELEASE-FILE {args.seastar_path}/CMakeLists.txt
build {buildfile} {build_ninja_list}: configure | configure.py SCYLLA-VERSION-GEN $builddir/SCYLLA-PRODUCT-FILE $builddir/SCYLLA-VERSION-FILE $builddir/SCYLLA-RELEASE-FILE {args.seastar_path}/CMakeLists.txt
rule cscope
command = find -name '*.[chS]' -o -name "*.cc" -o -name "*.hh" | cscope -bq -i-
description = CSCOPE
@@ -2377,7 +2464,7 @@ def write_build_file(f,
description = List configured modes
build mode_list: mode_list
default {modes_list}
''').format(modes_list=' '.join(default_modes), build_ninja_list=' '.join([f'build/{mode}/{dir}/build.ninja' for mode in build_modes for dir in ['seastar']]), **globals()))
''').format(modes_list=' '.join(default_modes), build_ninja_list=' '.join([f'{outdir}/{mode}/{dir}/build.ninja' for mode in build_modes for dir in ['seastar', 'abseil']]), **globals()))
unit_test_list = set(test for test in build_artifacts if test in set(tests))
f.write(textwrap.dedent('''\
rule unit_test_list
@@ -2388,14 +2475,14 @@ def write_build_file(f,
f.write(textwrap.dedent('''\
build always: phony
rule scylla_version_gen
command = ./SCYLLA-VERSION-GEN
command = ./SCYLLA-VERSION-GEN --output-dir $builddir
restat = 1
build $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE: scylla_version_gen | always
rule debian_files_gen
command = ./dist/debian/debian_files_gen.py
command = ./dist/debian/debian_files_gen.py --build-dir $builddir
build $builddir/debian/debian: debian_files_gen | always
rule extract_node_exporter
command = tar -C build -xvpf {node_exporter_filename} --no-same-owner && rm -rfv build/node_exporter && mv -v build/{node_exporter_dirname} build/node_exporter
command = tar -C $builddir -xvpf {node_exporter_filename} --no-same-owner && rm -rfv $builddir/node_exporter && mv -v $builddir/{node_exporter_dirname} $builddir/node_exporter
build $builddir/node_exporter/node_exporter: extract_node_exporter | always
build $builddir/node_exporter/node_exporter.stripped: strip $builddir/node_exporter/node_exporter
build $builddir/node_exporter/node_exporter.debug: phony $builddir/node_exporter/node_exporter.stripped
@@ -2418,14 +2505,17 @@ def create_build_system(args):
extra_cxxflags = ' '.join(get_extra_cxxflags(mode, mode_config, args.cxx, args.debuginfo))
mode_config['cxxflags'] += f' {extra_cxxflags}'
mode_config['per_src_extra_cxxflags']['release.cc'] = ' '.join(get_release_cxxflags(scylla_version, scylla_release))
mode_config['per_src_extra_cxxflags']['release.cc'] = ' '.join(get_release_cxxflags(scylla_product, scylla_version, scylla_release))
if not args.dist_only:
global user_cflags, libs
# args.buildfile builds seastar with the rules of
# {outdir}/{mode}/seastar/build.ninja, and
# {outdir}/{mode}/seastar/seastar.pc is queried for building flags
for mode, mode_config in build_modes.items():
configure_seastar(outdir, mode, mode_config)
configure_abseil(outdir, mode, mode_config)
user_cflags += ' -isystem abseil'
for mode, mode_config in build_modes.items():
mode_config.update(query_seastar_flags(f'{outdir}/{mode}/seastar/seastar.pc',
@@ -2479,7 +2569,6 @@ def configure_using_cmake(args):
if args.staticboost:
settings['Boost_USE_STATIC_LIBS'] = 'ON'
source_dir = os.path.realpath(os.path.dirname(__file__))
build_dir = os.path.join(source_dir, 'build')

View File

@@ -89,7 +89,7 @@ public:
using counter_shard_view = basic_counter_shard_view<mutable_view::no>;
template <>
struct fmt::formatter<counter_shard_view> : fmt::formatter<std::string_view> {
struct fmt::formatter<counter_shard_view> : fmt::formatter<string_view> {
auto format(const counter_shard_view&, fmt::format_context& ctx) const -> decltype(ctx.out());
};
@@ -352,7 +352,7 @@ struct counter_cell_view : basic_counter_cell_view<mutable_view::no> {
};
template <>
struct fmt::formatter<counter_cell_view> : fmt::formatter<std::string_view> {
struct fmt::formatter<counter_cell_view> : fmt::formatter<string_view> {
auto format(const counter_cell_view&, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -131,6 +131,7 @@ target_link_libraries(cql3
idl
wasmtime_bindings
Seastar::seastar
absl::headers
xxHash::xxhash
ANTLR3::antlr3
PRIVATE

View File

@@ -68,6 +68,7 @@ options {
#include "cql3/statements/ks_prop_defs.hh"
#include "cql3/selection/raw_selector.hh"
#include "cql3/selection/selectable-expr.hh"
#include "cql3/dialect.hh"
#include "cql3/keyspace_element_name.hh"
#include "cql3/constants.hh"
#include "cql3/operation_impl.hh"
@@ -148,6 +149,8 @@ using uexpression = uninitialized<expression>;
listener_type* listener;
dialect _dialect;
// Keeps the names of all bind variables. For bind variables without a name ('?'), the name is nullptr.
// Maps bind_index -> name.
std::vector<::shared_ptr<cql3::column_identifier>> _bind_variable_names;
@@ -171,9 +174,14 @@ using uexpression = uninitialized<expression>;
return s;
}
void set_dialect(dialect d) {
_dialect = d;
}
bind_variable new_bind_variables(shared_ptr<cql3::column_identifier> name)
{
if (name && _named_bind_variables_indexes.contains(*name)) {
if (_dialect.duplicate_bind_variable_names_refer_to_same_variable
&& name && _named_bind_variables_indexes.contains(*name)) {
return bind_variable{_named_bind_variables_indexes[*name]};
}
auto marker = bind_variable{_bind_variable_names.size()};
@@ -554,6 +562,10 @@ usingTimeoutClause[std::unique_ptr<cql3::attributes::raw>& attrs]
: K_USING K_TIMEOUT to=term { attrs->timeout = std::move(to); }
;
usingTimestampClause[std::unique_ptr<cql3::attributes::raw>& attrs]
: K_USING K_TIMESTAMP ts=intValue { attrs->timestamp = std::move(ts); }
;
/**
* UPDATE <CF>
* USING TIMESTAMP <long>
@@ -984,16 +996,17 @@ alterKeyspaceStatement returns [std::unique_ptr<cql3::statements::alter_keyspace
/**
* ALTER COLUMN FAMILY <CF> ALTER <column> TYPE <newtype>;
* ALTER COLUMN FAMILY <CF> ADD <column> <newtype>; | ALTER COLUMN FAMILY <CF> ADD (<column> <newtype>,<column1> <newtype1>..... <column n> <newtype n>)
* ALTER COLUMN FAMILY <CF> DROP <column>; | ALTER COLUMN FAMILY <CF> DROP ( <column>,<column1>.....<column n>)
* ALTER COLUMN FAMILY <CF> DROP <column> [USING TIMESTAMP <ts>]; | ALTER COLUMN FAMILY <CF> DROP ( <column>,<column1>.....<column n>) [USING TIMESTAMP <ts>]
* ALTER COLUMN FAMILY <CF> WITH <property> = <value>;
* ALTER COLUMN FAMILY <CF> RENAME <column> TO <column>;
*/
alterTableStatement returns [std::unique_ptr<alter_table_statement> expr]
alterTableStatement returns [std::unique_ptr<alter_table_statement::raw_statement> expr]
@init {
alter_table_statement::type type;
auto props = cql3::statements::cf_prop_defs();
std::vector<alter_table_statement::column_change> column_changes;
std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>, shared_ptr<cql3::column_identifier::raw>>> renames;
auto attrs = std::make_unique<cql3::attributes::raw>();
}
: K_ALTER K_COLUMNFAMILY cf=columnFamilyName
( K_ALTER id=cident K_TYPE v=comparatorType { type = alter_table_statement::type::alter; column_changes.emplace_back(alter_table_statement::column_change{id, v}); }
@@ -1007,13 +1020,14 @@ alterTableStatement returns [std::unique_ptr<alter_table_statement> expr]
| '(' id1=cident { column_changes.emplace_back(alter_table_statement::column_change{id1}); }
(',' idn=cident { column_changes.emplace_back(alter_table_statement::column_change{idn}); } )* ')'
)
( usingTimestampClause[attrs] )?
| K_WITH properties[props] { type = alter_table_statement::type::opts; }
| K_RENAME { type = alter_table_statement::type::rename; }
id1=cident K_TO toId1=cident { renames.emplace_back(id1, toId1); }
( K_AND idn=cident K_TO toIdn=cident { renames.emplace_back(idn, toIdn); } )*
)
{
$expr = std::make_unique<alter_table_statement>(std::move(cf), type, std::move(column_changes), std::move(props), std::move(renames));
$expr = std::make_unique<alter_table_statement::raw_statement>(std::move(cf), type, std::move(column_changes), std::move(props), std::move(renames), std::move(attrs));
}
;

View File

@@ -12,6 +12,7 @@
#include "cql3/column_identifier.hh"
#include "cql3/expr/evaluate.hh"
#include "cql3/expr/expr-utils.hh"
#include "exceptions/exceptions.hh"
#include <optional>
namespace cql3 {

View File

@@ -119,13 +119,13 @@ struct hash<cql3::column_identifier_raw> {
}
template <> struct fmt::formatter<cql3::column_identifier> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<cql3::column_identifier> : fmt::formatter<string_view> {
auto format(const cql3::column_identifier& i, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{}", i.text());
}
};
template <> struct fmt::formatter<cql3::column_identifier_raw> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<cql3::column_identifier_raw> : fmt::formatter<string_view> {
auto format(const cql3::column_identifier_raw& id, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{}", id.text());
}

View File

@@ -454,7 +454,8 @@ sstring maybe_quote(const sstring& identifier) {
// many keywords but allow keywords listed as "unreserved keywords".
// So we can use any of them, for example cident.
try {
cql3::util::do_with_parser(identifier, std::mem_fn(&cql3_parser::CqlParser::cident));
// In general it's not a good idea to use the default dialect, but for parsing an identifier, it's okay.
cql3::util::do_with_parser(identifier, dialect{}, std::mem_fn(&cql3_parser::CqlParser::cident));
return identifier;
} catch(exceptions::syntax_exception&) {
// This alphanumeric string is not a valid identifier, so fall

View File

@@ -365,9 +365,9 @@ inline bool operator==(const cql3_type& a, const cql3_type& b) {
}
template <>
struct fmt::formatter<cql3::cql3_type>: fmt::formatter<std::string_view> {
struct fmt::formatter<cql3::cql3_type>: fmt::formatter<string_view> {
auto format(const cql3::cql3_type& t, fmt::format_context& ctx) const {
return formatter<std::string_view>::format(format_as(t), ctx);
return formatter<string_view>::format(format_as(t), ctx);
}
};

34
cql3/dialect.hh Normal file
View File

@@ -0,0 +1,34 @@
// Copyright (C) 2024-present ScyllaDB
// SPDX-License-Identifier: AGPL-3.0-or-later
#pragma once
#include <fmt/core.h>
namespace cql3 {
struct dialect {
bool duplicate_bind_variable_names_refer_to_same_variable = true; // if :a is found twice in a query, the two references are to the same variable (see #15559)
bool operator==(const dialect&) const = default;
};
inline
dialect
internal_dialect() {
return dialect{
.duplicate_bind_variable_names_refer_to_same_variable = true,
};
}
}
template <>
struct fmt::formatter<cql3::dialect> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cql3::dialect& d, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "cql3::dialect{{duplicate_bind_variable_names_refer_to_same_variable={}}}",
d.duplicate_bind_variable_names_refer_to_same_variable);
}
};

View File

@@ -574,7 +574,7 @@ struct fmt::formatter<E> : public fmt::formatter<cql3::expr::expression> {
};
template <>
struct fmt::formatter<cql3::expr::column_mutation_attribute::attribute_kind> : fmt::formatter<std::string_view> {
struct fmt::formatter<cql3::expr::column_mutation_attribute::attribute_kind> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(cql3::expr::column_mutation_attribute::attribute_kind k, FormatContext& ctx) const {
switch (k) {

View File

@@ -578,7 +578,7 @@ tuple_constructor_prepare_nontuple(const tuple_constructor& tc, data_dictionary:
}
template <> struct fmt::formatter<cql3::expr::untyped_constant::type_class> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<cql3::expr::untyped_constant::type_class> : fmt::formatter<string_view> {
auto format(cql3::expr::untyped_constant::type_class t, fmt::format_context& ctx) const {
using enum cql3::expr::untyped_constant::type_class;
std::string_view name;

View File

@@ -19,7 +19,7 @@
#include <fmt/core.h>
template <> struct fmt::formatter<std::vector<data_type>> : fmt::formatter<std::string_view> {
template <> struct fmt::formatter<std::vector<data_type>> : fmt::formatter<string_view> {
auto format(const std::vector<data_type>& arg_types, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -69,6 +69,16 @@ using bytes_opt = std::optional<bytes>;
template<typename ToType, typename FromType>
static data_value castas_fctn_simple(data_value from) {
auto val_from = value_cast<FromType>(from);
// Workaround for https://github.com/boostorg/multiprecision/issues/553 (the additional bug discovered post-closing)
if constexpr (std::is_floating_point_v<ToType> && std::is_same_v<FromType, utils::multiprecision_int>) {
static auto min = utils::multiprecision_int(std::numeric_limits<ToType>::lowest());
static auto max = utils::multiprecision_int(std::numeric_limits<ToType>::max());
if (val_from < min) {
return -std::numeric_limits<ToType>::infinity();
} else if (val_from > max) {
return std::numeric_limits<ToType>::infinity();
}
}
return static_cast<ToType>(val_from);
}

View File

@@ -332,6 +332,9 @@ functions::get(data_dictionary::database db,
if (!receiver_cf.has_value()) {
throw exceptions::invalid_request_exception("functions::get for token doesn't have a known column family");
}
if (schema == nullptr) {
throw exceptions::invalid_request_exception(format("functions::get for token cannot find {} table", *receiver_cf));
}
auto fun = ::make_shared<token_fct>(schema);
validate_types(db, keyspace, schema.get(), fun, provided_args, receiver_ks, receiver_cf);
return fun;

Some files were not shown because too many files have changed in this diff Show More