Compare commits

...

5386 Commits

Author SHA1 Message Date
Botond Dénes
f6c2624c86 Merge '[branch-5.0] - minimal fix for crash caused by empty primary key range in LWT update' from Jan Ciołek
In #13001 we found a test case which causes a crash because it didn't handle `UNSET_VALUE` properly:

```python3
def test_unset_insert_where(cql, table2):
    p = unique_key_int()
    stmt = cql.prepare(f'INSERT INTO {table2} (p, c) VALUES ({p}, ?)')
    with pytest.raises(InvalidRequest, match="unset"):
        cql.execute(stmt, [UNSET_VALUE])

def test_unset_insert_where_lwt(cql, table2):
    p = unique_key_int()
    stmt = cql.prepare(f'INSERT INTO {table2} (p, c) VALUES ({p}, ?) IF NOT EXISTS')
    with pytest.raises(InvalidRequest, match="unset"):
        cql.execute(stmt, [UNSET_VALUE])
```

This PR does an absolutely minimal change to fix the crash.
It adds a check the moment before the crash would happen.

To make sure that everything works correctly, and to detect any possible breaking changes, I wrote a bunch of tests that validate the current behavior.
I also ported some tests from the `master` branch, at least the ones that were in line with the behavior on `branch-5.0`.

The changes are the same as in #13133, just cherry-picked to `branch-5.0`

Closes #13178

* github.com:scylladb/scylladb:
  cql-pytest/test_unset: port some tests from master branch
  cql-pytest/test_unset: test unset value in UPDATEs with LWT conditions
  cql-pytest/test_unset: test unset value in UPDATEs with IF EXISTS
  cql-pytest/test_unset: test unset value in UPDATE statements
  cql-pytest/test_unset: test unset value in INSERTs with IF NOT EXISTS
  cql-pytest/test_unset: test unset value in INSERT statements
  cas_request: fix crash on unset value in primary key with LWT
2023-05-08 12:03:44 +03:00
Botond Dénes
f7d9afd209 Update seastar submodule
* seastar 07548b37...62fd873d (2):
  > core/on_internal_error: always log error with backtrace
  > on_internal_error: refactor log_error_and_backtrace

Fixes: #13786
2023-05-08 10:41:24 +03:00
Marcin Maliszkiewicz
b011cc2e78 db: view: use deferred_close for closing staging_sstable_reader
When consume_in_thread throws the reader should still be closed.

Related https://github.com/scylladb/scylla-enterprise/issues/2661

Closes #13398
Refs: scylladb/scylla-enterprise#2661
Fixes: #13413

(cherry picked from commit 99f8d7dcbe)
2023-05-08 09:58:46 +03:00
Botond Dénes
fb466dd7b7 readers: evictable_reader: skip progress guarantee when next pos is partition start
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.

Fixes: #13491

Closes #13563

(cherry picked from commit 72003dc35c)
2023-05-02 21:22:23 +03:00
Jan Ciolek
697e090659 cql-pytest/test_unset: port some tests from master branch
I copied cql-pytest tests from the master branch,
at least the ones that were compatible with branch-5.1

Some of them were expecting an InvalidRequest exception
in case of UNSET VALUES being present in places that
branch-5.1 allows, so I skipped these tests.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit c75359d664)
2023-04-28 03:25:27 +02:00
Jan Ciolek
2c518f3131 cql-pytest/test_unset: test unset value in UPDATEs with LWT conditions
Test what happens when an UNSET_VALUE is passed to
an UPDATE statement with an LWT condition.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 24f76f40b7)
2023-04-28 03:25:27 +02:00
Jan Ciolek
e941a5ac34 cql-pytest/test_unset: test unset value in UPDATEs with IF EXISTS
Test what happens when an UNSET_VALUE is passed to
an UPDATE statement with IF EXISTS condition.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 3f133cfa87)
2023-04-28 03:25:27 +02:00
Jan Ciolek
3a7ce5e8aa cql-pytest/test_unset: test unset value in UPDATE statements
Test what happens when an UNSET_VALUE is passed to
an UPDATE statement.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit d66e23b265)
2023-04-28 03:25:27 +02:00
Jan Ciolek
efa4f312f5 cql-pytest/test_unset: test unset value in INSERTs with IF NOT EXISTS
Add tests which test INSERT statements with IF NOT EXISTS,
when an UNSET_VLAUE is passed for some column.
The test are similar to the previous ones done for simple
INSERTs without IF NOT EXISTS.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 378e8761b9)
2023-04-28 03:25:27 +02:00
Jan Ciolek
fb4b71ea02 cql-pytest/test_unset: test unset value in INSERT statements
Add some tests which test what happens when an UNSET_VALUE
is passed to an INSERT statement.

Passing it for partition key column is impossible
because python driver doesn't allow it.

Passing it for clustering key column causes Scylla
to silently ignore the INSERT.

Passing it for a regular or static column
causes this column to remain unchanged,
as expected.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit fc26f6b850)
2023-04-28 03:25:26 +02:00
Jan Ciolek
7387922a29 cas_request: fix crash on unset value in primary key with LWT
Doing an LWT INSERT/UPDATE and passing UNSET_VALUE
for the primary key column used to caused a crash.

This is a minimal fix for this crash.

Crash backtrace pointed to a place where
we tried doing .front() on an empty vector
of primary key ranges.

I added a check that the vector isn't empty.
If it's empty then let's throw an error
and mention that it's most likely
caused by an unset value.

This has been fixed on master,
but the PR that fixed it introduced
breaking changes, which I don't want
to add to branch-5.1.

This fix is absolutely minimal
- it performs the check at the
last moment before a crash.

It's not the prettiest, but it works
and can't introduce breaking changes,
because the new code gets activated
only in cases that would've caused
a crash.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 7663dc31b8)
2023-04-28 03:25:24 +02:00
Raphael S. Carvalho
cb78c3bf2c replica: Fix undefined behavior in table::generate_and_propagate_view_updates()
Undefined behavior because the evaluation order is undefined.

With GCC, where evaluation is right-to-left, schema will be moved
once it's forwarded to make_flat_mutation_reader_from_mutations_v2().

The consequence is that memory tracking of mutation_fragment_v2
(for tracking only permit used by view update), which uses the schema,
can be incorrect. However, it's more likely that Scylla will crash
when estimating memory usage for row, which access schema column
information using schema::column_at(), which in turn asserts that
the requested column does really exist.

Fixes #13093.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13092

(cherry picked from commit 3fae46203d)
2023-04-27 19:59:05 +03:00
Kefu Chai
aeac63a3ee dist/redhat: enforce dependency on %{release} also
* tools/python3 f725ec7...c888f39 (1):
  > dist: redhat: provide only a single version

s/%{version}/%{version}-%{release}/ in `Requires:` sections.

this enforces the runtime dependencies of exactly the same
releases between scylla packages.

Fixes #13222
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 7165551fd7)
2023-04-27 19:31:01 +03:00
Nadav Har'El
e7b50fb8d3 test/alternator: increase CQL connection timeout
This patch increases the connection timeout in the get_cql_cluster()
function in test/cql-pytest/run.py. This function is used to test
that Scylla came up, and also test/alternator/run uses it to set
up the authentication - which can only be done through CQL.

The Python driver has 2-second and 5-second default timeouts that should
have been more than enough for everybody (TM), but in #13239 we saw
that in one case it apparently wasn't enough. So to be extra safe,
let's increase the default connection-related timeouts to 60 seconds.

Note this change only affects the Scylla *boot* in the test/*/run
scripts, and it does not affect the actual tests - those have different
code to connect to Scylla (see cql_session() in test/cql-pytest/util.py),
and we already increased the timeouts there in #11289.

Fixes #13239

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13291

(cherry picked from commit 4fdcee8415)
2023-04-27 19:15:58 +03:00
Benny Halevy
6b21f2a351 utils: clear_gently: do not clear null unique_ptr
Otherwise the null pointer is dereferenced.

Add a unit test reproducing the issue
and testing this fix.

Fixes #13636

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 12877ad026)
2023-04-24 17:51:31 +03:00
Petr Gusev
0db8e627a5 removenode: add warning in case of exception
The removenode_abort logic that follows the warning
may throw, in which case information about
the original exception was lost.

Fixes: #11722
Closes #11735

(cherry picked from commit 40bd9137f8)
2023-04-24 10:02:39 +02:00
Botond Dénes
f1121d2149 Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.

There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).

Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).

Fixes #13594

Closes #13604

* github.com:scylladb/scylladb:
  db: system_keyspace: use microsecond resolution for group0_history range tombstone
  utils: UUID_gen: accept decimicroseconds in min_time_UUID

(cherry picked from commit 10c1f1dc80)
2023-04-23 16:03:39 +03:00
Beni Peled
a0ca8abe42 release: prepare for 5.0.13 2023-04-23 14:58:03 +03:00
Botond Dénes
8bceac1713 Merge 'Backport 5.0 distributed loader detect highest generation' from Benny Halevy
Backport of 4aa0b16852 to branch-5.0
Merge 'distributed_loader: detect highest generation before populating column families' from Benny Halevy

We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).

Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789

\Refs https://github.com/scylladb/scylladb/issues/11789
\Fixes https://github.com/scylladb/scylladb/issues/11793

\Closes https://github.com/scylladb/scylladb/pull/11795

Closes #13613

* github.com:scylladb/scylladb:
  Merge 'distributed_loader: detect highest generation before populating column families' from Benny Halevy
  replica: distributed_loader: reindent populate_keyspace
  replica: distributed_loader: coroutinize populate_keyspace
2023-04-21 14:29:04 +03:00
Botond Dénes
6bcc7c6ed5 Merge 'distributed_loader: detect highest generation before populating column families' from Benny Halevy
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).

Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789

Refs scylladb/scylladb#11789
Fixes scylladb/scylladb#11793

Closes #11795

* github.com:scylladb/scylladb:
  distributed_loader: populate_column_family: reindent
  distributed_loader: coroutinize populate_column_family
  distributed_loader: table_population_metadata: start: reindent
  distributed_loader: table_population_metadata: coroutinize start_subdir
  distributed_loader: table_population_metadata: start_subdir: reindent
  distributed_loader: pre-load all sstables metadata for table before populating it

(cherry picked from commit 4aa0b16852)
2023-04-21 13:23:56 +03:00
Benny Halevy
67f85875cc replica: distributed_loader: reindent populate_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b3e2204fe6)
2023-04-21 13:23:28 +03:00
Benny Halevy
8b874cd4e4 replica: distributed_loader: coroutinize populate_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a3c1dc8cee)
2023-04-21 13:23:18 +03:00
Botond Dénes
b08c582134 mutation/mutation_compactor: consume_partition_end(): reset _stop
The purpose of `_stop` is to remember whether the consumption of the
last partition was interrupted or it was consumed fully. In the former
case, the compactor allows retreiving the compaction state for the given
partition, so that its compaction can be resumed at a later point in
time.
Currently, `_stop` is set to `stop_iteration::yes` whenever the return
value of any of the `consume()` methods is also `stop_iteration::yes`.
Meaning, if the consuming of the partition is interrupted, this is
remembered in `_stop`.
However, a partition whose consumption was interrupted is not always
continued later. Sometimes consumption of a partitions is interrputed
because the partition is not interesting and the downstream consumer
wants to stop it. In these cases the compactor should not return an
engagned optional from `detach_state()`, because there is not state to
detach, the state should be thrown away. This was incorrectly handled so
far and is fixed in this patch, but overwriting `_stop` in
`consume_partition_end()` with whatever the downstream consumer returns.
Meaning if they want to skip the partition, then `_stop` is reset to
`stop_partition::no` and `detach_state()` will return a disengaged
optional as it should in this case.

Fixes: #12629

Closes #13365

(cherry picked from commit bae62f899d)
2023-04-18 03:18:25 -04:00
Avi Kivity
41556b5f63 Merge 'Backport "reader_concurrency_semaphore: don't evict inactive readers needlessly" to branch-5.0' from Botond Dénes
The patch doesn't apply cleanly, so a targeted backport PR was necessary.
I also needed to cherry-pick two patches from https://github.com/scylladb/scylladb/pull/13255 that the backported patch depends on. Decided against backporting the entire https://github.com/scylladb/scylladb/pull/13255 as it is quite an intrusive change.

Fixes: https://github.com/scylladb/scylladb/issues/11803

Closes #13517

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: don't evict inactive readers needlessly
  reader_concurrency_semaphore: add stats to record reason for queueing permits
  reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
  reader_concurrency_semaphore: add set_resources()
2023-04-17 12:26:38 +03:00
Raphael S. Carvalho
23e7e594c0 table: Fix disk-space related metrics
total disk space used metric is incorrectly telling the amount of
disk space ever used, which is wrong. It should tell the size of
all sstables being used + the ones waiting to be deleted.
live disk space used, by this defition, shouldn't account the
ones waiting to be deleted.
and live sstable count, shouldn't account sstables waiting to
be deleted.

Fix all that.

Fixes #12717.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 529a1239a9)
2023-04-16 22:19:05 +03:00
Michał Chojnowski
e6ac13314d locator: token_metadata: get rid of a quadratic behaviour in get_address_ranges()
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.

This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.

Refs #10337
Refs #10817
Refs #10836
Refs #10837
Fixes #12724

(cherry picked from commit 9e57b21e0c)
2023-04-16 22:03:04 +03:00
Botond Dénes
382d815459 reader_concurrency_semaphore: don't evict inactive readers needlessly
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.

Fixes: #11803

Closes #13286

(cherry picked from commit bd57471e54)
2023-04-14 05:04:10 -04:00
Botond Dénes
a867b2c0e5 reader_concurrency_semaphore: add stats to record reason for queueing permits
When diagnosing problems, knowing why permits were queued is very
valuable. Record the reason in a new stats, one for each reason a permit
can be queued.

(cherry picked from commit 7b701ac52e)
2023-04-14 05:04:10 -04:00
Botond Dénes
846edf78c6 reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
So caller can bump the appropriate counters or log the reason why the
the request cannot be admitted.

(cherry picked from commit bb00405818)
2023-04-14 05:04:10 -04:00
Botond Dénes
0ccc07322b reader_concurrency_semaphore: add set_resources()
Allowing to change the total or initial resources the semaphore has.
After calling `set_resources()` the semaphore will look like as if it
was created with the specified amount of resources when created.

(cherry picked from commit ecc7c72acd)
2023-04-14 05:04:10 -04:00
Yaron Kaikov
0b170192a1 release: prepare for 5.0.12 2023-04-10 15:58:57 +03:00
Botond Dénes
fd4b2a3319 db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.

Fixes: #11905
Refs: #11836

Closes #11860

(cherry picked from commit 84a69b6adb)
2023-03-22 09:14:12 +02:00
Botond Dénes
416929fb2a Update seastar submodule
* seastar d1d40176...07548b37 (1):
  > reactor: re-raise fatal signals

Fixes: #9242
2023-03-22 08:26:32 +02:00
Kamil Braun
9d8d7048eb service: storage_proxy: sequence CDC preimage select with Paxos learn
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.

Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.

`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.

Fixes #12098

(cherry picked from commit 1ef113691a)
2023-03-21 17:11:00 +01:00
Takuya ASADA
bae4155ab2 docker: prevent hostname -i failure when server address is specified
On some docker instance configuration, hostname resolution does not
work, so our script will fail on startup because we use hostname -i to
construct cqlshrc.
To prevent the error, we can use --rpc-address or --listen-address
for the address since it should be same.

Fixes #12011

Closes #12115

(cherry picked from commit 642d035067)
2023-03-21 17:54:56 +02:00
Pavel Emelyanov
d6e2a326cf Merge '[backport] reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict() ' from Botond Dénes
This PR backports 2f4a793457 to branch-5.1. Said patch depends on some other patches that are not part of any release yet.

Closes #13224

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
  reader_permit: expose operator<<(reader_permit::state)
  reader_permit: add get_state() accessor
2023-03-17 14:15:17 +03:00
Botond Dénes
15645ff40b reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
Instead of open-coding the same, in an incomplete way.
clear_inactive_reads() does incomplete eviction in severeal ways:
* it doesn't decrement _stats.inactive_reads
* it doesn't set the permit to evicted state
* it doesn't cancel the ttl timer (if any)
* it doesn't call the eviction notifier on the permit (if there is one)

The list goes on. We already have an evict() method that all this
correctly, use that instead of the current badly open-coded alternative.

This patch also enhances the existing test for clear_inactive_reads()
and adds a new one specifically for `stop()` being called while having
inactive reads.

Fixes: #13048

Closes #13049

(cherry picked from commit 2f4a793457)
2023-03-17 14:14:59 +03:00
Botond Dénes
a808fc7172 reader_permit: expose operator<<(reader_permit::state)
(cherry picked from commit ec1c615029)
2023-03-17 14:14:59 +03:00
Botond Dénes
dd260bfa82 reader_permit: add get_state() accessor
(cherry picked from commit 397266f420)
2023-03-17 14:14:59 +03:00
Takuya ASADA
c46935ed5c scylla_raid_setup: fix nonexistant out()
Since branch-5.0 does not have out(), it should be run(capture_output=True)
instead.

Closes #13155
2023-03-16 16:43:28 +02:00
Avi Kivity
985d6bc4c2 Merge 'scylla_raid_setup: prevent mount failed for /var/lib/scylla for branch-5.0' from Takuya ASADA
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.

Also added code to check make sure uuid and uuid based device path are valid.

Fixes #11359

Closes #13127

* github.com:scylladb/scylladb:
  scylla_raid_setup: run uuidpath existance check only after mount failed
  scylla_raid_setup: prevent mount failed for /var/lib/scylla
  scylla_raid_setup: check uuid and device path are valid
2023-03-09 23:04:52 +02:00
Takuya ASADA
7673ff4ae3 scylla_raid_setup: run uuidpath existance check only after mount failed
We added UUID device file existance check on #11399, we expect UUID
device file is created before checking, and we wait for the creation by
"udevadm settle" after "mkfs.xfs".

However, we actually getting error which says UUID device file missing,
it probably means "udevadm settle" doesn't guarantee the device file created,
on some condition.

To avoid the error, use var-lib-scylla.mount to wait for UUID device
file is ready, and run the file existance check when the service is
failed.

Fixes #11617

Closes #11666

(cherry picked from commit a938b009ca)
2023-03-09 22:34:03 +09:00
Takuya ASADA
c441eebf46 scylla_raid_setup: prevent mount failed for /var/lib/scylla
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.

Fixes #11359

(cherry picked from commit 8835a34ab6)
2023-03-09 22:33:38 +09:00
Takuya ASADA
bf4fa80dd7 scylla_raid_setup: check uuid and device path are valid
Added code to check make sure uuid and uuid based device path are valid.

(chery picked from commit 40134efee4)
2023-03-09 22:32:38 +09:00
Jan Ciolek
2010231fe9 cql3: preserve binary_operator.order in search_and_replace
There was a bug in `expr::search_and_replace`.
It doesn't preserve the `order` field of binary_operator.

`order` field is used to mark relations created
using the SCYLLA_CLUSTERING_BOUND.
It is a CQL feature used for internal queries inside Scylla.
It means that we should handle the restriction as a raw
clustering bound, not as an expression in the CQL language.

Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues,
the database could end up selecting the wrong clustering ranges.

Fixes: #13055

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #13056

(cherry picked from commit aa604bd935)
2023-03-09 12:53:01 +02:00
Takuya ASADA
0a51eb55e3 main: run --version before app_template initialize
Even on the environment which causes error during initalize Scylla,
"scylla --version" should be able to run without error.
To do so, we need to parse and execute these options before
initializing Scylla/Seastar classes.

Fixes #11117

Closes #11179

(cherry picked from commit d7dfd0a696)
2023-03-09 12:48:25 +02:00
Avi Kivity
d9c6c6283b Update seastar submodule (tls fixes)
* seastar 9a7ba6d57e...d1d4017679 (2):
  > Merge 'tls: vec_push: handle async errors rather than throwing on_internal_error' from Benny Halevy
Fixes #11252
  > tls: vec_push: handle synchronous error from put
Fixes #11118
2023-03-09 12:45:41 +02:00
Tomasz Grabiec
90a5344261 row_cache: Destroy coroutine under region's allocator
The reason is alloc-dealloc mismatch of position_in_partition objects
allocated by cursors inside coroutine object stored in the update
variable in row_cache::do_update()

It is allocated under cache region, but in case of exception it will
be destroyed under the standard allocator. If update is successful, it
will be cleared under region allocator, so there is not problem in the
normal case.

Fixes #12068

Closes #12233

(cherry picked from commit 992a73a861)
2023-03-08 20:54:06 +02:00
Gleb Natapov
68da667288 lwt: do not destroy capture in upgrade_if_needed lambda since the lambda is used more then once
If on the first call the capture is destroyed the second call may crash.

Fixes: #12958

Message-Id: <Y/sks73Sb35F+PsC@scylladb.com>
(cherry picked from commit 1ce7ad1ee6)
2023-03-08 18:52:22 +02:00
Pavel Emelyanov
9adb1a8fdd azure_snitch: Handle empty zone returned from IMDS
Azure metadata API may return empty zone sometimes. If that happens
shard-0 gets empty string as its rack, but propagates UNKNOWN_RACK to
other shards.

Empty zones response should be handled regardless.

refs: #12185

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12274

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-03-02 09:18:04 +03:00
Pavel Emelyanov
7623fe01b7 snitch: Check http response codes to be OK
Several snitch drivers make http requests to get
region/dc/zone/rack/whatever from the cloud provider. They blindly rely
on the response being successfull and read response body to parse the
data they need from.

That's not nice, add checks for requests finish with http OK statuses.

refs: #12185

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12287

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-03-02 09:17:57 +03:00
Botond Dénes
3b0a0c4876 types: unserialize_value for multiprecision_int,bool: don't read uninitialized memory
Check the first fragment before dereferencing it, the fragment might be
empty, in which case move to the next one.
Found by running range scan tests with random schema and random data.

Fixes: #12821
Fixes: #12823
Fixes: #12708

Closes #12824

(cherry picked from commit ef548e654d)
2023-02-23 22:38:39 +02:00
Yaron Kaikov
019d5cde1b release: prepare for 5.0.11 2023-02-23 14:30:57 +02:00
Gleb Natapov' via ScyllaDB development
a2e255833a lwt: upgrade stored mutations to the latest schema during prepare
Currently they are upgraded during learn on a replica. The are two
problems with this.  First the column mapping may not exist on a replica
if it missed this particular schema (because it was down for instance)
and the mapping history is not part of the schema. In this case "Failed
to look up column mapping for schema version" will be thrown. Second lwt
request coordinator may not have the schema for the mutation as well
(because it was freed from the registry already) and when a replica
tries to retrieve the schema from the coordinator the retrieval will fail
causing the whole request to fail with "Schema version XXXX not found"

Both of those problems can be fixed by upgrading stored mutations
during prepare on a node it is stored at. To upgrade the mutation its
column mapping is needed and it is guarantied that it will be present
at the node the mutation is stored at since it is pre-request to store
it that the corresponded schema is available. After that the mutation
is processed using latest schema that will be available on all nodes.

Fixes #10770

Message-Id: <Y7/ifraPJghCWTsq@scylladb.com>
(cherry picked from commit 15ebd59071)
2023-02-22 21:58:20 +02:00
Tomasz Grabiec
f4aa5cacb1 db: Fix trim_clustering_row_ranges_to() for non-full keys and reverse order
trim_clustering_row_ranges_to() is broken for non-full keys in reverse
mode. It will trim the range to
position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.

Fixes #12180
Refs #1446

(cherry picked from commit 536c0ab194)
2023-02-22 21:52:59 +02:00
Tomasz Grabiec
8ea9a16f9e types: Fix comparison of frozen sets with empty values
A frozen set can be part of the clustering key, and with compact
storage, the corresponding key component can have an empty value.

Comparison was not prepared for this, the iterator attempts to
deserialize the item count and will fail if the value is empty.

Fixes #12242

(cherry picked from commit 232ce699ab)
2023-02-22 21:44:49 +02:00
Michał Chojnowski
1aa5283a38 utils: config_file: fix handling of workdir,W in the YAML file
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.

Fixes #7478
Fixes #9500
Fixes #11503

Closes #11506

(cherry picked from commit af7ace3926)
2023-02-22 21:33:25 +02:00
Takuya ASADA
2e7b1858ad scylla_coredump_setup: fix coredump timeout settings
We currently configure only TimeoutStartSec, but probably it's not
enough to prevent coredump timeout, since TimeoutStartSec is maximum
waiting time for service startup, and there is another directive to
specify maximum service running time (RuntimeMaxSec).

To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it
configures both TimeoutStartSec and TimeoutStopSec).

Fixes #5430

Closes #12757

(cherry picked from commit bf27fdeaa2)
2023-02-19 21:14:14 +02:00
Avi Kivity
2542b57ddc Merge 'reader_concurrency_semaphore: fix waiter/inactive race' from Botond Dénes
We recently (in 7fbad8de87) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate.
This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do.
It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction.

Fixes: #11923
Refs: #11770

Closes #12026

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
  reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
  reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach

(cherry picked from commit 15ee8cfc05)
2023-02-09 11:45:53 +02:00
Botond Dénes
01a9871fc3 reader_concurrency_semaphore: unify admission logic across all paths
The semaphore currently has two admission paths: the
obtain_permit()/with_permit() methods which admits permits on user
request (the front door) and the maybe_admit_waiters() which admits
permits based on internal events like memory resource being returned
(the back door). The two paths used their own admission conditions
and naturally this means that they diverged in time. Notably,
maybe_admit_waiters() did not look at inactive readers assuming that if
there are waiters there cannot be inactive readers. This is not true
however since we merged the execution-stage into the semaphore. Waiters
can queue up even when there are inactive reads and thus
maybe_admit_waiters() has to consider evicting some of them to see if
this would allow for admitting new reads.
To avoid such divergence in the future, the admission logic was moved
into a new method can_admit_read() which is now shared between the two
method families. This method now checks for the possibility of evicting
inactive readers as well.
The admission logic was tuned slightly to only consider evicting
inactive readers if there is a real possibility that this will result
in admissions: notably, before this patch, resource availability was
checked before stalls were (used permits == blocked permits), so we
could evict readers even if this couldn't help.
Because now eviction can be started from maybe_admit_waiters(), which is
also downstream from eviction, we added a flag to avoid recursive
evict -> maybe admit -> evict ... loops.

Fixes: #11770

Closes #11784

(cherry picked from commit 7fbad8de87)
2023-02-09 11:45:47 +02:00
Beni Peled
6bb7fac8d8 release: prepare for 5.0.10 2023-02-06 14:42:32 +02:00
Botond Dénes
5dff7489b1 sstables: track decompressed buffers
Convert decompressed temporary buffers into tracked buffers just before
returning them to the upper layer. This ensures these buffers are known
to the reader concurrency semaphore and it has an accurate view of the
actual memory consumption of reads.

Fixes: #12448

Closes #12454

(cherry picked from commit c4688563e3)
2023-02-05 19:39:04 +02:00
Tomasz Grabiec
2775b1d136 row_cache: Fix violation of the "oldest version are evicted first" when evicting last dummy
Consider the following MVCC state of a partition:

   v2: ==== <7> [entry2] ==== <9> ===== <last dummy>
   v1: ================================ <last dummy> [entry1]

Where === means a continuous range and --- means a discontinuous range.

After two LRU items are evicted (entry1 and entry2), we will end up with:

   v2: ---------------------- <9> ===== <last dummy>
   v1: ================================ <last dummy> [entry1]

This will cause readers to incorrectly think there are no rows before
entry <9>, because the range is continuous in v1, and continuity of a
snapshot is a union of continuous intervals in all versions. The
cursor will see the interval before <9> as continuous and the reader
will produce no rows.

This is only temporary, because current MVCC merging rules are such
that the flag on the latest entry wins, so we'll end up with this once
v1 is no longer needed:

   v2: ---------------------- <9> ===== <last dummy>

...and the reader will go to sstables to fetch the evicted rows before
entry <9>, as expected.

The bug is in rows_entry::on_evicted(), which treats the last dummy
entry in a special way, and doesn't evict it, and doesn't clear the
continuity by omission.

The situation is not easy to trigger because it requires certain
eviction pattern concurrent with multiple reads of the same partition
in different versions, so across memtable flushes.

Closes #12452

(cherry-picked from commit f97268d8f2)

Fixes #12451.
2023-02-05 19:39:04 +02:00
Botond Dénes
2ae5675c0f types: is_tuple(): handle reverse types
Currently reverse types match the default case (false), even though they
might be wrapping a tuple type. One user-visible effect of this is that
a schema, which has a reversed<frozen<UDT>> clustering key component,
will have this component incorrectly represented in the schema cql dump:
the UDT will loose the frozen attribute. When attempting to recreate
this schema based on the dump, it will fail as the only frozen UDTs are
allowed in primary key components.

Fixes: #12576

Closes #12579

(cherry picked from commit ebc100f74f)
2023-02-05 19:39:04 +02:00
Calle Wilund
d507ad9424 alterator::streams: Sort tables in list_streams to ensure no duplicates
Fixes #12601 (maybe?)

Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.

(cherry picked from commit da8adb4d26)
2023-02-05 19:39:00 +02:00
Benny Halevy
413af945c0 view: row_lock: lock_ck: find or construct row_lock under partition lock
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.

This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.

This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.

This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.

Fixes #12632

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
2023-02-05 17:38:49 +02:00
Kefu Chai
9a71680dc7 cql3/selection: construct string_view using char* not size
before this change, we construct a sstring from a comma statement,
which evaluates to the return value of `name.size()`, but what we
expect is `sstring(const char*, size_t)`.

in this change

* instead of passing the size of the string_view,
  both its address and size are used
* `std::string_view` is constructed instead of sstring, for better
  performance, as we don't need to perform a deep copy

the issue is reported by GCC-13:

```
In file included from cql3/selection/selectable.cc:11:
cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
        auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
                                                           ^~~~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12666

(cherry picked from commit 186ceea009)

Fixes #12739.

(cherry picked from commit b588b19620)
2023-02-05 13:51:32 +02:00
Botond Dénes
94b8baa797 Revert "reader_concurrency_semaphore: unify admission logic across all paths"
This reverts commit 0e388d2140.

This patch is suspected to be the cause of read timeouts.
Refs: #12435
2023-01-11 07:09:17 +02:00
Botond Dénes
e372a5fe0a Revert "Merge 'reader_concurrency_semaphore: fix waiter/inactive race' from Botond Dénes"
This reverts commit bf92c2b44c.

This patch is suspected to be the cause of read timeouts.
Refs: #12435
2023-01-11 07:08:16 +02:00
Asias He
692e5ed175 gossip: Improve get_live_token_owners and get_unreachable_token_owners
The get_live_token_owners returns the nodes that are part of the ring
and live.

The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.

The token_metadata::get_all_endpoints returns nodes that are part of the
ring.

The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.

This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.

Fixes #10296
Fixes #11928

Closes #11952

(cherry picked from commit 16bd9ec8b1)
2023-01-09 16:55:38 +02:00
Michał Chojnowski
5a299f65ff configure: don't reduce parsers' optimization level to 1 in release
The line modified in this patch was supposed to increase the
optimization levels of parsers in debug mode to 1, because they
were too slow otherwise. But as a side effect, it also reduced the
optimization level in release mode to 1. This is not a problem
for the CQL frontend, because statement preparation is not
performance-sensitive, but it is a serious performance problem
for Alternator, where it lies in the hot path.

Fix this by only applying the -O1 to debug modes.

Fixes #12463

Closes #12460

(cherry picked from commit 08b3a9c786)
2023-01-08 01:35:15 +02:00
Botond Dénes
f4ae2fa5f9 Merge 'Branch 5.0: backport 'range_tombstone_change_generator: flush: emit closing range_tombstone_change'' from Benny Halevy
This series backports 0a3aba36e6 to branch 5.0.

It ensures that a closing range_tombstone_change is emitted if the highest tombstone is open ended
since range_tombstone_change_generator::flush does not do it by default.

With the additional testing added 9a59e9369b87b1bcefed6d1d5edf25c5d3451bc4 unit tests fail without the additional patches in the series, so it exposes a latent bug in the branch where the closing range_tombstone_change is not always emitted when flushing on end of partition of end of position range.

One additional change was required for unit tests to pass:
```diff
diff --git a/range_tombstone_change_generator.hh b/range_tombstone_change_generator.hh
index 6f98be5dce..9cde8d9b20 100644
--- a/range_tombstone_change_generator.hh
+++ b/range_tombstone_change_generator.hh
@@ -78,6 +78,7 @@ class range_tombstone_change_generator {
     template<RangeTombstoneChangeConsumer C>
     void flush(const position_in_partition_view upper_bound, C consumer) {
         if (_range_tombstones.empty()) {
+            _lower_bound = upper_bound;
             return;
         }

```

Refs https://github.com/scylladb/scylla/issues/10316

Closes #10969

* github.com:scylladb/scylladb:
  reader: upgrading_consumer: let range_tombstone_change_generator emit last closing change
  range_tombstone_change_generator: flush: emit end_position when upper limit is after all clustered rows
  range_tombstone_change_generator: flush: use tri_compare rather than less
  range_tombstone_change_generator: flush: return early if empty
2023-01-04 12:52:01 +02:00
Nadav Har'El
07c20bdfea materialized view: fix bug in some large modifications to base partitions
Sometimes a single modification to a base partition requires updates to
a large number of view rows. A common example is deletion of a base
partition containing many rows. A large BATCH is also possible.

To avoid large allocations, we split the large amount of work into
batch of 100 (max_rows_for_view_updates) rows each. The existing code
assumed an empty result from one of these batches meant that we are
done. But this assumption was incorrect: There are several cases when
a base-table update may not need a view update to be generated (see
can_skip_view_updates()) so if all 100 rows in a batch were skipped,
the view update stopped prematurely. This patch includes two tests
showing when this bug can happen - one test using a partition deletion
with a USING TIMESTAMP causing the deletion to not affect the first
100 rows, and a second test using a specially-crafed large BATCH.
These use cases are fairly esoteric, but in fact hit a user in the
wild, which led to the discovery of this bug.

The fix is fairly simple: To detect when build_some() is done it is no
longer enough to check if it returned zero view-update rows; Rather,
it explicitly returns whether or not it is done as an std::optional.

The patch includes several tests for this bug, which pass on Cassandra,
failed on Scylla before this patch, and pass with this patch.

Fixes #12297.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12305

(cherry picked from commit 92d03be37b)
2023-01-04 11:36:39 +02:00
Botond Dénes
8a36c4be54 evicatble_reader: avoid preemption pitfall around waiting for readmission
Permits have to wait for re-admission after having been evicted. This
happens via `reader_permit::maybe_wait_readmission()`. The user of this
method -- the evictable reader -- uses it to re-wait admission when the
underlying reader was evicted. There is one tricky scenario however,
when the underlying reader is created for the first time. When the
evictable reader is part of a multishard query stack, the created reader
might in fact be a resumed, saved one. These readers are kept in an
inactive state until actually resumed. The evictable reader shares it
permit with the to-be-resumed reader so it can check whether it has been
evicted while saved and needs to wait readmission before being resumed.
In this flow it is critical that there is no preemption point between
this check and actually resuming the reader, because if there is, the
reader might end up actually recreated, without having waited for
readmission first.
To help avoid this situation, the existing `maybe_wait_readmission()` is
split into two methods:
* `bool reader_permit::needs_readmission()`
* `future<> reader_permit::wait_for_readmission()`

The evictable reader can now ensure there is no preemption point between
`needs_readmission()` and resuming the reader.

Fixes: #10187

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220315105851.170364-1-bdenes@scylladb.com>
(cherry picked from commit 61028ad718)
2023-01-04 11:20:28 +02:00
Avi Kivity
bf92c2b44c Merge 'reader_concurrency_semaphore: fix waiter/inactive race' from Botond Dénes
We recently (in 7fbad8de87) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate.
This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do.
It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction.

Fixes: #11923
Refs: #11770

Closes #12026

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
  reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
  reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach

(cherry picked from commit 15ee8cfc05)
2023-01-03 16:46:44 +02:00
Botond Dénes
0e388d2140 reader_concurrency_semaphore: unify admission logic across all paths
The semaphore currently has two admission paths: the
obtain_permit()/with_permit() methods which admits permits on user
request (the front door) and the maybe_admit_waiters() which admits
permits based on internal events like memory resource being returned
(the back door). The two paths used their own admission conditions
and naturally this means that they diverged in time. Notably,
maybe_admit_waiters() did not look at inactive readers assuming that if
there are waiters there cannot be inactive readers. This is not true
however since we merged the execution-stage into the semaphore. Waiters
can queue up even when there are inactive reads and thus
maybe_admit_waiters() has to consider evicting some of them to see if
this would allow for admitting new reads.
To avoid such divergence in the future, the admission logic was moved
into a new method can_admit_read() which is now shared between the two
method families. This method now checks for the possibility of evicting
inactive readers as well.
The admission logic was tuned slightly to only consider evicting
inactive readers if there is a real possibility that this will result
in admissions: notably, before this patch, resource availability was
checked before stalls were (used permits == blocked permits), so we
could evict readers even if this couldn't help.
Because now eviction can be started from maybe_admit_waiters(), which is
also downstream from eviction, we added a flag to avoid recursive
evict -> maybe admit -> evict ... loops.

Fixes: #11770

Closes #11784

(cherry picked from commit 7fbad8de87)
2023-01-03 16:46:30 +02:00
Botond Dénes
288eb9d231 Merge 'Backport 5.0: cleanup compaction: flush memtable' from Benny Halevy
This a backport of 9fa1783892 (#11902) to branch-5.0

Flush the memtable before cleaning up the table so not to leave any disowned tokens in the memtable
as they might be resurrected if left in the memtable.

Refs #1239

Closes #12415

* github.com:scylladb/scylladb:
  table: perform_cleanup_compaction: flush memtable
  table: add perform_cleanup_compaction
  api: storage_service: add logging for compaction operations et al
2023-01-03 12:23:03 +02:00
Benny Halevy
9219a59802 table: perform_cleanup_compaction: flush memtable
We don't explicitly cleanup the memtable, while
it might hold tokens disowned by the current node.

Flush the memtable before performing cleanup compaction
to make sure all tokens in the memtable are cleaned up.

Note that non-owned ranges are invalidate in the cache
in compaction_group::update_main_sstable_list_on_compaction_completion
using desc.ranges_for_cache_invalidation.

Fixes #1239

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from eb3a94e2bc)
2022-12-29 09:36:37 +02:00
Benny Halevy
f9cea4dc51 table: add perform_cleanup_compaction
Move the integration with compaction_manager
from the api layer to the tabel class so
it can also make sure the memtable is cleaned up in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from fc278be6c4)
2022-12-29 09:36:37 +02:00
Benny Halevy
081b2b76cc api: storage_service: add logging for compaction operations et al
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from 85523c45c0)
2022-12-29 09:36:20 +02:00
Anna Mikhlin
dfb229a18a release: prepare for 5.0.9 2022-12-29 09:25:47 +02:00
Takuya ASADA
60da855c2d scylla_setup: fix incorrect type definition on --online-discard option
--online-discard option defined as string parameter since it doesn't
specify "action=", but has default value in boolean (default=True).
It breaks "provisioning in a similar environment" since the code
supposed boolean value should be "action='store_true'" but it's not.

We should change the type of the option to int, and also specify
"choices=[0, 1]" just like --io-setup does.

Fixes #11700

Closes #11831

(cherry picked from commit acc408c976)
2022-12-28 20:44:12 +02:00
Benny Halevy
1718861e94 main: shutdown: do not abort on storage_io_error
Do not abort in defer_verbose_shutdown if the callback
throws storage_io_error, similar and in addition to
the system errors handling that was added in
132c9d5933

As seen in https://github.com/scylladb/scylla/issues/9573#issuecomment-1148238291

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10740

(cherry picked from commit 1daa7820c9)
2022-12-28 19:29:17 +02:00
Petr Gusev
e03e9b1abe cql: batch statement, inserting a row with a null key column should be forbidden
Regular INSERT statements with null values for primary key
components are rejected by Scylla since #9286 and #9314.
Batch statements missed a similar check, this patch
fixes it.

Fixes: #12060
(cherry picked from commit 7730c4718e)
2022-12-28 18:15:54 +02:00
Benny Halevy
26c51025c1 reader: upgrading_consumer: let range_tombstone_change_generator emit last closing change
When flushing range tombstones up to
position_in_partition::after_all_clustered_rows(),
the range_tombstone_change_generator now emits
the closing range_tombstone_change, so there's
no need for the upgrading_consumer to do so too.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 002be743f6)
2022-12-28 16:23:11 +02:00
Benny Halevy
5c39a4524a range_tombstone_change_generator: flush: emit end_position when upper limit is after all clustered rows
When the highest tombstone is open ended, we must
emit a closing range_tombstone_change at
position_in_partition::after_all_clustered_rows().

Since all consumers need to do it, implement the logic
int the range_tombstone_change_generator itself.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit cd171f309c)
2022-12-28 16:23:11 +02:00
Benny Halevy
9823e8d9c5 range_tombstone_change_generator: flush: use tri_compare rather than less
less is already using tri_compare internally,
and we'll use tri_compare for equality in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2c5a6b3894)
2022-12-28 16:23:11 +02:00
Benny Halevy
b48c9cae95 range_tombstone_change_generator: flush: return early if empty
Optimize the common, empty case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 18a80a98b8)
(added _lower_bound = upper_bound on early return)
2022-12-28 16:23:11 +02:00
Nadav Har'El
14077d2def murmur3: fix inconsistent token for empty partition key
Traditionally in Scylla and in Cassandra, an empty partition key is mapped
to minimum_token() instead of the empty key's usual hash function (0).
The reasons for this are unknown (to me), but one possibility is that
having one known key that maps to the minimal token is useful for
various iterations.

In murmur3_partitioner.cc we have two variants of the token calculation
function - the first is get_token(bytes_view) and the second is
get_token(schema, partition_key_view). The first includes that empty-
key special case, but the second was missing this special case!

As Kamil first noted in #9352, the second variant is used when looking
up partitions in the index file - so if a partition with an empty-string
key is saved under one token, it will be looked up under a different
token and not found. I reproduced exactly this problem when fixing
issues #9364 and #9375 (empty-string keys in materialized views and
indexes) - where a partition with an empty key was visible in a
full-table scan but couldn't be found by looking up its key because of
the wrong index lookup.

I also tried an alternative fix - changing both implementations to return
minimum_token (and not 0) for the empty key. But this is undesirable -
minimum_token is not supposed to be a valid token, so the tokenizer and
sharder may not return a valid replica or shard for it, so we shouldn't
store data under such token. We also have have code (such as an increasing-
key sanity check in the flat mutation reader) which assumes that
no real key in the data can be minimum_token, and our plan is to start
allowing data with an empty key (at least for materialized views).

This patch does not risk a backward-incompatible disk format changes
for two reasons:

1. In the current Scylla, there was no valid case where an empty partition
   key may appear. CQL and Thrift forbid such keys, and materialized-views
   and indexes also (incorrectly - see #9364, #9375) drop such rows.
2. Although Cassandra *does* allow empty partition keys, they is only
   allowed in materialized views and indexes - and we don't support reading
   materialized views generated by Cassandra (the user must re-generate
   them in Scylla).

When #9364 and #9375 will be fixed by the next patch, empty partition keys
will start appearing in Scylla (in materialized views and in the
materialized view backing a secondary index), and this fix will become
important.

Fixes #9352
Refs #9364
Refs #9375

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit bc4d0fd5ad)
2022-12-28 15:24:53 +02:00
Piotr Grabowski
25508705a8 type_json: fix wrong blob JSON validation
Fixes wrong condition for validating whether a JSON string representing
blob value is valid. Previously, strings such as "6" or "0392fa" would
pass the validation, even though they are too short or don't start with
"0x". Add those test cases to json_cql_query_test.cc.

Fixes #10114

(cherry picked from commit f8b67c9bd1)
2022-12-28 15:17:31 +02:00
Botond Dénes
347da028e9 mutation_compactor: reset stop flag on page start
When the mutation compactor has all the rows it needs for a page, it
saves the decision to stop in a member flag: _stop.
For single partition queries, the mutation compactor is kept alive
across pages and so it has a method, start_new_page() to reset its state
for the next page. This method didn't clear the _stop flag. This meant
that the value set at the end of the previous could cause the new page
and subsequently the entire query to be stopped prematurely.
This can happen if the new page starts with a row that is covered by a
higher level tombstone and is completely empty after compaction.
Reset the _stop flag in start_new_page() to prevent this.

This commit also adds a unit test which reproduces the bug.

Fixes: #12361

Closes #12384

(cherry picked from commit b0d95948e1)
2022-12-25 09:45:50 +02:00
Yaron Kaikov
874fa15202 release: prepare for 5.0.8 2022-12-21 21:53:30 +02:00
Michał Chojnowski
99c03cb2af sstables: index_reader: always evict the local cache gently
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.

Fixes #12271

Closes #12364

(cherry picked from commit d9269abf5b)
2022-12-21 13:43:26 +02:00
Botond Dénes
6c35d3c5cd Merge 'Backport nodeops abort thread use-after-free patches' from Pavel Emelyanov
This includes merges 396d9e6a46 and 2c021affd1

Things that got changed here:

1. All the node_ops_... stuff in storage_service was coroutinized after 5.0, so in this merge the changes were de-coroutinized back
2. Had to cherry-pick molding for UUID (69fcc053bb and 489e50ef3a)
3. tracker::is_aborted() was added after 5.0, it caused minor context conflict
4. watchdog interval was changed, also caused minor context conflict

refs: #10284

Closes #12335

* github.com:scylladb/scylladb:
  repair: use sharded abort_source to abort repair_info
  repair: node_ops_info: add start and stop methods
  storage_service: node_ops_abort_thread: abort all node ops on shutdown
  storage_service: node_ops_abort_thread: co_return only after printing log message
  storage_service: node_ops_meta_data: add start and stop methods
  repair: node_ops_info: prevent accidental copy
  repair: Remove ops_uuid
  repair: Remove abort_repair_node_ops() altogether
  repair: Subscribe on node_ops_info::as abortion
  repair: Keep abort source on node_ops_info
  repair: Pass node_ops_info arg to do_sync_data_using_repair()
  repair: Mark repair_info::abort() noexcept
  node_ops: Remove _aborted bit
  node_ops: Simplify construction of node_ops_metadata
  main: Fix message about repair service starting
  utils: uuid: make operator bool explicit
  utils: uuid: add null_uuid
2022-12-16 10:49:49 +02:00
Benny Halevy
707622ce15 repair: use sharded abort_source to abort repair_info
Currently we use a single shared_ptr<abort_source>
that can't be copied across shards.

Instead, use a sharded<abort_source> in node_ops_info so that each
repair_info instance will use an (optional) abort_source*
on its own shard.

Added respective start and stop methodsm plus a local_abort_source
getter to get the shard-local abort_source (if available).

Fixes #11826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-15 18:48:45 +03:00
Benny Halevy
bab36b604c repair: node_ops_info: add start and stop methods
Prepare for adding a sharded<abort_source> member.

Wire start/stop in storage_service::node_ops_meta_data.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-15 18:48:45 +03:00
Benny Halevy
8840711e79 storage_service: node_ops_abort_thread: abort all node ops on shutdown
A later patch adds a sharded<abort_source> to node_ops_info.
On shutdown, we must orderly stop it, so use node_ops_abort_thread
shutdown path (where node_ops_singal_abort is called will a nullopt)
to abort (and stop) all outstanding node_ops by passing
a null_uuid to node_ops_abort, and let it iterate over all
node ops to abort and stop them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-15 18:48:45 +03:00
Benny Halevy
af18bb3fe9 storage_service: node_ops_abort_thread: co_return only after printing log message
Currently the function co_returns if (!uuid_opt)
so the log info message indicating it's stopped
is not printed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-15 18:48:45 +03:00
Benny Halevy
6003cba7a8 storage_service: node_ops_meta_data: add start and stop methods
Prepare for starting and stopping repair node_ops_info

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-15 18:48:45 +03:00
Benny Halevy
e9afd076eb repair: node_ops_info: prevent accidental copy
Delete node_ops_info copy and move constructors before
we add a sharded<abort_source> member for the per-shard repairs
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
c5f732d42a repair: Remove ops_uuid
It used to be used to abort repair_info by the corresponding node-ops
uuid, but this code is no longer there, so it's good to drop the uuid as
well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
13a1408135 repair: Remove abort_repair_node_ops() altogether
This code is dead after previous patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
6685e00dd4 repair: Subscribe on node_ops_info::as abortion
When node_ops_meta_data aborts it also kicks repair to find and abort
all relevant repair_infos. Now it can be simplified by subscribing
repair_meta on the abort source and aborting it without explicit kick

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
350bb57291 repair: Keep abort source on node_ops_info
Next patches will need to subscribe on node_ops_meta_data's abort source
inside repair code, so keep the pointer on node_ops_info too. At the
same time, the node_ops_info::abort becomes obsolete, because the same
check can be performed via the abort_source->abort_requested()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
e186ad5b6c repair: Pass node_ops_info arg to do_sync_data_using_repair()
Next patches will need to know more than the ops_uuid. The needed info
is (well -- will be) sitting on node_ops_info, so pass it along

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
139e9afc89 repair: Mark repair_info::abort() noexcept
Next patch will call it inside abort_source subscription callback which
requires the calling code to be noexcept

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
a42c6f190c node_ops: Remove _aborted bit
A short cleanup "while at it" -- the node_ops_meta_data doesn't need to
carry dedicated _aborted boolean -- the abort source that sets it is
available instantly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
2b8f0cbd97 node_ops: Simplify construction of node_ops_metadata
It always constructs node_ops_info the same way

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Pavel Emelyanov
a2a762e18d main: Fix message about repair service starting
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 18:48:45 +03:00
Benny Halevy
aa973e2b9e utils: uuid: make operator bool explicit
Following up on 69fcc053bb

To prevent unintentional implicit conversions
e.g. to a number.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220216081623.830627-1-bhalevy@scylladb.com>
2022-12-15 18:48:45 +03:00
Benny Halevy
e0777f1112 utils: uuid: add null_uuid
and respective bool predecate and operator
and unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215113438.473400-1-bhalevy@scylladb.com>
2022-12-15 18:48:45 +03:00
Benny Halevy
cc6311cbc7 view: row_lock: lock_ck: serialize partition and row locking
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:

    1. lock_pk acquires an exclusive lock on the partition.
    2.a lock_ck attempts to acquire shared lock on the partition
        and any lock on the row. both cases currently use a fiber
        returning a future<rwlock::holder>.
    2.b since the partition is locked, the lock_partition times out
        returning an exceptional future.  lock_row has no such problem
        and succeeds, returning a future holding a rwlock::holder,
        pointing to the row lock.
    3.a the lock_holder previously returned by lock_pk is destroyed,
        calling `row_locker::unlock`
    3.b row_locker::unlock sees that the partition is not locked
        and erases it, including the row locks it contains.
    4.a when_all_succeeds continuation in lock_ck runs.  Since
        the lock_partition future failed, it destroyes both futures.
    4.b the lock_row future is destroyed with the rwlock::holder value.
    4.c ~holder attempts to return the semaphore units to the row rwlock,
        but the latter was already destroyed in 3.b above.

Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,

This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.

This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.

Fixes #12168

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12208

(cherry picked from commit 5007ded2c1)
2022-12-13 14:52:01 +02:00
Anna Mikhlin
0354e13718 release: prepare for 5.0.7 2022-12-07 14:57:09 +02:00
Nadav Har'El
2750d2e94b Merge 'alternator: fix wrong 'where' condition for GSI range key' from Marcin Maliszkiewicz
Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages).

Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800)

Closes #11926

* github.com:scylladb/scylladb:
  alternator: add maybe_quote to secondary indexes 'where' condition
  test/alternator: correct xfail reason for test_gsi_backfill_empty_string
  test/alternator: correct indentation in test_lsi_describe
  alternator: fix wrong 'where' condition for GSI range key

(cherry picked from commit ce7c1a6c52)
2022-12-05 20:53:19 +02:00
Benny Halevy
b4383a389b repair_reader: construct _reader_handle before _reader
Currently, the `_reader` member is explicitly
initialized with the result of the call to `make_reader`.
And `make_reader`, as a side effect, assigns a value
to the `_reader_handle` member.

Since C++ initializes class members sequentially,
in the order they are defined, the assignment to `_reader_handle`
in `make_reader()` happens before `_reader_handle` is initialized.

This patch fixes that by changing the definition order,
and consequently, the member initialization order
in the constructor so that `_reader_handle` will be (default-)initialized
before the call to `make_reader()`, avoiding the undefined behavior.

Fixes #10882

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10883

(cherry picked from commit 9c231ad0ce)
2022-12-05 20:33:58 +02:00
Nadav Har'El
f667c5923a materialized views: fix view writes after base table schema change
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.

This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.

This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.

Fixes #10026
Fixes #11542

(cherry picked from commit 2f2f01b045)
2022-12-05 20:09:36 +02:00
Botond Dénes
e4ba0c56df db/view/view_builder: don't drop partition and range tombstones when resuming
The view builder builds the views from a given base table in
view_builder::batch_size batches of rows. After processing this many
rows, it suspends so the view builder can switch to building views for
other base tables in the name of fairness. When resuming the build step
for a given base table, it reuses the reader used previously (also
serving the role of a snapshot, pinning sstables read from). The
compactor however is created anew. As the reader can be in the middle of
a partition, the view builder injects a partition start into the
compactor to prime it for continuing the partition. This however only
included the partition-key, crucially missing any active tombstones:
partition tombstone or -- since the v2 transition -- active range
tombstone. This can result in base rows covered by either of this to be
resurrected and the view builder to generate view updates for them.
This patch solves this by using the detach-state mechanism of the
compactor which was explicitly developed for situations like this (in
the range scan code) -- resuming a read with the readers kept but the
compactor recreated.
Also included are two test cases reproducing the problem, one with a
range tombstone, the other with a partition tombstone.

Fixes: #11668

Closes #11671

(cherry picked from commit 5621cdd7f9)
2022-12-05 15:01:21 +02:00
Benny Halevy
329d55cc4f configure: add --perf-tests-debuginfo option
Provides separate control over debuginfo for perf tests
since enabling --tests-debuginfo affects both today
causing the Jenkins archives of perf tests binaries to
inflate considerably.

Refs https://github.com/scylladb/scylla-pkg/issues/3060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 48021f3ceb)

Fixes #12191
2022-12-04 17:20:33 +02:00
Petr Gusev
b956293f47 modification_statement: fix LWT insert crash if clustering key is null
PR #9314 fixed a similar issue with regular insert statements
but missed the LWT code path.

It's expected behaviour of
modification_statement::create_clustering_ranges to return an
empty range in this case, since possible_lhs_values it
uses explicitly returns empty_value_set if it evaluates rhs
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
modification_statement::process_where_clause. So the only
problem was modification_statement::execute_with_condition
was not expecting an empty clustering_range in case of
a null clustering key.

Fixes: #11954
(cherry picked from commit 0d443dfd16)
2022-12-04 15:00:27 +02:00
Nadav Har'El
6a8c2d3f56 Merge 'cql3: don't ignore other restrictions when a multi column restriction is present during filtering' from Jan Ciołek
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.

This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)

When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.

This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.

Fixes: #6200
Fixes: #12014

Closes #12031

* github.com:scylladb/scylladb:
  cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
  boost/restrictions-test: uncomment part of the test that passes now
  cql-pytest: enable test for filtering combined multi column and regular column restrictions
  cql3: don't ignore other restrictions when a multi column restriction is present during filtering

(cherry picked from commit 2d2034ea28)

Closes #12086
2022-11-26 14:24:08 +02:00
Piotr Grabowski
27a35c7f98 Udpate tools/jmx submodule (jackson dependency update)
* tools/jmx 53f7f55...fe351e8 (1):
  > Update jackson dependency

(cherry picked from commit 41b098f54e)

Refs #11929

Closes #11931
2022-11-20 20:10:14 +02:00
Pavel Emelyanov
d83134a245 Merge '[branch-5.0] multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.

The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.

Fixes: https://github.com/scylladb/scylladb/issues/9482

Closes #11912

* github.com:scylladb/scylladb:
  test/cql-pytest: add regression test for "IDL frame truncated" error
  mutation_compactor: detach_state(): make it no-op if partition was exhausted
2022-11-16 11:50:50 +03:00
Anna Mikhlin
b844d14829 release: prepare for 5.0.6 2022-11-13 16:39:30 +02:00
Eliran Sinvani
184df0393e cql: Fix crash upon use of the word empty for service level name
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.

Tests: 1. The fixed issue scenario from github.
       2. Unit tests in release mode.

Fixes #11774

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>

Closes #11777

(cherry picked from commit ab7429b77d)
2022-11-10 20:43:21 +02:00
Nadav Har'El
1b550dd301 cql3: fix cql3::util::maybe_quote() for keywords
cql3::util::maybe_quote() is a utility function formatting an identifier
name (table name, column name, etc.) that needs to be embedded in a CQL
statement - and might require quoting if it contains non-alphanumeric
characters, uppercase characters, or a CQL keyword.

maybe_quote() made an effort to only quote the identifier name if neccessary,
e.g., a lowercase name usually does not need quoting. But lowercase names
that are CQL keywords - e.g., to or where - cannot be used as identifiers
without quoting. This can cause problems for code that wants to generate
CQL statements, such as the materialized-view problem in issue #9450 - where
a user had a column called "to" and wanted to create a materialized view
for it.

So in this patch we fix maybe_quote() to recognize invalid identifiers by
using the CQL parser, and quote them. This will quote reserved keywords,
but not so-called unreserved keywords, which *are* allowed as identifiers
and don't need quoting. This addition slows down maybe_quote(), but
maybe_quote() is anyway only used in heavy operations which need to
generate CQL.

This patch also adds two tests that reproduce the bug and verify its
fix:

1. Add to the low-level maybe_quote() test (a C++ unit test) also tests
   that maybe_quote() quotes reserved keywords like "to", but doesn't
   quote unreserved keywords like "int".

2. Add a test reproducing issue #9450 - creating a materialized view
   whose key column is a keyword. This new test passes on Cassandra,
   failed on Scylla before this patch, and passes after this patch.

It is worth noting that maybe_quote() now has a "forward compatiblity"
problem: If we save CQL statements generated by maybe_quote(), and a
future version introduces a new reserved keyword, the parser of the
future version may not be able to parse the saved CQL statement that
was generated with the old mayb_quote() and didn't quote what is now
a keyword. This problem can be solved in two ways:

1. Try hard not to introduced new reserved keywords. Instead, introduce
   unreserved keywords. We've been doing this even before recognizing
   this maybe_quote() future-compatibility problem.

2. In the next patch we will introduce quote() - which unconditionally
   quotes identifier names, even if lowercase. These quoted names will
   be uglier for lowercase names - but will be safe from future
   introduction of new keywords. So we can consider switching some or
   all uses of maybe_quote() to quote().

Fixes #9450

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-1-nyh@scylladb.com>
(cherry picked from commit 5d2f694a90)
2022-11-07 17:01:32 +02:00
Alexander Turetskiy
01ce53d7fb Alternator: Projection field added to return from DescribeTable which describes GSIs and LSIs.
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.

Fixes #11470

Closes #11693

(cherry picked from commit 636e14cc77)
2022-11-07 17:01:32 +02:00
Jadw1
e9c7f89b32 CQL3: fromJson accepts string as bool
The problem was incompatibility with cassandra, which accepts bool
as a string in `fromJson()` UDF. The difference between Cassandra and
Scylla now is Scylla accepts whitespaces around word in string,
Cassandra don't. Both are case insensitive.

Fixes: #7915
(cherry picked from commit 1902dbc9ff)
2022-11-07 17:01:32 +02:00
Takuya ASADA
93f468c12c locator::ec2_snitch: Retry HTTP request to EC2 instance metadata service
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.

Fixes #10250

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #11688

(cherry picked from commit 6b246dc119)
(cherry picked from commit e2809674d2)
2022-11-07 17:01:32 +02:00
Botond Dénes
e54ae9efd9 test/cql-pytest: add regression test for "IDL frame truncated" error
(cherry picked from commit 11af489e84)
2022-11-07 13:43:53 +02:00
Botond Dénes
ef40e59c0e mutation_compactor: detach_state(): make it no-op if partition was exhausted
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.

This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.

(cherry picked from commit 70b4158ce0)
2022-11-07 13:42:43 +02:00
Botond Dénes
8c56b0b268 Merge 'Alternator, MV: fix bug in some view updates which set the view key to its existing value' from Nadav Har'El
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.

In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).

Fixes #11801

Closes #11808

* github.com:scylladb/scylladb:
  test/alternator: add test for issue 11801
  MV: fix handling of view update which reassign the same key value
  materialized views: inline used-once and confusing function, replace_entry()

(cherry picked from commit e981bd4f21)
2022-11-01 13:25:22 +02:00
Kamil Braun
fc78d88783 service: raft: raft_group0: don't call _abort_source.request_abort()
`raft_group0` does not own the source and is not responsible for calling
`request_abort`. The source comes from top-level `stop_signal` (see
main.cc) and that's where it's aborted.

Fixes #10668.

Closes #10678

(cherry picked from commit ef7643d504)
2022-10-16 11:42:22 +03:00
Pavel Emelyanov
31a20c4c54 compaction_manager: Swallow ENOSPCs in ::stop()
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened

refs: #11245

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 15:56:53 +03:00
Pavel Emelyanov
7e42bcfd61 exceptions: Mark storage_io_error::code() with noexcept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 15:56:03 +03:00
Pavel Emelyanov
2107ffe2d2 compaction_manager: Shuffle really_do_stop()
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 15:56:02 +03:00
Beni Peled
5a97a1060e release: prepare for 5.0.5 2022-10-09 08:44:14 +03:00
Nadav Har'El
2b0487c900 cql: validate bloom_filter_fp_chance up-front
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.

Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.

This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).

Fixes #11524.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11576

(cherry picked from commit 4c93a694b7)
2022-10-04 16:22:50 +03:00
Pavel Emelyanov
d3b3c53d9f system_keyspace/config: Swallow string->value cast exception
When updating an updateable value via CQL the new value comes as a
string that's then boost::lexical_cast-ed to the desired value. If the
cast throws the respective exception is printed in logs which is very
likely uncalled for.

fixes: #10394
tests: manual

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220503142942.8145-1-xemul@scylladb.com>
(cherry picked from commit 063d26bc9e)
2022-10-04 16:19:46 +03:00
Nadav Har'El
50c2c1b1d4 alternator: return ProvisionedThroughput in DescribeTable
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.

The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.

So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.

Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.

Fixes #11222

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11298

(cherry picked from commit 941c719a23)
2022-10-03 14:28:16 +03:00
Tomasz Grabiec
aa647a637a test: lib: random_mutation_generator: Don't generate mutations with marker uncompacted with shadowable tombstone
The generator was first setting the marker then applied tombstones.

The marker was set like this:

  row.marker() = random_row_marker();

Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.

However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.

This could generate rows with markers uncompacted with shadowable tombstones.

This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.

Fix by making sure there are no key clashes.

Closes #11663

(cherry picked from commit 5268f0f837)
2022-10-02 16:45:07 +03:00
Michael Livshin
2c0040fcb3 allow pre-scrub snapshots of materialized views and secondary indices
Previously, any attempt to take a materialized view or secondary index
snapshot was considered a mistake and caused the snapshot operation to
abort, with a suggestion to snapshot the base table instead.

But an automatic pre-scrub snapshot of a view cannot be attributed to
user error, so the operation should not be aborted in that case.

(It is an open question whether the more correct thing to do during
pre-scrub snapshot would be to silently ignore views.  Or perhaps they
should be ignored in all cases except when the user explicitly asks to
snapshot them, by name)

Closes #10760.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
(cherry picked from commit aab4cd850c)

Fixes #10760.
2022-10-02 14:04:11 +03:00
Nadav Har'El
54564adb7c alternator: forbid duplicate index (LSI and GSI) names
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.

So in this patch we add a test for this issue, and fix it.

Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.

Fixes #10789

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 8866c326de)
2022-10-02 13:00:03 +03:00
Tomasz Grabiec
839876e8f2 db: range_tombstone_list: Avoid quadratic behavior when applying
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).

This can cause reactor stalls and availability issues when replicas
apply such deletions.

This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.

Fixes #11211

Closes #11215

(cherry picked from commit 7f80602b01)
2022-09-30 17:55:23 +03:00
Botond Dénes
36002e2b7c sstables: crawling mx-reader: make on_out_of_clustering_range() no-op
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.

Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.

Fixes: #11421

Closes #11422

(cherry picked from commit be9d1c4df4)
2022-09-30 17:55:14 +03:00
Botond Dénes
91a8f9e09b test/lib/random_schema: add a simpler overload for fixed partition count
Some tests want to generate a fixed amount of random partitions, make
their life easier.

(cherry picked from commit 98f3d516a2)

Ref #11421 (prerequisite)
2022-09-30 17:54:55 +03:00
Michael Livshin
bc29f350dd batchlog_manager: warn when a batch fails to replay
Only for reasons other than "no such KS", i.e. when the failure is
presumed transient and the batch in question is not deleted from
batchlog and will be retried in the future.

(Would info be more appropriate here than warning?)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #10556

Fixes #10636

(cherry picked from commit 00ed4ac74c)
2022-09-29 12:14:56 +03:00
Asias He
4fe571f470 streaming: Allow drop table during streaming
Currently, if a table is dropped during streaming, the streaming would
fail with no_such_column_family error.

Since the table is dropped anyway, it makes more sense to ignore the
streaming result of the dropped table, whether it is successful or
failed.

This allows users to drop tables during node operations, e.g., bootstrap
or decommission a node.

This is especially useful for the cloud users where it is hard to
coordinate between a node operation by admin and user cql change.

This patch also fixes a possible user after free issue by not passing
the table reference object around.

Fixes #10395

Closes #10396

(cherry picked from commit 953af38281)
2022-09-21 10:26:22 +03:00
Michał Radwański
ebf38eaead flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO.
In functions such as upgrade_to_v2 (excerpt below), if the constructor
of transforming_reader throws, r needs to be destroyed, however it
hasn't been closed. However, if a reader didn't start any operations, it
is safe to destruct such a reader. This issue can potentially manifest
itself in many more readers and might be hard to track down. This commit
adds a bool indicating whether a close is anticipated, thus avoiding
errors in the destructor.

Code excerpt:
flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) {
    class transforming_reader : public flat_mutation_reader_v2::impl {
        // ...
    };
    return make_flat_mutation_reader_v2<transforming_reader>(std::move(r));
}

Fixes #9065.
Fixes #11491

(cherry picked from commit 9ada63a9cb)
2022-09-21 10:25:18 +03:00
Beni Peled
1c82766f33 release: prepare for 5.0.4 2022-09-21 09:16:13 +03:00
Piotr Sarna
e1f78c33b4 Merge 'Fix mutation commutativity with shadowable tombstone'
from Tomasz Grabiec

This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.

No known production impact.

Refs https://github.com/scylladb/scylladb/issues/11307

Closes #11312

* github.com:scylladb/scylladb:
  test: mutation_test: Add explicit test for mutation commutativity
  test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
  db: mutation_partition: Drop unnecessary maybe_shadow()
  db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
  mutation_partition: row: make row marker shadowing symmetric

(cherry picked from commit 484004e766)
2022-09-20 23:21:06 +02:00
Tomasz Grabiec
0634b5f734 test: row_cache: Use more narrow key range to stress overlapping reads more
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.

Closes #11260

(cherry picked from commit 8ee5b69f80)
2022-09-20 23:20:43 +02:00
Avi Kivity
6f020b26e1 Merge 'Backport 3 fixes for the evictable reader v2' from Botond Dénes
This pull request backports 3 important fixes from adc08d0ab9. Said 3 commits fixed important bugs in the v2 variant of the evitable reader, but were not backported because they were part of a large series doing v2 conversion in general. This means that 5.0 was left with a buggy evictable reader v2, which is used by repair. So far in the wild we've seen one bug manifest itself: the evictable reader getting stuck, spinning in a tight loop in `evictable_reader_v2::do_fill_buffer()`, in turn making repair being stuck too.

Fixes: #11223

Closes #11540

* github.com:scylladb/scylladb:
  test/boost/mutation_reader_test: add v2 specific evictable reader tests
  evictable_reader_v2: terminate active range tombstones on reader recreation
  evictable_reader_v2: restore handling of non-monotonically increasing positions
  evictable_reader_v2: simplify handling of reader recreation
2022-09-20 13:42:10 +03:00
Pavel Emelyanov
7f8dcc5657 messaging_service: Fix gossiper verb group
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.

fixes: #11465

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2c74062962)
2022-09-19 10:31:58 +03:00
Botond Dénes
20451760fe tools/scylla-sstable: fix description template
Quote '{' and '}' used in CQL example, so format doesn't try to
interpret it.

Fixes: #11571

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220221140652.173015-1-bdenes@scylladb.com>
(cherry picked from commit 10880fb0a7)
2022-09-19 06:54:25 +03:00
Michał Chojnowski
51b031d04e sstables: add a flag for disabling long-term index caching
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.

There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.

This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.

Consequences of this choice:

- The per-SSTable partition_index_cache is unused. Every index_reader has
  its own, and they die together. Independent reads can no longer reuse the
  work of other reads which hit the same index pages. This is not crucial,
  since partition accesses have no (natural) spatial locality. Note that
  the original reason for partition_index_cache -- the ability to share
  reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
  (uncached) input stream from the index file, and every
  bsearch_clustered_cursor has its own cached_file, which dies together with
  the cursor. Note that the cursor still can perform its binary search with
  caching. However, it won't be able to reuse the file pages read by
  index_reader. In particular, if the promoted index is small, and fits inside
  the same file page as its index_entry, that page will be re-read.
  It can also happen that index_reader will read the same index file page
  multiple times. When the summary is so dense that multiple index pages fit in
  one index file page, advancing the upper bound, which reads the next index
  page, will read the same index file page. Since summary:disk ratio is 1:2000,
  this is expected to happen for partitions with size greater than 2000
  partition keys.

Fixes #11202

(cherry picked from commit cdb3e71045)
2022-09-18 13:29:35 +03:00
Botond Dénes
82d1446ca9 test/boost/mutation_reader_test: add v2 specific evictable reader tests
One is a reincarnation of the recently removed
test_multishard_combining_reader_non_strictly_monotonic_positions. The
latter was actually targeting the evictable reader but through the
multishard reader, probably for historic reasons (evictable reader was
part of the multishard reader family).
The other one checks that active range tombstones changes are properly
terminated when the partition ends abruptly after recreating the reader.

(cherry picked from commit 014a23bf2a)
2022-09-15 13:51:13 +03:00
Botond Dénes
e0acb0766d evictable_reader_v2: terminate active range tombstones on reader recreation
Reader recreation messes with the continuity of the mutation fragment
stream because it breaks snapshot isolation. We cannot guarantee that a
range tombstone or even the partition started before will continue after
too. So we have to make sure to wrap up all loose threads when
recreating the reader. We already close uncontinued partitions. This
commit also takes care of closing any range tombstone started by
unconditionally emitting a null range tombstone. This is legal to do,
even if no range tombstone was in effect.

(cherry picked from commit 9e48237b86)
2022-09-14 19:15:50 +03:00
Botond Dénes
4f26d489a0 evictable_reader_v2: restore handling of non-monotonically increasing positions
We thought that unlike v1, v2 will not need this. But it does.
Handled similarly to how v1 did it: we ensure each buffer represents
forward progress, when the last fragment in the buffer is a range
tombstone change:
* Ensure the content of the buffer represents progress w.r.t.
  _next_position_in_partition, thus ensuring the next time we recreate
  the reader it will continue from a later position.
* Continue reading until the next (peeked) fragment has a strictly
  larger position.

The code is just much nicer because it uses coroutines.

(cherry picked from commit 6db08ddeb2)
2022-09-14 19:15:49 +03:00
Botond Dénes
43cbc5c836 evictable_reader_v2: simplify handling of reader recreation
The evictable reader has a handful of flags dictating what to do after
the reader is recreated: what to validate, what to drop, etc. We
actually need a single flag telling us if the reader was recreated or
not, all other things can be derived from existing fields.
This patch does exactly that. Furthermore it folds do_fill_buffer() into
fill_buffer() and replaces the awkward to use `should_drop_fragment()`
with `examine_first_fragments()`, which does a much better job of
encapsulating all validation and fragment dropping logic.
This code reorganization also fixes two bugs introduced by the v2
conversion:
* The loop in `do_fill_buffer()` could become infinite in certain
  circumstances due to a difference between the v1 and v2 versions of
  `is_end_of_stream()`.
* The position of the first non-dropped fragment is was not validated
  (this was integrated into the range tombstone trimming which was
  thrown out by the conversion).

(cherry picked from commit 498d03836b)
2022-09-14 19:15:49 +03:00
Nadav Har'El
f0c521efdf alternator: clean error shutdown in case of TLS misconfigration
The way our boot-time service "controllers" are written, if a
controller's start_server() finds an error and throws, it cannot
the caller (main.cc) to call stop_server(), and must clean up
resources already created (e.g., sharded services) before returning
or risk crashes on assertion failures.

This patch fixes such a mistake in Alternator's initialization.
As noted in issue #10025, if the Alternator TLS configuration is
broken - especially the certificate or key files are missing -
Scylla would crash on an assertion failure, instead of reporting
the error as expected. Before this patch such a misconfiguration
will result in the unintelligible:

<alternator::server>::~sharded() [Service = alternator::server]:
Assertion `_instances.empty()' failed. Aborting on shard 0.

After this patch we get the right error message:

ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed:
std::_Nested_exception<std::runtime_error> (Failed to set up Alternator
TLS credentials): std::_Nested_exception<std::runtime_error> (Could not
read certificate file conf/scylla.crt): std::filesystem::__cxx11::
filesystem_error (error system:2, filesystem error: open failed:
No such file or directory [conf/scylla.crt])

Arguably this error message is a bit ugly, so I opened
https://github.com/scylladb/seastar/issues/1029, but at least it says
exactly what the error is.

Fixes #10025
Fixes #11520

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>
(cherry picked from commit 7f89c8b3e3)
2022-09-11 14:43:18 +03:00
Beni Peled
b9a61c8e9a release: prepare for 5.0.3 2022-09-07 11:16:52 +03:00
Karol Baryła
32aa1e5287 transport/server.cc: Return correct size of decompressed lz4 buffer
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.

Fixes #11476

(cherry picked from commit 1c2eef384d)
2022-09-07 10:58:42 +03:00
Nadav Har'El
da6a126d79 cross-tree: fix header file self-sufficiency
Scylla's coding standard requires that each header is self-sufficient,
i.e., it includes whatever other headers it needs - so it can be included
without having to include any other header before it.

We have a test for this, "ninja dev-headers", but it isn't run very
frequently, and it turns out our code deviated from this requirement
in a few places. This patch fixes those places, and after it
"ninja dev-headers" succeeds again.

This is needed because our CI runs "ninja dev-headers".

Fixes #10995

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11457
2022-09-06 15:45:34 +03:00
Avi Kivity
d07e902983 Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.

Fixes: https://github.com/scylladb/scylladb/issues/11264

Closes #11273

* github.com:scylladb/scylladb:
  querier: querier_cache: remove now unused evict_all_for_table()
  database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
  reader_concurrency_semaphore: add evict_inactive_reads_for_table()

(cherry picked from commit afa7960926)
2022-09-02 11:39:43 +03:00
Piotr Sarna
3c0fc42f84 cql3: fix misleading error message for service level timeouts
The error message incorrectly stated that the timeout value cannot
be longer than 24h, but it can - the actual restriction is that the
value cannot be expressed in units like days or months, which was done
in order to significantly simplify the parsing routines (and the fact
that timeouts counted in days are not expected to be common).

Fixes #10286

Closes #10294

(cherry picked from commit 85e95a8cc3)
2022-09-01 20:34:12 +03:00
Piotr Grabowski
964ccf9192 type_json: support integers in scientific format
Add support for specifing integers in scientific format (for example
1.234e8) in INSERT JSON statement:

INSERT INTO table JSON '{"int_column": 1e7}';

Inserting a floating-point number ending with .0 is allowed, as
the fractional part is zero. Non-zero fractional part (for example
12.34) is disallowed. A new test is added to test all those behaviors.

Before the JSON parsing library was switched to RapidJSON from JsonCpp,
this statement used to work correctly, because JsonCpp transparently
casts double to integer value.

This behavior differs from Cassandra, which disallows those types of
numbers (1e7, 123.0 and 12.34).

Fix typo in if condition: "if (value.GetUint64())" to
"if (value.IsUint64())".

Fixes #10100

(cherry picked from commit efe7456f0a)
2022-09-01 16:03:49 +03:00
Avi Kivity
dfdc128faf Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec
Scenario:

cache = [
    row(pos=2, continuous=false),
    row(pos=after(2), dummy=true)
]

Scanning read starts, starts populating [-inf, before(2)] from sstables.

row(pos=2) is evicted.

cache = [
    row(pos=after(2), dummy=true)
]

Scanning read finishes reading from sstables.

Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.

The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.

Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).

Fixes #11239

Closes #11240

* github.com:scylladb/scylladb:
  test: mvcc: Fix illegal use of maybe_refresh()
  tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
  tests: row_cache_test: Introduce one_shot mode to throttle
  row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
2022-08-11 18:36:44 +02:00
Yaron Kaikov
299122e78d release: prepare for 5.0.2 2022-08-07 16:15:02 +03:00
Avi Kivity
23a34d7e42 Merge 'Backport: Fix map subscript crashes when map or subscript is null' from Nadav Har'El
This is a backport of https://github.com/scylladb/scylla/pull/10420 to branch 5.0.
Branch 5.0 had somewhat different code in this expression area, so the backport was not automatically, but nevertheless was fairly straightforward - just copy the exact same checking code to its right place, and keep the exact same tests to see we indeed fixed the bug.

Refs #10535.

The original cover letter from https://github.com/scylladb/scylla/pull/10420:

In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues https://github.com/scylladb/scylla/issues/10361, https://github.com/scylladb/scylla/issues/10399, https://github.com/scylladb/scylla/pull/10401. The existing test test_null.py::test_map_subscript_null reproduced all these bugs sporadically.

In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then define both m[NULL] and NULL[2] to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., NULL < 2 is also defined as resulting in NULL.

However, this decision differs from Cassandra, where m[NULL] is considered an error but NULL[2] is allowed. We believe that making m[NULL] be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like m[a], where the column a can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.

Fixes https://github.com/scylladb/scylla/issues/10361
Fixes https://github.com/scylladb/scylla/issues/10399
Fixes https://github.com/scylladb/scylla/pull/10401

Closes #11142

* github.com:scylladb/scylla:
  test/cql-pytest: reproducer for CONTAINS NULL bug
  expressions: don't dereference invalid map subscript in filter
  expressions: fix invalid dereference in map subscript evaluation
  test/cql-pytest: improve tests for map subscripts and nulls
2022-07-28 15:31:28 +03:00
Nadav Har'El
67a2f3aa67 test/cql-pytest: reproducer for CONTAINS NULL bug
This is a reproducer for issue #10359 that a "CONTAINS NULL" and
"CONTAINS KEY NULL" restrictions should not match any set, but currently
do match non-empty or all sets.

The tests currently fail on Scylla, so marked xfail. They also fails on
Cassandra because Cassandra considers such a request an error, which
we consider a mistake (see #4776) - so the tests are marked "cassandra_bug".

Refs #10359.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220412130914.823646-1-nyh@scylladb.com>
(cherry picked from commit ae0e1574dc)
2022-07-27 20:03:30 +03:00
Nadav Har'El
66e8cf8cea expressions: don't dereference invalid map subscript in filter
If we have the filter expression "WHERE m[?] = 2", the existing code
simply assumed that the subscript is an object of the right type.
However, while it should indeed be the right type (we already have code
that verifies that), there are two more options: It can also be a NULL,
or an UNSET_VALUE. Either of these cases causes the existing code to
dereference a non-object as an object, leading to bizarre errors (as
in issue #10361) or even crashes (as in issue #10399).

Cassandra returns a invalid request error in these cases: "Unsupported
unset map key for column m" or "Unsupported null map key for column m".
We decided to do things differently:

 * For NULL, we consider m[NULL] to result in NULL - instead of an error.
   This behavior is more consistent with other expressions that contain
   null - for example NULL[2] and NULL<2 both result in NULL as well.
   Moreover, if in the future we allow more complex expressions, such
   as m[a] (where a is a column), we can find the subscript to be null
   for some rows and non-null for other rows - and throwing an "invalid
   query" in the middle of the filtering doesn't make sense.

 * For UNSET_VALUE, we do consider this an error like Cassandra, and use
   the same error message as Cassandra. However, the current implementation
   checks for this error only when the expression is evaluated - not
   before. It means that if the scan is empty before the filtering, the
   error will not be reported and we'll silently return an empty result
   set. We currently consider this ok, but we can also change this in the
   future by binding the expression only once (today we do it on every
   evaluation) and validating it once after this binding.

Fixes #10361
Fixes #10399

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit fbb2a41246)
2022-07-27 19:56:17 +03:00
Nadav Har'El
35b66c844c expressions: fix invalid dereference in map subscript evaluation
When we have an filter such as "WHERE m[2] = 3" (where m is a map
column), if a row had a null value for m, our expression evaluation
code incorrectly dereferences an unset optional, and continued
processing the result of this dereference which resulted in undefined
behavior - sometimes we were lucky enough to get "marshaling error"
but other times Scylla crashed.

The fix is trivial - just check before dereferencing the optional value
of the map. We return null in that case, which means that we consider
the result of null[2] to be null. I think this is a reasonable approach
and fits our overall approach of making null dominate expressions (e.g.,
the value of "null < 2" is also null).

The test test_filtering.py::test_filtering_null_map_with_subscript,
which used to frequently fail with marshaling errors or crashes, now
passes every time so its "xfail" mark is removed.

Fixes #10417

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 808a93d29b)
2022-07-27 19:50:24 +03:00
Nadav Har'El
9e7a1340b9 test/cql-pytest: improve tests for map subscripts and nulls
The test test_null.py::test_map_subscript_null turned out to reproduce
multiple bugs related to using map subscripts in filtering expressions.
One was issue #10361 (m[null] resulted in a bizarre error) or #10399
(m[null] resulted in a crash), and a different issue was #10401 (m[2]
resulted in a bizarre error or a crash if m itself was null). Moreover,
the same test uncovered different bugs depending how it was run - alone
or with other tests - because it was using a shared table.

In this patch we introduce two separate tests in test_filtering.py
which are designed to reproduce these separate bugs instead of mixing
them into one test. The new tests also cover a few more corners which
the previous test (which focused on nulls) missed - such as UNSET_VALUE.

The two new tests (and the old test_map_subscript_null) pass on
Cassandra so still assume that the Cassandra behavior - that m[null]
should be an error - is the correct behavior. We may want to change
the desired behavior (e.g., to decide that m[null] be null, not an
error), and change the tests accordingly later - but for now the
tests follow Cassandra's behavior exactly, and pass on Cassandra
and fail on Scylla (so are marked xfail).

The bugs reproduced by these tests involve randomness or reading
uninitialized memory, so these tests sometimes pass, sometimes fail,
and sometimes even crash (as reported in #10399 and #10401). So to
reproduce these bugs run the tests multiple times. For example:

    test/cql-pytest/run --count 100 --runxfail
        test_filtering.py::test_filtering_null_map_with_subscript

Refs #10361
Refs #10399
Refs #10401

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 189b8845fe)
2022-07-27 19:28:17 +03:00
Benny Halevy
d5a0750ef3 multishard_mutation_query: do_query: stop ctx if lookup_readers fails
lookup_readers might fail after populating some readers
and those better be closed before returning the exception.

Fixes #10351

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10425

(cherry picked from commit 055141fc2e)
2022-07-25 14:52:44 +03:00
Benny Halevy
618c483c73 sstables: time_series_sstable_set: insert: make exception safe
Need to erase the shared sstable from _sstables
if insertion to _sstables_reversed fails.

Fixes #10787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit cd68b04fbf)
2022-07-25 14:21:45 +03:00
Tomasz Grabiec
f10fd1bc12 test: memtable: Make failed_flush_prevents_writes() immune to background merging
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.

Fix by triggering soft pressure on retries.

Fixes #10801
Refs #10793

(cherry picked from commit 0e78ad50ea)

Closes #10802

(cherry picked from commit 3bec1cc19f)
2022-07-25 14:19:48 +03:00
Tomasz Grabiec
1891f10141 memtable: Fix missing range tombstones during reads under ceratin rare conditions
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.

_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.

The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.

For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:

[a, c] @ t0
[a, b) @ t1, [b, d] @ t2

c > b

The proper sequence for such versions is (assuming d > c):

[a, b) @ t1,
[b, d] @ t2

Due to the bug, the reader will produce:

[a, b) @ t1,
[b, c] @ t0

The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).

The problem goes away once MVCC versions merge.

Fixes #10913
Fixes #10830

Closes #10914

(cherry picked from commit a6aef60b93)
2022-07-19 19:33:51 +03:00
Pavel Emelyanov
b177dacd36 Update seastar submodule (auto-increase latency goal fixes)
* seastar dbf79189...9a7ba6d5 (3):
  > io: Adjust IO latency goal on fair-queue level
  > reactor: Check IOPS/bandwidth and increase latency goal
  > Revert "io_queue: Auto-increase the io-latency goal"

refs: #10927

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 13:06:43 +03:00
Yaron Kaikov
283a722923 release: prepare for 5.0.1 2022-07-19 06:39:11 +03:00
Pavel Emelyanov
522d0a81e7 azure_snitch: Do nothing on non-io-cpu
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.

fixes: #10494

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c6d0bc87d0)
2022-07-17 14:13:25 +03:00
Pavel Emelyanov
cd13911db4 Merge 'Scrub compaction: prevent mishandling of range tombstone changes' from Botond
With v2 having individual bounds of range tombstone as separate
fragments, out-of-order fragments become more difficult to handle,
especially in the presence of active range tombstone.
Scrub in both SKIP and SEGREGATE mode closes the partition on
seeing the first invalid fragment (SEGREAGE re-opens it immediately).
If there is an active range tombstone, scrub now also has to take care
of closing said tombstone when closing the partition. In a normal stream
it could just use the last position-in-partition to create a closing
bound. But when out-of-order fragments are on the table this is not
possible: the closing bound may be found later in the stream, with a
position smaller than that of the current position-in-partition.
To prevent extending range tombstone changes like that, Scrub now aborts
the compaction on the first invalid fragment seen *inside* an active
range tombstone.
Fixing a v2 stream with range tombstone changes is definitely possible,
but non-trivial, so we defer it until there is demand for it.

This series also makes the mutation fragment stream validator check for
open range tombstones on partition-end and adds a comprehensive
test-suite for the validator.

Fixes: #10168

Tests: unit(dev)

* scrub-rtc-handling-fix/v2 of github.com/denesb/scylla.git:
  compaction/compaction: abort scrub when attempting to rectify stream with active tombstone
  test/boost/mutation_test: add test for mutation_fragment_stream_validator
  mutation_fragment_stream_validator: validate range tombstone changes

(cherry picked from commit edd0481b38)
2022-07-14 18:49:13 +03:00
Nadav Har'El
32423ebc38 Merge 'Handle errors during snapshot' from Benny Halevy
This series refactors `table::snapshot` and moves the responsibility
to flush the table before taking the snapshot to the caller.

`flush_on_all` and `snapshot_on_all` helpers are added to replica::database
(by making it a peering_sharded_service) and upper layers,
including api and snapshot-ctl now call it instead of calling cf.snapshot directly.

With that, error are handed in table::snapshot and propagated
back to the callers.

Failure to allocate the `snapshot_manager` object is fatal,
similar to failure to allocate a continuation, since we can't
coordinate across the shards without it.

Test: unit(dev), rest_api(debug)

* github.com:scylladb/scylla:
  table: snapshot: handle errors
  table: snapshot: get rid of skip_flush param
  database: truncate: skip flush when taking snapshot
  test: rest_api: storage_service: verify_snapshot_details: add truncate
  database: snapshot_on_all: flush before snapshot if needed
  table: make snapshot method private
  database: add snapshot_on_all
  snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
  snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot
  api: storage_service: increase visibility of snapshot ops in the log
  api: storage_service: coroutinize take_snapshot and del_snapshot
  api: storage_service: take_snapshot: improve api help messages
  test: rest_api: storage_service: add test_storage_service_snapshot
  database: add flush_on_all variants
  test: rest_api: add test_storage_service_flush

(cherry picked from commit 2c39c4c284)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10975
2022-07-12 15:24:24 +03:00
Pavel Emelyanov
97054ee691 view: Fix trace-state pointer use after move
It's moved into .mutate_locally() but it captured and used in its
continuation. It works well just because moved-from pointer looks like
nullptr and all the tracing code checks for it to be non-such.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1266/
       (CI job failed on post-actions thus it's red)

Fixes #11015

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220711134152.30346-1-xemul@scylladb.com>
(cherry picked from commit 5526738794)
2022-07-12 14:20:57 +03:00
Piotr Sarna
34085c364f view: exclude using static columns in the view filter
The code which applied view filtering (i.e. a condition placed
on a view column, e.g. "WHERE v = 42") erroneously used a wildcard
selection, which also assumes that static columns are needed,
if the base table contains any such columns.
The filtering code currently assumes that no such columns are fetched,
so the selection is amended to only ask for regular columns
(primary key columns are sent anyway, because they are enabled
via slice options, so no need to ask for them explicitly).

Fixes #10851

Closes #10855

(cherry picked from commit bc3a635c42)
2022-07-11 17:06:55 +03:00
Takuya ASADA
323521f4c8 install.sh: install files with correct permission in strict umask setting
To avoid failing to run scripts in non-root user, we need to set
permission explicitly on executables.

Fixes #10752

Closes #10840

(cherry picked from commit 13caac7ae6)
2022-07-10 16:46:30 +03:00
Asias He
1ad59d6a7b repair: Do not flush hints and batchlog if tombstone_gc_mode is not repair
The flush of hints and batchlog are needed only for the table with
tombstone_gc_mode set to repair mode. We should skip the flush if the
tombstone_gc_mode is not repair mode.

Fixes #10004

Closes #10124

(cherry picked from commit ec59f7a079)
2022-07-04 10:31:51 +03:00
Nadav Har'El
d3045df9c9 Merge 'types: fix is_string for reversed types' from Piotr Sarna
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.

Fixes #10183

Closes #10181

* github.com:scylladb/scylla:
  test: add a case for LIKE operator on a descending order column
  types: fix is_string for reversed types

(cherry picked from commit 733672fc54)
2022-07-03 17:59:33 +03:00
Benny Halevy
be48b7aa8b compaction_manager: perform_offstrategy: run_offstrategy_compaction in maintenance scheduling group
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.

However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.

Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.

The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.

Fixes #10151

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
(cherry picked from commit 0764e511bb)
2022-07-03 14:28:47 +03:00
Takuya ASADA
3c4688bcfa scylla_coredump_setup: support new format of Storage field
Storage field of "coredumpctl info" changed at systemd-v248, it added
"(present)" on the end of line when coredump file available.

Fixes #10669

Closes #10714

(cherry picked from commit ad2344a864)
2022-07-03 13:55:18 +03:00
Nadav Har'El
cc22021876 alternator: forbid empty AttributesToGet
In DynamoDB one can retrieve only a subset of the attributes using the
AttributesToGet or ProjectionExpression paramters to read requests.
Neither allows an empty list of attributes - if you don't want any
attributes, you should use Select=COUNT instead.

Currently we correctly refuse an empty ProjectionExpression - and have
a test for it:
test_projection_expression.py::test_projection_expression_toplevel_syntax

However, Alternator is missing the same empty-forbidding logic for
AttributesToGet. An empty AttributesToGet is currently allowed, and
basically says "retrieve everything", which is sort of unexpected.

So this patch adds the missing logic, and the missing test (actually
two tests for the same thing - one using GetItem and the other Query).

Fixes #10332

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-1-nyh@scylladb.com>
(cherry picked from commit 9c1ebdceea)
2022-07-03 13:35:50 +03:00
Yaron Kaikov
c9e79cb4a3 release: prepare for 5.0.0 2022-06-28 15:51:29 +03:00
Yaron Kaikov
f28542a71e release: prepare for 5.0.rc8 2022-06-12 14:44:47 +03:00
Pavel Emelyanov
527a75a4c0 Update seastar submodule (Calculate max IO lengths as lengths)
* seastar 8b2c13b3...dbf79189 (1):
  > Merge 'Calculate max IO lengths as lengths'
     io_queue: Type alias for internal::io_direction_and_length
     io_queue, fair_group: Throw instead of assert
     io_queue: Keep max lengths on board
     io_queue: Toss request_fq_ticket()
     io_queue: Introduce make_ticket() helper
     io_queue: Remove max_ticket_size
     io_queue: Make make_ticket() non-brancy
     io_queue: Add devid to group creation log

tests: cstress(release)
fixes: #10704
2022-06-09 21:15:21 +03:00
Avi Kivity
df00f8fcfb Update seastar submodule (json crash in describe_ring)
* seastar 7a430a0830...8b2c13b346 (1):
  > Merge 'stream_range_as_array: always close output stream' from Benny Halevy

Fixes #10592.
2022-06-08 16:48:28 +03:00
Yaron Kaikov
41a00c744f release: prepare for 5.0.rc7 2022-06-02 15:13:59 +03:00
Avi Kivity
2d7b6cd702 messaging: do isolate default tenants
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.

In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:

 scheduling_group
 messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
     return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
     if (must_compress) {
         opts.compressor_factory = &compressor_factory;
     }
     opts.tcp_nodelay = must_tcp_nodelay;
     opts.reuseaddr = true;
-    opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    // We send cookies only for non-default statement tenant clients.
+    if (idx > 3) {
+        opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    }

This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.

Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.

Fixes #9505.

Closes #10673

(cherry picked from commit c83393e819)
2022-06-01 17:20:30 +03:00
Avi Kivity
ff79228178 Merge 'Allow trigger off strategy compaction early for node operations' from Asias He
This patch set adds two commits to allow trigger off strategy early for node operations.

*) repair: Repair table by table internally

This patch changes the way a repair job walks through tables and ranges
if multiple tables and ranges are requested by users.

Before:

```
for range in ranges
   for table in tables
       repair(range, table)
```

After:

```
for table in tables
    for range in ranges
       repair(range, table)
```

The motivation for this change is to allow off-strategy compaction to trigger
early, as soon as a table is finished. This allows to reduce the number of
temporary sstables on disk. For example, if there are 50 tables and 256 ranges
to repair, each range will generate one sstable. Before this change, there will
be 50 * 256 sstables on disk before off-strategy compaction triggers. After this
change, once a table is finished, off-strategy compaction can compact the 256
sstables. As a result, this would reduce the number of sstables by 50X.

This is very useful for repair based node operations since multiple ranges and
tables can be requested in a single repair job.

Refs: #10462

*) repair: Trigger off strategy compaction after all ranges of a table is repaired

When the repair reason is not repair, which means the repair reason is
node operations (bootstrap, replace and so on), a single repair job contains all
the ranges of a table that need to be repaired.

To trigger off strategy compaction early and reduce the number of
temporary sstable files on disk, we can trigger the compaction as soon
as a table is finished.

Refs: #10462

Closes #10551

* github.com:scylladb/scylla:
  repair: Trigger off strategy compaction after all ranges of a table is repaired
  repair: Repair table by table internally

(cherry picked from commit e65b3ed50a)
2022-06-01 14:17:01 +03:00
Nadav Har'El
1803124cc6 alternator: allow DescribeTimeToLive even without TTL enabled
We still consider the TTL support in Alternator to be experimental, so we
don't want to allow a user to enable TTL on a table without turning on a
"--experimental-features" flag. However, there is no reason not to allow
the DescribeTimeToLive call when this experimental flag is off - this call
would simply reply with the truth - that the TTL feature is disabled for
the table!

This is important for client code (such as the Terraform module
described in issue #10660) which uses DescribeTimeToLive for
information, even when it never intends to actually enable TTL.

The patch is trivial - we simply remove the flag check in
DescribeTimeToLive, the code works just as before.

After this patch, the following test now works on Scylla without
experimental flags turned on:

    test/alternator/run test_ttl.py::test_describe_ttl_without_ttl

Refs #10660

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 8ecf1e306f)
2022-05-30 20:08:41 +03:00
Takuya ASADA
6fcbf66bfb scylla_sysconfig_setup: handle >=32CPUs correctly
Seems like 59adf05 has a bug, the regex pattern only handles first
32CPUs cpuset pattern, and ignores rest.
We should extend regex pattern to handle all CPUs.

Fixes #10523

Closes #10524

(cherry picked from commit a9dfe5a8f4)
2022-05-30 14:27:27 +03:00
Takuya ASADA
e9a3dee234 scylla_sysconfig_setup: avoid perse error on perftune.py --get-cpu-mask
Currently, we just passes entire output of perftune.py when getting CPU
mask from the script, but it may cause parse error since the script may
also print warning message.

To avoid that, we need to extract CPU mask from the output.

Fixes #10082

Closes #10107

(cherry picked from commit 59adf05951)
2022-05-30 14:25:21 +03:00
Avi Kivity
279cd44c7f Update seastar submodule (xfs project attribute zeroed)
* seastar 6745a43c10...7a430a0830 (1):
  > file: don't trample on xfs flags when setting xfs size hint

Fixes #10667.
2022-05-29 17:43:43 +03:00
Avi Kivity
c99f768381 Merge 'Rework off strategy compaction locking for branch 5.0' from Raphael "Raph" Carvalho
First patch removes incorrect usage of rwlock which should be restricted to minor and major compaction tasks.

Second patch revives a semaphore, which was lost in 6737c88045, as we want major to not wait on off-strategy completion before deciding whether or not it should proceed with execution. It wouldn't proceed with execution if user asked major to stop while waiting for a chance to run.

For master, we're going to rely on abortable variant of get_units() to allow major to be quickly aborted.

Fixes #10485.

Closes #10582

* github.com:scylladb/scylla:
  compaction_manager: Revive custom job semaphore
  compaction_manager: Remove rwlock usage in run_custom_job()
2022-05-29 17:38:01 +03:00
Tomasz Grabiec
89a540d54a sstable: partition_index_cache: Fix abort on bad_alloc during page loading
When entry loading fails and there is another request blocked on the
same page, attempt to erase the failed entry will abort because that
would violate entry_ptr guarantees, which is supposed to keep the
entry alive.

The fix in 92727ac36c was incomplete. It
only helped for the case of a single loader. This patch makes a more
general approach by relaxing the assert.

The assert manifested like this:

scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed.

Fixes #10617

Closes #10653

(cherry picked from commit f87274f66a)
2022-05-27 09:50:32 +03:00
Yaron Kaikov
338edcc02e release: prepare for 5.0.rc6 2022-05-23 11:37:37 +03:00
Avi Kivity
a8eb5164b2 Update seastar submodule (io_queue delay metrics in 25ms granularity)
* seastar 4a30c44c4c...6745a43c10 (1):
  > metrics: Report IO total times as real numbers

Ref #10392
2022-05-19 18:20:15 +03:00
Raphael S. Carvalho
9accb44f9c compaction_manager: Revive custom job semaphore
In commit 6737c88045, we started using a single semaphore for
maintenance operations, which is a good change.

However, after introduction of off-strategy, major cannot proceed
until off-strategy is done reshaping all its input files.

If user requests major to abort, the command will only return
once off-strategy is done, and that can take lots of time.

In master, we'll allow pending major to be quickly aborted, but
that's not possible here as abortable variant of get_units()
is not available yet.

Here, we'll allow major to proceed in parallel to off-strategy,
so major can decide whether or not it should run in parallel.

Fixes #10485.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-05-16 20:46:31 -03:00
Raphael S. Carvalho
8878007106 compaction_manager: Remove rwlock usage in run_custom_job()
The rwlock usage was introduced in 2017 commit 10eaa2339e.

Resharding was online back then and we want to serialize it with
major.

Rwlock usage should be restricted to major and minor, as clearly
stated in the documentation, but we're still using it in
run_custom_job().

It gains us nothing, it only prevents off-strategy and other
custom jobs from running concurrently to major.

Let's kill this as we want to allow off-strategy to not prevent
a major from happening in parallel, as the former works only
on the maintenance sstable set and won't interfere with
the latter.

Refs #10485.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-05-16 20:45:54 -03:00
Yaron Kaikov
9da666e778 release: prepare for 5.0.rc5 2022-05-15 22:09:16 +03:00
Benny Halevy
aca355dec1 table: clear: serialize with ongoing flush
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.

Fixes #10423

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aae532a96b)
2022-05-15 13:39:03 +03:00
Raphael S. Carvalho
efbb2efd3f compaction: LCS: don't write to disengaged optional on compaction completion
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup

Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().

_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.

Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.

To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.

compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.

Fixes #10378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10508

(cherry picked from commit 8e99d3912e)
2022-05-15 13:20:11 +03:00
Eliran Sinvani
44dc5c4a1d Revert "table: disable_auto_compaction: stop ongoing compactions"
This reverts commit 4affa801a5.
In issue #10146 a write throughput drop of ~50% was reported, after
bisect it was found that the change that caused it was adding some
code to the table::disable_auto_compaction which stops ongoing
compactions and returning a future that resolves once all the  compaction
tasks for a table, if any, were terminated. It turns out that this function
is used only at startup (and in REST api calls which are not used in the test)
in the distributed loader just before resharding and loading of
the sstable data. It is then reanabled after the resharding and loading
is done.
For still unknown reason, adding the extra logic of stopping ongoing
compactions made the write throughput drop to 50%.
Strangely enough this extra logic **should** (still unvalidated) not
have any side effects since no compactions for a table are supposed to
be running prior to loading it.
This regains the performance but also undo a change which eventually
should get in once we find the actual culprit.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #10559

Reopens #9313.

(cherry picked from commit 8e8dc2c930)
2022-05-15 08:50:38 +03:00
Juliusz Stasiewicz
6b34ba3a4f CQL: Replace assert by exception on invalid auth opcode
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.

Fixes #10487

Closes #10503

(cherry picked from commit 603dd72f9e)
2022-05-10 14:04:52 +02:00
Yaron Kaikov
f1e25cb4a6 release: prepare for 5.0.rc4 2022-05-10 07:35:53 +03:00
Benny Halevy
c9798746ae compaction: time_window_compaction_strategy: reset estimated_remaining_tasks when running out of candidates
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.

Refs #10418
(to be closed when the fix reaches branch-4.6)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10419

(cherry picked from commit 01f41630a5)
2022-05-09 09:35:53 +03:00
Eliran Sinvani
7f70ffc5ce prepared_statements: Invalidate batch statement too
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.

Fixes #10129

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
(cherry picked from commit 4eb0398457)
2022-05-08 12:31:42 +03:00
Eliran Sinvani
551636ec89 cql3 statements: Change dependency test API to express better it's
purpose

Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.

(cherry picked from commit bf50dbd35b)

Ref #10129
2022-05-08 12:31:02 +03:00
Raphael S. Carvalho
e1130a01e7 table: Close reader if flush fails to peek into fragment
An OOM failure while peeking into fragment, to determine if reader will
produce any fragment, causes Scylla to abort as flat_mutation_reader
expects reader to be closed before destroyed. Let's close it if
peek() fails, to handle the scenario more gracefully.

Fixes #10027.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220204031553.124848-1-raphaelsc@scylladb.com>
(cherry picked from commit 755cec1199)
2022-05-08 12:16:15 +03:00
Calle Wilund
b0233cb7c5 cdc: Ensure columns removed from log table are registered as dropped
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.

Should probably be backported to all CDC enabled versions.

Fixes #10473
Closes #10474

(cherry picked from commit 78350a7e1b)
2022-05-05 11:38:18 +02:00
Avi Kivity
e480c5bf4d Merge 'loading_cache: force minimum size of unprivileged ' from Piotr Grabowski
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.

When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.

This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.

To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.

New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.

Fixes #10440.

Closes #10456

* github.com:scylladb/scylla:
  loading_cache_test: test prepared stmts cache
  loading_cache: force minimum size of unprivileged
  loading_cache: extract dropping entries to lambdas
  loading_cache: separately track size of sections
  loading_cache: fix typo in 'privileged'

(cherry picked from commit 5169ce40ef)
2022-05-04 14:35:53 +03:00
Tomasz Grabiec
7d90f7e93f loading_cache: Make invalidation take immediate effect
There are two issues with current implementation of remove/remove_if:

  1) If it happens concurrently with get_ptr(), the latter may still
  populate the cache using value obtained from before remove() was
  called. remove() is used to invalidate caches, e.g. the prepared
  statements cache, and the expected semantic is that values
  calculated from before remove() should not be present in the cache
  after invalidation.

  2) As long as there is any active pointer to the cached value
  (obtained by get_ptr()), the old value from before remove() will be
  still accessible and returned by get_ptr(). This can make remove()
  have no effect indefinitely if there is persistent use of the cache.

One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:

  User Defined Type value contained too many fields (expected 5, got 6)

The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.

The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.

Fixes #10117

Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
2022-05-04 14:35:37 +03:00
Avi Kivity
3e6e8579c6 loading_cache: fix indentation of timestamped_val and two nested type aliases
timestamped_val (and two other type aliases) are nested inside loading_cache,
but indented as if they were top-level names. Adjust the indent to
avoid confusion.

Closes #10118

(cherry picked from commit d1a394fd97)

Ref #10117 - backport prerequisite
2022-05-04 14:35:15 +03:00
Avi Kivity
3e98e17d18 Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.

Fixes: #10450

Closes #10451

* github.com:scylladb/scylla:
  replica/database: drop_column_family(): drop querier cache entries after waiting for ops
  replica/database: finish coroutinizing drop_column_family()
  replica/database: make remove(const column_family&) private

(cherry picked from commit 7f1e368e92)
2022-05-01 17:22:57 +03:00
Avi Kivity
a214f8cf6e Update tools/java submodule (bad IPv6 addresses in nodetool)
* tools/java b1e09c8b8f...2241a63bda (1):
  > CASSANDRA-17581 fix NodeProbe: Malformed IPv6 address at index

Fixes #10442
2022-04-28 11:33:15 +03:00
Benny Halevy
e8b92fe34d replica: distributed_database: populate_column_family: trigger offstrategy compaction only for the base directory
In https://github.com/scylladb/scylla/issues/10218
we see off-strategy compaction happening on a table
during the initial phases of
`distributed_loader::populate_column_family`.

It is caused by triggering offtrategy compaction
too early, when sstables are populated from the staging
directory in a144d30162.

We need to trigger offstrategy compaction only of the base
table directory, never the staging or quarantine dirs.

Fixes #10218

Test: unit(dev)
DTest: materialized_views_test.py::TestInterruptBuildProcess

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220316152812.3344634-1-bhalevy@scylladb.com>
(cherry picked from commit a1d0f089c8)
2022-04-24 17:38:53 +03:00
Nadav Har'El
fa479c84ac config: fix some types in system.config virtual table
The system.config virtual tables prints each configuration variable of
type T based on the JSON printer specified in the config_type_for<T>
in db/config.cc.

For two variable types - experimental_features and tri_mode_restriction,
the specified converter was wrong: We used value_to_json<string> or
value_to_json<vector<string>> on something which was *not* a string.
Unfortunately, value_to_json silently casted the given objects into
strings, and the result was garbage: For example as noted in #10047,
for experimental_features instead of printing a list of features *names*,
e.g., "raft", we got a bizarre list of one-byte strings with each feature's
number (which isn't documented or even guaranteed to not change) as well
as carriage-return characters (!?).

So solution is a new printable_to_json<T> which works on a type T that
can be printed with operator<< - as in fact the above two types can -
and the type is converted into a string or vector of strings using this
operator<<, not a cast.

Also added a cql-pytest test for reading system.config and in particular
options of the above two types - checking that they contain sensible
strings and not "garbage" like before this patch.

Fixes #10047.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209090421.298849-1-nyh@scylladb.com>
(cherry picked from commit fef7934a2d)
2022-04-14 19:29:08 +03:00
Tomasz Grabiec
40c26dd2c5 utils/chunked_managed_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no user impact.

Fixes #10364.

Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
(cherry picked from commit 0c365818c3)
2022-04-13 09:48:34 +03:00
Tomasz Grabiec
2c6f069fd1 utils/chunked_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no known user impact.

Fixes #10363.

Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)
2022-04-13 09:47:24 +03:00
Avi Kivity
e27dff0c50 transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.

Fix by updating the error codes.

A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.

Fixes #5610.

Closes #10362

(cherry picked from commit 987e6533d2)
2022-04-13 09:47:24 +03:00
Tomasz Grabiec
3f03260ffb utils/chunked_managed_vector: Fix corruption in case there is more than one chunk
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.

Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.

Currently, the container is only used in the sstable partition index
cache.

Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.

Introduced in 78e5b9fd85 (4.6.0)

Fixes #10290

Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
(cherry picked from commit 41fe01ecff)
2022-04-08 10:53:33 +03:00
Takuya ASADA
1315135fca docker: enable --log-to-stdout which mistakenly disabled
Since our Docker image moved to Ubuntu, we mistakenly copy
dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not
used in Ubuntu (it should be /etc/default).
So /etc/default/scylla-server is just default configuration of
scylla-server .deb package, --log-to-stdout is 0, same as normal installation.

We don't want keep the duplicated configuration file anyway,
so let's drop dist/docker/etc/sysconfig/scylla-server and configure
/etc/default/scylla-server in build_docker.sh.

Fixes #10270

Closes #10280

(cherry picked from commit bdefea7c82)
2022-04-07 12:13:19 +03:00
Yaron Kaikov
f92622e0de release: prepare for 5.0.rc3 2022-04-06 14:31:03 +03:00
Takuya ASADA
3bca608db5 docker: run scylla as root
Previous versions of Docker image runs scylla as root, but cb19048
accidently modified it to scylla user.
To keep compatibility we need to revert this to root.

Fixes #10261

Closes #10325

(cherry picked from commit f95a531407)
2022-04-05 12:46:25 +03:00
Takuya ASADA
a93b72d5dd docker: revert scylla-server.conf service name change
We changed supervisor service name at cb19048, but this breaks
compatibility with scylla-operator.
To fix the issue we need to revert the service name to previous one.

Fixes #10269

Closes #10323

(cherry picked from commit 41edc045d9)
2022-04-05 12:40:59 +03:00
Benny Halevy
d58ca2edbd range_tombstone_list: insert_from: correct rev.update range_tombstone in not overlapping case
2nd std::move(start) looks like a typo
in fe2fa3f20d.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220404124741.1775076-1-bhalevy@scylladb.com>
(cherry picked from commit 2d80057617)

Fixes ##10326
2022-04-05 12:39:13 +03:00
Alexey Kartashov
75740ace2a dist/docker: fix incorrect locale value
Docker build script contains an incorrect locale specification for LC_ALL setting,
this commit fixes that.

Fixes #10310

Closes #10321

(cherry picked from commit d86c3a8061)
2022-04-04 12:51:02 +03:00
Piotr Sarna
d7a1bf6331 cql3: fix qualifying restrictions with IN for indexing
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.

Fixes #10300

Closes #10302

(cherry picked from commit c0fd53a9d7)
2022-04-03 11:20:49 +03:00
Avi Kivity
bbd7d657cc Update seastar submodule (pidof command not installed)
* seastar 1c0d622ba0...4a30c44c4c (1):
  > seastar-cpu-map.sh: switch from pidof to pgrep
Fixes #10238.
2022-03-29 12:36:06 +03:00
Avi Kivity
f5bf4c81d1 Merge 'replica/database: truncate: temporarily disable compaction on table and views before flush' from Benny Halevy
Flushing the base table triggers view building
and corresponding compactions on the view tables.

Temporarily disable compaction on both the base
table and all its view before flush and snapshot
since those flushed sstables are about to be truncated
anyway right after the snapshot is taken.

This should make truncate go faster.

In the process, this series also embeds `database::truncate_views`
into `truncate` and coroutinizes both

Refs #6309

Test: unit(dev)

Closes #10203

* github.com:scylladb/scylla:
  replica/database: truncate: fixup indentation
  replica/database: truncate: temporarily disable compaction on table and views before flush
  replica/database: truncate: coroutinize per-view logic
  replica/database: open-code truncate_view in truncate
  replica/database: truncate: coroutinize run_with_compaction_disabled lambda
  replica/database: coroutinize truncate
  compaction_manager: add disable_compaction method

(cherry picked from commit aab052c0d5)
2022-03-28 15:40:40 +03:00
Benny Halevy
02e8336659 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.

Fixes #10173

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
(cherry picked from commit a085ef74ff)
2022-03-24 18:00:11 +02:00
Benny Halevy
601812e11b atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.

The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.

This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.

If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.

Fixes #10156

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
(cherry picked from commit a57c087c89)
2022-03-24 18:00:11 +02:00
Benny Halevy
ea466320d2 atomic_cell: compare_atomic_cell_for_merge: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-2-bhalevy@scylladb.com>
(cherry picked from commit d43da5d6dc)
2022-03-24 18:00:11 +02:00
Benny Halevy
25ea831a15 atomic_cell: compare_atomic_cell_for_merge: simplify expiry/deltion_time comparison
No need to check first the the cells' expiry is different
or that deletion_time is different before comparing them
with `<=>`.

If they are the same the function returns std::strong_ordering::equal
anyhow and that is the same as `<=>` comparing identical values.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-1-bhalevy@scylladb.com>
(cherry picked from commit be865a29b8)
2022-03-24 18:00:11 +02:00
Benny Halevy
8648c79c9e main: shutdown: do not abort on certain system errors
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.

The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.

This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind.  Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
(cherry picked from commit 132c9d5933)
2022-03-24 14:48:52 +02:00
Nadav Har'El
7ae4d0e6f8 Seastar: backport Seastar fix for missing scring escape in JSON output
Backported Seastar fix:
  > Merge 'json/formatter: Escape strings' from Juliusz Stasiewicz

Fixes #9061

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-23 20:29:50 +02:00
Piotr Sarna
f3564db941 expression: fix get_value for mismatched column definitions
As observed in #10026, after schema changes it somehow happened
that a column defition that does not match any of the base table
columns was passed to expression verification code.
The function that looks up the index of a column happens to return
-1 when it doesn't find anything, so using this returned index
without checking if it's nonnegative results in accessing invalid
vector data, and a segfault or silent memory corruption.
Therefore, an explicit check is added to see if the column was actually
found. This serves two purposes:
 - avoiding segfaults/memory corruption
 - making it easier to investigate the root cause of #10026

Closes #10039

(cherry picked from commit 7b364fec9849e9a342af1c240e3a7185bf5401ef)
2022-03-21 10:37:48 +01:00
Pavel Emelyanov
97caf12836 Update seastar submodule (IO preemption overlap)
* seastar 47573503...8ef87d48 (3):
  > io_queue: Don't let preemption overlap requests
  > io_queue: Pending needs to keep capacity instead of ticket
  > io_queue: Extend grab_capacity() return codes

Fixes #10233
2022-03-17 11:26:38 +03:00
Yaron Kaikov
839d9ef41a release: prepare for 5.0.rc2 2022-03-16 14:35:52 +02:00
Benny Halevy
782bd50f92 compaction_manager: rewrite_sstables: do not acquire table write lock
Since regular compaction may run in parallel no lock
is required per-table.

We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88045.
But it can be removed entirely in master in a follow-up patch.

This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.

Fixes #10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10177

(cherry picked from commit 11ea2ffc3c)
2022-03-14 13:13:48 +02:00
Avi Kivity
0a4d971b4a Merge 'utils: cached_file: Fix alloc-dealloc mismatch during eviction' from Tomasz Grabiec
cached_page::on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback installed by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happily deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.

The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.

The series also adds two safety checks to LSA to catch such problems earlier.

Fixes #10056

\cc @slivne @bhalevy

Closes #10130

* github.com:scylladb/scylla:
  lsa: Abort when trying to free a standard allocator object not allocated through the region
  lsa: Abort when _non_lsa_memory_in_use goes negative
  tests: utils: cached_file: Validate occupancy after eviction
  test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
  utils: cached_file: Fix alloc-dealloc mismatch during eviction

(cherry picked from commit ff2cd72766)
2022-02-26 11:28:36 +02:00
Benny Halevy
22562f767f cql3: result_set: remove std::ref from comperator&
Applying std::ref on `RowComparator& cmp` hits the
following compilation error on Fedora 34 with
libstdc++-devel-11.2.1-9.fc34.x86_64

```
FAILED: build/dev/cql3/statements/select_statement.o
clang++ -MD -MT build/dev/cql3/statements/select_statement.o -MF build/dev/cql3/statements/select_statement.o.d -I/home/bhalevy/dev/scylla/seastar/include -I/home/bhalevy/dev/scylla/build/dev/seastar/gen/include -std=gnu++20 -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=6 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_TYPE_ERASE_MORE -DFMT_LOCALE -DFMT_SHARED -I/usr/include/p11-kit-1  -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -O2 -DSCYLLA_ENABLE_WASMTIME -iquote. -iquote build/dev/gen --std=gnu++20  -ffile-prefix-map=/home/bhalevy/dev/scylla=.  -march=westmere -DBOOST_TEST_DYN_LINK   -Iabseil -fvisibility=hidden  -Wall -Werror -Wno-mismatched-tags -Wno-tautological-compare -Wno-parentheses-equality -Wno-c++11-narrowing -Wno-sometimes-uninitialized -Wno-return-stack-address -Wno-missing-braces -Wno-unused-lambda-capture -Wno-overflow -Wno-noexcept-type -Wno-error=cpp -Wno-ignored-attributes -Wno-overloaded-virtual -Wno-unused-command-line-argument -Wno-defaulted-function-deleted -Wno-redeclared-class-member -Wno-unsupported-friend -Wno-unused-variable -Wno-delete-non-abstract-non-virtual-dtor -Wno-braced-scalar-init -Wno-implicit-int-float-conversion -Wno-delete-abstract-non-virtual-dtor -Wno-uninitialized-const-reference -Wno-psabi -Wno-narrowing -Wno-array-bounds -Wno-nonnull -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DHAVE_LZ4_COMPRESS_DEFAULT  -c -o build/dev/cql3/statements/select_statement.o cql3/statements/select_statement.cc
In file included from cql3/statements/select_statement.cc:14:
In file included from ./cql3/statements/select_statement.hh:16:
In file included from ./cql3/statements/raw/select_statement.hh:16:
In file included from ./cql3/statements/raw/cf_statement.hh:16:
In file included from ./cql3/cf_name.hh:16:
In file included from ./cql3/keyspace_element_name.hh:16:
In file included from /home/bhalevy/dev/scylla/seastar/include/seastar/core/sstring.hh:25:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/algorithm:74:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/pstl/glue_algorithm_defs.h:13:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/functional:58:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: error: exception specification of 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' uses itself
                = decltype(reference_wrapper::_S_fun(std::declval<_Up>()))>
                                                     ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: note: in instantiation of exception specification for 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' requested here
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:321:2: note: in instantiation of default argument for 'reference_wrapper<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' required here
        reference_wrapper(_Up&& __uref)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1017:57: note: while substituting deduced template arguments into function template 'reference_wrapper' [with _Up = __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, $1 = (no value), $2 = (no value)]
      = __bool_constant<__is_nothrow_constructible(_Tp, _Args...)>;
                                                        ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1023:14: note: in instantiation of template type alias '__is_nothrow_constructible_impl' requested here
    : public __is_nothrow_constructible_impl<_Tp, _Args...>::type
             ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:153:14: note: in instantiation of template class 'std::is_nothrow_constructible<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
    : public conditional<_B1::value, _B2, _B1>::type
             ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:298:11: note: (skipping 8 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
          return __and_<typename _Base::_Local_storage,
                 ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1933:13: note: in instantiation of function template specialization 'std::__partial_sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
              std::__partial_sort(__first, __last, __last, __comp);
                   ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1954:9: note: in instantiation of function template specialization 'std::__introsort_loop<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, long, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
          std::__introsort_loop(__first, __last,
               ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:4875:12: note: in instantiation of function template specialization 'std::__sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
      std::__sort(__first, __last, __gnu_cxx::__ops::__iter_comp_iter(__comp));
           ^
./cql3/result_set.hh:168:14: note: in instantiation of function template specialization 'std::sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>' requested here
        std::sort(_rows.begin(), _rows.end(), std::ref(cmp));
             ^
cql3/statements/select_statement.cc:773:21: note: in instantiation of function template specialization 'cql3::result_set::sort<std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>' requested here
                rs->sort(_ordering_comparator);
                    ^
1 error generated.
ninja: build stopped: subcommand failed.
```

Fixes #10079.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215071955.316895-3-bhalevy@scylladb.com>
(cherry picked from commit 3e20fee070)

[avi: backport for developer quality-of-life rather than as a bug fix]
2022-02-16 10:07:11 +02:00
Raphael S. Carvalho
eb80dd1db5 Revert "sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME"
This reverts commit 4c05e5f966.

Moving cleanup to maintenance group made its operation time up to
10x slower than previous release. It's a blocker to 4.6 release,
so let's revert it until we figure this all out.

Probably this happens because maintenance group is fixed at a
relatively small constant, and cleanup may be incrementally
generating backlog for regular compaction, where the former is
fighting for resources against the latter.

Fixes #10060.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220213184306.91585-1-raphaelsc@scylladb.com>
(cherry picked from commit a9427f150a)
2022-02-14 18:05:43 +02:00
Avi Kivity
51d699ee21 Update seastar submodule (overzealous log silencer)
* seastar 0d250d15ac...47573503cd (1):
  > log: Fix silencer to be shard-local and logger-global
Fixes #9784.
2022-02-14 17:54:54 +02:00
Avi Kivity
83a33bff8c Point seastar submodule at scylla-seastar.git
This allows us to backport Seastar fixes to this branch.
2022-02-14 17:54:16 +02:00
Nadav Har'El
273563b9ad alternator: allow REMOVE of non-existent nested attribute
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.

So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.

Fixes #10043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
(cherry picked from commit 9982a28007)
2022-02-08 11:37:31 +02:00
Yaron Kaikov
891990ec09 release: prepare for 5.0.rc1 2022-02-06 16:41:05 +02:00
Yaron Kaikov
da0cd2b107 release: prepare for 5.0.rc0 2022-02-03 08:10:30 +02:00
Calle Wilund
1e66043412 commitlog: Fix double clearing of _segment_allocating shared_future.
Fixes #10020

Previous fix 445e1d3 tried to close one double invocation,  but added
another, since it failed to ensure all potential nullings of the opt
shared_future happened before a new allocator could reset it.

This simplifies the code by making clearing the shared_future a
pre-requisite for resolving its contents (as read by waiters).

Also removes any need for try-catch etc.

Closes #10024
2022-02-02 23:26:17 +02:00
Nadav Har'El
cb6630040d docker: don't repeat "--alternator-address" option twice
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.

So this patch fixes the scyllasetup.py script to only pass this
parameter once.

Fixes #10016.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
2022-02-02 23:26:11 +02:00
Botond Dénes
d309a86708 Merge 'Add keyspace_offstrategy_compaction api' from Benny Halevy
This series adds methods to perform offstrategy compaction, if needed, returning a future<bool>
so the caller can wait on it until compaction completes.
The returned value is true iff offstrategy compaction was needed.

The added keyspace_offstrategy_compaction calls perform_offstrategy_compaction on the specified keyspace and tables, return the number of tables that required offstrategy compaction.

A respective unit test was added to the rest_api pytest.

This PR replaces https://github.com/scylladb/scylla/pull/9095 that suggested adding an option to `keyspace_compaction`
since offstrategy compaction triggering logic is different enough from major compaction meriting a new api.

Test: unit (dev)

Closes #9980

* github.com:scylladb/scylla:
  test: rest_api: add unit tests for keyspace_offstrategy_compaction api
  api: add keyspace_offstrategy_compaction
  compaction_manager: get rid of submit_offstrategy
  table: add perform_offstrategy_compaction
  compaction_manager: perform_offstrategy: print ks.cf in log messages
  compaction_manager: allow waiting on offstrategy compaction
2022-02-02 13:15:31 +02:00
Nadav Har'El
79776ff2ff alternator: fix error handling during Alternator startup
A recent restructuring of the startup of Alternator (and also other
protocol servers) led to incorrect error-handling behavior during
startup: If an error was detected on one of the shards of the sharded
service (in alternator/server.cc), the sharded service itself was never
stopped (in alternator/controller.cc), leading to an assertion failure
instead of the desired error message.

A common example of this problem is when the requested port for the
server was already taken (this was issue #9914).

So in this patch, exception handling is removed from server.cc - the
exception will propegate to the code in controller.cc, which will
properly stop the server (including the sharded services) before
returning.

Fixes #9914.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220130131709.1166716-1-nyh@scylladb.com>
2022-02-02 10:35:57 +01:00
Pavel Emelyanov
a026b4ef49 config: Add option to disable config updates via CQL
The system.config table allows changing config parameters, but this
change doesn't survive restarts and is considered to be dangerous
(sometimes). Add an option to disable the table updates. The option
is LiveUpdate and can be set to false via CQL too (once).

fixes #9976

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220201121114.32503-1-xemul@scylladb.com>
2022-02-01 14:30:47 +02:00
Takuya ASADA
c2ccdac297 move cloud related code from scylla repository to scylla-machine-image
Currently, cloud related code have cross-dependencies between
scylla and scylla-machine-image.
It is not good way to implement, and single change can break both
package.

To resolve the issue, we need to move all cloud related code to
scylla-machine-image, and remove them from scylla repository.

Change list:
 - move cloud part of scylla_util.py to scylla-machine-image
 - move cloud part of scylla_io_setup to scylla-machine-image
 - move scylla_ec2_check to scylla-machine-image
 - move cloud part of scylla_bootparam_setup to scylla-machine-image

Closes #9957
2022-02-01 11:26:59 +02:00
Tomasz Grabiec
00a9326ae7 Merge "raft: let modify_config finish on a follower that removes itself" from Kamil
When forwarding a reconfiguration request from follower to a leader in
`modify_config`, there is no reason to wait for the follower's commit
index to be updated. The only useful information is that the leader
committed the configuration change - so `modify_config` should return as
soon as we know that.

There is a reason *not* to wait for the follower's commit index to be
updated: if the configuration change removes the follower, the follower
will never learn about it, so a local waiter will never be resolved.

`execute_modify_config` - the part of `modify_config` executed on the
leader - is thus modified to finish when the configuration change is
fully complete (including the dummy entry appended at the end), and
`modify_config` - which does the forwarding - no longer creates a local
waiter, but returns as soon as the RPC call to the leader confirms that
the entry was committed on the leader.

We still return an `entry_id` from `execute_modify_config` but that's
just an artifact of the implementation.

Fixes #9981.

A regression test was also added in randomized_nemesis_test.

* kbr/modify-config-finishes-v1:
  test: raft: randomized_nemesis_test: regression test for #9981
  raft: server: don't create local waiter in `modify_config`
2022-01-31 20:14:50 +01:00
Kamil Braun
97ff98f3a7 service: migration_manager: retry schema change command on transient failures
The call to `raft::server::add_entry` in `announce_with_raft` may fail
e.g. due to a leader change happening when we try to commit the entry.
In cases like this it makes sense to retry the command so we don't
prematurely report an error to the client.

This may result in double application of the command. Fortunately, the schema
change command is idempotent thanks to the group 0 state ID mechanism
(originally used to prevent conflicting concurrent changes from happening).
Indeed, once a command passes the state ID check, it changes the group 0
history last state ID, causing all later applications of that same
command to fail the check. Similarly, once a command fails the state ID
check, it means that the last state ID is different than the one
observed when the command was being constructed, so all further
applications of the command will also fail the check (it is not possible
for the last state ID to change from X to Y then back to X).

Note that this reasoning only works for commands with `prev_state_id`
engaged, such as the ones which we're using in
`migration_manager::announce_with_raft`. It would not work with
"unconditional commands" where `prev_state_id` is `nullopt` - for those
commands no state ID check is performed. It could still be safe to retry
those commands if they are idempotent for a different reason.

(Note: actually, our schema commands are already idempotent even without
the state ID check, because they simply apply a set of mutations, and
applying the same mutations twice is the same as applying them once.)
Message-Id: <20220131152926.18087-1-kbraun@scylladb.com>
2022-01-31 19:49:31 +01:00
Takuya ASADA
218dd3851c scylla_swap_setup: add --swap-size-bytes
Currently, --swap-size does not able to specify exact file size because
the option takes parameter only in GB.
To fix the limitation, let's add --swpa-size-bytes to specify swap size
in bytes.
We need this to implement preallocate swapfile while building IaaS
image.

see scylladb/scylla-machine-image#285

Closes #9971
2022-01-31 18:32:32 +02:00
Benny Halevy
4272dd0b28 storage_proxy: mutate_counter_on_leader_and_replicate: use container to get to shard proxy
Rather than using the global helper, get_local_storage_proxy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220131151516.3461049-2-bhalevy@scylladb.com>
2022-01-31 18:14:31 +02:00
Benny Halevy
8acdc6ebdc storage_proxy: paxos: don't use global storage_proxy
Rather than calling get_local_storage_proxy(),
use paxos_response_handler::_proxy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220131151516.3461049-1-bhalevy@scylladb.com>
2022-01-31 18:14:31 +02:00
Calle Wilund
445e1d3e41 commitlog: Ensure we never have more than one new_segment call at a time
Refs #9896

Found by @eliransin. Call to new_segment was wrapped in with_timeout.
This means that if primary caller timed out, we would leave new_segment
calls running, but potentially issue new ones for next caller.

This could lead to reserve segment queue being read simultanously. And
it is not what we want.

Change to always use the shared_future wait, all callers, and clear it
only on result (exception or segment)

Closes #10001
2022-01-31 16:50:22 +02:00
Nadav Har'El
8a745593a2 Merge 'alternator: fill UnprocessedKeys for failed batch reads' from Piotr Sarna
DynamoDB protocol specifies that when getting items in a batch
failed only partially, unprocessed keys can be returned so that
the user can perform a retry.
Alternator used to fail the whole request if any of the reads failed,
but right now it instead produces the list of unprocessed keys
and returns them to the user, as long as at least 1 read was
successful.

This series comes with a test based on Scylla's error injection mechanism, and thus is only useful in modes which come with error injection compiled in. In release mode, expect to see the following message:
SKIPPED (Error injection not enabled in Scylla - try compiling in dev/debug/sanitize mode)

Fixes #9984

Closes #9986

* github.com:scylladb/scylla:
  test: add total failure case for GetBatchItem
  test: add error injection case for GetBatchItem
  test: add a context manager for error injection to alternator
  alternator: add error injection to BatchGetItem
  alternator: fill UnprocessedKeys for failed batch reads
2022-01-31 15:28:24 +02:00
Piotr Sarna
c87126198d test: add total failure case for GetBatchItem
The test verifies that if all reads from a batch operation
failed, the result is an error, and not a success response
with UnprocessedKeys parameter set to all keys.
2022-01-31 14:21:55 +01:00
Piotr Sarna
e79c2943fc test: add error injection case for GetBatchItem
The new test case is based on Scylla error injection mechanism
and forces a partial read by failing some requests from the batch.
2022-01-31 14:21:55 +01:00
Piotr Sarna
99c5bec0e2 test: add a context manager for error injection to alternator
With the new context manager it's now easier to request an error
to be injected via REST API. Note that error injection is only
enabled in certain build modes (dev, debug, sanitize)
and the test case will be skipped if it's not possible to use
this mechanism.
2022-01-31 14:21:55 +01:00
Tomasz Grabiec
8297ae531d Merge "Automatically retry CQL DDL statements in presence of concurrent changes" from Kamil
Schema changes on top of Raft do not allow concurrent changes.
If two changes are attempted concurrently, one of them gets
`group0_concurrent_modification` exception.

Catch the exception in CQL DDL statement execution function and retry.

In addition, improve the description of CQL DDL statements
in group 0 history table.

Add a test which checks that group 0 history grows iff a schema change does
not throw `group0_concurrent_modification`. Also check that the retry
mechanism works as expected.

* kbr/ddl-retry-v1:
  test: unit test for group 0 concurrent change protection and CQL DDL retries
  cql3: statements: schema_altering_statement: automatically retry in presence of concurrent changes
2022-01-31 14:12:35 +01:00
Tomasz Grabiec
b78bab7286 Merge "raft: fixes and improvements to the library and nemesis test" from Kamil
Raft randomized nemesis test was improved by adding some more
chaos: randomizing the network delay, server configuration,
ticking speed of servers.

This allowed to catch a serious bug, which is fixed in the first patch.

The patchset also fixes bugs in the test itself and adds quality of life
improvements such as better diagnostics when inconsistency is detected.

* kbr/nemesis-random-v1:
  test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency
  test: raft: randomized_nemesis_test: print details when detecting inconsistency
  test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine`
  test: raft: randomized_nemesis_test: keep server id in impure_state_machine
  test: raft: randomized_nemesis_test: frequent snapshotting configuration
  test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
  test: raft: randomized_nemesis_test: simplify ticker
  test: raft: randomized_nemesis_test: randomize network delay
  test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()`
  test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
  test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside
  test: raft: randomized_nemesis_test: fix obsolete comment
  raft: fsm: print configuration entries appearing in the log
  raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration`
  raft: server: abort snapshot applications before waiting for rpc abort
  raft: server: logging fix
  raft: fsm: don't advance commit index beyond matched entries
2022-01-31 13:25:27 +01:00
Calle Wilund
7ca72ffd19 database: Make wrapped version of timed_out_error a timed_out_error
Refs #9919

in a6202ae  throw_commitlog_add_error was added to ensure we had more
info on errors generated writing to commit log.

However, several call sites catch timed_out_error explicitly, not
checking for nested etc.
97bb1be and 868b572 tried to deal with it, by using check routines.
It turns out there are call sites left, and while these should be
changed, it is safer and quicker for now to just ensure that
iff we have a timed_out_error, we throw yet another timed_out_error.

Closes #10002
2022-01-31 14:15:23 +02:00
Piotr Sarna
d50ed944f2 alternator: add error injection to BatchGetItem
When error injection is enabled at compile time, it's now possible
to inject an error into BatchGetItem in order to produce a partial
read, i.e. when only part of the items were retrieved successfully.
2022-01-31 12:56:00 +01:00
Piotr Sarna
31f4f062a2 alternator: fill UnprocessedKeys for failed batch reads
DynamoDB protocol specifies that when getting items in a batch
failed only partially, unprocessed keys can be returned so that
the user can perform a retry.
Alternator used to fail the whole request if any of the reads failed,
but right now it instead produces the list of unprocessed keys
and returns them to the user, as long as at least 1 read was
successful.

NOTE: tested manually by compiling Scylla with error injection,
which fails every nth request. It's rather hard to figure out
an automatic test case for this scenario.

Fixes #9984
2022-01-31 12:56:00 +01:00
Mikołaj Sielużycki
93d6eb6d51 compacting_reader: Support fast_forward_to position range.
Fast forwarding is delegated to the underlying reader and assumes the
it's supported. The only corner case requiring special handling that has
shown up in the tests is producing partition start mutation in the
forwarding case if there are no other fragments.

compacting state keeps track of uncompacted partition start, but doesn't
emit it by default. If end of stream is reached without producing a
mutation fragment, partition start is not emitted. This is invalid
behaviour in the forwarding case, so I've added a public method to
compacting state to force marking partition as non-empty. I don't like
this solution, as it feels like breaking an abstraction, but I didn't
come across a better idea.

Tests: unit(dev, debug, release)

Message-Id: <20220128131021.93743-1-mikolaj.sieluzycki@scylladb.com>
2022-01-31 13:37:36 +02:00
Nadav Har'El
a25e265373 test/alternator: improve comment on why we need "global_random"
Improve the comment that explains why we needed to use an explicitly
shared random sequence instead of the usual "random". We now understand
that we need this workaround to undo what the pytest-randomly plugin does.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220130155557.1181345-1-nyh@scylladb.com>
2022-01-31 10:07:56 +01:00
Nadav Har'El
59fe6a402c test/cql-pytest: use unique keys instead of random keys
Some of the tests in test/cql-pytest share the same table but use
different keys to ensure they don't collide. Before this patch we used a
random key, which was usually fine, but we recently noticed that the
pytest-randomly plugin may cause different tests to run through the *same*
sequence of random numbers and ruin our intent that different tests use
different keys.

So instead of using a *random* key, let's use a *unique* key. We can
achieve this uniqueness trivially - using a counter variable - because
anyway the uniqueness is only needed inside a single temporary table -
which is different in every run.

Another benefit is that it will now be clearer that the tests are
deterministic and not random - the intent of a random_string() key
was never to randomly walk the entire key space (random_string()
anyway had a pretty narrow idea of what a random string looks like) -
it was just to get a unique key.

Refs #9988 (fixes it for cql-pytest, but not for test/alternator)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-01-31 09:01:23 +02:00
Benny Halevy
1c25934399 test: rest_api: add unit tests for keyspace_offstrategy_compaction api
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-30 20:40:40 +02:00
Benny Halevy
f6431824a7 api: add keyspace_offstrategy_compaction
Perform offstrategy compaction via the REST API with
a new `keyspace_offstrategy_compaction` option.

This is useful for performing offstrategy compaction
post repair, after repairing all token ranges.

Otherwise, offstrategy compaction will only be
auto-triggered after a 5 minutes idle timeout.

Like major compaction, the api call returns the offstrategy
compaction task future, so it's waited on.
The `long` result counts the number of tables that required
offstrategy compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-30 20:40:39 +02:00
Benny Halevy
02bd84fe79 compaction_manager: get rid of submit_offstrategy
Now that the table layer is using perform_offstrategy,
submit_offstrategy is no longer in use and can be deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-30 20:09:35 +02:00
Benny Halevy
e03b6eeff8 table: add perform_offstrategy_compaction
Expose an async method to perform offstrategy- compaction, if needed.

Returns a future<bool> that is resolved when offstrategy_compaction completes.
The future value is true iff offstrategy compaction was required.

To be used in a following patch by the storage_service api.

Call it from `trigger_offstrategy_compaction` that triggers
offstrategy compaction in the background and warn about ignored
failures.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-30 20:09:35 +02:00
Benny Halevy
6b8e88d047 compaction_manager: perform_offstrategy: print ks.cf in log messages
So it would be easier to relate the messages to the
table for which it was submitted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-30 20:09:35 +02:00
Benny Halevy
69883d464e compaction_manager: allow waiting on offstrategy compaction
Return a future from perform_offstrategy, resolved
when the offstrategy compaction completes so that callers
can wait on it.

submit_offstrategy still submits the offstrategy compaction
in the background.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-30 20:09:35 +02:00
Tomasz Grabiec
b734615f51 util: cached_file: Fix corruption after memory reclamation was triggered from population
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.

Fix by running emplace() under allocating section, which disables
memory reclamation.

The bug manifests with assert failures, e.g:

./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.

Fixes #9915

Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
2022-01-30 19:57:35 +02:00
Benny Halevy
3cee0f8bd9 shared_token_metadata: mutate_token_metadata: bump cloned copy ring_version
Currently this is done only in
storage_service::get_mutable_token_metadata_ptr
but it needs to be done here as well for code paths
calling mutate_token_metadata directly.

Currently, this it is only called from network_topology_strategy_test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220130152157.2596086-1-bhalevy@scylladb.com>
2022-01-30 18:15:08 +02:00
Piotr Sarna
471205bdcf test/alternator: use a global random generator for all test cases
It was observed (perhaps it depends on the Python implementation)
that an identical seed was used for multiple test cases,
which violated the assumption that generated values are in fact
unique. Using a global generator instead makes sure that it was
only seeded once.

Tests: unit(dev) # alternator tests used to fail for me locally
  before this patch was applied
Message-Id: <315d372b4363f449d04b57f7a7d701dcb9a6160a.1643365856.git.sarna@scylladb.com>
2022-01-30 16:40:20 +02:00
Tomasz Grabiec
3e31126bdf Merge "Brush up the initial tokens generation code" from Pavel Emelyanov
On start the storage_service sets up initial tokens. Some dangling
variables, checks and code duplication had accumulated over time.

* xemul/br-storage-service-bootstrap-leftovers:
  dht: Use db::config to generate initial tookens
  database, dht: Move get_initial_tokens()
  storage_service: Factor out random/config tokens generation
  storage_service: No extra get_replace_address checks
  storage_service: Remove write-only local variable
2022-01-28 15:54:45 +01:00
Pavel Emelyanov
89a7c750ea Merge "Deglobalize repair_meta_map" from Benny
This series moves the static thread_local repair_meta_map instances
into the repair_service shards.

Refs #9809

Test: unit(release) (including scylla-gdb)
Dtest: repair_additional_test.py::TestRepairAdditional::{test_repair_disjoint_row_2nodes,test_repair_joint_row_3nodes_2_diff_shard_count} replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap[rbo_enabled](release)

* git@github.com:bhalevy/scylla.git deglobalize-repair_meta_map-v1
  repair_service: deglobalize get_next_repair_meta_id
  repair_service: deglobalize repair_meta_map
  repair_service: pass reference to service to row_level_repair_gossip_helper
  repair_meta: define repair_meta_ptr
  repair_meta: move static repair_meta map functions out of line
  repair_meta: make get_set_diff a free function
  repair: repair_meta: no need to keep sharded<netw::messaging_service>
  repair: repair_meta: derive subordinate services from repair_service
  repair: pass repair_service to repair_meta
2022-01-28 14:12:33 +02:00
Avi Kivity
34252eda26 Update seastar submodule
* seastar 5524f229b...0d250d15a (6):
  > core: memory: Avoid current_backtrace() on alloc failure when logging suppressed
Fixes #9982
  > Merge "Enhance io-tester and its rate-limited job" from Pavel E
  > queue: pop: assert that the queue is not empty
  > io_queue: properly declare io_queue_for_tests
  > reactor: Fix off-by-end-of-line misprint in legacy configuration
  > fair_queue: Fix move constructor
2022-01-28 14:12:33 +02:00
Tomasz Grabiec
7ee79fa770 logalloc: Add more logging
Message-Id: <20220127232009.314402-1-tgrabiec@scylladb.com>
2022-01-28 14:12:33 +02:00
Kamil Braun
d10b508380 test: raft: randomized_nemesis_test: regression test for #9981 2022-01-27 17:50:40 +01:00
Kamil Braun
28b5792481 raft: server: don't create local waiter in modify_config
When forwarding a reconfiguration request from follower to a leader in
`modify_config`, there is no reason to wait for the follower's commit
index to be updated. The only useful information is that the leader
committed the configuration change - so `modify_config` should return as
soon as we know that.

There is a reason *not* to wait for the follower's commit index to be
updated: if the configuration change removes the follower, the follower
will never learn about it, so a local waiter will never be resolved.

`execute_modify_config` - the part of `modify_config` executed on the
leader - is thus modified to finish when the configuration change is
fully complete (including the dummy entry appended at the end), and
`modify_config` - which does the forwarding - no longer creates a local
waiter, but returns as soon as the RPC call to the leader confirms that
the entry was committed on the leader.

We still return an `entry_id` from `execute_modify_config` but that's
just an artifact of the implementation.

Fixes #9981.
2022-01-27 17:49:40 +01:00
Pavel Emelyanov
1525c04db3 dht: Use db::config to generate initial tookens
The replica::database is passed into the helper just to get the
config from. Better to use config directly without messing with
the database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:29 +03:00
Pavel Emelyanov
77532a6a36 database, dht: Move get_initial_tokens()
The helper in question has nothing to do with replica/database and
is only used by dht to convert config option to a set of tokens.
It sounds like the helper deserves living where it's needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:29 +03:00
Pavel Emelyanov
50170366ea storage_service: Factor out random/config tokens generation
There's a place in normal node start that parses the initial_token
option or generates num_tokens random tokens. This code is used almost
unchanged since being ported from its java version. Later there appeared
the dht::get_bootstrap_token() with the same internal logic.

This patch generalizes these two places. Logging messages are unified
too (dtest seem not to check those).

The change improves a corner case. The normal node startup code doesn't
check if the initial_token is empty and num_tokens is 0 generating empty
bootstrap_tokens set. It fails later with an obscure 'remove_endpoint
should be used instead' message.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:29 +03:00
Pavel Emelyanov
7b521405e4 storage_service: No extra get_replace_address checks
The get_replace_address() returns optional<inet_address>, but
in many cases it's used under if (is_replacing()) branch which,
in turn, returns bool(get_replace_address()) and this is only
executed if the returned optional is engaged.

Extra checks can be removed making the code tiny bit shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:29 +03:00
Pavel Emelyanov
330f2cfcfc storage_service: Remove write-only local variable
The set of tokens used to be use after being filled, but now
it's write-only.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-27 16:41:25 +03:00
Kamil Braun
4a52b802ac test: unit test for group 0 concurrent change protection and CQL DDL retries
Check that group 0 history grows iff a schema change does not throw
`group0_concurrent_modification`. Check that the CQL DDL statement retry
mechanism works as expected.
2022-01-27 11:26:15 +01:00
Kamil Braun
edd8344706 cql3: statements: schema_altering_statement: automatically retry in presence of concurrent changes
Schema changes on top of Raft do not allow concurrent changes.
If two changes are attempted concurrently, one of them gets
`group0_concurrent_modification` exception.

Catch the exception in CQL DDL statement execution function and retry.

In addition, the description of CQL DDL statements in group 0 history
table was improved.
2022-01-27 11:26:14 +01:00
Benny Halevy
f8db9e1bd8 repair_service: deglobalize get_next_repair_meta_id
Rather than using a static unit32_t next_id,
move the next_id variable into repair_service shard 0
and manage it there.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 11:34:21 +02:00
Benny Halevy
90ba9013be repair_service: deglobalize repair_meta_map
Move the static repair_meta_map into the repair_service
and expose it from there.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 11:01:47 +02:00
Benny Halevy
e6b6fdc9a0 repair_service: pass reference to service to row_level_repair_gossip_helper
Note that we can't pass the repair_service container()
from its ctor since it's not populated until all shards start.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 11:00:26 +02:00
Benny Halevy
3008ecfd4e repair_meta: define repair_meta_ptr
Keep repair_meta in repair_meta_map as shared_ptr<repair_meta>
rather than lw_shared_ptr<repair_meta> so it can be defined
in the header file and use only forward-declared
class repair_meta.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:18:14 +02:00
Benny Halevy
fdc0a9602c repair_meta: move static repair_meta map functions out of line
Define the static {get,insert,remove}_repair_meta functions out
of the repair_meta class definition, on the way of moving them,
along with the repair_meta_map itself, to repair_service.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:15:09 +02:00
Benny Halevy
b5427cc6d1 repair_meta: make get_set_diff a free function
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:13:09 +02:00
Benny Halevy
224e7497e0 repair: repair_meta: no need to keep sharded<netw::messaging_service>
All repair_meta needs is the local instance.
Need be, it's a peering service so the container()
can be used if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:13:09 +02:00
Benny Halevy
c4ac92b2b7 repair: repair_meta: derive subordinate services from repair_service
Use repair_service as the authoritative source for
the database, messaging_service, system_distributed_keyspace,
and view_update_generator, similar to repair_info.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:12:53 +02:00
Benny Halevy
a71d6333e4 repair: pass repair_service to repair_meta
Prepare for old the repair_meta_map in repair_service.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-27 09:12:51 +02:00
Tomasz Grabiec
ba6c02b38a Merge "Clear old entries from group 0 history when performing schema changes" from Kamil
When performing a change through group 0 (which right now means schema
changes), clear entries from group 0 history table which are older
than one week.

This is done by including an appropriate range tombstone in the group 0
history table mutation.

* kbr/g0-history-gc-v2:
  idl: group0_state_machine: fix license blurb
  test: unit test for clearing old entries in group0 history
  service: migration_manager: clear old entries from group 0 history when announcing
2022-01-26 16:12:40 +01:00
Kamil Braun
95ac8ead4f test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency 2022-01-26 16:09:41 +01:00
Kamil Braun
e249ea5aef test: raft: randomized_nemesis_test: print details when detecting inconsistency
If the returned result is inconsistent with the constructed model, print
the differences in detail instead of just failing an assertion.
2022-01-26 16:09:41 +01:00
Kamil Braun
1170e47af4 test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in impure_state_machine
Useful for debugging.
2022-01-26 16:09:41 +01:00
Kamil Braun
b8158e0b43 test: raft: randomized_nemesis_test: keep server id in impure_state_machine
Will be used for logging.
2022-01-26 16:09:41 +01:00
Kamil Braun
3c01449472 test: raft: randomized_nemesis_test: frequent snapshotting configuration
With probability 1/2, run the test with a configuration that causes
servers to take snapshots frequently.
2022-01-26 16:09:41 +01:00
Kamil Braun
7546a9ebb5 test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
Previously all servers were ticked at the same moment, every 10
network/timer ticks.

Now we tick each server with probability 1/10 on each network/timer
tick. Thus, on average, every server is ticked once per 10 ticks.
But now we're able to obtain more interesting behaviors.
E.g. we can now observe servers which are stalling for as long as 10 ticks
and servers which temporarily speed up to tick once per each network tick.
2022-01-26 16:09:41 +01:00
Kamil Braun
5d986b2682 test: raft: randomized_nemesis_test: simplify ticker
Instead of taking a set of functions with different periods, take a
single function that is called on every tick. The periodicity can be
implemented easily on the user side.
2022-01-26 16:09:41 +01:00
Kamil Braun
173fb2bf36 test: raft: randomized_nemesis_test: randomize network delay
As a side effect, this causes messages to be delivered in a different
order they were sent, adding even more chaos.
2022-01-26 16:09:41 +01:00
Kamil Braun
00c18adbb0 test: raft: randomized_nemesis_test: fix use-after-free in environment::crash()
The lambda attached to `_crash_fiber` was a coroutine. The coroutine
would use `this` captured by the lambda after the `co_await`, where the
lambda object (hence its captures) was already destroyed.

No idea why it worked before and sanitizers did not complain in debug
mode.
2022-01-26 16:09:41 +01:00
Kamil Braun
4c68e6a04c test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
Two-way RPC functions such as `send_snapshot` had a guard object which
was captured in a lambda passed to `with_gate`. The guard object, on
destruction, accessed the `rpc` object. Unfortunately, the guard object
could outlive the `rpc` object. That's because the lambda, and hence the
guard object, was destroyed after `with_gate` finished (it lived in the
frame of the caller of `with_gate`, i.e. `send_snapshot` and others),
so it could be destroyed after `rpc` (the gate prevents `rpc` from being
destroyed).

Make sure that the guard object is destroyed before `with_gate` finishes
by creating it inside the lambda body - not capturing inside the object.
2022-01-26 16:09:41 +01:00
Kamil Braun
871f0d00ce test: raft: randomized_nemesis_test: rpc: don't propagate gate_closed_exception outside
The `raft::rpc` interface functions are called by `raft::server_impl`
and the exceptions may be propagated outside the server, e.g. through
the `add_entry` API.

Translate the internal `gate_closed_exception` to an external
`raft::stopped_error`.
2022-01-26 16:09:41 +01:00
Kamil Braun
9da4ffc1c7 test: raft: randomized_nemesis_test: fix obsolete comment 2022-01-26 16:09:41 +01:00
Kamil Braun
22092d110a raft: fsm: print configuration entries appearing in the log
When appending or committing configuration entries, print them (on TRACE
level). Useful for debugging.
2022-01-26 16:09:41 +01:00
Kamil Braun
44a1a8a8b0 raft: operator<<(ostream&, ...) implementation for server_address and configuration
Useful for debugging.

Had to make `configuration` constructor explicit. Otherwise the
`operator<<` implementation for `configuration` would implicitly convert
the `server_address` to `configuration` when trying to output it, causing
infinite recursion.

Removed implicit uses of the constructor.
2022-01-26 16:09:41 +01:00
Kamil Braun
46f6a0cca5 raft: server: abort snapshot applications before waiting for rpc abort
The implementation of `rpc` may wait for all snapshot applications to
finish before it can finish aborting. This is what the
randomized_nemesis_test implementation did. This caused rpc abort to
hang in some scenarios.

In this commit, the order of abort calls is modified a bit. Instead of
waiting for rpc abort to finish and then aborting existing snapshot
applications, we call `rpc::abort()` and keep the future, then abort
snapshot applications, then wait on the future. Calling `rpc::abort()`
first is supposed to prevent new snapshot applications from starting;
a comment was added at the interface definition. The nemesis test
implementation had this property, and `raft_rpc` in group registry
was adjusted appropriately. Aborting the snapshot applications then
allows `rpc::abort()` to finish.
2022-01-26 16:06:45 +01:00
Kamil Braun
5577ad6c34 raft: server: logging fix 2022-01-26 15:54:14 +01:00
Kamil Braun
1216f39977 raft: fsm: don't advance commit index beyond matched entries
Otherwise it was possible to incorrectly mark obsolete entries from
earlier terms as committed, leading to inconsistencies between state
machine replicas.

Fixes #9965.
2022-01-26 15:53:13 +01:00
Avi Kivity
df22396a34 Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540

----

This reverts 0d8f932 and introduce correct fix.

Closes #9970

* github.com:scylladb/scylla:
  scylla_raid_setup: use mdmonitor only when RAID level > 0
  Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
2022-01-26 15:34:47 +02:00
Takuya ASADA
32f2eb63ac scylla_raid_setup: use mdmonitor only when RAID level > 0
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540
2022-01-26 22:33:07 +09:00
Takuya ASADA
cd57815fff Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
This reverts commit 0d8f932f0b,
because RHEL developer explains this is not a bug, it's expected behavior.
(mdadm --monitor does not start when RAID level is 0)
see: https://bugzilla.redhat.com/show_bug.cgi?id=2031936

So we should stop downgrade mdadm package and modify our script not to
enable mdmonitor.service on RAID0, use it only for RAID5.
2022-01-26 22:33:06 +09:00
Gleb Natapov
579dcf187a raft: allow an option to persist commit index
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.

For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.

Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
2022-01-26 14:06:39 +01:00
Calle Wilund
43f51e9639 commitlog: Ensure we don't run continuation (task switch) with queues modified
Fixes #9955

In #9348 we handled the problem of failing to delete segment files on disk, and
the need to recompute disk footprint to keep data flow consistent across intermittent
failures. However, because _reserve_segments and _recycled_segments are queues, we
have to empty them to inspect the contents. One would think it is ok for these
queues to be empty for a while, whilst we do some recaclulating, including
disk listing -> continuation switching. But then one (i.e. I) misses the fact
that these queues use the pop_eventually mechanism, which does _not_ handle
a scenario where we push something into an empty queue, thus triggering the
future that resumes a waiting task, but then pop the element immediately, before
the waiting task is run. In fact, _iff_ one does this, not only will things break,
they will in fact start creating undefined behaviour, because the underlying
std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push
operations -> we will pop an empty queue, immediately making it non-empty, but
using undefined memory (with luck null/zeroes).

Strictly speakging, seastar::queue::pop_eventually should be fixed to handle
the scenario, but nontheless we can fix the usage here as well, by simply copy
objects and do the calculation "in background" while we potentially start
popping queue again.

Closes #9966
2022-01-26 13:51:01 +02:00
Avi Kivity
f5cd6ec419 Update tools/python3 submodule (relicensed to Apache License 2.0)
* tools/python3 8a77e76...f725ec7 (2):
  > Relicense to Apache 2.0
  > treewide: use Software Package Data Exchange (SPDX) license identifiers
2022-01-25 18:50:39 +02:00
Kamil Braun
f3c0c73d36 idl: group0_state_machine: fix license blurb 2022-01-25 17:48:46 +01:00
Kamil Braun
bf91dcd1e3 idl: group0_state_machine: fix license blurb 2022-01-25 13:14:47 +01:00
Kamil Braun
b863a63b08 test: unit test for clearing old entries in group0 history
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
2022-01-25 13:13:35 +01:00
Kamil Braun
e9083433a8 service: migration_manager: clear old entries from group 0 history when announcing
When performing a change through group 0 (which right now only covers
schema changes), clear entries from group 0 history table which are older
than one week.

This is done by including an appropriate range tombstone in the group 0
history table mutation.
2022-01-25 13:11:14 +01:00
Botond Dénes
eb42213db4 compact_mutation: close active range tombstone on page end
The compactor recently acquired the ability to consume a v2 stream. The
v2 spec requires that all streams end with a null tombstone.
`range_tombstone_assembler`, the component the compactor uses for
converting the v2 input into its v1 output enforces this with a check on
`consume_end_of_partition()`. Normally the producer of the stream the
compactor is consuming takes care of closing the active tombstone before
the stream ends. The compactor however (or its consumer) can decide to
end the consume early, e.g. to cut the current page. When this happens
the compactor must take care of closing the tombstone itself.
Furthermore it has to keep this tombstone around to re-open it on the
next page.
This patch implements this mechanism which was left out of 134601a15e.
It also adds a unit test which reproduces the problems caused by the
missing mechanism.
The compactor now tracks the last clustering position emitted. When the
page ends, this position will be used as the position of the closing
range tombstone change. This ensures the range tombstone only covers the
actually emitted range.

Fixes: #9907

Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>
2022-01-25 09:52:30 +02:00
Gleb Natapov
e56e96ac5a raft: do not add new wait entries after abort
Abort signals stopped_error on all awaited entries, but if an entry is
added after this it will be destroyed without signaling and will cause
a waiter to get broken_promise.

Fixes #9688

Message-Id: <Ye6xJjTDooKSuZ87@scylladb.com>
2022-01-25 09:52:30 +02:00
Tomasz Grabiec
c89b1953f8 Merge "Enforce linearizability of group 0 operations using state IDs" from Kamil
We introduce a new table, `system.group0_history`.

This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

Group 0 commands, additionally to mutations which modify group 0 tables,
contain a "previous state ID" and a "new state ID".

The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group 0
state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.

* kbr/schema-state-ids-v4:
  service: migration_manager: `announce`: take a description parameter
  service: raft: check and update state IDs during group 0 operations
  service: raft: group0_state_machine: introduce `group0_command`
  service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
  service: migration_manager: convert migration request handler to coroutine
  db: system_keyspace: introduce `system.group0_history` table
  treewide: require `group0_guard` when performing schema changes
  service: migration_manager: introduce `group0_guard`
  service: raft: pass `storage_proxy&` to `group0_state_machine`
  service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot`
  service: raft: rename `schema_raft_state_machine` to `group0_state_machine`
  service: migration_manager: rename `schema_read_barrier` to `start_group0_operation`
  service: migration_manager: `announce`: split raft and non-raft paths to separate functions
  treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions
  service: migration_manager: put notifier call inside `async`
  service: migration_manager: remove some unused and disabled code
  db: system_distributed_keyspace: use current time when creating mutations in `start()`
  redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only
2022-01-25 09:52:30 +02:00
Avi Kivity
a105b09475 build: prepare for Scylla 5.0
We decided to name the next version Scylla 5.0, in honor of
Raft based schema management.
2022-01-25 09:52:30 +02:00
Avi Kivity
277303a722 build_indexes_virtual_reader: convert to flat_mutation_reader_v2
Since it doesn't handle range tombstones in any way, the conversion
consists of just using the new type names.

Closes #9948
2022-01-25 09:52:30 +02:00
Avi Kivity
007145e033 validation: complete transition to data_dictionary module
The API was converted in 00de5f4876,
but some #includes remain. Remove them.

Closes #9947
2022-01-25 09:52:30 +02:00
Avi Kivity
e74f570eda alternator: streams: fix use-after-free of data_dictionary in describe_stream()
In 4aa9e86924 ("Merge 'alternator: move uses of replica module to
data_dictionary' from Avi Kivity"), we changed alternator to use
data_dictionary instead of replica::database. However,
data_dictionary::database objects are different from replica::database
objects in that they don't have a stable address and need to be
captured by value (they are pointer-like). One capture in
describe_stream() was capturing a data_dictionary::database
by reference and so caused a use-after-free when the previous
continuation was deallocated.

Fix by capturing by value.

Fixes #9952.

Closes #9954
2022-01-25 09:52:30 +02:00
Kamil Braun
044e05b0d9 service: migration_manager: announce: take a description parameter
The description parameter is used for the group 0 history mutation.
The default is empty, in which case the mutation will leave
the description column as `null`.
I filled the parameter in some easy places as an example and left the
rest for a follow-up.

This is how it looks now in a fresh cluster with a single statement
performed by the user:

cqlsh> select * from system.group0_history ;

 key     | state_id                             | description
---------+--------------------------------------+------------------------------------------------------
 history | 9ec29cac-7547-11ec-cfd6-77bb9e31c952 |                                    CQL DDL statement
 history | 9beb2526-7547-11ec-7b3e-3b198c757ef2 |                                                 null
 history | 9be937b6-7547-11ec-3b19-97e88bd1ca6f |                                                 null
 history | 9be784ca-7547-11ec-f297-f40f0073038e |                                                 null
 history | 9be52e14-7547-11ec-f7c5-af15a1a2de8c |                                                 null
 history | 9be335dc-7547-11ec-0b6d-f9798d005fb0 |                                                 null
 history | 9be160c2-7547-11ec-e0ea-29f4272345de |                                                 null
 history | 9bdf300e-7547-11ec-3d3f-e577a2e31ffd |                                                 null
 history | 9bdd2ea8-7547-11ec-c25d-8e297b77380e |                                                 null
 history | 9bdb925a-7547-11ec-d754-aa2cc394a22c |                                                 null
 history | 9bd8d830-7547-11ec-1550-5fd155e6cd86 |                                                 null
 history | 9bd36666-7547-11ec-230c-8702bc785cb9 | Add new columns to system_distributed.service_levels
 history | 9bd0a156-7547-11ec-a834-85eac94fd3b8 |        Create system_distributed(_everywhere) tables
 history | 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a |        Create system_distributed_everywhere keyspace
 history | 9bcec89a-7547-11ec-e1b4-34e0010b4183 |                   Create system_distributed keyspace
2022-01-24 15:20:37 +01:00
Kamil Braun
6a00e790c7 service: raft: check and update state IDs during group 0 operations
The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group
0 state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.
2022-01-24 15:20:37 +01:00
Kamil Braun
509ac2130f service: raft: group0_state_machine: introduce group0_command
Objects of this type will be serialized and sent as commands to the
group 0 state machine. They contain a set of mutations which modify
group 0 tables (at this point: schema tables and group 0 history table),
the 'previous state ID' which is the last state ID present in the
history table when the operation described by this command has started,
and the 'new state ID' which will be appended to the history table if
this change is successful (successful = the previous state ID is still
equal to the last state ID in the history table at the moment of
application). It also contains the address of the node which constructed
this command.

The state ID mechanism will be described in more detail in a later
commit.
2022-01-24 15:20:37 +01:00
Kamil Braun
cc0c54ea15 service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
The MIGRATION_REQUEST verb is currently used to pull the contents of
schema tables (in the form of mutations) when nodes synchronize schemas.
We will (ab)use the verb to fetch additional data, such as the contents
of the group 0 history table, for purposes of group 0 snapshot transfer.

We extend `schema_pull_options` with a flag specifying that the puller
requests the additional data associated with group 0 snapshots. This
flag is `false` by default, so existing schema pulls will do what they
did before. If the flag is `true`, the migration request handler will
include the contents of group 0 history table.

Note that if a request is set with the flag set to `true`, that means
the entire cluster must have enabled the Raft feature, which also means
that the handler knows of the flag.
2022-01-24 15:20:37 +01:00
Kamil Braun
a944dd44ee service: migration_manager: convert migration request handler to coroutine 2022-01-24 15:20:37 +01:00
Kamil Braun
fad72daeb4 db: system_keyspace: introduce system.group0_history table
This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

We will use these state IDs to check if a given change is still
valid at the moment it is applied (in `group0_state_machine::apply`),
i.e. that there wasn't a concurrent change that happened between
creating this change and applying it (which may invalidate it).
2022-01-24 15:20:37 +01:00
Kamil Braun
a664ac7ba5 treewide: require group0_guard when performing schema changes
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.

The guard will be a method of ensuring linearizability for group 0
operations.
2022-01-24 15:20:35 +01:00
Kamil Braun
742f036261 service: migration_manager: introduce group0_guard
This object will be used to "guard" group 0 operations. Obtaining it
will be necessary to perform a group 0 change (such as modifying the
schema), which will be enforced by the type system.

The initial implementation is a stub and only provides a timestamp which
will be used by callers to create mutations for group 0 changes. The
next commit will change all call sites to use the guard as intended.

The final implementation, coming later, will ensure linearizability of
group 0 operations.
2022-01-24 15:12:50 +01:00
Kamil Braun
f908da919c service: raft: pass storage_proxy& to group0_state_machine
We'll use it to update the group 0 history table.
2022-01-24 15:12:50 +01:00
Kamil Braun
dce8ece4b6 service: raft: raft_state_machine: pass snapshot_descriptor to transfer_snapshot
Currently it takes just the snapshot ID. Extend it by taking the whole
snapshot descriptor.

In following commits I use this to perform additional logging.
2022-01-24 15:12:50 +01:00
Kamil Braun
538cc6ecb9 service: raft: rename schema_raft_state_machine to group0_state_machine
Generalize the name so it doesn't suggest that group 0 contains only
schema state.
2022-01-24 15:12:50 +01:00
Kamil Braun
86762a1dd9 service: migration_manager: rename schema_read_barrier to start_group0_operation
1. Generalize the name so it mentions group 0, which schema will be a
   strict subset of.
2. Remove the fact that it performs a "read barrier" from the name. The
   function will be used in general to ensure linearizability of group0
   operations - both reads and writes. "Read barrier" is Raft-specific
   terminology, so it can be thought of as an implementation detail.
2022-01-24 15:12:50 +01:00
Kamil Braun
0f24b907b7 service: migration_manager: announce: split raft and non-raft paths to separate functions 2022-01-24 15:12:50 +01:00
Kamil Braun
283ac7fefe treewide: pass mutation timestamp from call sites into migration_manager::prepare_* functions
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.
2022-01-24 15:12:50 +01:00
Kamil Braun
f97edb1dbd service: migration_manager: put notifier call inside async
`get_notifier().before_update_column_family(...)` requires being inside
`async`. Fix this.
2022-01-24 15:12:50 +01:00
Kamil Braun
3bab5c564a service: migration_manager: remove some unused and disabled code
`include_keyspace_and_announce` was no longer used.
`do_announce_new_type` only had a declaration, it was not used and there
was no definition.
2022-01-24 15:12:49 +01:00
Kamil Braun
0af5f74871 db: system_distributed_keyspace: use current time when creating mutations in start()
When creating or updating internal distributed tables in
`system_distributed_keyspace::start()`, hardcoded timestamps were used.

There two reasons for this:
- to protect against issue #2129, where nodes would start without
  synchronizing schema with the existing cluster, creating the tables
  again, which would override any manual user changes to these tables.
  The solution was to use small timestamps (like api::min_timestamp) - the
  user-created schema mutations would always 'win' (because when they were
  created, they used current time).
- to eliminate unnecessary schema sync. If two nodes created these
  tables concurrently with different timestamps, the schemas would
  formally be different and would need to merge. This could happen
  during upgrades when we upgraded from a version which doesn't have
  these tables or doesn't have some columns.

The #2129 workaround is no longer necessary: when nodes start they always
have to sync schema with existing nodes; we also don't allow
bootstrapping nodes in parallel.

The second problem would happen during parallel bootstrap, which we
don't allow, or during parallel upgrade. The procedure we recommend is
rolling upgrade - where nodes are upgraded one by one. In this case only
one node is going to create/update the tables; following upgraded nodes
will sync schema first and notice they don't need to do anything. So if
procedures are followed correctly, the workaround is not needed. If
someone doesn't follow the procedures and upgrades nodes in parallel,
these additional schema synchronizations are not a big cost, so the
workaround doesn't give us much in this case as well.

When schema changes are performed by Raft group 0, certain constraints
are placed on the timestamps used for mutations. For this we'll need to
be able to use timestamps which are generated based on current time.
2022-01-24 15:12:49 +01:00
Kamil Braun
63d3449bc3 redis: keyspace_utils: create_keyspace_if_not_exists_impl: call announce twice only
The code would previously `announce` schema mutations once per each keyspace and
once per each table. This can be reduced to two calls of `announce`:
once to create all keyspaces, and once to create all tables.

This should be further reduced to a single `announce` in the future.
Left a FIXME.

Motivation: after migrating to Raft, each `announce` will require a
`read_barrier` to achieve linearizability of schema operations. This
introduces latency, as it requires contacting a leader which then must
contact a quorum. The fewer announce calls, the better. Also, if all
sub-operations are reduced to a single `announce`, we get atomicity -
either all of these sub-operations succeed or none do.
2022-01-24 15:12:46 +01:00
Benny Halevy
188cedd533 test: lister_test: test_lister_abort: generate at least one entry
Without this fix, generate_random_content could generate 0 entries
and the expected exception would never be injected.

With it, we generate at least 1 entry and the test passes
with the offending random-seed:

```
random-seed=1898914316
Generated 1 dir entries
Aborting lister after 1 dir entries
test/boost/lister_test.cc(96): info: check 'exception "expected_exception" raised as expected' has passed
```

Fixes #9953

Test: lister_test.test_lister_abort --random-seed=1898914316(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123122921.14017-1-bhalevy@scylladb.com>
2022-01-23 17:52:44 +02:00
Gleb Natapov
d09864d61f redis: check for tables existence before creating
Do not create redis tables unconditionally on boot since this requires
issue raft barrier and cannot be done without a quorum.

Message-Id: <YefV0CqEueRL7G00@scylladb.com>
2022-01-23 17:52:44 +02:00
Benny Halevy
f439edca35 test: sstable_compaction_test: twcs_reshape_with_disjoint_set_test: take min_threshold into consideration
Take into account that get_reshaping_job selects only
buckets that have more than min_threashold sstables in them.

Therefore, with 256 disjoint sstables in different windows,
allow first or last windows to not be selected by get_reshaping_job
that will return at least disjoint_sstable_count - min_threshold + 1
sstables, and not more than disjoint_sstable_count.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123090044.38449-2-bhalevy@scylladb.com>
2022-01-23 17:52:44 +02:00
Avi Kivity
ae6fdf1599 Update seastar submodule
* seastar 5025cd44ea...5524f229bb (3):
  > Merge "Simplify io-queue configuration" from Pavel E
  > fix sstring.find(): make find("") compatible with std::string
  > test: file_utils: test_non_existing_TMPDIR: no need to setenv

Contains patch from Pavel Emelyanov <xemul@scylladb.com>:

scylla-gdb: Remove _shares_capacity from fair-group debug

This field is about to be removed in newer seastar, so it
shouldn't be checked in scylla-gdb

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220121115643.6966-1-xemul@scylladb.com>
2022-01-21 17:38:05 +02:00
Piotr Jastrzebski
09d4438a0d cdc: Handle compact storage correctly in preimage
Base tables that use compact storage may have a special artificial
column that has an empty type.

c010cefc4d fixed the main CDC path to
handle such columns correctly and to not include them in the CDC Log
schema.

This patch makes sure that generation of preimage ignores such empty
column as well.

Fixes #9876
Closes #9910

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2022-01-20 13:23:38 +01:00
Nadav Har'El
350c3d0f6a alternator: update comment about default timeout
The comment explaining where the default Alternator timeout is set
became out-of-date. So fix it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220120092631.401563-1-nyh@scylladb.com>
2022-01-20 14:05:58 +02:00
Raphael S. Carvalho
5d654a6b9a compaction: don't copy owned ranges in cleanup ctor
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220119142322.39791-1-raphaelsc@scylladb.com>
2022-01-20 14:05:58 +02:00
Botond Dénes
a65b38a9f7 reader_permit: release_base_resources(): also update _resources
If the permit was admitted, _base_resources was already accounted in
_resource and therefore has to be deducted from it, otherwise the permit
will think it leaked some resources on destruction.

Test:
dtest(repair_additional_test.py.test_repair_one_missing_row_diff_shard_count)

Refs: #9751
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220119132550.532073-1-bdenes@scylladb.com>
2022-01-20 14:05:58 +02:00
Nadav Har'El
7cb6250c40 Merge 'snapshot_ctl: true_snapshots_size: fix space accounting' from Benny Halevy
This pull request fixes two preexisting issues related to snapshot_ctl::true_snapshots_size

https://github.com/scylladb/scylla/issues/9897
https://github.com/scylladb/scylla/issues/9898

And adds a couple unit tests to tests the snapshot_ctl functionality.

Test: unit(dev), database_test.{test_snapshot_ctl_details,test_snapshot_ctl_true_snapshots_size}(debug)

Closes #9899

* github.com:scylladb/scylla:
  table: get_snapshot_details: count allocated_size
  snapshot_ctl: cleanup true_snapshots_size
  snpashot_ctl: true_snapshots_size: do not map_reduce across all shards
2022-01-19 11:57:15 +02:00
Nadav Har'El
4aa9e86924 Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity
Alternator is a coordinator-side service and so should not access
the replica module. In this series all but one of uses of the replica
module are replaced with data_dictionary.

One case remains - accessing the replication map which is not
available (and should not be available) via the data dictionary.

The data_dictionary module is expanded with missing accessors.

Closes #9945

* github.com:scylladb/scylla:
  alternator: switch to data_dictionary for table listing purposes
  data_dictionary: add get_tables()
  data_dictionary: introduce keyspace::is_internal()
2022-01-19 11:34:25 +02:00
Avi Kivity
7399f3fae7 alternator: switch to data_dictionary for table listing purposes
As a coordinator-side service, alternator shouldn't touch the
replica module, so it is migrated here to data_dictionary.

One use case still remains that uses replica::keyspace - accessing
the replication map. This really isn't a replica-side thing, but it's
also not logically part of the data dictionary, so it's left using
replica::keyspace (using the data_dictionary::database::real_database()
escape hatch). Figuring out how to expose the replication map to
coordinator-side services is left for later.
2022-01-19 11:03:36 +02:00
Avi Kivity
f80d13c95c data_dictionary: add get_tables()
Unlike replica::database::get_column_families() which is replaces,
it returns a vector of tables rather than a map. Map-like access
is provided by get_table(), so it's redundant to build a new
map container to expose the same functionality.
2022-01-19 09:36:22 +02:00
Benny Halevy
94c2272c8e table: get_snapshot_details: count allocated_size
Rather than the logical file sizes so to account
for metadata overhead.

Fixes #9898

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-19 08:10:57 +02:00
Benny Halevy
5440739e1b snapshot_ctl: cleanup true_snapshots_size
Cleanup indentation and s/local_total/total/
as it is

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-19 07:50:53 +02:00
Benny Halevy
5db3cbe1e4 snpashot_ctl: true_snapshots_size: do not map_reduce across all shards
snapshot_ctl uses map_reduce over all database shards,
each counting the size of the snapshots directory,
which is shared, not per-shard.

So the total live size returned by it is multiples by the number of shards.

Add a unit test to test that.

Fixes #9897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-19 07:50:53 +02:00
Nadav Har'El
1ce73c2ab3 Merge 'utils::is_timeout_exception: Ensure we handle nested exception types' from Calle Wilund
Fixes #9922

storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.

This adds check of nested types in exception handling, as well as a test
for the routine itself.

Closes #9932

* github.com:scylladb/scylla:
  database/storage_proxy: Use "is_timeout_exception" instead of catch match
  utils::is_timeout_exception: Ensure we handle nested exception types
2022-01-18 23:49:41 +02:00
Calle Wilund
868b572ec8 database/storage_proxy: Use "is_timeout_exception" instead of catch match
Might miss cases otherwise.

v2: Fix broken control flow
v3: Avoid throw - use make_exception_future instead.
2022-01-18 15:40:41 +00:00
Avi Kivity
8350cabff3 data_dictionary: introduce keyspace::is_internal()
Instead of the replica module's is_internal_keyspace(), provide
it as part of data_dictionary. By making it a member of the keyspace
class, it is also more future proof in that it doesn't depend on
a static list of names.
2022-01-18 15:31:38 +02:00
Avi Kivity
5ed1a8217c redis: switch from replica/database to data_dictionary
redis uses replica/database only for data dictionary purposes;
switch it to the much lighter weight data_dictionary module.

Closes #9926
2022-01-18 13:26:29 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Raphael S. Carvalho
299ffb1e1a compaction: make TWCS reshape on a time bucket with tons of files much more efficient
Currently, when TWCS reshape finds a bucket containing more than 32
files, it will blindly resize that bucket to 32.
That's very bad because it doesn't take into consideration that
compaction efficiency depends on relative sizes of files being
compacted together, meaning that a huge file can be compacted with
a tiny one, producing lots of write amplification.

To solve this problem, STCS reshape logic will now be reused in
each time bucket. So only similar-sized files are compacted together
and the time bucket will be considered reshaped once its size tiers
are properly compacted, according to the reshape mode.

Fixes #9938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>
2022-01-18 12:33:54 +02:00
Botond Dénes
8ac7c4f523 docs/design-notes/IDL.md: fix typo: s/on only/only/
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220118094416.242409-1-bdenes@scylladb.com>
2022-01-18 12:30:39 +02:00
Benny Halevy
84e80f7b99 table: snapshot: handle error from seal_snapshot
If seal_snapshot fails we currently do not signal
the manifest_write semaphore and shards waiting for
it will be blocked forever.

Also, call manifest_write.wait in a `finally` clause
rather than in a `then` clause, even though
`my_work` future never fails at the moment,
to make this future proof.

Fixes #9936

Test: database_test(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220117181733.3706764-1-bhalevy@scylladb.com>
2022-01-18 12:17:01 +02:00
Avi Kivity
7260d8abed Merge "index_reader: improve verify_end_state()" from Botond
"
Said method should take care of checking that parsing stopped in a valid
state. This patch-set expands the existing but very lacking
implementation by improving the existing error message and adding an
additional check for prematurely exiting the parser in the middle of
parsing an index entry, something we've seen recently in #9446.
To help in debugging such issues, some additional information is added
to the trace messages.
The series also fixes a bug in the error handling code of the partition
index cache.

Refs: #9446

Tests: unit(dev)
"

* 'index-reader-better-verify-end-state/v2.1' of https://github.com/denesb/scylla:
  sstables/index_reader: process_state(): add additional information to trace logging
  sstables/index_reader: verify_end_state(): add check for premature EOS
  sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception
  sstables/index_reader: add const sstable& to index_consume_entry_context
  sstables/index_reader: remove unused members from index_consume_entry_context
2022-01-18 12:13:08 +02:00
Benny Halevy
2ae69447b5 sstables: update_info_for_opened_data: accumulate allocated_size into bytes_on_disk
bytes_on_disk is intended to reflect the bytes allocated for the
sstable files on disk.

Accumulating the files logical size, as done today, causes a
discrepancy between information retrieved over the
storage_service/sstables_info api, like nodetool status or nodetool
cfstats and command line tools like df -H /var/lib/scylla.

Fixes #9941

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220118070208.3963076-1-bhalevy@scylladb.com>
2022-01-18 11:33:36 +02:00
Botond Dénes
940874f3ff sstables/index_reader: process_state(): add additional information to trace logging
The amount of data available for parsing at the start of each entry, and
the parsed key size.
2022-01-18 10:38:11 +02:00
Botond Dénes
afb14508c4 sstables/index_reader: verify_end_state(): add check for premature EOS
Add a check which ensures that parsing ended in a valid state and not in
the middle of a half-parsed entry.
2022-01-18 10:38:11 +02:00
Botond Dénes
36c0fe904e sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception
Errors during parsing are usually reported via malformed sstable
exception to signify their gravity of potentially being caused by
corrupt sstables. This patch converts the exception thrown in
`index_consume_entry_context::verify_end_state()`.
While at it the error message is improved as well. It currently suggests
that parsing was ended prematurely because data ran out, while in fact
the condition under which this error is thrown is the opposite: parsing
ended but there is unconsumed data left. The current state is also added
to the error message.
2022-01-18 10:38:11 +02:00
Botond Dénes
7508b4fd22 sstables/index_reader: add const sstable& to index_consume_entry_context
To be used by the next patches to throw malformed sstable exception.
2022-01-18 10:38:11 +02:00
Botond Dénes
9f3e5ae801 sstables/index_reader: remove unused members from index_consume_entry_context
The unused members are: _s and _file_name.
2022-01-18 10:38:11 +02:00
Avi Kivity
2754e29a9d Merge "tools: make cli command-based" from Botond
"
Currently commands are regular switches. This has several disadvantages:
* CLI programs nowadays use the command-based UX, so our tools are
  awkward to use to anybody used to that;
* They don't stand out from regular options;
* They are parsed at the same time as regular options, so all options
  have to be dumped to a single description;

This series migrates the tools to the command based CLI. E.g. instead of

    scylla sstable --validate --merge /path/to/sst1 /path/to/sst2

we now have:

    scylla sstable validate --merge /path/to/sst1 /path/to/sst2

Which makes it much clearer that "validate" is the command and "merge"
is an option. And it just looks better.

Internally the command is parsed and popped from argv manually just as
we do with the tool name in scylla main(). This means we know the
command before even building the
boost::program_options::options_description representation and thus
before creating the seastar::app_template instance. Consequently we can
tailor the options registered and the --help content (the application
description) to the command run.
So now "scylla sstable --help" prints only a general description of the
tool and a list of the supported operations. Invoking "scylla sstable
{operation} --help" will print a detailed description of the operation
along with its specific options. This greatly improves the documentation
and the usability of the tool.
"

Refs #9882

* 'tools-command-oriented-cli/v1' of https://github.com/denesb/scylla:
  tools/scylla-sstable: update general description
  tools/scylla-sstable: proper operation-specific --help
  tools/scylla-sstable: proper operation-specific options
  tools/scylla-sstable: s/dump/dump-data/
  tools/utils: remove now unused get_selected_operation() overload
  tools: take operations (commands) as positional arguments
  tools/utils: add positional-argument based overload of get_selected_operation()
  tools: remove obsolete FIXMEs
2022-01-17 17:03:39 +02:00
Botond Dénes
518abe7415 test/lib/mutation_diff: force textual conversion
If the compared mutations have binary keys, `colordiff` will declare the
file as binary and will refuse to compare them, beyond a very unhelpful
"binary files differ" summary. Add "-a" to the command line to force a
treating all files as text.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220117131347.106585-1-bdenes@scylladb.com>
2022-01-17 15:27:53 +02:00
Michael Livshin
d7a993043d shard_reader: check that _reader is valid before dereferencing
After fc729a804, `shard_reader::close()` is not interrupted with an
exception any more if read-ahead fails, so `_reader` may in fact be
null.

Fixes #9923

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220117120405.152927-1-michael.livshin@scylladb.com>
2022-01-17 14:39:11 +02:00
Konstantin Osipov
b96f9a3580 migration manager: fix compile error on Ubuntu 20
Thanks to an older boost, there is an ambiguity in
name resolution between
boost::placeholders and std::placeholders.
Message-Id: <20220117094837.653145-2-kostja@scylladb.com>
2022-01-17 12:49:30 +02:00
Pavel Emelyanov
daf686739b redis: Use local storage proxy
The create_keyspace_if_not_exists_impl() gets global instance of
storage proxy, but its only caller (controller) already have it
and can pass via argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220117104226.22833-1-xemul@scylladb.com>
2022-01-17 12:44:22 +02:00
Benny Halevy
17e006106b token_metadata: update_normal_tokens: avoid unneeded sort when token ownership doesn't change
Currently, we first delete all existing token mappings
for the endpoint from _token_to_endpoint_map and then
we add all updated token mappings for it and set should_sort_tokens
if the token is newly inserted, but since we removed all
existing mappings for the endpoint unconditionally, we
will sort the tokens even if the token existed and
its ownership did not change.

This is worthwhile since there are scenarios where
none of the token ownership change.  Searching and
erasing tokens from the tokens unordered_set runs
at constant time on average so doing it for n tokens
is O(n), while sorting the tokens is O(n*log(n)).

Test: unit(dev)
DTest: replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap(dev,debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220117101242.122512-2-bhalevy@scylladb.com>
2022-01-17 12:18:42 +02:00
Benny Halevy
25977db7b4 token_metadata: remove update_normal_token entry point
It's currently used only by unit tests
and it is dangerous to use on a populated token_metadata
as update_normal_tokens assumes that the set of tokens
owned by the given endpoint is compelte, i.e. previous
tokens owned by the endpoint are no longer owned by it,
but the single-token update_normal_token interface
seems commulative (and has no documentation whatsoever).

It is better to remove this interface and calculate a
complete map of endpoint->tokens from the tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220117101242.122512-1-bhalevy@scylladb.com>
2022-01-17 12:18:42 +02:00
Nadav Har'El
8fd5041092 cql: INSERT JSON should refuse empty-string partition key
Add the missing partition-key validation in INSERT JSON statements.

Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).

Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.

The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.

Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.

This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.

Reported by @FortTell
Fixes #9853.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
2022-01-17 09:46:18 +01:00
Calle Wilund
97bb1be6f7 utils::is_timeout_exception: Ensure we handle nested exception types
Fixes #9922

storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.

This adds check of nested types in exception handling, as well as a test
for the routine itself.
2022-01-17 08:43:41 +00:00
Avi Kivity
985403ab99 view: convert build_progress_virtual_reader to flat_mutation_reader_v2
build_progress_virtual_reader is a virtual reader that trims off
the last clustering key column from an underlying base table. It
is here converted to flat_mutation_reader_v2.

Because range_tombstone_change uses position_in_partition, not
clustering_key_prefix, we need a new adjust_ckey() overload.

Note the transformation is likely incorrect. When trimming the
last clustering key column, an inclusive bound changes should
change to exclusive. However, the original code did not do this,
so we don't fix it here. It's immaterial anyway since the base
table doesn't include range tombstones.

Test: unit (dev)   (which has a test for this reader)

Closes #9913
2022-01-17 10:31:37 +02:00
Gleb Natapov
2aedf79152 idl-compiler: remove no longer used variable
types_with_const_appearances is no longer used. Remove it.

Message-Id: <YeUnoZXNcW0AdWWK@scylladb.com>
2022-01-17 10:30:30 +02:00
Nadav Har'El
82005b91b6 test/cql-pytest: really flush() in translated Cassandra tests
Some of the CQL tests translated from Cassandra into the test/cql-pytest
framework used the flush() function to force a flush to sstables -
presumably because this exercised yet another code path, or because it
reproduced bugs that Cassandra once had that were only visible when
reading from sstables - not from memtables.

Until now, this flush() function was stubbed and did nothing.
But we do have in test/cql-pytest a flush() implementation in
nodetool.py - which uses the REST API if possible and if not (e.g., when
running against Cassandra) uses the external "nodetool" command.
So in this patch flush() starts to use nodetool.flush() instead of
doing nothing.

The tests continue to pass as before after this patch, and there is no
noticable slowdown (the flush does take time, but the few times it's
done is negligible in these tests).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220117073112.83994-1-nyh@scylladb.com>
2022-01-17 10:22:04 +02:00
Gleb Natapov
d65427ad81 thrift: correctly check for keyspace existence
d9c315891a broke the check for keyspace
existence. The condition is opposite. Fix it.

Fixes #9927

Message-Id: <YeUhtESDHQeMHiUW@scylladb.com>
2022-01-17 10:20:48 +02:00
Avi Kivity
fec0c09756 Merge "Convert scrub and validation to v2" from Botond
"
As a prerequisite the mutation fragment stream validator is converted to
v2 as well (but it still supports v1). We get one step closer to
eliminate conversions altogether from compaction.cc.

Tests: unit(dev)
"

* 'scrub-v2/v1' of https://github.com/denesb/scylla:
  mutation_writer: remove v1 version segregate_by_partition()
  compaction/compaction: remove v1 version of validate and scrub reader factory methods
  tools/scylla-sstable: migrate to v2
  test/boost/sstable_compaction_test: migrate validation tests to v2
  test/boost/sstable_compaction_test: migrate scrub tests to v2
  test/lib/simple_schema: add v2 of make_row() and make_static_row()
  compaction: use v2 version of mutation_writer::segregate_by_partition()
  mutation_writer: add v2 version of segregate_by_partition()
  compaction: migrate scrub and validate to v2
  mutation_fragment_stream_validator: migrate validator to v2
2022-01-16 18:25:07 +02:00
Avi Kivity
52b7778ae6 Merge "repair: make sure there is one permit per repair with count res" from Botond
"
Repair obtains a permit for each repair-meta instance it creates. This
permit is supposed to track all resources consumed by that repair as
well as ensure concurrency limit is respected. However when the
non-local reader path is used (shard config of master != shard config of
follower), a second permit will be obtained -- for the shard reader of
the multishard reader. This creates a situation where the repair-meta's
permit can block the shard permit, creating a deadlock situation.
This patch solves this by dropping the count resource on the
repair-meta's permit when a non-local reader path is executed -- that is
a multishard reader is created.

Fixes: #9751
"

* 'repair-double-permit-block/v4' of https://github.com/denesb/scylla:
  repair: make sure there is one permit per repair with count res
  reader_permit: add release_base_resource()
2022-01-16 18:22:29 +02:00
Nadav Har'El
a30e71e27a alternator: doc, test: fix mentions of reverse queries
Now that issues #7586 and #9487 were fixed, reverse queries - even in
long partitions - work well, we can drop the claim in
alternator/docs/compatibility.md that reverse queries are buggy for
large partitions.

We can also remove the "xfail" mark from the tes that checks this
feature, as it now passes.

Refs #7586
Refs #9487

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #9831
2022-01-16 17:46:26 +02:00
Gleb Natapov
dc886d96d1 idl-compiler: update the documentation with new features added recently
The series to move storage_proxy verbs to the IDL added not features to
the IDL compiler, but was lacking a documentation. This patch documents
the features.
2022-01-16 15:12:07 +02:00
Mikołaj Sielużycki
f6d9d6175f sstables: Harden bad_alloc handling during memtable flush.
dirty_memory_manager monitors memory and triggers memtable flushing if
there is too much pressure. If bad_alloc happens during the flush, it
may break the loop and flushes won't be triggered automatically, leading
to blocked writes as memory won't be automatically released.

The solution is to add exception handling to the loop, so that the inner
part always returns a non-exceptional future (meaning the loop will
break only on node shutdown).

try/catch is used around on_internal_error instead of
on_internal_error_noexcept, as the latter doesn't have a version that
accepts an exception pointer. To get the exception message from
std::exception_ptr a rethrow is needed anyway, so this was a simpler
approach.

Fixes: #4174

Message-Id: <20220114082452.89189-1-mikolaj.sieluzycki@scylladb.com>
2022-01-14 16:09:21 +02:00
Botond Dénes
b6828e899a Merge "Postpone reshape of SSTables created by repair" from Raphael
"
SSTables created by repair will potentially not conform to the
compaction strategy
layout goal. If node shuts down before off-strategy has a chance to
reshape those files, node will be forced to reshape them on restart.
That
causes unexpected downtime. Turns out we can skip reshape of those files
on boot, and allow them to be reshaped after node becomes online, as if
the node never went down. Those files will go through same procedure as
files created by repair-based ops. They will be placed in maintenance
set,
and be reshaped iteratively until ready for integration into the main
set.
"

Fixes #9895.

tests: UNIT(dev).

* 'postpone_reshape_on_repair_originated_files' of https://github.com/raphaelsc/scylla:
  distributed_loader: postpone reshape of repair-originated sstables
  sstables: Introduce filter for sstable_directory::reshape
  table: add fast path when offstrategy is not needed
  sstables: add constant for repair origin
2022-01-14 14:05:09 +02:00
Botond Dénes
c727360eca db: convert data listeners to v2
To remove yet another back-and-forth conversion in
table::make_reader_v2().

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220114085551.565752-1-bdenes@scylladb.com>
2022-01-14 13:57:44 +02:00
Avi Kivity
4995179c6f Merge "Use data_dictionary in client_state and validation" from Pavel E
"
The main motivation for the set is to expell query_processor.proxy().local_db()
calls from cql3/statements code. The only places that still use q.p. like this
are those calling client_state::has_..._access() checkers. Those checks can
go with the data_dictionary which is already available on the query processor.
This is the continuation of the 9643f84d ("Eliminate direct storage_proxy usage
from cql3 statements") patch set.

As a side effect the validation/ code, that's called from has_..._access checks,
is also converted to use data_dictionary.

tests: unit(dev, debug)
"

* 'br-cql3-dictionary' of https://github.com/xemul/scylla:
  validation: Make validate_column_family use data_dictionary::database
  client_state: Make has_access use data_dictionary::database
  client_state: Make has_schema_access use data_dictionary::database
  client_state: Make has_column_family_access use data_dictionary::database
  client_state: Make has_keyspace_access use data_dictionary::database
2022-01-14 13:55:22 +02:00
Raphael S. Carvalho
ae3b589f12 table: Reduce off-strategy space requirement if multiple compaction rounds are required
Off-strategy compaction works by iteratively reshaping the maintenance set
until it's ready for integration into the main set. As repair-based ops
produces disjoint sstables only, off-strategy compaction can complete
the reshape in a single round.
But if reshape ends up requiring more than one round, space requirement
for off-strategy to succeed can be high. That's because we're only
deleting input SSTables on completion. SSTables from maintenance set
can be only deleted on completion as we can only merge maintenance
set into main one once we're done reshaping[1]. But a SSTable that was
created by a reshape and later used as a input in another reshape can
be deleted immediately as its existence is not needed anywhere.

[1] We don't update maintenance set after each reshape round, because that
would mess with its disjointness. We also don't iteratively merge
maintenance set into main set, as the data produced by a single round
is potentially not ready for integration into main set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220111202950.111456-1-raphaelsc@scylladb.com>
2022-01-14 13:46:31 +02:00
Botond Dénes
3005b9b5f8 Merge "move raft verbs to the IDL" from Gleb Natapov
"
The series moves raft verbs to the IDL and also fix some verbs to be one
way like they were intended to be.
"

* 'gleb/raft-idl' of github.com:scylladb/scylla-dev:
  raft service: make one way raft messages truly one way
  raft: move raft verbs to the IDL
  raft: split idl to rpc and storage
  idl-compiler: always produce const variant of serializers
  raft: simplify raft idl definitions
2022-01-14 13:40:20 +02:00
Pavel Emelyanov
00de5f4876 validation: Make validate_column_family use data_dictionary::database
And instantly convert the validate_keyspace() as it's not called
from anywhere but the validate_column_family().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-14 13:00:53 +03:00
Pavel Emelyanov
71c3a7525b client_state: Make has_access use data_dictionary::database
This db argument is only needed to be pushed into
cdc::is_log_for_some_table() helper. All callers already have
the d._d.::database at hands and convert it into .real_database()
call-time, so this patch effectively generalizes those calls to
the .real_database().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-14 12:59:35 +03:00
Pavel Emelyanov
f22eb22b8b client_state: Make has_schema_access use data_dictionary::database
It's now called with d._d.::database converted to .real_database()
right in the argument passing, so this change can be treated as
the generalization of that .real_database() call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-14 12:55:53 +03:00
Pavel Emelyanov
b6bc7a9b29 client_state: Make has_column_family_access use data_dictionary::database
Straightforward replacement. Internals of the has_column_family_access()
temporarily get .real_database(), but it will be changed soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-14 12:55:15 +03:00
Pavel Emelyanov
1ed237120a client_state: Make has_keyspace_access use data_dictionary::database
Straightforward replacement. Internals of the has_keyspace_access()
temporarily get .real_database(), but it will be changed soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-14 12:54:01 +03:00
Botond Dénes
3ce526082f mutation_writer: remove v1 version segregate_by_partition() 2022-01-14 10:19:56 +02:00
Botond Dénes
a7f4ab6b14 compaction/compaction: remove v1 version of validate and scrub reader factory methods 2022-01-14 10:19:56 +02:00
Botond Dénes
77c9f252a1 tools/scylla-sstable: migrate to v2 2022-01-14 08:54:26 +02:00
Botond Dénes
74d3a9223c test/boost/sstable_compaction_test: migrate validation tests to v2 2022-01-14 08:54:26 +02:00
Botond Dénes
0e1bdca71b test/boost/sstable_compaction_test: migrate scrub tests to v2 2022-01-14 08:54:26 +02:00
Botond Dénes
da0c5adcc3 test/lib/simple_schema: add v2 of make_row() and make_static_row() 2022-01-14 08:54:26 +02:00
Botond Dénes
d57634ad46 compaction: use v2 version of mutation_writer::segregate_by_partition() 2022-01-14 08:54:26 +02:00
Botond Dénes
e772326b10 mutation_writer: add v2 version of segregate_by_partition()
Just a facade using converters behind the scenes. The actual segregator
is not worth migrating to v2 while mutation and the flushing readers
don't have a v2 versions. Still, migrating all users to a v2 API allows
the conversion to happen at a single point where more work is necessary,
instead of scattered around all the users.
We leave the v1 version in place to aid incremental migration to the v2
one.
2022-01-14 08:54:26 +02:00
Botond Dénes
b315d17c2a compaction: migrate scrub and validate to v2
We add v2 version of external API but leave the old v1 in place to help
incremental migration. The implementation is migrated to v2.
2022-01-14 08:54:26 +02:00
Botond Dénes
f61fcfbada mutation_fragment_stream_validator: migrate validator to v2
Add support for validating v2 streams while still keeping the v1
support. Since the underlying logic is largely independent of the format
version, this is simple to do and will allow incremental migration of
users.
2022-01-14 08:54:26 +02:00
Kamil Braun
168c6f47f9 replica: database: allow disabling optimized TWCS queries through compaction strategy options
As requested from field engineering, add a way to disable
the optimized TWCS query algorithm (use regular query path)
just in case a bug or a performance regression shows up in
production.

To disable the optimized query path, add
'enable_optimized_twcs_queries': 'false' to compaction strategy options,
e.g.
```
alter table ks.t with compaction =
    {'class': 'TimeWindowCompactionStrategy',
     'enable_optimized_twcs_queries': 'false'};
```

Setting the `enable_optimized_twcs_queries` key to anything other than
`'false'` (note: a boolean `false` expands to a string `'false'`) or
skipping it (re)enables the optimized query path.

Note: the flag can be set in a cluster in the middle of upgrade. Nodes
which do not understand it simply ignore it, but they do store it in
their schema tables (they store the entire `compaction` map). After
these nodes are upgraded, they will understand the flag and act
accordingly.

Note: in the situation above, some nodes may use the optimized path and
some may use the regular path. This may happen also in a fully upgraded
cluster when compaction options are changed concurrently to reads;
there is a short period of time where the schema change propagates and
some nodes got the flag but some didn't.

These should not be a problem since the optimization does not change the
returned read results (unless there is a bug).

Generally, the flag is not intended for normal use, but for field
engineers to disable it in case of a serious problem.

Ref #6418.

Closes #9900
2022-01-14 07:10:02 +02:00
Kamil Braun
4c3fb9ac68 conf: update description of reversed_reads_auto_bypass_cache in scylla.yaml
Message-Id: <20220111123937.10750-1-kbraun@scylladb.com>
2022-01-13 23:49:01 +01:00
Kamil Braun
fe0366f6bc cdc: check_and_repair_cdc_streams: fix indentation 2022-01-13 23:10:18 +02:00
Juliusz Stasiewicz
ea46439858 cdc: check_and_repair_cdc_streams: regenerate if too many streams are present
If the number of streams exceeds the number of token ranges
it indicates that some spurious streams from decommissioned
nodes are present.

In such a situation - simply regenerate.

Fixes #9772

Closes #9780
2022-01-13 23:10:18 +02:00
Nadav Har'El
a0cad9585f merge: move tests to use new schema announcement API
Merged patch series from Gleb Natapov:

The series moves tests to use new schema announcement API and removes
the old one.

Gleb Natapov (7):
  test: convert database_test to new schema announcement api
  test use new schema announcement api in cql_test_env.cc
  test: move cql_query_test.cc to new schema announcement api
  test: move memtable_test.cc to new schema announcement api
  test: move schema_change_test.cc to new schema announcement api
  migration_manager: drop unused announce_ functions
  migration_manager: assert that raft ops are done on shard 0

 service/migration_manager.hh     |  5 ---
 service/migration_manager.cc     | 52 ++++++++------------------------
 test/boost/cql_query_test.cc     |  3 +-
 test/boost/database_test.cc      |  5 +--
 test/boost/memtable_test.cc      |  2 +-
 test/boost/schema_change_test.cc | 18 ++++++-----
 test/lib/cql_test_env.cc         |  2 +-
 7 files changed, 31 insertions(+), 56 deletions(-)
2022-01-13 23:10:18 +02:00
Gleb Natapov
0169e4d7ed migration_manager: assert that raft ops are done on shard 0
Now that all consumers run on shard zero we can assert it.
2022-01-13 23:10:18 +02:00
Gleb Natapov
1ff85020b5 migration_manager: drop unused announce_ functions 2022-01-13 23:10:18 +02:00
Gleb Natapov
f0a41c102a test: move schema_change_test.cc to new schema announcement api 2022-01-13 23:10:18 +02:00
Gleb Natapov
512556914a test: move memtable_test.cc to new schema announcement api 2022-01-13 23:10:13 +02:00
Botond Dénes
d6efe27545 Merge 'db: config: add a flag to disable new reversed reads algorithm' from Kamil Braun
Just in case the new algorithm turns out to be buggy, or give a
performance regression, add a flag to fall-back to the old algorithm for
use in the field.

Closes #9908

* github.com:scylladb/scylla:
  db: config: add a flag to disable new reversed reads algorithm
  replica: table: remove obsolete comment about reversed reads
2022-01-13 23:09:02 +02:00
Gleb Natapov
be46109af6 test: move cql_query_test.cc to new schema announcement api 2022-01-13 23:09:02 +02:00
Avi Kivity
63d254a8d2 Merge 'gms, service: futurize and coroutinize gossiper-related code' from Pavel Solodovnikov
This series greatly reduces gossipers' dependence on `seastar::async` (yet, not completely).

`i_endpoint_state_change_subscriber` callbacks are converted to return futures (again, to get rid of `seastar::async` dependency), all users are adjusted appropriately (e.g. `storage_service`, `cdc::generation_service`, `streaming::stream_manager`, `view_update_backlog_broker` and `migration_manager`).
This includes futurizing and coroutinizing the whole function call chain up to the `i_endpoint_state_change_subscriber` callback functions.

To aid the conversion process, a non-`seastar::async` dependent variant of `utils::atomic_vector::for_each` is introduced (`for_each_futurized`). A different name is used to clearly distinguish converted and non-converted code, so that the last step (remove `seastar::async()` wrappers around callback-calling code in gossiper) is easier. This is left for a follow-up series, though.

Tests: unit(dev)

Closes #9844

* github.com:scylladb/scylla:
  service: storage_service: coroutinize `set_gossip_tokens`
  service: storage_service: coroutinize `leave_ring`
  service: storage_service: coroutinize `handle_state_left`
  service: storage_service: coroutinize `handle_state_leaving`
  service: storage_service: coroutinize `handle_state_removing`
  service: storage_service: coroutinize `do_drain`
  service: storage_service: coroutinize `shutdown_protocol_servers`
  service: storage_service: coroutinize `excise`
  service: storage_service: coroutinize `remove_endpoint`
  service: storage_service: coroutinize `handle_state_replacing`
  service: storage_service: coroutinize `handle_state_normal`
  service: storage_service: coroutinize `update_peer_info`
  service: storage_service: coroutinize `do_update_system_peers_table`
  service: storage_service: coroutinize `update_table`
  service: storage_service: coroutinize `handle_state_bootstrap`
  service: storage_service: futurize `notify_*` functions
  service: storage_service: coroutinize `handle_state_replacing_update_pending_ranges`
  repair: row_level_repair_gossip_helper: coroutinize `remove_row_level_repair`
  locator: reconnectable_snitch_helper: coroutinize `reconnect`
  gms: i_endpoint_state_change_subscriber: make callbacks to return futures
  utils: atomic_vector: introduce future-returning `for_each` function
  utils: atomic_vector: rename `for_each` to `thread_for_each`
  gms: gossiper: coroutinize `start_gossiping`
  gms: gossiper: coroutinize `force_remove_endpoint`
  gms: gossiper: coroutinize `do_status_check`
  gms: gossiper: coroutinize `remove_endpoint`
2022-01-13 23:09:02 +02:00
Gleb Natapov
100b44f5ff test use new schema announcement api in cql_test_env.cc 2022-01-13 23:09:02 +02:00
Avi Kivity
230eac439e Update seastar submodule
* seastar ae8d1c28a2...5025cd44ea (2):
  > Merge "Lazy IO capacity replenishment" from Pavel E
Fixes #9893
  > configure.py: don't use deprecated mktemp()
2022-01-13 23:09:02 +02:00
Gleb Natapov
5dffc8ed3e test: convert database_test to new schema announcement api 2022-01-13 23:09:02 +02:00
Gleb Natapov
c500a90902 raft service: make one way raft messages truly one way
Raft core does not expect replies for most messages it sends, but they
are defined as two way by the IDL currently. Fix them to be one way.
2022-01-13 13:14:46 +02:00
Gleb Natapov
b1fea20d36 raft: move raft verbs to the IDL 2022-01-13 13:14:46 +02:00
Gleb Natapov
8a25b740df raft: split idl to rpc and storage
Storage uses only small part of the IDL, so it can include only the part
that is relevant to it.
2022-01-13 13:14:46 +02:00
Gleb Natapov
b0dee71b34 idl-compiler: always produce const variant of serializers
Currently const variant is produced only if a type and its const usage
are in the same idl file, but a type can be defined in one file and used
as const in another.
2022-01-13 13:14:46 +02:00
Gleb Natapov
c5474f9ac2 raft: simplify raft idl definitions
We may use high level types in the IDL.
2022-01-13 13:14:46 +02:00
Nadav Har'El
f842f65794 Merge 'thrift: switch to replica::database uses to data_dictionary' from Avi Kivity
replica::database is (as its name indicates) a replica-side service, while thrift
is coordinator-side. Convert thrift's use of replica::database for data dictionary
lookups to the data_dictionary module. Since data_dictionary was missing a
get_keyspaces() operation, add that.

Thrift still uses replica::database to get the schema version. That should be
provided by migration_manager, but changing that is left for later.

Closes #9888

* github.com:scylladb/scylla:
  thrift: switch from replica module to data_dictionary module
  thrift: simplify execute_schema_command() calling convention
  data_dictionary: add get_keyspaces() method
2022-01-13 10:52:30 +02:00
Nadav Har'El
343c521e28 alternator: avoid large contigous allocation in BatchGetItem
The BatchGetItem request can return a very large response - according to
DynamoDB documentation up to 16 MB, but presently in Alternator, we allow
even more (see #5944).

The problem is that the existing code prepares the entire response as
a large contiguous string, resulting in oversized allocation warnings -
and potentially allocation failures. So in this patch we estimate the size
of the BatchGetItem response, and if it is "big enough" (currently over
100 KB), we return it with the recently added streaming output support.

This streaming output doesn't avoid the extra memory copies unfortunately,
but it does avoid a *contiguous* allocation which is the goal of this
patch.

After this patch, one oversized allocation warning is gone from the test:

    test/alternator/run test_batch.py::test_batch_get_item_large

(a second oversized allocation is still present, but comes from the
unrelated BatchWriteItem issue #8183).

Fixes #8522

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220111170541.637176-1-nyh@scylladb.com>
2022-01-13 09:46:08 +01:00
Kamil Braun
e98711cfcb db: config: add a flag to disable new reversed reads algorithm
Just in case the new algorithm turns out to be buggy, or give a
performance regression, add a flag to fall-back to the old algorithm for
use in the field.
2022-01-12 18:59:19 +01:00
Avi Kivity
6205d40d5f thrift: switch from replica module to data_dictionary module
Thrift is a coordinator-side service and should not touch the replica
module. Switch it to data_dictionary.

The switch is straightforward with two exceptions:
 - client_state still receives replica::database parameters. After
   this change it will be easier to adapt client_state too.
 - calls to replica::database::get_version() remain. They should be
   rerouted to migration_manager instead, as that deals with schema
   management.
2022-01-12 19:54:38 +02:00
Kamil Braun
7fb7a406e7 replica: table: remove obsolete comment about reversed reads 2022-01-12 17:57:08 +01:00
Avi Kivity
85061b694b thrift: simplify execute_schema_command() calling convention
execute_schema_command is always called with the same first two
parameters, which are always defined froom the thrift_handler
instance that contains its caller. Simplify it by making it a member
function.

This simplifies migration to data_dictionary in the next patch.
2022-01-12 18:56:47 +02:00
Avi Kivity
631a19884d data_dictionary: add get_keyspaces() method
Mirroring replica::database::get_keyspaces(), for Thrift's use.

We return a vector instead of a hash map. Random access is already
available via database::find_keyspace(). The name is available
via the keyspace metadata, and in fact Thrift ignore the map
name and uses the metadata name. Using a simpler type reduces
include dependencies for this heavily used module.

The function is plumbed to replica::database::get_keyspaces() so
it returns the same data.
2022-01-12 18:24:38 +02:00
Raphael S. Carvalho
a144d30162 distributed_loader: postpone reshape of repair-originated sstables
SSTables created by repair will potentially not conform to the compaction strategy
layout goal. If node shuts down before off-strategy has a chance to
reshape those files, node will be forced to reshape them on restart. That
causes unexpected downtime. Turns out we can skip reshape of those files
on boot, and allow them to be reshaped after node becomes online, as if
the node never went down. Those files will go through same procedure as
files created by repair-based ops. They will be placed in maintenance set,
and be reshaped iteratively until ready for integration into the main set.

Fixes #9895.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-12 13:14:31 -03:00
Nadav Har'El
8bcd23fa02 Merge: move rest of internal ddl users to use raft from Gleb
The patch series moves the rest of internal ddl users to do schema
change over raft (if enabled). After that series only tests are left
using old API.

* 'gleb/raft-schema-rest-v6' of github.com:scylladb/scylla-dev: (33 commits)
  migration_manager: drop no longer used functions
  system_distributed_keyspace: move schema creation code to use raft
  auth: move table creation code to use raft
  auth: move keyspace creation code to use raft
  table_helper: move schema creation code to use raft
  cql3: make query_processor inherit from peering_sharded_service
  table_helper: make setup_table() static
  table_helper: co-routinize setup_keyspace()
  redis: move schema creation code to go through raft
  thrift: move system_update_column_family() to raft
  thrift: authenticate a statement before verifying in system_update_column_family()
  thrift: co-routinize system_update_column_family()
  thrift: move system_update_keyspace() to raft
  thrift: authenticate a statement before verifying in system_update_keyspace()
  thrift: co-routinize system_update_keyspace()
  thrift: move system_drop_keyspace() to raft
  thrift: authenticate a statement before verifying in system_drop_keyspace()
  thrift: co-routinize system_drop_keyspace()
  thrift: move system_add_keyspace() to raft
  thrift: co-routinize system_add_keyspace()
  ...
2022-01-12 18:09:08 +02:00
Raphael S. Carvalho
f9e33f7046 sstables: Introduce filter for sstable_directory::reshape
This will be useful to allow sstable_directory user to filter out
sstables that should not be reshaped. The default filter is
implemented as including everything.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-12 11:54:17 -03:00
Gleb Natapov
2aec9009ef migration_manager: drop no longer used functions 2022-01-12 16:40:06 +02:00
Gleb Natapov
9ce62bcc33 system_distributed_keyspace: move schema creation code to use raft 2022-01-12 16:40:06 +02:00
Gleb Natapov
50b7806c57 auth: move table creation code to use raft 2022-01-12 16:40:06 +02:00
Gleb Natapov
4273a3308c auth: move keyspace creation code to use raft 2022-01-12 16:40:06 +02:00
Gleb Natapov
03184bd786 table_helper: move schema creation code to use raft 2022-01-12 16:40:06 +02:00
Gleb Natapov
eb62e81843 cql3: make query_processor inherit from peering_sharded_service
This what we can get to a distributed object from shard local one.
2022-01-12 16:40:06 +02:00
Gleb Natapov
e2a29d9239 table_helper: make setup_table() static
It will make it easier to move schema creation to shard 0.
2022-01-12 16:40:06 +02:00
Gleb Natapov
3995f75b30 table_helper: co-routinize setup_keyspace()
Also replace open-coded loops with more modern c++ alternatives.
2022-01-12 16:40:05 +02:00
Gleb Natapov
5b4982d01f redis: move schema creation code to go through raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
dd36150a7d thrift: move system_update_column_family() to raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
bcfdcc51d6 thrift: authenticate a statement before verifying in system_update_column_family()
Otherwise it is possible to infer if a table exist without having proper
credentials.
2022-01-12 16:33:16 +02:00
Gleb Natapov
aec413d0f7 thrift: co-routinize system_update_column_family() 2022-01-12 16:33:16 +02:00
Gleb Natapov
d9c315891a thrift: move system_update_keyspace() to raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
7ffbdde554 thrift: authenticate a statement before verifying in system_update_keyspace()
Otherwise it is possible to infer if a table exist without having proper
credentials.
2022-01-12 16:33:16 +02:00
Gleb Natapov
1b4538f5bd thrift: co-routinize system_update_keyspace() 2022-01-12 16:33:16 +02:00
Gleb Natapov
64b8f4fe50 thrift: move system_drop_keyspace() to raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
52fc815f24 thrift: authenticate a statement before verifying in system_drop_keyspace()
Otherwise it is possible to infer if a table exist without having proper
credentials.
2022-01-12 16:33:16 +02:00
Gleb Natapov
45ff7e30a1 thrift: co-routinize system_drop_keyspace() 2022-01-12 16:33:16 +02:00
Gleb Natapov
a17f82c647 thrift: move system_add_keyspace() to raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
3a3a3f693e thrift: co-routinize system_add_keyspace() 2022-01-12 16:33:16 +02:00
Gleb Natapov
845b617256 thrift: move system_drop_column_family() to raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
9b6a9b104e thrift: co-routinize system_drop_column_family() 2022-01-12 16:33:16 +02:00
Gleb Natapov
7cfedb50bb thrift: move system_add_column_family() to raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
e4ac3c2777 thrift: authenticate a statement before verifying in system_add_column_family()
Otherwise it is possible to infer if a table exist without having proper
credentials.
2022-01-12 16:33:16 +02:00
Gleb Natapov
d5f14306d0 thrift: co-routinize system_add_column_family() 2022-01-12 16:33:16 +02:00
Gleb Natapov
1491cc2906 alternator: move create_table() to raft 2022-01-12 16:33:16 +02:00
Gleb Natapov
0cd6d283ad alternator: move update_table() to raft 2022-01-12 16:33:15 +02:00
Gleb Natapov
7ee39ff94b alternator: move validation in update_table() to the begining 2022-01-12 16:33:15 +02:00
Gleb Natapov
740b2181e1 alternator: move update_tags() to raft 2022-01-12 16:33:15 +02:00
Gleb Natapov
57be1b773e alternator: move delete_table() to raft 2022-01-12 16:33:15 +02:00
Gleb Natapov
0ac20b5494 alternator: make some functions static
Make add_stream_options, supplement_table_info, supplement_table_stream_info static. They only need a pointer
to storage_proxy, so pass it directly.
2022-01-12 16:33:15 +02:00
Gleb Natapov
2e4a8bdfaa alternator: co-routinize delete_table() 2022-01-12 16:33:15 +02:00
Gleb Natapov
459539e812 migration_manager: do not allow creating keyspace with arbitrary timestamp
This was needed to fix issue #2129 which was only manifest itself with
auto_bootstrap set to false. The option is ignored now and we always
wait for schema to synch during boot.
2022-01-12 16:33:15 +02:00
Botond Dénes
bdcbf3f71b Merge 'database: Add error message with mutation info on commit log apply failure' from Calle Wilund
Fixes #9408

While it is rare, some customer issues have shown that we can run into cases where commit log apply (writing mutations to it) fails badly. In the known cases, due to oversized mutations. While these should have been caught earlier in the call chain really, it would probably help both end users and us (trying to figure out how they got so big and how they got so far) iff we added info to the errors thrown (and printed), such as ks, cf, and mutation content.

Somewhat controversial, this makes the apply with CL decision path coroutinized, mainly to be able to do the error handling    for the more informative wrapper exception easier/less ugly. Could perhaps do with futurize_invoke + then_wrapper also. But future is coroutines...

This is as stated somewhat problematic, it adds an allocation to perf_simple_query::write path (because of crap clang cr frame folding?). However, tasks/op remain constant and actual tps (though unstable) remain more or less the same (on my crappy measurements).

Counter path is unaffected, as coroutine frame alloc replaces with(...)

dtest for the wrapped exception on separate pr.

Closes #9412

* github.com:scylladb/scylla:
  database: Add error message with mutation info on commit log apply failure
  database: coroutinize do_apply and apply_with_commitlog
2022-01-12 16:16:29 +02:00
Raphael S. Carvalho
6aa221a247 table: add fast path when offstrategy is not needed
If there's nothing in maintenance set, then there's no need to submit
a offstrategy request to manager.
2022-01-12 11:15:54 -03:00
Raphael S. Carvalho
34be8842ad sstables: add constant for repair origin
Make comparisons easy and avoid duplication

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-12 11:13:58 -03:00
Calle Wilund
a6202ae079 database: Add error message with mutation info on commit log apply failure
Fixes #9408

While it is rare, some customer issues have shown that we can run into cases
where commit log apply (writing mutations to it) fails badly. In the known
cases, due to oversized mutations. While these should have been caught earlier
in the call chain really, it would probably help both end users and us (trying
to figure out how they got so big and how they got so far) iff we added info
to the errors thrown (and printed), such as ks, cf, and mutation content.
2022-01-12 14:04:23 +00:00
Calle Wilund
63ea666ca0 database: coroutinize do_apply and apply_with_commitlog
Somewhat controversial. Making the apply with CL decision path
coroutinized, mainly to be able to in next patch make error handling
more informative (because we will have exceptions that are immediate
and/or futurized).

This is as stated somewhat problematic, it adds an allocation to
perf_simple_query::write path (because of crap clang cr frame folding?).
However, tasks/op remain constant and actual tps (though unstable)
remain more or less the same (on my crappy measurements).

Counter path is unaffected, as coroutine frame alloc replaces with(...)
alloc, and all is same and dandy.

I am hoping that the simpler error + verbose code will compensate for
the extra alloc.
2022-01-12 14:04:15 +00:00
Nadav Har'El
23e93a26b3 Merge 'Alternator: stream results + chunk results to remove large allocations' from Calle Wilund
Refs: #9555

When running the "Kraken" dynamodb streams test to provoke the issued observed by QA, I noticed on my setup mainly two things: Large allocation stalls (+ warnings) and timeouts on read semaphores in DB.

This tries to address the first issue, partly by making query_result_view serialization using chunked vector instead of linear one, and by introducing a streaming option for json return objects, avoiding linearizing to string before wire.
Note that the latter has some overhead issues of its own, mainly data copying, since we essentially will be triple buffering (local, wrapped http stream, and final output stream). Still, normal string output will typically do a lot of realloc which is potential extra copies as well, so...

This is not really performance tested, but with these tweaks I no longer get large alloc stalls at least, so that is a plus. :-)

Closes #9713

* github.com:scylladb/scylla:
  alternator::executor: Use streamed result for scan etc if large result
  alternator::streams: Use streamed result in get_records if large result
  executor/server: Add routine to make stream object return
  rjson: Add print to stream of rjson::value
  query_idl: Make qr_partition::rows/query_result::partitions chunked
2022-01-12 15:53:31 +02:00
Calle Wilund
f73ca9659b alternator::executor: Use streamed result for scan etc if large result
Avoids large allocations for larger scans.
Todo: determine threshold
2022-01-12 13:34:49 +00:00
Calle Wilund
0c1ff5c2f5 alternator::streams: Use streamed result in get_records if large result
If we have a resonable result set to send back to client, use direct
streaming of the object.

Todo: determine threshold.
2022-01-12 13:34:49 +00:00
Calle Wilund
4a8a7ef8b4 executor/server: Add routine to make stream object return
Simply retains result object and sets json::json_return_type to
streaming callback.
2022-01-12 13:34:49 +00:00
Calle Wilund
e2d7225df8 rjson: Add print to stream of rjson::value
Allows direct stream of object to seastar::stream. While not 100%
efficient, it has the advantage of avoiding large allocations
(long string) for huge result messages.
2022-01-12 13:34:49 +00:00
Avi Kivity
134601a15e Merge "Convert input side of mutation compactor to v2" from Botond
"
With this series the mutation compactor can now consume a v2 stream. On
the output side it still uses v1, so it can now act as an online
v2->v1 converter. This allows us to push out v2->v1 conversion to as far
as the compactor, usually the next to last component in a read pipeline,
just before the final consumer. For reads this is as far as we can go,
as the intra-node ABI and hence the result-sets built are v1. For
compaction we could go further and eliminate conversion altogether, but
this requires some further work on both the compactor and the sstable
writer and so it is left to be done later.
To summarize, this patchset enables a v2 input for the compactor and it
updates compaction and single partition reads to use it.
"

* 'mutation-compactor-consume-v2/v1' of https://github.com/denesb/scylla:
  table: add make_reader_v2()
  querier: convert querier_cache and {data,mutation}_querier to v2
  compaction: upgrade compaction::make_interposer_consumer() to v2
  mutation_reader: remove unecessary stable_flattened_mutations_consumer
  compaction/compaction_strategy: convert make_interposer_consumer() to v2
  mutation_writer: migrate timestamp_based_splitting_writer to v2
  mutation_writer: migrate shard_based_splitting_writer to v2
  mutation_writer: add v2 clone of feed_writer and bucket_writer
  flat_mutation_reader_v2: add reader_consumer_v2 typedef
  mutation_reader: add v2 clone of queue_reader
  compact_mutation: make start_new_page() independent of mutation_fragment version
  compact_mutation: add support for consuming a v2 stream
  compact_mutation: extract range tombstone consumption into own method
  range_tombstone_assembler: add get_range_tombstone_change()
  range_tombstone_assembler: add get_current_tombstone()
2022-01-12 14:37:19 +02:00
Avi Kivity
4118f2d8be treewide: replace deprecated seastar::later() with seastar::yield()
seastar::later() was recently deprecated and replaced with two
alternatives: a cheap seastar::yield() and an expensive (but more
powerful) seastar::check_for_io_immediately(), that corresponds to
the original later().

This patch replaces all later() calls with the weaker yield(). In
all cases except one, it's unambiguously correct. In one case
(test/perf scheduling_latency_measurer::stop()) it's not so ambiguous,
since check_for_io_immediately() will additionally force a poll and
so will cause more work to be done (but no additional tasks to be
executed). However, I think that any measurement that relies on
the measuring the work on the last tick to be inaccurate (you need
thousands of ticks to get any amount of confidence in the
measurement) that in the end it doesn't matter what we pick.

Tests: unit (dev)

Closes #9904
2022-01-12 12:19:19 +01:00
Avi Kivity
0e5d196499 Merge "move storage proxy verbs to the IDL" from Gleb
* 'gleb/sp-idl-v1' of github.com:scylladb/scylla-dev:
  storage_proxy: move all verbs to the IDL
  idl-compiler: allow const references in send() parameter list
  idl-compiler: support smart pointers in verb's return value
  idl-compiler: support multiple return value and optional in a return value
  idl-compiler: handle :: at the beginning of a type
  idl-compiler: sending one way message without timeout does not require ret value specialization as well
  storage_proxy: convert more address vectors to inet_address_vector_replica_set
2022-01-12 12:34:18 +02:00
Nadav Har'El
7a9f69ec38 Merge 'lister cleanup and test' from Benny Halevy
Split off of #9835.

The series removes extraneous includes of lister.hh from header files
and adds a unit test for lister::scan_dir to test throwing an exception
from the walker function passed to `scan_dir`.

Test: unit(dev)

Closes #9885

* github.com:scylladb/scylla:
  test: add lister_list
  lister: add more overloads of fs::path operator/ for std::string and string_view
  resource_manager: remove unnecessary include of lister.hh from header file
  sstables: sstable_directory: remove unncessary include of lister.hh from header file
2022-01-12 08:20:07 +01:00
Nadav Har'El
c5f29fe3ea configure.py: don't use deprecated mktemp()
configure.py uses the deprecated Python function tempfile.mktemp().
Because this function is labeled a "security risk" it is also a magnet
for automated security scanners... So let's replace it with the
recommended tempfile.mkstemp() and avoid future complaints.

The actual security implications of this mktemp() call is negligible to
non-existent: First it's just the build process (configure.py), not
the build product itself. Second, the worst that an attacker (which
needs to run in the build machine!) can do is to cause a compilation
test in configure.py to fail because it can't write to its output file.

Reported by @srikanthprathi

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220111121924.615173-1-nyh@scylladb.com>
2022-01-11 17:06:14 +02:00
Benny Halevy
1e6829e9f1 test: add lister_list
Test the lister class.

In particular the ability to abort the lister
when the walker function throws an exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-11 17:04:16 +02:00
Benny Halevy
8444e50e6a lister: add more overloads of fs::path operator/ for std::string and string_view
To make it easier to append a std::string to a filesystem::path.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-11 17:04:16 +02:00
Benny Halevy
f4cd535e3d resource_manager: remove unnecessary include of lister.hh from header file
But define namespace fs = std::filesystem in the header
since many use sites already depend on it
and it's a convention throught scylla's code.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-11 17:04:16 +02:00
Benny Halevy
b9c41dc0fd sstables: sstable_directory: remove unncessary include of lister.hh from header file
The source file depends on it, not the header.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-11 17:04:16 +02:00
Botond Dénes
97d74de8fc Merge "flat_mutation_reader: clone evictable_reader & convert some others" from Michael Livshin
"
The first patch introduces evictable_reader_v2, and the second one
further simplifies it.  We clone instead of converting because there
is at least one downstream (by way of multishard_combining_reader) use
that is not itself straightforward to convert at the moment
(multishard_mutation_query), and because evictable_reader instances
cannot be {up,down}graded (since users also access the undelying
buffers).  This also means that shard_reader, reader_lifecycle_policy
and multishard_combining_reader have to be cloned.
"

* tag 'clone-evictable-reader-to-v2/v3' of https://github.com/cmm/scylla:
  convert make_multishard_streaming_reader() to flat_mutation_reader_v2
  convert table::make_streaming_reader() to flat_mutation_reader_v2
  convert make_flat_multi_range_reader() to flat_mutation_reader_v2
  view_update_generator: remove unneeded call to downgrade_to_v1()
  introduce multishard_combining_reader_v2
  introduce shard_reader_v2
  introduce the reader_lifecycle_policy_v2 abstract base
  evictable_reader_v2: further code simplifications
  introduce evictable_reader_v2 & friends
2022-01-11 17:01:08 +02:00
Botond Dénes
d21803c5d0 Merge "Remove global storage proxy from pagers code" from Pavel Emelyanov
"
The fix is in keeping shared proxy pointer on query_pager.

tests: unit(dev)
"
* 'br-keep-proxy-on-pager-2' of https://github.com/xemul/scylla:
  pager: Use local proxy pointer
  pager: Keep shared pointer to proxy onboard
2022-01-11 17:01:08 +02:00
Nadav Har'El
9d0eaeb90a test/scylla-gdb: enable test for "scylla fiber"
After the rewrite of the test/scylla-gdb, the test for "scylla fiber"
was disabled - and this patch brings it back.

For the "scylla fiber" operation to do something interesting (and not just
print an error message and seem to succeed...) it needs a real task pointer.
The old code interrupted Scylla in a breakpoint and used get_local_tasks(),
but in the new test framework we attach to Scylla while it's idle, so
there are no ready tasks. So in this patch we use the find_vptrs()
function to find a continuation from http_server::do_accept_one() - it
has an interesting fiber of 5 continuations.

After this patch all 33 tests in test/scylla-gdb/test_misc.py pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220110211813.581807-1-nyh@scylladb.com>
2022-01-11 17:01:08 +02:00
Avi Kivity
861cc1d304 Update seastar submodule
* seastar 28fe4214e5...ae8d1c28a2 (3):
  > cross-tree: convert deprecated later() to yield()
  > future: deprecate later(), and add two alternatives
  > reactor: improve lowres_clock, lowres_system_clock granularity
2022-01-11 17:01:08 +02:00
Nadav Har'El
7f5ca5bf3f Merge 'replica: move distributed_loader to replica module' from Avi Kivity
distributed_loader is replica-side thing, so it belongs in the
replica module ("distributed" refers to its ability to load
sstables in their correct shards). So move it to the replica
module.

The change exposes a dependency on the construction
order of static variables (which isn't defined), so we remove
the dependency in the first two patches.

Closes #9891

* github.com:scylladb/scylla:
  replica: move distributed_loader into replica module
  tracing: make sure keyspace and table names are available to static constructors
  auth: make sure keyspace and table names are available to static constructors
2022-01-11 17:01:08 +02:00
Pavel Emelyanov
4dd1c15b7b Merge v3 of "Deglobalize repair tracker" from Benny
This series gets rid of the global repair_tracker
and thread-local node_ops_metrics instances.

It does so by first, make the repair_tracker sharded,
with an instance per repair_service shard.
The, exposing the repair_service::repair_tracker
and keeping a reference to the repair_service in repair_info.

Then the node_ops_metrics instances are moved from
thread-local global variables to class repair_service.

The motivation for this series is two fold:
1. There is a global effor the get rid of global services
   and instantiate all services on the stack of main() or cql_test_env.
2. As part of https://github.com/scylladb/scylla/issues/9809,
   we would like to eventually use a generci job tracer for both repair
   and compaction, so this would be one of the prelimanry steps to get there.

Refs #9809

Test: unit(release) (including scylla-gdb)
Dtest: repair_additional_test.py::TestRepairAdditional::{test_repair_disjoint_row_2nodes,test_repair_joint_row_3nodes_2_diff_shard_count} replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap[rbo_enabled]
(Still seeing https://github.com/scylladb/scylla/issues/9785 but nothing worse)

* github.com:bhalevy/scylla.git deglobalize-repair-tracker-v4
  repair: repair_tracker: get rid of _the_tracker
  repair: repair_service: move free abort_repair_node_ops function to repair_service
  repair_service: deglobalize node_ops_metrics
  repair: node_ops_metrics: fixup indentation
  repair: node_ops_metrics: declare in header file
  repair: repair_info: add check_in_shutdown method
  repair: use repair_info to get to the repair tracker
  repair: move tracker-dependent free functions to repair_service
  repair: tracker: mark get function const
  repair_service: add repair_tracker getter
  repair: make repair_tracker sharded
  repair: repair_tracker: get rid of unused abort_all_abort_source
  repair: repair_tracker: get rid of unused shutdown abort source
2022-01-11 17:01:08 +02:00
Nadav Har'El
261c4b80b5 Update tools/java submodule
* tools/java 6249bfbe2f...b1e09c8b8f (1):
  > dist/debian:set either python (>=2.7) or python2
2022-01-11 17:01:08 +02:00
Calle Wilund
706f20442b query_idl: Make qr_partition::rows/query_result::partitions chunked
When doing potentially large (internal) queries, i.e. alternator streams,
we can cause large allocations here.
2022-01-11 13:52:40 +00:00
Michael Livshin
1f27e12dc6 convert make_multishard_streaming_reader() to flat_mutation_reader_v2
All changes are mechanical.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Michael Livshin
be5118a7c9 convert table::make_streaming_reader() to flat_mutation_reader_v2
All changes are mechanical.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Michael Livshin
221cd264db convert make_flat_multi_range_reader() to flat_mutation_reader_v2
Mechanical changes and a resulting downgrade in one caller (which is
itself converted later).

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Michael Livshin
91d38ef2a9 view_update_generator: remove unneeded call to downgrade_to_v1()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Michael Livshin
7f0e228cbb introduce multishard_combining_reader_v2
All changes are mechanical.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Michael Livshin
4bc0deb7e9 introduce shard_reader_v2
Needed for multishard_combining_reader_v2 (see next commit), all
changes are mechanical.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Michael Livshin
6499361b6a introduce the reader_lifecycle_policy_v2 abstract base
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Michael Livshin
b053716e74 evictable_reader_v2: further code simplifications
Almost all mechanical: not passing a `reader` parameter around when we
know it's the `_reader` member, folding a short one-use method into
its caller.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Michael Livshin
402dbd2ca7 introduce evictable_reader_v2 & friends
Cloning instead of converting because there is at least one
downstream (via multishard_combining_reader) use that is not
straightforward to convert (multishard_mutation_query).

The clone is mostly mechanical and much simpler than the original,
because it does not have to deal with range tombstones when deciding
if it is safe to pause the wrapped reader, and also does not have to
trim any range tombstones.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Pavel Solodovnikov
236591be83 service: storage_service: coroutinize set_gossip_tokens
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
6aeccbb3b8 service: storage_service: coroutinize leave_ring
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
648c79347a service: storage_service: coroutinize handle_state_left
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
b23c19bfb6 service: storage_service: coroutinize handle_state_leaving
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
99195d637d service: storage_service: coroutinize handle_state_removing
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
8052ad12cc service: storage_service: coroutinize do_drain
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:45 +03:00
Pavel Solodovnikov
1593507f32 service: storage_service: coroutinize shutdown_protocol_servers
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
0bee6976e3 service: storage_service: coroutinize excise
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
c7d2a09424 service: storage_service: coroutinize remove_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
210c482c4f service: storage_service: coroutinize handle_state_replacing
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
adfc8f8346 service: storage_service: coroutinize handle_state_normal
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
ba113439de service: storage_service: coroutinize update_peer_info
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
b46ebd4fe5 service: storage_service: coroutinize do_update_system_peers_table
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
aa363acc4b service: storage_service: coroutinize update_table
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
f8dbaa3722 service: storage_service: coroutinize handle_state_bootstrap
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
f0f4a74817 service: storage_service: futurize notify_* functions
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
9edf2182ab service: storage_service: coroutinize handle_state_replacing_update_pending_ranges
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
4fcf31f11c repair: row_level_repair_gossip_helper: coroutinize remove_row_level_repair
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
badbfd521c locator: reconnectable_snitch_helper: coroutinize reconnect
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
5dcfb94d5a gms: i_endpoint_state_change_subscriber: make callbacks to return futures
Coroutinize a few simple callbacks in the process.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
adf7138b3b utils: atomic_vector: introduce future-returning for_each function
Introduce a variant of `for_each` function not requiring
`seastar::async` context.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
b958e85c54 utils: atomic_vector: rename for_each to thread_for_each
To emphasize that the function requires `seastar::thread`
context to function properly.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
445876a125 gms: gossiper: coroutinize start_gossiping
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
04b3172e6b gms: gossiper: coroutinize force_remove_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
a01c900d66 gms: gossiper: coroutinize do_status_check
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
42ff01eee2 gms: gossiper: coroutinize remove_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Raphael S. Carvalho
49eeacff37 compaction_manager: make run_with_compaction_disabled() barrier out non-regular compactions
run_with_compaction_disabled() is used to temporarily disable compaction
for a table T. Not only regular compaction, but all types.
Turns out it's stopping all types but it's only preventing new regular
compactions from starting. So major for example can start even with
compaction temporarily disabled. This is fixed by not allowing
compaction of any type if disabled. This wasn't possible before as
scrub incorrectly ran entirely with compaction disabled, so it wouldn't
be able to start, but now it only disables compaction while retrieving
its candidate list.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220107154942.59800-1-raphaelsc@scylladb.com>
2022-01-10 18:57:16 +02:00
Raphael S. Carvalho
1c23d1099a Make population more resilient when reshape fails
Reshape isn't mandatory for correctness, unlike resharding.
So we can allow boot to continue even in face of reshape
failure. Without this, boot will fail right away due to
unhandled exception. This is intended to make population
more resilient as any exception, even "benign" ones,
may cause boot to fail. It's better to allow boot to
continue from where it left off, as if there's an exception
like io error, or OOM, population will be unable to
complete anyway.

This patch was written based on observation that dangling
errors in interposer consumer used by compaction can cause
a different exception to be triggered, like broken_promise,
when user asked reshape to stop. This can no longer happen
now, but better safe than sorry.
So regular compaction can now pick on backlog once node is
online.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220107130539.14899-1-raphaelsc@scylladb.com>
2022-01-10 18:57:16 +02:00
Avi Kivity
4392c20bd3 replica: move distributed_loader into replica module
distributed_loader is replica-side thing, so it belongs in the
replica module ("distributed" refers to its ability to load
sstables in their correct shards). So move it to the replica
module.
2022-01-10 15:25:28 +02:00
Avi Kivity
bfa4abaf6b tracing: make sure keyspace and table names are available to static constructors
Static constructors (specifically for the `system_keyspaces` global variable)
need their dependencies to be already constructed when their own
construction begins. Because tracing uses seastar::sstring, which is not
constexpr, we must change it to std::string_view (which is). Change
the type and perform the required adjustments. The definition is moved
to the header file for simplicity.
2022-01-10 15:24:57 +02:00
Gleb Natapov
1db151bd75 storage_proxy: move all verbs to the IDL
Define all verbs in the IDL instead of manually codding them.
2022-01-10 14:58:28 +02:00
Gleb Natapov
c998f77cd2 idl-compiler: allow const references in send() parameter list
Currently send function parameters and rpc handler's function parameters
have both to be values, but sometimes we want send function to receive a
const reference to a value to avoid copying, but a handler still needs
to get it by value obviously. Support that by introducing one more type
attribute [[ref]]. If present the code generator makes send function
argument to look like 'const type&' and handler's argument will be
'type'.
2022-01-10 14:44:20 +02:00
Gleb Natapov
f3d5507f86 idl-compiler: support smart pointers in verb's return value
A verb's handler may return a 'foreign_ptr<smart_ptr<type>>' value which
is received on a client side as a naked 'type'. Current verb generator
code can only support symmetric handler/send helper where return type pf
a handler matches return type of a send function. Fix that by adding two
new attributes that can annotate a return type: unique_ptr,
lw_shared_ptr. If unique_ptr attribute is present the return type of a
handler will be 'foreign_ptr<unique_ptr<type>>' and the return type of a
send function will be just 'type'.
2022-01-10 14:29:37 +02:00
Gleb Natapov
9329234941 idl-compiler: support multiple return value and optional in a return value
RPC verbs can be extended to return more then one value and new values
are returned as rpc::optional. When adding a return value to a verb
its return values becomes rpc::tuple<type1, type2, type3>. In addition
new return values may be marked as rpc::optional for backwards
compatibility. The patch allow to part return expression of the form:
  -> type1, type2 [[version 1.1.0]]
which will be translated into:
  rpc::tuple<type1, rpc::optional<type2>>
2022-01-10 14:23:51 +02:00
Gleb Natapov
9c88ea2303 idl-compiler: handle :: at the beginning of a type
Currently types starting from '::' like '::ns::type' cause parsing
errors. Fix it.
2022-01-10 14:22:48 +02:00
Gleb Natapov
cf8c42ee42 idl-compiler: sending one way message without timeout does not require ret value specialization as well 2022-01-10 14:16:20 +02:00
Gleb Natapov
ff6a0fffaf storage_proxy: convert more address vectors to inet_address_vector_replica_set 2022-01-10 13:48:20 +02:00
Benny Halevy
50a361c280 repair: repair_tracker: get rid of _the_tracker
the global _the_tracker pointer is no longer used,
remove it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 12:03:57 +02:00
Benny Halevy
ceb08b9302 repair: repair_service: move free abort_repair_node_ops function to repair_service
Do not depend on the_repair_tracker().

With that, the_repair_tracker() is no longer used
and should be deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:59:22 +02:00
Benny Halevy
6bd78eb9a6 repair_service: deglobalize node_ops_metrics
Embed the node_ops_metrics instance in
a sharded repair_service member.

Test: curl -silent http://127.0.0.1:9180/metrics | grep node_ops | grep -v "^#"
      on a freshly started scylla instance.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:57:54 +02:00
Benny Halevy
a9c30f47fe repair: node_ops_metrics: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:52:58 +02:00
Benny Halevy
91cee22792 repair: node_ops_metrics: declare in header file
For de-globalizing its thread-local instance
by placing a node_ops_metrics member in repair_service.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:52:54 +02:00
Benny Halevy
95176098d1 repair: repair_info: add check_in_shutdown method
Replacing the free check_in_shutdown function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:49:40 +02:00
Benny Halevy
abeca95093 repair: use repair_info to get to the repair tracker
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:41:10 +02:00
Benny Halevy
4db57267a6 repair: move tracker-dependent free functions to repair_service
These functions are called from the api layer.
Continue to hide the repair tracker from the caller
but use the repair_service already available
at the api layer to invoke the respective high-level
methods without requiring `the_repair_tracker()`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:40:09 +02:00
Benny Halevy
6f7acc2029 repair: tracker: mark get function const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:26:29 +02:00
Benny Halevy
861852214c repair_service: add repair_tracker getter
And rename the global repair_tracker getter to
`the_repair_tracker` as the first step to get rid of it.

repair_service methods now use the repair_service::repair_tracker
method.

The global getter was renamed to `the_repair_tracker()`
temporarily while eliminating it in this series
to help distinguish it from repair_service::repair_tracker().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:25:32 +02:00
Benny Halevy
2f9e701570 repair: make repair_tracker sharded
Rather than keeping all shards' semaphore
and repair_info:s on the tracker's single-shard instance,
instantiate it on all shards, tracking the local
repair jobs on its local shard.

For now, until it's deglobalized, turn
_the_tracker into static thread_local pointer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 11:04:37 +02:00
Benny Halevy
415e67f3c2 repair: repair_tracker: get rid of unused abort_all_abort_source
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 10:57:10 +02:00
Benny Halevy
6650cb543b repair: repair_tracker: get rid of unused shutdown abort source
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-10 10:54:57 +02:00
Pavel Emelyanov
281ce3cbc6 pager: Use local proxy pointer
There are few places that need storage proxy and that use
global method to acheive it. Since previous patch there's
a pager local non-null pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-10 07:58:57 +03:00
Pavel Emelyanov
095d93eaf8 pager: Keep shared pointer to proxy onboard
Pagers are created by alternator and select statement, both
have the proxy reference at hands. Next, the pager's unique_ptr
is put on the lambda of its fetch_page() continuation and thus
it survives the fetch_page execution and then gets destroyed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-01-10 07:58:57 +03:00
Avi Kivity
05fa3e07f4 Update seastar submodule
* seastar 655078dfdb...28fe4214e5 (2):
  > program_options: avoid including boost/program_options.hpp when possible
  > smp: split smp_options out of smp.hh
2022-01-09 19:56:39 +02:00
Nadav Har'El
3cc058d193 sstables: add missing include of seastar/core/metrics.hh
sstables/sstables.cc uses seastar::metrics but was missing an include of
<seastar/core/metrics.hh>. It probably received this include through
some other random included Seastar header (e.g., smp.hh).

Now that we're reducing the unnecessary inclusions in Seastar (an ongoing
effort of Seastar patches), it is no longer included implicitly, and we
need to include it explicitly in sstables.cc.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220109162823.511781-1-nyh@scylladb.com>
2022-01-09 18:30:50 +02:00
Nadav Har'El
63bd0807b4 test/scylla-gdb: skip tests on aarch64
As already noted in commit eac6fb8, many of the scylla-gdb tests fail on
aarch64 for various reasons. The solution used in that commit was to
have test/scylla-gdb/run pretend to succeed - without testing anything -
when not running on x86_64. This workaround was accidentally lost when
scylla-gdb/run was recently rewritten.

This patch brings this workaround back, but in a slightly different form -
Instead of the run script not doing anything, the tests do get called, but
the "gdb" fixture in test/scylla-gdb/conftest.py causes each individual
test to be skipped.

The benefit of this approach is that it can easily be improved in the
future to only skip (or xfail) specific tests which are known to fail on
aarch64, instead of all of them - as half of the tests do pass on aarch64.

Fixes #9892.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220109152630.506088-1-nyh@scylladb.com>
2022-01-09 17:34:23 +02:00
Avi Kivity
57188de09e Merge 'Make dc/rack encryption work for some cases where Nat hides ednpoint ips' from Eliran Sinvani
This is a consolidation of #9714 and #9709 PRs by @elcallio that were reviewed by @asias
The last comment on those was that they should be consolidated in order not to create a security degradation for
ec2 setups.
For some cases it is impossible to determine dc or rack association for nodes on outgoing connections.
One example is when some IPs are hidden behind Nat layer.
In some cases this creates problems where one side of the connection is aware of the rack/dc association where the
other doesn't.
The solution here is a two stage one:
1. First add a gossip reverse lookup that will help us determine the rack/dc association for a broader (hopefully all) range
 of setups and NAT situations.
2. When this fails - be more strict about downgrading a node which tries to ensure that both sides of the connection will at least
 downgrade the connection instead of just fail to start when it is not possible for one side to determine rack/dc association.

Fixes #9653
/cc @elcallio @asias

Closes #9822

* github.com:scylladb/scylla:
  messaging_service: Add reverse mapping of private ip -> public endpoint
  production_snitch_base: Do reverse lookup of endpoint for info
  messaging_service: Make dc/rack encryption check for connection more strict
2022-01-09 16:40:49 +02:00
Nadav Har'El
7b5a8d3bcc init.hh: add missing include of boost/program_options.hpp
init.hh relies on boost::program_options but forgot to include the
header file <boost/program_options.hpp> for it. Today, this doesn't
matter, because Seastar unnecessarily includes <boost/program_options.hpp>
from unrelated header files (such as smp.hh) - so it ends up not being
missing.

But we plan to clean up Seastar from those unnecessary includes, and
then including what we need in init.hh will become important.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220109123152.492466-1-nyh@scylladb.com>
2022-01-09 15:56:58 +02:00
Avi Kivity
7f285965d8 auth: make sure keyspace and table names are available to static constructors
Static constructors (specifically for the `system_keyspaces` global variable)
need their dependencies to be already constructed when their own
construction begins. Enforce that for auth keyspace and table names
using the constinit keyword.
2022-01-09 12:51:22 +02:00
Avi Kivity
6c53717a39 replica, atomic_cell: move atomic_cell merge code from replica module to atomic_cell.cc
compare_atomic_cell_for_merge() was placed in database.cc, before
atomic_cell.cc existed. Move it to its correct place.

Closes #9889
2022-01-09 11:08:10 +02:00
Botond Dénes
31777dfec8 repair: make sure there is one permit per repair with count res
Repair obtains a permit for each repair-meta instance it creates. This
permit is supposed to track all resources consumed by that repair as
well as ensure concurrency limit is respected. However when the
non-local reader path is used (shard config of master != shard config of
follower), a second permit will be obtained -- for the shard reader of
the multishard reader. This creates a situation where the repair-meta's
permit can block the shard permit, creating a deadlock situation.
This patch solves this by dropping the count resource on the
repair-meta's permit when a non-local reader path is executed -- that is
a multishard reader is created.
2022-01-07 14:06:31 +02:00
Botond Dénes
4762ddec0f reader_permit: add release_base_resource()
Signals base resources to the semaphore and zeros it.
This basically undoes admission.
2022-01-07 14:06:31 +02:00
Botond Dénes
1a97f4c355 table: add make_reader_v2()
In fact the existing `make_reader()` is renamed to `make_reader_v2()`,
dropping the `downgrade_to_v1()` from the returned reader. To ease
incremental migration we add a `make_reader()` implementation which
downgrades this reader back to v1.
`table::as_mutation_source()` is also updated to use the v2 reader
factory method.
2022-01-07 13:52:43 +02:00
Botond Dénes
85c42a5d76 querier: convert querier_cache and {data,mutation}_querier to v2
The shard_mutation_querier is left using a v1 reader in its API as the
multishard query code is not ready yet. When saving this reader it is
upgraded to v2 and on lookup it is downgraded to v1. This should cancel
out thanks to upgrade/downgrade unwrapping.
2022-01-07 13:52:26 +02:00
Botond Dénes
15d8ea983e compaction: upgrade compaction::make_interposer_consumer() to v2
Almost all (except the scrub one) actual interposer consumers are v2.
2022-01-07 13:52:14 +02:00
Botond Dénes
aa3c943f4c mutation_reader: remove unecessary stable_flattened_mutations_consumer
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.
2022-01-07 13:52:07 +02:00
Botond Dénes
1ba19c2aa4 compaction/compaction_strategy: convert make_interposer_consumer() to v2
The underlying timestamp-based splitter is v2 already.
2022-01-07 13:51:59 +02:00
Botond Dénes
9826b5d732 mutation_writer: migrate timestamp_based_splitting_writer to v2 2022-01-07 13:51:48 +02:00
Botond Dénes
0601a465a2 mutation_writer: migrate shard_based_splitting_writer to v2 2022-01-07 13:48:53 +02:00
Botond Dénes
92244ae8ec mutation_writer: add v2 clone of feed_writer and bucket_writer
Since we have multiple writers using this that we don't want to migrate
all at once, we create a v2 version of said classes so we can migrate
them incrementally.
2022-01-07 13:48:43 +02:00
Botond Dénes
2d7625f4b3 flat_mutation_reader_v2: add reader_consumer_v2 typedef
v2 version of the reader_consumer typedef.
2022-01-07 13:48:36 +02:00
Botond Dénes
8556cb78cc mutation_reader: add v2 clone of queue_reader
As this reader is used in a wide variety of places, it would be a
nightmare to upgrade all such sites in one go. So create a v2 clone and
migrate users incrementally.
2022-01-07 13:47:53 +02:00
Botond Dénes
e8a918b25c compact_mutation: make start_new_page() independent of mutation_fragment version
By using partition_region instead of mutation_fragment::kind. This will
make incremental migration of users to v2 easier.
2022-01-07 13:47:39 +02:00
Botond Dénes
790e73141f compact_mutation: add support for consuming a v2 stream
Consuming either a v1 or v2 stream is supported now, but compacted
fragments are still emitted in the v1 format, thus the compactor acts an
online downgrader when consuming a v2 stream. This allows pushing out
downgrade to v1 on the input side all the way into the compactor. This
means that reads for example can now use an all v2 reader pipeline, the
still mandatory downgrade to v1 happening at the last possible place:
just before creating the result-set. Mandatory because our intra-node
ABI is still v1.
There are consumers who are ready for v2 in principle (e.g. compaction),
they have to wait a little bit more.
2022-01-07 13:42:31 +02:00
Botond Dénes
1d842e980a compact_mutation: extract range tombstone consumption into own method
Next patch wants to reuse the same code.
2022-01-07 13:42:17 +02:00
Botond Dénes
172c094388 range_tombstone_assembler: add get_range_tombstone_change() 2022-01-07 13:41:34 +02:00
Botond Dénes
3efb17a661 range_tombstone_assembler: add get_current_tombstone() 2022-01-07 13:41:25 +02:00
Botond Dénes
0f60cc84f4 Merge 'replica: create a replica module' from Avi Kivity
Move the ::database, ::keyspace, and ::table classes to a new replica
namespace and replica/ directory. This designates objects that only
have meaning on a replica and should not be used on a coordinator
(but note that not all replica-only classes should be in this module,
for example compaction and sstables are lower-level objects that
deserve their own modules).

The module is imperfect - some additional classes like distributed_loader
should also be moved, but there is only one way to untie Gordian knots.

Closes #9872

* github.com:scylladb/scylla:
  replica: move ::database, ::keyspace, and ::table to replica namespace
  database: Move database, keyspace, table classes to replica/ directory
2022-01-07 13:37:40 +02:00
Botond Dénes
4f4df25687 tools/scylla-sstable: update general description
We now have detailed per-operation descriptions, so remove
operation-specific parts of the general one and instead add more details
on the common options and arguments.
2022-01-07 12:05:49 +02:00
Botond Dénes
c6d61d47b7 tools/scylla-sstable: proper operation-specific --help
Add a detailed description to each of the operations. This description
replaces the general one when the operation specific help is displayed
(scylla sstable {operation} --help). The existing short description of
the operations is demoted to a summary and is made even shorter. This
will serve as the headline on the operation specific help page, as well
as the summary on the operation listing.
This allows the specifics of each operation to be detailed in length
instead of the terse summary that was available before.
2022-01-07 12:05:48 +02:00
Botond Dénes
51deb051d9 tools/scylla-sstable: proper operation-specific options
Operation-specific options are a mess currently. Some of them are in the
general options, all individual operations having to check for their
presence and warn if unsupported ones are set. These options were
general only when scylla-sstable had a single operation (dump). They
(most of them) became specific as soon as a second one was added.
Other specific options are in the awkward to use (both on the CLI
and in code) operation-specific option map.

This patch cleans this mess up. Each operation declares the option it
supports and these are only added to the command line when the specific
operation is chosen. General options now only contain options that are
truly universal.
As a result scylla-sstable has a operation-specific --help content now.
Operation-specific options are only printed when the operation is
selected:

    scylla sstable --help

will only print generic options, while:

    scylla sstable dump-data --help

will also print options specific to said operation. The description is
the same still, but this will be fixed in the next patch too.
2022-01-07 12:05:48 +02:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Botond Dénes
9b5fa12c3d tools/scylla-sstable: s/dump/dump-data/
We now have dump-{component} for all sstable components, so rename dump
to dump-data to follow the established naming scheme and to clear any
possible confusion about what it dumps.
2022-01-07 11:23:54 +02:00
Botond Dénes
41dec2dd50 tools/utils: remove now unused get_selected_operation() overload 2022-01-07 11:23:54 +02:00
Botond Dénes
6d4b17976f tools: take operations (commands) as positional arguments
Instead of switches. E.g.:

    scylla sstable dump ...

instead of:

    scylla sstable --dump

This is more inline with how most CLI interfaces work nowadays.
2022-01-07 09:38:05 +02:00
Botond Dénes
062ffaa571 tools/utils: add positional-argument based overload of get_selected_operation()
As opposed to the current one, which expects the operation to be given
with the --operation syntax, this new overload expects it as the first
positional argument. If found and valid, it is extracted from the
arglist and returned. Otherwise exit() is invoked to simplify error
handling.
2022-01-07 09:38:05 +02:00
Botond Dénes
2c16fc8e9b tools: remove obsolete FIXMEs 2022-01-07 07:21:05 +02:00
Raphael S. Carvalho
07fba4ab5d compaction_manager: Abort reshape for tables waiting for a chance to run
Tables waiting for a chance to run reshape wouldn't trigger stop
exception, as the exception was only being triggered for ongoing
compactions. Given that stop reshape API must abort all ongoing
tasks and all pending ones, let's change run_custom_job() to
trigger the exception if it found that the pending task was
asked to stop.

Tests:
dtest: compaction_additional_test.py::TestCompactionAdditional::test_stop_reshape_with_multiple_keyspaces
unit: dev

Fixes #9836.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211223002157.215571-1-raphaelsc@scylladb.com>
2022-01-06 18:04:16 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Raphael S. Carvalho
4c28c49bc7 compaction_manager: make return of maybe_stop_on_error less confusing
maybe_stop_on_error() is confusing because it returns true if the task
can be retried which goes in opposite direction of its semantics.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220106143233.459903-1-raphaelsc@scylladb.com>
2022-01-06 16:39:15 +02:00
Avi Kivity
b850b34bcc build: reduce inline threshold on aarch64 to 300
We see coroutine miscompiles with 600.

Fixes #9881.

Closes #9883
2022-01-06 15:13:27 +02:00
Nadav Har'El
6e2d29300c test/scylla-gdb: a rewrite, using pytest
This patch is an almost complete rewrite of the test/scylla-gdb
framework for testing Scylla's gdb commands.

The goals of this rewrite are described in issue #9864. In short, the
goals are:

1. Use pytest to define individual test cases instead one long Python
   script. This will make it easier to add more tests, to run only
   individual tests (e.g., test/scylla-gdb/run somefile.py::sometest),
   to understand which test failed when it fails - and a lot of other
   pytest conveniences.

2. Instead of an ad-hoc shell script to run Scylla, gdb, and the test,
   use the same Python code which is used in other test suites (alternator,
   cql-pytest, redis, and more). The resulting handling of the temporary
   resources (processes, directories, IP address) is more robust, and
   interrupting test/scylla-gdb/run will correctly kill its child
   processes (both Scylla and gdb).

All existing gdb tests (except one - more on this below...) were
easily rewritten in the new framework.

The biggest change in this patch is who starts what. Before this patch,
"run" starts gdb, which in turn starts Scylla, stops it on a breakpoint,
and then runs various tests. After this patch, "run" starts Scylla on
its own (like it does in test/cql-pytest/run, et al.), and then gdb runs
pytest - and in a pytest fixture attaches to the running Scylla process.
The biggest benefit of this approach is that "run" is aware of both gdb
and Scylla, and can kill both with abruptly with SIGKILL to end the test.
But there's also a downside to this change: One of the tests (of "scylla
fiber") needs access to some task object. Before this patch, Scylla was
stopped on a breakpoint, and a task was available at that point. After
this patch, we attach gdb to an idle Scylla, and the test cannot find
any task to use. So the test_fiber() test fails for now.

One way we could perhaps fix it is to add a breakpoint and "continue"
Scylla a bit more after attaching to it. However, I could find the right
breakpoint - and we may also need to send a request to Scylla to
get it to reach that breakpoint. I'm still looking for a better way
to have access to some "task" object we can test on.

Fixes #9864.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220102221534.1096659-1-nyh@scylladb.com>
2022-01-06 11:29:55 +02:00
Nadav Har'El
d9fe6f4c96 Merge: main: improve tool integration
This set contains follow-up fixes to folding tools into the scylla
executable:
* Improve the app description of scylla w.r.t. tools
* Add a new --list-tools option
* Error out when the first argument is unrecognized

Tests: unit(dev)

Botond Dénes (3):
  main: rephrase app description
  main: add move tool listing to --list-tools
  main: improve handling of non-matching argv[1]

 main.cc | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)
2022-01-06 10:06:28 +02:00
Botond Dénes
a37b4bbbaf main: improve handling of non-matching argv[1]
Be silent when argv[1] starts with "-", it is probably an option to
scylla (and "server" is missing from the cmd line).
Print an error and stop when argv[1] doesn't start with "-" and thus the
user assumably meant to start either the server or a tool and mis-typed
it. Instead of trying to guess what they meant stop with a clear error
message.
2022-01-06 06:59:59 +02:00
Botond Dénes
fe0bfa1d7b main: add move tool listing to --list-tools
And make it the central place listing available tools (to minimize the
places to update when adding a new one). The description is edited to
point to this command instead of listing the tools itself.
2022-01-06 06:58:44 +02:00
Botond Dénes
ab0e39503b main: rephrase app description
Remove "compatible with Apache Cassandra", scylla is much more than that
already.

Rephrase the part describing the included tools such that it is clear
that the scylla server is the main thing and the tools are the "extra"
additions. Also use the term "tool" instead of the term "app".
2022-01-06 06:37:32 +02:00
Botond Dénes
92727ac36c sstables/partition_index_cache: destroy entry ptr on error
The error-handling code removes the cache entry but this leads to an
assertion because the entry is still referenced by the entry pointer
instance which is returned on the normal path. To avoid this clear the
pointer on the error path and make sure there are no additional
references kept to it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220105140859.586234-2-bdenes@scylladb.com>
2022-01-05 19:03:24 +01:00
Nadav Har'El
6ebf32f4d7 types: deinline template throw_with_backtrace<marshal_exception, sstring>
When a template is instantiated in a header file which is included by many
source files, the compiler needs to compile it again and again.
ClangBuildAnalyzer helps find the worst cases of this happening, and one
of the worst happens to be

   seastar::throw_with_backtrace<marshal_exception, sstring>

This specific template function takes (according to ClangBuildAnalyzer)
362 milliseconds to instantiate, and this is done 312 (!) times, because
it reaches virtually every Scylla source file via either types.hh or
compound.hh which use this idiom.

Unfortunately, C++ as it exists today does not have a mechanism to
avoid compiling a specific template instantiation if this was already
done in some other source file. But we can do this manually using
the C++11 feature of "extern template":

1. For a specific template instance, in this case
   seastar::throw_with_backtrace<marhsal_exception, sstring>,
   all source files except one specify it as "extern template".
   This means that the code for it will NOT be built in this source
   file, and the compiler assumes the linker will eventually supply it.

2. At the same time, one source file instantiates this template
   instance once regularly, without "extern".

The numbers from ClangBuildAnalyzer suggest that this patch should
reduce total build time by 1% (in dev build mode), but this is hard to
measure in practice because the very long build time (210 CPU minutes on
my laptop) usually fluctuates by more than 1% in consecutive runs.
However, we've seen in the past that a good estimate of build time is
the total produced object size (du -bc build/dev/**/*.o). This patch
indeed reduces this total object size (in dev build mode) by exactly 1%.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220105171453.308821-1-nyh@scylladb.com>
2022-01-05 19:23:40 +02:00
Avi Kivity
d01e1a774b Merge 'Build performance: do not include the entire <seastar/net/ip.hh>' from Nadav Har'El
The header file <seastar/net/ip.hh> is a large collection of unrelated stuff, and according to ClangBuildAnalyzer, takes 2 seconds to compile for every source file that included it - and unfortunately virtually all Scylla source files included it - through either "types.hh" or "gms/inet_address.hh". That's 2*300 CPU seconds wasted.

In this two-patch series we completely eliminate the inclusion of <seastar/net/ip.hh> from Scylla. We still need the ipv4_address, ipv6_address types (e.g., gms/inet_address.hh uses it to hold a node's IP address) so those were split (in a Seastar patch that is already in) from ip.hh into separate small header files that we can include.

This patch reduces the entire build time (of build/dev/scylla) by 4% - reducing almost 10 sCPU minutes (!) from the build.

Closes #9875

* github.com:scylladb/scylla:
  build performance: do not include <seastar/net/ip.hh>
  build performance: speed up inclusion of <gm/inet_address.hh>
2022-01-05 17:55:07 +02:00
Nadav Har'El
6012f6f2b6 build performance: do not include <seastar/net/ip.hh>
In a previous patch, we noticed that the header file <gm/inet_address.hh>,
which is included, directly or indirectly, by most source files,
includes <seastar/net/ip.hh> which is very slow to compile, and
replaced it by the much faster-to-include <seastar/net/ipv[46]_address.hh>.

However, we also included <seastar/net/ip.hh> in types.hh - and that
too is included by almost every file, so the actual saving from the
above patch was minimal. So in this patch we replace this include too.
After this patch Scylla does not include <seastar/net/ip.hh> at all.

According to ClangBuildAnalyzer, this reduces the average time to include
types.hh (multiply this by 312 times!) from 4 seconds to 1.8 seconds,
and reduces total build time (dev mode) by about 3%.

Some of the source files were now missing some include directives, that
were previously included in ip.hh - so we need to add those explicitly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-01-05 17:29:21 +02:00
Tomasz Grabiec
382797a627 tests: perf: perf_fast_forward: Fix test_large_partition_slicing_clustering_keys for scylla_bench_large_part_ds1 schema
The test case assumed int32 partition key, but
scylla_bench_large_part_ds1 has int64 partition key. This resulted in
no results to be returned by the reader.

Fixs by introducing a partition key factory on the data source level.

Message-Id: <20220105150550.67951-1-tgrabiec@scylladb.com>
2022-01-05 17:18:06 +02:00
Nadav Har'El
788b9c7bc0 dbuild: better documentation for how to use with ccache
dbuild's README contained some vague and very partial hints on how to use
ccache with dbuild. Replace them with more concrete instructions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211229180433.781906-1-nyh@scylladb.com>
2022-01-05 16:53:08 +02:00
Botond Dénes
015d09a926 tools: utils: add configure_tool_mode()
Which configures seastar to act more appropriate to a tool app. I.e.
don't act as if it owns the place, taking over all system resources.
These tools are often run on a developer machine, or even next to a
running scylla instance, we want them to be the least intrusive
possible.
Also use the new tool mode in the existing tools.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211220143104.132327-1-bdenes@scylladb.com>
2022-01-05 15:33:57 +02:00
Asias He
c5784c1149 repair: Sort follower nodes by proximity
Sort follower nodes by the proximity so that in the step where the
master node gets missing rows from repair follower nodes,the master
node has a chance to get the missing rows from a near node first (e.g.,
local dc node), avoding getting rows from a far node.

For example:

dc1: n1, n2
dc2: n3, n4
dc3: n5, n6

Run repair on n1, with this patch, n1 will get data from n2 which is in the same dc first.

[shard 0] repair - Repair 1 out of 1 ranges, id=[id=1, uuid=8b0040bd-5aa5-42e1-bb9f-58c5e7052aec],
shard=0, keyspace=ks, table={cf}, range=(-6734413101754081925, -6539883972247625343],
peers={127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3},
live_peers={127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3}

[shard 0] repair - Before sort = {127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3}

[shard 0] repair - After sort = {127.0.39.2, 127.0.39.5, 127.0.39.6, 127.0.39.4, 127.0.39.3}

[shard 0] repair - Started Row Level Repair (Master): local=127.0.39.1,
peers={127.0.39.2, 127.0.39.5, 127.0.39.6, 127.0.39.4, 127.0.39.3}

Closes #9769
2022-01-05 14:09:59 +02:00
Nadav Har'El
e7e9001808 test/alternator: add more tests for GSI "Projection"
We already have multiple tests for the unimplemented "Projection" feature
of GSI and LSI (see issue #5036). This patch adds seven more test cases,
focusing on various types of errors conditions (e.g., trying to project
the same attribute twice), esoteric corner cases (it's fine to list a key in
NonKeyAttributes!), and corner cases that I expect we will have in our
implementation (e.g., a projected attribute may either be a real Scylla
column or just an element in a map column).

All new tests pass on DynamoDB and fail on Alternator (due to #5036), so
marked with "xfail".

Refs #5036.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211228193748.688060-1-nyh@scylladb.com>
2022-01-05 10:35:36 +02:00
Avi Kivity
53a83c4b1e Merge "flat_mutation_reader: convert flat_mutation_reader_from_mutations to v2" from Botond
"
Like flat_mutation_reader_from_fragments, this reader is also heavily
used by tests to compose a specific workload for readers above it. So
instead of converting it, we add a v2 variant and leave the v1 variant
in place.
The v2 variant was written from scratch to have built-in support for
reading in reverse. It is built-on `mutation::consume()` to avoid
duplicating the logic of consuming the contents of the mutation. To
avoid stalls, `mutation::consume()` gets support for pausing and
resuming consuming a mutation.

Tests: unit(dev)
"

* 'flat_mutation_reader_from_mutations_v2/v2' of https://github.com/denesb/scylla:
  flat_mutation_reader: convert make_flat_mutation_reader_from_mutation() v2
  flat_mutation_reader: extract mutation slicing into a function
  mutation: consume(): make it pausable/resumable
  mutation: consume(): restructure clustering iterator initialization
  test/boost/mutation_test: add rebuild test for mutation::consume()
2022-01-05 10:23:17 +02:00
Avi Kivity
2e958b3555 Merge "Coroutinization of compaction sstable rewrite procedure" from Raphael
"
Completes coroutinization of rewrite_sstables().

tests: UNIT(debug)
"

* 'rewrite_sstable_coroutinization' of https://github.com/raphaelsc/scylla:
  compaction_manager: coroutinize main loop in sstable rewrite procedure
  compaction_manager: coroutinize exception handling in sstable rewrite procedure
  compaction_manager: mark task::finish_compaction() as noexcept
  compaction_manager: make maybe_stop_on_error() more flexible
2022-01-05 10:15:19 +02:00
Raphael S. Carvalho
426450dc04 treewide: remove useless include of database.hh
Wrote a script based on cpp-include to find places that needlessly
included database.hh, which is expensive to process during
build time.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220104204359.168895-1-raphaelsc@scylladb.com>
2022-01-05 10:15:19 +02:00
Nadav Har'El
dcc42d3815 configure.py: re-run configure.py if the build/ directory is gone
When you run "configure.py", the result is not only the creation of
./build.ninja - it also creates build/<mode>/seastar/build.ninja
and build/<mode>/abseil/build.ninja. After a "rm -r build" (or "ninja
clean"), "ninja" will no longer work because those files are missing
when Scylla's ninja tries to run ninja in those internal project.

So we need to add a dependency, e.g., that running ninja in Seastar
requires build/<mode>/seastar/build.ninja to exist, and also say
that the rule that (re)runs "configure.py" generates those files.

After this patch,

        configure.py --with-some-parameters --of-your-choice
        rm -r build
        ninja

works - "ninja" will re-run configure.py with the same parameters
when it needs Seastar's or Abseil's build.ninja.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211230133702.869177-1-nyh@scylladb.com>
2022-01-05 10:15:19 +02:00
Nadav Har'El
5fbeae9016 cql-pytest: add a couple of default-TTL tests
This patch adds a new cql-pytest test file - test_ttl.py - with
currently just a couple of tests for the "with default_time_to_live"
feature. One is a basic test, and second reproduces issue #9842 -
that "using ttl 0" should override the default time to live, but
doesn't.

The test for #9842, test_default_ttl_0_override, fails on Scylla and
passes on Cassandra, and is marked "xfail".

Refs #9842.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211227091502.553577-1-nyh@scylladb.com>
2022-01-05 10:15:19 +02:00
Benny Halevy
e0a351e0c6 compaction_manager: stop_compaction: disallow specific types
We can stop only specific compaction types.

Reshard should be excluded since it mustn't be stopped.

And other types of compaction types like "VALIDATION" or "INDEX_BUILD"
are valid in terms of their syntax but unsupported by scylla so we better
return an error rather than appear to support them.

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222133449.2177746-1-bhalevy@scylladb.com>
2022-01-05 09:32:20 +02:00
Botond Dénes
62d82b8b0e flat_mutation_reader: convert make_flat_mutation_reader_from_mutation() v2
Since this reader is also heavily used by tests to compose a specific
workload for readers above it, we just add a v2 variant, instead of
changing the existing v1 one.
The v2 variant was written from scratch to have built-in support for
reading in reverse. It is built-on `mutation::consume()` to avoid
duplicating the logic of consuming the contents of the mutation.

A v2 native unit test is also added.
2022-01-05 09:06:16 +02:00
Botond Dénes
2d1bb90c8e flat_mutation_reader: extract mutation slicing into a function 2022-01-05 09:06:16 +02:00
Botond Dénes
e8ca07abed mutation: consume(): make it pausable/resumable
To avoid stalls or overconsumption for consumers which have a limit on
how much they want to consume in one go, the mutation::consume() is made
pausable/resumable. This happens via a cookie which is now returned as
part of the returned result, and which can be passed to a later
consume call to resume the previous one.
2022-01-05 09:06:16 +02:00
Botond Dénes
f1391d5c27 mutation: consume(): restructure clustering iterator initialization
Instead of having a branch per each value of `consume_in_reverse`, have
just two ifs with two branches each for clustering rows and range
tombstones respectively, to facilitate further patching.
2022-01-05 07:29:36 +02:00
Nadav Har'El
3fbbad7d60 build performance: speed up inclusion of <gm/inet_address.hh>
The header file <gm/inet_address.hh> is included, directly or
indirectly, from 291 source files in Scylla. It is hard to reduce this
number because Scylla relies heavily on IP addresses as keys to
different things. So it is important that this header file be fast to
include. Unfortunately it wasn't... ClangBuildAnalyzer measurements
showed that each inclusion of this header file added a whopping 2 seconds
(in dev build mode) to the build. A total of 600 CPU seconds - 10 CPU
minutes - were spent just on this header file. It was actually worse
because the build also spent additional time on template instantiation
(more on this below).

So in this patch we:

1. Remove some unnecessary stuff from gms/inet_address.hh, and avoid
   including it in one place that doesn't need it. This is just
   cosmetic, and doesn't significantly speed up the build.

2. Move the to_sstring() implementation for the .hh to .cc. This saves
   a lot of time on template instantiations - previously every source
   file instantiated this to_sstring(), which was slow (that "format"
   thing is slow).

3. Do not include <seastar/net/ip.hh> which is a huge file including
   half the world. All we need from it is the type "ipv4_address",
   so instead include just the new <seastar/net/ipv4_address.hh>.
   This change brings most of the performance improvement.
   So source files forgot to include various Seastar header files
   because the includes-everything ip.hh did it - so we need to add
   these missing includes in this patch.

After this patch, ClangBuildAnalyzer's reports that the cost of
inclusion of <gms/inet_address.hh> is down from 2 seconds to 0.326
seconds. Additionally the format<inet_address> template instantiation
291 times - about half a second each - is also gone.

All in all, this patch should reduce around 10 CPU minutes from the build.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-01-04 21:07:23 +02:00
Raphael S. Carvalho
f0b816d8e8 compaction_manager: coroutinize main loop in sstable rewrite procedure
with this patch, rewrite_sstables() is now fully coroutinized.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-04 16:03:23 -03:00
Raphael S. Carvalho
c85ba1e694 compaction_manager: coroutinize exception handling in sstable rewrite procedure
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-04 15:39:54 -03:00
Raphael S. Carvalho
59a65742f9 compaction_manager: mark task::finish_compaction() as noexcept
As it's intended to be used in a deferred action.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-04 15:30:04 -03:00
Raphael S. Carvalho
3fe4c2e517 compaction_manager: make maybe_stop_on_error() more flexible
It's hard to integrate maybe_stop_on_error() with coroutines as it
accepts a resolved future, not an exception pointer. Let's adjust
its interface, making it more flexible to work with.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-04 15:28:30 -03:00
Raphael S. Carvalho
9a1fdb0635 sstables: stop including unused expensive headers
database.hh is expensive to include, and turns out it's no longer
needed. also stop including other unused ones.

build time of sstables.o reduces by ~3% (cleared all caches and set
cpu frequency to a fixed value before building sstables.o from
scratch)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220104175908.98833-1-raphaelsc@scylladb.com>
2022-01-04 20:14:01 +02:00
Asias He
25b036f35b repair: Improve memory usage tracking and oom protection
Currently, the repair parallelism is calculated by the number of memory
allocated to repair and memory usage per repair instance. However, the
memory usage per repair instance does not take the max possible memory
usage caused by repair followers.

As a result, when repairing a table with more replication factors, e.g.,
3 DCs, each has 3 replicas, the repair master node would use 9X repair
buffer size in worse cases. This would cause OOM when the system is
under pressure.

This patch introduces a semaphore to cap the max memory usage.

Each repair instance takes the max possible memory usage budget before
it starts. This ensures repair would never use more than the memory
allocated to repair.

Fixes #9817

Closes #9818.
2022-01-04 20:11:36 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Avi Kivity
5eccb42846 Merge "Host tool executables in the scylla main executable" from Botond
"
A big problem with scylla tool executables is that they include the
entire scylla codebase and thus they are just as big as the scylla
executable itself, making them impractical to deploy on production
machines. We could try to combat this by selectively including only the
actually needed dependencies but even ignoring the huge churn of
sorting out our depedency hell (which we should do at one point anyway),
some tools may genuinely depend on most of the scylla codebase.

A better solution is to host the tool executables in the scylla
executable itself, switching between the actual main function to run
some way. The tools themselves don't contain a lot of code so
this won't cause any considerable bloat in the size of the scylla
executable itself.
This series does exactly this, folds all the tool executables into the
scylla one, with main() switching between the actual main it will
delegate to based on a argv[1] command line argument. If this is a known
tool name, the respective tool's main will be invoked.
If it is "server", missing or unrecognized, the scylla main is invoked.

Originally this series used argv[0] as the mean to switch between the
main to run. This approach was abandoned for the approach mentioned above
for the following reasons:
* No launcher script, hard link, soft link or similar games are needed to
  launch a specific tool.
* No packaging needed, all tools are automatically deployed.
* Explicit tool selection, no surprises after renaming scylla to
  something else.
* Tools are discoverable via scylla's description.
* Follows the trend set by modern command line multi-command or multi-app
  programs, like git.

Fixes: #7801

Tests: unit(dev)
"

* 'tools-in-scylla-exec-v5' of https://github.com/denesb/scylla:
  main,tools,configure.py: fold tools into scylla exec
  tools: prepare for inclusion in scylla's main
  main: add skeleton switching code on argv[1]
  main: extract scylla specific code into scylla_main()
2022-01-04 17:55:07 +02:00
Calle Wilund
73c4a2f42b messaging_service: Add reverse mapping of private ip -> public endpoint
For quick reverse lookup.

(cherry picked from commit c86296f2a8)
2022-01-04 15:14:58 +02:00
Botond Dénes
5e547dcc8a test/boost/mutation_test: add rebuild test for mutation::consume()
In the next patches we will refactor mutation::consume(). Before doing
that add another test, which rebuilds the consumed mutation, comparing
it with the original.
2022-01-04 11:43:46 +02:00
Nadav Har'El
e0ebde0f4f Update seastar submodule
The split of <seastar/net/ip.hh> will be useful for reducing the build
time (ip.hh is huge and we don't need to include most of it)
Refs #1

* seastar 8d15e8e6...655078df (13):
  > net: split <seastar/net/ip.hh>
  > Merge "Rate-limited IO capacity management" from Pavel E
  > util: closeable/stoppable: Introduce cancel()
  > loop: Improve concepts to match requirements
  > Merge "scoped_critical_alloc_section make conditional and volatile" from Benny
  > Added variadic version of when_any
  > websocket: define CryptoPP::byte for older cryptopp
  > tests: fix build (when libfmt >= 8) by adding fmt::runtime()
  > foreign_ptr: destroy_on: fixup indentation
  > foreign_ptr: expose async destroy method
  > when_all: when_all_state::wait_all move scoped_critical_alloc_section to captures
  > json: json_return_type: provide copy constructor and assignment operator
  > json: json_element: mark functions noexcept
2022-01-03 22:52:24 +02:00
Calle Wilund
3c02cab2f7 commitlog: Don't allow error_handler to swallow exception
Fixes #9798

If an exception in allocate_segment_ex is (sub)type of std::system_error,
commit_error_handler might _not_ cause throw (doh), in which case the error
handling code would forget the current exception and return an unusable
segment.

Now only used as an exception pointer replacer.

Closes #9870
2022-01-03 22:46:31 +02:00
Nadav Har'El
8774fc83d3 test/rest_api: fix "--ssl" option
test/rest_api has a "--ssl" option to use encrypted CQL. It's not clear
to me why this is useful (it doesn't actually test encryption of the
REST API!), but as long as we have such an option, it should work.

And it didn't work because of a typo - we set a "check_cql" variable to the
right function, but then forgot to use it and used run.check_cql instead
(which is just for unencrypted cql).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220102123202.1052930-1-nyh@scylladb.com>
2022-01-02 15:53:25 +02:00
Benny Halevy
fc729a804b shard_reader: Continue after read_ahead error
If read ahead failed, just issue a log warning
and proceed to close the reader.

Currently co_await will throw and the evictable reader
won't be closed.

This is seen occasionally in testing, e.g.
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/1010/artifact/logs-all.debug.2/1640918573898_lwt_banking_load_test.py%3A%3ATestLWTBankingLoad%3A%3Atest_bank_with_nemesis/node2.log
```
ERROR 2021-12-31 02:40:56,160 [shard 0] mutation_reader - shard_reader::close(): failed to stop reader on shard 1: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
```

Fixes #9865.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220102124636.2791544-1-bhalevy@scylladb.com>
2022-01-02 15:52:09 +02:00
Pavel Emelyanov
36905ce19d scylla-gdb: Do not try to unpack None-s
When 'scylla fiber' calls _walk the latter can validly return back None
pointer (see 74ffafc8a7 scylla-gdb.py: scylla fiber: add actual return
to early return). This None is not handled by the caller but is unpacked
as if it was a valid tuple.

fixes: #9860
tests: scylla-gdb(release, failure not reproduced though)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211231094311.2495-1-xemul@scylladb.com>
2021-12-31 22:21:58 +02:00
Pavel Emelyanov
946e03351e scylla-gdb: Handle rate-limited IO scheduler groups
The capacity accounting was changed, scylla-gdb.py should know
the new layout. On error -- fall back to current state.

tests: scylla-gdb(release, current and patched seastar)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211231073427.32453-1-xemul@scylladb.com>
2021-12-31 22:20:45 +02:00
Pavel Solodovnikov
904de0a094 gms: introduce two gossip features for raft-based cluster management
The patch adds the `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
and `USES_RAFT_CLUSTER_MANAGEMENT` gossiper features.

These features provide a way to organize the automatic
switch to raft-based cluster management.

The scheme is as follows:
 1. Every new node declares support for raft-based cluster ops.
 2. At the moment, no nodes in the cluster can actually use
    raft for cluster management, until the `SUPPORTS*` feature is enabled
    (i.e. understood by every node in the cluster).
 3. After the first `SUPPORTS*` feature is enabled, the nodes
    can declare support for the second, `USES*` feature, which
    means that the node can actually switch to use raft-based cluster
    ops.

The scheme ensures that even if some nodes are down while
transitioning to new bootstrap mechanism, they can easily
switch to the new procedure, not risking to disrupt the
cluster.

The features are not actually wired to anything yet,
providing a framework for the integration with `raft_group0`
code, which is subject for a follow-up series.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211220081318.274315-1-pa.solodovnikov@scylladb.com>
2021-12-30 11:05:45 +02:00
Tomasz Grabiec
7038dc7003 lsa: Fix segment leak on memory reclamation during alloc_buf
alloc_buf() calls new_buf_active() when there is no active segment to
allocate a new active segment. new_buf_active() allocates memory
(e.g. a new segment) so may cause memory reclamation, which may cause
segment compaction, which may call alloc_buf() and re-enter
new_buf_active(). The first call to new_buf_active() would then
override _buf_active and cause the segment allocated during segment
compaction to be leaked.

This then causes abort when objects from the leaked segment are freed
because the segment is expected to be present in _closed_segments, but
isn't. boost::intrusive::list::erase() will fail on assertion that the
object being erased is linked.

Introduced in b5ca0eb2a2.

Fixes #9821
Fixes #9192
Fixes #9825
Fixes #9544
Fixes #9508
Refs #9573

Message-Id: <20211229201443.119812-1-tgrabiec@scylladb.com>
2021-12-30 11:02:08 +02:00
Piotr Jastrzebski
85f5277a05 max_result_size: Expand the comment
Add describtion about how SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT cluster
feature is used and note that only coordinators check it. Decision made
by a coordinator is immutable for the whole request and can be checked
by looking at page_size field. If it's set to 0 or unset then we're
handling the struct in the old way. Otherwise, new way is used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #9855
2021-12-29 17:34:15 +02:00
Avi Kivity
e6f7ade60c Update tools/java submodule (python2 dependency)
* tools/java 8fae618f7f...6249bfbe2f (1):
  > dist/debian: replace "python (>=2.7)" with "python2"

Ref #9498.
2021-12-29 17:31:53 +02:00
Avi Kivity
9e74556413 Merge 'Support reverse reads in the row cache natively' from Tomasz Grabiec
This change makes row cache support reverse reads natively so that reversing wrappers are not needed when reading from cache and thus the read can be executed efficiently, with similar cost as the forward-order read.

The database is serving reverse reads from cache by default after this. Before, it was bypassing cache by default after 703aed3277.

Refs: #1413

Tests:

  - unit [dev]
  - manual query with build/dev/scylla and cache tracing on

Closes #9454

* github.com:scylladb/scylla:
  tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries
  row_cache: partition_snapshot_row_cursor: Print more details about the current version vector
  row_cache: Improve trace-level logging
  config: Use cache for reversed reads by default
  config: Adjust reversed_reads_auto_bypass_cache description
  row_cache: Support reverse reads natively
  mvcc: partition_snapshot: Support slicing range tombstones in reverse
  test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition
  row_cache: Log produced range tombstones
  test: Make produces_range_tombstone() report ck_ranges
  tests: lib: random_mutation_generator: Extract make_random_range_tombstone()
  partition_snapshot_row_cursor: Support reverse iteration
  utils: immutable-collection: Make movable
  intrusive_btree: Make default-initialized iterator cast to false
2021-12-29 16:53:25 +02:00
Avi Kivity
4a323772c1 Merge 'Use the same page size limit in reverse queries as in forward reads' from Piotr Jastrzębski
The default for get_unlimited_query_max_result_size() is 100MB (adjustable through config), whereas query::result_memory_limiter::maximum_result_size is 1MB (hard coded, should be enough for everybody)

This limit is then used by the replica to decide when to break pages and, in case of reversed clustering order reads, when to fail the read when accumulated data crosses the threshold. The latter behavior stems from the fact that reversed reads had to accumulate all the data (read in forward order) before they can reverse it and return the result. Reverse reads thus need a higher limit so that they have a higher chance of succeeding.

Most readers are now supporting reading in reverse natively, and only reversing wrappers (make_reversing_reader()) inserted on top of ka/la sstable readers need to accumulate all the data. In other cases, we could break pages sooner. This should lead to better stability (less memory usage) and performance (lower page build latency, higher read concurrency due to less memory footprint).

Tests: unit(dev)

Closes #9815

* github.com:scylladb/scylla:
  storage_proxy: Send page_size in the read_command
  gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature
  result_memory_accounter: use new max_result_size::get_page_size in check_local_limit
  max_result_size: Add page_size field
2021-12-29 15:04:01 +02:00
Nadav Har'El
4374c73d82 Merge 'Fix bad lowres_clock::duration assumptions' from Avi Kivity
Some code assumes that lowres_clock::duration is milliseconds, but public
documentation never claimed that. Harden the code for a change in the
definition by removing the assumptions.

Closes #9850

* github.com:scylladb/scylla:
  loading_cache: fix mixup of std::chrono::milliseconds and lowres_clock::duration
  service: storage_proxy: fix lowres_clock::duration assumption
  service: misc_services: fix lowres_clock::duration assumption
  gossip: fix lowres_clock::duration assumption
2021-12-28 23:32:26 +02:00
Avi Kivity
d40722d598 loading_cache: fix mixup of std::chrono::milliseconds and lowres_clock::duration
lowres_clock uses the two types interchangably, although they are not
defined to be the same. Fix by using only lowres_clock::duration.
2021-12-28 21:19:08 +02:00
Avi Kivity
966bb3c8f0 service: storage_proxy: fix lowres_clock::duration assumption
calculate_delay() implicitly converts a lowres_clock::duration to
std::chrono::microseconds. This fails if lowres_clock::duration has
higher resolution than microseconds.

Fix by using an explicit conversion, which always works.
2021-12-28 21:17:14 +02:00
Avi Kivity
e2a3f974d6 service: misc_services: fix lowres_clock::duration assumption
recalculate_hitrates() is defined to return future<lowres_clock::duration>
but actually returns future<std::chrono::milliseconds>. This fails
if the two types are not the same.

Fix by returning the declared type.
2021-12-28 21:15:40 +02:00
Avi Kivity
49a603af39 gossip: fix lowres_clock::duration assumption
The variable diff is assigned a type of std::chrono::milliseconds
but later used to store the difference between two
lowres_clock::time_point samples. This works now because the two
types are the same, but fails if lowres_clock::duration changes.

Remove the assumption by using lowres_clock::duration.
2021-12-28 21:13:59 +02:00
Piotr Jastrzebski
7fa3fa6e65 storage_proxy: Send page_size in the read_command
When the whole cluster is already supporting
separate_page_size_and_safety_limit,
start sending page_size in read_command. This new value will be used
for determining the page size instead of hard_limit.

Fixes #9487
Fixes #7586

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-12-28 16:38:02 +01:00
Piotr Jastrzebski
02d5997377 gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature
This new feature will be used to determined whether the whole cluster
is ready to use additional page_size field in max_result_size.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-12-28 16:38:02 +01:00
Piotr Jastrzebski
1ca39458f2 result_memory_accounter: use new max_result_size::get_page_size in check_local_limit
This means when page_size is sent together with read_command it will be
used for paged queries instead of the hard_limit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-12-28 16:38:01 +01:00
Piotr Jastrzebski
ae2c199bcd max_result_size: Add page_size field
With this new field comes a new member function called get_page_size.
This new function will be used by the result_memory_accounter to decide
when to cut a page.

The behaviour of get_page_size depends on whether page_size field is
set. This is distinguished by page size being equal to 0 or not. When
page_size is equal to 0 then it's not set and hard_limit will be
returned from get_page_size. Otherwise, get_page_size will return
page_size field.

When read_command is received from an old node, page_size will be equal
to 0 and hard_limit will be used to determine the page size. This is
consistent with the behaviour on the old nodes.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-12-28 16:37:49 +01:00
Valerii Ponomarov
12fa68fe67 scylla_util: return boolean calling systemd_unit.available
As of now, 'systemd_unit.available' works ok only when provided
unit is present.
It raises Exception instead of returning boolean
when provided systemd unit is absent.

So, make it return boolean in both cases.

Fixes https://github.com/scylladb/scylla/issues/9848

Closes #9849
2021-12-28 15:14:04 +02:00
Tomasz Grabiec
2a3450dfb7 Merge "db: save supported features after passing gossip feature check" from Pavel Solodovnikov
Move saving features to `system.local#supported_features`
to the point after passing all remote feature checks in
the gossiper, right before joining the ring.

This makes `system.local#supported_features` column to store
advertised feature set. Leave a comment in the definition of
`system.local` schema to reflect that.

Since the column value is not actually used anywhere for now,
it shouldn't affect any tests or alter the existing behavior.

Later, we can optimize the gossip communication between nodes
in the cluster, removing the feature check altogether
in some cases (since the column value should now be monotonic).

* manmanson/save_adv_features_v2:
  db: save supported features after passing gossip feature check
  db: add `save_local_supported_features` function
2021-12-28 11:26:11 +02:00
Nadav Har'El
b8786b96f4 commitlog: fix missing wait for semaphore units
Commit dcc73c5d4e introduced a semaphore
for excluding concurrent recalculations - _reserve_recalculation_guard.

Unfortunately, the two places in the code which tried to take this
guard just called get_units() - which returns a future<units>, not
units - and never waited for this future to become available.

So this patch adds the missing "co_await" needed to wait for the
units to become available.

Fixes #9770.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211214122612.1462436-1-nyh@scylladb.com>
2021-12-27 16:56:30 +02:00
Eliran Sinvani
6d9d00ec9c conofigure.py: Set seastar scheduling groups count explicitly
In order to have stability and also regression control, we set
the scheduling groups parameter explicitly.

Closes #9847
2021-12-27 15:48:45 +02:00
Takuya ASADA
6a834261fb scylla_coredump_setup: prevent coredump timeout on systemd-coredump@.service
On newer version of systemd-coredump, coredump handled in
systemd-coredump@.service, and may causes timeout while running the
systemd unit, like this:
  systemd[1]: systemd-coredump@xxxx.service: Service reached runtime time limit. Stopping.
To prevent that, we need to override TimeoutStartSec=infinity.

Fixes #9837

Closes #9841
2021-12-27 13:58:07 +02:00
Takuya ASADA
0d8f932f0b scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).

Fixes #9540

Closes #9782
2021-12-27 12:07:34 +02:00
Takuya ASADA
7064ae3d90 dist: fix scylla-housekeeping uuid file chmod call
Should use chmod() on a file, not fchmod()

Fixes #9683

Closes #9802
2021-12-27 11:47:06 +02:00
Raphael S. Carvalho
ad82ede5f3 compaction: simplify rewrite_sstables() with coroutine
rewrite_sstables() is terribly nested, making it hard to read.
as usual, can be nicely simplified with coroutines.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211223135012.56277-1-raphaelsc@scylladb.com>
2021-12-26 14:10:52 +02:00
Piotr Sarna
a36c8990ab docs: move service_levels.md to design-notes
Along the way, our flat structure for docs was changed
to categorize the documents, but service_levels.md was forward-ported
later and missed the created directory structure, so it was created
as a sole document in the top directory. Move it to where the other
similar docs live.

Message-Id: <68079d9dd511574ee32fce15fec541ca75fca1e2.1640248754.git.sarna@scylladb.com>
2021-12-26 14:10:52 +02:00
Piotr Sarna
483a98aa14 docs: add AssemblyScript example to wasm.md
The paragraph about WebAssembly missed a very useful language,
AssemblyScript. An example for it is provided in this patch.

Message-Id: <8d6ea1038f2944917316de29c7ca5cce88b2a148.1640248754.git.sarna@scylladb.com>
2021-12-26 14:10:52 +02:00
Avi Kivity
9643f84d81 Merge "Eliminate direct storage_proxy usage from cql3 statements" from Pavel E
"

The token metadata and features should be kept on the query_processor itself,
so finally the "storage" API would look like this:

      6 .query()
      5 .get_max_result_size()
      2 .mutate_with_triggers()
      2 .cas()
      1 .truncate_blocking()

The get_max_result_size() is probably also worth moving away from storage,
it seem to have nothing to do with it.

tests: unit(dev)
"

* 'br-query-processor-in-cql-statements' of https://github.com/xemul/scylla:
  cql3: Generalize bounce-to-shard result creation
  cql3: Get data dictionary directly from query_processor
  create_keyspace_statement: Do not use proxy.shared_from_this()
  cas_request: Make read_command() accept query_processor
  select_statement: Replace all proxy-s with query_processor
  create_|alter_table_statement: Make check_restricted_table_properties() accept query_processor
  create_|alter_keyspace_statement: Make check_restricted_replication_strategy() accept query_processor
  role_management_statement: Make validate_cluster_support() accept query_processor
  drop_index_statement: Make lookup_indexed_table() accept query_processor
  batch_|modification_statement: Make get_mutations accept query_processor
  modification_statement: Replace most of proxy-s with query_processor
  batch_statement: Replace most of proxy-s with query_processor
  cql3: Make create_arg_types()/prepare_type() accept query_processor
  cql3: Make .validate_while_executing() accept query_processor
  cql3: Make execution stages carry query_processor over
  cql3: Make .validate() and .check_access() accept query_processor
2021-12-26 14:10:52 +02:00
Nadav Har'El
e4b2dfb54d alternator ttl: when node is down, secondary node continues to expire
The current implementation of the Alternator expiration (TTL) feature
has each node scan for expired partitions in its own primary ranges.
This means that while a node is down, items in its primary ranges will
not get expired.

But we note that doesn't have to be this way: If only a single node is
down, and RF=3, the items that node owns are still readable with QUORUM -
so these items can still be safely read and checked for expiration - and
also deleted.

This patch implements a fairly simple solution: When a node completes
scanning its own primary ranges, also checks whether any of its *secondary*
ranges (ranges where it is the *second* replica) has its primary owner
down. For such ranges, this node will scan them as well. This secondary
scan stops if the remote node comes back up, but in that case it may
happen that both nodes will work on the same range at the same time.
The risks in that are minimal, though, and amount to wasted work and
duplicate deletion records in CDC. In the future we could avoid this by
using LWT to claim ownership on a range being scanned.

We have a new dtest (see a separate patch), alternator_ttl_tests.py::
TestAlternatorTTL::test_expiration_with_down_node, which reproduces this
and verifies this fix. The test starts a 5-node cluster, with 1000 items
with random tokens which are due to be expired immediately. The test
expects to see all items expiring ASAP, but when one of the five nodes
is brought down, this doesn't happen: Some of the items are not expired,
until this patch is used.

Fixes #9787

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211222131933.406148-1-nyh@scylladb.com>
2021-12-26 14:10:52 +02:00
Pavel Solodovnikov
83862d9871 db: save supported features after passing gossip feature check
Move saving features to `system.local#supported_features`
to the point after passing all remote feature checks in
the gossiper, right before joining the ring.

This makes `system.local#supported_features` column to store
advertised feature set. Leave a comment in the definition of
`system.local` schema to reflect that.

Since the column value is not actually used anywhere for now,
it shouldn't affect any tests or alter the existing behavior.

Later, we can optimize the gossip communication between nodes
in the cluster, removing the feature check altogether
in some cases (since the column value should now be monotonic).

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-23 12:48:37 +03:00
Pavel Emelyanov
d98dd0ff80 cql3: Generalize bounce-to-shard result creation
The main intention is actually to free the qp.proxy() from the
need to provide the get_stats() method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 11:28:44 +03:00
Pavel Emelyanov
d32de22ee8 cql3: Get data dictionary directly from query_processor
After previous patches there's a whole bunch of places that do

  qp.proxy().data_dictionary()

while the data_dictionary is present on the query processor itself
and there's a public method to get one. So use it everywhere.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 11:28:44 +03:00
Pavel Emelyanov
ec101e8b56 create_keyspace_statement: Do not use proxy.shared_from_this()
The prepare_schema_mutations is not sleeping method, so there's no
point in getting call-local shared pointer on proxy. Plain reference
is more than enough.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 11:28:44 +03:00
Pavel Emelyanov
b29d3f1758 cas_request: Make read_command() accept query_processor
Just relpace the argument and patch the callers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
da4c29105d select_statement: Replace all proxy-s with query_processor
This is the largest user of proxy argument. Fix them all and
their callers (all sit in the same .cc file).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
70ad1d9933 create_|alter_table_statement: Make check_restricted_table_properties() accept query_processor
Patch check_restricted_table_properties() and its callers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
2ca8a580d9 create_|alter_keyspace_statement: Make check_restricted_replication_strategy() accept query_processor
Patch the check_restricted_replication_strategy() and its callers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
0ea9e2636f role_management_statement: Make validate_cluster_support() accept query_processor
Patch internal role_management_statement's methods to use query_processor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
4c2343e8dd drop_index_statement: Make lookup_indexed_table() accept query_processor
Patch internal drop_index_statement's methods to use query_processor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
7a15f1c402 batch_|modification_statement: Make get_mutations accept query_processor
This completes the batch_ and modification_statement rework.
Also touch the private batch_statement::read_command while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
b1b230548b modification_statement: Replace most of proxy-s with query_processor
There are some internal methods that use proxy argument. Replace
most of them with query_processor, next patch will fix the rest --
those that interact with batch statement.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
3bad767f67 batch_statement: Replace most of proxy-s with query_processor
There are some proxy arguments left in the batch_statement internals.
Fix most of them to be query_processors. Few remainders will come
later as they rely on other statements to be fixed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
83c79b8133 cql3: Make create_arg_types()/prepare_type() accept query_processor
Change the methods' argument, then fix compiler errors.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:28 +03:00
Pavel Emelyanov
3d373597eb cql3: Make .validate_while_executing() accept query_processor
The schema_altering_statement declares this pure virtual method. This
patch changes its first argument from proxy into query processor and
fixes what compiler errors about.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:54:27 +03:00
Pavel Emelyanov
bce2ed9c6c cql3: Make execution stages carry query_processor over
The batch_ , modification_ and select_ statements get proxy from
query processor just to push it through execution stage. Simplify
that by pushing the query processor itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:53:44 +03:00
Pavel Emelyanov
b990ca5550 cql3: Make .validate() and .check_access() accept query_processor
This is mostly a sed script that replaces methods' first argument
plus fixes of compiler-generated errors.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-23 10:53:44 +03:00
Benny Halevy
f7b8b809d0 sstables: parse chunked_vector<std::integral Members>: maximize chunk size
Currently this parse function reads only 100KB worth
of members in eac hiteration.

Since the default max_chunk_capacity is 128KB,
100KB underutilize the chunk capacity, and it could
be safely increased to the max to reduce the number of
allocations and corresponding calls to read_exactly
for large arrays.

Expose utils::chunked_vector::max_chunk_capacity
so that the caler wouldn't have to guess this number
and use it in parse().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222103126.1819289-2-bhalevy@scylladb.com>
2021-12-22 15:47:37 +02:00
Benny Halevy
d95f6602a7 sstables: coroutinize parse functions
Simplify the implementation using coroutines.
This also has the potential to coalesce multiple
allocations into one.

test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222103126.1819289-1-bhalevy@scylladb.com>
2021-12-22 15:47:37 +02:00
Benny Halevy
2f2e3b2e84 test: lib: index_reader_assertions: close reader before it is destroyed
Otherwise, it may trip an assertion when the nuderlying
file is closed, as seen in e.g.:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/4318/artifact/testlog/x86_64_release/sstable_3_x_test.test_read_rows_only_index.4174.log
```
test/boost/sstable_3_x_test.cc(0): Entering test case "test_read_rows_only_index"
sstable_3_x_test: ./seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed.
Aborting on shard 0.
Backtrace:
  0x22557e8
  0x2286842
  0x7f2799e99a1f
  /lib64/libc.so.6+0x3d2a1
  /lib64/libc.so.6+0x268a3
  /lib64/libc.so.6+0x26788
  /lib64/libc.so.6+0x35a15
  0x222c53d
  0x222c548
  0xb929cc
  0xc0b23b
  0xa84bbf
  0x24d0111
```

Decoded:
```
__GI___assert_fail at :?
~file_data_source_impl at ./build/release/seastar/./seastar/src/core/fstream.cc:205
~file_data_source_impl at ./build/release/seastar/./seastar/src/core/fstream.cc:202
std::default_delete<seastar::data_source_impl>::operator()(seastar::data_source_impl*) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85
 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361
 (inlined by) ~data_source at ././seastar/include/seastar/core/iostream.hh:55
 (inlined by) ~input_stream at ././seastar/include/seastar/core/iostream.hh:254
 (inlined by) ~continuous_data_consumer at ././sstables/consumer.hh:484
 (inlined by) ~index_consume_entry_context at ././sstables/index_reader.hh:116
 (inlined by) std::default_delete<sstables::index_consume_entry_context<sstables::index_consumer> >::operator()(sstables::index_consume_entry_context<sstables::index_consumer>*) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85
 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361
 (inlined by) ~index_bound at ././sstables/index_reader.hh:395
 (inlined by) ~index_reader at ././sstables/index_reader.hh:435
std::default_delete<sstables::index_reader>::operator()(sstables::index_reader*) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85
 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361
 (inlined by) ~index_reader_assertions at ././test/lib/index_reader_assertions.hh:31
 (inlined by) operator() at ./test/boost/sstable_3_x_test.cc:4630
```

Test: unit(dev), sstable_3_x_test.test_read_rows_only_index(release X 10000)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222132858.2155227-1-bhalevy@scylladb.com>
2021-12-22 15:33:22 +02:00
Raphael S. Carvalho
e80cb51b6a distributed_loader: make shutdown clean by properly handling compaction_stopped exception
Today, when resharding is interrupted, shutdown will not be clean
because stopped exception interrupts the shutdown process.
Let's handle stopped exception properly, to allow shutdown process
to run to completion.

Refs #9759

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211221175717.62293-1-raphaelsc@scylladb.com>
2021-12-22 15:08:31 +02:00
Botond Dénes
def6d48307 Merge 'gdb: Introduce "scylla lsa-check"' from Tomasz Grabiec
Catches inconsistencies in LSA state.

Currently:

  - discrepancy between segment set in _closed_segments and shard's
    segment descriptors

  - cross-shard segment references in _closed_segments

  - discrepancy in _closed_occupancy stats and what's in segment
    descriptors

  - segments not present in _closed_segments but present in
    segment descriptors

Refs https://github.com/scylladb/scylla/issues/9544

Closes #9834

* github.com:scylladb/scylla:
  gdb: Introduce "scylla lsa-check"
  gdb: Make get_base_class_offset() also see indirect base classes
2021-12-22 15:08:31 +02:00
Pavel Emelyanov
7286374dba migration_manager: Remove last occurrence of get_local_storage_proxy()
The migration manager got local storage proxy reference recently, but one
method still uses the global call. Fix it.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211221120034.21824-1-xemul@scylladb.com>
2021-12-22 15:08:31 +02:00
Botond Dénes
aba68c8f83 Merge "reader_concurrency_semaphore: convert to flat_mutation_reader_v2" from Michael
"
The second patch in this series is a mechanical conversion of
reader_concurrency_semaphore to flat_mutation_reader_v2, and caller
updates.

The first patch is needed to pass the test suite, since without it a
real reader version conversion would happen on every entry to and exit
from reader_concurrency_semaphore, which is stressful (for example:
mutation_reader_test.test_multishard_streaming_reader reaches 8191
conversions for a couple of readers, which somehow causes it to catch
SIGSEGV in diverse and seemingly-random places).

Note that in a real workload it is unreasonable to expect readers being
parked in a reader_concurrency_semaphore to be pristine, so
short-circuiting their version conversions will be impossible and this
workaround will not really help.
"

* tag 'rcs-v2-v4' of https://github.com/cmm/scylla:
  reader_concurrency_semaphore: convert to flat_mutation_reader_v2
  short-circuit flat mutation reader upgrades and downgrades
2021-12-22 15:08:31 +02:00
Tomasz Grabiec
3e81318587 gdb: Introduce "scylla lsa-check"
Catches inconsistencies in LSA state.

Currently:

  - discrepancy between segment set in _closed_segments and shard's
    segment descritpors

  - cross-shard segment references in _closed_segments

  - discrepancy in _closed_occupancy stats and what's in segment
    descriptors

  - segments not present in _closed_segments but present in
    segment descriptors
2021-12-21 21:18:52 +01:00
Tomasz Grabiec
d754504fa2 gdb: Make get_base_class_offset() also see indirect base classes
I need it so that segment_descriptor is seen as inheriting from
list_base_hook<>, which it does via log_heap_hook.
2021-12-21 21:18:52 +01:00
Michael Livshin
a1b8ba23d2 reader_concurrency_semaphore: convert to flat_mutation_reader_v2
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-12-21 11:26:17 +02:00
Michael Livshin
9f656b96ac short-circuit flat mutation reader upgrades and downgrades
When asked to upgrade a reader that itself is a downgrade, try to
return the original v2 reader instead, and likewise when downgrading
upgraded v1 readers.

This is desirable because version transformations can result from,
say, entering/leaving a reader concurrency semaphore, and the amount
of such transformations is practically unbounded.

Such short-circuiting is only done if it is safe, that is: the
transforming reader's buffer is empty and its internal range tombstone
tracking state is discardable.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-12-21 11:26:17 +02:00
Raphael S. Carvalho
64ec1c6ec6 table: Make sure major compaction doesn't miss data in memtable
Make sure that major will compact data in all sstables and memtable,
as tombstones sitting in memtable could shadow data in sstables.
For example, a tombstone in memtable deleting a large partition could
be missed in major, so space wouldn't be saved as expected.
Additionally, write amplification is reduced as data in memtable
won't have to travel through tiers once flushed.

Fixes #9514.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217160055.96693-2-raphaelsc@scylladb.com>
2021-12-21 07:21:34 +02:00
Raphael S. Carvalho
e1e8e020fe tests: Allow memtable to be flushed through column_family_for_tests
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217160055.96693-1-raphaelsc@scylladb.com>
2021-12-21 07:21:26 +02:00
Botond Dénes
bb0874b28b main,tools,configure.py: fold tools into scylla exec
The infrastructure is now in place. Remove the proxy main of the tools,
and add appropriate `else if` statements to the executable switch in
main.cc. Also remove the tool applications from the `apps` list and add
their respective sources as dependencies to the main scylla executable.
With this, we now have all tool executables living inside the scylla
main one.
2021-12-20 18:27:25 +02:00
Botond Dénes
0761113d8b tools: prepare for inclusion in scylla's main
Rename actual main to `${tool_name}_main` and have a proxy main call it.
In the next patch we will get rid of these proxy mains and the tool
mains will be invoked from scylla's main, if the `argv[0]` matches their
name.
The main functions are included in a new `tools/entry_point.hh` header.
2021-12-20 18:27:19 +02:00
Botond Dénes
972d789a27 main: add skeleton switching code on argv[1]
To prepare for the scylla executable hosting more than one apps,
switching between them using argv[1]. This is consistent with how most
modern multi-app/multi-command programs work, one prominent example
being git.
For now only one app is present: scylla itself, called "server". If
argv[1] is missing or unrecognized, this is what is used as the default
for backward-compatibility.

The scylla app also gets a description, which explains that scylla hosts
multiple apps and lists all the available ones.
2021-12-20 18:26:38 +02:00
Raphael S. Carvalho
e05859c3f9 compaction: kill unused code for resharding_compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217162728.114936-2-raphaelsc@scylladb.com>
2021-12-20 18:21:31 +02:00
Raphael S. Carvalho
d1f2fd7f03 compaction: rename compacting_sstable_writer to compacted_fragments_writer
the name compacting_sstable_writer is misleading as it doesn't perform
any compaction. let's rename it to a name that reflects more what it
does.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217162728.114936-1-raphaelsc@scylladb.com>
2021-12-20 18:21:31 +02:00
Avi Kivity
f190434beb Merge "table,sstable_set: use v2 readers below the cache" from Bodtrond
"
Convert sstable_set and table::make_sstable_reader() to v2. With this
all readers below cache use the v2 format.

Tests: unit(dev)
"

* 'table-make-sstable-reader-v2/v1' of https://github.com/denesb/scylla:
  table: upgrade make_sstable_reader() to v2
  sstables/sstable_set: create_single_key_sstable_reader() upgrade to v2
  sstables/sstable_set: remove unused and undefined make_reader() member
2021-12-20 17:53:44 +02:00
Botond Dénes
18cddd3279 table: upgrade make_sstable_reader() to v2
With this all readers below cache use the v2 format (except kl/la
readers).
2021-12-20 17:40:46 +02:00
Botond Dénes
1a4ca831a4 main: extract scylla specific code into scylla_main()
main() now contains only generic setup and teardown code and it
delegates to scylla_main().
In the next patches we want to wire in tool executables into the scylla
one. This will require selecting the main to run at runtime.
scylla_main() will be just one of those (the default).
2021-12-20 17:31:46 +02:00
Botond Dénes
9027c6f936 sstables/sstable_set: create_single_key_sstable_reader() upgrade to v2
With this all methods of the sstable set create v2 readers.
2021-12-20 17:17:33 +02:00
Botond Dénes
847eddf19a sstables/sstable_set: remove unused and undefined make_reader() member 2021-12-20 17:17:31 +02:00
Botond Dénes
55bb70a878 Merge "Make sure TWCS per-window major includes all files" from Raphael
"
TWCS perform STCS on a window as long as it's the most recent one.
From there on, TWCS will compact all files in the past window into
a single file. With some moderate write load, it could happen that
there's still some compaction activity in that past window, meaning
that per-window major may miss some files being currently compacted.
As a result, a past window may contain more than 1 file after all
compaction activity is done on its behalf, which may increase read
amplification. To avoid that, TWCS will now make sure that per-window
major is serialized, to make sure no files are missed.

Fixes #9553.

tests: unit(dev).
"

* 'fix_twcs_per_window_major_v3' of https://github.com/raphaelsc/scylla:
  TWCS: Make sure major on past window is done on all its sstables
  TWCS: remove needless param for STCS options
  TWCS: kill unused param in newest_bucket()
  compaction: Implement strategy control and wire it
  compaction: Add interface to control strategy behavior.
2021-12-20 17:12:50 +02:00
Avi Kivity
e772fcbd57 Merge "Convert combined reader to v2" from Botond
"
Users are adjusted by sprinkling `upgrade_to_v2()` and
`downgrade_to_v1()` where necessary (or removing any of these where
possible). No attempt was made to optimize and reduce the amount of
v1<->v2 conversions. This is left for follow-up patches to keep this set
small.

The combined reader is composed of 3 layers:
1. fragment producer - pop fragments from readers, return them in batches
  (each fragment in a batch having the same type and pos).
2. fragment merger - merge fragment batches into single fragments
3. reader implementation glue-code

Converting layers (1) and (3) was mostly mechanical. The logic of
merging range tombstone changes is implemented at layer (2), so the two
different producer (layer 1) implementations we have share this logic.

Tests: unit(dev)
"

* 'combined-reader-v2/v4' of https://github.com/denesb/scylla:
  test/boost/mutation_reader_test: add test_combined_reader_range_tombstone_change_merging
  mutation_reader: convert make_clustering_combined_reader() to v2
  mutation_reader: convert position_reader_queue to v2
  mutation_reader: convert make_combined_reader() overloads to v2
  mutation_reader: combined_reader: convert reader_selector to v2
  mutation_reader: convert combined reader to v2
  mutation_reader: combined_reader: attach stream_id to mutation_fragments
  flat_mutation_reader_v2: add v2 version of empty reader
  test/boost/mutation_reader_test: clustering_combined_reader_mutation_source_test: fix end bound calculation
2021-12-20 14:01:03 +02:00
Pavel Solodovnikov
96799a72d9 db: add save_local_supported_features function
This is a utility function for writing the supported
feature set to the `system.local` table.

Will be used to move the corresponding part from
`system_keyspace::setup_version` to the gossiper
after passing remote feature check, effectively making
`system.local#supported_features` store the advertised
features (which already passed the feature check).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-20 13:31:52 +03:00
Botond Dénes
7f331cee01 test/boost/mutation_reader_test: add test_combined_reader_range_tombstone_change_merging
Stressing the range tombstone change merging logic.
2021-12-20 09:29:05 +02:00
Botond Dénes
e1bbc4a480 mutation_reader: convert make_clustering_combined_reader() to v2
Just sprinkle the right amount downgrade_to_v1() and upgrade_to_v2() to
call sites, no attempts at optimization was done.
2021-12-20 09:29:05 +02:00
Botond Dénes
2364144b19 mutation_reader: convert position_reader_queue to v2
By removing the converting (v1->v2) constructor of
`reader_and_upper_bound` and adjusting its users.
2021-12-20 09:29:05 +02:00
Botond Dénes
aeddcf50a1 mutation_reader: convert make_combined_reader() overloads to v2
Just sprinkle the right amount downgrade_to_v1() and upgrade_to_v2() to
call sites, no attempts at optimization was done.
2021-12-20 09:29:05 +02:00
Botond Dénes
1554b94b78 mutation_reader: combined_reader: convert reader_selector to v2 2021-12-20 09:29:05 +02:00
Botond Dénes
71835bdee1 mutation_reader: convert combined reader to v2
The meat of the change is on the fragment merger level, which is now
also responsible for merging range tombstone changes. The fragment
producers are just mechanically converted to v2 by appending `_v2` to
the appropriate type names.
The beauty of this approach is that range tombstone merging happens in a
single place, shared by all fragment producers (there is 2 of them).

Selectors and factory functions are left as v1 for now, they will be
converted incrementally by the next patches.
2021-12-20 09:29:05 +02:00
Calle Wilund
4df008adcc production_snitch_base: Do reverse lookup of endpoint for info
Refs #9709
Refs #9653

If we don't find immediate info about an endpoint, check if
we're being asked about a "private" ip for the endpoint.
If so, give info for this.
2021-12-20 06:20:46 +02:00
Calle Wilund
4778770814 messaging_service: Make dc/rack encryption check for connection more strict
Fixes #9653

When doing an outgoing connection, in a internode_encryption=dc/rack situation
we should not use endpoint/local broadcast solely to determine if we can
downgrade a connection.

If gossip/message_service determines that we will connect to a different
address than the "official" endpoint address, we should use this to determine
association of target node, and similarly, if we bind outgoing connection
to interface != bc we need to use this to decide local one.

Note: This will effectively _disable_ internode_encryption=dc/rack on ec2 etc
until such time that gossip can give accurate info on dc/rack for "internal"
ip addresses of nodes.
2021-12-20 06:20:46 +02:00
Asias He
eba4a4fba4 repair: Allow ignoring dead nodes for replace operation
Consider

1) n1, n2, n3, n4, n5
2) n2 and n3 are both down
3) start n6 to replace n2
4) start n7 to replace n3

We want to replace the dead nodes n2 and n3 to fix the cluster to have 5
running nodes.

Replace operation in step 3 will fail because n3 is down.
We would see errors like below:

replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed
for replace operation are down. It is highly recommended to fix the down
nodes and try again.

In the above example, currently, there is no way to replace any of the
dead nodes.

Users can either fix one of the dead nodes and run replace or run
removenode operation to remove one of the dead nodes then run replace
and run bootstrap to add another node.

Fixing dead nodes is always the best solution but it might not be
possible. Running removenode operation is not better than running
replace operation (with best effort by ignoring the other dead node) in
terms of data consistency. In addition, users have to run bootstrap
operation to add back the removed node. So, allowing replacing in such
case is a clear win.

This patch adds the --ignore-dead-nodes-for-replace option to allow run
replace operation with best effort mode. Please note, use this option
only if the dead nodes are completely broken and down, and there is no
way to fix the node and bring it back. This also means the user has to
make sure the ignored dead nodes specified are really down to avoid any
data consistency issue.

Fixes #9757

Closes #9758
2021-12-20 00:49:03 +02:00
Avi Kivity
7bdc999bba service: paxos_state: wean off get_local_storage_proxy()
Instead of calling get_local_storage_proxy in paxos_state, get it from the
caller (who is, in fact, storage_proxy or one of its components).

Some of the callers, although they are storage_proxy components, don't
have a storage_proxy reference handy and so they ignomiously call
get_local_storage_proxy() themselves. This will be adjusted later.

The other callers who are, in fact, storage_proxy, have to take special
care not to cross a shard boundary. When they do, smp::submit_to()
is converted to sharded::invoke_on() in order to get the correct local instance.

Test: unit (dev)

Closes #9824
2021-12-20 00:31:13 +02:00
Nadav Har'El
252ce8afd4 Merge 'Extend stop compaction api' from Benny Halevy
Allow stopping compaction by type on a given keyspace and list of tables.

Also add api unit test suite that tests the existing `stop_compaction` api
and the new `stop_keyspace_compaction` api.

Fixes #9700

Closes #9746

* github.com:scylladb/scylla:
  api: storage_service: validate_keyspace: improve exception error message
  api: compaction_manager: add stop_keyspace_compaction
  api: storage_service: expose validate_keyspace and parse_tables
  api: compaction_manager: stop_compaction: fix type description
  compaction_manager: stop_compaction: expose optional table*
  test: api: add basic compaction_manager test
2021-12-20 00:18:46 +02:00
Tomasz Grabiec
1c80d7fec4 tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries 2021-12-19 22:43:52 +01:00
Tomasz Grabiec
d678890757 row_cache: partition_snapshot_row_cursor: Print more details about the current version vector
Now the format is the same as for the "heap" version vector. Contains
positions and continuity flags. Helps in debugging.

Before:

  {cursor: position={position: clustered,ckp{...},-1}, cont=0, rev=1, current=[0], heap=[
  ], latest_iterator=[{position: clustered,ckp{...},-1}]}

After:

 {cursor: position={position: clustered,ckp{...},-1}, cont=0, rev=1, current=[{v=0, pos={position: clustered,ckp{...},-1}, cont=false}], heap=[
  ], latest_iterator=[{position: clustered,ckp{...},-1}]}
2021-12-19 22:41:35 +01:00
Tomasz Grabiec
5196d450bd row_cache: Improve trace-level logging
Print MVCC snapshot to help distinguish reads which use different snapshots.

Also, print the whole cursor, not just its position. This helps in
determining which MVCC version the iterator comes from.
2021-12-19 22:41:35 +01:00
Tomasz Grabiec
65a1a0247a config: Use cache for reversed reads by default 2021-12-19 22:41:35 +01:00
Tomasz Grabiec
9fd1120ad5 config: Adjust reversed_reads_auto_bypass_cache description
Bypassing cache is no longer necessary to use native reverse readers.
2021-12-19 22:41:35 +01:00
Tomasz Grabiec
63351483f0 row_cache: Support reverse reads natively
Some implementation notes below.

When iterating in reverse, _last_row is after the current entry
(_next_row) in table schema order, not before like in the forward
mode.

Since there is no dummy row before all entries, reverse iteration must
be now prepared for the fact that advancing _next_row may land not
pointing at any row. The partition_snapshot_row_cursor maintains
continuity() correctly in this case, and positions the cursor before
all rows, so most of the code works unchanged. The only excpetion is
in move_to_next_entry(), which now cannot assume that failure to
advance to an entry means it can end a read.

maybe_drop_last_entry() is not implemented in reverse mode, which may
expose reverse-only workload to the problem of accumulating dummy
entries.

ensure_population_lower_bound() was not updating _last_row after
inserting the entry in latets version. This was not a problem for
forward reads because they do not modify the row in the partition
snapshot represented by _last_row. They only need the row to be there
in the latest version after the call. It's different for reveresed
reads, which change the continuity of the entry represented by
_last_row, hence _last_row needs to have the iterator updated to point
to the entry from the latest version, otherwise we'd set the
continuity of the previous version entry which would corrupt the
continuity.
2021-12-19 22:41:35 +01:00
Tomasz Grabiec
d0c367f44f mvcc: partition_snapshot: Support slicing range tombstones in reverse 2021-12-19 22:41:35 +01:00
Tomasz Grabiec
87c921dff5 test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition
There may be unconsumed but expected fragments in the stream at the
time of the call to produces_partition_end().

Call check_rts() sooner to avoid failures.
2021-12-19 22:41:35 +01:00
Tomasz Grabiec
b3618163f8 row_cache: Log produced range tombstones 2021-12-19 22:41:35 +01:00
Tomasz Grabiec
5f45d45c55 test: Make produces_range_tombstone() report ck_ranges 2021-12-19 22:41:35 +01:00
Tomasz Grabiec
26ed0081a4 tests: lib: random_mutation_generator: Extract make_random_range_tombstone() 2021-12-19 22:41:35 +01:00
Tomasz Grabiec
757fc1275f partition_snapshot_row_cursor: Support reverse iteration 2021-12-19 22:41:35 +01:00
Tomasz Grabiec
86791845ec utils: immutable-collection: Make movable
Classes with reference fields are not movable by default.
2021-12-19 22:41:35 +01:00
Pavel Emelyanov
d88ae7edae Merge 'migration_manager: retire global storage proxy refs' from Avi Kivity
Replace get_local_storage_proxy() and get_local_storage_proxy() with
constructor-provided references. Some unneeded cases were removed.

Test: unit (dev)

Closes #9816

* github.com:scylladb/scylla:
  migration_manager: replace uses of get_storage_proxy and get_local_storage_proxy with constructor-provided reference
  migration_manager: don't keep storage_proxy alive during schema_check verb
  mm: don't capture storage proxy shared_ptr during background schema merge
  mm: remove stats on schema version get
2021-12-17 17:53:08 +03:00
Raphael S. Carvalho
f508f54f3e table: move min_compaction_threshold() and compaction_enforce_min_threshold() into table_state
Compaction specific methods can be implemented in table_state only,
as they aren't needed elsewhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211214191822.164223-1-raphaelsc@scylladb.com>
2021-12-17 10:00:31 +02:00
Piotr Sarna
f49c20aa24 thrift: drop obtaining incorrect permits
The thrift layer started partially having admission control
after commit ef1de114f0,
but code inspection suggests that it might cause use-after-free
in a few cases, when a permit is obtained more than once per
handling - due to the fact that some functions tail-called other
functions, which also obtain a permit.
These extraneous permits are not taken anyore.

Tests: "please trust me" + cassandra-stress in thrift mode
Message-Id: <ac5d711288b22c5fed566937722cceeabc234e16.1639394937.git.sarna@scylladb.com>
2021-12-17 09:35:24 +02:00
Avi Kivity
7c23ed888d Update tools/jmx submodule (dropping unneeded dependencies)
* tools/jmx 2c43d99...53f7f55 (1):
  > pom.xml: drop unneeded logging dependencies
2021-12-16 21:54:36 +02:00
Avi Kivity
a97731a7e5 migration_manager: replace uses of get_storage_proxy and get_local_storage_proxy with constructor-provided reference
A static helper also gained a storage_proxy parameter.
2021-12-16 21:05:47 +02:00
Avi Kivity
aca9029c24 migration_manager: don't keep storage_proxy alive during schema_check verb
The schema_check verb doesn't leak tasks, so when the verb is
unregistered it will be drained. So protection for storage_proxy lifetime
can be removed.
2021-12-16 21:04:27 +02:00
Avi Kivity
26c656f6ed mm: don't capture storage proxy shared_ptr during background schema merge
The definitions_update() verb captures a shared_ptr to storage_proxy
to keep it alive while the background task executes.

This was introduced in (2016!):

commit 1429213b4c
Author: Pekka Enberg <penberg@scylladb.com>
Date:   Mon Mar 14 17:57:08 2016 +0200

    main: Defer migration manager RPC verb registration after commitlog replay

    Defer registering migration manager RPC verbs after commitlog has has
    been replayed so that our own schema is fully loaded before other other
    nodes start querying it or sending schema updates.
    Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>

when moving this code from storage_proxy.cc.

Later, better protection with a gate was added:

commit 14de126ff8
Author: Pavel Emelyanov <xemul@scylladb.com>
Date:   Mon Mar 16 18:03:48 2020 +0300

    migration_manager: Run background schema merge in gate

    The call for merge_schema_from in some cases is run in the
    background and thus is not aborted/waited on shutdown. This
    may result in use-after-free one of which is

    merge_schema_from
     -> read_schema_for_keyspace
         -> db::system_keyspace::query
             -> storage_proxy::query
                 -> query_partition_key_range_concurrent

    in the latter function the proxy._token_metadata is accessed,
    while the respective object can be already free (unlike the
    storage_proxy itself that's still leaked on shutdown).

    Related bug: #5903, #5999 (cannot reproduce though)
    Tests: unit(dev), manual start-stop
           dtest(consistency.TestConsistency, dev)
           dtest(schema_management, dev)

    Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
    Reviewed-by: Pekka Enberg <penberg@scylladb.com>
    Message-Id: <20200316150348.31118-1-xemul@scylladb.com>

Since now the task execution is protected by the gate and
therefore migration_manager lifetime (which is contained within
that of storage_proxy, as it is constructed afterwards), capturing
the shared_ptr is not needed, and we therefore remove it, as
it uses the deprecated global storage_proxy accessors.
2021-12-16 21:01:06 +02:00
Botond Dénes
7db31e1bdb mutation_reader: combined_reader: attach stream_id to mutation_fragments
The fragment producer component of the combined reader returns a batch
of fragments on each call to operator()(). These fragments are merged
into a single one by the fragment merger. This patch adds a stream id to
each fragment in the batch which identifies the stream (reader) it
originates from. This will be used in the next patches to associate
range-tombstone-changes originating from the same stream with each other.
2021-12-16 14:57:49 +02:00
Botond Dénes
c193bbed82 flat_mutation_reader_v2: add v2 version of empty reader
Convert the v1 implementation to v2, downgrade to v1 in the existing
`make_empty_flat_reader()`.
2021-12-16 14:57:49 +02:00
Botond Dénes
f15f4952be test/boost/mutation_reader_test: clustering_combined_reader_mutation_source_test: fix end bound calculation
Currently the test assumes that fragments represent weakly monotonic
upper bounds and therefore unconditionally overwrites the upper-bound on
receiving each fragment. Range tombstones however violate this as a
range tombstone with a smaller position (lower bound) may have a higher
upper bound than some or all fragments that follow it in the stream.
This causes test failures after the converting the combined reader to
v2, but not before, no idea why.
2021-12-16 14:57:49 +02:00
Nadav Har'El
9ae98dbe92 Merge 'Reduce boot time for dtest setup' from Asias He
This patch helps to speed up node boot up for test setups like dtest.

Nadav reported

```
With Asias's two patches o Scylla, and my patch to enable it in dtest:

Boot time of 5 nodes is now down to 9 seconds!

Remember we started this exercise with 214 seconds? :-)
```

Closes #9808

* github.com:scylladb/scylla:
  storage_service: Recheck tokens before throw in storage_service::bootstrap
  gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero
2021-12-16 13:44:42 +02:00
Pavel Emelyanov
b2a62d2b59 Merge 'db: range_tombstone_list: Deoverlap empty range tombstones' from Tomasz Grabiec
Appending an empty range adjacent to an existing range tombstone would
not deoverlap (by dropping the empty range tombstone) resulting in
different (non canoncial) result depending on the order of appending.

Suppose that range tombstone [a, b] covers range tombstone [x, x), and [a, x) and [x, b) are range tombstones which correspond to [a, b] split around position x.

Appending [a, x) then [x, b) then [x, x) would give [a, b)
Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b)

The fix is to drop empty range tombstones in range_tombstone_list so that the result is canonical.

Fixes #9661

Closes #9764

* github.com:scylladb/scylla:
  range_tombstone_list: Deoverlap adjacent empty ranges
  range_tombstone_list: Convert to work in terms of position_in_partition
2021-12-16 10:00:40 +03:00
Avi Kivity
c40043b142 mm: remove stats on schema version get 2021-12-15 18:56:18 +02:00
Nadav Har'El
d323b82cf6 Merge 'Introduce data_dictionary module' from Avi Kivity
The full user-defined structure of the database (keyspaces,
tables, user-defined types, and similar metadata, often known
as the schema in other databases) is needed by much of the
front-end code. But in Scylla it is deeply intertwined with the
replica data management code - ::database, ::keyspace, and
::table. Not only does the front-end not need data access, it
cannot get correct data via these objects since they represent
just one replica out of many.

This dual-role is a frequent cause of recompilations. It was solved
to some degree by forward declarations, but there is still a lot
of incidental dependencies.

To solve this, we introduce a data_dictionary module (and
namespace) to exclusively deal with greater schema metadata.
It is an interface, with a backing implementation by the existing code,
so it doesn't add a new source of truth. The plan is to allow mock
implementations for testing as well.

Test: unit (dev, release, debug).

Closes #9783

* github.com:scylladb/scylla:
  cql3, related: switch to data_dictionary
  test: cql_test_env: provide access to data_dictionary
  storage_proxy: provide access to data_dictionary
  database: implement data_dictionary interface
  data_dictionary: add database/keyspace/table objects
  data_dictionary: move keyspace_metadata to data_dictionary
  data_dictionary: move user_types_metadata to new module data_dictionary
2021-12-15 18:29:28 +02:00
Avi Kivity
87917d2536 Merge "gms: gossiper: coroutinize a few small functions" from Pavel S
"
Start converting small functions in gossiper code
from using `seastar::thread` context to coroutines.

For now, the changes are quite trivial.
Later, larger code fragments will be converted
to eliminate uses of `seastar::async` function calls.

Moving the code to coroutines makes the code a bit
more readable and also mmediately evident that a given
function is async just looking at the signature (for
example, for void-returning functions, a coroutine
will return `future<>` instead of `void` in case of
a seastar::thread-using function).

Tests: unit(dev)
"

* 'coro_gossip_v1' of https://github.com/ManManson/scylla:
  gms: gossiper: coroutinize `maybe_enable_features`
  gms: gossiper: coroutinize `wait_alive`
  gms: gossiper: coroutinize `add_saved_endpoint`
  gms: gossiper: coroutinize `evict_from_membership`
2021-12-15 16:02:18 +02:00
Tomasz Grabiec
87e3552cb8 intrusive_btree: Make default-initialized iterator cast to false
This patch makes the following expression true:

 !bool(iterator_base{})

It's a reasonable expectation upon which subsequent patches will rely.
2021-12-15 13:54:40 +01:00
Avi Kivity
d768e9fac5 cql3, related: switch to data_dictionary
Stop using database (and including database.hh) for schema related
purposes and use data_dictionary instead.

data_dictionary::database::real_database() is called from several
places, for these reasons:

 - calling yet-to-be-converted code
 - callers with a legitimate need to access data (e.g. system_keyspace)
   but with the ::database accessor removed from query_processor.
   We'll need to find another way to supply system_keyspace with
   data access.
 - to gain access to the wasm engine for testing whether used
   defined functions compile. We'll have to find another way to
   do this as well.

The change is a straightforward replacement. One case in
modification_statement had to change a capture, but everything else
was just a search-and-replace.

Some files that lost "database.hh" gained "mutation.hh", which they
previously had access to through "database.hh".
2021-12-15 13:54:23 +02:00
Avi Kivity
399e2895f1 test: cql_test_env: provide access to data_dictionary
Allow tests to have access to the data_dictionary.
2021-12-15 13:54:18 +02:00
Avi Kivity
c2da20484d storage_proxy: provide access to data_dictionary
Probably storage_proxy is not the correct place to supply
data_dictionary, but it is available to practically all of
the coordinator code, so it is convenient.
2021-12-15 13:54:08 +02:00
Avi Kivity
1de0a4b823 database: implement data_dictionary interface
Implement the new data_dictionary interface using the existing
::database, ::keyspace, and ::table classes. The implementation
is straightforward. This will allow the coordinator code to access
the full schema without depending on the gnarly bits that compose
::database, like reader_concurrency_semaphore or the backlog
controller.
2021-12-15 13:53:46 +02:00
Avi Kivity
e55a606423 data_dictionary: add database/keyspace/table objects
Add metadata-only counterparts to ::database, ::keyspace, and ::table.

Apart from being metadata-only objects suitable for the coordinator,
the new types are also type-erased and so they can be mocked without
being linked to ::database and friends.

We use a single abstract class to mediate between data_dictionary
objects and the objects they represent (data_dictionary::impl).
This makes the data_dictionary objects very lightweight - they
only contain a pointer to the impl object (of which only one
needs to be instantiated), and a reference to the object that
is represented. This allows these objects to be easily passed
by value.

The abstraction is leaky: in one place it is outright breached
with data_dictionary::database::real_database() that returns
a ::database reference. This is used so that we can perform the
transition incrementally. Another place is that one of the
methods returns a secondary_index_manager, which in turn grants
access to the real objects. This will be addressed later, probably
by type erasing as well.

This patch only contains the interface, and no implementation. It
is somewhat messy since it mimics the years-old evolution of the
real objects, but maybe it will be easier to improve it now.
2021-12-15 13:52:31 +02:00
Avi Kivity
3945acaa2d data_dictionary: move keyspace_metadata to data_dictionary
Like user_types_metadata, keyspace_metadata does not grant
data access, just metadata, and so belongs in data_dictionary.
2021-12-15 13:52:21 +02:00
Avi Kivity
021c7593b8 data_dictionary: move user_types_metadata to new module data_dictionary
The new module will contain all schema related metadata, detached from
actual data access (provided by the database class). User types is the
first contents to be moved to the new module.
2021-12-15 13:52:10 +02:00
Asias He
b4eb270e89 storage_service: Recheck tokens before throw in storage_service::bootstrap
In case count_normal_token_owners check fails, sleep and retry. This
makes test setups like skip_wait_for_gossip_to_settle = 0 and
ring_delay_ms = 0 work.
2021-12-15 19:40:43 +08:00
Asias He
78d0cc4ab5 gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero
The skip_wait_for_gossip_to_settle == 0 which means do not wait for
gossip to settle at all. It is not respected in
gossiper::wait_for_range_setup and in gossiper::wait_for_gossip for
initial sleeps.

Since setting skip_wait_for_gossip_to_settle zero is not allowed in
production cluster anyway. It is mostly used by tests like dtest to
reduce the cluster boot up time. Respect skip_wait_for_gossip_to_settle
zero flag and avoid any sleep and wait completely.
2021-12-15 19:40:43 +08:00
Tzach Livyatan
d6fbabbf8c fix typo in repair_based_node_ops.md
Fix https://github.com/scylladb/scylla/issues/9786

Closes #9788
2021-12-15 09:56:21 +02:00
Avi Kivity
3ac622bdd8 Merge "Add v2 versions of make_forwadable() and make_flat_mutation_reader_from_fragments()" from Botond
"
These two readers are crucial for writing tests for any composable
reader so we need v2 versions of them before we can convert and test the
combined reader (for example). As these two readers are often used in
situations where the payload they deliver is specially crafted for the
test at hand, we keep their v1 versions too to avoid conversion meddling
with the tests.

Tests: unit(dev)
"

* 'forwarding-and-fragment-reader-v2/v1' of https://github.com/denesb/scylla:
  flat_mutation_reader_v2: add make_flat_mutation_reader_from_fragments()
  test/lib/mutation_source_test: don't force v1 reader in reverse run
  mutation_source: add native_version() getter
  flat_mutation_reader_v2: add make_forwardable()
  position_in_partition: add after_key(position_in_partition_view)
  flat_mutation_reader: make_forwardable(): fix indentation
  flat_mutation_reader: make_forwardable(): coroutinize reader
2021-12-14 20:43:09 +02:00
Raphael S. Carvalho
be6cfa4a83 table: only stop regular compaction when disabling auto compaction
disable auto compaction API is about regular compactions, so maintenance
operations like cleanup must not be stopped.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211213133541.36015-1-raphaelsc@scylladb.com>
2021-12-14 15:49:50 +02:00
Benny Halevy
f3a4ae1460 database: add debug messages around drop and truncate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211214104357.2224926-1-bhalevy@scylladb.com>
2021-12-14 14:26:33 +02:00
Benny Halevy
32d61a3d09 test: sstable_directory_test_table_lock_works: verify that truncate is blocked on the the table lock
The test in its current form is invalid, as database::remove
does removing the table's name from its listing
as well as from the keyspace metadata, so it won't be found
after that.

That said, database::drop_column_family then proceeds
to truncate and stop the table, after calling await_pending_ops,
and the latter should indeed block on the lock taken by the test.

This change modifies the test to create some sstables in the
table's directory before starting the sstable_directory.

Then, when executing "drop table" in the background,
wait until the table is not found by db.find_column_family

That would fail the test before this change.
See https://jenkins.scylladb.com/job/scylla-enterprise/job/next/1442/artifact/testlog/x86_64_debug/sstable_directory_test.sstable_directory_test_table_lock_works.4720.log
```
INFO  2021-12-13 14:00:17,298 [shard 0] schema_tables - Dropping ks.cf id=00487bc0-5c1d-11ec-9e3b-a44f824027ae version=b10c4994-31c7-3f5a-9591-7fedb0273c82
test/boost/sstable_directory_test.cc(453): fatal error: in "sstable_directory_test_table_lock_works": unexpected exception thrown by table_ok.get()
```

A this point, the test verifies again that the sstables are still on
disk (and no truncate happened), and only after drop completed,
the table should not exist on disk.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211214104407.2225080-1-bhalevy@scylladb.com>
2021-12-14 14:26:17 +02:00
Nadav Har'El
31eeb44d28 alternator: fix error on UpdateTable for non-existent table
When the UpdateTable operation is called for a non-existent table, the
appropriate error is ResourceNotFoundException, but before this patch
we ran into an exception, which resulted in an ugly "internal server
error".

In this patch we use the existing get_table() function which most other
operations use, and which does all the appropriate verifications and
generates the appropriate Alternator api_error instead of letting
internal Scylla exceptions escape to the user.

This patch also includes a test for UpdateTable on a non-existent table,
which used to fail before this patch and pass afterwards. We also add a
test for DeleteTable in the same scenario, and see it didn't have this
bug. As usual, both tests pass on DynamoDB, which confirms we generate
the right error codes.

Fixes #9747.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211206181605.1182431-1-nyh@scylladb.com>
2021-12-14 13:09:27 +01:00
Tomasz Grabiec
b228ddabb7 Merge "Move schema altering statement to raft" from Gleb
The series is on top of "wire up schema raft state machine". It will
apply without, but will not work obviously (when raft is disabled it
does nothing anyway).

This series does not provide any linearisability just yet though. It
only uses raft as a means to distribute schema mutations. To achieve
linearisability more work is needed. We need to at lease make sure
that schema mutation use monotonically increasing timestamps and,
since schema altering statement are RMW, no modification to schema
were done between schema mutation creation and application. If there
were an operation needs to be restarted.

* scylla-dev/gleb/raft-schema-v5: (59 commits)
  cql3: cleanup mutation creation code in ALTER TYPE
  cql3: use migration_manager::schema_read_barrier() before accessing a schema in altering statements
  cql3: bounce schema altering statement to shard 0
  migration_manager: add is_raft_enabled() to check if raft is enabled on a cluster
  migration_manager: add schema_read_barrier() function
  migration_manager: make announce() raft aware
  migration_manager: co-routinize announce() function
  migration_manager: pass raft_gr to the migration manager
  migration_manager: drop view_ptr array from announce_column_family_update()
  mm: drop unused announce_ methods
  cql3: drop schema_altering_statement::announce_migration()
  cql3: drop has_prepare_schema_mutations() from schema altering statement
  cql3: drop announce_migration() usage from schema_altering_statement
  cql3: move DROP AGGREGATE statement to prepare_schema_mutations() api
  migration_manager: add prepare_aggregate_drop_announcement() function
  cql3: move DROP FUNCTION statement to prepare_schema_mutations() api
  migration_manager: add prepare_function_drop_announcement() function
  cql3: move CREATE AGGREGATE statement to prepare_schema_mutations() api
  migration_manager: add prepare_new_aggregate_announcement() function
  cql3: move CREATE FUNCTION statement to prepare_schema_mutations() api
  ...
2021-12-14 11:05:32 +01:00
Piotr Sarna
feea7cb920 Merge 'cql3: disentangle column_identifier from selectable' from Avi Kivity
column_identifier serves two purposes: a value type used to denote an
identifier (which may or may not map to a table column), and `selectable`
implementation used for selecting table columns. This stands in the way
of further refactoring - the unification of the WHERE clause prepare path
(prepare_expression()) and the SELECT clause prepare path
(prepare_selectable()).

Reduce the entanglement by moving the selectable-specific parts to a new
type, selectable_column, and leaving column_identifier as a pure value type.

Closes #9729

* github.com:scylladb/scylla:
  cql3: move selectable_column to selectable.cc
  cql3: column_identifier: split selectable functionality off from column_identifier
2021-12-14 10:37:32 +01:00
Benny Halevy
b28314c2e5 database: find_uuid: update comment
To agree with 8cbecb1c21.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211214073228.2159674-1-bhalevy@scylladb.com>
2021-12-14 11:17:50 +02:00
Nadav Har'El
815324713e test/alternator: add more tests for ADD operand mismatch
The "ADD" operator in UpdateItem's AttributeUpdates supports a number of
types (numbers, sets and strings), should result in a ValidationException
if the attribute's existing type is different from the type of the
operand - e.g., trying to ADD a number to an attribute which has a set
as a value.

So far we only had partial testing for this (we tested the case where
both operands are sets, but of different types) so this patch adds the
missing tests. The new tests pass (on both Alternator and DynamoDB) -
we don't have a bug there.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211213195023.1415248-1-nyh@scylladb.com>
2021-12-14 11:15:23 +02:00
Botond Dénes
425c0b0394 test/cql-pytest/nodetool.py: fix take_snapshot() for cassandra
take_snapshot() contained copypasta from flush() for the nodetool
variant.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211208110129.141592-1-bdenes@scylladb.com>
2021-12-14 11:15:23 +02:00
Takuya ASADA
6870938842 scylla_raid_setup: fix typo
Closes #9790
2021-12-14 11:15:23 +02:00
Benny Halevy
c89876c975 compaction: scrub_validate_mode_validate_reader: throw compaction_stopped_exception if stop is requested
Currently when scrub/validate is stopped (e.g. via the api),
scrub_validate_mode_validate_reader co_return:s without
closing the reader passed to it - causing a crash due
to internal error check, see #9766.

Throwing a compaction_stopped_exception rather than co_return:ing
an exception will be handled as any other exeption, including closing
the reader.

Fixes #9766

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>
2021-12-14 11:15:23 +02:00
Benny Halevy
d38206587e table: enable_auto_compaction: trigger compaction
auto_compaction has been disabled so sstables
may have already been accumulated and require compaction.

Do not wait for new sstables to be written to trigger
compaction, trigger compaction right away.

Refs #9784

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211212090632.1257829-1-bhalevy@scylladb.com>
2021-12-14 11:15:23 +02:00
Gleb Natapov
1ba9cc8836 cql3: cleanup mutation creation code in ALTER TYPE
Now that we have only one user for do_announce_migration() function it
can be simplified (and renamed to something more appropriate).
2021-12-14 09:01:42 +02:00
Gleb Natapov
72a55c584e cql3: use migration_manager::schema_read_barrier() before accessing a schema in altering statements
Schema altering statements do read/modify/write on the schema. To make
sure a statement access the latest schema it needs to execute raft read
barrier before accessing local schema copy.
2021-12-14 09:01:42 +02:00
Gleb Natapov
31a873c80e cql3: bounce schema altering statement to shard 0
Since raft's group zero resides on shard 0 only lets bounce all schema
altering statement to shard 0 (if raft is enabled) to make it easier to
use it.
2021-12-14 09:01:42 +02:00
Gleb Natapov
6e5061a12d migration_manager: add is_raft_enabled() to check if raft is enabled on a cluster 2021-12-14 09:01:42 +02:00
Gleb Natapov
955e582fb6 migration_manager: add schema_read_barrier() function
The function is responsible of calling raft's group zero read barrier in
case it is enabled.
2021-12-14 09:01:42 +02:00
Gleb Natapov
9ee4ba143a migration_manager: make announce() raft aware
If raft is enabled use it to distribute schema change instead of direct
RPC calls.
2021-12-14 09:01:40 +02:00
Gleb Natapov
3fd834222a migration_manager: co-routinize announce() function 2021-12-14 09:00:33 +02:00
Tomasz Grabiec
78a6474982 range_tombstone_list: Deoverlap adjacent empty ranges
Appending an empty range adjacent to an existing range tombstone would
not deoverlap (by dropping the empty range tombstone) resulting in
different (non canoncial) result depending on the order of appending.

 Suppose that [a, b] covers [x, x)

Appending [a, x) then [x, b) then [x, x) would give [a, b)
Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b)

Fix by dropping empty range tombstones.
2021-12-13 21:31:36 +01:00
Raphael S. Carvalho
8eace8fc49 TWCS: Make sure major on past window is done on all its sstables
Once current window is sealed, TWCS is supposed to compact all its
sstables into one. If there's ongoing compaction, it can happen that
sstables are missed and therefore past windows will contain more than
one sstable. Additionally, it could happen that major doesn't happen
at all if under heavy load. All these problems are fixed by serializing
major on past window and also postponing it if manager refuses to run
the job now.

Fixes #9553.

Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:10:43 -03:00
Raphael S. Carvalho
2dc890d8e6 TWCS: remove needless param for STCS options
STCS option can be retrieved from class member, as newest_bucket()
is no longer a static function. let's get rid of it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:40 -03:00
Raphael S. Carvalho
41a5736aaf TWCS: kill unused param in newest_bucket()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:36 -03:00
Raphael S. Carvalho
49f40c8791 compaction: Implement strategy control and wire it
This implements strategy control interface for both manager and
tests, and wire it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:23 -03:00
Raphael S. Carvalho
6d9466052e compaction: Add interface to control strategy behavior.
This interface is akin to table_state, but compaction manager's
representative instead.
It will allow compaction manager to set goals and contraints on
compaction strategies. It will start by allowing strategy to know
if there's ongoing compaction, which is useful for virtually all
strategies. For example, LCS may want to compact L0 in parallel
with higher levels, to avoid L0 falling behind.
This interface can be easily extended to allow manager to switch
to a reclaim mode, if running out of space, etc.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 15:55:37 -03:00
Nadav Har'El
41c7b2fb4b test/cql-pytest run: fix inaccurate comment
The code in test/cql-pytest/run.py can start Scylla (or Cassandra, or
Redis, etc.) in a random IP address in 127.*.*.*. We explained in a
comment that 127.0.0.* is used by CCM so we avoid it in case someone
runs both dtest and test.py in parallel on the same machine.

But this observation was not accurate: Although the original CCM did use
only 127.0.0.*, in Scylla's CCM we added in 2017, in commit
00d3ba5562567ab83190dd4580654232f4590962, the ability to run multiple
copies of CCM in parallel; CCM now uses 127.0.*.*, not just 127.0.0.*.
So we need to correct this in the comment.

Luckily, the code doesn't need to change! We already avoided the entire
127.0.*.* for simplicity, not just 127.0.0.*.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211212151339.1361451-1-nyh@scylladb.com>
2021-12-13 18:12:11 +02:00
Avi Kivity
e44a28dce4 Merge "compaction: Allow data from different buckets (e.g. windows) to be compacted together" from Raphael
"
Today, data from different buckets (e.g. windows) cannot be compacted together because
mutation compactor happens inside each consumer, where each consumer is done on behalf
of a particular bucket. To solve this problem, mutation compaction process is being
moved from consumer into producer, such that interposer consumer, which is responsible
for segregation, will be feeded with compacted data and forward it into the owner bucket.

Fixes #9662.

tests: unit(debug).
"

* 'compact_across_buckets_v2' of github.com:raphaelsc/scylla:
  tests: sstable_compaction_test: add test_twcs_compaction_across_buckets
  compaction: Move mutation compaction into producer for TWCS
  compaction: make enable_garbage_collected_sstable_writer() more precise
2021-12-12 15:07:15 +02:00
Gleb Natapov
e9fafea5c1 migration_manager: pass raft_gr to the migration manager
Migration manager will be use raft group zero to distribute schema
changes.
2021-12-11 12:31:07 +02:00
Gleb Natapov
38e1f85959 migration_manager: drop view_ptr array from announce_column_family_update()
No users pass it any longer.
2021-12-11 12:31:07 +02:00
Gleb Natapov
a13ebe13c9 mm: drop unused announce_ methods 2021-12-11 12:31:07 +02:00
Gleb Natapov
730171f4df cql3: drop schema_altering_statement::announce_migration()
It is no longer used.
2021-12-11 12:31:07 +02:00
Gleb Natapov
837a153b34 cql3: drop has_prepare_schema_mutations() from schema altering statement
It is no longer used.
2021-12-11 12:31:07 +02:00
Gleb Natapov
f5e10b23dd cql3: drop announce_migration() usage from schema_altering_statement
Now that all schema altering statement support
prepare_schema_mutations() we can drop announce_migration() usage.
2021-12-11 12:31:07 +02:00
Gleb Natapov
e632ded782 cql3: move DROP AGGREGATE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
07103d915e migration_manager: add prepare_aggregate_drop_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
e904236cd4 cql3: move DROP FUNCTION statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
25ae8a6376 migration_manager: add prepare_function_drop_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
9d1d54bc93 cql3: move CREATE AGGREGATE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
7430750674 migration_manager: add prepare_new_aggregate_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
156f234996 cql3: move CREATE FUNCTION statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
10c14cd044 migration_manager: add prepare_new_function_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
9ec0db660c cql: get rid of mutable members in DROP/CREATE FUNCTION
Instead of using a mutable member as a way to pass data between
functions just return the data directly to a caller.
2021-12-11 12:31:07 +02:00
Gleb Natapov
661651a836 cql3: move statement validation to execute time for function related statements
To be able to confine raft to the execution time of a statement we need to
move all schema access to the execution time as well. Since the
validation code access the schema lets run it during execution.
2021-12-11 12:31:07 +02:00
Gleb Natapov
1d448f59a0 cql3: move DROP INDEX statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
05801b99c6 cql3: factor our mutation creation code into a separate function for DROP INDEX
The function will be used in the next patch.
2021-12-11 12:31:07 +02:00
Gleb Natapov
09128719dc cql3: move DROP VIEW statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
25294e4460 migration_manager: add prepare_view_drop_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
4528273e54 cql3: move DROP TYPE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
87b52c30e7 migration_manager: add prepare_type_drop_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
36745b6b73 cql3: move statement validation to execute time for DROP TYPE
To be able to confine raft to the execution time of a statement we need to
move all schema access to the execution time as well. Since the
validation code access the schema lets run it during execution.
2021-12-11 12:31:07 +02:00
Gleb Natapov
d438a3285e cql3: move DROP TABLE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
471d48d277 migration_manager: add prepare_column_family_drop_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
68cf743554 cql3: move DROP KEYSPACE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
f1cc1fb96e migration_manager: add prepare_keyspace_drop_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
4574981f9e cql3: move CREATE INDEX statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
2d1b318d36 cql3: CREATE INDEX do not re-create targets array twice
The validation and execution code recreate the same array twice. Avoid
it by returning the array from verification function.
2021-12-11 12:31:07 +02:00
Gleb Natapov
6029ba6f5b cql3: factor our mutation creation code into a separate function for CREATE INDEX
The function will be used in the next patch.
2021-12-11 12:31:07 +02:00
Gleb Natapov
a3e1cb932a cql3: move statement validation to execute time for CREATE INDEX
To be able to confine raft to the execution time of a statement we need to
move all schema access to the execution time as well. Since the
validation code access the schema lets run it during execution.
2021-12-11 12:31:07 +02:00
Gleb Natapov
5f30b5802c cql3: move ALTER KEYSPACE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
d79e426fb6 migration_manager: add prepare_keyspace_update_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
456d2e28c3 cql3: move CREATE KEYSPACE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
011f38a2f1 migration_manager: add prepare_new_keyspace_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
edff0cf4db cql3: move ALTER TYPE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
db1c9cec20 cql3: move ALTER VIEW statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
94bc34bb20 cql3: factor our mutation creation code into a separate function for ALTER VIEW
The function will be used in the next patch.
2021-12-11 12:31:07 +02:00
Gleb Natapov
a4afc69b87 migration_manager: add prepare_view_update_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
82acc9aa05 cql3: move CREATE VIEW statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
c294d7b1ca cql3: factor our mutation creation code into a separate function for CREATE VIEW statement
The function will be used in the next patch.
2021-12-11 12:31:07 +02:00
Gleb Natapov
3f47210374 migration_manager: add prepare_new_view_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
af6b3d985d cql3: move ALTER TABLE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
688efff6b5 cql3: factor our mutation creation code into a separate function for ALTER TABLE
The function will be used in the next patch.
2021-12-11 12:31:07 +02:00
Gleb Natapov
7cc629980b migration_manager: add prepare_column_family_update_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
5af2c342a3 migration_manager: add prepare_update_type_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
67661b6e66 cql3: move CREATE TYPE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
5649daf76a migration_manager: add prepare_new_type_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
5e9af3c414 cql3: move CREATE TABLE statement to prepare_schema_mutations() api 2021-12-11 12:31:07 +02:00
Gleb Natapov
20dbd717ff migration_manager: add prepare_new_column_family_announcement() function
The function only generates mutations for the announcement, but does not
send them out. Will be used by the later patches.
2021-12-11 12:31:07 +02:00
Gleb Natapov
b2af64ec5e cql3: introduce schema_altering_statement::prepare_schema_mutations() as announce_migration() alternative
Instead of announcing schema mutations the new function will return
them. The caller is responsible to announce them. To easy the transition
make the API optional. Statements that do not have it will use old
announce_migration() method.
2021-12-11 12:31:07 +02:00
Gleb Natapov
2f95a29209 migration_manager: add include_keyspace() function
Currently a keyspace mutation is included into schema mutation list just
before announcement. Move the inclusion to a separate function. It will
be used later when instead of announcing new schema the mutation array
will be returned.
2021-12-11 12:31:07 +02:00
Pavel Solodovnikov
47533bca65 gms: gossiper: coroutinize maybe_enable_features
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-11 09:39:48 +03:00
Pavel Solodovnikov
3993c6a9fb gms: gossiper: coroutinize wait_alive
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-11 09:30:32 +03:00
Pavel Solodovnikov
a6ff04dd24 gms: gossiper: coroutinize add_saved_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-11 09:23:35 +03:00
Pavel Solodovnikov
23dd8b66c5 gms: gossiper: coroutinize evict_from_membership
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-11 09:15:03 +03:00
Raphael S. Carvalho
7c90088152 tests: sstable_compaction_test: add test_twcs_compaction_across_buckets
Verify that TWCS compaction can now compact data across time windows,
like a tombstone which will cause all shadowed data to be purged
once they're all compacted together.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 17:14:45 -03:00
Raphael S. Carvalho
9b8aa1e9ae compaction: Move mutation compaction into producer for TWCS
If interposer is enabled, like the timestamp-based one for TWCS, data
from different buckets (e.g. windows) cannot be compacted together because
mutation compaction happens inside each consumer, where each consumer
will be belong to a different bucket.
To remove this limitation, let's move the mutation compactor from
consumer into producer, such that compacted data will be feeded into
the interposer, before it segregates data.
We're short-circuiting this logic if TWCS isn't in use as
compacting reader adds overhead to compaction, given that this reader
will pop fragments from combined sstable reader, compact them using
mutation_compactor and finally push them out to the underlying
reader.

without compacting reader (e.g. STCS + no interposer):
228255.92 +- 1519.53 partitions / sec (50 runs, 1 concurrent ops)
224636.13 +- 1165.05 partitions / sec (100 runs, 1 concurrent ops)
224582.38 +- 1050.71 partitions / sec (100 runs, 1 concurrent ops)

with compacting reader (e.g. TWCS + interposer):
221376.19 +- 1282.11 partitions / sec (50 runs, 1 concurrent ops)
216611.65 +- 985.44 partitions / sec (100 runs, 1 concurrent ops)
215975.51 +- 930.79 partitions / sec (100 runs, 1 concurrent ops)

So the cost of compacting data across buckets is ~3.5%, which happens
only with interposer enabled and GC writer disabled.

Fixes #9662.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 17:14:44 -03:00
Pavel Emelyanov
b0a8c153f7 select_statement: Remove unused proxy args and captures
The generate_view_paging_state_from_base_query_results() has unused
proxy argument that's carried over quite a long stack for nothing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211210175203.26197-1-xemul@scylladb.com>
2021-12-10 20:39:55 +02:00
Raphael S. Carvalho
484269cd8f compaction: make enable_garbage_collected_sstable_writer() more precise
we only want to enable GC writer if incremental compaction is required.
let's make it more precise by checking that size limit for sstable
isn't disabled, so GC writer will only be enabled for compaction
strategies that really need it. So strategies that don't need it
won't pay the penalty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 15:22:08 -03:00
Avi Kivity
3f862f9ece cql3: move selectable_column to selectable.cc
Move selectable_column to selectable.cc (and to the cql3::selection
namespace). This cleans up column_identifier.hh so it is now a pure
vocabulary header.
2021-12-10 19:51:57 +02:00
Avi Kivity
3305d1d514 cql3: column_identifier: split selectable functionality off from column_identifier
column_identifier serves two purposes: one is as a general value type
that names a value, for example in column_specification. The other
is as a `selectable` derived class specializing in selecting columns
from a base table. Obviously, to select a column from a table you
need to know its name, but to name some value (which might not
be a table column!) you don't need the ability to select it from
a table.

The mix-up stands in the way of unifying the select-clause
(`selectable`) and where-clause (previously known as `term`)
expression prepare paths. This is because the already existing
where-clause result, `expr::column_value`, is implemented as
`column_definition*`, while the select clause equivalent,
`column_identifier`, can't contain a column_definition because
not all uses of column_identifier name a schema column.

To fix this, split column_identifier into two: column_identifier
retains the original use case of naming a value, while a new class
`selectable_column` has the additional ability of selecting a
column in a select clause. It still doesn't use column_definition,
that will be adjusted later.
2021-12-10 19:51:55 +02:00
Botond Dénes
04306d762f tools/scylla-sstables: remove unused variables and captures
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211210142949.527545-1-bdenes@scylladb.com>
2021-12-10 18:24:08 +03:00
Juliusz Stasiewicz
351f142791 cdc/check_and_repair_cdc_streams: ignore LEFT endpoints
When `check_and_repair_cdc_streams` encountered a node with status LEFT, Scylla
would throw. This behavior is fixed so that LEFT nodes are simply ignored.

Fixes #9771

Closes #9778
2021-12-10 15:28:14 +01:00
Raphael S. Carvalho
e0758fded1 compaction_manager: make get_compaction_state() private
internal method that should never be directly used by the outside
world.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211210120806.19233-1-raphaelsc@scylladb.com>
2021-12-10 17:19:24 +03:00
Botond Dénes
39426b1aa3 flat_mutation_reader_v2: add make_flat_mutation_reader_from_fragments()
The main difference compared to v1 (apart from having _v2 suffix at
relevant places) is how slicing and reversing works. The v2 variant has
native reverse support built-in because the reversing reader is not
something we want to convert to v2.

A native v2 mutation-source test is also added.
2021-12-10 15:48:49 +02:00
Botond Dénes
20e45987b5 test/lib/mutation_source_test: don't force v1 reader in reverse run
Currently in the reverse run we wrap the test-provided mutation-source
and create a v1 reader with it, forcing a conversion if the
mutation-source has a v2 factory. Worse still, if the test is v2 native,
there will be a double conversion. This patch fixes this by creating a
wrapper mutation-source appropriate to the version of the underlying
factory of the wrapped mutation-source.
2021-12-10 15:48:49 +02:00
Botond Dénes
d8870d2fe1 mutation_source: add native_version() getter
So tests can determine the native version of the factory function and
create the native reader if needed, to avoid unnecessary conversions.
2021-12-10 15:48:49 +02:00
Botond Dénes
76ee3f029c flat_mutation_reader_v2: add make_forwardable()
Not a completely straightforward conversion as the v2 version has to
make sure to emit the current range tombstone change after
fast_forward_to() (if it changes compared to the current one before fast
forwarding).
Changes are around the two new members `_tombstone_to_emit` and
`maybe_emit_tombstone()`.
2021-12-10 15:48:49 +02:00
Botond Dénes
a7866f783f position_in_partition: add after_key(position_in_partition_view) 2021-12-10 15:48:49 +02:00
Botond Dénes
7306f53be1 flat_mutation_reader: make_forwardable(): fix indentation 2021-12-10 15:19:18 +02:00
Botond Dénes
2468a0602b flat_mutation_reader: make_forwardable(): coroutinize reader
For improved readability and to facilitate further patching.
2021-12-10 15:19:18 +02:00
Tomasz Grabiec
4d302dfa1a Merge "Fix exception safety of rows insertion" from Pavel Emelyanov
There are several places that (still) use throwing b-tree .insert_before()
method and don't manage the inserted object lifetime. Some of those places
also leave the leaked rows_entry on the LRU delaying the assertion failure
by the time those entries get evicted (#9728)

To prevent such surprises in the future, the set removes the non-safe
inserters from the B-tree code. Actually most of this set is that removal
plus preparations for reviewability.

* xemul/br-rows-insertion-exception-safety-2:
  btree: Earnestly discourage from insertion of plain references
  row-cache: Handle exception (un)safety of rows_entry insertion
  partition_snapshot_row_cursor: Shuffle ensure_result creation
  mutation_partition: Use B-tree insertion sugar
  tests: Make B-tree tests use unique-ptrs for insertion
2021-12-10 13:55:18 +01:00
Pavel Emelyanov
6b4b170025 btree: Earnestly discourage from insertion of plain references
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-10 12:35:12 +03:00
Pavel Emelyanov
ee103636ac row-cache: Handle exception (un)safety of rows_entry insertion
The B-tree's insert_before() is throwing operation, its caller
must account for that. When the rows_entry's collection was
switched on B-tree all the risky places were fixed by ee9e1045,
but few places went under the radar.

In the cache_flat_mutation_reader there's a place where a C-pointer
is inserted into the tree, thus potentially leaking the entry.

In the partition_snapshot_row_cursor there are two places that not
only leak the entry, but also leave it in the LRU list. The latter
it quite nasty, because those entry can be evicted, eviction code
tries to get rows_entry iterator from "this", but the hook happens
to be unattached (because insertion threw) and fails the assert.

fixes: #9728

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-10 12:35:12 +03:00
Pavel Emelyanov
9fd8db318d partition_snapshot_row_cursor: Shuffle ensure_result creation
Both places get the C-pointer on the freshly allocated rows_entry,
insert it where needed and return back the dereferenced pointer.

The C-pointer is going to become smart-pointer that would go out
of scope before return. This change prepares for that by constructing
the ensure_result from the iterator, that's returned from insertion
of the entry.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-10 12:35:12 +03:00
Pavel Emelyanov
e03f7191d9 mutation_partition: Use B-tree insertion sugar
The B-tree insertion methods accept smart pointers and
automatically release the ownership after exception-risky
part is passed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-10 12:35:12 +03:00
Pavel Emelyanov
5a405a4273 tests: Make B-tree tests use unique-ptrs for insertion
The non-smart-pointers overloads are going away, prepare
tests for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-10 12:35:12 +03:00
Nadav Har'El
03d67440ef alternator: test additional metrics and fix another broken counter
In issue #9406 we noticed that a counter for BatchGetItem operations
was missing. When we fixed it, we added a test which checked this
counter - but only this counter. It was left as a TODO to test the rest
of the Alternator metrics, and this is what this patch does.

Here we add a comprehensive test for *all* of the operations supported
by Scylla and how they increase the appropriate operation counter.

With this test we discovered a new bug: the DescribeTimeToLive operation
incremented the UpdateTimeToLiveCounter :-( So in this patch we also
include a fix for that bug, and the new test verifies that it is fixed.

In addition to the operation counters, Alternator also has additional
metric and we also added tests for some of them - but not all. The
remaining untested metrics are listed in a TODO comment.
Message-Id: <20211206154727.1170112-1-nyh@scylladb.com>
2021-12-10 08:08:54 +02:00
Benny Halevy
cca956bce2 database_test: snapshot_with_quarantine_works: get the list of sstables from table object
Rather than the filesystem, to reduce flakiness.

Also, add some test logging.

Fixes #9763

Test: database_test(debug, release)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211209175144.854896-1-bhalevy@scylladb.com>
2021-12-09 21:01:25 +02:00
Nadav Har'El
006fa588a3 alternator ttl: correct misleading typo in error message
Alternator's support for the DynamoDB API TTL features is experimental,
so if a user attempts to use one the TTL API requests, an error message
is returned that the experimental feature must be turned on first.

The message incorrectly said that the name of the experimental flag
to turn on is "alternator_ttl", with an underscore. But that's a type -
it should be "alternator-ttl" with a hyphen.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211209183428.1336526-1-nyh@scylladb.com>
2021-12-09 20:47:05 +02:00
Benny Halevy
8728fd480d database_test: do_with_some_data: get the return func future
do_with_some_data runs a function in a seastar thread.

It needs to get() the future func returns rather
than propagating it.

This solves a secondary failure due to abandoned future
when the test case fails, as seen in
https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/4254/artifact/testlog/x86_64_debug/database_test.snapshot_with_quarantine_works.381.log
```
test/boost/database_test.cc(903): fatal error: in "snapshot_with_quarantine_works": critical check expected.empty() has failed
WARN  2021-12-08 00:35:16,300 [shard 0] seastar - Exceptional future ignored: boost::execution_aborted, backtrace: 0x10935e50 0x16ff2d8d 0x16ff2a4d 0x16ff5033 0x16ff5ec2 0x162d4ce9 0x10a2bdb5 0x10a2bd24 0x10a54ca4 0x10a27cf3 0x10a22151 0x10a67c9d 0x10a67a78 0x163ac37e 0x163b29e9 0x163b7690 0x163b51c1 0x17c212df 0x17c1f097 0x17bf8b4c 0x17bf83f2 0x17bf82a2 0x17bf7d52 0x10f8bf5a 0x166db84b /lib64/libpthread.so.0+0x9298 /lib64/libc.so.6+0x100352
...
*** 1 abandoned failed future(s) detected
Failing the test because fail was requested by --fail-on-abandoned-failed-futures
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211209174512.851945-1-bhalevy@scylladb.com>
2021-12-09 21:11:56 +03:00
Nadav Har'El
c6f2afb93d Merge 'cql3: Allow to skip EQ restricted columns in ORDER BY' from Jan Ciołek
In queries like:
```cql
SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c1 ASC, c2 ASC)
```
we can skip the requirement to specify ordering for `c1` column.

The `c1` column is restricted by an `EQ` restriction, so it can have
at most one value anyway, there is no need to sort.

This commit makes it possible to write just:
```cql
SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c2 ASC)
```

I reorganized the ordering code, I feel that it's now clearer and easier to understand.
It's possible to only introduce a small change to the existing code, but I feel like it becomes a bit too messy.
I tried it out on the [`orderby_disorder_small`](https://github.com/cvybhu/scylla/commits/orderby_disorder_small) branch.

The diff is a bit messy because I moved all ordering functions to one place,
it's better to read [select_statement.cc](https://github.com/cvybhu/scylla/blob/orderby_disorder/cql3/statements/select_statement.cc#L1495-L1658) lines 1495-1658 directly.

In the new code it would also be trivial to allow specifying columns in any order, we would just have to sort them.
For now I commented out the code needed to do that, because the point of this PR was to fix #2247.
Allowing this would require some more work changing the existing tests.

Fixes: #2247

Closes #9518

* github.com:scylladb/scylla:
  cql-pytest: Enable test for skipping eq restricted columns in order by
  cql3: Allow to skip EQ restricted columns in ORDER BY
  cql3: Add has_eq_restriction_on_column function
  cql3: Reorganize orderings code
2021-12-09 21:11:56 +03:00
Nadav Har'El
36c3b92b19 alternator, schema_loader: get rid of deprecation warnings
Seastar moved the read_entire_stream(), read_entire_stream_contiguous()
and skip_entire_stream() from the "httpd" namespace to the "util"
namespace. Using them with their old names causes deprecation warnings
when compiling alternator/server.cc.

This patch fixes the namespace (and adds the new include) to get rid of
the deprecation warnings.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211209132759.1319420-1-nyh@scylladb.com>
2021-12-09 21:11:56 +03:00
Avi Kivity
242e19195f Merge "table: Prevent resurrecting data from memtable on compaction" from Mikołaj
"
Mutations are not guaranteed to come in the order of their timestamps.
If there is an expired tombstone in the sstable and a repair inserts old
data into memtable, the compaction would not consider memtable data and
purge the tombstone leading to data resurrection. The solution is to
disallow purging tombstones newer than min memtable timestamp. If there
are no memtables, max timestamp is used.
"

* 'check-memtable-at-compact-tombstone-discard/v2' of github.com:mikolajsieluzycki/scylla:
  table: Prevent resurrecting data from memtable on compaction
  table: Add min_memtable_timestamp function to table
2021-12-09 21:11:56 +03:00
Piotr Sarna
2ec36a6c53 alternator,ttl: limit parallelism to 1 page
Right now we do not really have any parallelism in the alternator
TTL service, but in order to be future-proof, a semaphore
is instantiated to ensure that we only handle 1 page of a scan
at a time, regardless of how many tables are served.
This commit also removes the FIXME regarding the service permit
- using an empty permit is a conscious decision, because the
parallelism is limited by other means (see above).

Tests: unit(release)
Message-Id: <b5f0c94f1afbead1f940a210911cc05f70900dcd.1638990637.git.sarna@scylladb.com>
2021-12-09 21:11:55 +03:00
Asias He
9859c76de1 storage_service: Wait for seastar::get_units in node_ops
The seastar::get_units returns a future, we have to wait for it.

Fixes #9767

Closes #9768
2021-12-09 21:11:55 +03:00
Jan Ciolek
13d367dada cql-pytest: Enable test for skipping eq restricted columns in order by
This test was marked as xfail, but now the functionality it tests has been implemented.

In my opinion the expected error message makes no sense, the message was:
"Order by currently only supports the ordering of columns following their declared order in the PRIMARY KEY"
In cases where there was missing restriction on one column.

This has been changed to:
"Unsupported order by relation - column {} doesn't have an ordering or EQ relation."

Because of that I had to modify the test to accept messages from both Scylla and Cassandra.
The expected error message pattern is now "rder by", because that's the largest common part.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-09 14:59:47 +01:00
Benny Halevy
85f10138f0 api: storage_service: validate_keyspace: improve exception error message
Generate the error message using the
no_such_keyspace(ks_name) exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-09 14:40:21 +02:00
Benny Halevy
6805ce5bd9 api: compaction_manager: add stop_keyspace_compaction
Allow stopping compaction by type on a given keyspace
and list of tables.

Add respective rest_api test.

Fixes #9700

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-09 14:40:13 +02:00
Benny Halevy
522a32f19f api: storage_service: expose validate_keyspace and parse_tables
To be used by the compaction_manager api in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-09 14:25:53 +02:00
Mikołaj Sielużycki
504efe0607 table: Prevent resurrecting data from memtable on compaction
Mutations are not guaranteed to come in the order of their timestamps.
If there is an expired tombstone in the sstable and a repair inserts old
data into memtable, the compaction would not consider memtable data and
purge the tombstone leading to data resurrection. The solution is to
disallow purging tombstones newer than min memtable timestamp.
2021-12-09 13:22:14 +01:00
Benny Halevy
71c95faeee api: compaction_manager: stop_compaction: fix type description
List only the compaction types we support stopping.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-09 14:17:38 +02:00
Benny Halevy
fed7319698 compaction_manager: stop_compaction: expose optional table*
To be used by api layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-09 14:14:49 +02:00
Mikołaj Sielużycki
7ce0ca040d table: Add min_memtable_timestamp function to table 2021-12-09 13:14:38 +01:00
Benny Halevy
4535cb5cb3 test: api: add basic compaction_manager test
Test compaction_manager/stop_compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-09 13:59:06 +02:00
Jan Ciolek
a548c2dac4 cql3: Allow to skip EQ restricted columns in ORDER BY
In queries like:
SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c1 ASC, c2 ASC)

we can skip the requirement to specify ordering for c1 column.

The c1 column is restricted by an EQ restriction, so it can have
only one value anyway, there is no need to sort.

This commit makes it possible to write just:
SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c2 ASC)

Fixes: #2247

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-09 12:07:02 +01:00
Jan Ciolek
7bbfa48bc5 cql3: Add has_eq_restriction_on_column function
Adds a function that checks whether a given expression has eq restrction
on the specified column.

It finds restrictions like
col = ...
or
(col, col2) = ...

IN restrictions don't count, they aren't EQ restrictions

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-09 12:06:43 +01:00
Jan Ciolek
f76a1cd4bf cql3: Reorganize orderings code
Reorganized the code that handles column ordering (ASC or DESC).
I feel that it's now clearer and easier to understand.

Added an enum that describes column ordering.
It has two possible values: ascending or descending.
It used to be a bool that was sometimes called 'reversed',
which could mean multiple things.

Instead of column.type->is_reversed() != <ordering bool>
there is now a function called are_column_select_results_reversed.

Split checking if ordering is reversed and verifying whether it's correct into two functions.
Before all of this was done by is_reversed()

This is a preparation to later allow skipping ORDER BY restrictions on some columns.
Adding this to the existing code caused it to get quite complex,
but this new version is better suited for the task.

The diff is a bit messy because I moved all ordering functions to one place,
it's better to read select_statement.cc lines 1495-1651 directly.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-09 12:06:42 +01:00
Nadav Har'El
f9673309aa docs: protocols.md - add information on Redis listening address
The description in protocols.md of the Redis protocol server in Scylla
explains how its port can be configured, but not how the listening IP
address can be configured. It turns out that the same "rpc_address" that
controls CQL's and Thrift's IP address also applies to Redis. So let's
document that.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211208160206.1290916-1-nyh@scylladb.com>
2021-12-08 20:14:52 +01:00
Nadav Har'El
e032f92c5c Merge 'api/storage service: validate table names' from Benny Halevy
This series fixes a couple issues around generating and handling of no_such_keyspace and no_such_column_family exceptions.

First, it removes std::throw_with_nested around their throw sites in the respective database::find_* functions.
Fixes #9753

And then, it introduces a `validate_tables` helper in api/storage_service.cc that generates a `bad_param_exception` in order to set the correct http response status if a non-existing table name is provided in the `cf` http request parameter.
Fixes #9754

The series also adds a test for the REST API under test/rest_api that verifies the storage_service enable/disable auto_compaction api and checks the error codes for non-existing keyspace or table.

Test: unit(dev)

Closes #9755

* github.com:scylladb/scylla:
  api: storage_service: add parse_tables
  database: un-nest no_such_keyspace and no_such_column_family exceptions
  database: throw internal error when failing uuid returned by find_uuid
  database: find_uuid: throw no_such_column_family exception if ks/cf were not found
  test: rest_api: add storage_service test
  test: add basic rest api test
  test: cql-pytest: wait for rest api when starting scylla
2021-12-08 16:54:48 +02:00
Benny Halevy
ff63ad9f6e api: storage_service: add parse_tables
Splits and validate the cf parameter, containing an optional
comma-separated list of table names.

If any table is not found and a no_such_column_family
exception is thrown, wrap it in a `bad_param_exception`
so it will translate to `reply::status_type::bad_request`
rather than `reply::status_type::internal_server_error`.

With that, hide the split_cf function from api/api.hh
since it was used only from api/storage_service
and new use sites should use validate_tables instead.

Fixes #9754

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-08 16:42:40 +02:00
Benny Halevy
a3bd7806e7 database: un-nest no_such_keyspace and no_such_column_family exceptions
These were thrown in the respective database::find_*
function as nested exception since
d3fe0c5182.

Wrapping them in nested exceptions just makes it
harder to figure out and work with and apprently serves
no purpose.

Without these nested_exception we can correctly detect
internal errors when synchronously failing to find
a uuid returned by find_uuid(ks, cf).

Fixes #9753

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-08 16:35:38 +02:00
Benny Halevy
ac49e5fff1 database: throw internal error when failing uuid returned by find_uuid
find_uuid returns a uuid found for ks_name.table_name.
In some cases, we immediately and synchronously use that
uuid to lookup other information like the table&
or the schema.  Failing to find that uuid indicates
an internal error when no preemption is possible.

Note that yielding could allow deletion of the table
to sneak in and invalidate the uuis.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-08 16:35:38 +02:00
Benny Halevy
8cbecb1c21 database: find_uuid: throw no_such_column_family exception if ks/cf were not found
Rather than masquerading all errors as std::out_of_range("")
convert only the std::out_of_range error from _ks_cf_to_uuid.at()
to no_such_column_family(ks, cf).  That relieves all callers of
fund_uuid from doing that conversion themselves.

For example, get_uuid in api/column_family now only deals with converting
no_such_column_family to bad_param_exception, as it needs to do
at the api level, rather than generating a similar error from scratch.

Other call sites required no intervention.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-08 16:35:38 +02:00
Benny Halevy
5eb32aa57c test: rest_api: add storage_service test
FIXME: negative tests for not-found tables
should result in a requests.codes.bad_request
but currently result in requests.codes.internal_server_error.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-08 16:35:36 +02:00
Tomasz Grabiec
fe2fa3f20d range_tombstone_list: Convert to work in terms of position_in_partition
This makes it comprehensible, and a bit simpler.
2021-12-08 15:16:18 +01:00
Piotr Sarna
d486b496a6 alternator,ttl: start scans from a random token range
This patch addresses yet another FIXME from alternator/ttl.cc.
Namely, scans are now started from a random, owned token range
instead of always starting with the first range.
This mechanism is expected to reduce the probability of some
ranges being starved when the scanning process is often restarted,
e.g. due to nodes failing.
Should the mechanism prove insufficient for some users, a more complete
solution is to regularly persist the state of the scanning process
in a table (distributed if we want to allow other nodes to pick up
from where a dead node left off), but that induces overhead.

Tests: unit(release) (including a long loop over the ttl pytest)

Message-Id: <7fc3f6525ceb69725c41de10d0fb6b16188349e3.1638387924.git.sarna@scylladb.com>
Message-Id: <db198e743ca9ed1e5cc659e73da342fbce2c882a.1638473143.git.sarna@scylladb.com>
2021-12-08 16:15:53 +02:00
Benny Halevy
26257cfa6d test: add basic rest api test
Test system/uptime_ms to start with.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-08 16:05:33 +02:00
Benny Halevy
01f2e8b391 test: cql-pytest: wait for rest api when starting scylla
Some of the tests, like nodetool.py, use the scylla REST API.
Add a check_rest_api function that queries http://<node_addr>:10000/
that is served once scylla starts listening on the API port
and call it via run.wait_for_services.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-08 16:05:32 +02:00
Piotr Sarna
26288c1a86 test,alternator: make TTL tests less prone to false negatives
On my local machine, a 3 second deadline proved to cause flakiness
of test_ttl_expiration case, because its execution time is just
around 3 seconds.
This patch addresse the problem by bumping the local timeout to 10
(and 15 for test_ttl_expiration_long, since it's dangerously near
the 10 second deadline on my machine as well).

Moreover, some test cases short-circuited once they detected that
all needed items expired, but other ones lacked it and always used
their full time slot. Since 10 seconds is a little too long for
a single test case, even one marked with --veryslow, this patch
also adds a couple of other short-circuits.
One exception is test_ttl_expiration_hash_wrong_type, which actually
depends on the fact that we should wait for the whole loop to finish.
Since this case was never flaky for me with the 3 second timeout,
it's left as is.
Theoretically, test_ttl_expiration also kind of depends on checking
the condition more than once (because the TTL of one of the values
is bumped on each iteration), but empirical evidence shows that
multiple iterations always occur in this test case anyway - for
me, it always spinned at least 3 times.

Tests: unit(release)

Message-Id: <a0a479929dac37daace744e0a970567a8aa3b518.1638431933.git.sarna@scylladb.com>
2021-12-08 16:02:45 +02:00
Raphael S. Carvalho
c3c23dd1e5 multishard_mutation_query: make multi_range_reader::fill_buffer() work even after EOS
if fill_buffer() is called after EOS, underlying reader will
be fast forwarded to a range pointed to by an invalid iterator,
so producing incorrect results.

fill_buffer() is changed to return early if EOS was found,
meaning that underlying reader already fast forwarded to
all ranges managed by multi_range_reader.

Usually, consume facilities check for EOS, before calling
fill_buffer() but most reader impl check for EOS to avoid
correctness issues. Let's do the same here.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211208131423.31612-1-raphaelsc@scylladb.com>
2021-12-08 15:39:11 +02:00
Avi Kivity
f28552016f Update seastar submodule
* seastar f8a038a0a2...8d15e8e67a (21):
  > core/program_options: preserve defaultness of CLI arguments
  > log: Silence logger when logging
  > Include the core/loop.hh header inside when_all.hh header
  > http: Fix deprecated wrappers
  > foreign_ptr: Add concept
  > util: file: add read_entire_file
  > short_streams: move to util
  > Revert "Merge: file: util: add read_entire_file utilities"
  > foreign_ptr: declare destroy as a static method
  > Merge: file: util: add read_entire_file utilities
  > Merge "output_stream: handle close failure" from Benny
  > net: bring local_address() to seastar::connected_socket.
  > Merge "Allow programatically configuring seastar" from Botond
  > Merge 'core: clean up memory metric definitions' from John Spray
  > Add PopOS to debian list in install-dependencies.sh
  > Merge "make shared_mutex functions exception safe and noexcept" from Benny
  > on_internal_error: set_abort_on_internal_error: return current state
  > Implementation of iterator-range version of when_any
  > net: mark functions returning ethernet_address noexcept
  > net: ethernet_address: mark functions noexcept
  > shared_mutex: mark wake and unlock methods noexcept

Contains patch from Botond Dénes <bdenes@scylladb.com>:

db/config: configure logging based on app_template::seastar_options

Scylla has its own config file which supports configuring aspects of
logging, in addition to the built-in CLI logging options. When applying
this configuration, the CLI provided option values have priority over
the ones coming from the option file. To implement this scylla currently
reads CLI options belonging to seastar from the boost program options
variable map. The internal representation of CLI options however do not
constitute an API of seastar and are thus subject to change (even if
unlikely). This patch moves away from this practice and uses the new
shiny C++ api: `app_template::seastar_options` to obtain the current
logging options.
2021-12-08 14:21:11 +02:00
Tomasz Grabiec
5eaca85e4b Merge "wire up schema raft state machine" from Gleb
This series wires up the schema state machine to process raft commands
and transfer snapshots. The series assumes that raft group zero is used
for schema transfer only and that single raft command contains single
schema change in a form of canonical_mutation array. Both assumptions
may change in which case the code will be changed accordingly, but we
need to start somewhere.

* scylla-dev/gleb/schema-raft-sm-v2:
  schema raft sm: request schema sync on schema_state_machine snapshot transfer
  raft service: delegate snapshot transfer to a state machine implementation
  schema raft sm: pass migration manager to schema_raft_state_machine and merge schema on apply()
2021-12-08 13:14:28 +01:00
Nadav Har'El
92e7fbe657 test/alternator: check correct error for unknown operation
Add a short test verifying that Alternator responds with the correct
error code (UnknownOperationException) when receiving an unknown or
unsupported operation.

The test passes on both AWS and Alternator, confirming that the behavior
is the same.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211206125710.1153008-1-nyh@scylladb.com>
2021-12-08 13:56:38 +02:00
Gleb Natapov
f25424edcd storage_service: remove unused function.
is_auto_bootstrap() function is no longer used.

Message-Id: <YbCVXPI4hE8wgT4T@scylladb.com>
2021-12-08 13:55:32 +02:00
Botond Dénes
0aa4e5e726 test/cql-pytest: mv virtual_tables.py -> test_virtual_tables.py
For consistency with the other tests.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211208102108.126492-1-bdenes@scylladb.com>
2021-12-08 12:23:22 +02:00
Gavin Howell
c6e0a807b4 Update wasm.md
Grammar correction, sentence re-write.

Closes #9760
2021-12-08 10:24:53 +01:00
Tomasz Grabiec
2a36377bb3 Merge "test: raft: randomized_nemesis_test: introduce server stop/crash nemesis" from Kamil
We begin by preparing the `persistence` class so that the storage can be
reused across different Raft server instances: the test keeps a shared
pointer to the storage so that when a server stops, a new server with
the same ID can be reconstructed with this storage.

We then modify `environment` so that server instances can be removed and
replaced in middle of operations.

Finally we prepare a nemesis operation which gracefully stops or
immediately crashes a randomly picked server and run this operation
periodically in `basic_generator_test`.

One important change that changes the API of `raft::server` is included:
the metrics are not automatically registered in `start()`. This is
because metric registration modifies global data structures, which
cannot be done twice with the same set of metrics (and we would do it
when we restart a server with the same ID). Instead,
`register_metrics()` is exposed in the `raft::server` interface to be
called when running servers in production.

* kbr/crashes-v3:
  raft: server: print the ID of aborted server
  test: raft: randomized_nemesis_test: run stop_crash nemesis in `basic_generator_test`
  test: raft: randomized_nemesis_test: introduce `stop_crash` operation
  test: raft: randomized_nemesis_test: environment: implement server `stop` and `crash`
  raft: server: don't register metrics in `start()`
  test: raft: randomized_nemesis_test: raft_server: return `stopped_error` when called during abort
  test: raft: randomized_nemesis_test: handle `raft::stopped_error`
  test: raft: randomized_nemesis_test: handle missing servers in `environment` call functions
  test: raft: randomized_nemesis_test: environment: split `new_server` into `new_node` and `start_server`
  test: raft: randomized_nemesis_test: remove `environment::get_server`
  test: raft: randomized_nemesis_test: construct `persistence_proxy` outside `raft_server<M>::create`
  test: raft: randomized_nemesis_test: persistence_proxy: store a shared pointer to `persistence`
  test: raft: randomized_nemesis_test: persistence: split into two classes
  test: raft: logical_timer: introduce `sleep_until`
2021-12-07 22:16:23 +01:00
Nadav Har'El
bb0f8c3cdf Merge 'build: disable superword-level parallism (slp) on clang' from Avi Kivity
Clang (and gcc) can combine loads and stores of independent variables
into wider operations, often using vector registers. This reduces
instruction count and execution unit occupancy. However, clang
is too aggressive and generates loads that break the store-to-load
forwarding rules: a load must be the same size or smaller than the
corresponding load, or it will execute with a large penalty.

Disabling slp results in larger but faster code. Comparing
before and after on Zen 3:

slp:

226766.49 tps ( 75.1 allocs/op,  12.1 tasks/op,   45073 insns/op)
226679.57 tps ( 75.1 allocs/op,  12.1 tasks/op,   45074 insns/op)
226168.79 tps ( 75.1 allocs/op,  12.1 tasks/op,   45061 insns/op)
225884.34 tps ( 75.1 allocs/op,  12.1 tasks/op,   45068 insns/op)
225998.16 tps ( 75.1 allocs/op,  12.1 tasks/op,   45056 insns/op)

median 226168.79 tps ( 75.1 allocs/op,  12.1 tasks/op,   45061 insns/op)
median absolute deviation: 284.45
maximum: 226766.49
minimum: 225884.34

no slp:

228195.33 tps ( 75.1 allocs/op,  12.1 tasks/op,   45109 insns/op)
227773.76 tps ( 75.1 allocs/op,  12.1 tasks/op,   45123 insns/op)
228088.98 tps ( 75.1 allocs/op,  12.1 tasks/op,   45117 insns/op)
228157.43 tps ( 75.1 allocs/op,  12.1 tasks/op,   45129 insns/op)
228072.29 tps ( 75.1 allocs/op,  12.1 tasks/op,   45128 insns/op)

median 228088.98 tps ( 75.1 allocs/op,  12.1 tasks/op,   45117 insns/op)
median absolute deviation: 68.45
maximum: 228195.33
minimum: 227773.76

Disabling slp increases the instruction count by ~60 instructions per op
(0.13%) but increases throughput by 0.85%. This shows the impact of the
violation is quite high. It can also be observed by the effect on
stalled cycles:

slp:

         44,932.70 msec task-clock                #    0.993 CPUs utilized
            13,618      context-switches          #  303.075 /sec
                33      cpu-migrations            #    0.734 /sec
             1,695      page-faults               #   37.723 /sec
   211,997,160,633      cycles                    #    4.718 GHz                      (71.67%)
     1,118,855,786      stalled-cycles-frontend   #    0.53% frontend cycles idle     (71.67%)
     1,258,837,025      stalled-cycles-backend    #    0.59% backend cycles idle      (71.66%)
   454,445,559,376      instructions              #    2.14  insn per cycle
                                                  #    0.00  stalled cycles per insn  (71.66%)
    83,557,588,477      branches                  #    1.860 G/sec                    (71.67%)
       174,313,252      branch-misses             #    0.21% of all branches          (71.67%)

no-slp:

         44,579.83 msec task-clock                #    0.986 CPUs utilized
            13,435      context-switches          #  301.369 /sec
                33      cpu-migrations            #    0.740 /sec
             1,691      page-faults               #   37.932 /sec
   210,070,080,283      cycles                    #    4.712 GHz                      (71.68%)
     1,066,774,628      stalled-cycles-frontend   #    0.51% frontend cycles idle     (71.68%)
     1,082,255,966      stalled-cycles-backend    #    0.52% backend cycles idle      (71.66%)
   455,067,924,891      instructions              #    2.17  insn per cycle
                                                  #    0.00  stalled cycles per insn  (71.68%)
    83,597,450,748      branches                  #    1.875 G/sec                    (71.65%)
       151,897,866      branch-misses             #    0.18% of all branches          (71.68%)

Note the differences in "backend cycles idle" and "stalled cycles
per insn".

I also observed the same pattern on a much older generation Intel (although
the baseline instructions per clock there are around 0.56).

slp:

42232.64 tps ( 75.1 allocs/op,  12.1 tasks/op,   44818 insns/op)
42318.87 tps ( 75.1 allocs/op,  12.1 tasks/op,   44849 insns/op)
42331.33 tps ( 75.1 allocs/op,  12.1 tasks/op,   44857 insns/op)
42315.89 tps ( 75.1 allocs/op,  12.1 tasks/op,   44875 insns/op)
42410.19 tps ( 75.1 allocs/op,  12.1 tasks/op,   44818 insns/op)

median 42318.87 tps ( 75.1 allocs/op,  12.1 tasks/op,   44849 insns/op)
median absolute deviation: 12.46
maximum: 42410.19
minimum: 42232.64

no-slp:

42464.18 tps ( 75.1 allocs/op,  12.1 tasks/op,   44886 insns/op)
42631.88 tps ( 75.1 allocs/op,  12.1 tasks/op,   44939 insns/op)
42783.95 tps ( 75.1 allocs/op,  12.1 tasks/op,   44961 insns/op)
42671.23 tps ( 75.1 allocs/op,  12.1 tasks/op,   44947 insns/op)
42487.82 tps ( 75.1 allocs/op,  12.1 tasks/op,   44875 insns/op)

median 42631.88 tps ( 75.1 allocs/op,  12.1 tasks/op,   44939 insns/op)
median absolute deviation: 144.06
maximum: 42783.95
minimum: 42464.18

slp:

         26,877.01 msec task-clock                #    0.989 CPUs utilized
            15,621      context-switches          #    0.581 K/sec
                 9      cpu-migrations            #    0.000 K/sec
            55,322      page-faults               #    0.002 M/sec
    96,084,360,190      cycles                    #    3.575 GHz                      (72.55%)
    71,435,545,235      stalled-cycles-frontend   #   74.35% frontend cycles idle     (72.57%)
    59,531,573,539      stalled-cycles-backend    #   61.96% backend cycles idle      (70.96%)
    53,273,420,083      instructions              #    0.55  insn per cycle
                                                  #    1.34  stalled cycles per insn  (72.55%)
    10,240,844,987      branches                  #  381.026 M/sec                    (72.57%)
        94,348,150      branch-misses             #    0.92% of all branches          (72.57%)

no-slp:

         26,381.66 msec task-clock                #    0.971 CPUs utilized
            15,586      context-switches          #    0.591 K/sec
                 9      cpu-migrations            #    0.000 K/sec
            55,318      page-faults               #    0.002 M/sec
    94,317,505,691      cycles                    #    3.575 GHz                      (72.59%)
    69,693,601,709      stalled-cycles-frontend   #   73.89% frontend cycles idle     (72.59%)
    57,579,078,046      stalled-cycles-backend    #   61.05% backend cycles idle      (58.08%)
    53,260,417,953      instructions              #    0.56  insn per cycle
                                                  #    1.31  stalled cycles per insn  (72.60%)
    10,235,123,948      branches                  #  387.964 M/sec                    (72.60%)
        96,002,988      branch-misses             #    0.94% of all branches          (72.62%)

Closes #9752

* github.com:scylladb/scylla:
  build: rearrange -O3 and -f<optimization-option> options
  build: disable superword-level parallism (slp) on clang
2021-12-07 18:01:26 +02:00
Avi Kivity
c519857beb build: rearrange -O3 and -f<optimization-option> options
It turns out that -O3 enabled -fslp-vectorize even if it is
disabled before -O3 on the command line. Rearrange the code
so that -O3 is before the more specific optimization options.
2021-12-07 17:52:32 +02:00
Juliusz Stasiewicz
5a8741a1ca cdc: Throw when ALTERing cdc options without "enabled":"..."
The problem was that such a command:
```
alter table ks.cf with cdc={'ttl': 120};
```
would assume that "enabled" parameter is the default ("false") and,
in effect, disable CDC on that table. This commit forces the user
to specify that key.

Fixes #6475

Closes #9720
2021-12-07 17:37:44 +02:00
Avi Kivity
04ad07b072 build: disable superword-level parallism (slp) on clang
Clang (and gcc) can combine loads and stores of independent variables
into wider operations, often using vector registers. This reduces
instruction count and execution unit occupancy. However, clang
is too aggressive and generates loads that break the store-to-load
forwarding rules: a load must be the same size or smaller than the
corresponding load, or it will execute with a large penalty.

Disabling slp results in larger but faster code. Comparing
before and after on Zen 3:

slp:

226766.49 tps ( 75.1 allocs/op,  12.1 tasks/op,   45073 insns/op)
226679.57 tps ( 75.1 allocs/op,  12.1 tasks/op,   45074 insns/op)
226168.79 tps ( 75.1 allocs/op,  12.1 tasks/op,   45061 insns/op)
225884.34 tps ( 75.1 allocs/op,  12.1 tasks/op,   45068 insns/op)
225998.16 tps ( 75.1 allocs/op,  12.1 tasks/op,   45056 insns/op)

median 226168.79 tps ( 75.1 allocs/op,  12.1 tasks/op,   45061 insns/op)
median absolute deviation: 284.45
maximum: 226766.49
minimum: 225884.34

no slp:

228195.33 tps ( 75.1 allocs/op,  12.1 tasks/op,   45109 insns/op)
227773.76 tps ( 75.1 allocs/op,  12.1 tasks/op,   45123 insns/op)
228088.98 tps ( 75.1 allocs/op,  12.1 tasks/op,   45117 insns/op)
228157.43 tps ( 75.1 allocs/op,  12.1 tasks/op,   45129 insns/op)
228072.29 tps ( 75.1 allocs/op,  12.1 tasks/op,   45128 insns/op)

median 228088.98 tps ( 75.1 allocs/op,  12.1 tasks/op,   45117 insns/op)
median absolute deviation: 68.45
maximum: 228195.33
minimum: 227773.76

Disabling slp increases the instruction count by ~60 instructions per op
(0.13%) but increases throughput by 0.85%. This shows the impact of the
violation is quite high. It can also be observed by the effect on
stalled cycles:

slp:

         44,932.70 msec task-clock                #    0.993 CPUs utilized
            13,618      context-switches          #  303.075 /sec
                33      cpu-migrations            #    0.734 /sec
             1,695      page-faults               #   37.723 /sec
   211,997,160,633      cycles                    #    4.718 GHz                      (71.67%)
     1,118,855,786      stalled-cycles-frontend   #    0.53% frontend cycles idle     (71.67%)
     1,258,837,025      stalled-cycles-backend    #    0.59% backend cycles idle      (71.66%)
   454,445,559,376      instructions              #    2.14  insn per cycle
                                                  #    0.00  stalled cycles per insn  (71.66%)
    83,557,588,477      branches                  #    1.860 G/sec                    (71.67%)
       174,313,252      branch-misses             #    0.21% of all branches          (71.67%)

no-slp:

         44,579.83 msec task-clock                #    0.986 CPUs utilized
            13,435      context-switches          #  301.369 /sec
                33      cpu-migrations            #    0.740 /sec
             1,691      page-faults               #   37.932 /sec
   210,070,080,283      cycles                    #    4.712 GHz                      (71.68%)
     1,066,774,628      stalled-cycles-frontend   #    0.51% frontend cycles idle     (71.68%)
     1,082,255,966      stalled-cycles-backend    #    0.52% backend cycles idle      (71.66%)
   455,067,924,891      instructions              #    2.17  insn per cycle
                                                  #    0.00  stalled cycles per insn  (71.68%)
    83,597,450,748      branches                  #    1.875 G/sec                    (71.65%)
       151,897,866      branch-misses             #    0.18% of all branches          (71.68%)

Note the differences in "backend cycles idle" and "stalled cycles
per insn".

I also observed the same pattern on a much older generation Intel (although
the baseline instructions per clock there are around 0.56).

slp:

42232.64 tps ( 75.1 allocs/op,  12.1 tasks/op,   44818 insns/op)
42318.87 tps ( 75.1 allocs/op,  12.1 tasks/op,   44849 insns/op)
42331.33 tps ( 75.1 allocs/op,  12.1 tasks/op,   44857 insns/op)
42315.89 tps ( 75.1 allocs/op,  12.1 tasks/op,   44875 insns/op)
42410.19 tps ( 75.1 allocs/op,  12.1 tasks/op,   44818 insns/op)

median 42318.87 tps ( 75.1 allocs/op,  12.1 tasks/op,   44849 insns/op)
median absolute deviation: 12.46
maximum: 42410.19
minimum: 42232.64

no-slp:

42464.18 tps ( 75.1 allocs/op,  12.1 tasks/op,   44886 insns/op)
42631.88 tps ( 75.1 allocs/op,  12.1 tasks/op,   44939 insns/op)
42783.95 tps ( 75.1 allocs/op,  12.1 tasks/op,   44961 insns/op)
42671.23 tps ( 75.1 allocs/op,  12.1 tasks/op,   44947 insns/op)
42487.82 tps ( 75.1 allocs/op,  12.1 tasks/op,   44875 insns/op)

median 42631.88 tps ( 75.1 allocs/op,  12.1 tasks/op,   44939 insns/op)
median absolute deviation: 144.06
maximum: 42783.95
minimum: 42464.18

slp:

         26,877.01 msec task-clock                #    0.989 CPUs utilized
            15,621      context-switches          #    0.581 K/sec
                 9      cpu-migrations            #    0.000 K/sec
            55,322      page-faults               #    0.002 M/sec
    96,084,360,190      cycles                    #    3.575 GHz                      (72.55%)
    71,435,545,235      stalled-cycles-frontend   #   74.35% frontend cycles idle     (72.57%)
    59,531,573,539      stalled-cycles-backend    #   61.96% backend cycles idle      (70.96%)
    53,273,420,083      instructions              #    0.55  insn per cycle
                                                  #    1.34  stalled cycles per insn  (72.55%)
    10,240,844,987      branches                  #  381.026 M/sec                    (72.57%)
        94,348,150      branch-misses             #    0.92% of all branches          (72.57%)

no-slp:

         26,381.66 msec task-clock                #    0.971 CPUs utilized
            15,586      context-switches          #    0.591 K/sec
                 9      cpu-migrations            #    0.000 K/sec
            55,318      page-faults               #    0.002 M/sec
    94,317,505,691      cycles                    #    3.575 GHz                      (72.59%)
    69,693,601,709      stalled-cycles-frontend   #   73.89% frontend cycles idle     (72.59%)
    57,579,078,046      stalled-cycles-backend    #   61.05% backend cycles idle      (58.08%)
    53,260,417,953      instructions              #    0.56  insn per cycle
                                                  #    1.31  stalled cycles per insn  (72.60%)
    10,235,123,948      branches                  #  387.964 M/sec                    (72.60%)
        96,002,988      branch-misses             #    0.94% of all branches          (72.62%)
2021-12-07 17:08:38 +02:00
Raphael S. Carvalho
648c921af2 cql3: statements: Fix UB when getting memory consumption limit for unpaged query
get_max_result_size() is called on slice moved in previous argument.
This results in use-after-move with clang, which evaluation order is
left-to-right.
For paged queries, max_result_size is later overriden by query_pager,
but for unpaged and/or reversed queries it can happen that max result
size incorrectly contains the 1MB limit for paged, non-reversed queries.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211207145133.69764-1-raphaelsc@scylladb.com>
2021-12-07 16:57:01 +02:00
Avi Kivity
edaa0c468d cql3: expr: standardize on struct tag for expression components
Expression components are pure data, so emphasize this by using
the struct tag consistently. This is just a cosmetic change.

Closes #9740
2021-12-07 15:46:25 +02:00
Botond Dénes
2e5440bdf2 Merge 'Convert compaction to flat_mutation_reader_v2' from Raphael Carvalho
Since sstable reader was already converted to flat_mutation_reader_v2, compaction layer can naturally be converted too.

There are many dependencies that use v1. Those strictly needed like readers in sstable set, which links compaction to sstable reader, were converted to v2 in this series. For those that aren't essential we're relying on V1<-->V2 adaptors, and conversion work on them will be postponed. Those being postponed are: scrub specialized reader (needs a validator for mutation_fragment_v2), interposer consumer, combined reader which is used by incremental selector. incremental selector itself was converted to v2.

tests: unit(debug).

Closes #9725

* github.com:scylladb/scylla:
  compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
  sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update incremental_reader_selector to flat_mutation_reader_v2
2021-12-07 15:17:38 +02:00
Raphael S. Carvalho
2435bd14c6 compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:57 -03:00
Raphael S. Carvalho
c6399005a3 sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:55 -03:00
Raphael S. Carvalho
aebbe68239 sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:53 -03:00
Raphael S. Carvalho
c3c070a5ca sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:51 -03:00
Raphael S. Carvalho
6b664067dd sstable_set: update incremental_reader_selector to flat_mutation_reader_v2
Cannot be fully converted to flat_mutation_reader_v2 yet, as the
selector is built on combined_reader interface which is still not
converted. So only updated wherever possible.
Subsequent work will update sstable_set readers, which uses the
selector, to flat_mutation_reader_v2.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:49 -03:00
Kamil Braun
75bab2beec raft: server: print the ID of aborted server 2021-12-07 11:23:34 +01:00
Kamil Braun
45fe0d015d test: raft: randomized_nemesis_test: run stop_crash nemesis in basic_generator_test
There is a separate thread that periodically stops/crashes and restarts
a randomly chosen server, so the nemesis runs concurrently with
reconfigurations and network partitions.
2021-12-07 11:23:34 +01:00
Kamil Braun
f9073b864f test: raft: randomized_nemesis_test: introduce stop_crash operation
An operation which chooses a server randomly, randomly chooses whether
to crash or gracefully stop it, performs the chosen operation, and
restarts the server after a selected delay.
2021-12-07 11:23:34 +01:00
Kamil Braun
168390d4bb test: raft: randomized_nemesis_test: environment: implement server stop and crash
`stop` gracefully stops a running server, `crash` immediately "removes" it
(from the point of view of the rest of the environment).

We cannot simply destroy a running server. Read the comments in `crash`
to understand how it's implemented.
2021-12-07 11:23:34 +01:00
Kamil Braun
485c0b1819 raft: server: don't register metrics in start()
Instead, expose `register_metrics()` at the `server` interface
(previously it was a private method of `server_impl`).

Metrics are global so `register_metrics()` cannot be called on
two servers that have the same ID, which is useful e.g. in tests when we
want to simulate server stops and restarts.
2021-12-07 11:23:33 +01:00
Kamil Braun
429f87160b test: raft: randomized_nemesis_test: raft_server: return stopped_error when called during abort
Don't return `gate_closed_exception` which is an internal implementation
detail and which callers don't expect.
2021-12-07 11:22:52 +01:00
Kamil Braun
c79dacc028 test: raft: randomized_nemesis_test: handle raft::stopped_error
Include it in possible call result types. It will start appearing when
we enable server aborts in the middle of the test.
2021-12-07 11:22:52 +01:00
Kamil Braun
25a8772306 test: raft: randomized_nemesis_test: handle missing servers in environment call functions
`environment` functions for performing operations on Raft servers:
`is_leader`, `call`, `reconfigure`, `get_configuration`,
currently assume that a server is running on each node at all times and
that it never changes. Prepare these functions for missing/restarting
servers.
2021-12-07 11:22:51 +01:00
Kamil Braun
d281b2c0ea test: raft: randomized_nemesis_test: environment: split new_server into new_node and start_server
Soon it will be possible to stop a server and then start a completely
new `raft::server` instance but which uses the same ID and persistence,
simulating a server restart. For this we introduce the concept of a
"node" which keeps the persistence alive (through a shared pointer). To
start a server - using `start_server` - we must first create a node on
which it will be running through `new_node`. `new_server` is now a
short function which does these two things.
2021-12-07 11:22:51 +01:00
Kamil Braun
5c803ae1d0 test: raft: randomized_nemesis_test: remove environment::get_server
To perform calls to servers in a Raft cluster, the test code would first
obtain a reference to a server through `get_server` and then call the
server directly. This will not be safe when we implement server crashes
and restarts as servers will disappear in middle of operations; we don't
want the test code to keep references to no-longer-existing servers.

In the new API the test will call the `environment` to perform
operations, giving it the server ID. `environment` will handle
disappearing servers underneath.
2021-12-07 11:22:51 +01:00
Kamil Braun
0d64fbc39d test: raft: randomized_nemesis_test: construct persistence_proxy outside raft_server<M>::create 2021-12-07 11:22:51 +01:00
Kamil Braun
4e8a86c6a1 test: raft: randomized_nemesis_test: persistence_proxy: store a shared pointer to persistence
We want the test to be able to reuse `persistence` even after
`persistence_proxy` is destroyed for simulating server restarts. We'll
do it by having the test keep a shared pointer to `persistence`.

To do that, instead of storing `persistence` by value and constructing
it inside `persistence_proxy`, store it by `lw_shared_ptr` which is
taken through the constructor (so `persistence` itself is now
constructed outside of `persistence_proxy`).
2021-12-07 11:22:51 +01:00
Kamil Braun
16b1d2abcc test: raft: randomized_nemesis_test: persistence: split into two classes
The previous `persistence` implemented the `raft::persistence` interface
and had two different responsibilities:
- representing "persistent storage", with the ability to store and load
  stuff to/from it,
- accessing in-memory state shared with a corresponding instance of
  `impure_state_machine` that is running along `persistence` inside
  a `raft::server`.

For example, `persistence::store_snapshot_descriptor` would persist not
only the snapshot descriptor, but also the corresponding snapshot. The
descriptor was provided through a parameter but the snapshot wasn't. To
obtain the snapshot we use a data structure (`snapshots_t`) that both
`persistence` and `impure_state_machine` had a reference to.

We split `persistence` into two classes:
- `persistence` which handles only the first responsibility, i.e.
  storing and loading stuff; everything to store is provided through
  function parameters (e.g. now we have a `store_snapshot` function
  which takes both the snapshot and its descriptor through the
  parameters) and everything to load is returned directly by functions
  (e.g. `load_snapshot` returns a pair containing both the descriptor
  and corresponding snapshot)
- `persistence_proxy` (for lack of a better name) which implements
  `raft::persistence`, contains the above `persistence` inside and
  shares a data structure with `impure_state_machine`
(so `persistence_proxy` corresponds to the old `persistence`).

The goal is to prepare for reusing the persisted stuff between different
instances of `raft::server` running in a single test when simulating
server shutdowns/crashes and restarts. When destroying a `raft::server`,
we destroy its `impure_state_machine` and `persistence_proxy` (we are
forced to because constructing a `raft::server` requires a `unique_ptr`
to `raft::persistence`), but we will be able to keep the underlying
`persistence` for the next instance (if we simulate a restart) - after a
slight modification made in the next commit.
2021-12-07 11:22:51 +01:00
Kamil Braun
c1db77fa61 test: raft: logical_timer: introduce sleep_until
Allows sleeping until a given time point arrives.
2021-12-07 11:22:51 +01:00
Avi Kivity
79bcdc104e Merge "Fix stateful multi-range scans" from Botond
"
Currently stateful (readers being saved and resumed on page boundaries)
multi-range scans are broken in multiple ways. Trying to use them can
result in anything from use-after-free (#6716) or getting corrupt data
(#9718). Luckily no-one is doing such queries today, but this started to
change recently as code such as Alternator TTL and distributed
aggregate reads started using this.
This series fixes both problems and adds a unit test too exercising this
previously completely unused code-path.

Fixes: #6716
Fixes: #9718

Tests: unit(dev, release, debug)
"

* 'fix-stateful-multi-range-scans/v1' of https://github.com/denesb/scylla:
  test/boost/multishard_mutation_query_test: add multi-range test
  test/boost/multishard_mutation_query_test: add multi-range support
  multishard_mutation_query: don't drop data during stateful multi-range reads
  multishard_combining_reader: reader_lifecycle_policy: allow saving read range on fast-forward
2021-12-07 12:19:56 +02:00
Nadav Har'El
ca46c3ba8f test/redis: replace run script with shorter Python script
In the past, we had very similar shell scripts for test/alternator/run,
test/cql-pytest/run and test/redis/run. Most of the code of all three
scripts was identical - dealing with starting Scylla in a temporary
directory, running pytest, and so on. The code duplication meant that
every time we fixed a bug in one of those scripts, or added an important
boot-time parameter to Scylla, we needed to fix all three scripts.

The solution was to convert the run scripts to Python, and to use a
common library, test/cql-pytest/run.py, for the main features shared
by all scripts - starting Scylla, waiting for protocols to be available,
and running pytest.

However, we only did this conversion for alternator and cql-pytest -
redis remained the old shell scripts. This patch completes the
conversion also for redis. As expected, no change was needed to the
run.py library code, which was already strong enough for the needs of
the redis tests.

Fixes #9748.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211207081423.1187847-1-nyh@scylladb.com>
2021-12-07 12:18:07 +02:00
Avi Kivity
395b30bca8 mutation_reader: update make_filtering_reader() to flat_mutation_reader_v2
As part of the drive to move over to flat_mutation_reader_v2, update
make_filtering_reader(). Since it doesn't examine range tombstones
(only the partition_start, to filter the key) the entire patch
is just glue code upgrading and downgrading users in the pipeline
(or removing a conversion, in one case).

Test: unit (dev)

Closes #9723
2021-12-07 12:18:07 +02:00
Raphael S. Carvalho
6737c88045 compaction_manager: use single semaphore for serialization of maintenance compactions
We have three semaphores for serialization of maintenance ops.
1) _rewrite_sstables_sem: for scrub, cleanup and upgrade.
2) _major_compaction_sem: for major
3) _custom_job_sem: for reshape, resharding and offstrategy

scrub, cleanup and upgrade should be serialized with major,
so rewrite sem should be merged into major one.

offstrategy is also a maintenance op that should be serialized
with others, to reduce compaction aggressiveness and space
requirement.

resharding is one-off operation, so can be merged there too.
the same applies for reshape, which can take long and not
serializing it with other maintenance activity can lead to
exhaustion of resources and high space requirement.

let's have a single semaphore to guarantee their serialization.

deadlock isn't an issue because locks are always taken in same
order.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211201182046.100942-1-raphaelsc@scylladb.com>
2021-12-07 12:18:07 +02:00
Eliran Sinvani
426fc8db3a Repair: add a stringify function for node_ops_cmd
Adding a strigify function for the node_ops_cmd enum,
will make the log output more readable and will make it
possible (hopefully) to do initial analysis without consulting
the source code.

Refs #9629

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #9745
2021-12-07 12:18:07 +02:00
Nadav Har'El
d3abff9ea1 test/alternator: validate that TagResource needs a Tags parameter
A short new test to verify that in the TagResource operation, the
Tags parameter - specifying which tags to set - is required.

The test passes on both AWS and Alternator - they both produce a
ValidationException in this case (the specific human-readable error
message is different, though, so we don't check it).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211206140541.1157574-1-nyh@scylladb.com>
2021-12-06 15:08:16 +01:00
Avi Kivity
f907205b92 utils: logalloc: correct and adjust timing unit in stall report
The stall report uses the millisecond unit, but actually reports
nanoseconds.

Switch to microseconds (milliseconds are a bit too coarse) and
use the safer "duration / 1us" style rather than "duration::count()"
that leads to unit confusion.

Fixes #9733.

Closes #9734
2021-12-06 09:51:57 +02:00
Botond Dénes
ade4cdf0e7 Merge "compaction: quarantine invalid sstables" from Benny Halevy
"
This series adds an optional "quarantine" subdirectory
to the table data directory that may contain sstables
that are fenced-off from regular compaction.

The motivation, as discussed in
https://github.com/scylladb/scylla/issues/7658
and
https://github.com/scylladb/scylla/issues/9537#issuecomment-953635973,
is to prevent regular compaction from spreading sstable corruption
further to other sstables, and allow investigating the invalid sstables
using the scylla-sstable tool, or scrubbing them in segregate mode.

When sstables are found to be invalid in scrub::mode::validate
they are moved to the quarantine directory, where they will still
be available for reading, but will not be considered for regular
or major compaction.

By default scrub, in all other modes, will consider all sstables,
including the quaratined ones.  To make it more efficient, a
new option was added and exposed via the storage_service/keyspace_scrub
api -
quarantine_mode. When set to quarantine_mode::only, scrub will read only
the quarantined sstables, so that the user can start with validate mode
to
detect invalid sstables and quarantine them, then scrub/segregate only
the quarantined sstables.

Test: unit(dev), database_test(debug)
DTest:
nodetool_additional_test.py:TestNodetool.{scrub_ks_sstable_with_invalid_fragment_test,scrub_segregate_sstable_with_invalid_fragment_test,scrub_segregate_ks_sstable_with_invalid_fragment_test,scrub_sstable_with_invalid_fragment_test,scrub_with_multi_nodes_expect_data_rebuild_test,scrub_with_one_node_expect_data_loss_test,validate_ks_sstable_with_invalid_fragment_test,validate_with_one_node_expect_data_loss_test,validate_sstable_with_invalid_fragment_test}
"

* tag 'quarantine-invalid-sstables-v6' of github.com:bhalevy/scylla:
  test: sstable_compaction_test: add sstable_scrub_quarantine_mode_test
  compaction: scrub: add quarantine_mode option
  compaction_manager: perform_sstable_scrub: get the whole compaction_type_options::scrub
  compaction: scrub_sstables_validate_mode: quarantine invalid sstables
  test: database_test: add snapshot_with_quarantine_works
  test: database_test: add populate_from_quarantine_works
  distributed_loader: populate_keyspace: populate also from the quarantine dir
  distributed_loader: populate_column_family: add must_exist param
  sstables: add is_quarantined
  sstables: add is_eligible_for_compaction
  sstables: define symbolic names for table subdirectories
2021-12-06 08:58:43 +02:00
Takuya ASADA
ea20f89c56 dist: allow running scylla-housekeeping with strict umask setting
To avoid failing scylla-housekeeping in strict umask environment,
we need to chmod a+r on repository file and housekeeping.uuid.

Fixes #9683

Closes #9739
2021-12-05 20:46:46 +02:00
Benny Halevy
044e4a6b72 token_metadata: delete private constructor
It is not used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211205174306.450536-1-bhalevy@scylladb.com>
2021-12-05 19:49:29 +02:00
Avi Kivity
32beb9e7e4 Merge "Keep proxy reference from thrift" from Pavel E
"
Thrift is one of the users of global storage proxy instance.
This set remove all such calls from the thrift/ code.

tests: unit(dev)
"

* 'br-thrift-reference-storage-proxy' of https://github.com/xemul/scylla:
  thrift: Use local proxy reference in do_paged_slice
  thrift: Use local proxy reference in handler methods
  thrift: Keep sharded proxy reference on thrift_handler
2021-12-05 19:22:33 +02:00
Benny Halevy
9ed72cac95 test: sstable_compaction_test: add sstable_scrub_quarantine_mode_test
For each quarantine mode:
Validate sstables to quarantine one of them
and then scrub with the given quarantine mode
and verify the output whwther the quarantined
sstable was scrubbed or not.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:29:58 +02:00
Benny Halevy
cc122984d6 compaction: scrub: add quarantine_mode option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:29:04 +02:00
Benny Halevy
60ff28932c compaction_manager: perform_sstable_scrub: get the whole compaction_type_options::scrub
So we can pass additional options on top of the scrub mode.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:21:37 +02:00
Benny Halevy
bbe275f37d compaction: scrub_sstables_validate_mode: quarantine invalid sstables
When invalid sstables are detected, move them
to the quarantine subdirectory so they won't be
selected for regular compaction.

Refs #7658

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:14:16 +02:00
Benny Halevy
3eabfad9fc test: database_test: add snapshot_with_quarantine_works
Test that snapshot includes quarantined sstables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
11b54d44d9 test: database_test: add populate_from_quarantine_works
Test that we load quarantined sstables by
creating a dataset, moving a sstable to the quarantine dir,
and then reload the table and verify the dataset.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
075962b45a distributed_loader: populate_keyspace: populate also from the quarantine dir
sstables in the quarantine subdirectory are part of the table.
They're just not eligible for non-scrub compaction.

Call populate_column_family also for the quarantine subdirectory,
allowing it to not exist.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
f643dc90a9 distributed_loader: populate_column_family: add must_exist param
Check if the directory to be loaded exists.

Currently must_exist=true in all cases,
but it may be set to false when loading directories
that may not exist.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
13e7b00f2e sstables: add is_quarantined
Quarantined sstables will reside in a "quarantine" subdirectory
and are also not eligible for regular compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
07c5ddf182 sstables: add is_eligible_for_compaction
Currently compaction_manager tracks sstables
based on !requires_view_building() and similarly,
table::in_strategy_sstables picks up only sstables
that are not in staging.

is_eligible_for_compaction() generalizes this condition
in preparation for adding a quarantine subdirectory for
invalid sstables that should not be compacted as well.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
bdc53880d4 sstables: define symbolic names for table subdirectories
Define the "staging", "upload", and "snapshots" subdirectory
names as named const expressions in the sstables namespace
rather than relying on their string representation,
that could lead to typo mistakes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Avi Kivity
bfdab1e92e alternator: ttl: don't initialize vector from initializer_list in coroutine
Initializing a vector from an initializer_list defeats move construction
(since initializer_list is const). Moreover it is suspected to cause a
crash due to a miscompile. In any case, this patch fixes the crash.

Fixes #9735.

Closes #9736
2021-12-05 17:51:05 +02:00
Avi Kivity
8d724835eb Merge 'select_statement: Calculate _restrictions->need_filtering() only once' from Jan Ciołek
Originally mentioned in: https://github.com/scylladb/scylla/pull/9481#issuecomment-982698208

Currently we call `_restrictions->need_filtering()` each time a prepared select is executed.
This is not super efficient - `need_filtering` has to scan through the whole AST and analyze it.

This PR calculates value of `_restrictions->need_filtering()` only once and then uses this precomputed value.
I ran `perf_simple_query` on my laptop throttled to 1GHz and it looks like this saves ~1000 instructions/op.

```bash
median 38459.09 tps ( 75.1 allocs/op,  12.1 tasks/op,   46099 insns/op)
median 38743.79 tps ( 75.1 allocs/op,  12.1 tasks/op,   46091 insns/op)
median 38489.52 tps ( 75.1 allocs/op,  12.1 tasks/op,   46097 insns/op)
median 38492.10 tps ( 75.1 allocs/op,  12.1 tasks/op,   46102 insns/op)
median 38478.65 tps ( 75.1 allocs/op,  12.1 tasks/op,   46098 insns/op)

median 38930.07 tps ( 75.1 allocs/op,  12.1 tasks/op,   44922 insns/op)
median 38777.52 tps ( 75.1 allocs/op,  12.1 tasks/op,   44904 insns/op)
median 39325.41 tps ( 75.1 allocs/op,  12.1 tasks/op,   44925 insns/op)
median 38640.51 tps ( 75.1 allocs/op,  12.1 tasks/op,   44907 insns/op)
median 39075.89 tps ( 75.1 allocs/op,  12.1 tasks/op,   44920 insns/op)

./build/release/test/perf/perf_simple_query --cpuset 1 -m 1G --random-seed 0 --task-quota-ms 10 --operations-per-shard 1000000
```

Closes #9727

* github.com:scylladb/scylla:
  select_statement: Use precomputed value of _restrictions->need_filtering()
  select_statement: Store whether restrictions need filtering in a variable
2021-12-05 13:38:51 +02:00
Takuya ASADA
097a6ee245 dist: add support im4gn/is4gen instance on AWS
Add support next-generation, storage-optimized ARM64 instance types.

Fixes #9711

Closes #9730
2021-12-05 13:20:01 +02:00
Nadav Har'El
de21455dfe Rename one logger which had a space in its name
We had a logger called "query result log", with spaces, which made it
impossible to enable it with the REST API due to missing percent
decoding support in our HTTP server (see #9614).

Although that HTTP server bug should be fixed as well (in Seastar -
see scylladb/seastar#725), there is no good reason to have a logger
name with a space in it. This is the only logger whose name has
a space: We have 77 other loggers using underscores (_) in their name,
and only 9 using hyphens (-). So in this patch we choose the more
popular alternative - an underscore.

Fixes #9614.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211205093732.1092553-1-nyh@scylladb.com>
2021-12-05 12:18:21 +02:00
Pavel Emelyanov
db5678bb7f Merge "Kill unused code in compaction" from Raphael
tests: unit(dev).

* github.com/raphaelsc/scylla.git cleanups_for_compaction_12_03
  compaction_strategy: kill unused compaction_strategy_type::major
  compaction: Log skip of fully expired sstables
  compaction_strategy: kill unused can_compact_partial_runs()
  compaction: kill useless on_skipped_expired_sstable()
  compaction: merge _total_input_sstables and _ancestors
2021-12-03 19:22:08 +03:00
Jan Ciolek
22c3e00c44 select_statement: Use precomputed value of _restrictions->need_filtering()
Instead of calculating _restrictions->need_filtering() each time,
we can now use the value that has been already calculated.

This used to happen during query execution,
so we get an increase in performance.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-03 17:03:53 +01:00
Jan Ciolek
075b3a45fd select_statement: Store whether restrictions need filtering in a variable
Instead of calculating _restrictions->need_filtering()
we can calculate it only once and then use this computed variable.

It turns out that _restrictions->need_filtering() is called
during execution of prepared statements and it has to scan through the whole AST,
so doing it only once gives us a performance gain.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-03 17:01:09 +01:00
Raphael S. Carvalho
2f9f089eda compaction_strategy: kill unused compaction_strategy_type::major
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:27:10 -03:00
Raphael S. Carvalho
0e3d388ebb compaction: Log skip of fully expired sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:25:48 -03:00
Raphael S. Carvalho
9725e5efa9 compaction_strategy: kill unused can_compact_partial_runs()
This strategy method was introduced unnecessarily. We assume it was
going to be needed, but turns out it was never needed, not even
for ICS. Also it's built on a wrong assumption as an output
sstable run being generated can never be compacted in parallel
as the non-overlapping requirement can be easily broken.
LCS for example can allow parallel compaction on different runs
(levels) but correctness cannto be guaranteed with same runs
are compacted in parallel.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:20:51 -03:00
Raphael S. Carvalho
7a7a2467fa compaction: kill useless on_skipped_expired_sstable()
It was introduced by commit 5206a97915 because fully expired sstable
wouldn't be registed and therefore could be never removed from backlog
tracker. This is no longer possible as table is now responsible for
removing all input sstables. So let's kill on_skipped_expired_sstable()
as it's now only boilerplate we don't need.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:19:29 -03:00
Raphael S. Carvalho
32c2534e91 compaction: merge _total_input_sstables and _ancestors
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:19:23 -03:00
Pavel Emelyanov
d86b35f474 thrift: Use local proxy reference in do_paged_slice
This place need some more care than simple replacement

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-03 17:56:04 +03:00
Pavel Emelyanov
35c35602ae thrift: Use local proxy reference in handler methods
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-03 17:56:04 +03:00
Pavel Emelyanov
2d8272dc03 thrift: Keep sharded proxy reference on thrift_handler
Carried via main -> controller -> server -> factory -> handler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-03 17:48:19 +03:00
Piotr Sarna
0bd139e81c Merge 'cql3: expr: detemplate and deinline find_in_expression()
... and count_if()' from Avi Kivity

The expression code provides some utilities to examine and manipulate
expressions at prepare time. These are not (or should not be) in the fast
path and so should be optimized for compile time and code footprint
rather than run time.

This series does so by detemplating and deinlining find_in_expression()
and count_if().

Closes #9712

* github.com:scylladb/scylla:
  cql3: expr: adjust indentation in recurse_until()
  cql3: expr: detemplate count_if()
  cql3: expr: detemplate count_if()
  cql3: expr: rewrite count_if() in terms of recurse_until()
  cql3: expr: deinline recurse_until()
  cql3: expr: detemplate find_in_expression
2021-12-03 15:41:07 +01:00
Piotr Sarna
3867ca2fd6 Merge 'cql3: Don't allow unset values inside UDT' from Jan Ciołek
Scylla doesn't support unset values inside UDT.
The old code used to convert `unset` to `null`, which seems incorrect.

There is an extra space in the error message to retain compatability with Cassandra.

Fixes: #9671

Closes #9724

* github.com:scylladb/scylla:
  cql-pytest: Enable test for UDT with unset values
  cql3: Don't allow unset values inside UDT
2021-12-03 15:36:55 +01:00
Jan Ciolek
3ae8752812 cql-pytest: Enable test for UDT with unset values
The test testUDTWithUnsetValues was marked as xfail,
but now the issue has been fixed and we can enable it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-03 14:46:21 +01:00
Jan Ciolek
be14904416 cql3: Don't allow unset values inside UDT
Scylla doesn't support unset values inside UDT.
The old code used to convert unset to null, which seems incorrect.

There is an extra space in the error message to retain compatability with Cassandra.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-03 14:46:21 +01:00
Gleb Natapov
b954d91b4f migration_manager: co-routinize announce_column_family_drop function.
Message-Id: <20211202150531.1277448-30-gleb@scylladb.com>
2021-12-03 13:29:34 +02:00
Botond Dénes
fb8d268251 test/boost/multishard_mutation_query_test: add multi-range test 2021-12-03 10:51:45 +02:00
Botond Dénes
2b613a13a5 test/boost/multishard_mutation_query_test: add multi-range support
In the test infrastructure code, so we can add tests passing multiple
ranges to the tested `multishard_{mutation,data}_query()`, exercising
multi-range functionality.
2021-12-03 10:51:45 +02:00
Botond Dénes
5380cb0102 multishard_mutation_query: don't drop data during stateful multi-range reads
When multiple ranges are passed to `multishard_{mutation,data}_query()`,
it wraps the multishard reader with a multi-range one. This interferes
with the disassembly of the multishard reader's buffer at the end of the
page, because the multi-range reader becomes the top-level reader,
denying direct access to the multishard reader itself, whose buffer is
then dropped. This confuses the reading logic, causing data corruption
on the next page(s). A further complication is that the multi-range
reader can include data from more then one range in its buffer when
filling it. To solve this, a special-purpose multi-range is introduced
and used instead of the generic one, which solves both these problems by
guaranteeing that:
* Upon calling fill_buffer(), the entire content of the underlying
  multishard reader is moved to that of the top-level multi-range
  reader. So calling `detach_buffer()` guarantees to remove all
  unconsumed fragments from the top-level readers.
* fill_buffer() will never mix data from more than one ranges. It will
  always stop on range boundaries and will only cross if the last range
  was consumed entirely.

With this, multi-range reads finally work with reader-saving.
2021-12-03 10:45:06 +02:00
Botond Dénes
953603199e multishard_combining_reader: reader_lifecycle_policy: allow saving read range on fast-forward
The reader_lifecycle_policy API was created around the idea of shard
readers (optionally) being saved and reused on the next page. To do
this, the lifecycle policy has to also be able to control the lifecycle
of by-reference parameters of readers: the slice and the range. This was
possible from day 1, as the readers are created through the lifecycle
policy, which can intercept and replace the said parameters with copies
that are created in stable storage. There was one whole in the design
though: fast-forwarding, which can change the range of the read, without
the lifecycle policy knowing about this. In practice this results in
fast-forwarded readers being saved together with the wrong range, their
range reference becoming stale. The only lifecycle implementation prone
to this is the one in `multishard_mutation_query.cc`, as it is the only
one actually saving readers. It will fast-forward its reader when the
query happens over multiple ranges. There were no problems related to
this so far because no one passes more than one range to said functions,
but this is incidental.
This patch solves this by adding an `update_read_range()` method to the
lifecycle policy, allowing the shard reader to update the read range
when being fast forwarded. To allow the shard reader to also have
control over the lifecycle of this range, a shared pointer is used. This
control is required because when an `evictable_reader` is the top-level
reader on the shard, it can invoke `create_reader()` with an edited
range after `update_read_range()`, replacing the fast-forwarded-to
range with a new one, yanking it out from under the feet of the
evictable reader itself. By using a shared pointer here, we can ensure
the range stays alive while it is the current one.
2021-12-03 10:27:44 +02:00
Raphael S. Carvalho
4a02e312f6 compaction: increase disjoint tolerance in TWCS reshape
When reshaping TWCS table in relaxed mode, which is the case for
offstrategy and boot, disjoint tolerance is too strict, which can
lead those processes to do more work than needed.
Let's increase the tolerance to max threshold, which will limit the
amount of sstables opened in compaction to a reasonable amount.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130132538.56285-1-raphaelsc@scylladb.com>
2021-12-03 06:38:42 +02:00
Raphael S. Carvalho
6ad630c095 scylla-gdb.py: fix unique ptr on newer libstdc++
unfortunately, correctness of std_unique_ptr and similar depends on
their implementation in libstdc++. let's support unique ptr on
newer systems while maintaining backward compatibility.

./test.py --mode=release scylla-gdb now passes to me, also verified
`scylla compaction-tasks` produces correct info.

Fixes #9677.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211202173534.359672-1-raphaelsc@scylladb.com>
2021-12-03 06:33:54 +02:00
Avi Kivity
3b82ef854d Merge "Some compaction manager cleanups" from Raphael
"
couple of preparatory changes for coroutinization of manager
"

* 'some_compaction_manager_cleanups_v5' of github.com:raphaelsc/scylla:
  compaction_manager: move check_for_cleanup into perform_cleanup()
  compaction_manager: replace get_total_size by one liner
  compaction_manager: make consistent usage of type and name table
  compaction_manager: simplify rewrite_sstables()
  compaction_manager: restore indentation
2021-12-02 19:53:13 +02:00
Pavel Emelyanov
5cfeac0c90 paxos: Drop forward declarations of seastar pointers
They will break compilation after next seastar update, but the
good news is that scylla compiles even without them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211202173643.1070-1-xemul@scylladb.com>
2021-12-02 19:49:03 +02:00
Konstantin Osipov
bdb924cdac cql3: co-routinize create_table_statement::announce_migration()
Message-Id: <20211202150531.1277448-4-gleb@scylladb.com>
2021-12-02 19:43:30 +02:00
Pavel Emelyanov
e4f35e2139 migration_manager: Eliminate storage service from passive announcing
Currently storage service acts as a glue between database schema value
and the migration manager "passive_announce" call. This interposing is
not required, migration manager can do all the management itself, and
the linkage can be done in main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Pavel Emelyanov
a751a1117a migration_manager: Coroutinize drain()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Pavel Emelyanov
eb8e30f696 migration_manager: Rename stop to drain then bring it back
Because today's migration_manager::stop is called drain-time.
Keep the .stop for next patch, but since it's called when the
whole migration_manager stops, guard it against re-entrances.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Pavel Emelyanov
798f4b0e3f migration_manager: Sanitize (maybe_)schedule_schema_pull
Both calls are now private. Also the non-maybe one can become void
and handle pull exceptions by itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Pavel Emelyanov
421679e428 migration_manager: Schedule schema pulls upon gossip events
Move the calls from respective storage service notification callbacks.
One non-move change is that token metadata that was available on the
storage service should be taken from storage proxy, but this change
is aligned with future changes -- migration manager depends on proxy
and will get a local proxy reference some day.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Pavel Emelyanov
d4d0bd147e migration_manager: Subscribe on gossiper events
This is to start schema pulls upon on_join, on_alive and on_change ones
in the next patch. Migration manager already has gossiper reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Botond Dénes
259649c779 sstables/index_reader: improved diagnostics on missing index entry
Add the summary index and the bound's address to the error message, so
it can be correlated with other trace level logging when investigating a
problem.

Refs: #9446

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202124955.542293-2-bdenes@scylladb.com>
2021-12-02 19:43:30 +02:00
Botond Dénes
f0b9519999 test/lib/exception_utils: add message_matches() predicate
Which checks the message against the given regex.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202124955.542293-1-bdenes@scylladb.com>
2021-12-02 19:43:30 +02:00
Nadav Har'El
605a2de398 config: change default prometheus_address handling, again
In the very recent commit 3c0e703 fixing issue #8757, we changed the
default prometheus_address setting in scylla.yaml to "localhost", to
match the default listen_address in the same file. We explained in that
commit how this helped developers who use an unchanged scylla.yaml, and
how it didn't hurt pre-existing users who already had their own scylla.yaml.

However, it was quickly noted by Tzach and Amnon that there is one use case
that was hurt by that fix:

Our existing documentation, such as the installation guide
https://www.scylladb.com/download/?platform=centos ask the user to take
our initial scylla.yaml, and modify listen_address, rpc_address, seeds,
and cluster_name - and that's it. That document - and others - don't
tell the user to also override prometheus_address, so users will likely
forget to do so - and monitoring will not work for them.

So this patch includes a different solution to #8757.
What it does is:
1. The setting of prometheus_address in scylla.yaml is commented out.
2. In config.cc, prometheus_address defaults to empty.
3. In main.cc, if prometheus_address is empty (i.e., was not explicitly
   set by the user), the value of listen_address is used instead.

In other words, the idea is that prometheus_address, if not explicitly set
by the user, should default to listen_address - which is the address used
to listen to the internal Scylla inter-node protocol.

Because the documentation already tells the user to set listen_address
and to not leave it set to localhost, setting it will also open up
prometheus, thereby solving #9701. Meanwhile, developers who leave the
default listen_address=localhost will also get prometheus_address=localhost,
so the original #8757 is solved as well. Finally, for users who had an old
scylla.yaml where prometheus_address was explicitly set to something,
this setting will continue to be used. This was also a requirement of
issue #8757.

Fixes #9701.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211129155201.1000893-1-nyh@scylladb.com>
2021-12-02 19:43:30 +02:00
Avi Kivity
7cfd278c32 db: size_estimates_virtual_reader: convert to flat_mutation_reader_v2
As part of changing the codebase to flat_mutation_reader_v2,
change size_estimates_virtual_reader.

Since the bulk of the work is done by
make_flat_mutation_reader_from_mutations() (which is unchanged),
only glue code is affected. It is also not performance sensitive,
so the extra conversions are unimportant.

Test: unit (dev)

Closes #9707
2021-12-02 19:43:30 +02:00
Avi Kivity
b920f2500d db: virtual_table: convert chained_delegating_reader to v2
As part of changing the codebase to flat_mutation_reader_v2,
change chained_delegating_reader and its user virtual_table.

Since the reader does not process fragments (only forwarding
things around), only glue code is affected. It is also not
performance sensitive, so the extra conversions are unimportant.

Test: unit (dev)

Closes #9706
2021-12-02 19:43:30 +02:00
Raphael S. Carvalho
6d750d4f59 compaction_manager: move check_for_cleanup into perform_cleanup()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 14:39:31 -03:00
Raphael S. Carvalho
9aed7e9d67 compaction_manager: replace get_total_size by one liner
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 14:39:31 -03:00
Raphael S. Carvalho
760cfd93fb compaction_manager: make consistent usage of type and name table
new code in manager adopted name and type table, whereas historical
code still uses name and type column family. let's make it consistent
for newcomers to not get confused.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 14:39:27 -03:00
Gleb Natapov
cab1a1c403 schema raft sm: request schema sync on schema_state_machine snapshot transfer
If the schema state machine requests snapshot transfer it means that
it missed some schema mutations and needs a full sync. We already have
a function that does it: migration_manager::submit_migration_task(),
so call it on a snapshot transfer.
2021-12-02 14:55:29 +02:00
Raphael S. Carvalho
e460f72250 compaction_manager: simplify rewrite_sstables()
as rewrite_sstables() switched to coroutine, it can be simplified
by not using smart pointers to handle lifetime issues.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 08:15:41 -03:00
Raphael S. Carvalho
48124fc15a compaction_manager: restore indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 08:15:38 -03:00
Gleb Natapov
431f931d2a raft service: delegate snapshot transfer to a state machine implementation
We want raft service to support all kinds of state machines and most
services provided by it may indeed be shared. But snapshot transfer is
very state machine specific and thus cannot be put into the raft service.
This patch delegates snapshot transfer implementation to a state machine
implementation.
2021-12-02 10:54:44 +02:00
Gleb Natapov
fd109ecff1 schema raft sm: pass migration manager to schema_raft_state_machine and merge schema on apply()
This patch wires up schema_raft_state_machine::apply() function. For now
it assumes that a raft command contains single schema change in the form
of a schema mutation array. It may change later (we may add more info to
a schema), but for now this will do.
2021-12-02 10:46:32 +02:00
Piotr Sarna
761c691149 alternator,ttl: simplify getting primary key column values
Key column values fetched during the TTL scan have a well-defined
order - primary columns come first. This assumption is now used
to simplify getting the values from rows during scans without
having to consult result metadata first.

Tests: unit(release)
Message-Id: <dcb19b8bab0dd02838693fe06d5a835ea2f378ff.1638357005.git.sarna@scylladb.com>
2021-12-02 10:29:41 +02:00
Piotr Sarna
337906bc1c alternator: precompute scan range parameters in a function
This commit addresses a very simple FIXME left in alternator TTL
implementation - it reduces the number of parameters passed
to scan_table_ranges() by enclosing the parameters in a separate
object.

Tests: unit(release)

Message-Id: <214afcd9d5c1968182ad98550105f82add216c80.1638354094.git.sarna@scylladb.com>
2021-12-02 10:04:05 +02:00
Raphael S. Carvalho
de165b864c repair: Enable off-strategy compaction for rebuild
Let's enable offstrategy for repair based rebuild, for it to take
advantage of offstrategy benefits, one of the most important
being compaction not acting aggressively, which is important
for both reducing operation time and delivering good latency
while the operation is running.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130115957.13779-1-raphaelsc@scylladb.com>
2021-12-02 09:58:58 +02:00
David Garcia
954d5d5d63 Fix cql docs error
Closes #9613
2021-12-02 09:58:58 +02:00
Avi Kivity
ef3edcf848 test: refine test suite names exposed via xunit format
The test suite names seen by Jenkins are suboptimal: there is
no distinction between modes, and the ".cc" suffix of file names
is interpreted as a class name, which is converted to a tree node
that must be clicked to expand. Massage the names to remove
unnecessary information and add the mode.

Closes #9696
2021-12-02 09:58:58 +02:00
Avi Kivity
9edd86362a test: sstable_test: don't read compressed file size from closed file
We read the compressed file size from a file that was already closed,
resulting in EBADF on my machine. Not sure why it works for everyone
else.

Fix by reading the size using the path.

Closes #9675
2021-12-01 16:28:46 +02:00
Raphael S. Carvalho
f23e0d7f2d compaction_manager: Disconsider inactive tasks when filtering sstables
After commit 1f5b17f, overlapping can be introduced in level 1 because
procedure that filters out sstables from partial runs is considering
inactive tasks, so L1 sstables can be incorrectly filtered out from
next compaction attempt. When L0 is merged into L1, overlapping is
then introduced in L1 because old L1 sstables weren't considered in
L0 -> L1 compaction.

From now on, compaction_manager::get_candidates() will only consider
active tasks, to make sure actual partial runs are filtered out.

Fixes #9693.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129180459.125847-1-raphaelsc@scylladb.com>
2021-12-01 16:11:44 +02:00
Raphael S. Carvalho
9de7abdc80 compaction: LCS: Fix inefficiency when pushing SSTables to higher levels
To satisfy backlog controller, commit 28382cb25c changed LCS to
incrementally push sstables to highest level *when* there's nothing else
to be done.

That's overkill because controller will be satisfied with level L being
fanout times larger than L-1. No need to push everything to last level as
it's even worse than a major, because any file being promoted will
overlap with ~10 files in next level. At least, the cost is amortized by
multiple iterations, but terrible write amplification is still there.
Consequently, this reduces overall efficiency.
For example, it might happen that LCS in table A start pushing everything
to highest level, when table B needs resources for compaction to reduce its
backlog. Increased write amplification in A may prevent other tables
from reducing their backlog in a timely manner.

It's clear that LCS should stop promoting as soon as level L is 10x
larger than L-1, so strategy will still be satisfied while fixing the
inefficiency problem.

Now layout will look like as follow:
SSTables in each level: [0, 2, 15, 121]

Previously, it looked like once table stopped being written to:
SSTables in each level: [0, 0, 0, 138]

It's always good to have everything in a single run, but that comes
with a high write amplification cost which we cannot afford in steady
state. With this change, the layout will still be good enough to make
everybody happy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129143606.71257-1-raphaelsc@scylladb.com>
2021-12-01 16:10:25 +02:00
Gleb Natapov
f2ab5f4e60 raft service: insert a new raft instance into the servers' list only after it is started
RPC module starts to dispatching calls to a server the moment it is in the
servers' list, but until raft::server::start() completes the instance is
not fully created yet and is not ready to accept anything. Fix the code
that initialize new raft group to insert new raft instance into the list
only after it is started.

Message-Id: <YZTFFW9v0NlV7spR@scylladb.com>
2021-12-01 13:11:49 +01:00
Nadav Har'El
94eb5c55c8 Merge 'Loading cache improve eviction use policy' from Vladislav Zolotarov
This series introduces a new version of a loading_cache class.
The old implementation was susceptible to a "pollution" phenomena when frequently used entry can get evicted by an intensive burst of "used once" entries pushed into the cache.

The new version is going to have a privileged and unprivileged cache sections and there's a new loading_cache template parameter - SectionHitThreshold. The new cache algorithm goes as follows:
  * We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
  * New cache entry is always added to the "unprivileged" section.
  * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
  * Both sections' entries obey expiration and reload rules in the same way as before this patch.
  * When cache entries need to be evicted due to a size restriction "unprivileged" section's
    least recently used entries are evicted first.

More details may be found in #8674.

In addition, during a testing another issue was found in the authorized_prepared_statements_cache: #9590.
There is a patch that fixes it as well.

Closes #9708

* github.com:scylladb/scylla:
  loading_cache: account unprivileged section evictions
  loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy
  authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed
  loading_cache::timestamped::lru_entry: refactoring
  loading_cache.hh: rearrange the code (no functional change)
  loading_cache: use std::pmr::polymorphic_allocator
2021-12-01 13:13:53 +02:00
Calle Wilund
3e21fea2b6 test_streamts: test_streams_starting_sequence_number fix 'LastEvaluatedShardId' usage
It is not part of raw response, but of the 'StreamDescription' object.
Test fails internmittently depending on PK randomization.

Closes #9710
2021-12-01 11:05:40 +02:00
Avi Kivity
03755b362a Merge 'compaction_manager api: stop ongoing compactions' from Benny Halevy
This series extends `compaction_manager::stop_ongoing_compaction` so it can be used from the api layer for:
- table::disable_auto_compaction
- compaction_manager::stop_compaction

Fixes #9313
Fixes #9695

Test: unit(dev)

Closes #9699

* github.com:scylladb/scylla:
  compaction_manager: stop_compaction: wait for ongoing compactions to stop
  compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level
  compaction_manager: unify stop_ongoing_compactions implementations
  compaction_manager: stop_ongoing_compactions: add compaction_type option
  compaction_manager: get_compactions: get a table* parameter
  table: disable_auto_compaction: stop ongoing compactions
  compaction_manager: make stop_ongoing_compactions public
  table: futurize disable_auto_compactions
2021-11-30 19:08:14 +02:00
Avi Kivity
2c613b027d cql3: expr: adjust indentation in recurse_until()
Whitespace changes only.
2021-11-30 17:57:53 +02:00
Avi Kivity
f7f77df143 cql3: expr: detemplate count_if()
No functional changes. This prepare-path function does not need to
be inlined.
2021-11-30 17:52:15 +02:00
Avi Kivity
3a96b74e49 cql3: expr: detemplate count_if()
count_if() is a prepare-path function and does not need to be
a template. Type-erase it with noncopyable_function.
2021-11-30 17:50:34 +02:00
Avi Kivity
6f9e56e678 cql3: expr: rewrite count_if() in terms of recurse_until()
Counting is just recursing without early termination, and counting
as a side effect.
2021-11-30 17:49:00 +02:00
Avi Kivity
c01188c414 cql3: expr: deinline recurse_until()
As a prepare-path function, it has no business being inline.
2021-11-30 17:41:16 +02:00
Avi Kivity
d0177d4b85 cql3: expr: detemplate find_in_expression
find_in_expression() is not in a fast path but is quite large
and inlined due to being a template. Detemplate it into a
recurse_until() utility function, and keep only the minimal
code in a template.

The recurse_until is still inline to simplify review, but
will be deinlined in the next patch.
2021-11-30 17:37:24 +02:00
Avi Kivity
595cc328b1 Merge 'cql3: Remove term, replace with expression' from Jan Ciołek
This PR finally removes the `term` class and replaces it with `expression`.

* There was some trouble with `lwt_cache_id` in `expr::function_call`.
  The current code works the following way:
  * for each `function_call` inside a `term` that describes a pk restriction, `prepare_context::add_pk_function_call` is called.
  * `add_pk_function_call` takes a `::shared_ptr<cql3::functions::function_call>`, sets its `cache_id` and pushes this shared pointer onto a vector of all collected function calls
  * Later when some condiition is met we want to clear cache ids of all those collected function calls. To do this we iterate through shared pointers collected in `prepare_context` and clear cache id for each of them.

  This doesn't work with `expr::function_call` because it isn't kept inside a shared pointer.
  To solve this I put the `lwt_cache_id` inside a shared pointer and then `prepare_context` collects these shared pointers to cache ids.

  I also experimented with doing this without any shared pointers, maybe we could just walk through the expression and clear the cache ids ourselves. But the problem is that expressions are copied all the time, we could clear the cache in one place, but forget about a copy. Doing it using shared pointers more closely matches the original behaviour.
The experiment is on the [term2-pr3-backup-altcache](https://github.com/cvybhu/scylla/tree/term2-pr3-backup-altcache) branch
* `shared_ptr<term>` being `nullptr` could mean:
  * It represents a cql value `null`
  * That there is no value, like `std::nullopt` (for example in `attributes.hh`)
  * That it's a mistake, it shouldn't be possible

  A good way to distinguish between optional and mistake is to look for `my_term->bind_and_get()`, we then know that it's not an optional value.

* On the other hand `raw_value` cased to bool means:
   * `false` - null or unset
   * `true` - some value, maybe empty

I ran a simple benchmark on my laptop to see how performance is affected:
```
build/release/test/perf/perf_simple_query --smp 1 -m 1G --operations-per-shard 1000000 --task-quota-ms 10
```
* On master (a21b1fbb2f) I get:
  ```
  176506.60 tps ( 77.0 allocs/op,  12.0 tasks/op,   45831 insns/op)

  median 176506.60 tps ( 77.0 allocs/op,  12.0 tasks/op,   45831 insns/op)
  median absolute deviation: 0.00
  maximum: 176506.60
  minimum: 176506.60
  ```
* On this branch I get:
  ```
  172225.30 tps ( 75.1 allocs/op,  12.1 tasks/op,   46106 insns/op)

  median 172225.30 tps ( 75.1 allocs/op,  12.1 tasks/op,   46106 insns/op)
  median absolute deviation: 0.00
  maximum: 172225.30
  minimum: 172225.30
  ```

Closes #9481

* github.com:scylladb/scylla:
  cql3: Remove remaining mentions of term
  cql3: Remove term
  cql3: Rename prepare_term to prepare_expression
  cql3: Make prepare_term return an expression instead of term
  cql3: expr: Add size check to evaluate_set
  cql3: expr: Add expr::contains_bind_marker
  cql3: expr: Rename find_atom to find_binop
  cql3: expr: Add find_in_expression
  cql3: Remove term in operations
  cql3: Remove term in relations
  cql3: Remove term in multi_column_restrictions
  cql3: Remove term in term_slice, rename to bounds_slice
  cql3: expr: Remove term in expression
  cql3: expr: Add evaluate_IN_list(expression, options)
  cql3: Remove term in column_condition
  cql3: Remove term in select_statement
  cql3: Remove term in update_statement
  cql3: Use internal cql format in insert_prepared_json_statement cache
  types: Add map_type_impl::serialize(range of <bytes, bytes>)
  cql3: Remove term in cql3/attributes
  cql3: expr: Add constant::view() method
  cql3: expr: Implement fill_prepare_context(expression)
  cql3: expr: add expr::visit that takes a mutable expression
  cql3: expr: Add receiver to expr::bind_variable
2021-11-30 16:39:39 +02:00
Avi Kivity
078f69c133 Merge "raft: (service) implement group 0 as a service" from Kostja
"
To ensure consistency of schema and topology changes,
Scylla needs a linearizable storage for this data
available at every member of the database cluster.

The series introduces such storage as a service,
available to all Scylla subsystems. Using this service, any other
internal service such as gossip or migrations (schema) could
persist changes to cluster metadata and expect this to be done in
a consistent, linearizable way.

The series uses the built-in Raft library to implement a
dedicated Raft group, running on shard 0, which includes all
members of the cluster (group 0), adds hooks to topology change
events, such as adding or removing nodes of the cluster, to update
group 0 membership, ensures the group is started when the
server boots.

The state machine for the group, i.e. the actual storage
for cluster-wide information still remains a stub. Extending
it to actually persist changes of schema or token ring
is subject to a subsequent series.

Another Raft related service was implemented earlier: Raft Group
Registry. The purpose of the registry is to allow Scylla have an
arbitrary number of groups, each with its own subset of cluster
members and a relevant state machine, sharing a common transport.
Group 0 is one (the first) group among many.
"

* 'raft-group-0-v12' of github.com:scylladb/scylla-dev:
  raft: (server) improve tracing
  raft: (metrics) fix spelling of waiters_awaken
  raft: make forwarding optional
  raft: (service) manage Raft configuration during topology changes
  raft: (service) break a dependency loop
  raft: (discovery) introduce leader discovery state machine
  system_keyspace: mark scylla_local table as always-sync commitlog
  system_keyspace: persistence for Raft Group 0 id and Raft Server Id
  raft: add a test case for adding entries on follower
  raft: (server) allow adding entries/modify config on a follower
  raft: (test) replace virtual with override in derived class
  raft: (server) fix a typo in exception message
  raft: (server) implement id() helper
  raft: (server) remove apply_dummy_entry()
  raft: (test) fix missing initialization in generator.hh
2021-11-30 16:24:51 +02:00
Raphael S. Carvalho
0d5ac845e1 compaction: Make cleanup withstand better disk pressure scenario
It's not uncommong for cleanup to be issued against an entire keyspace,
which may be composed of tons of tables. To increase chances of success
if low on space, cleanup will now start from smaller tables first, such
that bigger tables will have more space available, once they're reached,
to satisfy their space requirement.

parallel_for_each() is dropped and wasn't needed given that manager
performs per-shard serialization of cleanup jobs.

Refs #9504.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130133712.64517-1-raphaelsc@scylladb.com>
2021-11-30 16:15:24 +02:00
Benny Halevy
957003e73f compaction_manager: stop_compaction: wait for ongoing compactions to stop
Similar to #9313, stop_compaction should also reuse the
stop_ongoing_comapctions() infrastructure and wait on ongoing
compactions of the given type to stop.

Fixes #9695

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:11 +02:00
Benny Halevy
b9ba181d3c compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level
Normally, "Stopping 0 tasks for 0 ongoing compactions for table ..."
is not very interesting so demote its log_level to debug.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:11 +02:00
Benny Halevy
03e969dbef compaction_manager: unify stop_ongoing_compactions implementations
Now stop_ongoing_compactions(reason) is equivalent to
to stop_ongoing_compactions(reason, nullptr, std::nullopt)
so share the code of the latter for the former entry point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:07 +02:00
Benny Halevy
94011bdcca compaction_manager: stop_ongoing_compactions: add compaction_type option
And make the table optional as well, so it can be used
by stop_compaction() to a particular compaction type on all tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:07:47 +02:00
Benny Halevy
a419759835 compaction_manager: get_compactions: get a table* parameter
Optionally get running compaction on the provided table.
This is required for stop_ongoing_compactions on a given table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:06:34 +02:00
Benny Halevy
4affa801a5 table: disable_auto_compaction: stop ongoing compactions
The api call disables new regular compaction jobs from starting
but it doesn't wait for ongoing compaction to stop and so it's
much less useful.

Returning after stopping regular compaction jobs and waiting
for them to stop guarantees that no regular compactions job are
running when nodetool disableautocompaction returns successfully.

Fixes #9313

Test: sstable_compaction_test,sstable_directory_test(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:06:34 +02:00
Benny Halevy
3c721eb228 compaction_manager: make stop_ongoing_compactions public
So it can be used directly by table code
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:06:29 +02:00
Raphael S. Carvalho
3006394312 compaction: Allow incremental compaction with interposer consumer
Until commit c94e6f8567, interposer consumer wouldn't work
with our GC writer, needed for incremental compaction correctness.
Now that the technical debt is gone, let's allow incremental
compaction with interposer consumer.

The only change needed is serialization of replacer as two
consumers cannot step on each toe, like when we have concurrent
bucket writers with TWCS.

sstable_compaction_test.test_bug_6472 passes with this change,
which was added when #6472 was fixed by not allowing incremental
compaction with interposer consumer.

Refs #6472.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211126191000.43292-1-raphaelsc@scylladb.com>
2021-11-30 15:24:17 +02:00
Eliran Sinvani
ddd7248b3b testlib: close index_reader to avoid racing condition
In order to avoid race condition introduced in 9dce1e4 the
index_reader should be closed prior to it's destruction.
This only exposes 4.4 and earlier releases to this specific race.
However, it is always a good idea to first close the index reader
and only then destroy it since it is most likely to be assumed by
all developers that will change the reader index in the future.

Ref #9704 (because on 4.4 and earlier releases are vulnerable).

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #9705
2021-11-30 13:05:24 +01:00
Benny Halevy
b60d697084 table: futurize disable_auto_compactions
So it can stop ongoing compaction and wait
for them to complete.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 08:33:04 +02:00
Vlad Zolotarov
4cb245fe3c loading_cache: account unprivileged section evictions
Provide a template parameter to provide a static callbacks object to
increment a counter of evictions from the unprivileged section.

If entries are evicted from the cache while still in the unprivileged
section indicates a not efficient usage of the cache and should be
investigated.

This patch instruments authorized_prepared_statements_cache and a
prepared_statements_cache objects to provide non-empty callbacks.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 21:45:53 -05:00
Vlad Zolotarov
1a9c6d9fd3 loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy
This patch implements a simple variation of LFRU eviction policy:
  * We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
  * New cache entry is always added to the "unprivileged" section.
  * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
  * Both sections' entries obey expiration and reload rules in the same way as before this patch.
  * When cache entries need to be evicted due to a size restriction "unprivileged" section's
    least recently used entries are evicted first.

Note:
With a 2 sections cache it's not enough for a new entry to have the latest timestamp
in order not be evicted right after insertion: e.g. if all all other entries
are from the privileged section.

And obviously we want to allow new cache entries to be added to a cache.

Therefore we can no longer first add a new entry and then shrink the cache.
Switching the order of these two operations resolves the culprit.

Fixes #8674

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 21:45:21 -05:00
Pavel Solodovnikov
e3f922c48b raft: write raft log in user memory
System dirty memory space is limited by 10MB capacity.
This means that memtables cannot accumulate more than
5MB before they are flushed to sstables.

This can impact performance under load.

Move the `system.raft` table to the regular dirty
memory space.

Fixes: #9692
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211129200044.1144961-1-pa.solodovnikov@scylladb.com>
2021-11-29 23:51:24 +01:00
Vlad Zolotarov
66c150769b authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed
Always "touch" a prepared_statements_cache entry when it's accessed via
authorized_prepared_statements_cache.

If we don't do this it may turn out that the most recently used prepared statement doesn't have
the newest last_read timestamp and can get evicted before the not-so-recently-read statement if
we need to create space in the prepared statements cache for a new entry.

And this is going to trigger an eviction of the corresponding entry from the authorized_prepared_cache
breaking the LRU paradigm of these caches.

Fixes #9590

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 17:37:25 -05:00
Nadav Har'El
d9c5c4eab6 test/alternator: tests for Select parameter in GSI and LSI
We already have tests for the behavior of the "Select" parameter when
querying a base table, but this patch adds additional tests for its
behavior when querying a GSI or a LSI. There are some differences:
Select=ALL_PROJECTED_ATTRIBUTES is not allowed for base tables, but is
allowed - and in fact is the default - for GSI and LSI. Also, GSI may
not allow ALL_ATTRIBUTES (which is the default for base tables) if
only a subset of the attributes were projected.

The new tests xfail because the Select and Projection features have
not yet been implemented in Alternator. They pass in DynamoDB.
After this patch we have (hopefully) complete test coverage of the
Select feature, which will be helpful when we start implementing it.

Refs #5058 (Select)
Refs #5036 (Projection)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211125100443.746917-1-nyh@scylladb.com>
2021-11-29 20:28:43 +01:00
Nadav Har'El
1c279118f4 test/alternator: more test cases for Select parameter
Add to the existing tests for the Select parameter of the Query and Scan
operations another check: That when Select is ALL_ATTRIBUTES or COUNT,
specifying AttributesToGet or ProjectionExpression is forbidden -
because the combination doesn't make sense.

The expanded test continues to xfail on Alternator (because the Select
parameter isn't yet implemented), and passes on DynamoDB. Strengthening
the tests for this feature will be helpful when we decide to implement it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211125074128.741677-1-nyh@scylladb.com>
2021-11-29 20:28:25 +01:00
Vlad Zolotarov
cbabde9622 loading_cache::timestamped::lru_entry: refactoring
* Store a reference to a parent (loading_cache) object instead of holding
     references to separate fields.
   * Access loading_cache fields via accessors.
   * Move the LRU "touch" logic to the loading_cache.
   * Keep only a plain "list entry" logic in the lru_entry class.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 14:24:56 -05:00
Vlad Zolotarov
9125b4545e loading_cache.hh: rearrange the code (no functional change)
Hide internal classes inside the loading_cache class:
  * Simpler calls - no need for a tricky back-referencing to access loading_cache fields.
  * Cleaner interface.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 14:24:56 -05:00
Vlad Zolotarov
fd92718f48 loading_cache: use std::pmr::polymorphic_allocator
Use std::pmr::polymorphic_allocator instead of
std::allocator - the former allows not to define the
allocated object during the template specification.

As a result we won't have to have lru_entry defined
before loading_cache, which in line would allow us
to rearrange classes making all classes internal to
loading_cache and hence simplifying the interface.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 14:24:56 -05:00
Raphael S. Carvalho
1f3135abb4 sstable_set: use for_each_sstable() in make_crawling_reader()
sstable_set_impl::all() may have to copy all sstables from multiple
sets, if compound. let's avoid this overhead by using
sstable_set_impl::for_each_sstable().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211127181037.56542-1-raphaelsc@scylladb.com>
2021-11-29 19:59:39 +02:00
Michael Livshin
f0e2ada748 fix mutation_source::operator bool() for v2 factories
A mutation source is valid when it has either a v1 or v2 flat mutation
reader factory, but `operator bool()` only checks for the former.

Fixes #9697

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #9698
2021-11-29 19:50:37 +02:00
Nadav Har'El
8618346331 config: automate experimental_features_t::all()
The experimental_features_t has an all() method, supposedly returning
all values of the enum - but it's easy to forget to update it when
adding a new experimental feature - and it's currently out-of-sync
(it's missing the ALTERNATOR_TTL option).
We already have another method, map(), where a new experimental feature
must be listed otherwise it can't be used, so let's just take all()'s
values from map(), automatically, instead of forcing developers to keep
both lists up-to-date.

Note that using the all() function to enable all experimental features
is not recommended - the best practice is to enable specific experimental
features, not all of them. Nevertheless, this all() function is still used
in one place - in the cql_repl tool - which uses it to enable all
experimental features.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211108135601.78460-1-nyh@scylladb.com>
2021-11-29 18:44:23 +02:00
Tomasz Grabiec
3226c5bf9d Merge 'sstables: mx: enable position fast-forwarding in reverse mode' from Kamil Braun
Most of the machinery was already implemented since it was used when
jumping between clustering ranges of a query slice. We need only perform
one additional thing when performing an index skip during
fast-forwarding: reset the stored range tombstone in the consumer (which
may only be stored in fast-forwarding mode, so it didn't matter that it
wasn't reset earlier). Comments were added to explain the details.

As a preparation for the change, we extend the sstable reversing reader
random schema test with a fast-forwarding test and include some minor
fixes.

Fixes #9427.

Closes #9484

* github.com:scylladb/scylla:
  query-request: add comment about clustering ranges with non-full prefix key bounds
  sstables: mx: enable position fast-forwarding in reverse mode
  test: sstable_conforms_to_mutation_source_test: extend `test_sstable_reversing_reader_random_schema` with fast-forwarding
  test: sstable_conforms_to_mutation_source_test: fix `vector::erase` call
  test: mutation_source_test: extract `forwardable_reader_to_mutation` function
  test: random_schema: fix clustering column printing in `random_schema::cql`
2021-11-29 16:01:53 +01:00
Raphael S. Carvalho
80a1ebf0f3 compaction_manager: Fix race when selecting sstables for rewrite operations
Rewrite operations are scrub, cleanup and upgrade.

Race can happen because 'selection of sstables' and 'mark sstables as
compacting' are decoupled. So any deferring point in between can lead
to a parallel compaction picking the same files. After commit 2cf0c4bbf,
files are marked as compacting before rewrite starts, but it didn't
take into account the commit c84217ad which moved retrieval of
candidates to a deferring thread, before rewrite_sstables() is even
called.

Scrub isn't affected by this because it uses a coarse grained approach
where whole operation is run with compaction disabled, which isn't good
because regular compaction cannot run until its completion.

From now on, selection of files and marking them as compacting will
be serialized by running them with compaction disabled.

Now cleanup will also retrieve sstables with compaction disabled,
meaning it will no longer leave uncleaned files behind, which is
important to avoid data resurrection if node regains ownership of
data in uncleaned files.

Fixes #8168.
Refs #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129133107.53011-1-raphaelsc@scylladb.com>
2021-11-29 16:27:29 +02:00
Avi Kivity
bcadd8229b Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj
"
When memtable contains both mutations and tombstones that delete them,
the output flushed to sstables contains both mutations. Inserting a
compacting reader results in writing smaller sstables and saves
compaction work later.

There are mixed performance implications of this change:
- If no rows are removed, there is a ~12% penalty on writing. Read times
  are not affected. A heuristic is implemented to avoid this problem -
  compaction is executed only if there are tombstones.
- Read and write performance linearly improves with percentage of rows
  removed. At ~15% of rows removed, writes become faster than without
  compaction.

In the tables below in columns 4 and 7, values below 100% denote
improvement and values over 100% denote regression. The tests were
performed on a table with 5 columns and the exact percentages will vary
across different table schemas.

1. percentage removed
2. write duration/row no compaction
3. write duration/row with compaction
4. write performance new/old
5. read duration/row no compaction
6. read duration/row with compaction
7. read performance new/old
  1           2           3          4           5           6          7
5      8.91E-07    9.64E-07    108.25%    6.05E-07    5.76E-07    95.23%
10     9.28E-07    9.94E-07    107.15%    6.14E-07    5.56E-07    90.55%
15     9.27E-07    9.21E-07    99.43%     6.24E-07    5.27E-07    84.39%
20     9.28E-07    9.03E-07    97.31%     6.19E-07    4.83E-07    78.03%
25     9.49E-07    8.58E-07    90.40%     6.40E-07    4.59E-07    71.76%
30     9.68E-07    8.28E-07    85.61%     6.35E-07    4.20E-07    66.07%
35     9.81E-07    8.07E-07    82.26%     6.38E-07    3.88E-07    60.85%
40     9.97E-07    7.81E-07    78.35%     6.43E-07    3.59E-07    55.91%
45     1.01E-06    7.59E-07    75.28%     6.45E-07    3.34E-07    51.75%
50     1.02E-06    7.30E-07    71.52%     6.55E-07    3.00E-07    45.78%
55     1.06E-06    7.08E-07    66.97%     6.65E-07    2.70E-07    40.56%
60     1.04E-06    6.87E-07    66.20%     6.62E-07    2.40E-07    36.22%
65     1.05E-06    6.56E-07    62.49%     6.60E-07    2.12E-07    32.04%
70     1.06E-06    6.34E-07    59.58%     6.66E-07    1.80E-07    27.07%
75     1.07E-06    6.09E-07    56.90%     6.69E-07    1.50E-07    22.38%
80     1.09E-06    5.84E-07    53.58%     6.80E-07    1.20E-07    17.62%
85     1.10E-06    5.56E-07    50.49%     6.83E-07    9.00E-08    13.18%
90     1.11E-06    5.33E-07    47.92%     6.90E-07    5.97E-08    8.66%
95     1.12E-06    5.07E-07    45.10%     6.93E-07    3.04E-08    4.39%
100    1.14E-06    4.87E-07    42.77%     6.97E-07    6.56E-12    0.00%

1. percentage removed
2. write instructions retired/row no compaction
3. write instructions retired/row with compaction
4. write performance new/old
5. read instructions retired/row no compaction
6. read instructions retired/row with compaction
7. read performance new/old
  1     2       3         4    5       6          7
5   10276   11188   108.88% 7735    7297    94.34%
10  10463   10891   104.09% 7797    6913    88.66%
15  10633   10596   99.65%  7852    6529    83.15%
20  10811   10300   95.27%  7910    6145    77.69%
25  10997   9998    90.92%  7976    5755    72.15%
30  11177   9707    86.85%  8033    5376    66.92%
35  11353   9412    82.90%  8092    4992    61.69%
40  11522   9111    79.07%  8143    4604    56.54%
45  11708   8819    75.32%  8208    4224    51.46%
50  11877   8520    71.74%  8259    3836    46.45%
55  12064   8228    68.20%  8325    3456    41.51%
60  12240   7928    64.77%  8382    3069    36.61%
65  12419   7635    61.48%  8440    2688    31.85%
70  12598   7339    58.26%  8499    2304    27.11%
75  12768   7043    55.16%  8549    1920    22.46%
80  12977   6747    51.99%  8616    1536    17.83%
85  13131   6451    49.13%  8673    1152    13.28%
90  13311   6155    46.24%  8731    767     8.78%
95  13487   5858    43.43%  8790    383     4.36%
100 13657   5562    40.73%  8841    0       0.00%
"

* 'add-compacting-reader-when-flushing-memtable-v6' of github.com:mikolajsieluzycki/scylla:
  memtable-sstable: Add compacting reader when flushing memtable.
  memtable-sstable: Track existence of tombstones in memtable.
2021-11-29 15:15:59 +02:00
Mikołaj Sielużycki
a88f7df195 memtable-sstable: Add compacting reader when flushing memtable.
When memtable contains both mutations and tombstones that delete them,
the output flushed to sstables contains both mutations. Inserting a
compacting reader results in writing smaller sstables and saves
compaction work later.

Performance tests of this change have shown a regression in a common
case where there are no deletes. A heuristic is employed to skip
compaction unless there are tombstones in the memtable to minimise
the impact of that issue.
2021-11-29 13:19:42 +01:00
Mikołaj Sielużycki
6dd9f63f3b memtable-sstable: Track existence of tombstones in memtable.
Add flags if memtable contains tombstones. They can be used as a
heuristic to determine if a memtable should be compacted on
flush. It's an intermediate step until we can compact during applying
mutations on a memtable.
2021-11-29 13:06:12 +01:00
Kamil Braun
b2b242d0ad query-request: add comment about clustering ranges with non-full prefix key bounds 2021-11-29 11:10:49 +01:00
Kamil Braun
8722e0d23c sstables: mx: enable position fast-forwarding in reverse mode
Most of the machinery was already implemented since it was used when
jumping between clustering ranges of a query slice. We need only perform
one additional thing when performing an index skip during
fast-forwarding: reset the stored range tombstone in the consumer (which
may only be stored in fast-forwarding mode, so it didn't matter that it
wasn't reset earlier). Comments were added to explain the details.
2021-11-29 11:10:49 +01:00
Kamil Braun
ea6310961c test: sstable_conforms_to_mutation_source_test: extend test_sstable_reversing_reader_random_schema with fast-forwarding
The test would check whether the forward and reverse readers returned
consistent results when created in non-forwarding mode with slicing.

Do the same but using fast-forwarding instead of slicing.

To do this we require a vector of `position_range`s. We also need a
vector of `clustering_range`s for the existing test. We modify the
existing `random_ranges` function to return `position_range`s instead of
`clustering_range`s since `position_range`s are easier to reason about,
especially when we consider non-full clustering key prefixes. A function
is introduced to convert a `position_range` to a `clustering_range` for
the existing test.
2021-11-29 11:10:46 +01:00
Benny Halevy
cf528d7df9 database: shutdown: don't shutdown keyspaces yet
Don't shutdown the keyspaces just yet,
since they are needed during shutdown.

FIXME: restore when #8995 is fixed and no queries are issued
after the database shuts down.

Refs #8995
Fixes #9684

Test: unit(dev)
- scylla-gdb test fails locally with #9677

DTest: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_adding_info_{1,2}_test(dev)
- running now into #8995. dtest fails with unexpected error: "storage_proxy - Exception when communicating with
  127.0.62.4, to read from system_distributed.service_levels:
  seastar::gate_closed_exception (gate closed)"

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211127083348.146649-2-bhalevy@scylladb.com>
2021-11-29 11:59:45 +02:00
Benny Halevy
93367ba55f effective_replication_map_factory: temporarily unregister outstanding maps when destroyed
The next patch will disable stopping the keyspaces
in database shutdown due to #9684.

This will leave outstanding e_r_m:s when the factory
is destroyed. They must be unregistered from the factory
so they won't try to submit_background_work()
to gently clear their contents.

Support that temporarily until shutdown is fixed
to ensure they are no outstanding e_r_m:s when
the factory is destroyed, at which point this
can turn into an internal error.

Refs #8995
Refs #9684

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211127083348.146649-1-bhalevy@scylladb.com>
2021-11-29 11:59:44 +02:00
Nadav Har'El
1e2ecd282a Merge 'Harden compaction manager remove' from Benny Halevy
This series hardens compaction_manager::remove by:
- add debug logging around task execution and stopping.
- access compaction_state as lw_shared_ptr rather than via a raw pointer.
  - with that, detach it from `_compaction_state` in `compaction_manager::remove` right away, to prevent further use of it while compactions are stopped.
 - added write_lock in `remove` to make sure the lock is not held by any stray task.

Test: unit(dev), sstable_compaction_test(debug)
Dtest: alternator_tests.py:AlternatorTest.test_slow_query_logging (debug)

Closes #9636

* github.com:scylladb/scylla:
  compaction_manager: add compaction_state when table is constructed
  compaction_manager: remove: fixup indentation
  compaction_manager: remove: detach compaction_state before stopping ongoing compactions
  compaction_manager: remove: serialize stop_ongoing_compactions and gate.close
  compaction_manager: task: keep a reference on compaction_state
  test: sstable_compaction_test: incremental_compaction_data_resurrection_test: stop table before it's destroyed.
  test: sstable_utils: compact_sstables: deregister compaction also on error path
  test: sstable_compaction_test: partial_sstable_run_filtered_out_test: deregiser_compaction also on error path
  test: compaction_manager_test: add debug logging to register/deregister compaction
  test: compaction_manager_test: deregister_compaction: erase by iterator
  test: compaction_manager_test: move methods out of line
  compaction_manager: compaction_state: use counter for compaction_disabled
  compaction_manager: task: delete move and copy constructors
  compaction_manager: add per-task debug log messages
  compaction_manager: stop_ongoing_compactions: log number of tasks to stop
2021-11-28 22:12:52 +02:00
Avi Kivity
b23af15432 tests: consolidate boost xunit result files
The recent parallelization of boost unit tests caused an increase
in xml result files. This is challenging to Jenkins, since it
appears to use rpc-over-ssh to read the result files, and as a result
it takes more than an hour to read all result files when the Jenkins
main node is not on the same continent as the agent.

To fix this, merge the result files in test.py and leave one result
file per mode. Later we can leave one result file overall (integrating
the mode into the testsuite name), but that can wait.

Tested on a local Jenkins instance (just reading the result files,
not the entire build).

Closes #9668
2021-11-28 22:12:52 +02:00
Piotr Sarna
ecd122a1b0 Merge 'alternator: rudimentary implementation of TTL expiration service' from Nadav Har'El
In this patch series we add an implementation of an
expiration service to Alternator, which periodically scans the data in
the table, looking for expired items and deleting them.

We also continue to improve the TTL test suite to cover additional
corner cases discovered during the development of the code.

This implementation is good enough to make all existing tests but one,
plus a few new ones, pass, but is still a very partial and inefficient
implementation littered with FIXMEs throughout the code. Among other
things, this initial implementation doesn't do anything reasonable about pacing of
the scan or about multiple tables, it scans entire items instead of only the
needed parts, and because each shard "owns" a different subset of the
token ranges, if a node goes down, partitions which it "owns" will not
get expired.

The current tests cannot expose these problems, so we will need to develop
additional tests for them.

Because this implementation is very partial, the Alternator TTL continues
to remain "experimental", cannot be used without explicitly enabling this
experimental feature, and must not be used for any important deployment.

Refs #5060 but doesn't close the issue (let's not close it until we have a
reasonably complete implementation - not this partial one).

Closes #9624

* github.com:scylladb/scylla:
  alternator: fix TTL expiration scanner's handling of floating point
  test/alternator: add TTL test for more data
  test/alternator: remove "xfail" tag from passing tests in test_ttl.py
  test/alternator: make test_ttl.py tests fast on Alternator
  alternator: initial implmentation of TTL expiration service
  alternator: add another unwrap_number() variant
  alternator: add find_tag() function
  test/alternator: test another corner case of TTL setting
  test/alternator: test TTL expiration for table with sort key
  test/alternator: improve basic test for TTL expiration
  test/alternator: extract is_aws() function
2021-11-28 22:12:52 +02:00
Avi Kivity
25bd945a2c Merge "reverse range scans: use the correct schema for result building" from Botond
"
Reverse queries has to use the reverse schema (query schema) for the
read itself but the table schema for the result building, according to
the established interface with the coordinator (half-reverse format).
Range scans were using the query schema for both, which produced
un-parseable reconcilable results for mutation range scans.
This series fixes this and adds unit tests to cover this previously
uncovered area.
"

Fixes #9673.

* 'reverse-range-scan-test/v1' of https://github.com/denesb/scylla:
  test/boost/multishard_mutation_query_test: add reverse read test
  test/boost/multishard_mutation_query_test: add test for combinations of limits, paging and stateful
  test/boost/multishard_mutation_query_test: generalize read_partitions_with_paged_scan()
  test/boost/multishard_mutation_query_test: add read_all_partitions_one_by_one() overload with slice
  multishard_mutation_query: fix reverse scans
  partition_slice: init all fields in copy ctor
  partition_slice: operator<<: print the entire partition row limit
  partition_slice_builder: add with_partition_row_limit()
2021-11-28 14:18:28 +02:00
Avi Kivity
ec775ba292 Merge "Remove more gms::get(_local)?_gossiper() calls" from Pavel E
"
This set covers simple but diverse cases:
- cache hitrace calculator
- repair
- system keyspace (virtual table)
- dht code
- transport event notifier

All the places just require straightforward arguments passing.
And a reparation in transport -- event notifier needs a backref
to the owning server.

Remaining after this set is the snitch<->gossiper interaction
and the cache hitrate app state update from table code.

tests: unit(dev)
"

* 'br-unglobal-gossiper-cont' of https://github.com/xemul/scylla:
  transport: Use server gossiper in event notifier
  transport: Keep backreference from event_notifier
  transport: Keep gossiper on server
  dht: Pass gossiper to range_streamer::add_ranges
  dht: Pass gossiper argument to bootstrap
  system_keyspace: Keep gossiper on cluster_status_table
  code: Carry gossiper down to virtual tables creation
  repair: Use local gossiper reference
  cache_hitrate_calculator: Keep reference on gossiper
2021-11-28 14:18:28 +02:00
Tomasz Grabiec
0df14a48cf Merge "gms: features should keep 'enabled' state" from Pavel Solodovnikov
This patchset implements part of the solution of the
problem described in the https://github.com/scylladb/scylla/issues/4458.

Introduce a new key `enabled_features` in the `system.scylla_local`
table, update it when each gms feature is enabled,
then read them from the table on node startup and
perform validation and re-enable these features early.

The solution provides a way to prevent a way to do
prohibited node downgrades, that is: when a node does
not understand some features that were enabled previously,
it means it's doing a prohibited downgrade procedure.

Also, enabling features early allows to shorten the time frame
for which the feature is not enabled on a node and that also
can affect cluster liveness (until a node contacts others
to discover features state in the cluster and re-enable
them again).

Features should be enabled before commitlog starts replaying
since some features affect storage (for example, when
determining used sstable format).

* manmanson/persist_enabled_features_v8:
  gms: feature_service: re-enable features on node startup
  gms: gossiper: maybe_enable_features() should enable features in seastar::async context
  gms: feature_service: expose registered features map
  gms: feature_service: persist enabled features
  gms: move `to_feature_set()` function from gossiper to feature_service
2021-11-28 14:18:28 +02:00
Pavel Solodovnikov
1365e2f13e gms: feature_service: re-enable features on node startup
Re-enable previously persisted enabled features on node
startup. The features list to be enabled is read from
`system.local#enabled_features`.

In case an unknown feature is encountered, the node
fails to boot with an exception, because that means
the node is doing a prohibited downgrade procedure.

Features should be enabled before commitlog starts replaying
since some features affect storage (for example, when
determining used sstable format).

This patch implements a part of solution proposed by Tomek
in https://github.com/scylladb/scylla/issues/4458.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:24 +02:00
Pavel Solodovnikov
777985b64d gms: gossiper: maybe_enable_features() should enable features in seastar::async context
Since `gms::feature::enable()` requires `seastar::async` context
to be present.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:11 +02:00
Pavel Solodovnikov
5b5fbb4b33 gms: feature_service: expose registered features map
This will be used for re-enabling previously enabled cluster
features, which will be introduces in later patches.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:11 +02:00
Pavel Solodovnikov
a2f5ad432f gms: feature_service: persist enabled features
Save each feature enabled through the feature_service
instance in the `system.scylla_local` under the
'enabled_features' key.

The features would be persisted only if the underlying
query context used by `db::system_keyspace` is initialized.

Since `system.scylla_local` table is essentially a
string->string map, use an ad-hoc method for serializing
enabled features set: the same as used in gossiper for
translating supported features set via gossip.

The entry should be saved before we enable the feature so
that crash-after-enable is safe.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:11 +02:00
Pavel Solodovnikov
e891f874df gms: move to_feature_set() function from gossiper to feature_service
This utility will also be used for de-serialization of
persisted enabled features, which will be introduced in a
later patch.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:11 +02:00
Nadav Har'El
f1997be989 alternator: fix TTL expiration scanner's handling of floating point
The expiration-time attribute used by Alternator's TTL feature has a
numeric type, meaning that it may be a floating point number - not just
an integer, and implemented as big_decimal which has a separate integer
mantissa and exponent. Our code which checked expiration incorrectly
looked only at the mantissa - resulting in incorrect handling of
expiration times which have a fractional part - 123.4 was treated as
1234 instead of 123.

This patch fixes the big_decimal handling in the expiration checking,
and also adds to the test test_ttl.py::test_ttl_expiration check also
for non-integer floating point as well as one with an exponent. The
new tests pass on DynamoDB, and failed on Alternator before this patch -
and pass with it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:37 +02:00
Nadav Har'El
84e0004ff6 test/alternator: add TTL test for more data
The existing TTL tests use only tiny tables, so don't exercise the
expiration-time scanner's use of paging. So in this patch we add
another test with a much larger table (with 40,000 items).

To verify that this test indeed checks paging, I stopped the scanner's
iteration after one page, and saw that this test starts failing (but
the smaller tests all pass).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:37 +02:00
Nadav Har'El
baea76c33b test/alternator: remove "xfail" tag from passing tests in test_ttl.py
Most tests in test_ttl.py now pass, so remove their "xfail" tag.
The only remaining failing test is test_ttl_expiration_streams -
which cannot yet pass because the expiration event is not yet marked.

Note that the fact that almost all tests for Alternator's TTL feature
now pass does not mean the feature is complete. The current implementation
is very partial and inefficient, and only works reasonably in tests on
a single node. The current tests cannot expose these problems, so
we will need to develop additional tests for them. The tests will of
course remain useful to see that as the implementation continues to
improve, none of the tests that already work will break.

The Alternator TTL continues to remain "experimental", cannot be used
without explicitly enabling this experimental feature, and must not be
used for any important deployment.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:37 +02:00
Nadav Har'El
0b97da5f46 test/alternator: make test_ttl.py tests fast on Alternator
The tests for the TTL feature in test/alternator/test_ttl.py takes huge
amount of time on DynamoDB - 10 to 30 minutes (!) - because it delays
expiration of items a long time after their intended expiration times.

We intend Scylla's implementation to have a configurable delay for the
expiration scanner, which we will be able to configure to very short
delays for tests. So These tests can be made much faster on Scylla.
So in this patch we change all of the tests to finish much more quickly
on Scylla.

Many of the tests still fail, because the TTL feature is not implemented
yet.

Although after this change all the tests in test_ttl.py complete in
a reasonable amount of time (around 3 seconds each), we still mark
them as "veryslow" and the "--runveryslow" flag is needed to run them.
We should consider changing this in the future, so that these tests will
run as part of our default test suite.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:37 +02:00
Nadav Har'El
13a3aca460 alternator: initial implmentation of TTL expiration service
In this patch we add an incomplete implementation of an expiration
service to Alternator, which periodically scans the data in the table,
looking for expired items and deleting them.

This implementation involves a new "expiration service" which runs a
background scan in each shard. Each shard "owns" a subset of the token
ranges - the intersection of the node's primary ranges with this shard's
token ranges - and scans those ranges over and over, deleting any items
which are found expired.

This implementation is good enough to make all existing tests but one
pass, but is still a partial and inefficient implementation littered with
FIXMEs throughout the code. Among other things, this implementation
doesn't do anything reasonable about pacing of the scan or about multiple
tables, it scans entire items instead of only the needed parts, and
if a node goes down, the part of the token range which it "owns" will not
be scanned for expiration (we need living nodes to take over the
background expiration work for dead nodes).

The current tests cannot expose these problems, so we will need to develop
additional tests for them.

Because this implementation is very partial, the Alternator TTL continues
to remain "experimental", cannot be used without explicitly enabling this
experimental feature, and must not be used for any important deployment.
The new TTL expiration service will only run (at the moment) in the
background if the Alternator TTL experimental feature is enabled and
and if Alternator is enabled as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:37 +02:00
Nadav Har'El
f7e984110d alternator: add another unwrap_number() variant
We have an unwrap_number() function which in case of data errors (such
as the value not being a number) throws an exception with a given
string used in the message.

In this patch we add a variant of unwrap_number() - try_unwrap_number() -
which doesn't take a message, and doesn't throw exceptions - instead it
returns an empty std::optional if the given value is not a number.
This function is useful in places where we need to know if we got a
number or not, but both are fine but not errors. We'll use it in a
following patch to parse expiration times for the TTL feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:37 +02:00
Nadav Har'El
be969ff995 alternator: add find_tag() function
find_tag() returns the value of a specific tag on a table, or nothing if
it doesn't exist. Unlike the existing get_tags_of_table() above, if the
table is missing the tags extension (e.g., is not an Alternator table)
it's not an error - we return nothing, as in the case that tags exist
but not this tag.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:36 +02:00
Nadav Har'El
88f175d0a8 test/alternator: test another corner case of TTL setting
Although it isn't terribly useful, an Alternator user can enable TTL
with an expiration-time attribute set to a *key* attribute. Because
expiration times should be numeric - not other types like strings -
DynamoDB could warn the user when a chosen key attribute hs a non-
numeric type (since key attributes do have fixed types!). But DynamoDB
doesn't warn about this - it simply expires nothing. This test
verifies this that it indeed does this.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:36 +02:00
Nadav Har'El
a982d161ad test/alternator: test TTL expiration for table with sort key
The basic test for TTL expiration, test_ttl.py::test_ttl_expiration,
uses a table with only a partition key. Most of the item expiration
logic is exactly the same for tables that also have a sort key, but
the step of *deleting* the item is different, so let's add a test
that verifies that also in this case, the expired item is properly
deleted.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:36 +02:00
Nadav Har'El
69b4f53aa9 test/alternator: improve basic test for TTL expiration
This patch improves test_ttl.py::test_ttl_expiration in two ways:

First, it checks yet another case - that items that have the wrong type
for the expiration-time column (e.g., a string) never get expired - even
if that string happens to contain a number that looks like an expiration
time.

Second, instead of the huge 15-minute duration for this test, the
test now has a configurable duration; We still need to use a very long
duration on AWS, but in Scylla we expect to be able to configure the
TTL scan frequency, and can finish this test in just a few seconds!
We already have experimental code which makes this test pass in just
3 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:36 +02:00
Nadav Har'El
fd9a6cf851 test/alternator: extract is_aws() function
Extract a boolean function is_aws() out of the "scylla_only" fixture, so
it can be used in tests for other purposes.

For example, in the next patch the TTL tests will use them to pick
different timeouts on AWS (where TTL expiration have huge many-minute
delays) and on Scylla (which can be configured to have very short delays).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-25 22:01:36 +02:00
Konstantin Osipov
eea82f1262 raft: (server) improve tracing 2021-11-25 12:35:43 +03:00
Konstantin Osipov
0d830d4c11 raft: (metrics) fix spelling of waiters_awaken
The usage of awake and awaken is quite messy, but
awoken is more common for passive voice, so use waiters_awoken.
2021-11-25 12:35:43 +03:00
Konstantin Osipov
6d28927550 raft: make forwarding optional
In absence of abort_source or timeouts in Raft API, automatic
bouncing can create too much noise during testing, especially
during network failures. Add an option to disable follower
bouncing feature, since randomized_nemesis_test has its own
bouncing which handles timeouts correctly.

Optionally disable forwarding in basic_generator_test.
2021-11-25 12:35:43 +03:00
Konstantin Osipov
c22f945f11 raft: (service) manage Raft configuration during topology changes
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.

However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.

In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.

Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.

Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:

Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.

Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.

Add tracing.
2021-11-25 12:35:42 +03:00
Konstantin Osipov
96e2594207 raft: (service) break a dependency loop
Break a dependency loop raft_rpc <-> raft_group_registry
via raft_address_map. Pass raft_address_map to raft_rpc and
raft_gossip_failure_detector explicitly, not entire raft_group_registry.

Extract server_for_group into a helper class. It's going to be used by
raft_group0 so make it easier to reference.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
8ee88a9d8a raft: (discovery) introduce leader discovery state machine
Introduce a special state machine used to to find
a leader of an existing Raft cluster or create
a new cluster.

This state machine should be used when a new
Scylla node has no persisted Raft Group 0 configuration.

The algorithm is initialized with a list of seed
IP addresses, IP address of this server, and,
this server's Raft server id.

The IP addresses are used to construct an initial list of peers.

Then, the algorithm tries to contact each peer (excluding self) from
its peer list and share the peer list with this peer, as well as
get the peer's peer list. If this peer is already part of
some Raft cluster, this information is also shared. On a response
from a peer, the current peer's peer list is updated. The
algorithm stops when all peers have exchanged peer information or
one of the peers responds with id of a Raft group and Raft
server address of the group leader.

(If any of the peers fails to respond, the algorithm re-tries
ad infinitum with a timeout).

More formally, the algorithm stops when one of the following is true:
- it finds an instance with initialized Raft Group 0, with a leader
- all the peers have been contacted, and this server's
  Raft server id is the smallest among all contacted peers.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
30e3227e0b system_keyspace: mark scylla_local table as always-sync commitlog
It is infrequently updated (typically once at start) but stores
critical state for this instance survival (Raft Group 0 id, Raft
server id, sstables format), so always write it to commit log
in sync mode.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
fd295850fe system_keyspace: persistence for Raft Group 0 id and Raft Server Id
Implement system_keyspace helpers to persist Raft Group 0 id
and Raft Server id.

Do not use coroutines in a template function to work around
https://bugs.llvm.org/show_bug.cgi?id=50345
2021-11-25 11:50:38 +03:00
Konstantin Osipov
65e549946f raft: add a test case for adding entries on follower 2021-11-25 11:50:38 +03:00
Konstantin Osipov
e3751068fe raft: (server) allow adding entries/modify config on a follower
Implement an RPC to forward add_entry calls from the follower
to leader. Bounce & retry in case of not_a_leader.
Do not retry in case of uncertainty - this can lead to adding
duplicate entries.

The feature is added to core Raft since it's needed by
all current clients - both topology and schema changes.

When forwarding an entry to a remote leader we may get back
a term/index pair that conflicts (has the same index, but is with
a higher term) with a local entry we're still waiting on.

This can happen, e.g. because there was a leader change and the
log was truncated, but we still haven't got the append_entries
RPC from the new leader, still haven't truncated the log locally,
still haven't aborted all the local waits for truncated entries.

Only remove the offending entry from the wait list and abort it.
There may be entries labeled with an older term to the right (with
higher commit index) of the conflicting entry. However, finding them,
would require a linear scan. If we allow it, we may end up doing this
linear scan for *every* conflicting entry during the transition
period, which brings us to N^2 complexity of this step. At the
same time, as soon as append_entries that commits a higher-term
entry with the same index reaches the follower, the waits
for the respective truncated entry will be aborted anyway (see
notify_waiters() which sets dropped_entry exception), so the scan
is unnecessary.

Similarly to being able to add entries, allow to modify
Raft group configuration on a follower. The implementation
works the same way as adding entries - forwards the command
to the leader.

Now that add_entry() or modify_config never throws not_a_leader,
it's more likely to throw timed_out_error, e.g. in case the
network is partitioned. Previously it was only possible due to a
semaphore wait timeout, and this scenario was not tested.
Handle timed_out_error on RPC level to let the existing tests
(specifically the randomized nemesis test) pass.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
ae5dc8e980 raft: (test) replace virtual with override in derived class
Clang 12 complains if use of override is inconsistent,
so stick to it everywhere.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
8f303844df raft: (server) fix a typo in exception message 2021-11-25 11:50:38 +03:00
Konstantin Osipov
9cde1cdf71 raft: (server) implement id() helper
There is no easy way to get server id otherwise.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
b9faf41513 raft: (server) remove apply_dummy_entry()
It's currently unused, and going forward we'd like to make
it work on the follower, which requires a new implementation.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
2763fdd3b7 raft: (test) fix missing initialization in generator.hh
A missing initialization in poll_timeout of class interpreter
could manifest itself as a sporadically failing
randomized_nemesis_test.

The test would prematurely run out of allowed limit of virtual
clock ticks.
2021-11-25 11:50:38 +03:00
Pavel Emelyanov
c04ddc5aa9 transport: Use server gossiper in event notifier
The notifier is automatic friend of server and can access its
private fields without additional wrappers/decorations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:56:05 +03:00
Pavel Emelyanov
2cb18c2404 transport: Keep backreference from event_notifier
The event_notifier is private server subclass that's created once
per server to handle events from storage_service. The notifier needs
gossiper that already sits on the server, and to get it the simplest
way is to equip notifier with the server backreference. Since these
two objects are in strict 1:1 relation this reference is safe.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:55:41 +03:00
Pavel Emelyanov
43951318c8 transport: Keep gossiper on server
The gossiper is needed by the transport::event_notifier. There's
already gossiper reference on the transport controller, but it's
a local reference, because controller doesn't need more. This
patch upgrages controller reference to sharded<> and propagates
it further up to the server.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:54:45 +03:00
Pavel Emelyanov
831f18e392 dht: Pass gossiper to range_streamer::add_ranges
A continuation of the previous patch. The range_streamer needs
gossiper too, and is called from boot_strapper and storage_service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:54:16 +03:00
Pavel Emelyanov
6a2f6068cb dht: Pass gossiper argument to bootstrap
The boot_strapper::bootstrap needs gossiper and is called only from
the storage_service code that has it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:53:56 +03:00
Pavel Emelyanov
aaf268ae58 system_keyspace: Keep gossiper on cluster_status_table
This table gets endpoint states map from global gossiper. Now there's
a local reference nearby.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:53:18 +03:00
Pavel Emelyanov
ef1960d034 code: Carry gossiper down to virtual tables creation
One of the tables needs gossiper and uses global one. This patch
prepares the fix by patching the main -> register_virtual_tables
stack with the gossiper reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:52:55 +03:00
Pavel Emelyanov
1168c4154b repair: Use local gossiper reference
There are two places in repair that call for global gossiper instance.
However, the repair_service already has sharded gossiper on board, and
it can use it directly in the first place.

The second place is called from inside repair_info method. This place
is fixed by keeping the gossiper reference on the info, just like it's
done for other services that info needs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:52:37 +03:00
Pavel Emelyanov
770d34796b cache_hitrate_calculator: Keep reference on gossiper
The calculator needs to update its app-state on gossiper. Keeping
a reference is safe -- gossiper starts early, the calculator -- at
the very very end, stop in reverse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:52:27 +03:00
Nadav Har'El
2bdc31f8a3 test/alternator: two more tests for unimplemented Select=COUNT
This patch adds two more tests for the unimplemented Select=COUNT
feature (which asks to only count queried items and not return the
actual items). Because this feature has not yet been implemented in
Alternator (Refs #5058), the new tests xfail. They pass on DynamoDB.

The two tests added here are for the interaction of the Select=COUNT
feature with filters - in one of the two supported syntaxes (QueryFilter
and FilterExpression). We want to verify that even though the user
doesn't need the content of the items (since only the counts were
requested), they are still retrieved from disk as needed for doing
proper filtering - but not returned.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211124225429.739744-1-nyh@scylladb.com>
2021-11-25 08:47:14 +01:00
Mikołaj Sielużycki
44f4ea38c5 test: Future-proof reader conversions tests.
Query time must be fetched after populate. If compaction is executed
during populate it may be executed with timestamp later than query_time.
This would cause the test expected compaction and compaction during
populate to be executed at different time points producing different
results. The result would be sporadic test failures depending on relative
timing of those operations. If no other mutations happen after populate,
and query_time is later than the compaction time during population, we're
guaranteed to have the same results.
Message-Id: <20211123134808.105068-1-mikolaj.sieluzycki@scylladb.com>
2021-11-24 21:01:57 +01:00
Michał Chojnowski
08f7b81b36 dist: scylla_io_setup: run iotune for supported but not preconfigured AWS instance types
Currently, for AWS instances in `is_supported_instance_class()` other than
i3* and *gd (for example: m5d), scylla_io_setup neither provides
preconfigured values for io_properties.yaml nor runs iotune nor fails.
This silently results in a broken io_properties.yaml, like so:

disks:
  - mountpoint: /var/lib/scylla

Fix that.

Closes #9660
2021-11-24 18:28:13 +02:00
Avi Kivity
f3faa48f8b Merge "Unglobal stream manager" from Pavel E
"
There's a nest of globals in streaming/ code. The stream_manager
itself and a whole lot of its dependencies (database, sys_dist_ks,
view_update_generator and messaging). Also streaming code gets
gossiper instance via global call.

The fix is, as usual, in keeping the sharded<stream_manager> in
the main() code and pushing its reference everywhere. Somwehere
in the middle the global pointers go away being replaced with
respective references pushed to the stream_manager ctor.

This reveals an implicit dependency:

  storage_service -> stream_manager

tests: unit(dev),
       dtest.cdc_tests.cluster_reduction_with_cdc(dev)
       v1: dtest.bootstrap_test.add_node(dev)
       v1: dtest.bootstrap_test.simple_bootstrap(dev)
"

* 'br-unglobal-stream-manager-3-rebase' of https://github.com/xemul/scylla: (26 commits)
  streaming, main: Remove global stream_manager
  stream_transfer_task: Get manager from session (result-future)
  stream_transfer_task: Keep Updater fn onboard
  stream_transfer_task: Remove unused database reference
  stream_session: Use manager reference from result-future
  stream_session: Capture container() in message handler
  stream_session: Keep stream_manager reference
  stream_session: Remove unused default contructor
  stream_result_future: Use local manager reference
  stream_result_future: Keep stream_manager reference
  stream_plan: Keep stream_manager onboard
  dht: Keep stream_manager on board
  streaming, api: Use captured manager in handlers
  streaming, api: Standardize the API start/stop
  storage_service: Sanitize streaming shutdown
  storage_service: Keep streaming_manager reference
  stream_manager: Use container() in notification code
  streaming: Move get_session into stream_manager
  streaming: Use container.invoke_on in rpc handlers
  streaming: Fix interaction with gossiper
  ...
2021-11-24 12:23:18 +02:00
Pavel Emelyanov
4a34226aa6 streaming, main: Remove global stream_manager
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
50e6d334a9 stream_transfer_task: Get manager from session (result-future)
When the task starts it needs the stream_manager to get messaging
service and database from. There's a session at hands and this
session is properly initialized thus it has the result-future.
Voila -- we have the manager!

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
95d26bc420 stream_transfer_task: Keep Updater fn onboard
The helper function called send_mutation_fragments needs the manager
to update stats about stream_transfer_task as it goes on. Carrying the
manager over its stack is quite boring, but there's a helper send_info
object that lives there. Equip the guy with the updating function and
capture the manager by it early to kill one more usage of the global
stream_manager call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
9ee208de8d stream_transfer_task: Remove unused database reference
The send_info helper keeps it, but doesn't use. Remove.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
a3b4d4d3cf stream_session: Use manager reference from result-future
When the stream_session initializes it's being equipped with
the shared-pointer on the stream_result_future very early. In
all the places where stream_session needs the manager this
pointer is alive and session get get manager from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
56f5327450 stream_session: Capture container() in message handler
The stream_mutation_fragments handler need to access the manager. Since
the handler is registered by the manager itself, it can capture the
local manager reference and use container() where appropriate.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
db33607eb2 stream_session: Keep stream_manager reference
The manager is needed to get messaging service and database from.
Actually, the database can be pushed though arguments in all the
places, so effectively session only needs the messaging. However,
the stream-task's need the manager badly and there's no other
place to get it from other than the session.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
f2ae080c63 stream_session: Remove unused default contructor
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
307a2583ee stream_result_future: Use local manager reference
The reference is present in all the required places already.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
5b748a72de stream_result_future: Keep stream_manager reference
The stream_result_future needs manager to register on it and to
unregister from it. Also the result-future is referenced from
stream_session that also needs the manager (see next patches).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
3087422d4d stream_plan: Keep stream_manager onboard
The plan itself doesn't need it, but it creates some lower level
objects that do. Next patches will use this reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
c593f8624d dht: Keep stream_manager on board
This is the preparation for the future patching. The stream_plan
creation will need the manager reference, so keep one on dht
object in advance. These are only created from the storage service
bootstrap code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
5166a98ce4 streaming, api: Use captured manager in handlers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
fd920e2420 streaming, api: Standardize the API start/stop
Todays idea of API reg/unreg is to carry the target service via
lambda captures down to the route handlers and unregister those
handers before the target is about to stop.

This patch makes it so for the streaming API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
390a971bd8 storage_service: Sanitize streaming shutdown
Use local reference and don't use 'is_stopped' boolean as the
whole stop_transport is guarded with its own lock.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
aaa58b7b89 storage_service: Keep streaming_manager reference
The manager is drained() on drain/decommission/isolate. Since now
it's storage_service who orchestrates all of the above, it needs
and explicit reference on the target.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:35 +03:00
Pavel Emelyanov
3a9eb6af28 stream_manager: Use container() in notification code
Continuation of the previous patch -- some native stream_manager methods
can enjoy using container() call. One nit -- the [] access to the map
of statistics now runs in const context and cannot create elements, so
switch this place into .at() method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:59 +03:00
Pavel Emelyanov
8ab96a8362 streaming: Move get_session into stream_manager
This makes the code a bit shorter and helps removing one more call
for global stream manager.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:59 +03:00
Pavel Emelyanov
228b4520a6 streaming: Use container.invoke_on in rpc handlers
This will help to reduce the usage of global manager instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:59 +03:00
Pavel Emelyanov
c2c676784a streaming: Fix interaction with gossiper
Streaming manager registers itself in gossiper, so it needs an explicit
dependency reference. Also it forgets to unregister itself, so do it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:59 +03:00
Pavel Emelyanov
73e10c7aed streaming: Move start/stop onto common rails
In case of streaming this mostly means dropping the global
init/uninit calls and replacing them with sharded<stream_manager>
instance. It's still global, but it's being fixed atm.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:58 +03:00
Pavel Emelyanov
08818ffe75 streaming: Rename .stop() into .shutdown()
The start/stop standard is becoming like

    sharded<foo> foo;
    foo.start();
    defer([] { foo.stop() });
    foo.invoke_on_all(&foo::start);
    ...
    defer([] { foo.shutdown() });
    wait_for_stop_signal();
    /* quit making the above defers self-unroll */

where .shutdown() for a service would mean "do whatever is
appropriate to start stopping, the real synchronous .stop() will
come some time later".

According to that, rename .stop() as it's really the mentioned
preparation, not real stopping.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:58 +03:00
Pavel Emelyanov
ba298bd5c6 streaming: Remove global dependency pointers
Now they are not needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:58 +03:00
Pavel Emelyanov
6d7eb76fad streaming: Use get_stream_manager to get dependencies
Currently streaming uses global pointers to save and get a
dependency. Now all the dependencies live on the manager,
this patch changes all the places in streaming/ to get the
needed dependencies from it, not from global pointer (next
patch will remove those globals).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:58 +03:00
Pavel Emelyanov
e448774588 streaming: Move rpc verbs reg/unreg into manager
As a part of streaming start/stop unification.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:58 +03:00
Pavel Emelyanov
165971fb7f streaming: Initialize stream manager with proper deps
The stream manager is going to become central point of control
for the streaming subsys. This patch makes its dependencies
explicit and prepares the gound for further patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:15:58 +03:00
Nadav Har'El
e71131091a cql-pytest: translate Cassandra's tests for user-defined types
This is a translation of Cassandra's CQL unit test source file
validation/entities/UserTypesTest.java into our our cql-pytest
framework.

This test file includes 26 tests for various features and corners of
the user-defined type feature. Two additional tests which were more
involved to translate were dropped with a comment explaining why.

All 26 tests pass on Cassandra, and all but one pass on Scylla:
The test testUDTWithUnsetValues fails on Scylla so marked xfail.
It reproduces a previously-unknown Scylla bug:

  Refs #9671: In some cases, trying to assign an UNSET value into part
              of a UDT is not detected

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211124074001.708183-1-nyh@scylladb.com>
2021-11-24 10:37:15 +02:00
Botond Dénes
05ae2f88b8 test/boost/multishard_mutation_query_test: add reverse read test
Which also tests combinations of limits, paging and statefulness.

Fixes: #9328

This patch fixes the above issue by providing the test said issue was
asking for to be considered fixed. The bug described therein was already
fixed by an earlier patch.
2021-11-23 18:32:32 +02:00
Avi Kivity
965ea4a3fa Merge "tools/scylla-sstable: add dumpers for all components" from Botond
"
Except for TOC, Filter, Digest and CRC32, these are trivial to read with
any text/binary editor.
"

* 'scylla-sstable-dump-components' of https://github.com/denesb/scylla:
  tools/scylla-sstable: add --dump-scylla-metadata
  tools/scylla-sstable: add --dump-statistics
  tools/scylla-sstable: add --dump-summary
  tools/scylla-sstable: add --dump-compression-info
  tools/scylla-sstable: extract unsupported flag checking into function
  sstables/sstable: add scylla metadata getter
  sstables/sstable: add statistics accessor
2021-11-23 16:13:02 +02:00
Michał Sala
27ff3e7de7 storage_proxy: check partition ranges contiguity
storage_proxy::query_partition_key_range_concurrent() iterates through
vnodes produced by its argument query_ranges_to_vnodes_generator&&
ranges_to_vnodes and tries to merge them. This commit introduces
checking if subsequent vnodes are contiguous with each other, before
merging them.

Fixes #9167

Closes #9175
2021-11-23 15:48:55 +02:00
Botond Dénes
9746dbe20d Merge "Add --cpus option to test.py" from Pavel Emelyanov
"
When provided all the tests start from under the 'taskset -c $value'.
This is _not_ the same as just doing 'taskset -c ... ./test.py ...'
because in the latter case test.py will compete with all the tests
for the provided cpuset and may not be able to run at desired speed.
With this option it's possible to isolate the tests themselves on a
cpuset without affecting the test.py performance.

One of the examples when test.py speed can be critical is catching
flaky tests that reveal their buggy nature only when ran in a tight
environment. The combination of --cpus, --repeat and --jobs creates
nice pressure on the cpu, and keeping the test.py out of the mincer
lets it fork and exec (and wait) the tests really fast.

tests: unit(dev, with and without --cpus)
"
* 'br-test-taskset-2' of https://github.com/xemul/scylla:
  test.py: Add --cpus option
  test.py: Lazily calculate args.jobs
2021-11-23 15:06:59 +02:00
Botond Dénes
a5b5171a73 test/boost/multishard_mutation_query_test: add test for combinations of limits, paging and stateful 2021-11-23 14:23:35 +02:00
Botond Dénes
25713b1d62 test/boost/multishard_mutation_query_test: generalize read_partitions_with_paged_scan()
Extract all logic related to issuing the actual read and building the
combined result. This is now done by an ResultBuilder template object,
which allows reusing the paging logic for both mutation and data scans.
ResultBuilder implementations for which are also provided by this patch.
The paging logic is also fixed to work with correctly with
per-partition-row-limit.
2021-11-23 14:23:35 +02:00
Botond Dénes
810cc8bd1c test/boost/multishard_mutation_query_test: add read_all_partitions_one_by_one() overload with slice 2021-11-23 14:23:35 +02:00
Botond Dénes
3210dee4a6 multishard_mutation_query: fix reverse scans
The read itself has to be done with the reversed schema (query schema)
but the result building has to be done with the table schema. For data
queries this doesn't matter, but replicate the distinction for
consistency (and because this might change).
2021-11-23 14:22:01 +02:00
Botond Dénes
15af80800a partition_slice: init all fields in copy ctor
_partition_row_limit_high_bits was left out for some reason, corrupting
the per-partition row limit.
2021-11-23 14:21:50 +02:00
Botond Dénes
c372b9676d partition_slice: operator<<: print the entire partition row limit
Not just the low bits.
2021-11-23 14:21:50 +02:00
Botond Dénes
3881de6353 partition_slice_builder: add with_partition_row_limit() 2021-11-23 14:21:50 +02:00
Pavel Emelyanov
bd24c1eecf Merge "Deglobalize batchlog_manager" from Benny
This series gets rid of the global batchlog_manager instance.

It does so by first, allowing to set a global pointer
and instatiating stack-local instances in main and
cql_test_env.

Expose the cql_test_env batchlog_manager to tests
so they won't need the global `get_batchlog_manager()` as
used in batchlog_manager_test.test_execute_batch.

Then we pass a reference to the `sharded<db::batchlog_manager>` to
storage_service so it can be used instead of the global one.

Derive batchlog_manager from peering_sharded_service so it
get its `container()` rather than relying on the global `get_batchlog_manager()`.

And finally, handle a circular dependency between the batchlog_manager,
that relies on the query_processor that, in turn, relies on the storage_proxy,
and the the storage_proxy itself that depends on the batchlog_manager for
`mutate_atomically`.

Moved `endpoint_filter` to gossiper so `storage_proxy::mutate_atomically`
can call it via the `_gossiper` member it already has.
The function requires a gossiper object rather than a batchlog_manager
object.

Also moved `get_batch_log_mutation_for` to storage_proxy so it can be
called from `sync_write_to_batchlog` (also from the mutate_atomically path)

Test: unit(dev)
DTest: batch_test.py:TestBatch.test_batchlog_manager_issue(dev)

* git@github.com:bhalevy/scylla.git deglobalize-batchlog_manager-v2
  get rid of the global batchlog_manager
  batchlog_manager: get_batch_log_mutation_for: move to storage_proxy
  batchlog_manager: endpoint_filter: move to gossiper
  batchlog_manager: do_batch_log_replay: use lambda coroutine
  batchlog_manager: derive from peering_sharded_service
  storage_service: keep a reference to the batchlog_manager
  test: cql_test_env: expose batchlog_manager
  main: allow setting the global batchlog_manager
2021-11-23 15:10:50 +03:00
Benny Halevy
1740833324 test: sstable_compaction_test: autocompaction_control_test: use deferred_stop
To auto-stop the table and the compaction_manager, making the
test case exception-safe.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211122204340.1020932-2-bhalevy@scylladb.com>
2021-11-23 12:10:12 +02:00
Benny Halevy
dfa6a494c2 test: sstable_compaction_test: require smp::count==1 where needed
These test cases may crash if running with more shards.
This is not required for test.py runs, but rather when
running the test manually using the command line.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211122204340.1020932-1-bhalevy@scylladb.com>
2021-11-23 12:10:12 +02:00
Kamil Braun
a33b0649b1 Merge 'Block creation of MV on CDC Log' from Piotr Jastrzębski
Add a restriction in create_view_statement to disallow creation of MV for CDC Log table.

Also add a CQL test that checks the new restriction works.

Test: unit(dev)

Fixes #9233
Closes #9663

* 'fix9233' of https://github.com/haaawk/scylla:
  tests: Add cql test to verify it's impossible to create MV for CDC Log
  cql3: Make it impossible to create MV on CDC log
2021-11-23 10:51:02 +01:00
Nadav Har'El
3c0e7037be conf/scylla.yaml: change default Prometheus listen address
Developers often run Scylla with the default conf/scylla.yaml provided
with the source distribution. The existing default listens for all ports
but one (19042, 10000, 9042, 7000) on the *localhost* IP address (127.0.0.1).
But just one port - 9180 (Prometheus metrics) - is listened on 0.0.0.0.
This patch changes the default to be 127.0.0.1 for port 9180 as well.

Note that this just changes the default scylla.yaml - users can still
choose whatever listening address they want by changing scylla.yaml
and/or passing command line parameters.

The benefits of this patch are:
1. More consistent.
2. Better security for developers (don't open ports on external
   addresses while testing).
3. Allow test/cql-pytest/run to run in parallel with a default run of
   Scylla (currently, it fails to run Scylla on a random IP address,
   because the default run of Scylla already took port 9180 on all IP
   addresses.

The third benefit is what led me to write this patch. Fixes #8757.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210530130307.906051-1-nyh@scylladb.com>
2021-11-23 11:45:35 +02:00
Benny Halevy
ff18c0c14c messaging_service: remove unused include of db/system_keyspace.hh
As a followup to eba20c7e5d
"messaging_service: init_local_preferred_ip_cache: get preferred ips from caller".

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211123080457.1247970-1-bhalevy@scylladb.com>
2021-11-23 11:12:36 +03:00
Pavel Emelyanov
dcefe98fbb test.py: Add --cpus option
The option accepts taskset-style cpulist and limits the launched tests
respectively. When specified, the default number of jobs is adjusted
accordingly, if --jobs is given it overrides this "default" as expected.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-23 11:08:41 +03:00
Pavel Emelyanov
0246841c5e test.py: Lazily calculate args.jobs
Next patch will need to know if the --jobs option was specified or the
caller is OK with the default. One way to achieve it is to keep 0 as the
default and set the default value afterwards.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-23 11:05:56 +03:00
Nadav Har'El
253387ea07 alternator: implement AttributeUpdates DELETE operation with Value
In the DynamoDB API, UpdateItem's AttributeUpdates parameter (the older
syntax, which was superseded by UpdateExpression) has a DELETE operation
that can do two different things: It can delete an attribute, or it can
delete elements from a set. Before this patch we only implemented the
first feature, and this patch implements the second.

Note that unlike the ordinary delete, the second feature - set subtraction -
is a read-modify-write operation. This is not only because of Alternator's
serialization (as JSON strings, not CRDTs) - but also fundementally because
of the API's guarantees - e.g., the operation is supposed to fail if the
attribute's existing value is *not* a set of the correct type, so it
needs to read the old value.

The test for this feature begins to pass, so its "xfail" mark is
removed. After this, all tests in test/alternator/test_item.py pass :-)

Fixes #5864.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211103151206.157184-1-nyh@scylladb.com>
2021-11-23 08:51:06 +01:00
Benny Halevy
0a33762fb1 compaction_manager: add compaction_state when table is constructed
With that, it is always expected that _compaction_state[cf]
exists when compaction jobs are submnitted.

Otherwise, throw std::out_of_range exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
29dd24ab46 compaction_manager: remove: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
46ac139490 compaction_manager: remove: detach compaction_state before stopping ongoing compactions
So that the compaction_state won't be found from this point on,
while stopping the ongoing compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
75a2509b07 compaction_manager: remove: serialize stop_ongoing_compactions and gate.close
Now that compaction tasks enter the compaction_state gate there is
no point in stopping ongoing compaction in parallel to closing the gate.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
3940ffb085 compaction_manager: task: keep a reference on compaction_state
And hold its gate to make sure the compaction_state outlives
the task and can be used to wait on all tasks and functions
using it.

With that, doing access _compaction_state[cf] to acquire
shared/exclusive locks but rather get to it via
task->compaction_state so it can be detached from
_compaction_state while task is running, if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
f482d8f377 test: sstable_compaction_test: incremental_compaction_data_resurrection_test: stop table before it's destroyed.
It must remove itself from the compaction_manager,
that will stop_ongoing_compactions.

Without that we're hitting
```
sstable_compaction_test: ./seastar/include/seastar/core/gate.hh:56: seastar::gate::~gate(): Assertion `!_count && "gate destroyed with outstanding requests"' failed.
```

when destroying the compaction_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
3955829286 test: sstable_utils: compact_sstables: deregister compaction also on error path
We need to call deregister_compaction(cdata) also if
compact_sstables failed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:39:10 +02:00
Benny Halevy
d344765ec6 get rid of the global batchlog_manager
Now that it's unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
744275df73 batchlog_manager: get_batch_log_mutation_for: move to storage_proxy
And rename to get_batchlog_mutation_for while at it,
as it's about the batchlog, not batch_log.

This resolves a circular dependency between the
batchlog_manager and the storage_proxy that required
it in the case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
55967a8597 batchlog_manager: endpoint_filter: move to gossiper
There's nothing in this function that actually requries
the batchlog manager instance.

It uses a random number engine that's moved along with it
to class gossiper.

This resolves a circular dependency between the
batchlog_manager and storage_proxy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
85d0bbb4fc batchlog_manager: do_batch_log_replay: use lambda coroutine
Ssimplify the function implemention and error handling
by invoking a lambda coroutine on shard 0 that keeps
a gate holder and semaphore units on its stack, for RAII-
style unwinding.

It then may invoke a function on another shard, using
the peered service container() to do the
replay on the destination shard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
691afe1c4d batchlog_manager: derive from peering_sharded_service
So that do_batch_log_replay can get the sharded
batchlog_manager as container().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
9cde52c58f storage_service: keep a reference to the batchlog_manager
Rather than accessing the global batchlog_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
c6d82891cc test: cql_test_env: expose batchlog_manager
And use in batchlog_manager_test.test_execute_batch
to help deglobalize the batchlog_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
03039e8f8a main: allow setting the global batchlog_manager
As a prerequisite to globalizing the batchlog_manager,
allow setting a global pointer to it and instantiate
the sharded<db::batchlog_manager> on the main/cql_test_env
stack.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Benny Halevy
5fb66ecd03 test: sstable_compaction_test: partial_sstable_run_filtered_out_test: deregiser_compaction also on error path
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:09:40 +02:00
Benny Halevy
8d7909de83 test: compaction_manager_test: add debug logging to register/deregister compaction
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:09:40 +02:00
Benny Halevy
ca97c919eb test: compaction_manager_test: deregister_compaction: erase by iterator
No need to search for the task again in the list.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:09:40 +02:00
Benny Halevy
5d6ea651d7 test: compaction_manager_test: move methods out of line
No need for them to be inlined in the sstable_utils.hh.

While at it, mark constructor noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:09:40 +02:00
Benny Halevy
e7ab1f8581 compaction_manager: compaction_state: use counter for compaction_disabled
We'd like to use compaction_state::gate both for functions
running with compaction disabled and for and tasks referring
to the compaction_state so that stop_ongoing_compactions
could wait on all functions referring to the state structure.

This is also cleaner with respect to not relying on
gate::use_count() when re-submitting regular compaction
when compaction is re-enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:08:42 +02:00
Benny Halevy
3268c94e72 compaction_manager: task: delete move and copy constructors
We use a lw_shared_ptr<task> everywhere.
So prevent moving or copying task objects.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:00:18 +02:00
Benny Halevy
0cc6060552 compaction_manager: add per-task debug log messages
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:00:18 +02:00
Benny Halevy
1d8d472028 compaction_manager: stop_ongoing_compactions: log number of tasks to stop
get_compactions().size() may return 0 while there are
non-zero tasks to stop.

Some tasks may not be marked as `compaction_running` since
they are either:
- postponed (due to compaction manger throttling of regular compaction)
- sleeping before retry.

In both cases we still want to stop them so the log message
should reflect both the number of ongoing compactions
and the actual number of tasks we're stopping.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:00:18 +02:00
Tomasz Grabiec
1d84bc6c3b sstables: partition_index_cache: Avoid abort due to benign bad_alloc inside allocating section
shared_promise::get_shared_future() is marked noexcept, but can
allocate memory. It is invoked by sstable partition index cache inside
an allocating section, which means that allocations can throw
bad_alloc even though there is memory to reclaim, so under normal
conditions.

Fix by allocating the shared_promise in a stable memory, in the
standard allocator via lw_shared_ptr<>, so that it can be accessed outside
allocating section.

Fixes #9666

Tests:

  - build/dev/test/boost/sstable_partition_index_cache_test

Message-Id: <20211122165100.1606854-1-tgrabiec@scylladb.com>
2021-11-22 19:07:51 +02:00
Tomasz Grabiec
1e4da2dcce cql: Fix missing data in indexed queries with base table short reads
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.

Fix by restoring the "remaining" count when adjusting the paging state
on short read.

Tests:

  - index_with_paging_test
  - secondary_index_test

Fixes #9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>
2021-11-22 17:42:49 +02:00
Benny Halevy
6b6cf73b48 test: manual: gossip: stop services on exit
All sharded service that were started must
be stopped before destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211122081305.789375-3-bhalevy@scylladb.com>
2021-11-22 16:15:43 +02:00
Benny Halevy
d2703eace7 test: remove gossip_test
First, it doesn't test the gossiper so
it's unclear why have it at all.
And it doesn't test anything more than what we test
using the cql_test_env either.

For testing gossip there is test/manual/gossip.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211122081305.789375-2-bhalevy@scylladb.com>
2021-11-22 16:15:41 +02:00
Tomasz Grabiec
0d080d19fb Merge "raft: improve handling of non voting members" from Gleb
This series contains fixes for non voting members handling for stepdown
and stable leader check.

* scylla-dev/raft-stepdown-fixes-v2:
  raft: handle non voting members correctly in stepdown procedure
  raft: exclude non voting nodes from the stable leader check
  raft: fix configuration::can_vote() to worth correctly with joint config
2021-11-22 12:00:44 +01:00
Benny Halevy
ce9836e2fd messaging_service: init_local_preferred_ip_cache: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211119143523.3424773-2-bhalevy@scylladb.com>
2021-11-22 13:29:21 +03:00
Benny Halevy
eba20c7e5d messaging_service: init_local_preferred_ip_cache: get preferred ips from caller
To avoid back-calling the system_keyspace from the messaging layer
let the system_keyspace get the preferred ips vector and pass it
down to the messaging_service.

This is part of the effort to deglobalize the system keyspace
and query context.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211119143523.3424773-1-bhalevy@scylladb.com>
2021-11-22 13:29:17 +03:00
Gleb Natapov
e56022a8ba migration_manager: co-routinize announce_column_family_update
The patch also removes the usage of map_reduce() because it is no longer needed
after 6191fd7701 that drops futures from the view mutation building path.
The patch preserves yielding point that map_reduce() provides though by
calling to coroutine::maybe_yield() explicitly.

Message-Id: <YZoV3GzJsxR9AZfl@scylladb.com>
2021-11-22 10:48:25 +02:00
Benny Halevy
599ed69023 repair_service: do_decommission_removenode_with_repair: maybe yield
everywhere_replication_strategy::calculate_natural_endpoints
is synchronous and doesn't yield, so add maybe_yield() calls
when looping over many token ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211121090339.3955278-1-bhalevy@scylladb.com>
Message-Id: <20211121102606.76700-2-bhalevy@scylladb.com>
2021-11-22 10:48:25 +02:00
Benny Halevy
9d2631daaf token_metadata: calculate_pending_ranges_for_leaving: maybe yield
We see long stalls as reported in
https://github.com/scylladb/scylla/issues/8030#issuecomment-974783526

everywhere_replication_strategy::calculate_natural_endpoints
is synchronous and doesn't yield, so add maybe_yield() calls
when looping over many token ranges.

Refs #8030

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211121090339.3955278-1-bhalevy@scylladb.com>
Message-Id: <20211121102606.76700-1-bhalevy@scylladb.com>
2021-11-22 10:48:25 +02:00
Benny Halevy
df5ccb8884 storage_service: get_changed_ranges_for_leaving: maybe yield
We see long stalls as reported in
https://github.com/scylladb/scylla/issues/8030#issuecomment-974647167

Even before the change to use erm->get_natural_endpoints,
everywhere_replication_strategy::calculate_natural_endpoints
is synchronous and doesn't yield, so add maybe_yield() calls
when looping over all token ranges.

Refs #8030

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211121090339.3955278-1-bhalevy@scylladb.com>
2021-11-21 11:31:56 +02:00
Raphael S. Carvalho
2b2f0eae05 compaction: STCS: kill needless include of database.hh
This is part of work for reducing compilation time and removing
layer violation in compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211120042727.114909-1-raphaelsc@scylladb.com>
2021-11-21 11:28:29 +02:00
Raphael S. Carvalho
8d9704c030 compaction: LCS: kill needless include of database.hh
This is part of work for reducing compilation time and removing
layer violation in compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211120042232.106651-1-raphaelsc@scylladb.com>
2021-11-20 18:28:55 +02:00
Avi Kivity
96e9c3951c Merge "Finally stop including database.hh in compaction.cc" from Raphael
"
After this series, compaction will finally stop including database.hh.

tests: unit(debug).
"

* 'stop_including_database_hh_for_compaction' of github.com:raphaelsc/scylla:
  compaction: stop including database.hh
  compaction: switch to table_state in get_fully_expired_sstables()
  compaction: switch to table_state
  compaction: table_state: Add missing methods required by compaction
2021-11-20 18:28:05 +02:00
Raphael S. Carvalho
06405729ce compaction: stop including database.hh
after switching to table_state, compaction code can finally stop
including database.hh

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-19 22:06:03 -03:00
Raphael S. Carvalho
69ab5c9dff compaction: switch to table_state in get_fully_expired_sstables()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-19 22:06:02 -03:00
Raphael S. Carvalho
d89edad9fb compaction: switch to table_state
Make compaction procedure switch to table_state. Only function in
compaction.cc still directly using table is
get_fully_expired_sstables(T,...), but subsequently we'll make it
switch to table_state and then we can finally stop including database.hh
in the compaction code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-19 22:06:01 -03:00
Raphael S. Carvalho
12137bca73 compaction: table_state: Add missing methods required by compaction
These are the only methods left for compaction to switch to
table_state, so compaction can finally stop including database.hh

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-19 22:05:59 -03:00
Piotr Jastrzebski
16de68aba5 tests: Add cql test to verify it's impossible to create MV for CDC Log
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-11-19 17:34:09 +01:00
Piotr Jastrzebski
e12ee2d9cc cql3: Make it impossible to create MV on CDC log
Fixes #9233

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-11-19 17:33:10 +01:00
Avi Kivity
f3d5b2b2b0 Merge "Add effective_replication_map factory" from Benny
"
Add a sharded locator::effective_replication_map_factory that holds
shared effective_replication_maps.

To search for e_r_m in the factory, we use a compound `factory_key`:
<replication_strategy type, replication_strategy options, token_metadata ring version>.

Start the sharded factory in main (plus cql_test_env and tools/schema_loader)
and pass a reference to it to storage_proxy and storage_server.

For each keyspace, use the registry to create the effective_replication_map.

When registered, effective_replication_map objects erase themselves
from the factory when destroyed. effective_replication_map then schedules
a background task to clear_gently its contents, protected by the e_r_m_f::stop()
function.

Note that for non-shard 0 instances, if the map
is not found in the registry, we construct it
by cloning the precalculated replication_map
from shard 0 to save the cpu cycles of re-calculating
it time and again on every shard.

Test: unit(dev), schema_loader_test(debug)
DTest: bootstrap_test.py:TestBootstrap.decommissioned_wiped_node_can_join_test update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_with_repair_test (dev)
"

* tag 'effective_replication_map_factory-v7' of https://github.com/bhalevy/scylla:
  effective_replication_map: clear_gently when destroyed
  database: shutdown keyspaces
  test: cql_test_env: stop view_update_generator before database shuts down
  effective_replication_map_factory: try cloning replication map from shard 0
  tools: schema_loader: start a sharded erm_factory
  storage_service: use erm_factory to create effective_replication_map
  keyspace: use erm_factory to create effective_replication_map
  effective_replication_map: erase from factory when destroyed
  effective_replication_map_factory: add create_effective_replication_map
  effective_replication_map: enable_lw_shared_from_this
  effective_replication_map: define factory_key
  keyspace: get a reference to the erm_factory
  main: pass erm_factory to storage_service
  main: pass erm_factory to storage_proxy
  locator: add effective_replication_map_factory
2021-11-19 18:19:38 +02:00
Botond Dénes
f8a6857987 tools/scylla-sstable: add --dump-scylla-metadata
Dumps the scylla component.
2021-11-19 15:52:41 +02:00
Botond Dénes
a0d1c0948c tools/scylla-sstable: add --dump-statistics
Dumps the statistics component.
2021-11-19 15:52:41 +02:00
Botond Dénes
d3dbf1b0e4 tools/scylla-sstable: add --dump-summary
Dumps the summary component.
2021-11-19 15:52:41 +02:00
Botond Dénes
5f59aabc1b tools/scylla-sstable: add --dump-compression-info
Dump the compression-info component.
2021-11-19 15:52:41 +02:00
Botond Dénes
25e9e1f2d4 tools/scylla-sstable: extract unsupported flag checking into function
Some of the common flags are unsupported for dumping components other
than the data one. Currently this is checked in the only non-data
dumper: dump-index. Move this into a separate function in preparation of
adding dumpers for other components as well.
2021-11-19 15:52:41 +02:00
Botond Dénes
16e105c8e1 sstables/sstable: add scylla metadata getter 2021-11-19 15:52:41 +02:00
Botond Dénes
78a57c34f9 sstables/sstable: add statistics accessor 2021-11-19 15:52:38 +02:00
Raphael S. Carvalho
c94e6f8567 compaction: Merge GC writer into regular compaction writer
Turns out most of regular writer can be reused by GC writer, so let's
merge the latter into the former. We gain a lot of simplification,
lots of duplication is removed, and additionally, GC writer can now
be enabled with interposer as it can be created on demand by
each interposer consumer (will be done in a later patch).

Refs #6472.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211119120841.164317-1-raphaelsc@scylladb.com>
2021-11-19 14:19:50 +02:00
GavinJE
f8c91bdd1e Update debugging.md
Line 7 does not display correctly in reality.
"crashed" appears as "chrashed" on the website.
Bug needs to be fixed.

Closes #9652
2021-11-19 14:21:53 +03:00
GavinJE
22fa7ecf99 Update compaction_controller.md
Line 15.

"ee" changed to "they"

Closes #9651
2021-11-19 14:19:20 +03:00
Benny Halevy
eed3e95704 effective_replication_map: clear_gently when destroyed
Prevent reactor stalls by gently clearing the replication_map
and token_metadata_ptr when the effective_replication_map is
destroyed.

This is done in the background, protected by the
effective_replication_map_factory::stop() method.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:52:41 +02:00
Benny Halevy
cd0061dcb5 database: shutdown keyspaces
release the keyspace effective_replication_map during
shutdown so that effective_replication_map_factory
can be stopped cleanly with no outstanding e_r_m:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:52:41 +02:00
Benny Halevy
1e259665fe test: cql_test_env: stop view_update_generator before database shuts down
We can't have view updates happening after the database shuts down.
In particular, mutateMV depends on the keyspace effective_replaication_map
and it is going to be released when all keyspaces shut down, in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:52:41 +02:00
Benny Halevy
866e1b8479 effective_replication_map_factory: try cloning replication map from shard 0
Calculating a new effective_replication_map on each shard
is expensive.  To try to save that, use the factory key to
look up an e_r_m on shard 0 and if found, use to to clone
its replication map and use that to make the shard-local
e_r_m copy.

In the future, we may want to improve that in 2 ways:
- instead of always going to shard 0, use hash(key) % smp::count
to create the first copy.
- make full copies only on NUMA nodes and keep a shared pointer
on all other shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:52:41 +02:00
Benny Halevy
0a3d66839a tools: schema_loader: start a sharded erm_factory
This is required for an upcoming change to create effective_replication_map
on all shards in storage_service::replication_to_all_cores.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:52:41 +02:00
Benny Halevy
23e1344b72 storage_service: use erm_factory to create effective_replication_map
Instead of calculating the effective_replication_map
in replicate_to_all_cores, use effective_replication_map_factory::
create_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:52:41 +02:00
Benny Halevy
cb240ffbae keyspace: use erm_factory to create effective_replication_map
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:52:41 +02:00
Benny Halevy
6754e6ca2b effective_replication_map: erase from factory when destroyed
The effective_replication_map_factory keeps nakes pointers
to outstanding effective_replication_map:s.
These are kept valid using a shared effective_replication_map_ptr.

When the last shared ptr reference is dropped the effective_replication_map
object is destroyed, therefore the raw pointer to it in the factory
must be erased.

This now happens in ~effective_replication_map when the object
is marked as registered.

Registration happens when effective_replication_map_factory inserts
the newly created effective_replication_map to its _replication_maps
map, and the factory calles effective_replication_map::set_factory..

Note that effective_replication_map may be created temporarily
and not be inserted to the factory's map, therefore erase
is called only when required.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:52:20 +02:00
Benny Halevy
8a6fbe800f effective_replication_map_factory: add create_effective_replication_map
Make a factory key using the replication_strategy type
and config options, plus the token_metadata ring version
and use it to search an already-registred effective_replication_map.

If not found, calculate a new create_effective_replication_map
and register it using the above key.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00
Benny Halevy
ecba37dbfd effective_replication_map: enable_lw_shared_from_this
So a effective_replication_map_ptr can be generated
using a raw pointer by effective_replication_map_factory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00
Benny Halevy
f4f41e2908 effective_replication_map: define factory_key
To be used to locate the effective_replication_map
in the to-be-introduced effective_replication_map_factory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00
Benny Halevy
5947de7674 keyspace: get a reference to the erm_factory
To be used for creating effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00
Benny Halevy
1d7556d099 main: pass erm_factory to storage_service
To be used for creating effective_replication_map
when token_metadata changes, and update all
keyspaces with it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00
Benny Halevy
242043368e main: pass erm_factory to storage_proxy
To be used for creating the effective_replication_map per keyspace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00
Benny Halevy
3fed73e7c2 locator: add effective_replication_map_factory
It will be used further to create shared copies
of effective_replication_map based on replication_strategy
type and config options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00
Benny Halevy
3c0fec6b17 storage_proxy: paxos_response_handler::prune: demote write timeout error printout to debug level
Similar to other timeout handling paths, there is no need to print an
ERROR for timeout as the error is not returned anyhow.

Eventually the error will be reported at the query level
when the query times out or fails in any other way.

Also, similar to `storage_proxy::mutate_end`, traces were added
also for the error cases.

FWIW, these extraneous timeout error causes dtest failures.
E.g. alternator_tests:AlternatorTest.test_slow_query_logging

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211118153603.2975509-1-bhalevy@scylladb.com>
2021-11-19 11:09:09 +03:00
Raphael S. Carvalho
5f7ee2e135 test: sstable_compaction_test: fix twcs_reshape_with_disjoint_set_test by using a non-coarse timestamp resolution
We're using a coarse resolution when rounding clock time for sstables to
be evenly distributed across time buckets. Let's use a better resolution,
to make sure sstables won't fall into the edges.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211118172126.34545-1-raphaelsc@scylladb.com>
2021-11-19 11:09:09 +03:00
Pavel Emelyanov
1dd08e367e test, cross-shard-barrier: Increase stall detector period
The test checks every 100 * smp::count milliseconds that a shard
had been able to make at least once step. Shards, in turn, take up
to 100 ms sleeping breaks between steps. It seems like on heavily
loaded nodes the checking period is too small and the test
stuck-detector shoots false-positives.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211118154932.25859-1-xemul@scylladb.com>
2021-11-19 11:09:09 +03:00
Mikołaj Sielużycki
87a212fa56 memtable-sstable: Fix indentation in table::try_flush_memtable_to_sstable.
Message-Id: <20211118131441.215628-3-mikolaj.sieluzycki@scylladb.com>
2021-11-19 11:09:09 +03:00
Mikołaj Sielużycki
6df07f7ff7 memtable-sstable: Convert table::try_flush_memtable_to_sstable to coroutines.
I intentionally store lambdas in variables and pass them to
with_scheduling_group using std::ref. Coroutines don't put variables
captured by lambdas on stack frame. If the lambda containing them is not
stored, the captured variables will be lost, resulting in stack/heap use
after free errors. An alternative is to capture variables, then create
local variables inside lambda bodies that contain a copy/moved version
of the captured ones. For example, if the post_flush lambda wasn't
stored in a dedicated variable, then it wouldn't be put on the coroutine
frame. At the first co_await inside of it, the lambda object along with
variables captured by it (old and &newtabs created inside square
brackets) would go away. The underlying objects (e.g. newtabs created in
the outer scope) would still be valid, but the reference to it would be
gone, causing most of the tests to fail.

Message-Id: <20211118131441.215628-2-mikolaj.sieluzycki@scylladb.com>
2021-11-19 11:09:09 +03:00
Kamil Braun
0f404c727e test: raft: randomized_nemesis_test: better RPC message receiving implementation
The previous implementation based on `delivery_queue` had a serious
defect: if receiving a message (`rpc::receive`) blocked, other messages
in the queue had to wait. This would cause, for example, `vote_request`
messages to stop being handled by a server if the server was in the middle
of applying a snapshot.

Now `rpc::receive` returns `void`, not `future<>`. Thus we no longer
need `delivery_queue`: the network message delivery function can simply
call `rpc::receive` directly. Messages which require asynchronous work
to be performed (such as snapshot application) are handled in
`rpc::receive` by spawning a background task. The number of such
background tasks is limited separately for each message type; now if
we exceed that limit, we drop other messages of this type (previously
they would queue up indefinitely and block not only other messages
of this type but different types as well).
Message-Id: <20211116163316.129970-1-kbraun@scylladb.com>
2021-11-19 11:09:09 +03:00
Botond Dénes
a51529dd15 protocol_servers: strengthen guarantees of listen_addresses()
In early versions of the series which proposed protocol servers, the
interface had two methods answering pretty much the same question of
whether the server is running or not:
* listen_addresses(): empty list -> server not running
* is_server_running()

To reduce redundancy and to avoid possible inconsistencies between the
two methods, `is_server_running()` was scrapped, but re-added by a
follow-up patch because `listen_addresses()` proved to be unreliable as
a source for whether the server is running or not.
This patch restores the previous state of having only
`listen_addresses()` with two additional changes:
* rephrase the comment on `listen_addresses()` to make it clear that
  implementations must return empty list when the server is not running;
* those implementations that have a reliable source of whether the
  server is running or not, use it to force-return an empty list when
  the server is not running

Tests: dtest(nodetool_additional_test.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211117062539.16932-1-bdenes@scylladb.com>
2021-11-19 11:09:09 +03:00
Gleb Natapov
814dea3600 raft: handle non voting members correctly in stepdown procedure
For leader stepdown purposes a non voting member is not different
from a node outside of the config. The patch makes relevant code paths
to check for both conditions.
2021-11-18 11:35:29 +02:00
Gleb Natapov
6a9b3cdb49 raft: exclude non voting nodes from the stable leader check
If a node is a non voting member it cannot be a leader, so the stable
leader rule should not be applied to it. This patch aligns non voting
node behaviour with a node that was removed from the cluster. Both of
them stepdown from leader position if they happen to be a leader when
the state change occurred.
2021-11-18 11:18:13 +02:00
Raphael S. Carvalho
4b1bb26d5a compaction: Make maybe_replace_exhausted_sstables_by_sst() more robust
Make it more robust by tracking both partial and sealed sstables.
This way, maybe_r__e__s__by_sst() won't pick partial sstables as
part of incremental compaction. It works today because interposer
consumer isn't enabled with incremental compaction, so there's
a single consumer which will have sealed the sstable before
the function for early replacement is called, but the story is
different if both is enabled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211117135817.16274-1-raphaelsc@scylladb.com>
2021-11-17 17:21:53 +02:00
Avi Kivity
bc75e2c1d1 treewide: wrap runtime formats with fmt::runtime for fmt 8
fmt 8 checks format strings at compile time, and requires that
non-compile-time format strings be wrapped with fmt::runtime().

Do that, and to allow coexistence with fmt 7, supply our own
do-nothing version of fmt::runtime() if needed. Strictly speaking
we shouldn't be introducing names into the fmt namespace, but this
is transitional only.

Closes #9640
2021-11-17 15:21:36 +02:00
Gleb Natapov
6744b466e4 cql3: co-routinize alter_type_statement::announce_migration
Message-Id: <YZUAlx3fHdVRSlqX@scylladb.com>
2021-11-17 15:20:37 +02:00
Gavin Howell
28f8c3987e docs/alternator: copyedit alternator.md
Line 41.
Grammar correction needed. Unclear meaning in sentence.
word "message" added after "error". Comma added after "message".

Closes #9648
2021-11-17 15:06:21 +02:00
Gavin Howell
7b0a5cdeb2 docs/alternator: typo in compatibility.md
Line 170.
"PoinInTime" changed to "PointInTime"

Closes #9650
2021-11-17 15:03:40 +02:00
Calle Wilund
a8bb4dcd28 tls: Add certficate_revocation_list option for client/server encryption options
Fixes #9630

Adds support for importing a CRL certificate reovcation list. This will be
monitored and reloaded like certs/keys. Allows blacklisting individual certs.

Closes #9655
2021-11-17 14:24:22 +02:00
Nadav Har'El
82bcc2cbd2 Merge: redis: get controller in line
Merged patch series from Botond Dénes:

Redis's controller, unlike all other protocol's controllers is called
service and is not even in the redis namespace. This is made even worse
by the redis directory also having a server.{hh,cc}, making one always
second guessing on which is what.
This series applies to the redis controller the convention used by
(almost) all other service controller classes:
* They are called controller
* They are in a file called ${protocol}/controller.{hh,cc}
* They are in a namespace ${protocol}

(Thrift is not perfectly following this either).

Botond Dénes (3):
  redis: redis_service: move in redis namespace
  redis: redis::service -> redis::controller
  redis: mv service.* -> controller.*

 configure.py                        |  2 +-
 main.cc                             | 10 ++++-----
 redis/{service.cc => controller.cc} | 32 ++++++++++++++++-------------
 redis/{service.hh => controller.hh} | 10 ++++-----
 4 files changed, 29 insertions(+), 25 deletions(-)
 rename redis/{service.cc => controller.cc} (87%)
 rename redis/{service.hh => controller.hh} (93%)
2021-11-17 14:19:36 +02:00
Botond Dénes
d4d4c0ace7 redis: mv service.* -> controller.* 2021-11-17 13:58:49 +02:00
Botond Dénes
618adeddd8 redis: redis::service -> redis::controller
Follow the naming scheme for the controller class/instance used by all
other protocol controllers:
* rename class: service -> controller;
* rename variable in main.cc: redis -> redis_ctl;
2021-11-17 13:47:44 +02:00
Botond Dénes
95510c6f92 redis: redis_service: move in redis namespace 2021-11-17 13:44:41 +02:00
Piotr Jastrzebski
033a75ff96 cdc: Don't support "on" and "off" values for preimage any more
This is an undocumented feature that causes confusion so let's get rid
of it.

tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #9639
2021-11-17 11:54:11 +01:00
Tzach Livyatan
0b6c49b03e docs website: update latest branch to 4.5
Closes #9638
2021-11-17 12:33:22 +02:00
Avi Kivity
12d29b28ab raft: generator: correct constraints on members
A member variable is a reference, not a pure value, so std::same_as<>
needs to be given a reference (and clanf 13 insists). However, clang
12 doesn't accept the correct constraint, so use std::convertible_to<>
as a compromise.

Closes #9642
2021-11-17 11:27:52 +02:00
Gleb Natapov
8f64a6d2d2 raft: fix configuration::can_vote() to worth correctly with joint config
Fix configuration::can_vote() to return true if a node is a voting
member in any of the configs.
2021-11-17 11:06:42 +02:00
Benny Halevy
9548220b70 compaction_manager: submit_offstrategy: remove task in finally clause
Now, when the offstrategy task is stopped, it exits the repeat
loop if (!can_proceed(task)) without going through
_tasks.remove(task) - causing the assert in compaction_manger::remove
to trip, as stop_ongoing_compactions will be resolved
while the task is still listed in _tasks.

Fixes #9634

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-17 09:53:59 +02:00
Avi Kivity
720e9521f0 utils: build_id: correct fmt include
fmt::print(std::ostream&) is in <fmt/ostream.h>

Closes #9641
2021-11-17 09:02:57 +02:00
Avi Kivity
edcdbc16d3 db: heat weighted load balancing: remove unused variable total_deficit
The variable is write-only.

Closes #9647
2021-11-17 09:02:23 +02:00
Avi Kivity
e51fcc22f3 sstable_loader: add missing include <cfloat>
Needed for FLT_EPSILON

Closes #9646
2021-11-17 09:01:49 +02:00
Avi Kivity
2c1e30a12a test: network_topology_strategy_test: remove unused variable total_rf
It is write-only.

Closes #9645
2021-11-17 09:01:24 +02:00
Avi Kivity
cba07a3145 test: perf: fix format string for scheduling_latency_measurer
Need a colon to introduce the format after the default argument
specifier.

Found by fmt 8.

Closes #9644
2021-11-17 09:00:56 +02:00
Avi Kivity
6ece375fc8 repair: add missing include <cfloat>
Needed for FLT_EPSILON

Closes #9643
2021-11-17 09:00:11 +02:00
Amos Kong
32e62252e1 debian/build_offline_installer.sh: config apt to keep downloaded packages
The downloaded packages might be deleted autotically after installation,
then we will provide an incomplete installer to user.

This patch changed to config apt to keep the downloaded packages before
installation.

Signed-off-by: Amos Kong <kongjianjun@gmail.com>

Closes #9592
2021-11-16 17:47:01 +02:00
Avi Kivity
e2c27ee743 Merge 'commitlog: recalculate disk footprint on delete_segment exceptions' from Calle Wilund
If we get errors/exceptions in delete_segments we can (and probably will) loose track of disk footprint counters. This can in turn, if using hard limits, cause us to block indefinitely on segment allocation since we might think we have larger footprint than we actually do.

Of course, if we actually fail deleting a segment, it is 100% true that we still technically hold this disk footprint (now unreachable), but for cases where for example outside forces (or wacky tests) delete a file behind our backs, this might not be true. One could also argue that our footprint is the segments and file names we keep track of, and the rest is exterior sludge.

In any case, if we have any exceptions in delete_segments, we should recalculate disk footprint based on current state, and restart all new_segment paths etc.

Fixes #9348

(Note: this is based on previous PR #9344 - so shows these commits as well. Actual changes are only the latter two).

Closes #9349

* github.com:scylladb/scylla:
  commitlog: Recalculate footprint on delete_segment exceptions
  commitlog_test: Add test for exception in alloc w. deleted underlying file
  commitlog: Ensure failed-to-create-segment is re-deleted
  commitlog::allocate_segment_ex: Don't re-throw out of function
2021-11-16 17:44:56 +02:00
Tomasz Grabiec
bf6898a5a0 lsa: Add sanity checks around lsa_buffer operations
We've been observing hard to explain crashes recently around
lsa_buffer destruction, where the containing segment is absent in
_segment_descs which causes log_heap::adjust_up to abort. Add more
checks to catch certain impossible senarios which can lead to this
sooner.

Refs #9192.
Message-Id: <20211116122346.814437-1-tgrabiec@scylladb.com>
2021-11-16 14:25:02 +02:00
Tomasz Grabiec
4d627affc3 lsa: Mark compact_segment_locked() as noexcept
We cannot recover from a failure in this method. The implementation
makes sure it never happens. Invariants will be broken if this
throws. Detect violations early by marking as noexcept.

We could make it exception safe and try to leave the data structures
in a consistent state but the reclaimer cannot make progress if this throws, so
it's pointless.

Refs #9192
Message-Id: <20211116122019.813418-1-tgrabiec@scylladb.com>
2021-11-16 14:23:10 +02:00
Pavel Emelyanov
a62631d441 config: Enable developer-mode by default in dev/debug modes
Other than looking sane, this change continues the founded by the
--workdir option tradition of freeing the developer form annoying
necessity to type too many options when scylla is started by hand
for devel purposes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211116104815.31822-1-xemul@scylladb.com>
2021-11-16 12:53:33 +02:00
Pavel Emelyanov
ba16318457 generic_server: Keep server alive during conn background processing
There's at least one tiny race in generic_server code. The trailing
.handle_exception after the conn->process() captures this, but since the
whole continuation chain happens in the background, that this can be
released thus causing the whole lambda to execute on freed generic_server
instance. This, in turn, is not nice because captured this is used to get
a _logger from.

The fix is based on the observation that all connections pin the server
in memory until all of them (connections) are destructed. Said that, to
keep the server alive in the aforementioned lambda it's enough to make
sure the conn variable (it's lw_shared_ptr on the connection) is alive in
it. Not to generate a bunch of tiny continuations with identical set of
captures -- tail the single .then_wrapped() one and do whatever is needed
to wrap up the connection processing in it.

tests: unit(dev)
fixes: #9316

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211115105818.11348-1-xemul@scylladb.com>
2021-11-16 11:10:39 +02:00
Pavel Emelyanov
6131aea203 scylla-gdb: Handle new fair_queue::priority_class_data representation
In the full-duplex capable scheduler the _handles list contains
direct pointers on pclass data, not lw_shared_ptr's.

Most of the time this container is empty so this bug is not
triggerable right at once.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211116084250.21399-1-xemul@scylladb.com>
2021-11-16 11:07:21 +02:00
Botond Dénes
b136746040 mutation_partition: deletable_row::apply(shadowable_tombstone): remove redundant maybe_shadow()
Shadowing is already checked by the underlying row_tombstone::apply().
This redundant check was introduced by a previous fix to #9483
(6a76e12768). The rest of that patch is
good.

Refs: #9483
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211115091513.181233-1-bdenes@scylladb.com>
2021-11-15 17:50:41 +01:00
Kamil Braun
e48e0ff7db test: sstable_conforms_to_mutation_source_test: fix vector::erase call
... in `test_sstable_reversing_reader_random_schema`.
The call was missing an end iterator.
2021-11-15 17:32:22 +01:00
Kamil Braun
3abcbf6875 test: mutation_source_test: extract forwardable_reader_to_mutation function
The function shall be used in other places as well.
2021-11-15 17:32:17 +01:00
Kamil Braun
9f0e13dd0b test: random_schema: fix clustering column printing in random_schema::cql
Also leave a FIXME to include the key ordering in the string as well.
2021-11-15 17:30:59 +01:00
Botond Dénes
64bb48855c flat_mutation_reader: revamp flat_mutation_reader_from_mutations()
Add schema parameter so that:
* Caller has better control over schema -- especially relevant for
  reverse reads where it is not possible to follow the convention of
  passing the query schema which is reversed compared to that of the
  mutations.
* Now that we don't depend on the mutations for the schema, we can lift
  the restriction on mutations not being empty: this leads to safer
  code. When the mutations parameter is empty, an empty reader is
  created.
Add "make_" prefix to follow convention of similar reader factory
functions.

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211115155614.363663-1-bdenes@scylladb.com>
2021-11-15 17:58:46 +02:00
Nadav Har'El
6e1344eb4f alternator: better error handling for wrongly-encoded numbers
In the DynamoDB API, a number is encoded in JSON requests as something
like: {"N": "123"} - the type is "N" and the value "123". Note that the
value of the number is encoded as a string, because the floating-point
range and accuracy of DynamoDB differs from what various JSON libraries
may support.

We have a function unwrap_number() which supported the value of the
number being encoded as an actual number, not a string. But we should
NOT support this case - DynamoDB doesn't. In this patch we add a test
that confirms that DynamoDB doesn't, and remove the unnecessary case
from unwrap_number(). The unnecessary case also had a FIXME, so it's
a good opportunity to get rid of a FIXME.

When writing the test, I noticed that the error which DynamoDB returns
in this case is SerializionException instead of the more usual
ValidationException. I don't know why, but let's also change the error
type in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211115125738.197099-1-nyh@scylladb.com>
2021-11-15 14:47:49 +01:00
Botond Dénes
802e3642a0 Update tools/java submodule
* tools/java cb6c1d07a7...8fae618f7f (2):
  > removeNode: support ignoreNodes options
  > build: replace yum with dnf
2021-11-15 15:41:27 +02:00
Botond Dénes
f313706d80 Update tools/jmx submodule
* tools/jmx d6225c5...2c43d99 (2):
  > removeNode: support ignoreNodes options
  > build: replace yum with dnf
2021-11-15 15:41:27 +02:00
Avi Kivity
a19d00ef9b dist: scylla_raid_setup: mount XFS with online discard
Online discard asks the disk to erase flash memory cells as soon
as files are deleted. This gives the disk more freedom to choose
where to place new files, so it improves performance.

On older kernel versions, and on really bad disks, this can reduce
performance so we add an option to disable it.

Since fstrim is pointless when online discard is enabled, we
don't configure it if online discard is selected.

I tested it on an AWS i3.large instance, the flag showd up in
`mount` after configuration.

Closes #9608
2021-11-15 14:16:08 +02:00
Avi Kivity
c17101604f Merge 'Revert "scylla_util.py: return bool value on systemd_unit.is_active()"' from Takuya ASADA
On scylla_unit.py, we provide `systemd_unit.is_active()` to return `systemctl is-active` output.
When we introduced systemd_unit class, we just returned `systemctl is-active` output as string, but we changed the return value to bool after that (2545d7fd43).
This was because `if unit.is_active():` always becomes True even it returns "failed" or "inactive", to avoid such scripting bug.
However, probably this was mistake.
Because systemd unit state is not 2 state, like "start" / "stop", there are many state.

And we already using multiple unit state ("activating", "failed", "inactive", "active") in our Cloud image login prompt:
https://github.com/scylladb/scylla-machine-image/blob/next/common/scylla_login#L135
After we merged 2545d7fd43, the login prompt is broken, because it does not return string as script expected (https://github.com/scylladb/scylla-machine-image/issues/241).

I think we should revert 2545d7fd43, it should return exactly same value as `systemctl is-active` says.

Fixes #9627
Fixes scylladb/scylla-machine-image#241

Closes #9628

* github.com:scylladb/scylla:
  scylla_ntp_setup: use string in systemd_unit.is_active()
  Revert "scylla_util.py: return bool value on systemd_unit.is_active()"
2021-11-15 13:56:28 +02:00
Takuya ASADA
279fabe9b4 scylla_ntp_setup: use string in systemd_unit.is_active()
Since we reverted 2545d7fd43, we need to
use string instead of bool value.
2021-11-15 19:50:31 +09:00
Takuya ASADA
d646673705 Revert "scylla_util.py: return bool value on systemd_unit.is_active()"
This reverts commit 2545d7fd43.

Fixes #9627
Fixes scylladb/scylla-machine-image#241
2021-11-15 19:50:31 +09:00
Pavel Emelyanov
4e86936850 redis: Remove stop_server deferred action from main
Commit 3f56c49a9e put redis into protocol_servers list of storage
service. Since then there's no need in explicit stop_server call
on shutdown -- the protocol_servers thing will do it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211109154259.1196-1-xemul@scylladb.com>
2021-11-15 11:58:44 +02:00
Avi Kivity
4d7a013e94 sstables: mx: writer: make large partition stats accounting branch-free
It is bad form to introduce branches just for statistics, since branches
can be expensive (even when perfectly predictable, they consume branch
history resources). Switch to simple addition instead; this should be
not cause any cache misses since we already touch other statistics
earlier.

The inputs are already boolean, but cast them to boolean just so it
is clear we're adding 0/1, not a count.

Closes #9626
2021-11-15 11:28:48 +02:00
Benny Halevy
9d4262e264 protocol_server: add per-protocol is_server_running method
Change b0a2a9771f broke
the generic api implementation of
is_native_transport_running that relied on
the addresses list being empty agter the server is stopped.

To fix that, this change introduces a pure virtual method:
protocol_server::is_server_running that can be implemented
by each derived class.

Test: unit(dev)
DTest: nodetool_additional_test.py:TestNodetool.binary_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211114135248.588798-1-bhalevy@scylladb.com>
2021-11-14 16:01:31 +02:00
Avi Kivity
c9b8b84411 build: replace yum with dnf
dnf has replaced yum on Fedora and CentOS. On modern versions of Fedora,
you have to install an extra package to get the old name working, so
avoid that inconvenience and use dnf directly.

Closes #9622
2021-11-14 14:41:47 +02:00
Michael Livshin
a7511cf600 system keyspace: record partitions with too many rows
Add "rows" field to system.large_partitions.  Add partitions to the
table when they are too large or have too many rows.

Fixes #9506

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #9577
2021-11-14 14:25:18 +02:00
Avi Kivity
98ec98ba36 Update seastar submodule
* seastar 04c6787b35...f8a038a0a2 (1):
  > http: disable Nagle's algorithm for the http server

Fixes #9619.
2021-11-14 13:21:06 +02:00
Avi Kivity
6cb3caaf39 Update seastar submodule
* seastar a189cdc45...04c6787b3 (12):
  > Convert std::result_of to std::invoke_result
  > Merge "IO queue full-duplex mode" from Pavel E
  > Merge "Report bytes/ops for R and W separately" from Pavel E
  > websocket: override std::exception::what() correctly
  > tests: websocket_test: remove unused lambda capture
  > Merge "Improve IO classes preemption" from Pavel E
  > Revert "Merge "Improve IO classes preemption" from Pavel E"
  > Merge "Add skeleton implementation of a WebSocket server" from Piotr S
  > Merge "Improve IO classes preemption" from Pavel E
  > io_queue: Add starvation time metrics (per-class)
  > Revert "Merge "Add skeleton implementation of a WebSocket server" from Piotr S"
  > Merge "Add skeleton implementation of a WebSocket server" from Piotr S
2021-11-13 11:56:28 +02:00
Piotr Sarna
cc544ba117 service: coroutinize client_state.cc
No functional changes, but makes the code shorter and gets rid
of a few allocations.
Coroutinizing has_column_family_access is deliberately skipped and
commented, since some callers expect this function to throw instead
of returning an exceptional future.

Message-Id: <958848a1eeeef490b162d2d2b805c8a14fc9082b.1636704996.git.sarna@scylladb.com>
2021-11-12 21:52:29 +02:00
Tomasz Grabiec
4e3b54d9fe Merge "Teach scylla-gdb.py duplex IO queues" from Pavel Emelyanov
Fresh seastar has duplex IO queues (and some more goodies). The
former one needs respective changes in scylla-gdb.py

* xemul/br-gdb-duplex-ioqueues:
  scylla-gdb: Support new fair_{queue|group}s layout
  scylla-gdb: Add boost::container::small_vector wrapper
  scylla-gdb: Fix indentation aft^w before next patch
2021-11-12 19:43:22 +01:00
Pavel Emelyanov
123286d5cd database: Remove infinite_bound_range_deletion bits
Have been unused for quite a while already

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211112150837.24125-1-xemul@scylladb.com>
2021-11-12 19:40:17 +01:00
Pavel Emelyanov
5877b84a1a range_streamer: Remove stream_plan from
The streamer creates stream_plan "on demand" and doesnt use the on-board one

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211112180335.27831-1-xemul@scylladb.com>
2021-11-12 19:38:45 +01:00
Pavel Emelyanov
29892af828 scylla-gdb: Support new fair_{queue|group}s layout
In the recent seastar io_queues carry several fair_queues on board,
so do the io_groups. The queues are in boost small_vector, the groups
are in a vector of unique_ptrs. This patch adds this knowledge to
scylla-gdb script.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-12 16:16:25 +03:00
Pavel Emelyanov
c032794556 scylla-gdb: Add boost::container::small_vector wrapper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-12 16:15:51 +03:00
Pavel Emelyanov
b321cccaad scylla-gdb: Fix indentation aft^w before next patch
The upcoming seastar update will have fair_groups and fair_queues to
become arrays. Thus scylla-gdb will need to iterate over them with
some sort of loop. This patch makes the queue/group prining indentation
to match this future loop body and prepares the loop variables while
at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-12 16:11:59 +03:00
Gleb Natapov
123ece611b lwt: co-routinize accept_proposal
Message-Id: <20211111163942.121827-4-gleb@scylladb.com>
2021-11-11 22:13:26 +02:00
Gleb Natapov
588768f4af lwt: co-routinize prepare_ballot
Message-Id: <20211111163942.121827-3-gleb@scylladb.com>
2021-11-11 22:13:26 +02:00
Gleb Natapov
61b2e41a23 lwt: co-routinize begin_and_repair_paxos
Message-Id: <20211111163942.121827-2-gleb@scylladb.com>
2021-11-11 22:13:26 +02:00
Avi Kivity
f74b258928 Merge "Add the system.config virtual table (updateable)" from Pavel E
"
Scylla can be configured via a bunch of config files plus
a bunch of commandline options. Collecting these altogether
can be challenging.

The proposed table solves a big portion of this by dupming
the db::config contents as a table. For convenience (and,
maybe, to facilitate Benny's CLI) it's possible to update
the 'value' column of the table with CQL request.

There exists a PR with a table that exports loglevels in a
form of a table. The updating technique used in this set
is applicable to that table as well.

tests: compilation(dev, release, debug), unit(debug)
"

* 'br-db-config-virtual-table-3' of https://github.com/xemul/scylla:
  tests: Unit test for system.config virtual table
  system_keyspace: Table with config options
  code: Push db::config down to virtual tables
  storage_proxy: Propagate virtual table exceptions messages
  table: Virtual writer hook (mutation applier)
  table: Rewrap table::apply()
  table: Mark virtual reader branch with unlikely
  utils: Add config_src::source_name() method
  utils: Ability to set_value(sstring) for an option
  utils: Internal change of config option
  utils: Mark some config_file methods noexcept
2021-11-11 22:13:26 +02:00
Yaron Kaikov
060a91431d dist/docker/debian/build_docker.sh: debian version fix for rc releases
When building a docker we relay on `VERSION` value from
`SCYLLA-VERSION-GEN` . For `rc` releases only there is a different
between the configured version (X.X.rcX) and the actualy debian package
we generate (X.X~rcX)

Using a similar solution as i did in dcb10374a5

Fixes: #9616

Closes #9617
2021-11-11 22:13:26 +02:00
Pavel Emelyanov
e6ef5e7e43 tests: Unit test for system.config virtual table
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 16:39:34 +03:00
Pavel Emelyanov
4a70e0aa57 system_keyspace: Table with config options
A config option value is reported as 'text' type and contains
a string as it would looks like in json config.

The table is UPDATE-able. Only the 'value' columnt can be set
and the value accepted must be string. It will be converted into
the option type automatically, however in current implementation
is't not 100% precise -- conversion is lexicographical cast which
only works for simple types. However, liveupdate-able values are
only of those types, so it works in supported cases.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 16:39:34 +03:00
Pavel Emelyanov
947e4c9a10 code: Push db::config down to virtual tables
The db::config reference is available on the database, which
can be get from the virtual_table itself. The problem is that
it's a const refernece, while system.config will be updateable
and will need non-const reference.

Adding non-const get_config() on the database looks wrong. The
database shouldn't be used as config provider, even the const
one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 16:39:34 +03:00
Pavel Emelyanov
1ea301ad07 storage_proxy: Propagate virtual table exceptions messages
The intention is to return some meaningful info to the CQL caller
if a virtual table update fails. Unfortunately the "generic" error
reporting in CQL is not extremely flexible, so the best option
seems to report regular write failre with custom message in it.

For now this only works for virtual table errors.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 16:39:34 +03:00
Pavel Emelyanov
5aefc48e28 table: Virtual writer hook (mutation applier)
Symmetrically to virtual reader one, add the virtual writer
callback on a table that will be in charge of applying the
provided mutation.

If a virtual table doesn't override this apply method the
dedicated exception is thrown. Next patch will catch it and
propagate back to caller, so it's a new exception type, not
existing/std one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 16:39:34 +03:00
Pavel Emelyanov
80460f66fc table: Rewrap table::apply()
The main motivation is to have future returning apply (to be used
by next patches). As a side effect -- indentation fix and private
dirty_memory_region_group() method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 16:39:34 +03:00
Pavel Emelyanov
c3d15c3e18 table: Mark virtual reader branch with unlikely
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 15:15:05 +03:00
Pavel Emelyanov
b3fee616ea utils: Add config_src::source_name() method
To get a human-readable string from abstract source type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 15:15:05 +03:00
Pavel Emelyanov
d513034ca4 utils: Ability to set_value(sstring) for an option
There soon will appear an updateable system.config table that
will push sstrings into names_value-s. Prepare for this change
by adding the respective .set_value() call. Since the update
only works for LiveUpdate-able options, and inability to do it
can be propagated back to the caller make this method return
true/false whether the update took place or not.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 15:15:05 +03:00
Pavel Emelyanov
c226c0a149 utils: Internal change of config option
When a named_value is .set_value()-d the caller may specify the reason
for this change. If not specified it's set to None, but None means
"it was there by default and was't changed" so it's a bit of a lie.

Add an explicit Internal reason. It's actually used by the directories
thing that update all directories according to --workdir option.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 15:15:05 +03:00
Pavel Emelyanov
2959ebf393 utils: Mark some config_file methods noexcept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 15:15:05 +03:00
Botond Dénes
b58403fb63 Merge "Flatten database drain" from Pavel E
"
Draining the database is now scattered across the do_drain()
method of the storage_service. Also it tells shutdown drain
from API drain.

This set packs this logic into the database::drain() method.

tests: unit(dev), start-stop-drain(dev)
"

* 'br-database-drain' of https://github.com/xemul/scylla:
  database, storage_service: Pack database::drain() method
  storage_service: Shuffle drain sequence
  storage_service, database: Move flush-on-drain code
  storage_service: Remove bool from do_drain
2021-11-11 08:19:35 +02:00
Tomasz Grabiec
a084c8c10f Merge "raft fixes for bugs found by randomized nemesis testing" from Gleb
The series fixes issues:

server may use the wrong configuration after applying a remote snapshot, causing a split-brain situation

assertion ins raft::server_impl::notify_waiters()

snapshot transfer to a server removed from the configuration should be aborted

cluster may become stuck when a follower takes a snapshot after an accepted entry that the leader didn't learn about

* scylla-dev/random-test-fixes-v2:
  raft: rename rpc_configuration to configuration in fsm output
  raft: test: test case for the issue #9552
  raft: fix matching of a snapshotted log on a follower
  raft: abort snapshot transfer to a server that was removed from the configuration
  raft: fix race between snapshot application and committing of new entries
  raft: test: add test for correct last configuration index calculation during snapshot application
  raft: do not maintain _last_conf_idx and _prev_conf_idx past snapshot index
  raft: correctly truncate the log in a persistence module during snapshot application
2021-11-10 20:36:53 +01:00
Avi Kivity
d949202615 Update tools/java submodule (PyYAML dependency removal)
* tools/java fd10821045...cb6c1d07a7 (1):
  > dist: remove unneeded dependency to PyYAML
2021-11-10 14:16:01 +02:00
Raphael S. Carvalho
49863ab11c tests: sstable_compaction_test: Fix test compaction_with_fully_expired_table
column_family_for_tests was missing the schema which contained the
gc_grace_seconds used by the test.

Fixes #8872.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211109163440.75592-1-raphaelsc@scylladb.com>
2021-11-09 19:21:57 +02:00
Michał Radwański
eff392073c memtable: fix gcc function argument evaluation order induced use after move
clang evaluates function arguments from left to right, while gcc does so
in reverse. Therefore, this code can be correct on clang and incorrect
on gcc:
```
f(x.sth(), std::move(x))
```

This patch fixes one such instance of this bug, in memtable.cc.

Fixes #9605.

Closes #9606
2021-11-09 19:21:57 +02:00
Avi Kivity
d2e02ea7aa Merge " Abstract table for compaction layer with table_state" from Raphael
"
table_state is being introduced for compaction subsystem, to remove table dependency
from compaction interface, fix layer violations, and also make unit testing
easier as table_state is an abstraction that can be implemented even with no
actual table backing it.

In this series, compaction strategy interfaces are switching to table_state,
and eventually, we'll make compact_sstables() switch to it too. The idea is
that no compaction code will directly reference a table object, but only work
with the abstraction instead. So compaction subdirectory can stop
including database.hh altogether, which is a great step forward.
"

* 'table_state_v5' of https://github.com/raphaelsc/scylla:
  sstable_compaction_test: switch to table_state
  compaction: stop including database.hh for compaction_strategy
  compaction: switch to table_state in estimated_pending_compactions()
  compaction: switch to table_state in compaction_strategy::get_major_compaction_job()
  compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction()
  DTCS: reduce table dependency for task estimation
  LCS: reduce table dependency for task estimation
  table: Implement table_state
  compaction: make table param of get_fully_expired_sstables() const
  compaction_manager: make table param of has_table_ongoing_compaction() const
  Introduce table_state
2021-11-09 19:21:57 +02:00
Pavel Emelyanov
2005b4c330 Merge branch 'move_disable_compaction_to_manager/v6' from Raphael S. Carvalho
Move run_with_compaction_disabled() into compaction manager

run_with_compaction_disabled() living in table is a layer violation as the
logic of disabling compaction for a table T clearly belongs to manager
and table shouldn't be aware of such implementation details.
This makes things less error prone too as there's no longer a need for
coordination between table and manager.
Manager now takes all the responsibility.

* 'move_disable_compaction_to_manager/v6' of https://github.com/raphaelsc/scylla:
  compaction: move run_with_compaction_disabled() from table into compaction_manager
  compaction_manager: switch to coroutine in compaction_manager::remove()
  compaction_manager: add struct for per table compaction state
  compaction_manager: wire stop_ongoing_compactions() into remove()
  compaction_manager: introduce stop_ongoing_compactions() for a table
  compaction_manager: prevent compaction from being postponed when stopping tasks
  compaction_manager: extract "stop tasks" from stop_ongoing_compactions() into new function
2021-11-09 19:21:56 +02:00
Pavel Emelyanov
43f6a13a30 database, storage_service: Pack database::drain() method
The storage_service::do_drain() now ends up with shutting down
compaction manager, flushing CFs and shutting down commitlog.
All three belong to the database and deserve being packed into
a single database::drain() method.

A note -- these steps are cross-shard synchronized, but database
already has a barrier for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-09 19:17:38 +03:00
Pavel Emelyanov
906cac0f86 storage_service: Shuffle drain sequence
Right now the draining sequence is

 - stop transport (protocol servers, gossiper, streaming)
 - shutdown tracing
 - shutdown compaction manager
 - flush CFs
 - drain batchlog manager
 - stop migration manager
 - shutdown commitlog

This violates the layering -- both batchlog and migration managers
are higher-level services than the database, so they should be
shutdown/drained before it, i.e. -- before shutting down compaction
manager and flushing all CFs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-09 19:13:56 +03:00
Pavel Emelyanov
82509c9e74 storage_service, database: Move flush-on-drain code
Flushing all CFs on shutdown is now fully managed in storage service
and it looks weird. Some better place for it seems to be the database
itself.

Moving the flushing code also imples moving the drain_progress thing
and patching the relevant API call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-09 19:11:49 +03:00
Pavel Emelyanov
aba475fe1d storage_service: Remove bool from do_drain
The do_drain() today tells shutdown drain from API drain. The reason
is that compaction manager subscribes on the main's abort signal and
drains itself early. Thus, on regular drain it needs this extra kick
that would crash if called from shutdown drain.

This differentiation should sit in the compaction manager itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-09 19:10:13 +03:00
Raphael S. Carvalho
df4bce03ae sstable_compaction_test: switch to table_state
Let's make compaction tests switch to table_state. All disabled ones
can now be reenabled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:35:45 -03:00
Raphael S. Carvalho
bb5a8682f3 compaction: stop including database.hh for compaction_strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:29:47 -03:00
Raphael S. Carvalho
e2f6a47999 compaction: switch to table_state in estimated_pending_compactions()
Last method in compaction_strategy using table. From now on,
compaction strategy no longer works directly with table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:25:28 -03:00
Raphael S. Carvalho
93ae9225f7 compaction: switch to table_state in compaction_strategy::get_major_compaction_job()
From now on, get_major_compaction_job() will use table_state instead of
a plain reference to table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:25:22 -03:00
Raphael S. Carvalho
d881310b52 compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction()
From now on, get_sstables_for_compaction() will use table_state.
With table_state, we avoid layer violations like strategy using
manager and also makes testing easier.

Compaction unit tests were temporarily disabled to avoid a giant
commit which is hard to parse.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:52:14 -03:00
Raphael S. Carvalho
9f2d2eee98 DTCS: reduce table dependency for task estimation
Similar to LCS, let's reduce table dependency in DTCS, to make it
easier to switch to table_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:50:29 -03:00
Raphael S. Carvalho
83fc59402f LCS: reduce table dependency for task estimation
let's reduce table dependency from LCS task estimation, to make
it easier to switch to table_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:50:27 -03:00
Raphael S. Carvalho
03c819b8f5 table: Implement table_state
This is the first implementation of table_state, intended to be used
within compaction. It contains everything needed for compaction
strategies. Subsequently, compaction strategy interface will replace
table by table_state, and later all compaction procedures.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:45:40 -03:00
Raphael S. Carvalho
29df862f57 compaction: make table param of get_fully_expired_sstables() const
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:41:54 -03:00
Raphael S. Carvalho
ff4953206b compaction_manager: make table param of has_table_ongoing_compaction() const
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:41:52 -03:00
Raphael S. Carvalho
ccb87a6b24 Introduce table_state
This abstraction is intended to be used within compaction layer,
to replace direct usage of table. This will simplify interfaces,
and also simplify testing as an actual table is no longer
strictly required.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:41:44 -03:00
Gleb Natapov
7aac6c2086 raft: rename rpc_configuration to configuration in fsm output
The filed is generic and used not only for rpc configuration now.
2021-11-09 15:16:57 +02:00
Gleb Natapov
db25f1dbb8 raft: test: test case for the issue #9552
test that if a leader tries to append an entry that falls inside
a follower's snapshot the protocol stays alive.
2021-11-09 14:51:40 +02:00
Gleb Natapov
a59779155f raft: fix matching of a snapshotted log on a follower
There can be a situation where a leader will send to a follower entries
that the latter already snapshotted. Currently a follower consider those
to be outdated appends and it rejects them, but it may cause the
follower progress to be stuck:

- A is a leader, B is a follower, there are other followers which A used to commit entries
- A remembers that the last matched entry for B is 10, so the next entry to send is 11. A managed to commit the 11 entry using other followers
- A sends entry 11 to B
- B receives it, accepts, and updates its commit index to 11. It sends a success reply to A, but it never reaches A due to a network partition
- B takes a snapshot at index 11
- A sends entry 11 to B again
- B rejects it since it is inside the snapshot
- A receives the reject and retries from the same entry
- Same thing happen again

We should not reject such outdated entries since if they fall inside a
snapshot it means they match (according to log matching property).
Accepting them will make the case above alive.

Fixes #9552
2021-11-09 14:51:40 +02:00
Gleb Natapov
9d505c48de raft: abort snapshot transfer to a server that was removed from the configuration
If a node is removed from a config we should stop transferring snapshot
to it. Do exactly that.

Fixes #9547
2021-11-09 14:51:40 +02:00
Gleb Natapov
88a6e2446d raft: fix race between snapshot application and committing of new entries
Completion notification code assumes that previous snapshot is applied
before new entries are committed, otherwise it asserts that some
notifications were missing. But currently commit notifications and
snapshot application run in different fibers, so the can be race between
those.

Fix that by moving commit notification into applier fiber as well.

Fixes #9550
2021-11-09 14:51:40 +02:00
Gleb Natapov
3a88fa5f70 raft: test: add test for correct last configuration index calculation during snapshot application 2021-11-09 14:51:40 +02:00
Gleb Natapov
a04eb2d51f raft: do not maintain _last_conf_idx and _prev_conf_idx past snapshot index
The log maintains _last_conf_idx and _prev_conf_idx indexes into the log
to point to where the latest and previous configuration can be found.
If they are zero it means that the latest config is in the snapshot.
When snapshot with a trailing is applied we can safely reset those
indexes that are smaller than the snapshot one to zero because the
snapshot will have the latest config anyway. This simplifies maintenance
of those indexes since their value will not depend on user configured
snapshot_trailing parameter.
2021-11-09 14:03:36 +02:00
Calle Wilund
3929b7da1f commitlog: Add explicit track var for "wasted space" to avoid double counting
Refs #9331

In segment::close() we add space to managers "wasted" counter. In destructor,
if we can cleanly delete/recycle the file we remove it. However, if we never
went through close (shutdown - ok, exception in batch_cycle - not ok), we can
end up subtracting numbers that were never added in the first place.
Just keep track of the bytes added in a var.

Observed behaviour in above issue is timeouts in batch_cycle, where we
declare the segment closed early (because we cannot add anything more safely
- chunks could get partial/misplaced). Exception will propagate to caller(s),
but the segment will not go through actual close() call -> destructor should
not assume such.

Closes #9598
2021-11-09 09:15:44 +02:00
Botond Dénes
4b6c0fe592 mutation_writer/feed_writer: don't drop readers with small amount of content
Due to an error in transforming the above routine, readers who have <= a
buffer worth of content are dropped without consuming them.
This is due to the outer consume loop being conditioned on
`is_end_of_stream()`, which will be set for readers that eagerly
pre-fill their buffer and also have no more data then what is in their
buffer.
Change the condition to also check for `is_buffer_empty()` and only drop
the reader if both of these are true.

Fixes: #9594

Tests: unit(mutation_writer_test --repeat=200, dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211108092923.104504-1-bdenes@scylladb.com>
2021-11-09 09:15:44 +02:00
Avi Kivity
9e2b6176a2 Merge "Run gossiper message handlers in a gate" from Pavel E
"
When gossiper processes its messages in the background some of
the continuations may pop up after the gossiper is shutdown.
This, in turn, may result in unwanted code to be executed when
it doesn't expect.

In particular, storage_service notification hooks may try to
update system keyspace (with "fresh" peer info/state/tokens/etc).
This update doesn't work after drain because drain shuts down
commitlog. The intention was that gossiper did _not_ notify
anyone after drain, because it's shut down during drain too.
But since there are background continuations left, it's not
working as expected.

refs: #9567
tests: unit(dev), dtest.concurrent_schema_changes.snapshot(dev)
"

* 'br-gossiper-background-messages-2' of https://github.com/xemul/scylla:
  gossiper: Guard background processing with gate
  gossiper: Helper for background messaging processing
2021-11-09 09:15:44 +02:00
Avi Kivity
b0a2a9771f Merge "Sanitize hostnames resolving on start" from Pavel E
"
On start scylla resolves several hostnames into addresses. Different
places use different hostname selection logic, e.g. the API address
can be the listen one if the dedicated option not set. Failure to
resolve a hostname is reported with an exception that (sometimes)
contains the hostname, but it doesn't look very convenient -- better
to know the config option name. Also resolving of different hostnames
has different decoration around, e.g. prometheus carries a main-local
lambda just to nicely wrap the try/catch block.

This set unifies this zoo and makes main() shorter and less hairy:

1. All failures to resolve a hostname are reported with an
   exception containing the relevant config option

2. The || operator for named_value's is introduced to make
   the option selection look as short as

     resolve(cfg->some_address() || cfg->another_address())

3. All sanity checks are explicit and happen early in main

4. No dangling local variables carrying the cfg->...() value

5. Use resolved IP when logging a "... is listening on ..."
   message after a service start

tests: unit(dev)
"

* 'br-ip-resolve-on-start' of https://github.com/xemul/scylla:
  main: Move fb-utilities initialization up the main
  code: Use utils::resolve instead of inet_address::lookup
  main: Remove unused variable
  main: Sanitize resolving of listen address
  main: Sanitize resolving of broadcast address
  main: Sanitize resolving of broadcast RPC address
  main: Sanitize resolving of API address
  main: Sanitize resolving of prometheus address
  utils: Introduce || operator for named_values
  db.config: Verbose address resolver helper
  main: Remove api-port and prometheus-port variables
  alternator: Resolve address with the help of inet_address
  redis, thrift: Remove unused captures
2021-11-09 09:15:40 +02:00
Michael Livshin
f6bbc7fc9b tests: remove remaining uses of sstable_assertions::make_reader_v1()
* flat_reader_assertions::produces_range_tombstone() does not actually
  check range tombstones beyond the fact that they are in fact range
  tombstones (unless non-empty ck_ranges is passed). Fix the immediate
  problem, change assertion logic to take split and overlapping range
  tombstones into account properly, and also fix several
  accidentally-incorrect tests.
  Fixes #9470

* Convert the remaining sstable_3_x reader tests to v2, now that they
  are more correct and only the actual convertion remains.

This deals with the sstable reader tests that involve range
tombstones.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-11-09 09:13:51 +02:00
Botond Dénes
5b3ac3147b db/schema_tables: merge_tables_and_views(): match old/new view with old/new base table
For altered tables, the above function creates schema objects
representing before/after (old/new) table states. In case of views,
there is a matching mechanism to set the base table field of the view to
the appropriate base table object. This works by iterating over the list
of altered tables and selecting the "new_schema" field of the first
instance matching the keyspace/name of the base-table. This ends up
pairing the after/old version of the base table to both the before and
after version of the view. This means the base attached to the view is
possibly incompatible with the view it is attached to.
This patch fixes this by passing the schema generation (before/after) to
the function responsible for this matching, so it can select the
appropriate version of the base class.
For example, given the following input to `merge_tables_and_views()`:

    tables_before = { t1_before }
    tables_after = { t1_after }
    views_before = { v1_before }
    views_after = { v1_after }

Before this patch, the `base_schema` field of `v1_before` would be
`t1_after`, while it obviously should be `t1_before`. This sounds scary
but has no practical implications currently as `v1_before` is only
computed and then discarded without being used.

Tests: unit(dev)

Fixes: #9586
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211108124806.151268-1-bdenes@scylladb.com>
2021-11-09 09:13:51 +02:00
Raphael S. Carvalho
33b39a2bfc compaction: move run_with_compaction_disabled() from table into compaction_manager
That's intended to fix a bad layer violation as table was given the
responsibility of disabling compaction for a given table T, but that
logic clearly belongs to compaction_manager instead.

Additionally, gate will be used instead of counter, as former provides
manager with a way to synchronize with functions running under
run_with_compaction_disabled. so remove() can wait for their
termination.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-08 15:12:46 -03:00
Raphael S. Carvalho
52feb41468 compaction_manager: switch to coroutine in compaction_manager::remove()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-08 14:24:39 -03:00
Raphael S. Carvalho
aa9b1c1fa3 compaction_manager: add struct for per table compaction state
This will make it easier to pack all state data for a given table T.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-08 14:24:33 -03:00
Raphael S. Carvalho
7876bd4331 compaction_manager: wire stop_ongoing_compactions() into remove()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-08 14:24:24 -03:00
Raphael S. Carvalho
c0047bb9c0 compaction_manager: introduce stop_ongoing_compactions() for a table
New variant of stop_ongoing_compactions() which will stop all
compactions for a given table. Will be reused in both remove()
and by run_with_compaction_disabled() which soon be moved into
the compaction_manager.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-08 14:24:14 -03:00
Raphael S. Carvalho
2f293fa09c compaction_manager: prevent compaction from being postponed when stopping tasks
stop_tasks() must make sure that no ongoing task will postpone compaction
when asked to stop. Therefore, let's set all tasks as stopping before
any deferring point, such that no task will postpone compaction for
a table which is being stopped.

compaction_manager::remove() already handles this race with the same
method, and given that remove() will later switch to stop_tasks(),
let's do the same in stop_tasks().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-08 14:23:57 -03:00
Raphael S. Carvalho
0643faafd7 compaction_manager: extract "stop tasks" from stop_ongoing_compactions() into new function
Procedure will be reused to stop a list of tasks

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-08 14:23:37 -03:00
Pavel Emelyanov
92e8e217b7 main: Move fb-utilities initialization up the main
Setting up the fb_utilities addresses sits in the middle of
starting/stopping the real services. It's a bit cleaner to
make it earlier.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
2f9c21644b code: Use utils::resolve instead of inet_address::lookup
There are some users of the latter call left. They all suffer
from the same problem -- the lack of verbosity on resolving
errors.

While at it also get rid of useless local variables that are
only there to carry the cfg->...() option over.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
7cf4e848ec main: Remove unused variable
This one left hanging after the previous patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
3a9b0d83fc main: Sanitize resolving of listen address
Nother special here, just get rid of on-shot local variable
and use the util::resolve to improve the verbosity of the
exception thrown on error.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
f190d99998 main: Sanitize resolving of broadcast address
To resolve this one main selects between the config option of
the same name or picks the listen address. Similarly to the
broadcast RPC address, on error the thrown exception is very
generic and doesn't tell which option contained the faulty
address.

THe utils::resolve, || operator and dedicated excplicit sanity
check make this place look better.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
a1b6600e7f main: Sanitize resolving of broadcast RPC address
The broadcast RPC address is taken from either the config
option of the same name or from the rpc_address one. Also
there's a sanity check on the latter. On resolution failure
it's impossible to find out which option caused this, just
the seastar-level exception is printed.

Using recently added utils helper and || for named values
makes things shorter. The sanity check for INADDR_ANY is
moved upper the main() to where other options sanity checks
sit.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
98ab8d9827 main: Sanitize resolving of API address
To find out the API address there's a main-local lambda to make
the verbose exception as well as an ?:-selection of which option
to use as the API address.

Using the utils::resolve and recently introduced || for named
values makes things much nicer and shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
3188161f93 main: Sanitize resolving of prometheus address
Right now there's a main-local lambda to resolve the address
and throw some meaningful exception.

Using recently introduced utils::resolve() helper makes things
look nicer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
b3a4f9e194 utils: Introduce || operator for named_values
Those named_values that support .empty() check can be "selected"
like this

    auto& v = option_a() || option_b() || option_c();

This code will put into v a reference to the first non-empty
named_value out of a/b/c.

This "selection" is actually used on start when scylla decides
which config options to use as listen/broadcact/rpc/etc. addresses.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
71ce7c6e87 db.config: Verbose address resolver helper
The helper works on named_value() and throws and exception containing
the option name for convenient error reporting.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Pavel Emelyanov
df08fb3025 main: Remove api-port and prometheus-port variables
Those variables just pollute the main's scope for no gain.
It's simpler and more friendly to the next patches to use
cfg-> stuff directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:04:07 +03:00
Pavel Emelyanov
acb7068ab5 alternator: Resolve address with the help of inet_address
Alternator needs to lookup its address without preferring ipv4
or ipv6. To do it calls seastar method, but the same effect is
achieved by calling inet_address::lookup.

This change makes all places in scylla resolve addresses in a
similar way, makes this code line shorter and removes the need
to specifically explain the alternator hunks from next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:04:07 +03:00
Pavel Emelyanov
7f6fbaf3c6 redis, thrift: Remove unused captures
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 16:39:30 +03:00
Jenkins
158f47dfc7 release: prepare for 4.7.dev 2021-11-08 09:46:13 +02:00
Pavel Emelyanov
9fccf7f3af gossiper: Guard background processing with gate
When shutdown gossiper may have some messages being processed in
the background. This brings two problems.

First, the gossiper itself is about to disappear soon and messages
might step on the freed instance (however, this one is not real now,
gossiper is not freed for real, just ::stop() is called).

Second, messages processing may notify other subsystems which, in
turn, do not expect this after gossiper is shutdown.

The common solution to this is to run background code through a gate
that gets closed at some point, the ::shutdown() in gossiper case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 10:25:03 +03:00
Pavel Emelyanov
42f44adb98 gossiper: Helper for background messaging processing
Some messages are processed by gossiper on shard0 in the no-wait
manner. Add a generic helper for that to facilitate next patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 10:24:44 +03:00
Takuya ASADA
76519751bc install.sh: add fix_system_distributed_tables.py to the package
Related with #4601

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2021-11-08 08:07:49 +02:00
Michael Livshin
806b5310fd tests: remove remaining uses of sstable_assertions::make_reader_v1()
This deals with the sstable reader tests that involve range
tombstones.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-11-08 00:56:39 +02:00
Michael Livshin
4941e2ec41 tests: fix range tombstone checking and deal with the fallout
flat_reader_assertions::produces_range_tombstone() does not actually
check range tombstones beyond the fact that they are in fact range
tombstones (unless non-empty ck_ranges is passed).

Fixing the immediate problem reveals that:

* The assertion logic is not flexible enough to deal with
  creatively-split or creatively-overlapping range tombstones.

* Some existing tests involving range tombstones are in fact wrong:
  some assertions may (at least with some readers) refer to wrong
  tombstones entirely, while others assert wrong things about right
  tombstones.

* Range tombstones in pre-made sstables (such as those read by
  sstable_3_x_test) have deletion time drift, and that now has to be
  somehow dealt with.

This patch (which is not split into smaller ones because that would
either generate unreasonable amount of work towards ensuring
bisectability or entail "temporarily" disabling problematic tests,
which is cheating) contains the following changes:

* flat_reader_assertions check range tombstones more carefully, by
  accumulating both expected and actually-read range tombstones into
  lists and comparing those lists when a partition ends (or when the
  assertion object is destroyed).

* flat_reader_assertions::may_produce_tombstones() can take
  constraining ck_ranges.

* Both flat_reader_assertions and flat_reader_assertions_v2 can be
  instructed to ignore tombstone deletion times, to help with tests that
  read pre-made sstables.

* Affected tests are changed to reflect reality.  Most changes to
  tests make sense; the only one I am not completely sure about is in
  test_uncompressed_filtering_and_forwarding_range_tombstones_read.

Fixes #9470

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-11-08 00:56:39 +02:00
Avi Kivity
247f2b69d5 Merge "system tables: create the schema more efficiently" from Botond
"
System tables currently almost uniformly use a pattern like this to
create their schema:

    return schema_builder(make_shared_schema(...))
    // [...]
    .with_version(...)
    .build(...);

This pattern is very wasteful because it first creates a schema, then
dismantles it just to recreate it again. This series abolishes this
pattern without much churn by simply adding a constructor to schema
builder that takes identical parameters to `make_shared_schema()`,
then simply removing `make_shared_schema()` from these users, who now
build a schema builder object directly and build the schema only once.

Tests: unit(dev)
"

* 'schema-builder-make-shared-schema-ctor/v1' of https://github.com/denesb/scylla:
  treewide: system tables: don't use make_shared_schema() for creating schemas
  schema_builder: add a constructor providing make_shared_schema semantics
  schema_builder: without_column(): don't assume column_specification exists
  schema: add static variant of column_name_type()
2021-11-07 18:23:22 +02:00
Takuya ASADA
546e4adf9e dist/docker: configure default locale correctly
Since cqlsh requires UTF-8 locale, we should configure default locale
correctly, on both directly executed shell with docker and via SSH.
(Directly executed shell means "docker exec -ti <image> /bin/bash")

For SSH, we need to set correct parameter on /etc/default/locale, which
can set by update-locale command.
However, directly executed shell won't load this parameter, because it
configured at PAM but we skip login on this case.
To fix this issue, we also need to set locale variables on container
image configuration (ENV in Dockerfile, --env in buildah).

Fixes #9570

Closes #9587
2021-11-07 17:03:12 +02:00
Takuya ASADA
201a97e4a4 dist/docker: fix bashrc filename for Ubuntu
For Debian variants, correct filename is /etc/bash.bashrc.

Fixes #9588

Closes #9589
2021-11-07 17:01:13 +02:00
Avi Kivity
7a3930f7cf Merge 'More nodetool-replacing virtual tables' from Botond Dénes
This PR introduces 4 new virtual tables aimed at replacing nodetool commands, working towards the long-term goal of replacing nodetool completely at least for cluster information retrieval purposes.
As you may have noticed, most of these replacement are not exact matches. This is on purpose. I feel that the nodetool commands are somewhat chaotic: they might have had a clear plan on what command prints what but after years of organic development they are a mess of fields that feel like don't belong. In addition to this, they are centered on C* terminology which often sounds strange or doesn't make any sense for scylla (off-heap memory, counter cache, etc.).
So in this PR I tried to do a few things:
* Drop all fields that don't make sense for scylla;
* Rename/reformat/rephrase fields that have a corresponding concept in scylla, so that it uses the scylla terminology;
* Group information in tables based on some common theme;

With these guidelines in mind lets look at the virtual tables introduced in this PR:
* `system.snapshots` - replacement for `nodetool listnapshots`;
* `system.protocol_servers`- replacement for `nodetool statusbinary` as well as `Thrift active` and `Native Transport active` from `nodetool info`;
* `system.runtime_info` - replacement for `nodetool info`, not an exact match: some fields were removed, some were refactored to make sense for scylla;
* `system.versions` - replacement for `nodetool version`, prints all versions, including build-id;

Closes #9517

* github.com:scylladb/scylla:
  test/cql-pytest: add virtual_tables.py
  test/cql-pytest: nodetool.py: add take_snapshot()
  db/system_keyspace: add versions table
  configure.py: move release.cc and build_id.cc to scylla_core
  db/system_keyspace: add runtime_info table
  db/system_keyspace: add protocol_servers table
  service: storage_service: s/client_shutdown_hooks/protocol_servers/
  service: storage_service: remove unused unregister_client_shutdown_hook
  redis: redis_service: implement the protocol_server interface
  alternator: controller: implement the protocol_server interface
  transport: controller: implement the protocol_server interface
  thrift: controller: implement the protocol_server interface
  Add protocol_server interface
  db/system_keyspace: add snapshots virtual table
  db/virtual_table: remove _db member
  db/system_keyspace: propagate distributed<> database and storage_service to register_virtual_tables()
  docs/design-notes/system_keyspace.md: add listing of existing virtual tables
  docs/guides: add virtual-tables.md
2021-11-07 16:55:31 +02:00
Avi Kivity
c6ac1462c2 build, submodules: use utc for build datestamp
This helps keep packages built on different machines have the
same datestamp, if started on the same time.

* tools/java 05ec511bbb...fd10821045 (1):
  > build: use utc for build datestamp

* tools/jmx 48d37f3...d6225c5 (1):
  > build: use utc for build datestamp

* tools/python3 c51db54...8a77e76 (1):
  > build: use utc for build datestamp

[avi: commit own patches as this one requires excessive coordination
      across submodules, for something quite innocuous]

Ref #9563 (doesn't really fix it, but helps a little)
2021-11-07 15:58:48 +02:00
Avi Kivity
1d4f6498c8 Update tools/python3 submodule for .orig cleanup
* tools/python3 279aae1...c51db54 (1):
  > reloc: clean up '.orig' temporary directory before building deb package
2021-11-07 15:55:49 +02:00
Botond Dénes
e991604918 schema: make private constructor invokable via make_lw_shared
The schema has a private constructor, which means it can't be
constructed with `make_lw_shared()` even by classes which are otherwise
able to invoke the private constructor themselves.
This results in such classes (`schema_builder`) resorting to building a
local schema object, then invoking `make_lw_shared()` with the schema's
public move constructor. Moving a schema is not cheap at all however, so
each `schema_builder::build()` call results in two expensive schema
construction operations.
We could make `make_lw_shared()` a friend of `schema` to resolve this,
but then we'd de-facto open the private consctructor to the world.
Instead this patch introduces a private tag type, which is added to the
private constructor, which is then made public. Everybody can invoke the
constructor but only friends can create the private tag instance
required to actually call it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211105085940.359708-1-bdenes@scylladb.com>
2021-11-07 12:51:09 +02:00
Tomasz Grabiec
31bc1eb681 Merge 'Memtable reversing reader: fix computing rt slice, if there was previously emitted range tombstone.' from Michał Radwański
This PR started by realizing that in the memtable reversing reader, it
never happened on tests that `do_refresh_state` was called with
`last_row` and `last_rts` which are not `std::nullopt`.

Changes
- fix memtable test (`tesst_memtable_with_many_versions_conforms_to_mutation_source`), so that there is a background job forcing state refreshes,
- fix the way rt_slice is computed (was `(last_rts, cr_range_snapshot.end]`, now is `[cr_range_snapshot.start, last_rts)`).

Fixes #9486

Closes #9572

* github.com:scylladb/scylla:
  partition_snapshot_reader: fix indentation in fill_buffer
  range_tombstone_list: {lower,upper,}slice share comparator implementation
  test: memtable: add full_compaction in background
  partition_snapshot_reader: fix obtaining rt_slice, if Reversing and _last_rts was set
  range_tombstone_list: add lower_slice
2021-11-05 15:27:03 +01:00
Botond Dénes
6993a55ff3 test/cql-pytest: add virtual_tables.py
Presence and column check for virtual tables. Where possible (and
simple) more is checked.
2021-11-05 16:26:21 +02:00
Botond Dénes
18f9d329ed test/cql-pytest: nodetool.py: add take_snapshot() 2021-11-05 16:26:01 +02:00
Botond Dénes
d51aa66a8a db/system_keyspace: add versions table
Contains all version related information (`nodetool version` and more).
Example printout:

    (cqlsh) select * from system.versions;

     key   | build_id                                 | build_mode | version
    -------+------------------------------------------+------------+-------------------------------
     local | aaecce2f5068b0160efd04a09b0e28e100b9cd9e |        dev | 4.6.dev-0.20211021.0d744fd3fa
2021-11-05 15:42:42 +02:00
Botond Dénes
5c87263ff8 configure.py: move release.cc and build_id.cc to scylla_core
These two files were only added to the scylla executable and some
specific unit tests. As we are about to use the symbols defined in these
files in some scylla_core code move them there.
2021-11-05 15:42:42 +02:00
Botond Dénes
89cc016f07 db/system_keyspace: add runtime_info table
Loosly contains the equivalent of the `nodetool info` command, with some
notable differences:
* Protocol server related information is in `system.protocol_servers`;
* Information about memory, memtable and cache is reformatted to be
  tailored to scylla: C* specific terminology and metrics are dropped;
* Information that doesn't change and is already in `system.local` is
  not contained;
* Added trace-probability too (`nodetool gettraceprobability`);

TODO(follow-up): exceptions.
2021-11-05 15:42:42 +02:00
Botond Dénes
78adda197f db/system_keyspace: add protocol_servers table
Lists all the client protocol server and their status. Example output:

    (cqlsh) select * from system.protocol_servers;

      name             | is_running | listen_addresses                      | protocol | protocol_version
    ------------------+------------+---------------------------------------+----------+------------------
     native transport |       True | ['127.0.0.1:9042', '127.0.0.1:19042'] |      cql |            3.3.1
	   alternator |      False |                                    [] | dynamodb |
		  rpc |      False |                                    [] |   thrift |           20.1.0
		redis |      False |                                    [] |    redis |

This prints the equivalent of `nodetool statusbinary` and the "Thrift
active" and "Native Transport active" fields from the `nodetool info`
output with some additional information:
* It contains alternator and redis status;
* It contains the protocol version;
* It contains the listen addresses (if respective server is running);
2021-11-05 15:42:42 +02:00
Botond Dénes
3f56c49a9e service: storage_service: s/client_shutdown_hooks/protocol_servers/
Replace the simple client shutdown hook registry mechanism with a more
powerful registry of the protocol servers themselves. This allows
enumerating the protocol servers at runtime, checking whether they are
running or not and starting/stopping them.
2021-11-05 15:42:42 +02:00
Botond Dénes
e9c9a39c06 service: storage_service: remove unused unregister_client_shutdown_hook
Nobody seems to unregister client shutdown hooks ever. We are about
to refactor the client shutdown hook machinery so remove this unused
code to make this easier.
2021-11-05 15:42:41 +02:00
Botond Dénes
f56f4ade22 redis: redis_service: implement the protocol_server interface
In the process de-globalize redis service and pass dependencies in the
constructor.
2021-11-05 15:42:41 +02:00
Botond Dénes
8ddfdd8aa9 alternator: controller: implement the protocol_server interface 2021-11-05 15:42:41 +02:00
Botond Dénes
134fa98ff4 transport: controller: implement the protocol_server interface 2021-11-05 15:42:41 +02:00
Botond Dénes
bda0d0ccba thrift: controller: implement the protocol_server interface 2021-11-05 15:42:41 +02:00
Botond Dénes
3ff8ba9146 Add protocol_server interface
We want to replace the current
`storage_service::register_client_shutdown_hook()` machinery with
something more powerful. We want to register all running client protocol
servers with the storage service, allowing enumerating these at runtime,
checking whether they are running or not and starting/stopping them.
As the first step towards this, we introduce an abstract interface that
we are going to implement at the controllers of the various protocol
servers we have. Then we will switch storage service to collect pointers
to this interface instead of simple stop functors.
2021-11-05 15:42:41 +02:00
Botond Dénes
64f658aea4 db/system_keyspace: add snapshots virtual table
Lists the equivalent of the `nodetool listsnapshots` command.
2021-11-05 15:42:41 +02:00
Botond Dénes
f0281eaa98 db/virtual_table: remove _db member
This member is potentially dangerous as it only becomes non-null
sometimes after the virtual table object is constructed. This is asking
for nullptr dereference.
Instead, remove this member and have virtual table implementations that
need a db, ask for it in the constructor, it is available in
`register_virtual_tables()` now.
2021-11-05 15:42:41 +02:00
Botond Dénes
200e2fad4d db/system_keyspace: propagate distributed<> database and storage_service to register_virtual_tables()
As some virtual tables will need the distributed versions of these.
2021-11-05 15:42:41 +02:00
Botond Dénes
185c5f1f5b docs/design-notes/system_keyspace.md: add listing of existing virtual tables
As well as a link to the newly added docs/guides/virtual-tables.md
2021-11-05 15:42:39 +02:00
Michał Radwański
ee601b7d87 partition_snapshot_reader: fix indentation in fill_buffer 2021-11-05 10:51:58 +01:00
Michał Radwański
35b1c3ff52 range_tombstone_list: {lower,upper,}slice share comparator
implementation

slice (2 overloads), upper_slice, lower_slice previously had
implementations of a comparator. Move out the common structs, so that
all 4 of them can share implementation.
2021-11-05 10:51:58 +01:00
Botond Dénes
b8c156d4f7 docs/guides: add virtual-tables.md
Explaining what virtual tables are, what are good candidates for virtual
tables and how you can write one.
2021-11-05 11:49:27 +02:00
Botond Dénes
ccf5c31776 treewide: system tables: don't use make_shared_schema() for creating schemas
`make_shared_schema()` is a convenience method for creating a schema in
a single function call, however it doesn't have all the advanced
capabilities as `schema_builder`. So most users (which all happen to be
system tables) pass the schema created by it to schema builder
immediately to do some further tweaking, effectively building the schema
twice. This is wasteful.
This patch changes all these users to use the newly added
`schema_builder()` constructor which has the same signature (and
therefore ease-of-use) as `make_shared_schema()`.
2021-11-05 11:41:04 +02:00
Botond Dénes
4dea339e0c schema_builder: add a constructor providing make_shared_schema semantics
make_shared_schema() is often used to create a schema that is then
passed to schema_builder to modify it further. This is wasteful as the
schema is built just to be disassembled and rebuilt again. To replace
this wasteful pattern we provide a schema_builder constructor that has
the same signature as `make_shared_schema()`, allowing follow-up
modifications on the schema before it is fully built.
2021-11-05 11:41:04 +02:00
Botond Dénes
476f49c693 schema_builder: without_column(): don't assume column_specification exists
It is only created for column definitions in the schema constructor, so
it will only exists for schema builders created from schema instances.
It is not guaranteed that `without_column()` will only be called on
such builder instances so ensure the implementation doesn't depend on
it.
2021-11-05 11:41:04 +02:00
Botond Dénes
d3833c5978 schema: add static variant of column_name_type()
So schema_builder can use it too (without a schema instance at hand).
2021-11-05 11:41:04 +02:00
Michael Livshin
60f76155a7 build: have configure.py create compile_commands.json
compile_commands.json (a.k.a. "compdb",
https://clang.llvm.org/docs/JSONCompilationDatabase.html) is intended
to help stand-alone C-family LSP servers index the codebase as
precisely as possible.

The actively maintained LSP servers with good C++ support are:
- Clangd (https://clangd.llvm.org/)
- CCLS (https://github.com/MaskRay/ccls)

This change causes a successful invocation of configure.py to create a
unified Scylla+Seastar+Abseil compdb for every selected build mode,
and to leave a valid symlink in the source root (if a valid symlink
already exists, it will be left alone).

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #9558
2021-11-05 11:28:37 +02:00
Raphael S. Carvalho
4950ce539c schema: replace outdated comment on default compaction strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211104210043.199156-1-raphaelsc@scylladb.com>
2021-11-05 00:35:41 +02:00
Nadav Har'El
5e52858295 rjson, alternator: rename set() functions add()
The rjson::set() *sounds* like it can set any member of a JSON object
(i.e., map), but that's not true :-( It calls the RapidJson function
AddMember() so it can only add a member to an object which doesn't have
a member with the same name (i.e., key). If it is called with a key
that already has a value, the result may have two values for the same
key, which is ill-formed and can cause bugs like issue #9542.

So in this patch we begin by renaming rjson::set() and its variant to
rjson::add() - to suggest to its user that this function only adds
members, without checking if they already exist.

After this rename, I was left with dozens of calls to the set() functions
that need to changed to either add() - if we're sure that the object
cannot already have a member with the same name - or to replace() if
it might.

The vast majority of the set() calls were starting with an empty item
and adding members with fixed (string constant) names, so these can
be trivially changed to add().

It turns out that *all* other set() calls - except the one fixed in
issue #9542 - can also use add() because there are various "excuses"
why we know the member names will be unique. A typical example is
a map with column-name keys, where we know that the column names
are unique. I added comments in front of such non-obvious uses of
add() which are safe.

Almost all uses of rjson except a handful are in Alternator, so I
verified that all Alternator test cases continue to pass after this
patch.

Fixes #9583
Refs #9542

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104152540.48900-1-nyh@scylladb.com>
2021-11-04 16:35:38 +01:00
Nadav Har'El
b95e431228 alternator: fix bug in ReturnValues=ALL_NEW
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.

The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.

So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.

This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.

In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().

Fixes #9542

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
2021-11-04 16:34:58 +01:00
Michał Radwański
cac9ac5126 test: memtable: add full_compaction in background
Add full compaction in test_memtable_with_many_versions_conforms_to_mutation_source
in background. Without it, some paths in the partition snapshot reader
weren't covered, as the tests always managed to read all range
tombstones and rows which cover a given clustering range from just a
single snapshot. Now, when full_compaction happens in process of reading
from a clustering range, we can force state refresh with non-nullopt
positions of last row and last range tombstone.

Note: this inability to test affected only the reversing reader.
2021-11-04 16:19:54 +01:00
Michał Radwański
94b263e356 partition_snapshot_reader: fix obtaining rt_slice, if Reversing and
_last_rts was set

If Reversing and _last_rts was set, the created rt_slice still contained
range tombstones between *_last_rts and (snapshot) clustering range end.
This is wrong - the correct range is between (snapshot) clustering range
begin and *_last_rts.
2021-11-04 16:10:07 +01:00
Jan Ciolek
51a8a1f89b cql3: Remove remaining mentions of term
There were a few places where term was still mentioned.
Removed/replaced term with expression.

search_and_replace is still done only on LHS of binary_operator
because the existing code would break otherwise.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:57:00 +01:00
Jan Ciolek
e458340821 cql3: Remove term
term isn't used anywhere now. We can remove it and all classes that derive from it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
dcd3199037 cql3: Rename prepare_term to prepare_expression
prepare_term now takes an expression and returns a prepared expression.
It should be renamed to prepare_expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
219f1a4359 cql3: Make prepare_term return an expression instead of term
prepare_term is now the only function that uses terms.
Change it so that it returns expression instead of term
and remove all occurences of expr::to_expression(prepare_term(...))

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
c84e941df9 cql3: expr: Add size check to evaluate_set
In old code sets::delayed_value::bind() contained a check that each serialized value is less than certain size.
I missed this when implementing evaluate(), so it's brought back to ensure identical behaviour.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
7bc65868eb cql3: expr: Add expr::contains_bind_marker
Add a function that checks whether there is a bind marker somewhere inside an expression.
It's important to note, that even when there are no bind markers, there can be other things that prevent immediate evaluation of an expression.
For example an expression can contain calls to nonpure functions.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
080286cb96 cql3: expr: Rename find_atom to find_binop
Soon there will be other functions that
also search in expression, find_atom would be confusing then.
find_binop is a more descriptive name.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
7cabed9ebf cql3: expr: Add find_in_expression
find_in_expression is a function that looks into the expression
and finds the given expression variant for which the predicate function returns true.
If nothing is found returns nullptr.

For example:
find_in_expression<binary_operator>(e, [](const binary_operator&) {return true;})
Will return the first binary operator found in the expression.

It is now used in find_atom, and soon will be used in other similar functions.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
890c8f4026 cql3: Remove term in operations
Replace term with expression in cql3/operation and its children.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
3b4dc39eb8 cql3: Remove term in relations
Replace uses of term with expression in cql3/*relation.hh

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
7f2ecf1aa2 cql3: Remove term in multi_column_restrictions
Replace all uses of term with expression in cql3/multi_column_restrictions

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
a59683b929 cql3: Remove term in term_slice, rename to bounds_slice
term_slice is an interval from one term to the other.
[term1, term2]
Replaced terms with expressions.
Because the name has 'term' in it, it was changed to bounds_slice.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
e37906ae34 cql3: expr: Remove term in expression
Some struct inside the expression variant still contained term.
Replace those terms with expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:44 +01:00
Pavel Emelyanov
6e97d2ce87 Merge branch 'compaction_cleanup_and_improvements_v2' from Raphael S. Carvalho
Cleanup and improvements for compaction

* 'compaction_cleanup_and_improvements_v2' of https://github.com/raphaelsc/scylla:
  compaction: fix outdated doc of compact_sstables()
  table: fix indentation in compact_sstables()
  table: give a more descriptive name to compaction_data in compact_sstables()
  compaction_manager: rename submit_major_compaction to perform_major_compaction
  compaction: fix indentantion in compaction.hh
  compaction: move incremental_owned_ranges_checker into cleanup_compaction
  compaction: make owned ranges const in cleanup_compaction
  compaction: replace outdated comment in regular_compaction
  compaction: give a more descriptive name to compaction_data
  compaction_manager: simplify creation of compaction_data
2021-11-04 17:27:07 +03:00
Raphael S. Carvalho
132a840ed5 compaction: fix outdated doc of compact_sstables()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 11:09:24 -03:00
Raphael S. Carvalho
98dd57113f table: fix indentation in compact_sstables()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 11:09:24 -03:00
Raphael S. Carvalho
51aa79e267 table: give a more descriptive name to compaction_data in compact_sstables()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 11:09:24 -03:00
Gleb Natapov
bdf7d1a411 raft: correctly truncate the log in a persistence module during snapshot application
When remote snapshot is applied the log is completely cleared because
snapshot transfer happens only when common log prefix cannot be found,
so we cannot be sure that existing entries in the log are correct. But
currently it only happens for in memory log by calling apply_snapshot
with trailing set to zero, but when persistence module is called to
store the snapshot _config.snapshot_trailing is used which can be non
zero. This may cause the log to contain incorrect entries after restart.

The patch fixes this by using zero trailing for non local snapshots.

Fixes #9551
2021-11-04 15:11:19 +02:00
Raphael S. Carvalho
8ce9cda391 compaction_manager: rename submit_major_compaction to perform_major_compaction
for symmetry, let's call it perform_* as it doesn't work like submission
functions which doesn't wait for result, like the one for minor
compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:54:00 -03:00
Raphael S. Carvalho
0d745912d0 compaction: fix indentantion in compaction.hh
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:50:46 -03:00
Raphael S. Carvalho
5af9a690c1 compaction: move incremental_owned_ranges_checker into cleanup_compaction
let's move checker into cleanup as it's not needed elsewhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:49:44 -03:00
Raphael S. Carvalho
04ef2124c6 compaction: make owned ranges const in cleanup_compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:47:12 -03:00
Raphael S. Carvalho
d86c2491d4 compaction: replace outdated comment in regular_compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:45:34 -03:00
Raphael S. Carvalho
b344db1696 compaction: give a more descriptive name to compaction_data
info is no longer descriptive, as compaction now works with
compaction_data instead of compaction_info.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:43:08 -03:00
Raphael S. Carvalho
63dc4e2107 compaction_manager: simplify creation of compaction_data
there's no need for wrapping compaction_data in shared_ptr, also
let's kill unused params in create_compaction_data to simplify
its creation.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:35:49 -03:00
Takuya ASADA
9b4cf8c532 scylla_util.py: On is_gce(), return False when it's on GKE
GKE metadata server does not provide same metadata as GCE, we should not
return True on is_gce().
So try to fetch machine-type from metadata server, return False if it
404 not found.

Fixes #9471

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #9582
2021-11-04 12:49:06 +02:00
Avi Kivity
a64458e71a Merge "Run test cases in parallel by default" from Pavel E
"
Some time ago there was introduced the --parallel-cases option
that was set to False by default. Now everything is ready for
making it True.

Running in a BYO job shows that it takes 30 minutes less to
complete the debug tests. Other timings remain almost the same.

tests: unit(dev), unit(debug)
"

* 'br-parallel-cases-by-default' of https://github.com/xemul/scylla:
  test.py: Run parallel cases by default
  test, raft: Keep many-400 case out of debug mode
  test.py: Cache collected test-cases
2021-11-04 10:10:08 +02:00
Pavel Emelyanov
d1679b66f2 test.py: Run parallel cases by default
There were few missing bits before making this the default.

- default max number of AIOs, now tests are run with the greatly
  reduced value

- 1.5 hours single case from database_test, now it's split and
  scales with --parallel-cases

- suite add_test methods called in a loop for --repeat options,
  patch #1 from this set fixes it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-04 10:47:13 +03:00
Pavel Emelyanov
12cf69e5f5 test, raft: Keep many-400 case out of debug mode
This case takes 45+ minutes which is 1.5 times longer then the
second longest case out there. I propose to keep the many-400
case out of debug runs, there's many-100 one which is configured
the same way but uses 4x times less "nodes".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-04 10:47:13 +03:00
Pavel Emelyanov
0d0ccd50b5 test.py: Cache collected test-cases
The add_test method of a siute can be called several times in a
row e.g. in case of --repeat option or because there are more
than one custom_args entries in the suite.yaml file. In any case
it's pointless to re-collect the test cases by launching the
test binary again, it's much faster (and 100% safe) to keep the
list of cases from the previous call and re-use it if the test
name matches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-04 10:47:13 +03:00
Avi Kivity
e1817b536f build: clobber user/group info from node_exporter tarball
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)

Reset the uid/gid infomation so this doesn't happen.

Closes #9579
2021-11-04 09:27:13 +02:00
Raphael S. Carvalho
ab0217e30e compaction: Improve overall efficiency by not diluting it with relatively inefficient jobs
Compaction efficiency can be defined as how much backlog is reduced per
byte read or written.
We know a few facts about efficiency:
1) the more files are compacted together (the fan-in) the higher the
efficiency will be, however...
2) the bigger the size difference of input files the worse the
efficiency, i.e. higher write amplification.

so compactions with similar-sized files are the most efficient ones,
and its efficiency increases with a higher number of files.

However, in order to not have bad read amplification, number of files
cannot grow out of bounds. So we have to allow parallel compaction
on different tiers, but to avoid "dilution" of overall efficiency,
we will only allow a compaction to proceed if its efficiency is
greater than or equal to the efficiency of ongoing compactions.

By the time being, we'll assume that strategies don't pick candidates
with wildly different sizes, so efficiency is only calculated as a
function of compaction fan-in.

Now when system is under heavy load, then fan-in threshold will
automatically grow to guarantee that overall efficiency remains
stable.

Please note that fan-in is defined in number of runs. LCS compaction
on higher levels will have a fan-in of 2. Under heavy load, it
may happen that LCS will temporarily switch to size-tiered mode
for compaction to keep up with amount of data being produced.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103215110.135633-2-raphaelsc@scylladb.com>
2021-11-03 20:03:23 +02:00
Raphael S. Carvalho
0db70a8d98 compaction: STCS: pick bucket with largest fan-in instead
STCS is considering the smallest bucket, out of the ones which contain
more than min_threshold elements, to be the most interesting one
to compact now.
That's basically saying we'll only compact larger tiers once we're
done with smaller ones. That can be problematic because under heavy
load, larger tiers cannot be compacted in a timely manner even though
they're the ones contributing the most to read amplification.
For example, if we're producing sstables in smaller tiers at roughly
the same rate that we can compact them, then it may happen that
larger tiers will not be compacted even though new sstables are
being pushed to them. Therefore, backlog will not be reduced in a
satisfactory manner, so read latency is affected.
By picking the bucket with largest fan-in instead, we'll choose the
most efficient compaction, as we'll target buckets which can reduce
more from backlog once compacted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103215110.135633-1-raphaelsc@scylladb.com>
2021-11-03 20:03:19 +02:00
Raphael S. Carvalho
e9cb56cd81 table: Adjust partition estimation for segregation on memtable flush
If memtable flush is segregated into multiple files, partition
estimation becomes innacurate and consequently bloom filters are
bigger than needed, leading to an increase in memory consumption.
To fix this, let's wire adjust_partition_estimate() into the flush
procedure, such that original estimation will be adjusted if
segregation is going to be performed. That's done by feeding
mutation_source_metadata, which will leave original estimation
unchanged if no segregation is needed, but will adjust it
otherwise.

Fixes #9581.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103141600.65806-2-raphaelsc@scylladb.com>
2021-11-03 17:51:03 +02:00
Raphael S. Carvalho
2340cfa957 memtable-sstable: Extend interface to allow adjustment of estimated partitions
Without tweaking interface, there was no way to adjust estimated
partitions on flush. For example, when segregating a memtable for
TWCS, all produced sstables would have an estimation equal to
the memtable size, even though each only contains a subset of it,
which leads to a significant increase in memory consumption for
bloom filters. Subsequent work will use this interface to perform
the adjustment.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103141600.65806-1-raphaelsc@scylladb.com>
2021-11-03 17:51:03 +02:00
Avi Kivity
08ce119703 Merge "Fix twcs reshape disjoint test case" from Pavel E
"
There are 3 overlapping problems with the test case. It
has use after move that covers wrond window selection and
relies on a time-since-epoch being aligned with the time
window by chance.

tests: unit(dev)
"

* 'br-twcs-test-fixes' of https://github.com/xemul/scylla:
  test, compaction: Do not rely on random timestamp
  test, compaction: Fix use after move in twcs reshape
2021-11-03 17:38:29 +02:00
Asias He
f5f5714aa6 repair: Return HTTP 400 when repiar id is not found
There are two APIs for checking the repair status and they behave
differently in case the id is not found.

```
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_async/system_auth?id=999", "duration": "1ms",
"status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad
Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"}

{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_status?id=999&timeout=1", "duration": "0ms",
"status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server
Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"}
```

The correct status code is 400 as this is a parameter error and should
not be retried.

Returning status code 500 makes smarter http clients retry the request
in hopes of server recovering.

After this patch:

curl -X PGET
'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999'
{"message": "unknown repair id 9999", "code": 400}

curl -X GET
'http://127.0.0.1:10000/storage_service/repair_status?id=9999'
{"message": "unknown repair id 9999", "code": 400}

Fixes #9576

Closes #9578
2021-11-03 17:15:40 +02:00
Pavel Emelyanov
9628d72964 test, compaction: Do not rely on random timestamp
Again, there's a sub-case with sequential time stamps that still
works by chance. This time it's because splitting 256 sstables
into buckets of maximum 8 ones is allowed to have the 1st and the
last ones with less than 8 items in it, e.g. 3, 8, ..., 8, 5. The
exact generation depends on the time-since-epoch at which it
starts.

When all the cases are run altogether this time luckily happens
to be well-aligned with 8-hours and the generated buckets are
filled perfectly. When this particular test-case is run all alone
(e.g. by --run_test or --parallel-cases) then the starting time
becomes different and it gets less than 4 sstables in its first
bucket.

The fix is in adjusting the starting time to be aligned with the
8 hours window.

Actually, the 8 hours appeared in the previous patch, before which
it was 24 hours. Nonetheless, the above reasoning applies to any
size of the time window that's less than 256, so it's still an
independent fix.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-03 15:41:19 +03:00
Pavel Emelyanov
c6ce6b9ca1 test, compaction: Fix use after move in twcs reshape
The options are std::move-d twice -- first into schema builder then
into compaction strategy. Surprisingly, but the 2nd move makes the
test work.

There's a sub-case in this case that checks sstables with incremental
timestamps with 1 hour step -- 0h, 1h, 2h, ... 255h. Next, the twcs
buckets generator obeys a minimal threshold of 4 sstables per bucket.
Those with less sstables in are not included in the job. Finally,
since the options used to create the twcs are empty after the 1st
move the default window of 24 hours is used. If they were configured
correctly with 1 hour window then all buckets would contain 1 sstable
and the generated job would become empty.

So the fix is both -- don't move after move and make the window size
large enough to fit more sstables than the mentioned minimum.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-03 15:09:15 +03:00
Piotr Sarna
f36bbe05b4 Merge 'alternator: add support for AttributeUpdates Add operation' from Nadav Har'El
In UpdateItem's AttributeUpdates (old-style parameter) we were missing
support for the ADD operation - which can increment a number, or add
items to sets (or to lists, even though this fact isn't documented).

This two-patch series add this missing feature. The first patch just moves
an existing function to where we can reuse it, and the second patch is
the actual implementation of the feature (and enabling its test).

Fixes #5893

Closes #9574

* github.com:scylladb/scylla:
  alternator: add support for AttributeUpdates ADD operation
  alternator: move list_concatenate() function
2021-11-03 09:33:50 +01:00
Avi Kivity
075ceb8918 Merge 'AWS: add scylla_io_setup preset parameters for ARM instances' from Takuya ASADA
Currently, scylla-server fails to start on ARM instances because scylla_io_setup does not have preset parameters even instance type added to 'supported instance'.
To fix this, we need to add io parameter preset on scylla_io_setup.

Also, we mistakenly added EBS only instances at a004b1da30, need to remove them.
Instrances does not have ephemeral disk should be 'unsupported instance', we still run our AMI on it, but we print warning message on login prompt, and user requires to run scylla_io_setup.

Fixes #9493

Closes #9532

* github.com:scylladb/scylla:
  scylla_util.py: remove EBS only ARM instances from support instance list
  scylla_io_setup: support ARM instances on AWS
2021-11-03 10:19:59 +02:00
Nadav Har'El
00335b1901 alternator: add support for AttributeUpdates ADD operation
In UpdateItem's AttributeUpdates (old-style parameter) we were missing
support for the ADD operation - which can increment a number, or add
items to sets (or to lists, even though this fact isn't documented).

This patch adds this feature, and the test for it begins to pass so its
"xfail" marker is removed.

Fixes #5893

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-03 10:19:26 +02:00
Nadav Har'El
7e6c5394f3 alternator: move list_concatenate() function
The list_concatenate() function was only used for UpdateExpression's
ADD operation, so we made it a static function in the source file where
it was used. In the next patch, we'll want to use it in another place
(AttributeUpdates' ADD operation), so let's move it to the same file
where similar functions for sets exist.

This patch is almost entirely a code move, but also makes one small
change: list_concatenate() used to throw an exception if one of the
arguments wasn't a list, but the text of this exception was specific to
UpdateExpression. So in the new version, we return a null value in this
case - and the caller checks for it and throws the right exception.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-03 10:19:26 +02:00
Nadav Har'El
56eb994d8f alternator: allow Authorization header to be without spaces
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.

At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.

In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).

Fixes #9568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
2021-11-03 06:38:28 +02:00
Takuya ASADA
4a96a8145e scylla_util.py: remove EBS only ARM instances from support instance list
Since we required ephemeral disks for our AMI, these EBS only ARM
instances cannot add in it is 'supported instance' list.
We still able to run our AMI on these instance types but login message
warns it is 'unsupported instance type', and requires to run
scylla_io_setup manually.
2021-11-03 10:26:42 +09:00
Takuya ASADA
4e8060ba72 scylla_io_setup: support ARM instances on AWS
Add preset parameters for AWS ARM intances.

Fixes #9493
2021-11-03 10:26:42 +09:00
Avi Kivity
b3d5651fd7 Update seastar submodule
* seastar 083898a172...a189cdc45d (7):
  > print: deprecate print() family
  > treewide: replace uses of fmt_print() and fprint() with direct fmt calls
  > circular_buffer: mark clear noexcept
  > circular_buffer: mark trivial methods noexcept
  > Merge "file: allow destroying append_challenged_posix_file_impl following a close failure" from Benny
  > merge: Add parsing HTTP response status
  > inet_address: fix usage of `htonl` for clang
2021-11-02 19:26:09 +02:00
garanews
7a6a59eb7c fix some typo in docs
Closes #9510
2021-11-02 19:59:16 +03:00
Botond Dénes
6ad0a2989c compaction/scrub: segregate input only in segregate mode
scrub_compaction assumes that `make_interposer_consumer()` is called
only when `use_interposer_consumer()` returns true. This is false, so in
effect scrub always ends up using the segregating interposer. Fix this
by short-circuiting the former method when the latter returns true,
returning the passed-in consumer unchanged.

Tests: unit(dev)

Fixes #9541

Closes #9564
2021-11-02 15:25:22 +02:00
Avi Kivity
15a80bb5ce Update tools/jmx submodule
* tools/jmx 5c383b6...48d37f3 (1):
  > StorageService: scrub: fix scrubMode is empty condition

Ref scylladb/scylla-jmx#180.
2021-11-02 15:21:31 +02:00
Avi Kivity
2f23d22739 Merge 'Scrub compaction: add a new in-memory partition segregation method' from Botond Dénes
The current disk-based segregation method works well enough for most cases, but it struggles greatly when there are a lot of partitions in the input. When this is the case it will produce tons of buckets (sstables), in the order of hundreds or even thousands. This puts a huge strain on different parts of the system.
This series introduces a new segregation method which specializes on the lots of small partitions case. If the conditions are right, it can cause a drastic reduction of buckets. In one case I tested, a 1.1GB sstable with 3.6M partitions in it produced just 2 output sstables, down from the 500+ with the on-disk method.
This new method uses a memtable to sort out-of-order partitions. In-order partitions bypass this sorting altogether and go to the disk directly. This method is not suitable for cases where either the partition are large or the total amount of data is large. For those the disk-based method should be used. Scrub compaction decides on the method to use based on heuristics.

Tests: unit(dev)

Closes #9548

* github.com:scylladb/scylla:
  compaction: scrub_compaction: add bucket count to finish message
  test/boost: mutation_writer_test: harden the partition-based segregator test
  mutation_writer: remove now unused on-disk partition segregator
  compaction,test: use the new in-memory segregator for scrub
  mutation_writer/partition_based_splitting_writer: add memtable-based segregator
2021-11-02 15:18:41 +02:00
Tomasz Grabiec
00814dcadc Merge "raft: randomized_nemesis_test: perform cluster reconfigurations" from Kamil
We introduce a new operation to the framework: `reconfiguration`.
The operation sends a reconfiguration request to a Raft cluster. It
bounces a few times in case of `not_a_leader` results.

A side effect of the operation is modifying a `known` set of nodes which
the operation's state has a reference to. This `known` set can then be
used by other operations (such as `raft_call`s) to find the current
leader.

For now we assume that reconfigurations are performed sequentially. If a
reconfiguration succeeds, we change `known` to the new configuration. If
it fails, we change `known` to be the set sum of the previous
configuration and the current configuration (because we don't know what
the configuration will eventually be - the old or the attempted one - so
any member of the set sum may eventually become a leader).

We use a dedicated thread (similarly to the network partitioning thread)
to periodically perform random reconfigurations.

* kbr/reconfig-v2:
  test: raft: randomized_nemesis_test: perform reconfigurations in basic_generator_test
  test: raft: randomized_nemesis_test: improve the bouncing algorithm
  test: raft: randomized_nemesis_test: handle more error types
  test: raft: randomized_nemesis_test put `variant` and `monostate` `ostream` `operator<<`s into `std` namespace
  test: raft: randomized_nemesis_test: `reconfiguration` operation
2021-11-02 13:55:45 +01:00
Botond Dénes
eaf4454ac8 compaction: scrub_compaction: add bucket count to finish message
It is useful to know how many buckets (output sstables) scrub produced
in total. The end compaction message will only report those still open
when the scrub finished, but will omit those that were closed in the
middle.
2021-11-02 12:24:37 +02:00
Botond Dénes
e4e369053b test/boost: mutation_writer_test: harden the partition-based segregator test
Test both methods: the "old" disk-based one and the recently added
in-memory one, with different configurations and also add additional
checks to ensure they don't loose data.
2021-11-02 12:24:37 +02:00
Botond Dénes
74f2290e49 mutation_writer: remove now unused on-disk partition segregator
Also removes related tests, including the exception safety test which
just spins forever with the memtable method.
2021-11-02 12:24:33 +02:00
Michał Radwański
07e78807e6 range_tombstone_list: add lower_slice
lower_slice returns the range tombstones which have end inside range
[start, before).
2021-11-02 10:50:31 +01:00
Botond Dénes
f2f529855d compaction,test: use the new in-memory segregator for scrub 2021-11-02 09:00:44 +02:00
Botond Dénes
18599f26fa mutation_writer/partition_based_splitting_writer: add memtable-based segregator
The current method of segregating partitions doesn't work well for huge
number of small partitions. For especially bad input, it can produce
hundreds or even thousands of buckets. This patch adds a new segregator
specialized for this use-case. This segregator uses a memtable to sort
out-of-order partitions in-memory. When the memtable size reaches the
provided max-memory limit, it is flushed to disk and a new empty one is
created. In-order partitions bypass the sorting altogether and go to the
fast-path bucket.

The new method is not used yet, this will come in the next patch.
2021-11-02 08:23:16 +02:00
Asias He
9e8fc63585 repair: Fix range intersection for start_token and end_token option
The range::subtract() was used as a trick to implement
range::intersection() when intersection was not available at that time.

The following code is problematic:

dht::token_range given_range_complement(tok_end, tok_start);

because dht::token_range should be non-wrapping range.

To fix, use compat::unwrap_into() to unwrap the range and use
the range::intersection() to calculate the intersection.

Note even if the given_range_complement is problematic, current code
generates correct intersections.

Example 1:
$ curl -X POST 'http://127.0.0.1:10000/storage_service/repair_async/keyspace1?startToken=5&endToken=100'

[shard 0] repair - starting user-requested
repair for keyspace keyspace1, repair id [id=1,
uuid=aa71a192-5967-4f05-99b8-5febd9d81d50], options {{ endToken -> 100}, { startToken -> 5}}

[shard 0] repair - start=5, end=100, given_range_complement=(100, 5], wrange=(100, 5], is_wrap_around=true,
ranges={(-inf, -7612759882658906007], (-7612759882658906007,
-6766703710995023384], (-6766703710995023384, 2918449800065200439],
(2918449800065200439, 8039072586540417979], (8039072586540417979,
+inf)}, intersections={(5, 100]}

[shard 0] repair - New method intersections={(5, 100]}

Example 2:

$ curl -X POST
'http://127.0.0.1:10000/storage_service/repair_async/keyspace1?startToken=100&endToken=5'

[shard 0] repair - starting user-requested
repair for keyspace keyspace1, repair id [id=1,
uuid=f6076438-015c-4bdc-8ebd-0a55664365fa], options {{ endToken -> 5}, { startToken -> 100}}

[shard 0] repair - start=100, end=5, given_range_complement=(5, 100], wrange=(5, 100], is_wrap_around=false,
ranges={(-inf, -7612759882658906007], (-7612759882658906007,
-6766703710995023384], (-6766703710995023384, 2918449800065200439],
(2918449800065200439, 8039072586540417979], (8039072586540417979,
+inf)}, intersections={
(-inf, -7612759882658906007],
(-7612759882658906007, -6766703710995023384],
(-6766703710995023384, 5],
(100, 2918449800065200439],
(2918449800065200439, 8039072586540417979],
(8039072586540417979, +inf)}

[shard 0] repair - New method intersections={
(-inf, -7612759882658906007],
(-7612759882658906007, -6766703710995023384],
(-6766703710995023384, 5],
(100, 2918449800065200439],
(2918449800065200439, 8039072586540417979],
(8039072586540417979, +inf)}

Fixes #9560

Closes #9561
2021-11-01 12:43:49 +02:00
Nadav Har'El
e6d17d8de2 test/cql-pytest: remove "xfail" label on a reproducer for a fixed bug
The two cql-pytest tests:
        test_frozen_collection.py::test_wrong_set_order_in_nested
        test_frozen_collection.py::test_wrong_set_order_in_nested_2

Which used to fail, and therefore marked "xfail", got fixed by commit
5589f348e7 ("cql3: expr: Implement
evaluate(expr::bind_variable). That commit made the handling of bound
variables in prepared statements more rigorous, and in particular
made sure that sets are re-sorted not only if they are at the top
level of the value (as happened in the old code), but also if they
are nested inside some other container. This explains the surprising
fact that we could only reproduce bug with prepared statements, and
only with nested sets - while top-level sets worked correctly.

As the tests no longer failed and the bug tested by them really did
get fixed, in this patch we remove the "xfail" marker from these
tests.

Closes #7856. This issue was really fixed by the aforementioned commit,
but let's close it now.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211029221731.1113554-1-nyh@scylladb.com>
2021-11-01 10:29:32 +02:00
Avi Kivity
1aff7d19c2 treewide: replace seastar::fmt_print() with fmt::print()
We shouldn't be using Seastar as a text formatting library; that's
not its focus. Use fmt directly instead. fmt::print() doesn't return
the output stream which is a minor inconvenience, but that's life.

Closes #9556
2021-11-01 10:05:16 +02:00
Avi Kivity
bd0b573a92 Merge 'Partition based splitting writer exception safety' from Botond Dénes
The partition based splitting writer (used by scrub) was found to be exception-unsafe, converting an `std::bad_alloc` to an assert failure. This series fixes the problem and adds a unit test checking the exception safety against `std::bad_alloc`:s fixing any other related problems found.

Fixes: https://github.com/scylladb/scylla/issues/9452

Closes #9453

* github.com:scylladb/scylla:
  test: mutation_writer_test: add exception safety test for segregate_by_partition()
  mutation_writer: segregate_by_partition(): make exception safe
  mutation_reader: queue_reader_handle: make abandoned() exception safe
  mutation_writer: feed_writers(): make it a coroutine
  mutation_writer: partition_based_splitting_writer: erase old bucket if we fail to create replacement
2021-10-31 21:15:19 +02:00
Takuya ASADA
c9499230c3 docker: add stopwaitsecs
We need stopwaitsecs just like we do TimeoutStpSec=900 on
scylla-server.service, to avoid timeout on scylla-server shutdown.

Fixes #9485

Closes #9545
2021-10-31 20:38:10 +02:00
Vlad Zolotarov
79b0654d60 time_window_compaction_strategy: put expired sstables in a separate compaction task
It's much more efficient to have a separate compaction task that consists
completely from expired sstables and make sure it gets a unique "weight" than
mixing expired sstables with non-expired sstables adding an unpredictable
latency to an eviction event of an expired sstable.

This change also improves the visibility of eviction events because now
they are always going to appear in the log as compactions that compact into
an empty set.

Fixes #9533

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>

Closes #9534
2021-10-31 17:54:40 +02:00
Nadav Har'El
6ae0ea0c48 alternator: return the correct Content-Type header
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.

Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).

Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.

This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.

Fixes #9554.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
2021-10-31 10:50:25 +01:00
Raphael S. Carvalho
2bf47c902e cql: set configurable restriction of DateTieredCompactionStrategy to warn by default
Setting a value of "warn" will still allow the create or alter commands,
but will warn the user, with a message that will appear both at the log
and also at cqlsh for example.

This is another step towards deprecating DTCS. Users need to know we're
moving towards this direction, and setting the default value to warn
is needed for this. Next step is to set it to false, and finally remove
it from the code base.

Refs #8914.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211029184503.102936-1-raphaelsc@scylladb.com>
2021-10-31 09:28:17 +02:00
Nadav Har'El
034f79cfb4 alternator: make api_error an std::exception
Objects of type "api_error" are used in Alternator when throwing an
error which will be reported as-is to the user as part of the official
DynamoDB protocol.

Although api_error objects are often thrown, the api_error class was not
derived from std::exception, because that's not necessary in C++.
However, it is *useful* for this exception to derive from std::except,
so this is what this patch does. It is useful for api_error to inherit
from std::exception because then our logging and debugging code knows
how to print this exception with all its details. All we need to do is
to implement a what() virtual function for api_error.

Before this patch, logging an api_error just logs the type's name
(i.e., the string "api_error"). After this patch, we get the full
information stored in the api_error - the error's type and its message.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211017150555.225464-1-nyh@scylladb.com>
2021-10-29 10:23:55 +03:00
Nadav Har'El
666017f2f0 Merge 'Convert last uses of sprint() to fmt::format()' from Avi Kivity
sprint() uses the printf-style formatting language while most of our
code uses the Python-derived format language from fmt::format().

The last mass conversion of sprint() to fmt (in 1129134a4a)
missed some callers (principally those that were on multiple lines, and
so the automatic converter missed them). Convert the remainder to
fmt::format(), and some sprintf() and printf() calls, so we have just
one format language in the code base. Seastar::sprint() ought to be
deprecated and removed.

Test: unit (dev)

Closes #9529

* github.com:scylladb/scylla:
  utils: logalloc: convert debug printf to fmt::print()
  utils: convert fmt::fprintf() to fmt::print()
  main: convert fprint() to fmt::print()
  compress: convert fmt::sprintf() to fmt::format()
  tracing: replace seastar::sprint() with fmt::format()
  thrift: replace seastar::sprint() with fmt::format()
  test: replace seastar::sprint() with fmt::format()
  streaming: replace seastar::sprint() with fmt::format()
  storage_service: replace seastar::sprint() with fmt::format()
  repair: replace seastar::sprint() with fmt::format()
  redis: replace seastar::sprint() with fmt::format()
  locator: replace seastar::sprint() with fmt::format()
  db: replace seastar::sprint() with fmt::format()
  cql3: replace seastar::sprint() with fmt::format()
  cdc: replace seastar::sprint() with fmt::format()
  auth: replace seastar::sprint() with fmt::format()
2021-10-28 22:33:23 +03:00
Piotr Sarna
0b11771731 alternator: decouple auth from CQL query processor
Alternator auth module used to piggy-back on top of CQL query processor
to retrieve authentication data, but it's no longer the case.
Instead, storage proxy is used directly.

Closes #9538
2021-10-28 21:55:56 +03:00
Jan Ciolek
fd1596171e cql3: expr: Add evaluate_IN_list(expression, options)
evaluate_IN_list was only defined for a term,
but now we are removing term so it should be also defined for an expression.
The internal code is the same - this function used to convert the term to expression
and then did all operations on expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 20:55:09 +02:00
Jan Ciolek
805ba145d7 cql3: Remove term in column_condition
Replace all uses of term with expression in cql3/column_condition

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 20:55:09 +02:00
Jan Ciolek
a24d06c195 cql3: Remove term in select_statement
Replace all uses of term with expression in cql3/statements/select_statement

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 20:55:09 +02:00
Jan Ciolek
d36847801b cql3: Remove term in update_statement
Replace all uses of term with expression in cql3/statements/update_statement

There was some trouble with extracting values from json.
The original code worked this way on a map example:
> There is a json string to parse: {'b': 1, 'a': 2, 'b': 3}
> The code parses the json and creates bytes where this map is serialized
  but without removing duplicates, sorting etc.
> Then a maps::delayed_value is created from these bytes.
  During creation map elements are extracted, sorted and duplicates are removed.
  This map value is then used in setter

Now when maps::delayed_value is changed to expr::constant the step where elements are sorted is lost.
Because of this we need to do this earlier, the best place is during original json parsing.

Additionally I suspect that removing duplicated elements used to work only on the first level, in case of map of maps it wouldn't work.
Now it will work no matter how many layers of maps there are.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 20:55:03 +02:00
Jan Ciolek
ba202cd8bd cql3: Use internal cql format in insert_prepared_json_statement cache
expr::constant is always serialized using the internal cql serialization format,
but currently the code keeps values in the cache in other format.

As preparation for moving from term to expression change
so that values kept in the cache are serialized using
the internal format.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 20:42:43 +02:00
Benny Halevy
a2fc3345bd storage_service: futurize storage_service::describe_ring
Convert storage_service::describe_ring to a coroutine
to prevent reactor stalls as seen in #9280.

Fixes #9280
Closes #9282

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #9282
2021-10-28 16:51:57 +03:00
Avi Kivity
1e1e4f4934 Update abseil submodule
* abseil 9c6a50f...f70eada (122):
  > Fix over-aligned layout test with older gcc compilers (#1049)
  > Export of internal Abseil changes
  > Initial support for Haiku (#1045)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Remove bazelbuild/rules_cc dependency (#1038)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Use FreeBSD macro definition for ElfW macro for compatibility. (#1037)
  > Export of internal Abseil changes
  > Fix hashing on big endian platforms (#1028)
  > Fix typedef of sig_t on AIX (#1030)
  > Export of internal Abseil changes
  > Fixed typo `constuct` to `construct` in 3 places. (#1022)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Initial support for AIX (#1021)
  > Export of internal Abseil changes
  > Update from_chars documentation with regard to whitespace (#1020)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Include immintrin.h instead of wmmintrin.h (#1015)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add -Wno-unknown-warning-option to ABSL_LLVM_FLAGS to disable warnings on unknown warning flags. (#1008)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add missing ABSL_DLL for a few functions (#1002)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Simplifies the construction of the value returned by GenerateRealFromBits() (#994)
  > CMake: option to use cxx_std_11 (minimum) that propagates. (#986)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix Bazel build on aarch64 (#984)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > CMake: add option to use Google Test already installed on system (#969)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Use CMAKE_INSTALL_FULL_{LIBDIR,INCLUDEDIR}. (#963)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Uses alignas for portability in dynamic_annotations.h (#947)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Call FailureSignalHandlerOptions.writenfn with nullptr at the end (#938)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add missing `add_subdirectory()` call for "cleanup" (#925)
  > Allowing to change the MSVC runtime (#921)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix C++/CLI build problem (#916)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add support for more Linux architectures (#904)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add support for m68k (#900)
  > Add support for sparc and sparc64 (#899)
  > Fix uc_mcontext register access on 32-bit PowerPC (#898)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
2021-10-28 16:22:18 +03:00
Jan Ciolek
e5391f1eed types: Add map_type_impl::serialize(range of <bytes, bytes>)
Adds two functions that take a range over pairs of serialized values
and return a serialized map value.

There are 2 functions - one operating on bytes and one operating on managed_bytes.
The version with managed_bytes is used in expression.cc, used to be a local static function.
The bytes version will be used in type_json.cc in the next commit.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 15:14:52 +02:00
Jan Ciolek
1502abaca1 cql3: Remove term in cql3/attributes
Replace all uses of term with expression in cql3/attributes

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 15:14:52 +02:00
Jan Ciolek
a82351dc79 cql3: expr: Add constant::view() method
Add a method that returns raw_value_view to expr::constant.

It's added for convenience - without it in many places
we would have to write my_value.value.to_view().

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 15:14:52 +02:00
Jan Ciolek
c2eb3a58b8 cql3: expr: Implement fill_prepare_context(expression)
Adds a new function - expr::fill_prepare_context.
This function has the same functionality as term::fill_prepare_context, which will be removed soon.

fill_prepare_context used to take its argument with a const qualifier, but it turns out that the argume>
It sets the cache ids of function calls corresponding to partition key restrictions.
New function doesn't have const to make this clear and avoid surprises.

Added expr::visit that takes an argument without const qualifier.

There were some problems with cache_ids in function_call.
prepare_context used to collect ::shared_ptr<functions::function_call>
of some function call, and then this allowed it to clear
cache ids of all involved functions on demand.

To replicate this prepare_context now collects
shared pointers to expr::function_call cache ids.

It currently collects both, but functions::function_call will be removed soon.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 15:14:52 +02:00
Jan Ciolek
edaa3b5dc2 cql3: expr: add expr::visit that takes a mutable expression
Currently expr::visit can only take a const expression as an argument.
For cases where we want to visit the expression and modify it a new function is needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 15:14:52 +02:00
Jan Ciolek
9c40516071 cql3: expr: Add receiver to expr::bind_variable
bind_variable used to have only the type of bound value.
Now this type is replaced with receiver, which describes information about column corresponding to this value.
A receiver contains type, column name, etc.

Receiver is needed in order to implement fill_prepare_context in the next commit.
It's an argument of prepare_context::add_variable_specification.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-10-28 15:14:52 +02:00
Benny Halevy
531e32957d compaction: time_window_compaction_strategy: get_reshaping_job: consider disjointness only when trimming
With 062436829c,
we return all input sstables in strict mode
if they are dosjoint even if they don't need reshaping at all.

This leads to an infinite reshape loop when uploading sstables
with TWCS.

The optimization for disjoint sstables is worth it
also in relaxed mode, so this change first makes sorting of the input
sstables by first_key order independent of reshape_mode,
and then it add a check for sstable_set_overlapping_count
before trimming either the multi_window vector or
any single_window bucket such that we don't trim the list
if the candidates are disjoint.

Adjust twcs_reshape_with_disjoint_set_test accordingly.

And also add some debug logging in
time_window_compaction_strategy::get_reshaping_job
so one can figure out what's going on there.

Test: unit(dev)
DTest: cdc_snapshot_operation.py:CDCSnapshotOperationTest.test_create_snapshot_with_collection_list_with_base_rows_delete_type

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211025071828.509082-1-bhalevy@scylladb.com>
2021-10-28 14:35:51 +03:00
Pavel Emelyanov
3c0a48ac38 Merge 'cdc stream generation: dont abandon stream description rewrite future' from Avi Kivity
The cdc stream rewrite code launches work in the backround. It's careful
to make sure all the dependencies are preserved via shared pointers, but
there are implicit dependencies (system_keyspace) that are not. To fix this
problem, this series changes the lifetime of the background work from
unbounded to be bounded by the lifetime of storage_service.

[ xemul: Not the storage_service, but the cdc_generation_service, now
  all the dances with rewriting and its futures happen inside the cdc
  part. Storage_service coroutinization and call to .stop() from main
  are here for several reasons
   - They are awseome
   - Splitting a PR into two is frustrating (as compared to ML threads)
   - Shia LaBeouf is good motivation speaker
]

As explained in the patches, I don't expect this to be a real problem and the
series just paves the way for making system_keyspace an explicit
dependency.

Test: unit (dev)

Closes #9356

* github.com:scylladb/scylla:
  cdc: don't allow background streams description rewrite to delay too far
  storage_service: coroutinize stop()
  main: stop storage_service on shutdown
2021-10-28 12:36:36 +03:00
Botond Dénes
7c95bd3343 Merge 'Rename 'system.status' and 'system.describe_ring' virtual tables' from Avi Kivity
'system.status' and 'system.describe_ring' are imperfect names for
what they do, so rename them. Fortunately they aren't exposed in any
released version so there is no compatibility concern.

Closes #9530

* github.com:scylladb/scylla:
  system_keyspace: rename 'system.describe_ring' to 'system.token_ring'
  system_keyspace: rename 'system.status' to 'system.cluster_status'
2021-10-28 11:46:20 +03:00
Avi Kivity
c30be50252 utils: logalloc: convert debug printf to fmt::print()
Standardize on one format language.
2021-10-28 10:48:08 +03:00
Takuya ASADA
13ffe3c094 scylla_util.py: detect ephemeral/EBS disks correctly on Nitro System
Currently, aws_instance.ephemeral_disks() returns both ephemeral disks
and EBS disks on Nitro System.
This is because both are attached as NVMe disks, we need to add disk
type detection code on NVMe handle logic.

Fixes #9440

Closes #9462
2021-10-28 08:58:25 +03:00
Piotr Sarna
f4cb8191fa cql3: include system distributed tables in system stats
Some time ago we started gathering stats for system tables in a separate
class in order to be able to distinguish which queries come from the
user - e.g. if the unpaged queries are internal or not.
Originally, only local system tables were moved into this class,
i.e. system and system_schema. It would make sense, however, to also
include other internal keyspaces in this separate class - which includes
system_distributed, system_traces, etc.

Fixes #9380

Closes #9490
2021-10-28 08:58:25 +03:00
Avi Kivity
5e6e4aed53 Merge 'Add Scylla Sphinx Theme 1.0' from David Garcia
Replaces https://github.com/scylladb/scylla/pull/9477

Related issue https://github.com/scylladb/sphinx-scylladb-theme/issues/133

Sphinx ScyllaDB Theme 1.0 is now released 🥳

We’ve made a number of updates to the look and feel of the theme to improve the overall user experience.
You can read more about all notable changes [here](https://sphinx-theme.scylladb.com/stable/CHANGELOG#september-2021).

This PR also cleans the file ``conf.py``, removing several unsued options.

1. Clone this PR. For more information, see [Cloning pull requests locally](https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally).

2. Enter the docs folder, and run:

```
make preview
````

3. Open http://127.0.0.1:5500/ with your favorite browser. You will see the docs with the new look and feel.

Closes #9515

* github.com:scylladb/scylla:
  Review docs config
  fix runtime errors
  upgrade theme to v1.x
2021-10-28 08:58:25 +03:00
Raphael S. Carvalho
affa1d9b04 utils/estimated_histogram.hh: fix division-by-zero in mean()
if mean() is called when there are no elements in the histogram,
a runtime error will happen due to division-by-zero.
approx_exponential_histogram::mean() handles it but for some
reason we forgot to do the same for estimated_histogram.

this problem was found when adding an unit test which calls
mean() in an empty histogram.

Fixes #9531.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211027142813.56969-1-raphaelsc@scylladb.com>
2021-10-28 08:58:25 +03:00
Benny Halevy
b79e9b7396 tools: scylla-sstable: improve error reporting when loading schema from file
Throw a proper exception from do_load_schemas if parse_statements
fails to parse the schema cql.

Catch it in scylla-sstable main() function so
it won't be reported as seastar - unhandled exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027124032.1787347-1-bhalevy@scylladb.com>
2021-10-28 08:58:25 +03:00
Avi Kivity
5ea0940ca9 system_keyspace: rename 'system.describe_ring' to 'system.token_ring'
Table names are usually nouns, so SELECT/INSERT statements
sound natural: "SELECT * FROM pets". 'system.describe_ring' defies
this convention. Rename it to 'system.token_ring' so selects are natural.

The name is not in any released version, so we can safely rename it.
2021-10-27 17:32:37 +03:00
Avi Kivity
5b21e4eb83 system_keyspace: rename 'system.status' to 'system.cluster_status'
'system.status' is too generic, it doesn't explain the status of what.
'system.node_status' is also ambiguous (this node? all nodes?) so I
picked 'system.cluster_status'.

The internal name, nodetool_status_table, was even worse (we're
not querying the status of nodetool!) but fortunately wasn't exposed.

The name is not in any released version, so we can safely rename it.
2021-10-27 17:31:45 +03:00
Avi Kivity
379454c235 utils: convert fmt::fprintf() to fmt::print()
Standardizing on a common format language.
2021-10-27 17:02:00 +03:00
Avi Kivity
7bcc0b8d8b main: convert fprint() to fmt::print()
fprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
0131ae6b5d compress: convert fmt::sprintf() to fmt::format()
Standardize on one format language.
2021-10-27 17:02:00 +03:00
Avi Kivity
3a0f2091d7 tracing: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
d1616a7643 thrift: replace seastar::sprint() with fmt::format()
sprint() is obsolete. Note InvalidRequestException used sprint() with
runtime format, so both it and its callers were updated.
2021-10-27 17:02:00 +03:00
Avi Kivity
27a2c74b64 test: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
7abd105d79 streaming: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
16f2eadfd0 storage_service: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
2fb406138c repair: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
bfa4535ba5 redis: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
36919a4ed7 locator: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
d9d03383fa db: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Avi Kivity
9424f6e12f cql3: replace seastar::sprint() with fmt::format()
sprint() is obsolete. Note some calls where to helper functions that
use sprint(), not to sprint() directly, so both the helpers and
the callers were modified.
2021-10-27 17:02:00 +03:00
Avi Kivity
6b02aa72e2 cdc: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 14:30:06 +03:00
Avi Kivity
b9cc9bad4c auth: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 14:29:32 +03:00
Botond Dénes
9ec55e054d treewide: distinguish truncated frame errors
We have two identical "Truncated frame" errors, at:
* read_frame_size() in serialization_visitors.hh;
* cql_server::connection::read_and_decompress_frame() in
  transport/server.cc;

When such an exception is thrown, it is impossible to tell where was it
thrown from and it doesn't have any further information contained in it
(beyond the basic information it being thrown implies).
This patch solves both problems: it makes the exception messages unique
per location and it adds information about why it was thrown (the
expected vs. real size of the frame).

Ref: #9482

Closes #9520
2021-10-27 12:27:16 +02:00
Alejo Sanchez
0a63e72fa4 api: (minor) fix typo bool instead of boolean
In definition for /column_family/major_compaction/{name} there is an
incorrect use of "bool" instead of "boolean".

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #9516
2021-10-27 12:25:59 +02:00
Benny Halevy
a21b1fbb2f large_data_handle: add sstable name to log messages
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.

Fixes #9524

Test: unit(dev)
DTest: wide_rows_test.py

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
2021-10-27 10:53:11 +03:00
Botond Dénes
6a76e12768 mutation_partition: row: make row marker shadowing symmetric
Currently row marker shadowing the shadowable tombstone is only checked
in `apply(row_marker)`. This means that shadowing will only be checked
if the shadowable tombstone and row marker are set in the correct order.
This at the very least can cause flakyness in tests when a mutation
produced just the right way has a shadowable tombstone that can be
eliminated when the mutation is reconstructed in a different way,
leading to artificial differences when comparing those mutations.

This patch fixes this by checking shadowing in
`apply(shadowable_tombstone)` too, making the shadowing check symmetric.

There is still one vulnerability left: `row_marker& row_marker()`, which
allow overwriting the marker without triggering the corresponding
checks. We cannot remove this overload as it is used by compaction so we
just add a comment to it warning that `maybe_shadow()` has to be manually
invoked if it is used to mutate the marker (compaction takes care of
that). A caller which didn't do the manual check is
mutation_source_test: this patch updates it to use `apply(row_marker)`
instead.

Fixes: #9483

Tests: unit(dev)

Closes #9519
2021-10-26 20:40:31 +02:00
Benny Halevy
5f513ed28b view_builder: consumer: flush_fragments: close reader on error
Make sure to close the reader created by flush_fragments
if an exception occurs before it's moved to `populate_views`.

Note that it is also ok to close the reader _after_ it has been
moved, in case populate_views itself throws after closing the
reader that was moved it.  For conveience flat_mutation_reader::close
supports close-after-move.

Fixes #9479

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211024164138.1100304-1-bhalevy@scylladb.com>
2021-10-24 19:53:31 +03:00
Benny Halevy
4062cd17e0 test: hashers_test: mutation_fragment_sanity_check: stop semaphore
To stop the semaphore as required we need run
the test in a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211024053402.990142-1-bhalevy@scylladb.com>
2021-10-24 11:29:23 +03:00
David Garcia
ff56b7e43e Review docs config 2021-10-22 13:34:56 +01:00
Botond Dénes
0d744fd3fa test: mutation_writer_test: add exception safety test for segregate_by_partition() 2021-10-21 06:50:22 +03:00
Botond Dénes
2ca6552909 mutation_writer: segregate_by_partition(): make exception safe
Close reader if feed_writer() fails in the setup phase.
2021-10-21 06:50:22 +03:00
Botond Dénes
6c8e98e33d mutation_reader: queue_reader_handle: make abandoned() exception safe
Allocating the exception might fail, terminating the application as
`abandoned()` is called in a noexcept context. Handle this case by
catching the bad-alloc and aborting the reader with that instead when
this happens.
2021-10-21 06:50:22 +03:00
Botond Dénes
de55ab571b mutation_writer: feed_writers(): make it a coroutine
The current code leaks exceptional futures. Instead of attempting to
fix, just convert to cleaner and exception-safe coroutines.
2021-10-21 06:50:22 +03:00
Botond Dénes
40ca728a20 mutation_writer: partition_based_splitting_writer: erase old bucket if we fail to create replacement
So we don't attempt to close already closed bucket again in
`partition_based_splitting_writer::close()`.
2021-10-21 06:50:22 +03:00
Michał Radwański
9caf85f64a partition_snapshot_reader: do not accidentally copy schema
Functions `upper_bound` and `lower_bound` had signatures:
```
template<typename T, typename... Args>
static rows_iter_type lower_bound(const T& t, Args... args);
```
This caused a dacay from `const schema&` to `schema` as one of the args,
which in turn copied the schema in a fair number of the queries. Fix
that by setting the parameter type to `Args&&`, which doesn't discard
the reference.

Fixes #9502

Closes #9507
2021-10-20 19:09:08 +03:00
Avi Kivity
a9951588b4 Update seastar submodule
* seastar 994b4b5a0c...083898a172 (24):
  > Revert "memory: always allocate buf using "malloc" for non reactor"
  > Revert dpdk update to 21.08.
  > tutorial: Fix typos
  > queue: add back template requirement for element type to be nothrow move-constructible
  > Revert "queue: require element type to be nothrow move-constructible"
  > build: add the closing "-Wl,--no-whole-archive" to the ldflags
  > build: add -Wno-error=volatile to CXX_FLAGS
  > build: Include dpdk as a single object in libseastar.a
  > Merge: queue: cleanup exception handling
  > build: drop dpdk-specific machine architecture names
  > reactor: call memory::configure() before initialize dpdk
  > core/loop: parallel_for_each(): make entire function critical alloc section
  > Merge 'scheduling groups: Add compile parameter for setting max scheduling groups count at compile time' from Eliran Sinvani
  > test: coroutines_test: assign spinner lambda to local variable
  > shared_ptr: mark shared_from_this functions noexcept
  > lw_shared_ptr: mark shared_from_this functions noexcept
  > build: update download URL for Boost
  > Merge "build: build with dpdk v21.08" from Kefu
  > cpu_stall_detector: handle wraparounds in Linux perf_event ring buffer
  > entry_point.cc: default-initialize sigaction struct
  > reactor: s/gettid()/syscall(SYS_gettid)/
  > memory: always allocate buf using "malloc" for non reactor
  > Revert "memory: always allocate buf using "malloc" for non reactor"
  > memory: always allocate buf using "malloc" for non reactor
2021-10-20 18:38:18 +03:00
Benny Halevy
0746b5add6 storage_service: replicate_to_all_cores: update all keyspaces
Currently we update the effective_replication_map
only on non-system keyspace, leaving the system keyspace,
that uses the local replication strategy, with the empty
replication_map, as it was first initialized.

This may lead to a crash when get_ranges is called later
as seen in #9494 where get_ranges was called from the
perform_sstable_upgrade path.

This change updates the effective_replication_map on all
keyspaces rather than just on the non-system ones
and adds a unit test that reproduces #9494 without
the fix and passes with it.

Fixes #9494

Test: unit(dev), database_test(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211020143217.243949-1-bhalevy@scylladb.com>
2021-10-20 17:54:23 +03:00
Calle Wilund
940058d25a transport::server: Handle nested exceoptions in cql execution/query
Fixes #9491

CQL server, when encountering a "general" exception (i.e. not thrown by
cql error checks), reports a wire error with simply the what() part of
exception. However, if we have nested exceptions, we will most likely
lose info here (hello encryption).

General exception case should unwind exception and give back full,
concatenated message to avoid confusion.

Closes #9492
2021-10-20 17:54:17 +03:00
Nadav Har'El
e4a6569258 config: experimental flag UNUSED_CDC shouldn't be distinct from UNUSED
When an experimental feature graduates from being experimental, we want
to continue allow the old "--experimental-features=..." option to work,
in case some user's configuration uses it - just do nothing. The way
we do it is to map in db::experimental_features_t::map() the feature's
name to the UNUSED value - this way the feature's name is accepted, but
doesn't change anything.

When the CDC feature graduated from being experimental, a new bit
UNUSED_CDC was introduced to do the same thing. This separate bit was
not actually necessary - if we ever check for UNUSED_CDC bit anywhere
in the code it means the flag isn't actually unused ;-) And we don't
check it.

So simplify the code by conflating UNUSED_CDC into UNUSED.
This will also make it easy to build from db::experimental_features_t::map()
a list of current experimental features - now it will simply be those that
do not map to UNUSED.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013105107.123544-1-nyh@scylladb.com>
2021-10-20 17:54:17 +03:00
Nadav Har'El
88afcc7fe3 Merge 'cql-pytest: Forbid deletions based on secondary index' from Piotr Sarna
This series fixes a bug which allowed using a secondary index in a restriction for a DELETE statement, which resulted in generating incorrect slices and deleting the whole partition instead. Secondary indexes are not meant to be used for deletes, which this series enforces by marking the indexes as not queriable.

It also comes with a reproducing test case, originally provided by @fee-mendes (thanks!).

Fixes #9495
Tests: unit(release)

Closes #9496

* github.com:scylladb/scylla:
  cql-pytest: add reproducer for deleting based on secondary index
  cql3: forbid querying indexes for deletions
2021-10-20 17:54:17 +03:00
Botond Dénes
995a41d422 test/perf/perf_sstable: add support for compaction strategies
So the compaction perf of different compaction strategies can be
compared. Data timestamps are diversified such that they fall into four
different bucket if TWCS is used, in order to be able to stress the
timestamp based splitting code path.

Closes #9488
2021-10-20 17:54:17 +03:00
Benny Halevy
dc091fc952 effective_replication_map, abstract_replication_strategy: get_ranges: call on_internal_error in empty sorted_tokens case
Accessing tm.sorted_tokens().back() causes undefined behavior
if tm.sorted_tokens is empty.

Check that first and throw/abort using on_internal_error
in this case.

This will prevent the segfault but it doesn't fix the root cause
which is getting here with empty token_metadata.  That will be fixed
by the following patch.

Refs #9494

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211019075710.1626808-1-bhalevy@scylladb.com>
2021-10-19 18:52:59 +03:00
Piotr Sarna
7c35d47690 cql3: make column names readable for invalid delete statements
This commit makes the column names from an invalid delete statement human readable.
Before that, they were printed in their hex representation, which is not convenient
for debugging.

Before:
  InvalidRequest:
    Error from server: code=2200 [Invalid query]
    message="Invalid where clause contains non PRIMARY KEY columns: 76616c"
After:
  InvalidRequest:
    Error from server: code=2200 [Invalid query]
    message="Invalid where clause contains non PRIMARY KEY columns: val"
Message-Id: <52923335e8837295fd5ba2dfd0921196e21f7f16.1634626777.git.sarna@scylladb.com>
2021-10-19 10:13:43 +03:00
Piotr Sarna
83722b5563 cql-pytest: add reproducer for deleting based on secondary index
This commit adds a test case for a bug reported by Felipe
<felipemendes@scylladb.com>. The bug involves trying to delete
an entry from a partition based on a secondary index created
on a column which is part of the compound clustering key,
and the unfortunate result is that the whole partition gets wiped.
Cassandra's behavior is in this case correct - deletion based
on a secondary index column is not allowed.

Refs #9495
2021-10-19 08:50:20 +02:00
Piotr Sarna
7e3649202e cql3: forbid querying indexes for deletions
Using secondary indexes for the purpose of a DELETE statement
was never expected to be well-defined, but an edge case in #9495
showed that the index may sometimes be inadvertently used, which
causes the whole partition to be deleted.
In order to prevent such errors, it's now explicitly defined
that an index is not queriable if it's going to be used for the
purpose of a DELETE statement.
2021-10-19 08:49:58 +02:00
Raphael S. Carvalho
4271c4edcd sstables: Fix metric currently_open_for_writing
metric currently_open_for_writing, used to inform # of sstables opened for writing,
holds the same value as total_open_for_writing. that means we aren't actually
decreasing the counter, so it is bogus.

Moved to sstable_writer, because sstable is used by writer to open files,
which are then extracted from sstable object, and later the same object is
reused for read-only mode.

Fixes #9455.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211013134812.177398-1-raphaelsc@scylladb.com>
2021-10-18 18:29:33 +03:00
Avi Kivity
e44057d5e1 cdc: don't allow background streams description rewrite to delay too far
If we're upgrading from an older version with the previous CDC streams
format, we'll upgrade it in the background. Background update is needed
since we need the cluster to be available when performing the upgrade,
but at this point we're just starting a node, and may not succeed in
forming a cluster before we shut down.

However, running in the background is dangerous since the objects we
use may stop existing. The code is careful to use reference counting,
but this does not guarantee that other dependencies are still alive,
especially since not all dependencies are expressed via constructor
parameters.

Fix by waiting for the rewrite work in generation_service::stop(). As
long as generation_service is up, the required dependencies should be
working too.

Note that there is another change here besides limiting the background
work: checks that were previously done in the foreground (limited to
local tables) are now also done in the background. I don't think
this has any impact.

Note: I expect this to have no real impact. Any CDC users will have
long since ugpraded. This is just preparing for other patches that
bring in other dependencies, which cannot be passed via reference
counted pointers, so they expose the existing problem.
2021-10-18 16:56:59 +03:00
Kamil Braun
22061831c1 Merge 'cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy' from Benny Halevy
It was auto-expanded only if the strategy name
was the short "NetworkTopologyStrategy" name.

Fixes #9302.

Closes #9304.

* 'prepare_options' of https://github.com/bhalevy/scylla:
  cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy
  abstract_replication_strategy: add to_qualified_class_name
2021-10-18 16:40:57 +03:00
Raphael S. Carvalho
ec1a55ffae compaction/TWCS: reduce write amp for reshape of sstables spanning multiple windows
TWCS can reshape at most 32 sstables spanning multiple windows, in a
single compaction round. Which sstables are compacted together, when
there are more than 32 sstables, is random.

If sstables with overlapping windows are compacted together, then
write amplification can be reduced because we may be able to push
all the data to a window W in a single compaction round, so we'll
not have to perform another compaction round later in W, to reduce
its number of files. This is also very good to reduce the amount
of transient file descriptors opened, because TWCS reshape
first reshapes all sstables spanning multiple windows, so if
all windows temporarily grow large in number of files, then
there's a risk which file descriptors can be exhausted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211013203046.233540-3-raphaelsc@scylladb.com>
2021-10-18 16:40:57 +03:00
Raphael S. Carvalho
062436829c compaction/TWCS: optimize reshape for disjoint sstables spanning multiple windows
After a4053dbb72, data segregation is postponed to offstrategy, so reshape
procedure is called with disjoint sstables which belong to different
windows, so let's extend the optimization for disjoint sstables which
span more than one window. In this way, write amplification is reduced
for offstrategy compaction, as all disjoint sstables will be compacted
at once.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211013203046.233540-2-raphaelsc@scylladb.com>
2021-10-18 16:40:57 +03:00
Raphael S. Carvalho
aa4aba40aa sstables: sstable_run: introduce estimate_droppable_tombstone_ratio
Make it possible to estimate dropppable tombstones for sstable runs.
The result is averaged by number of fragments composing the run.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211014143424.353357-1-raphaelsc@scylladb.com>
2021-10-18 12:24:08 +03:00
Benny Halevy
b9aa92edd4 cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy
It was auto-expanded only if the strategy name
was the short "NetworkTopologyStrategy" name.

Fixes #9302

Test: cql_query_test.test_rf_expand(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-18 12:18:07 +03:00
Benny Halevy
e4dc81ec04 abstract_replication_strategy: add to_qualified_class_name
And use it from cql3 check_restricted_replication_strategy and
keyspace_metadata ctor that defined their own `replication_class_strategy`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-18 12:13:25 +03:00
Piotr Sarna
4bfaa7d9fc Merge 'Service levels: fix undefined behaviours' from Eliran Sinvani
This mini series contains two fixes that are bundled together since the
second one assumes that the first one exists (or it will not fix
anything really...), the two problems were:
1. When certain operations are called on a service level controller
   which doesn't have it's data accessor set, it can lead to a crash
since some operations will still try to dereference the accessor
pointer.
2. The cql environment test initialized the accessor with a
   sharded<system_distributed_data>& however this sharded class as
itself is not initialized (sharded::start wasn't called), so for the
same that were unsafe for null dereference the accessor will now crash
for trying to access uninitialized sharded instance.

Closes #9468

* github.com:scylladb/scylla:
  CQL test environment: Fix bad initialization order
  Service Level Controller: Fix possible dereference of a null pointer
2021-10-18 08:53:53 +02:00
Nadav Har'El
1d751491a3 test/alternator: recognize when Scylla crashes
Before this patch, if Scylla crashes during some test in test/alternator,
all tests after it will fail because they can't connect to Scylla - and we
can get a report on hundreds of failures without a clear sign of where the
real problem was.

This patch introduces an autouse fixture (i.e., a fixture automatically
used by every test) which tries to run a do-nothing health-check request
after each test. If this health-check request fails, we conclude that
Scylla crashed and report the test in which this happened - and exit
pytest instead of failing a hundred more tests.

The failure report looks something like this:
```
! _pytest.outcomes.Exit: Scylla appears to have crashed in test test_batch.py::test_batch_get_item !
```
And the entire test run fails.

These extra health checks are not free, but they come fairly close to
being free: In my tests I measured less than 0.1 seconds slowdown of
the entire test suite (which has 618 tests) caused by the extra health
checks.

Fixes #9489

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211017123222.217559-1-nyh@scylladb.com>
2021-10-17 20:45:30 +03:00
Avi Kivity
4b9a34051c storage_service: coroutinize stop()
In preparation for adding more stuff, convert stop() to a coroutine
to avoid an unreadable chain of continuations.

The code uses a finally() block which might not be needed (since .join()
should not fail). Rather than risking getting it wrong I wrapped it in
a try/catch and added logging.
2021-10-17 18:02:08 +03:00
Avi Kivity
e6b34527c1 main: stop storage_service on shutdown
Just like other services, storage_service needs to be stopped on
shutdown. cql_test_env already stops it, so there is some precedent
for it working. I tested a shutdown while cassandra-stress was
running and it worked okay for a few trials.
2021-10-17 18:02:08 +03:00
Nadav Har'El
86e8979ff2 test/alternator, test/cql-pytest: enable specific experimental features
Issue #9467 deprecated the blanket "--experimental" option which we
used to enable all experimental Scylla features for testing, and
suggests that individual experimental features should be enabled
instead.

So this is what we do in this patch for the Scylla-running scripts
in test/alternator and test/cql-pytest: We need to enable UDF for
the CQL tests, and to enable Alternator Streams and Alternator TTL
for the Alternator tests.

Refs #9467

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211012110312.719654-2-nyh@scylladb.com>
2021-10-15 16:36:35 +03:00
Nadav Har'El
ddba510e64 config: add name for the experimental Alternator TTL feature
Earlier we added experimental (and very incomplete) support for
Alternator's TTL feature, but forgot to set a *name* for this
experimental feature. As a result, this feature can be enabled only with
the blanket "--experimental" option and not with a specific
"--experimental-features=..." option.

Since issue #9467 deprecated the blanket "--experimental" option
and users are encouraged to only enable specific experimental
features, it is important that we have a name for it.

So the name chosen in this patch is "alternator-ttl".
Eventually this feature might evolve beyond Alternator-only,
but for now, I think it's a good name and we'll probably
graduate the experimental Alternator TTL feature before
supporting CQL, so it will be a new experimental feature
anyway.

Refs #9467.

db/config.cc

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211012110312.719654-1-nyh@scylladb.com>
2021-10-15 16:36:23 +03:00
Avi Kivity
acfe0a3803 build: reinstate -Wunknown-attributes
The warning was disabled during the migration to clang, but now it
appears unnecessary (perhaps clang added support for the attributes
it did not have then). It is valuable for detecting misspelled
attributes, so enable it again.

Closes #9480
2021-10-14 14:26:56 +03:00
Tomasz Grabiec
cc56a971e8 database, treewide: Introduce partition_slice::is_reversed()
Cleanup, reduces noise.

Message-Id: <20211014093001.81479-1-tgrabiec@scylladb.com>
2021-10-14 12:39:16 +03:00
Nadav Har'El
cad039421a config: automate help-string listing experimental features
The help string from the "--experimental-features" command-line option
lists the available experimental features, to helping a user who might
want to enable them. But this help string was manually written, and has
since drifted from reality:

* Two of the listed "experimental" features, cdc and lwt, have actually
  graduated from being experimental long ago. Although technically a user
  may still use the words "cdc" and "lwt" in the "experimental-features"
  parameter, doing so is pointless, and worse: This text in the help
  string can mislead a user into thinking that these two features are
  still experimental - while they are not!

* One experimental feature - alternator-ttl - is missing from this list.

Instead of updating the help string text now - and needing to do this
again and again in the future as we change experimental features - what
this patch does is to construct the list of features automatically from
the map of supported feature names - excluding any features which map
to UNUSED.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013122635.132582-1-nyh@scylladb.com>
2021-10-14 10:39:58 +03:00
Avi Kivity
4f3b8f38e2 Merge "Add effective_replication_map" from Benny
"
The current api design of abstract_replication_strategy
provides a can_yield parameter to calls that may stall
when traversing the token metadata in O(n^2) and even
in O(n) for a large number of token ranges.

But, to use this option the caller must run in a seastar thread.
It can't be used if the caller runs a coroutine or plain
async tasks.

Rather than keep adding threads (e.g. in storage_service::load_and_stream
or storage_service::describe_ring), the series offers an infrastructure
change: precalculating the token->endpoints map once, using an async task,
and keeping the results in a `effective_replication_map` object.
The latter can be used for efficient and stall-free calls, like
get_natural_endpoints, or get_ranges/get_primary_range, replacing their
equivalents in abstract_replication_strategy, and dropping the public
abstract_replication_strategy::calculate_natural_endpoints and its
internal cached_endpoints map.

Other than the performance benefits of:
1. The current calls require running a thread to yield.
Precalculating the map (using async task) allows us to use synchronous calls
without stalling the rector.

2. The replication maps can and should be shared
between keyspaces that use the same replication strategy.
(Will be sent as a follow-up to the series)

The bigger benefits (courtesy of Avi Kivity) are laying the groundwork for:
1. atomic replication metadata - an operation can capture a replication map once, and then use consistent information from the map without worrying that it changes under its feet. We may even be able to s/inet_address/replica_ptr/ later.

2. establish boundaries on the use of replication information - by making a replication map not visible, and observing when its reference count drops to zero, we can tell when the new replication map is fully in use. When we start writing to a new node we'll be able to locate a point in time where all writes that were not aware of the new node were completed (this is the point where we should start streaming).

Notes:
* The get_natural_endpoints method that uses the effective_replication_map
  is still provided as a abstract_replication_strategy virtual method
  so that local_strategy can override it and privide natural endpoints
  for any search token, even in the absence of token_metadata, when\
  called early-on, before token_metadata has been established.

  The effective_replication_map materializes the replication strategy
  over a given replication strategy options and token_metadata.
  Whenever either of those change for a keyspace, we make a new
  effective_replication_map and keep it in the keyspace for latter use.

  Methods that depend on an ad-hoc token_metadata (e.g. during
  node operations like bootstrap or replace) are still provided
  by abstract_replication_strategy.

TODO:
- effective_replication_map registry
- Move pending ranges from token_metadata to replication map
- get rid of abstract_replication_strategy::get_range_addresses(token_metadata&)
  - calculate replication map and use it instead.

Test: unit(dev, debug)
Dtest: next-gating, bootstrap_test.py update_cluster_layout_tests.py alternator_tests.py -a 'dtest-full,!dtest-heavy' (release)
"

* tag 'effective_replication_strategy-v6' of github.com:bhalevy/scylla: (44 commits)
  effective_replication_map: add get_range_addresses
  abstract_replication_strategy: get rid of shared_token_metadata member and ctor param
  abstract_replication_strategy: recognized_options: pass const topology&
  abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map
  token_metadata: get rid of now-unused sync methods
  abstract_replication_strategy: get rid of do_calculate_natural_endpoints
  abstract_replication_strategy: futurize get_*address_ranges
  abstract_replication_strategy: futurize get_range_addresses
  abstract_replication_strategy: futurize get_ranges(inet_address ep, token_metadata_ptr)
  abstract_replication_strategy: move get_ranges and get_primary_ranges* to effective_replication_map
  compaction_manager: pass owned_ranges via cleanup/upgrade options
  abstract_replication_strategy: get rid of cached_endpoints
  all replication strategies: get rid of do_get_natural_endpoints
  storage_proxy: use effective_replication_map token_metadata_ptr along with endpoints
  abstract_replication_strategy: move get_natural_endpoints_without_node_being_replaced to effective_replication_map
  storage_service: bootstrap: add log messages
  storage_service: get_mutable_token_metadata_ptr: always invalidate_cached_rings
  shared_token_metadata: set: check version monotonicity
  token_metadata: use static ring version
  token_metadata: get rid of copy constructor and assignment operator
  ...
2021-10-13 20:28:30 +03:00
Tomasz Grabiec
d8832b9fd8 Merge 'Memtable make reversing reader' from Michał Radwański
Make a reader that reads from memtable in reverse order.

This draft PR includes two commits, out of which only the second is
relevant for review.

Described in #9133.
Refs #1413.

Closes #9174

* github.com:scylladb/scylla:
  partition_snapshot_reader: pop_range_tombstone returns reference (instead of value) when possible.
  memtable: enable native reversing
  partition_snapshot_reader: reverse ck_range when needed by Reversing
  memtable, partition_snapshot_reader: read from partition in reverse
  partition_snapshot_reader: rows_position and rows_iter_type supporting reverse iteration
  partition_snapshot_reader: split responsibility of ck_range
  partition_snapshot_reader: separate _schema into _query_schema and _partition_schema
  query: reverse clustering_range
  test: cql_query_test: fix test_query_limit for reversed queries
2021-10-13 20:24:02 +03:00
Nadav Har'El
ee8dc6847c scylla.yaml: refresh list of experimental features
Our scylla.yaml contains a comment listing the available experimental
features, supposedly helping a user who might want to enable them.
I think the usefuless of this comment is dubious, but as long as we
have one, let's at least make it accurate:

* Two of the listed "experimental" features, cdc and lwt, have actually
  graduated from being experimental long ago. Although technically a user
  may still use the words "cdc" and "lwt" in the "experimental-features"
  list, doing so is pointless, and worse: This comment suggests that these
  two features are still experimental - while they are not!

* One experimental feature - alternator-ttl - is missing from this list.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013083247.13223-1-nyh@scylladb.com>
2021-10-13 20:24:02 +03:00
Benny Halevy
17296cba4b effective_replication_map: add get_range_addresses
Equivalent to abstract_replication_strategy get_range_addresses,
yet synchronous, as it uses the precalculated map.

Call it from storage_service::get_new_source_ranges
and range_streamer::get_all_ranges_with_sources_for.

Consequently, get_new_source_ranges and removenode_add_ranges
can become synchronous too.

Unfortunately we can't entirely get rid of
abstract_replication_strategy::get_range_addresses
as it's still needed by
range_streamer::get_all_ranges_with_strict_sources_for.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
8c85197c6c abstract_replication_strategy: get rid of shared_token_metadata member and ctor param
It is not used any more.

Methods either use the token_metadata_ptr in the
effective_replication_map, or receive an ad-hoc
token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
91f2fd5f2c abstract_replication_strategy: recognized_options: pass const topology&
Prepare for deleting the _shared_token_metadata member.
All we need for recognized_options is the topology
(for network_topology_strategy).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
4d2561ff75 abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
d953e7b01a token_metadata: get rid of now-unused sync methods
Now that abstract_replication_strategy methods are all async
clone_only_token_map_sync, and update_normal_tokens_sync
are unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
bdce6f93ca abstract_replication_strategy: get rid of do_calculate_natural_endpoints
It is no longer in use.

And with it, the virtual calculate_natural_endpoint_sync method
of which it was the only caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
cbe58345b9 abstract_replication_strategy: futurize get_*address_ranges
Remaining callers of get_address_ranges and get_pending_address_ranges
are all either from a seastar thread or from a coroutine
so we can make the methods always async and drop the
can_yield param.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
91581ba23a abstract_replication_strategy: futurize get_range_addresses
All remaining use sites are called in a seastar thread
so we drop the can_yield param and make get_range_addresses
always async.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
3040e0a038 abstract_replication_strategy: futurize get_ranges(inet_address ep, token_metadata_ptr)
It is called only from repair, in a thread,
so it can be made always async and the need_preempt param
can be dropped.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
dfdc8d4ddb abstract_replication_strategy: move get_ranges and get_primary_ranges* to effective_replication_map
Provide a sync get_ranges method by effective_replication_map
that uses the precalculated map to get all token ranges owned by or
replicated on a given endpoint.

Reuse do_get_ranges as common infrastructure for all
3 cases: get_ranges, get_primary_ranges, and get_primary_ranges_within_dc.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:09:51 +03:00
Laura Novich
23886b2219 fix runtime errors 2021-10-13 15:08:24 +03:00
Laura Novich
d3e4b15530 upgrade theme to v1.x 2021-10-13 14:56:27 +03:00
Benny Halevy
5483269dfb compaction_manager: pass owned_ranges via cleanup/upgrade options
So they can be easily computed using an async task
before constructing the compaction object
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:17:46 +03:00
Benny Halevy
0e5bb94e84 abstract_replication_strategy: get rid of cached_endpoints
Now that do_get_natural_endpoints is gone, the cached
endpoints are no longer in use.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:15:34 +03:00
Benny Halevy
25227ab5ea all replication strategies: get rid of do_get_natural_endpoints
Now that all falvors of get_natural_endpoints methods
were moved to effective_replication_map,
do_get_natural_endpoints and its overrides are unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:13:51 +03:00
Benny Halevy
facd5035f1 storage_proxy: use effective_replication_map token_metadata_ptr along with endpoints
Use the same token_metadata used for get_natural_endpoints_without_node_being_replaced
where used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:11:43 +03:00
Benny Halevy
aab363753f abstract_replication_strategy: move get_natural_endpoints_without_node_being_replaced to effective_replication_map
Use the precalculated endpoints map there
as well as the token_metadata_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:10:01 +03:00
Benny Halevy
548719aac1 storage_service: bootstrap: add log messages
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:07:59 +03:00
Benny Halevy
08fef2a702 storage_service: get_mutable_token_metadata_ptr: always invalidate_cached_rings
We should invalidate the cached rings every time the
token metadata changes, not only on topology changes
to invalidate cached token/replication mappings
when the modified token_metadata is committed.

Currently we can do without it (apparently)
but this will become a requirement for keep
versions of the effective_replication_map
in a registry, indexed by the token_metadata ring version,
among other things.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:05:57 +03:00
Benny Halevy
bb0ea0b1c0 shared_token_metadata: set: check version monotonicity
Setting the ring version backwards means it got out of sync.
Possibly concurrent updates weren't serialized properly
using token_metadata_lock / mutate_token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:03:51 +03:00
Benny Halevy
43160abaec token_metadata: use static ring version
For generating unique _ring_version.

Currently when we clone a mutable token_metadata_ptr
it remains with the same _ring_version
and the ring version is updated only when the topology changes.

To be able to distinguish these traqnsient copies
from the ones that got applied, be stricter about
the ring version and change it to a unique number
using a static counter.

Next patch will update the ring version
(and consequently invalidate the cached_endpoints
on the replication strategy) every time the token_metadata
changes, not only when the topology changes.

Note that the _cached_endpoints will go away
once the transition to effective_replication_map
is finished, so this will not degrade performance.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:03:17 +03:00
Benny Halevy
685f5e7704 token_metadata: get rid of copy constructor and assignment operator
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:00:55 +03:00
Benny Halevy
d74ecfbc29 abstract_replication_strategy: get rid of legacy get_natural_endpoints
implementation

Now that all users of it were converted to use the
effective_replication_map, the legacy
abstract_replication_strategy::get_natural_endpoints method
can be deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:58:18 +03:00
Benny Halevy
4afe8cad3c repair: use effective_replication_map to get_natural_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:57:16 +03:00
Benny Halevy
cddd16f22d db: view: use effective_replication_map to get_natural_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:55:50 +03:00
Benny Halevy
96aa6161d8 db: hints manager: use effective_replication_map to get_natural_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:54:52 +03:00
Benny Halevy
c10a439f6c storage_service: optimize get_effective_replication_map multi-usage
Currently, we call find_keyspace and then
get_effective_replication_map on the _same_ keyspace
to get_natural_endpoints for multiple tokens.

Get the effective_replication_map once in these cases
and use it for each token.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:53:18 +03:00
Benny Halevy
fdaa891332 storage_service, sstables_loader: use effective_replication_map to get_natural_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:50:27 +03:00
Benny Halevy
4b838197e2 storage_service: update keyspaces effective_replication_map on token_metadata change
Every time the token_metadata changes we need to update the
effective_replication_map on all non-system keyspaces.

Do that in replicate_to_all_cores after the updated token_metadata
has been replicated to all cores.

We first prepare and clone the token_metadata, then prepare
and clone the new effective_replication_maps.  Any failure
at this stage is recoverable, handle via rollback and the exception
is returned.

Note that any failure to _apply_ the pending token_metadata or the
effective_replication_map will cause scylla to abort.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:05:28 +03:00
Benny Halevy
3393df45eb token_metadata, storage_service: unify token_metadata_lock and merge_lock.
Serialize the metadata changes with
keyspace create, update, or drop.

This will become necessary in the following patch
when we update the effective_replication_map
on all keyspaces and we want instances on all shards
end up with the same replication map.

Note that storage_service::keyspace_changed is called
from the scheme_merge path so it already holds
the merge_lock.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:01:25 +03:00
Benny Halevy
4cba7195ee storage_service: coroutinize mutate_token_metadata
And fold with_token_metadata_lock into it, as it's
its only caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:59:58 +03:00
Benny Halevy
045806cae7 storage_service: replicate_to_all_cores: use local pending_token_metadata_ptr
Rather than a _pending_token_metadata_ptr member in the storeage_service
class.  This is now much easier that the function was converted to a
coroutine.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:58:30 +03:00
Benny Halevy
52f48f47f6 storage_service: coroutinize replicate_to_all_cores
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:57:05 +03:00
Benny Halevy
991a6a8664 keyspace: update_effective_replication_map
And use it to get_natural_endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:55:34 +03:00
Benny Halevy
970b0a50b5 keyspace: futurize create_replication_strategy
And functions that use it, like:
keyspace::update_from
database::update_keyspace
database::create_in_memory_keyspace

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:53:41 +03:00
Benny Halevy
eb752c3f69 test: network_topology_strategy_test: use effective_replication_map to get_natural_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:53:09 +03:00
Benny Halevy
1e1d7d7df5 abstract_replication_strategy: introduce effective_replication_map
effective_replication_map holds the full replication_map
resulting from applying the effective replication strategy
over the given token_metadata and replication_strategy_config_options.

It is calculated once, in make_effective_replication_map(), and then it
can be used for retrieving the endpoints/token_ranges synchronously
from the precalculated map.

A new virtual get_natural_endpoints(const token&, const effective_replication_map&)
method has been added to abstract_replication_strategy so that
local_strategy and everywhere_replication_strategy can override it as they may be
needed before the token_metadata is established.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:53:03 +03:00
Benny Halevy
d96a67eb57 abstract_replication_strategy: use shared_ptr in registry
Enable creating shared_ptr<BaseClass> in nonstatic_class_registry
using BaseClass::ptr_type and use that for
abstract_replication_strategy.

While at it, also clean up compressor with that respect
to define compressor::ptr_type as shared_ptr<compressor>
thus simplifying compressor_registry.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Benny Halevy
4511c9acdb database.hh: convert ifdef block to pragma once
Besides being more modern and more efficient for
the compiler, this #ifndef block confuses my editor
that greys out the whole block.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Benny Halevy
a1c573e6d3 abstract_replication_strategy: make calculate_natural_endpoints_sync private
And with that rename calculate_natural_endpoints(const token& search_token, const token_metadata&, can_yield)
to do_calculate_natural_endpoints and make it protected,

With this patch, all its external users call the async version, so
rename it back to calculate_natural_endpoints, and make
calculate_natural_endpoints_sync private since it's being called
only within abstract_replication_strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Benny Halevy
a1098c0094 replication strategies: calculate_natural_endpoints: split into sync and async variants
calculate_natural_endpoints_sync and _async are both provided
temporarily until all users of them are converted to use
the async version which will remain.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Benny Halevy
32c7314b80 network_topology_strategy: refactor calculate_natural_endpoints
Extract natural_endpoints_tracker out of calculate_natural_endpoints
so we easily split the function to sync and async variants.

Test: network_topology_strategy_test(dev, debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Benny Halevy
416531cce7 network_topology_strategy: use rslogger to debug-log configuration
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Benny Halevy
330d9772d4 abstract_replication_strategy: move logger to locator namespace
To be used by network_topology_strategy and later, by
effective_replication_map_registry.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Benny Halevy
7401d03e8c abstract_replication_strategy: define replication_map
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Benny Halevy
5001d261d4 abstract_replication_strategy: define replication_strategy_config_options
To be used for searching effective replication strategy instances.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 12:39:36 +03:00
Eliran Sinvani
56981f2259 CQL test environment: Fix bad initialization order
The service level controller was initialized with a data
accessor that uses the system distributed keyspace before
the later have been initialized. If there is a use of
this accessor (for example by calling
to: service_level_controller::get_distributed_service_levels())
if will fail miserably and crash.
Not initializing the data accessor doesn't mean the same thing
since we can deal with such call when the accessor is not
initialized.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-10-12 13:27:59 +03:00
Eliran Sinvani
6d3e8055f9 Service Level Controller: Fix possible dereference of a null pointer
If the service level controller don't have his data accessor set,
calls for getting of distributed information might dereference this
unset pointer for the accessor. Here we add code that will return
a result as if there is no data available to the accessor (a behaviour
which is roughly equivalent to a null data accessor).

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-10-12 13:27:50 +03:00
Pavel Solodovnikov
8b917f7c99 db: mark --experimental option deprecated
The documentation for --experimental config option states
that it enables all experimental features, but this is no
longer true, i.e.: raft feature is not enabled with it and
should be explicitly enabled via `--experimental-features=raft`
switch (we don't want to enable it by default alongside
other features).

Since the flag doesn't do what it's intended to, we should
mark it as "deprecated", because documenting each exception
(there could be more than only raft in the future) will be
a burden and docs will constantly go out-of-sync with the
code.

Adjust the description for the option to reflect that, mark
it "deprecated" and suggest using --experimental-features, instead.

Fixes: #9467

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211012093005.20871-2-pa.solodovnikov@scylladb.com>
2021-10-12 13:22:12 +03:00
Pavel Solodovnikov
162f1899e8 db: update the list of supported experimental features
`raft` and `alternator-streams` features were missing
from the description for `experimental-features` config
flag.

Update `scylla.yaml` template comments to reflect that, too.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211012093005.20871-1-pa.solodovnikov@scylladb.com>
2021-10-12 13:22:11 +03:00
Avi Kivity
0d48c39cb3 Merge 'tools/scylla-sstable: allow opening sstables from any path' from Botond Dénes
Currently it is required that sstables (in particular la/mx ones) are located at a valid path. This is required because `sstables::entry_descriptor::make_descriptor()` extracts the keyspace and table names from the sstable dir components. This PR relaxes this by using a newly introduced  `sstables::entry_descriptor::make_descriptor()` overload which allows the caller to specify keyspace and table names, not necessitating these to be extracted from the path.

Tests: unit(dev), manual(testing that `scylla-sstables` can indeed load sstables from invalid path)

Closes #9466

* github.com:scylladb/scylla:
  tools/scylla-sstable: allow loading sstables from any path
  sstables: entry_descriptor::make_descriptor(): add overload with provided ks/cf
2021-10-12 12:50:11 +03:00
Takuya ASADA
06c28585f9 dist: raise fs.file-max and fs.nr_open to enough size for scylla
Currently, we configure LimitNOFILE on scylla-server.service, but we
don't configure fs.nr_open and fs.file-max.
When fs.nr_open or fs.file-max are smaller than LimitNOFILE, we may fail
to allocate FDs.
To fix this issue, raise fs.file-max and fs.nr_open to enogh size for
scylla.

Fixes #9461

Closes #9461
2021-10-12 12:47:35 +03:00
Botond Dénes
cc65c9d0da compaction: scrub/segregate: adjust partition-estimate as buckets accumulate
Scrub compaction in segregate mode can split the input sstable into as
many as hundreds or even thousands of output sstables in the extreme
case. But even at a few dozen output sstables, most of these will only
have a few partitions with a few rows. These sstables however will still
have their bloom filter allocated according to the original
partition-count estimate, causing memory bloat or even OOM in the
extreme case.
This patch solves this by aggressively adjusting the partition count
downwards after the second bucket has been created. Each subsequent
bucket will halve the partition estimate, which will quickly reach 1.

Fixes: #9463

Closes #9464
2021-10-12 12:44:42 +03:00
Botond Dénes
d535346a6e tools/scylla-sstable: allow loading sstables from any path
Currently it is required that sstables (in particular la/mx ones) are
located at a valid path. This is required because
`sstables::entry_descriptor::make_descriptor()` extracts the keyspace
and table names from the sstable dir components.
This patch relaxes this by using the freshly introduced
`sstables::entry_descriptor::make_descriptor()` overload which allows
the caller to specify keyspace and table names.
2021-10-12 11:47:58 +03:00
Botond Dénes
1b7b3a81e6 sstables: entry_descriptor::make_descriptor(): add overload with provided ks/cf
Not necessitating these to be extracted from the sstable dir path. This
practically allows for la/mx sstables at non-standard paths to be
opened. This will be used by the `scylla-sstable` tool which wants to be
flexible about where the sstables it opens are located.
2021-10-12 11:43:23 +03:00
Nadav Har'El
e4bc97349c cql-pytest: XFAILing test was fixed by a Python driver fix
Issue #8203 describes a bug in a long scan which returns a lot of empty
pages (e.g., because most of the results are filtered out). We have two
cql-pytest test cases that reproduced this bug - one for a whole-table
scan and one for a single-partition scan.

It turned out that the bug was not in the Scylla server, but actually in
the Python driver which incorrectly stopped the iteration after an empty
page even though this page did contain the "more pages" flag.

This driver bug was already fixed in the Datastax driver (see
6ed53d9f70,
and in the Scylla fork of the driver:
1d9077d3f4

So in this patch we drop the XFAIL, and if the driver is not new enough
to contain this fix - the test is skipped.

Since our Jenkins machines have the latest Scylla fork of the driver and
it already contains this fix, these tests will not be skipped - and will
run and should pass. Developers who run these tests on their development
machine will see these tests either passing or skipped - depending on
which version of the driver they have installed.

Closes #8203

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211011113848.698935-1-nyh@scylladb.com>
2021-10-12 10:04:02 +02:00
Nadav Har'El
33f8ec09df Merge 'treewide: improve compatibility with gcc 11' from Avi Kivity
Our source base drifted away from gcc compatibility; this mostly
restores the ability to build with gcc. An important exception is
coroutines that have an initializer list [1]; this still doesn't work.

We aim to switch back to gcc 11 if/when this gives us better
C++ compatibility and performance.

Test: unit (dev)

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056

Closes #9459

* github.com:scylladb/scylla:
  test: radix_tree_printer: avoid template specialization in class context
  test: raft: avoid ignored variable errors
  test: reader_concurrency_semaphore_test: isolate from namespace of source_location
  test: cql_query_test: drop unused lambda assert_replication_not_contains
  test: commitlog_test: don't use deprecated seastar::unaligned_cast
  test: adjust signed/unsigned comparisons in loops and boost tests
  build: silence some gcc 11 warnings
  sstables: processing_result_generator: make coroutine support palatable for C++20 compilers
  managed_bytes: avoid compile-time loop in converting constructor
  service: service_level_controller: drop unused variable sl_compare
  raft: disambiguate promise name in raft::active_read
  locator: azure_snitch: use full type name in definition of globals
  cql3: statements: create_service_level_statement: don't ignore replace_defaults()
  cql3: statement_restrictions: adjust call to std::vector deduction guide
  types: remove recursive constraint in deserialize_value
  cql3: restrictions: relax constraint on visitor_with_binary_operator_content
  treewide: handle switch statements that return
  cql3: expr: correct type of captured map value_type
  cdc: adjust type of streams_count
  alternator: disambiguate attrs_to_get in table_requests
2021-10-11 16:54:01 +03:00
Nadav Har'El
5e4c60e19a Merge: Unload storage service from irrelevant APIs
Meged patch series from Pavel Emelyanov:

There's a long-term (well, likely mid-term already) goal to keep a
single role for the storage_service, namely -- managing the state of
a node in the ring. Then rename it once it happens to stop people
from loading new stuff into storage_service. There are at least three
REST API endpoints that stand on the way.

1. load_new_ss_tables. This part is moved to a new sharded sstables
   loader that wraps existing distributed_loader

2. view_build_statuses. Satuses are maintained by view_builder so must
   be retrieved from the same place

3. enable_|disable_auto_compaction. This is purely database knob that
   used to be such some time ago

This change also removes view_update_generator from storage_service list
of dependencies and leaves the system_distributed_keyspace be the
start-only one (another not yet published branch makes use of it and
removes s.d.ks from storage service at all).

branch: https://github.com/xemul/scylla/tree/br-unload-storage-service-api-3
tests: unit(dev)
refs: #5489

* 'br-unload-storage-service-api-3' of github.com:xemul/scylla:
  storage_service, api: Move set-tables-autocompaction back into API
  api: Fix indentation after previous patch
  api, database, storage_service: Unify auto-compaction toggle
  api: Remove storage service from new APIs
  view_builder: Accept view_build_statuses
  storage_service: Move view_build_statuses code
  api, storage_service: Keep view builder API handlers separate
  storage_service: Remove view update generator from
  sstables_loader: Accept the sstables loading code
  storage_service: Move the sstables loading code
  storage_service, api: Keep sstables loading API handlers separate
  sstables_loader: Introduce
  distributed_loader, utils: Move verify_owner_and_mode
  distributed_loader: Fix methods visibility
2021-10-11 15:22:06 +03:00
Kamil Braun
339b9bc38a sstables: mx: partition_reversing_data_source: close internal data consumers
`partition_reversing_data_source` uses `continuous_data_consumer`s
internally (`partition_header_context`, `row_body_skipping_context`)
which hold `input_stream`s opened to sstable data files. These
`input_stream`s must be closed before destruction. Right now they would
sometimes cause "Assertion `_reads_in_progress == 0' failed" on
destruction.

Close the `continuous_data_consumer`s before they are destroyed so they
can close their `input_stream`s.

Fixes #9444.

Closes #9451
2021-10-11 12:35:54 +02:00
Pavel Emelyanov
f0b5ab1c61 storage_service, api: Move set-tables-autocompaction back into API
The global autocompaction toggle is no longer tied to the storage
service. It naturally belongs to the database, but is small and
tidy enough not to pollute database methods and can be placed into
the api/ dir itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:13:59 +03:00
Pavel Emelyanov
fece1a2f9f api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:13:56 +03:00
Pavel Emelyanov
c5128eea67 api, database, storage_service: Unify auto-compaction toggle
There are two knobs here -- global and per-table one. Both were added
without any synchronisation, but the former one was later fixed to
become serialized and not to be available "too early".

This patch unifies both toggles to be serialized with each-other and
not be enabled too early.

The justification for this change is to move the global toggle from out
of the storage service, as it really belongs to the database, not the
storage service. Respectively, the current synchronization, that depends
on storage service internals, should be replaced with something else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:12:39 +03:00
Pavel Emelyanov
c53c74258a api: Remove storage service from new APIs
The APIs that had been recently switched to using relevant services no
longer need the storage service reference capture, so remove it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:11:52 +03:00
Pavel Emelyanov
c504361c15 view_builder: Accept view_build_statuses
The code itself is already in relevant .cc file, not move it to the
relevant class.

The only significant change is where to get token metadata from.
In its old location tokens were provided by the storage service
itself, now when it's in the view builder there's no "native" place
to get them from, however the rest of the view building code gets
tokens from global storage proxy, so do the same here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:11:40 +03:00
Pavel Emelyanov
3b6e8c7d93 storage_service: Move view_build_statuses code
This code belongs to view builder, so put it into its .cc. No changes,
just move. This needs some ugly namespace breakage, but they will
be patched away with the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:11:29 +03:00
Pavel Emelyanov
540c6fa5ae api, storage_service: Keep view builder API handlers separate
There's the 'storage_service/view_build_statuses' endpoint. It's
handler code sits in the storage_service, but the functionality
belongs purely to view_builder. Same as with sstables loader,
detach the enpoint's API set/unset code, next patches will fix
the handler to use view_builder.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:09:07 +03:00
Pavel Emelyanov
99d8994835 storage_service: Remove view update generator from
It's not used by storage service any longer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:09:02 +03:00
Pavel Emelyanov
68ecec0197 sstables_loader: Accept the sstables loading code
The code was moved in the relevant .cc file by previous patch, now
make it sit in the relevant class. One "significant" change is that
the messaging service is available by local reference already, not
by the sharded one. Other dependencies are already satisfied by the
patch that introduced the sstables_loader class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:08:21 +03:00
Pavel Emelyanov
42f83f6669 storage_service: Move the sstables loading code
Just cut-n-paste the code into sstables_loader.cc. No other
changes but replace storage service logger with its own one.
For now the code stays in storage_service class, but next
patch will relocate the code into the sstables_loader one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:07:39 +03:00
Pavel Emelyanov
7e49359720 storage_service, api: Keep sstables loading API handlers separate
Right now the handlers sit in one boat with the rest of the storage
service APIs. Next patches will switch this particular endpoint to
use previously introduced sstables_loader, before doing so here's
the respective API set/unset stubs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:05:45 +03:00
Pavel Emelyanov
13ab22d3c7 sstables_loader: Introduce
It's a sharded service that will be responsible for loading
sstables via the respective REST API (the endpoint in question
is in turn handling the nodetool refresh command). This patch
adds the loader, equips with the needed dependencies and
starts/stops one from main. Next patches will move the loader
code from storage_service into this new one. The list of
dependencies that are introduced in this patch is exactly
what's needed by the mentioned code move.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:03:54 +03:00
Pavel Emelyanov
581382edad distributed_loader, utils: Move verify_owner_and_mode
This method sits in dist.loader, but really belongs to util/ as it
just works on an "abstract" path and doesn't need to know what this
path is about. Another sign of layering violation is the inclusion
of dist.loader code into util/ stuf.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:03:51 +03:00
Pavel Emelyanov
e106e0571a distributed_loader: Fix methods visibility
Most of the methods are marked public, but only few of them should.
Test needs a bit more, however, so the distributed_loader_for_tests
is declared as friend class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:03:29 +03:00
Michał Radwański
c04dffbc01 partition_snapshot_reader: pop_range_tombstone returns reference
(instead of value) when possible.
2021-10-10 20:38:18 +02:00
Michał Radwański
771f3b12bd memtable: enable native reversing
This commit consists of changes, which need to reside in a single
commit, so that the tests pass on each of the commits.

1. Remove do_make_flat_reader which disabled reverse reads by making the
   slice a forward one. Remove call to get_ranges which would do
   superfluous reversal of clustering ranges.

2. test: cql_query_test: remove expectation that the test_query_limit
   fails for reversed queries, since reversed queries no longer require
   linear memory wrt. the result size, when paginated.
2021-10-10 20:38:18 +02:00
Michał Radwański
cc5ea66957 partition_snapshot_reader: reverse ck_range when needed by Reversing
Previous commits made it possible to split the responsibility of two
kinds of clustering key ranges in read_next and next_range_tombstone.
Here, the actual reversal takes place and we start passing the actually
reversed ck_range, if Reversing. This reversed ck_range is stored as a
class member, so that the reversal happens just once for each range.
2021-10-10 20:38:18 +02:00
Michał Radwański
5449982a0b memtable, partition_snapshot_reader: read from partition in reverse
In this commit, I add the ability to read from partition snapshots in
reverse order. Before these changes, a reverse read from memtable has
been handled as follows:
 - A reader higher in the hierarchy of readers performs a read from
   memtable in the forward order, which is not aware of the intention to
   read in reverse.
 - Later, some reader reverses the received mutation fragments.
Memtable decides based on options in `slice`, whether to read forward
or in reverse. Note that previous commit creates a killswitch which
clears the `reverse` option from slice before running the logic of
whether to reverse or not. This is due to the fact, that this commit
doesn't all the required code changes.

The reversing partition snapshot reader maintains two schemas - one that
is the reversed schema (called _query_schema) for the output, and the
other one (forward one, called _snapshot_schema), which is used to
access the memtable tree (which needs to be the same as the schema used
to create memtable).

The `partition_slice` provided by callers is provided in 'half-reversed'
format for reversed queries, where the order of clustering ranges is
reversed, but the ranges themselves are not.
2021-10-10 20:38:18 +02:00
Michał Radwański
6813c39927 partition_snapshot_reader: rows_position and rows_iter_type supporting
reverse iteration

Iterating in reverse is useful for native reverse memtable reader.
2021-10-10 20:38:18 +02:00
Avi Kivity
ef45a208ef test: radix_tree_printer: avoid template specialization in class context
gcc complains that it's illegal. It's unnecessary too - we can replace
it with a simple overload.
2021-10-10 18:17:53 +03:00
Avi Kivity
11cc772388 test: raft: avoid ignored variable errors
Avoid instantiating unused variables, and in one case ignore it,
to avoid a gcc warning.
2021-10-10 18:17:53 +03:00
Avi Kivity
cdb50b1972 test: reader_concurrency_semaphore_test: isolate from namespace of source_location
More modern gcc uses std::source_location instead of
std::experimental::source_location. Rely on seastar::compat to get it
right for us.
2021-10-10 18:17:53 +03:00
Avi Kivity
a08bcc0528 test: cql_query_test: drop unused lambda assert_replication_not_contains
gcc complains that it exists.
2021-10-10 18:17:53 +03:00
Avi Kivity
9166d1ab1d test: commitlog_test: don't use deprecated seastar::unaligned_cast
unaligned_cast is deprecated, and gcc complains that it violates
strict aliasing rules. Switch to std::copy_n() instead.
2021-10-10 18:17:53 +03:00
Avi Kivity
9907303bf5 test: adjust signed/unsigned comparisons in loops and boost tests
gcc complains about comparing a signed loop induction variable
with an unsigned limit, or comparing an expected value and measured
value. Fix by using unsigned throughout, except in one case where
the signed value was needed for the data_value constructor.
2021-10-10 18:16:50 +03:00
Avi Kivity
15ffd84473 build: silence some gcc 11 warnings
These warnings are valuable, but limit the noise for now by disabling
them.
2021-10-10 18:16:50 +03:00
Avi Kivity
029560c232 sstables: processing_result_generator: make coroutine support palatable for C++20 compilers
clang implement the coroutine technical specification, in namespace
std::experimental. gcc implements C++20 coroutines, in namespace std.
Detect which one is in use and select the namespace accordingly.
2021-10-10 18:16:50 +03:00
Avi Kivity
c38f18163e managed_bytes: avoid compile-time loop in converting constructor
managed_bytes_basic_view is a template with a constructor that
converts from one instantiation of the template to another.
Unfortunately when gcc encounters the associated constraint, it
instantiates the template which forces it to evaluate the constraint
again, sending it into a loop.

Fix that by making the converting constructor a template itself,
delaying instantiation. The constraint is strengthened so the set
of types on which the constructor works is unchanged.
2021-10-10 18:16:50 +03:00
Avi Kivity
f6d59c33ff service: service_level_controller: drop unused variable sl_compare
Reported by gcc 11.
2021-10-10 18:16:50 +03:00
Avi Kivity
cd4af0c722 raft: disambiguate promise name in raft::active_read
gcc complains tha the name 'promise' changes meaning (from type
to variable) within active_read. Help it by disambiguating the
use as type.
2021-10-10 18:16:50 +03:00
Avi Kivity
3f9ec5302a locator: azure_snitch: use full type name in definition of globals
Some globals in azure_snitch use std::string in the declaration
and auto in the definition. gcc 11 complains. I don't know if it's
correct, but it's easy to use the type in both declaration and
definition.
2021-10-10 18:16:50 +03:00
Avi Kivity
d83b565938 cql3: statements: create_service_level_statement: don't ignore replace_defaults()
We call replace_defaults on an object named 'slo', but then ignore it.

Use the new object that replace_defaults() returned.

Reported by gcc 11.
2021-10-10 18:16:50 +03:00
Avi Kivity
25f8e9c078 cql3: statement_restrictions: adjust call to std::vector deduction guide
gcc 11 has a hard time parsing a deduction guide use with
braced initializer. The bug [1] was already fixed in gcc 12,
and I've requested a backport, but reduce friction meanwhile
by switching to a form that works in gcc 11.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89062
2021-10-10 18:16:50 +03:00
Avi Kivity
df73d12272 types: remove recursive constraint in deserialize_value
deserialize_value() has a constraint that depends on another
deserialize_value() implementation. Apprently gcc wants to
instantiate the deserialize_value() instance we're constraining
while evaluating the constraint, leading to a loop.

Since this deserialize_value() is just an internal helper, drop
the constraint rather than fighting it.
2021-10-10 18:16:50 +03:00
Avi Kivity
58a0e80021 cql3: restrictions: relax constraint on visitor_with_binary_operator_content
We require that v.current_binary_operator is a 'const binary_operator*',
but it's really a 'const binary_operator*&'. Relax the constraint so it
works with both gcc and clang.
2021-10-10 18:16:50 +03:00
Avi Kivity
fd8beeaea9 treewide: handle switch statements that return
A switch statement where every case returns triggers a gcc
warning if the surrounding function doesn't return/abort.

Fix by adding an abort(). The abort() will never trigger since we
have a warning on unhandled switch cases.
2021-10-10 18:16:50 +03:00
Michał Radwański
a672b8b86f partition_snapshot_reader: split responsibility of ck_range
Previously, next_range_tombstone took as an argument a clustering key
range, which served two purposes. One was for accesing only specified
key ranges from the partition, the other was for deciding in which order
the mutation fragments should be emitted. This commits separates these
responsibilities, since in the advent of native memtable reader, these
two responsibilities are no longer common. The split is propagated to
the rest of the partition_snapshot_reader.hh to avoid confusion.
2021-10-07 17:04:44 +02:00
Michał Radwański
fc51d2cc8c partition_snapshot_reader: separate _schema into _query_schema and _partition_schema
After memtable starts supporting reverse order queries,
the schema provided to the readers will be reversed (reverse clustering
order). Reading from memtable in reverse requires two schemas - one to
access the memtable internal data structures (_partition_schema), and
the other one (_query_schema), the schema imposing clustering order on
returned mutation fragments. This commit prepares for introduction of
native reverse queries for memtable, by separating these responsibilities.
For now, they are still initialized with the schema passed from query.
2021-10-07 17:04:44 +02:00
Mikołaj Sielużycki
235c38e78f sstables, gdb: Retire usage of sstable_tracker
sstables_manager superseeds previous implementation of sstables_tracker
for tracking lifetime of the tables. Update scylla-gdb.py to use
sstables_manager in a backwards compatible way, as sstables_manager is
not available in Scylla Enterprise 2020.1. Add explicit test for
"scylla sstables" command, as previously only "scylla active-sstables"
was tested.

Closes #9439
2021-10-07 14:40:47 +02:00
Piotr Sarna
59bd25d1ea transport: respond with overloaded exception during shedding
This commit makes shedding always respond - with overloaded exception,
instead of ignoring the request.

Fixes #9442

Closes #9443
2021-10-07 15:38:40 +03:00
Nadav Har'El
d1505762df cql-pytest: add to README an example of repeating a test
pytest supports - if the "repeat" extension is installed - a convenient
and efficient way to repeat the same test (or all of them) multiple times.
Since it's very useful, let's document it in cql-pytest/README.md.

By the way, our test.py also has a "--repeat" option, but it can only run
all cql-pytest tests, not just repeat a single small test, and it is also
slower (and arguably, different) because it restarts Scylla instead of
running a test 100 times on the same Scylla.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211007122146.624210-1-nyh@scylladb.com>
2021-10-07 15:30:41 +03:00
Michael Livshin
e88891a8af avoid race between compaction and table stop
Also add a debug-only compaction-manager-side assertion that tests
that no new compaction tasks were submitted for a table that is being
removed (debug-only because not constant-time).

Fixes #9448.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20211007110416.159110-1-michael.livshin@scylladb.com>
2021-10-07 14:36:39 +03:00
Kamil Braun
96f18c4bb0 test: test_sstable_reversing_reader_random_schema: fix the workaround for #9352
The test generates random mutations and eliminates mutations whose keys
tokenize to 0, in particular it eliminates mutations with empty
partition keys (which should not end up in sstables).

However it would do that after using the randomly generated mutations to
create their reversed versions. So the reversed versions of mutations
with empty partition keys would stay.

Fix by placing the workaround earlier in the test.

Closes #9447
2021-10-07 14:01:43 +03:00
Raphael S. Carvalho
59693e6da3 compaction_manager: make rewrite_sstables() bail out when asked to stop
rewrite_sstables() can be asked to stop either on shutdown or on an
user-triggered comapction which forces all ongoing compaction to stop,
like scrub.

turns out we weren't actually bailing out from do_until() when task
cannot proceed. So rewrite_sstables() potentially runs into an infinite
loop which in turn causes shutdown or something else waiting on it
to hang forever.

found this while auditting code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211005233601.155442-1-raphaelsc@scylladb.com>
2021-10-07 10:46:22 +03:00
Benny Halevy
90fd4d5ed7 test: sstable_conforms_to_mutation_source_test: test_sstable_reversing_reader_random_schema: auto-close reader on exception
I stumbled upon this failure in dev mode:
```
test/boost/sstable_conforms_to_mutation_source_test.cc(0): Entering test case "test_sstable_reversing_reader_random_schema"
sstable_conforms_to_mutation_source_test: ./seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed.
Aborting on shard 0.
```

Since dev mode has no debug symbols I can't
decode the stack trace so I'm not 100% sure about
the root cause and I couldn't reproduce it in release
or debug modes yet.

One vulnerability in the current code is that r1
won't be closed if an exception is thrown before r1 and r2
are moved to `compare_readers` so this change adds
a deferred close of r1 in this case.

Test: sstable_conforms_to_mutation_source_test(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211006144009.696412-1-bhalevy@scylladb.com>
2021-10-06 17:53:49 +03:00
Avi Kivity
b08c299713 cql3: expr: correct type of captured map value_type
A map's value_type has const key, but in two places we omitted
the const. This causes construction of a new value, plus gcc
complaining that we're refering to a temporary.

Fix by using the correct type.
2021-10-06 14:57:43 +03:00
Avi Kivity
eac95e2370 cdc: adjust type of streams_count
streams_count has signed type, but it's compared against an unsigned
type, annoying gcc. Since a count should be positive, convert it to
an unsigned type.
2021-10-06 14:56:00 +03:00
Avi Kivity
5a5a47c4c7 alternator: disambiguate attrs_to_get in table_requests
There is a table_requests::attrs_to_get type, and also a type
named attrs_to_get used in the same struct, and gcc doesn't
like this. Disambiguate the type by fully qualifying it.
2021-10-06 14:55:48 +03:00
Takuya ASADA
3b798afc1e scylla_io_setup: handle nr_disks on GCP correctly
nr_disks is int, should not be string.

Fixes #9429

Closes #9430
2021-10-06 12:31:38 +03:00
Nadav Har'El
0f8d3ea459 cql-pytest: translate Cassandra's tests for ORDER BY
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectOrderByTest.java into our our cql-pytest
framework.

This test file includes 17 tests for various features and corners of
SELECT's "ORDER BY" feature. All these tests pass on Cassandra, but
three fail on Scylla and are marked as xfail:

One previously-unknown Scylla bug:

  Refs #9435: SELECT with IN, ORDER BY and function call does not obey
              the ORDER BY

And two new reproducers for already known bugs:

  Refs #2247: ORDER BY should allow skipping equality-restricted
              clustering columns
  Refs #7751: Allow selecting map values and set elements, like in
              Cassandra 4.0

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211005174140.571056-1-nyh@scylladb.com>
2021-10-06 12:31:38 +03:00
Avi Kivity
0ea79559a6 Merge 'IDL: support generating boilerplate code for RPC verbs' from Pavel Solodovnikov
Introduce new syntax in IDL compiler to allow generating
registration/sending code for RPC verbs:

```
        verb [[attr1, attr2...] my_verb (args...) -> return_type;
```

`my_verb` RPC verb declaration corresponds to the
`netw::messaging_verb::MY_VERB` enumeration value to identify the
new RPC verb.

For a given `idl_module.idl.hh` file, a registrator class named
`idl_module_rpc_verbs` will be created if there are any RPC verbs
registered within the IDL module file.

These are the methods being created for each RPC verb:

```
        static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&);
        static future<> unregister_my_verb(netw::messaging_service* ms);
        static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...);
```

Each method accepts a pointer to an instance of `messaging_service`
object, which contains the underlying seastar RPC protocol
implementation, that is used to register verbs and pass messages.

There is also a method to unregister all verbs at once:

```
        static future<> unregister(netw::messaging_service* ms);
```

The following attributes are supported when declaring an RPC verb
in the IDL:
* `[[with_client_info]]` - the handler will contain a const reference to
  an `rpc::client_info` as the first argument.
* `[[with_timeout]]` - an additional `time_point` parameter is supplied
  to the handler function and `send*` method uses `send_message_*_timeout`
  variant of internal function to actually send the message.
* `[[one_way]]` - the handler function is annotated by
  `future<rpc::no_wait_type>` return type to designate that a client
  doesn't need to wait for an answer.

The `-> return_type` clause is optional for two-way messages. If omitted,
the return type is set to be `future<>`.
For one-way verbs, the use of return clause is prohibited and the
signature of `send*` function always returns `future<>`.

No existing code is affected.

Ref: #1456

Closes #9359

* github.com:scylladb/scylla:
  idl: support generating boilerplate code for RPC verbs
  idl: allow specifying multiple attributes in the grammar
  message: messaging_service: extract RPC protocol details and helpers into a separate header
2021-10-05 18:05:24 +03:00
Michał Radwański
dac2509a7f query: reverse clustering_range 2021-10-05 16:47:04 +02:00
Tzach Livyatan
bd87c7d362 Update docker-hub text
Mention aarch64 support

Closes #9436
2021-10-05 17:35:02 +03:00
Raphael S. Carvalho
342bfbd65a compaction: Make major compaction on keyspace resilient if low on space
Let's major compact the smallest tables first, increasing chances of
success if low on disk space.

parallel_for_each() didn't have any effect on space requirement as
compaction_manager serializes major compaction in a shard.
As parallel_for_each() is no longer used, find_column_family() is now
used before each compact_all_sstables() to avoid a race with table drop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211005135257.31931-1-raphaelsc@scylladb.com>
2021-10-05 17:04:34 +03:00
Nadav Har'El
77bd4afda7 test/alternator: avoid client-side validation
Ever since we started testing Alternator with tests written in Python
and using Amazon's "boto3" library, one limitation kept annoying us:

Boto3 verifies the validity of the request parameters before passing
them on to the server. It verifies that mandatory parameters are not
missing, that parameters have the right types, and sometimes even the
right ranges - all in the library before ever sending the request.
This meant that in many cases, we couldn't get good test coverage for
Alternator's server-side handling of *wrong* parameters.

As it turns out, it is trivial to tell boto3 to *not* do its client-side
request validation, with the `parameter_validation=False` config flag.
We just never noticed that such a flag existed :-)

So this patch adds this flag. It then fixes a few tests which expected
ParameterValidationError - this error is the client-side validation
failure, but should now be replaced by checking the server-side error.

The patch also adds a couple of invalid parameter checks that we
couldn't do before because of boto3's eagerness to check them on the
client side. We can add a lot more of these error tests in the future,
now that we got rid of client-side validation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211005095514.537226-1-nyh@scylladb.com>
2021-10-05 13:26:51 +02:00
Nadav Har'El
6dee86eade test/alternator: another test for adding a GSI to an existing table
This patch adds yet another test for Alternator's unimplemented feature
of adding a GSI to an already existing table (issue #5022), but this
test is for a very specific corner case - tables which contain string
attributes with an empty value - the corner case described in
issue #9424:

DynamoDB used to forbid any string attributes from being set to an empty
string, but this changed in May 2020, and since then empty strings are
allowed - but NOT as keys. So although it is legal to set a string
attribute to an empty string, if this table has a GSI whose key is that
specific attribute, the update command is refused. We already had a
test for this - test_gsi_empty_value.

However, the case in this patch is the case where a GSI is added to a
table *after* the table already has data. In this case (as this test
demonstrates), we are supposed to drop the items which have the empty
string key from the GSI.

Even when #5022 (the ability to add GSIs to existing tables) will be done,
this test will continue to fail. The unique problem of this test is that
Scylla's materialized views *do* allow empty strings as clustering keys
(right now) and even partition keys (after #9375 will be solved), while
we don't want them to enter the GSI. We will probably need to add to the
view's filter, which right now contains (as required) "x IS NOT NULL"
also the filter "x != ''" (when x's type is a string or binary) so that
items with empty-string keys will be dropped.

Refs #5022
Refs #9375
Refs #9424

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211003170636.477582-1-nyh@scylladb.com>
2021-10-05 13:26:43 +02:00
Nadav Har'El
b136104298 alternator/test: test for invalid numeric values
DynamoDB has a rather baroque definition of numbers, and in particular
it does *not* allow numeric attributes to be set to infinity or NaN.

Although I did check invalid numbers in the past, manually, I was never
able to write a unit test for this in the past - because the boto3
library catches such errors on the client side, and prevents the test from
sending broken requests to the server. So in this patch, I finally came up
with a solution - a context manager client_no_transform() which
yields a client which does NOT do any transformation or validation
on the request's parameters, allowing us to use boto3 to create
improper requests - and test the server's handling of them.

The test in this patch passes - it did not discover a new bug, but
it is a useful regression test and the client_no_transform() trick
can be used in more error-case tests which until now we were unable
to write.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211004161809.520236-1-nyh@scylladb.com>
2021-10-05 13:13:45 +02:00
Avi Kivity
2d25705db0 cql3: deinline non-trivial methods in selection.hh
This allows us to forward-declare raw_selector, which in turn reduces
indirect inclusions of expression.hh from 147 to 58, reducing rebuilds
when anything in that area changes.

Includes that were lost due to the change are restored in individual
translation units.

Closes #9434
2021-10-05 12:58:55 +02:00
Avi Kivity
d3f8148807 utils: untie rjson.hh from base64.hh
base64.hh pulls in the huge rjson.hh, so if someone just wants
a base64 codec they have to pull in the entire rapidjson library.

Move the json related parts of base64.hh to rjson.hh and adjust
includes and namespaces.

In practice it doesn't make much difference, as all users of base64
appear to want json too. But it's cleaner not to mix the two.

Closes #9433
2021-10-05 12:57:54 +02:00
Kamil Braun
cab3f2e2d2 test: raft: randomized_nemesis_test: perform reconfigurations in basic_generator_test
We use a dedicated thread (similarly to the nemesis thread)
to periodically perform reconfigurations.
2021-10-05 11:59:29 +02:00
Kamil Braun
fde26eb476 test: raft: randomized_nemesis_test: improve the bouncing algorithm
The bouncing algorithm tries to send a request to other servers in the
configuration after it receives a `not_a_leader` response.

Improve the algorithm so it doesn't try the same server twice.
2021-10-05 11:54:16 +02:00
Kamil Braun
3ac8216a7b test: raft: randomized_nemesis_test: handle more error types
With reconfigurations the `commit_status_unknown` error may start
appearing.
2021-10-05 11:54:16 +02:00
Kamil Braun
98add5a4fc test: raft: randomized_nemesis_test put variant and monostate ostream operator<<s into std namespace
As a preparation for the following commits.
Otherwise the definitions wouldn't be found during argument-dependent lookup
(I don't understand why it worked before but won't after the next commit).
2021-10-05 11:54:16 +02:00
Kamil Braun
4956217341 test: raft: randomized_nemesis_test: reconfiguration operation
The operation sends a reconfiguration request to a Raft cluster. It
bounces a few times in case of `not_a_leader` results.

A side effect of the operation is modifying a `known` set of nodes which
the operation's state has a reference to. This `known` set can then be
used by other operations (such as `raft_call`s) to find the current
leader.

For now we assume that reconfigurations are performed sequentially. If a
reconfiguration succeeds, we change `known` to the new configuration. If
it fails, we change `known` to be the set sum of the previous
configuration and the current configuration (because we don't know what
the configuration will eventually be - the old or the attempted one - so
any member of the set sum may eventually become a leader).
2021-10-05 11:54:16 +02:00
Avi Kivity
3a67c661d4 Merge "Improve parallelizm of mutation source tests" from Pavel E
"
There's a run_mutation_source_tests lib helper that runs a bunch of
tests sequentially. The problem is that it does 4 different flavors
of it each being a certain decoration over provided reader. This
amplification makes some test cases run enormous amount of time
without any chance for parallelizm.

The simplest way to help running those cases in parallel is to teach
the slowest cases to run different flavors of mutation source tests
in dedicated cases. This patch makes it so.

The resulting timings are
                                  dev   debug
                 sequential run:  2m1s  53m50s
--parallel-cases (+ this patch):  1m3s  31m15s

tests: unit(dev, debug)
"

* 'br-parallel-mutation-source-tests' of https://github.com/xemul/scylla:
  test: Split multishard combining reader case
  test: Split database test case
  test: Split run_mutation_source_tests
2021-10-05 12:22:52 +03:00
Kamil Braun
0c24c18d0c test: cql_query_test: fix test_query_limit for reversed queries
(Single-partition) reversed queries are no longer unlimited but some
places still treat them as such. This causes, for example, shorter pages
for such queries, which breaks a test that expects certain results to
come in a single page.
2021-10-05 11:22:39 +02:00
Tomasz Grabiec
17430795e8 Merge "test: raft: randomized_nemesis_test: handle missing snapshot in rpc::send_snapshot" from Kamil
It's possible that the server drops the snapshot in the same iteration
of `io_fiber` loop as it tries to send it (the sending of messages
happens after snapshot dropping). Handle this case by throwing an
exception.

As a preparation we also fix the code in `server_impl::send_snapshot` so
it works correctly when `rpc::send_snapshot` throws or returns a ready
future.

Refs #9407.

* kbr/snapshot-handle-errors:
  test: raft: randomized_nemesis_test: remove an obsolete comment
  test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot`
  raft: server: handle `rpc::send_snapshot` returning instantly
2021-10-05 11:19:14 +02:00
Kamil Braun
c9a7778497 test: raft: randomized_nemesis_test: remove an obsolete comment 2021-10-05 11:04:11 +02:00
Kamil Braun
961f5a904c test: raft: randomized_nemesis_test: handle missing snapshot in rpc::send_snapshot
It's possible that the server drops the snapshot in the same iteration
of `io_fiber` loop as it tries to send it (the sending of messages
happens after snapshot dropping). Handle this case.

Refs #9407.
2021-10-05 11:04:11 +02:00
Kamil Braun
36f3e26374 raft: server: handle rpc::send_snapshot returning instantly
If `rpc::send_snapshot` returned immediately with a ready future, or if
it threw, the code in `server_impl::send_snapshot` would not update
`_snapshot_transfers` correctly.

The code assumed that the continuation attached to `rpc::send_snapshot`
(with `then_wrapped`) was executed after `_snapshot_transfer` below the
`rpc::send_snapshot` call was updated. That would not necessarily be true
(the continuation may even not have been executed at all if
`rpc::send_snapshot` threw).

Fix that by wrapping the `rpc::send_snapshot` call into a continuation
attached to `later()`.

Originally authored by Gleb <gleb@scylladb.com>, I added a comment.
2021-10-05 11:04:11 +02:00
Pavel Emelyanov
b742e6cbb6 test: Split multishard combining reader case
All the cases in this test also run mutation source tests and the
case with single-fragment buffer takes times more time to execute
than the others.

Splitting this single case so that it runs mutation source tests
flavours in different cases improves the test parallelizm.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-05 11:57:02 +03:00
Pavel Emelyanov
30075094ac test: Split database test case
The test_database_with_data_in_sstables_is_a_mutation_source case runs
the mutation source tests in one go. The problem is that on each step
a whole new ks:cf is created which takes the majority of the tests time.
In the end of the day this case is the slowest one in the suite being
up to two times longer (depending on mode) than the #2 on this list.

This patch splits the case into 4 so that each mutation source flavor
is run in separate case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-05 11:53:18 +03:00
Pavel Emelyanov
1e09a2c925 test: Split run_mutation_source_tests
There are 4 flavours of mutation source tests that are all ran
sequentially -- plain, reversed and upgrade/downgrade ones that
check v1<->v2 conversions.

This patch splits them all into individual calls so that some
tests may want to have dedicated cases for each. "By default" they
are all run as they were.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-05 11:51:43 +03:00
Pavel Emelyanov
4b4ce015aa system-keyspace: Keep UUID value when saving
The set_local_host_id() accepts UUID references and starts
to save it in local keyspace and in all shards' local cache.
Before it was coroutinized the UUID was copied on captures
and survived, after it it remains references. The problem is
that callers pass local variables as arguments that go away
"really soon".

Fix it to accept UUID as value, it's short enough for safe
and painless copy.

fixes: #9425
tests: dtest.ReplaceAddress_rbo_enabled.replace_node_diff_ip(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211004145421.32137-1-xemul@scylladb.com>
2021-10-04 18:21:44 +03:00
Tomasz Grabiec
cd9b4d95fc Merge "test: raft: randomized_nemesis_test: better liveness check at the end of generator test" from Kamil
The previous check would find a leader once and assume that it does not
change, and that the first attempt at sending a request to this leader
succeeds. In reality the leader may change at the end of the test (e.g.
it may be in the middle of stepping down when we find it) and in general
it may take some time for the cluster to stabilize. The new check tries
a few times to find a leader and perform a request - until a time limit
is reached.

* kbr/nemesis-liveness-check:
  test: raft: randomized_nemesis_test: better liveness check at the end of generator test
  test: raft: randomized_nemesis_test: take `time_point` instead of `duration` in `wait_for_leader`
2021-10-04 16:05:37 +02:00
Kamil Braun
17e771c5f5 test: raft: randomized_nemesis_test: better liveness check at the end of generator test
The previous check would find a leader once and assume that it does not
change, and that the first attempt at sending a request to this leader
succeeds. In reality the leader may change at the end of the test (e.g.
it may be in the middle of stepping down when we find it) and in general
it may take some time for the cluster to stabilize. The new check tries
a few times to find a leader and perform a request - until a time limit
is reached.

The commit also removes an incorrect assertion inside in `wait_for_leader`.
2021-10-04 15:57:54 +02:00
Kamil Braun
478a58e86d test: raft: randomized_nemesis_test: take time_point instead of duration in wait_for_leader
To be used in the next commit, where we call `wait_for_leader` in a loop
with the same deadline `time_point`.
2021-10-04 15:56:54 +02:00
Tomasz Grabiec
e89b9799b8 Merge 'sstable mx reader: implement reverse single-partition reads' from Kamil Braun
Until now reversed queries were implemented inside
`querier::consume_page` (more precisely, inside the free function
`consume_page` used by `querier::consume_page`) by wrapping the
passed-in reader into `make_reversing_reader` and then consuming
fragments from the resulting reversed reader.

The first couple of commits change that by pushing the reversing down below
the `make_combined_reader` call in `table::query`. This allows
working on improving reversing for memtables independently from
reversing for sstables.

We then extend the `index_reader` with functions that allow
reading the promoted index in reverse.

We introduce `partition_reversing_data_source`, which wraps an sstable data
file and returns data buffers with contents of a single chosen partition
as if the rows were stored in reverse order.

We use the reversing source and the extended index reader in
`mx_sstable_mutation_reader` to implement efficient (at least in theory)
reversed single-partition reads.

The patchset disables cache for reversed reads. Fast-forwarding
is not supported in the mx reader for reversed queries at this point.

Details in commit messages. Read the commits in topological order
for best review experience.

Refs: #9134
(not saying "Fixes" because it's only for single-partition queries
without forwarding)

Closes #9281

* github.com:scylladb/scylla:
  table: add option to automatically bypass cache for reversed queries
  test: reverse sstable reader with random schema and random mutations
  sstables: mx: implement reversed single-partition reads
  sstables: mx: introduce partition_reversing_data_source
  sstables: index_reader: add support for iterating over clustering ranges in reverse
  clustering_key_filter: clustering_key_filter_ranges owning constructor
  flat_mutation_reader: mention reversed schema in make_reversing_reader docstring
  clustering_key_filter: document clustering_key_filter_ranges::get_ranges
2021-10-04 15:37:34 +02:00
Kamil Braun
703aed3277 table: add option to automatically bypass cache for reversed queries
Currently the new reversing sstable algorithms do not support fast
forwarding and the cache does not yet handle reversed results. This
forced us to disable the cache for reversed queries if we want to
guarantee bounded memory. We introduce an option that does this
automatically (without specifying `bypass cache` in the query) and turn
it on by default.

If the user decides that they prefer to keep the cache at the
cost of fetching entire partitions into memory (which may be viable
if their partitions are small) during reversed queries, the option can
be turned off. It is live-updateable.
2021-10-04 15:24:12 +02:00
Kamil Braun
9bf6be5509 test: reverse sstable reader with random schema and random mutations
The test generates a random set of mutations and creates two readers:
- one by reversing the mutations, creating an sstable out of the result,
  and querying it in reverse,
- one by creating an sstable directly from the mutations and querying it
  in forward mode.

It checks that the readers give equal results.

The test already managed to find a bug where offsets returned by the
sstable index were interpreted incorrectly as absolute instead of
relative. It also helped find another bug unrelated to reversing (#9352).

Surprisingly few tests use the random schema and random mutation
utilities which seem to be quite powerful.
2021-10-04 15:24:12 +02:00
Kamil Braun
27238eaa0f sstables: mx: implement reversed single-partition reads
We use partition_reversing_data_source and the new `index_reader` methods
to implement single-partition reads in `mx_sstable_mutation_reader`.

The parsing logic does not need to change: the buffers returned by the
source already contain rows in reversed clustering order.

Some changes were required in `mp_row_consumer_m` which processes the
parsed rows and emits appropriate mutation fragments. The consumer uses
`mutation_fragment_filter` underneath to decide whether a fragment
should be ignored or not (e.g. the parsed fragment may come from outside
the requested clustering range), among other things. Previously
`mutation_fragment_filter` was provided a `partition_slice`. If the
slice was reversed, the filter would use
`clustering_key_filter_ranges::get_ranges` to obtain the clustering
ranges from the slice in unreversed order (they were reversed in the
slice) since we didn't perform any reversing in the reader. Now the
reader provides the ranges directly instead of the slice; furthermore,
the ranges are provided in native-reversed format (the order of ranges
is reversed and the ranges themselves are also reversed), and the schema
provided to the filter is also reversed. Thus to the filter everything
appears as if it was used during a non-reversed query but on a table
with reversed schema, which works correctly given the fact that the
reader is feeding parsed rows into the consumer in reversed order.

During reversed queries the reader uses alternative logic for skipping
to a later range (or, speaking in non-reversed terms, to an earlier range),
which happens in `advance_context`. It asks the index to advance its
upper bound in reverse so that the reversing_data_source notices the
change of the index end position and returns following buffers with rows
from the new range.

There is a slight difference in behavior of the reader from
`mp_row_consumer_m`'s point of view. For non-reversed reads, after
the consumer obtains the beginning of a row (`consume_row_start`)
- which contains the row's position but not the columns - and tells the
reader that the row won't be emitted because we need to skip to a later
range, the reader would tell the data source (the 'context') immediately
to skip to a later range by calling `skip_to`. This caused the source
not to return the rest of the row, and the rest of the row would not
be fed to the consumer (`consume_row_end`). However, for reversed reads,
the data source performs skipping 'on its own', after it notices that
the index end position has changed. This may happen 'too late', causing
the rest of the row to be returned anyway. We are prepared for this
situation inside `mp_row_consumer` by consulting the mutation fragment
filter again when the rest of the row arrives.

Fast forwarding is not supported at this point, which is fine given that
the cache is disabled for reversed queries for now (and the cache is the
only user of fast forwarding).

The `partition_slice` provided by callers is provided in 'half-reversed'
format for reversed queries, where the order of clustering ranges is
reversed, but the ranges themselves are not. This means we need to modify
the slice sometimes: for non-single-partition queries the mx reader must
use a non-reversed slice, and for single-partition queries the mx reader
must use a native-reversed slice (where the clustering ranges themselves
are reversed as well). The modified slice must be stored somewhere; we
store it inside the mx reader itself so we don't need to allocate more
intermediate readers at the call sites.  This causes the interface of
`mx::make_reader` to be a bit weird: for non-single-partition queries
where the provided slice is reversed the reader will actually return a
non-reversed stream of fragments, telling the user to reverse the stream
on their own. The interface has been documented in detail with
appropriate comments.
2021-10-04 15:24:12 +02:00
Wojciech Mitros
64e703bb54 sstables: mx: introduce partition_reversing_data_source
This patch adds an implementation of a data source that wraps an sstable
data file and returns data buffers with contents of one partition in the
sstable as if the rows of the partition were present in a reversed
order. In other words, to the user of the source the partition appears
to be reversed. We shall call this an 'intermediary' data source.

As part of the interface of the intermediary source the user is also
given read access to the source's current position over the data file,
and the constructor of the source takes a reference to `index_reader`.
This is necessary because the index operates directly on data file
offsets and we want the user to be able to use the index to skip
sequences of rows.

In order to ask the source to skip a sequence of rows - e.g. when jumping
between clustering ranges - the user must advance the index' upper bound
in reverse (to an earlier position). The source will then notice that
the end position of the index has changed and take appropriate action.

An alternative would be to translate the data positions of
`index_reader` to 'reversed positions' of the intermediary and then use
`skip_to` for skipping, as we do for forward reads. However this
solution would introduce more complexity to `index_reader` and the
intermediary source. One reason for the complexity in the input stream
is that we would have two kinds of skips: a single row skip,
and a skip to a clustering range. We know the offset of the next row,
so we could check that to differentiate them. We would also need to add
an information about the position of first clustering row and end of
the last one in the index_reader. Skipping by checking the index seems
to be overall simpler.

For simplicity, the intermediary stream always starts with
parsing the partition header and (if present) the static row,
and returning the corresponding bytes as a result of the first
read.

After partition header and static row we must find the last row entry of
the requested range. If the range ends before the partition end (i.e.
there are more row entries after the range) we can use the 'previous
unfiltered size' of the row following the range; otherwise we must scan
the last promoted index block and take its last row.

After finding the data range of the last row, we parse rows
consecutively in reversed order.  We must parse the rows partially
to learn their lengths and the positions of previous rows. We're
using similar constructs as in the sstable parser, but it only
contains a small part of the parsing coroutine and doesn't perform
any correctness checks.  The parser for rows still turned out rather
big mostly because we can't always deduce the size of the clustering
blocks without reading the block header.

The parser allows reading rows while skipping their bodies also in
non-reversed order, which we are making use of while reading the
last promoted index block.

The intermediary data source has one more utility: reversing range
tombstones.  When we read a tombstone bound/boundary, we modify
the data buffer so that the resulting bound/boundary has the reversed
kind (so we don't read ends before starts) and the boundaries have their
before/after timestamps swapped.
2021-10-04 15:24:12 +02:00
Wojciech Mitros
8385f3eb21 sstables: index_reader: add support for iterating over clustering ranges in reverse
In the sstable reader, we iterate over clustering ranges using the
index_reader, which normally only accepts advancing to increasing
positions. In this patch we add methods for advancing the index
reader in reverse.

To simplify our job we restrict our attention to a single implementation
of the promoted index block cursor: `bsearch_clustered_cursor`. The
`index_reader` methods for advancing in reverse will thus assume that
this implementation is used. The assumption is correct given that we're
working only with sstables of versions >= mc, which is indeed the
intended use case. We add some documentation in appropriate places to
make this obvious.

We extend `bsearch_clustered_cursor` with two methods:
`advance_past(pos)`, which advances the cursor to the first block after
`pos` (or to the end if there is no such block), and
`last_block_offset()`, which returns the data file offset of the first
row from the last promoted index block.

To efficiently find the position in the data file of the last row
of the partition (which we need when performing a reversed query)
the sstable reader may need to read the span of the entire last promoted
index block in the data file. To learn where the block starts it can use
`index_reader::last_block_offset()`, which is implemented in terms of
`bsearch_clustered_cursor::last_block_offset()`.

When performing a single partition read in forward order, the reader
asks the index to position its lower bound at the start of the partition
and its upper bound after the end of the slice. It starts by reading the
first range. After exhausting a range it jumps to the next one by asking
the index to advance the lower bound.

For reverse single partition reads we'll take a similar approach: the
initial bound positions are as in the forward case. However, we start
with the last range and after exhausting a range we want to jump to a
previous one; we will do it by advancing the upper bound in reverse
(i.e. moving it closer to the beginning of the partition).  For this
we introduce the `index_reader::advance_reverse` function.
2021-10-04 15:24:12 +02:00
Avi Kivity
3cb865103d Update seastar submodule
* seastar 0ba6c36cc3...994b4b5a0c (2):
  > Merge "Improve smp::invoke_on_others" from Pavel E
  > file: keep hint pointer alive when calling fcntl()
2021-10-04 15:36:45 +03:00
Avi Kivity
148a12f3da Merge "Keep storage_service less aware of cdc internals" from Pavel E
"
The storage_service is involved in the cdc_generation_service guts
more than needed.

 - the bool _for_testing bit is cdc-only
 - there's API-only cdc_generation_service getter
 - cdc_g._s. startup code partially sits in s._s. one

This patch cleans most of the above leaving only the startup
_cdc_gen_id on board.

tests: unit(dev)
refs: #2795

"

* 'br-storage-service-vs-cdc-2' of https://github.com/xemul/scylla:
  api: Use local sharded<cdc::generation_service> reference
  main: Push cdc::generation_service via API
  storage_service: Ditch for_testing boolean
  cdc: Replace db::config with generation_service::config
  cdc: Drop db::config from description_generator
  cdc: Remove all arguments from maybe_rewrite_streams_descriptions
  cdc: Move maybe_rewrite_streams_descriptions into after_join
  cdc: Squash two methods into one
  cdc: Turn make_new_cdc_generation a service method
  cdc: Remove ring-delay arg from make_new_cdc_generation
  cdc: Keep database reference on generation_service
2021-10-04 14:56:05 +03:00
Piotr Dulikowski
6093c2378b hints: assign _last_written_rp in ep manager's move constructor
The end_point_hints_manager's field _last_written_rp is initialized in
its regular constructor, but is not copied in the move constructor.
Because the move constructor is always involved when creating a new
endpoint manager, the _last_written_rp field is effectively always
initialized with the zero-argument constructor, and is set to the zero
value.

This can cause the following erroneous situation to occur:

- Node A accumulates hints towards B.
- Sync point is created at A.
  It will be used later to wait for currently accumulated hints.
- Node A is restarted.
  The endpoint manager A->B is created which has bogus value in the
  _last_written_rp (it is set to zero).
- Node A replays its hints but does not write any new ones.
- A hint flush occurs.
  If there are no hint segments on disk after flush, the endpoint
  manager sets its last sent position to the last written position,
  which is by design. However, the last written position has incorrect
  value, so the last sent position also becomes incorrect and too low.
- Try to wait for the sync point created earlier.
  The sync point waiting mechanism waits until last sent hint position
  reaches or goes past the position encoded in the sync point, but it
  will not happen because the last sent position is incorrect.

The above bug can be (sometimes) reproduced in
hintedhandoff_sync_point_api_test dtest.

Now, the _last_written_rp field is properly initialized in the move
constructor, which prevents the bug described above.

Fixes: #9320
Closes #9426
2021-10-04 13:21:34 +02:00
Kamil Braun
b2f33b3e0b test: raft: randomized_nemesis_test: abort environment before ticker
We must abort the environment before the ticker as the environment may
require time to keep advancing during abort in order for all operations
to finish, e.g. operations that can finish only due to timeout.
Currently such operations may cause the test to hang indefinitely
at the end.

The test requires a small modification to ensure that
`delivery_queue::push` is not called after the queue was aborted.
Message-Id: <20210930143539.157727-1-kbraun@scylladb.com>
2021-10-04 12:31:26 +02:00
Avi Kivity
1bac93e075 Merge "simplifications and layer violation fix for compaction manager" from Raphael
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."

* 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla:
  compaction: split compaction info and data for control
  compaction_manager: use task when stopping a given compaction type
  compaction: remove start_size and end_size from compaction_info
  compaction_manager: introduce helpers for task
  compaction_manager: introduce explicit ctor for task
  compaction: kill sstables field in compaction_info
  compaction: kill table pointer in compaction_info
  compaction: simplify procedure to stop ongoing compactions
  compaction: move management of compaction_info to compaction_manager
  compaction: move output run id from compaction_info into task
2021-10-04 13:09:31 +03:00
Avi Kivity
93b765f655 scripts/pull_github_pr.sh: don't guess git remote name
The script assumes the remote name is "origin", a fair
assumption, but not universally true. Read it from configuration
instead of guessing it.

Closes #9423
2021-10-04 12:32:39 +03:00
Nadav Har'El
414b672e22 test/alternator: verify that empty-string keys are NOT allowed
Since May 2020 empty strings are allowed in DynamoDB as attribute values
(see announcment in [1]). However, they are still not allowed as keys.

We had tests that they are not allowed in keys of LSI or GSI, but missed
tests that they are not allowed as keys (partition or sort key) of base
tables. This patch add these missing tests.

These tests pass - we already had code that checked for empty keys and
generated an appropriate error.

Note that for compatibility with DynamoDB, Alternator will forbid empty
strings as keys even though Scylla *does* support this possibility
(Scylla always supported empty strings as clustering key, and empty
partition keys will become possible with issue #9352).

[1] https://aws.amazon.com/about-aws/whats-new/2020/05/amazon-dynamodb-now-supports-empty-values-for-non-key-string-and-binary-attributes-in-dynamodb-tables/

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211003122842.471001-1-nyh@scylladb.com>
2021-10-04 08:40:43 +02:00
Botond Dénes
61e7d3de90 Merge 'Cleanup compaction_stop_exception' from Benny Halevy
The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception`
and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown
or api event vs. compaction aborting due to scrub validation error.

While at it, cleanup the existing retry logic related to `compaction_stop_exception`.

Test: unit(dev)
Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267)

Closes #9321

* github.com:scylladb/scylla:
  compaction: split compaction_aborted_exception from compaction_stopped_exception
  compaction_manager: maybe_stop_on_error: rely on retry=false default
  compaction_manager: maybe_stop_on_error: sync return value with error message.
  compaction: drop retry parameter from compaction_stop_exception
  compaction_manager: move errors stats accounting to maybe_stop_on_error
2021-10-04 07:27:11 +03:00
Takuya ASADA
9c830297ac scylla_util.py: add persistent disk support for GCE
Just like EBS disks for EC2, we want to use persistent disk on GCE.
We won't recommend to use it, but still need to support it.

Related scylladb/scylla-machine-image#215

Closes #9395
2021-10-03 17:58:18 +03:00
Takuya ASADA
d87b80ad14 scylla_util.py: add persistent disk support for Azure Just like EBS disks for EC2, we want to use persistent disk on Azure. We won't recommend to use it, but still need to support it.
Related https://github.com/scylladb/scylla-machine-image/issues/218

Closes #9417
2021-10-03 17:56:31 +03:00
Avi Kivity
adcd5a69d6 Update seastar submodule
* seastar e6db0cd587...0ba6c36cc3 (6):
  > semaphore: add try_get_units
  > build: adjust compilation for libfmt 8+
  > alloc_failure_injector: add explicit zero-initialization
  > Change wakeup() from private to public in reactor.hh
  > app-template: separate seastar options into --seastar-help
  > files: Don't ignore FS info for read-only files
2021-10-03 13:14:43 +03:00
Piotr Sarna
1d353bd6e7 docs: mention scripts/pull_github_pr.sh
The pull_github_pr.sh script is preferred over colorful github
buttons, because it's designed to always assign proper authors.
It also works for both single- and multi-patch series, which
makes the merging process more universal.
Message-Id: <b982b650442456b988e1cea59aa5ad221207b825.1633101849.git.sarna@scylladb.com>
2021-10-03 10:19:26 +03:00
Michał Radwański
0d5a2067ad test/lib/failure_injecting_allocation_strategy: remove UB...
by setting _alloc_count initially to 0.

The _alloc_count hasn't been explicitely specified. As the allocator has
been usually an automatic variable, _alloc_count had initially some
unspecified contents. This probalby means that cases where the first few
allocations passed and the later one failed, might haven't ever been
tested. Good thing is that most of the users have been transferred to
the Seastar failure injector, which (by accident) has been correct.

Closes #9420
2021-10-01 13:25:05 +02:00
Pavel Emelyanov
85d86cc85f scripts: Fix origin repo URL parsing
It assumes that the origin URL is git@ one while it can be
the https:// one as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211001082116.7214-1-xemul@scylladb.com>
2021-10-01 13:22:06 +02:00
Piotr Sarna
ec52e05eab tracing: unify prepared statement info into a single struct
The tracing code assumes that query_option_names and query_option_values
vectors always have the same length as the prepared_statements vector,
but it's not true. E.g. if one of the statements in a batch is
incorrect, it will create a discrepancy between the number of prepared
statements and the number of bound names and values, which currently
leads to a segmentation fault.
To overcome the problem, all three vectors are integrated into a single
vector, which makes size mismatches impossible.

Tested manually with code that triggers a failure while executing
a batch statement, because the Python driver performs driver-side
validation and thus it's hard to create a test case which triggers
the problem.

closes: #9221
2021-10-01 10:57:38 +03:00
Eliran Sinvani
c38ceafdcf Service Level Controller: Add an extention point to the API (#9374)
In order to ease future extensions to the information being sent
by the service level configuration change API, we pack the additional
parameters (other the the service level options) to the interface in a
structure. This will allow an easy expansion in the future if more
parameters needs to be sent to the observer.i

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-10-01 10:20:28 +03:00
Raphael S. Carvalho
9067a13eac compaction: split compaction info and data for control
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:57 -03:00
Raphael S. Carvalho
87ce0c5d43 compaction_manager: use task when stopping a given compaction type
compaction_info will eventually only be used for exporting data about
ongoing compactions, so task must be used instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:52 -03:00
Raphael S. Carvalho
cbd78be2dd compaction: remove start_size and end_size from compaction_info
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:45 -03:00
Raphael S. Carvalho
18f703e94b compaction_manager: introduce helpers for task
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:41 -03:00
Raphael S. Carvalho
d4572a1bb5 compaction_manager: introduce explicit ctor for task
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:37 -03:00
Raphael S. Carvalho
38df9c68f8 compaction: kill sstables field in compaction_info
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:33 -03:00
Raphael S. Carvalho
90cfe895d4 compaction: kill table pointer in compaction_info
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:29 -03:00
Raphael S. Carvalho
4ce745e0b6 compaction: simplify procedure to stop ongoing compactions
Today, compactions are tracked by both _compactions and _tasks,
where _compactions refer to actual ongoing compaction tasks,
whereas _tasks refer to manager tasks which is responsible for
spawning new compactions, retry them on failure, etc.
As each task can only have one ongoing compaction at a time,
let's move compaction into task, such that manager won't have to
look at both when deciding to do something like stopping a task.

So stopping a task becomes simpler, and duplication is naturally
gone.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:21 -03:00
Raphael S. Carvalho
efed06e2e4 compaction: move management of compaction_info to compaction_manager
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.

This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.

From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.

This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:15:00 -03:00
Raphael S. Carvalho
1f5b17fdc5 compaction: move output run id from compaction_info into task
this run id is used to track partial runs that are being written to.
let's move it from info into task, as this is not an external info,
but rather one that belongs to compaction_manager.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:13:20 -03:00
Raphael S. Carvalho
52302c3238 compaction_manager: prevent unbounded growth of pending tasks
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.

Refs #9331.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
2021-09-30 16:49:52 +03:00
Pavel Emelyanov
037135316e api: Use local sharded<cdc::generation_service> reference
And remove the getter from storage_service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 16:04:12 +03:00
Pavel Emelyanov
5d8e05e7ae main: Push cdc::generation_service via API
This is not to mess with storage service in this API call. Next
patch will make use of the passed reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 16:04:12 +03:00
Pavel Emelyanov
f669fbd230 storage_service: Ditch for_testing boolean
Nowadays it purely controls whether or not to inject delays into
timestamps generation by cdc. The same effect can be achieved by
configuring the cdc::generation_service directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 16:04:12 +03:00
Pavel Emelyanov
db623c5f64 cdc: Replace db::config with generation_service::config
This is to push the service towards general idea that each
component should have its own config and db::config to stay
in main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 16:04:12 +03:00
Pavel Emelyanov
b879d3f3a5 cdc: Drop db::config from description_generator
It only needs one for murmur3_partitioner_ignore_msb_bits value,
provide it directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 16:04:12 +03:00
Pavel Emelyanov
2e7364b94f cdc: Remove all arguments from maybe_rewrite_streams_descriptions
All of them are references taken from 'this', since the function is
the generation_service method it can use 'this' directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 16:04:12 +03:00
Pavel Emelyanov
6fe31d8eac cdc: Move maybe_rewrite_streams_descriptions into after_join
The generation service already has all it needs to do it. This
keeps storage_service smaller and less aware about cdc internals.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 15:34:03 +03:00
Pavel Emelyanov
3b51c5c96a cdc: Squash two methods into one
The recently introduced make_new_generation() method just calls
another one by passing more this->... stuff as arguments. Relax
the flow by teaching the latter to use 'this' directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 15:34:03 +03:00
Pavel Emelyanov
7a7a87f24a cdc: Turn make_new_cdc_generation a service method
It has everything needed onboard. Only two arguments are required -- the
booststrap tokens and whether or not to inject a delay.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 15:34:03 +03:00
Pavel Emelyanov
b867a19da1 cdc: Remove ring-delay arg from make_new_cdc_generation
It already has the db::config from where to get one (and even
this will change soon).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 15:34:03 +03:00
Pavel Emelyanov
5e2a049266 cdc: Keep database reference on generation_service
The service effectively depends on it when rewrites streams
descriptions.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 15:34:03 +03:00
Piotr Sarna
e2fe8559ca configure: temporarily disable wasm support for aarch64
There seems to be a problem with libwasmtime.a dependency on aarch64,
causing occasional segfaults during tests - specifically, tests
which exercise the path for halting wasm execution due to fuel
exhaustion. As a temporary measure, wasm is disabled on this
architecture to unblock the flow.

Refs #9387

Closes #9414
2021-09-30 14:57:04 +03:00
Pavel Emelyanov
bbcf671276 config: Remove unused replacing options
The --replace-token and --replace-node were added some time ago, but
have never been used since then, just parsed and immediatelly aborted.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210930102222.16294-1-xemul@scylladb.com>
2021-09-30 14:56:04 +03:00
Kamil Braun
5b011b1c2f clustering_key_filter: clustering_key_filter_ranges owning constructor 2021-09-30 12:10:52 +02:00
Kamil Braun
43dac07253 flat_mutation_reader: mention reversed schema in make_reversing_reader docstring 2021-09-30 12:10:52 +02:00
Kamil Braun
1777d5de46 clustering_key_filter: document clustering_key_filter_ranges::get_ranges 2021-09-30 12:10:52 +02:00
Piotr Jastrzebski
79de151158 cache_tracker: remove unused parameter from on_remove
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f66ad391d86963b43b2a01e957887ea597e591e8.1632992165.git.piotr@scylladb.com>
2021-09-30 13:03:13 +03:00
Avi Kivity
83237894b7 Merge "Keep local_host_id in local_cache" from Pavel E
"
Most of the code gets local_host_id by querying system keyspace.
There's one place (counters) that want future-less getter and
that caches host id on database for that (it used to be cached on
storage_service some time ago).

This set relocates the value on local cache and frees the starting
code from the need to mess with database for setting it. Also
this cuts hints->qctx hidden dependency.

tests: unit(dev)
"

* 'br-host-id-in-local-cache' of https://github.com/xemul/scylla:
  storage_proxy: Use future-less local_host_id getting
  database: Get local host id from system_keyspace
  system_keyspace: Keep local_host_id on local_cache
  code: Rename get_local_host_id() into load_...()
  system_keyspace: Coroutinize get_/set_local_host_id
2021-09-30 12:38:52 +03:00
Pavel Emelyanov
8a93a6de78 storage_proxy: Use future-less local_host_id getting
The methods in question are called from the API handlers which
are registered after start (and after host id load and cache),
so they can safely be switched.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 10:55:20 +03:00
Pavel Emelyanov
e9002e1e61 database: Get local host id from system_keyspace
It's now cached on database itself, and it can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 10:55:20 +03:00
Pavel Emelyanov
9f5fd8b5c0 system_keyspace: Keep local_host_id on local_cache
Some places in the code want to have future-less access to the
host id, now they do it all by themselves. Local cache seems to
be a better place (for the record -- some time ago the "better
place" argument justified cached host id relocation from the
storage_service onto the database).

While at it -- add the future-less getter for the host_id to be
used further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 10:54:38 +03:00
Nadav Har'El
1edcc3a218 test/alternator: add test for reverse queries
This patch adds a reproducer for issue #7586 - that Alternator queries
(Query) operating in reverse order (ScanIndexForward = false) are
artificially limited to 100 MB partitions because of their memory use.

This test generates a partition over 100 MB in size and then tries various
reverse queries on it - with or without Limit, starting at the end or
the middle of the partition. The test currently fails when a reverse query
refuses to operate on such a large partition - the log reports this:

  ERROR ... Memory usage of reversed read exceeds hard limit of 104857600
  (configured via max_memory_for_unlimited_query_hard_limit), while reading
  partition K1H6ON3A1C

With yet-uncommitted reverse-scan improvements, the test proceeds further,
but still fails where we test that a reverse query with Limit not
explicitly specified should still be limited to a certain size (e.g. 1MB)
and cannot return the entire 100 MB partition in one response.

Please note that this is not a comprehensive test for Scylla's reverse
scan implementation: In particular we do not have separate tests for
reverse scan's implementation on different sources - memtables, sstables,
or the cache. Nor do we check all sorts of edge cases. We assume that
Scylla's reverse scan implementation will have its own unit tests
elsewhere that will check these things - and this test can focus on the
Alternator use case.

This test is marked "xfail" because it still fails on Alternator. It is
marked "veryslow" because it's a (relatively) slow test, taking multiple
seconds to set up the 100 MB partition. So run the test with the
pytest options "--runxfail --runveryslow" to see how it fails.

Refs #7586

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210930063700.407511-1-nyh@scylladb.com>
2021-09-30 09:34:39 +02:00
Pavel Emelyanov
beb345c00a code: Rename get_local_host_id() into load_...()
There will appear the future-less method which better deserves
the get_ prefix, so give the existing method the load_ one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 10:33:57 +03:00
Pavel Emelyanov
e49dc4ed0d system_keyspace: Coroutinize get_/set_local_host_id
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-30 10:33:57 +03:00
Pavel Emelyanov
e6b920017a main: Replace cql_config_updater with updateable_value
The cql_config_updater is a sharded<> service that exists in main and
whose goal is to make sure some db::config's values are propagated into
cql_config. There's a more handy updateable_value<> glue for that.

tests: unit(dev)
refs: #2795

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210927090402.25980-1-xemul@scylladb.com>
2021-09-30 07:23:43 +03:00
Pavel Solodovnikov
88f9f2e9d0 idl: support generating boilerplate code for RPC verbs
Introduce new syntax in IDL compiler to allow generating
registration/sending code for RPC verbs:

        verb [[attr1, attr2...] my_verb (args...) -> return_type;

`my_verb` RPC verb declaration corresponds to the
`netw::messaging_verb::MY_VERB` enumeration value to identify the
new RPC verb.

For a given `idl_module.idl.hh` file, a registrator class named
`idl_module_rpc_verbs` will be created if there are any RPC verbs
registered within the IDL module file.

These are the methods being created for each RPC verb:

        static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&);
        static future<> unregister_my_verb(netw::messaging_service* ms);
        static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...);

Each method accepts a pointer to an instance of `messaging_service`
object, which contains the underlying seastar RPC protocol
implementation, that is used to register verbs and pass messages.

There is also a method to unregister all verbs at once:

        static future<> unregister(netw::messaging_service* ms);

The following attributes are supported when declaring an RPC verb
in the IDL:
* [[with_client_info]] - the handler will contain a const reference to
  an `rpc::client_info` as the first argument.
* [[with_timeout]] - an additional `time_point` parameter is supplied
  to the handler function and `send*` method uses `send_message_*_timeout`
  variant of internal function to actually send the message.
* [[one_way]] - the handler function is annotated by
  `future<rpc::no_wait_type>` return type to designate that a client
  doesn't need to wait for an answer.

The `-> return_type` clause is optional for two-way messages. If omitted,
the return type is set to be `future<>`.
For one-way verbs, the use of return clause is prohibited and the
signature of `send*` function always returns `future<>`.

No existing code is affected.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-09-30 02:21:57 +03:00
Michał Radwański
b68a6c63e9 flat_mutation_reader: remove unused reserve_one method
Closes #9410
2021-09-29 17:22:29 +02:00
Nadav Har'El
43b3c1b75d CODEOWNERS: some fixes and additions
Fixed some errors in .github/CODEOWNERS (which is used by Github to
recommend who should review which pull request), and also add a few
additional ownerships I thought of.

This file could still use more work - if you can think of specific
files or directories you'd like to review changes in, please send a
patch for this file to add yourself to the appropriate paths.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929141118.378930-1-nyh@scylladb.com>
2021-09-29 18:07:07 +03:00
Botond Dénes
970fe9a339 mutation_writer: partition_based_splitting_writer: limit number of max buckets
Recently we observed an OOM caused by the partition based splitting
writer going crazy, creating 1.7K buckets while scrubbing an especially
broken sstable. To avoid situations like that in the future, this patch
provides a max limit for the number of live buckets. When the number of
buckets reach this number, the largest bucket is closed and replaced by
a bucket. This will end up creating more output sstables during scrub
overall, but now they won't all be written at the same time causing
insane memory pressure and possibly OOM.
Scrub compaction sets this limit to 100, the same limit the TWCS's
timestamp based splitting writer uses (implemented through the
classifier -
time_window_compaction_strategy::max_data_segregation_window_count).

Fixes: #9400

Tests: unit(dev)

Closes #9401
2021-09-29 16:31:29 +03:00
Avi Kivity
b3c95a1fc6 commitlog: reduce inclusions of commitlog.hh due to db::commitlog::force_sync (#9379)
There are now 231 translation units that indirectly include commitlog.hh
due to the need to have access to db::commitlog::force_sync.

Move that type to a new file commitlog_types.hh and make it available
without access to the commitlog class.

This reduces the number of translation units that depend on commitlog.hh
to 84, improving compile time.
2021-09-29 16:13:44 +03:00
Nadav Har'El
5cbe9178fd alternator: add missing BatchGetItem metric
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.

In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.

Fixes #9406

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
2021-09-29 14:16:54 +02:00
Tomasz Grabiec
11a3b411c5 Merge 'mutation_source_test: test reverse reads' from Botond Dénes
Currently no mutation-source supports reading in reverse natively but
we are working on changing that, adding native reverse read support to
memtable, cache and sstable readers. To ensure that all mutation
sources work in a correct and uniform manner when reading in reverse,
we add a reverse test to the mutation source test suite. This test
reverses the data that it passes to `populate()`, then reads in
forward order (in reverse compared to the data order). For this we use
the currently established reverse read API: reverse schema (schema
order == query order) and half-reversed (legacy) slice.  All mutation
sources are prepared to work with reversed reads, using the
`make_reversing_reader()` adapter. As we progress with our native
reverse support, we will replace these adapters with native reversing
support. As part of this, we push down the reversing reader adapter
currently existing on the `query::consume_page()` level, to the
individual mutation sources.

Closes #9384

* github.com:scylladb/scylla:
  test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set
  querier: consume_page(): remove now unused max_size parameter
  test/lib: mutation_source_test: test reading in reverse
  test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse
  test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse
  test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema
  treewide: move reversing to the mutation sources
  mutation_query: reconcilable_result_builder: document reverse query preconditions
  sstable_set: time_series_sstable_set: reverse mode
  mutlishard_mutation_query: set max result size on used permits
  db/virtual_table: streaming_virtual_table::as_mutation_source(): use query schema instead of table schema
  flat_mutation_reader: make_reversing_reader(): add convenience stored slice
  mutation_reader: evictable_reader: add reverse read support
  flat_mutation_reader: make_flat_mutation_reader_from_fragments(): add reverse read support
  flat_mutation_reader: flat_mutation_reader_from_mutations(): add reverse read support
  flat_mutation_reader: flat_mutation_reader_from_mutations(): document preconditions
  query-request: introduce `half_reverse_slice`
  flat_mutation_reader_assertions: log what's expected
2021-09-29 12:57:57 +02:00
Avi Kivity
d4aa6c2746 Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael
"
Backlog tracker isn't updated correctly when facing a schema change, and
may leak a SSTable if compaction strategy is changed, which causes
backlog to be computed incorrectly. Most of these problems happen because
sstable set and tracker are updated independently, so it could happen
that tracker lose track (pun intended) of changes applied to set.

The first patch will fix the leak when strategy is changed, and the third
patch will make sure that tracker is updated atomically with sstable set,
so these kind of problems will not happen anymore.

Fixes #9157
"

* 'fixes_to_backlog_tracker_v4' of github.com:raphaelsc/scylla:
  compaction: Update backlog tracker correctly when schema is updated
  compaction: Don't leak backlog of input sstable when compaction strategy is changed
  compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables()
  compaction: simplify removal of monitors
2021-09-29 13:55:37 +03:00
Kamil Braun
075a894a89 test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set 2021-09-29 12:15:48 +03:00
Botond Dénes
42b677ef6f querier: consume_page(): remove now unused max_size parameter 2021-09-29 12:15:48 +03:00
Botond Dénes
bc49c27a06 test/lib: mutation_source_test: test reading in reverse
To ensure all mutation sources uniformly support the current API of
reverse reading: reversed schema and half-reversed slice. This test will
also ensure that once we switch to native-reverse slice, all
mutation-sources will keep on working.
2021-09-29 12:15:48 +03:00
Kamil Braun
7d5273b044 test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse
For reversed reads we must adjust the lower/upper bounds used by the
`position_reader_queue` and `clustering_combined_reader`. The bounds are
calculated using the mutation schema, but we need bounds calculated
using the query schema which is reversed.
2021-09-29 12:15:48 +03:00
Botond Dénes
9399f379ec test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse
The mutation source test suite will soon test reads in reverse. Prepare
for this by checking the reversed flag on the slice and not reversing
the data when set. The test will have two modes effectively:
* Forward mode: data is reversed before read, the reversed again during
  read.
* Reverse mode: data is already reversed and it is reversed back during
  read.
2021-09-29 12:15:48 +03:00
Botond Dénes
c048d854d9 test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema
The two might not be the same in case the schema was upgraded or if we
are reading in reverse. It is important to use the passed-in query
 schema consistently during a read.
2021-09-29 12:15:48 +03:00
Botond Dénes
41facb3270 treewide: move reversing to the mutation sources
Push down reversing to the mutation-sources proper, instead of doing it
on the querier level. This will allow us to test reverse reads on the
mutation source level.
The `max_size` parameter of `consume_page()` is now unused but is not
removed in this patch, it will be removed in a follow-up to reduce
churn.
2021-09-29 12:15:45 +03:00
Nadav Har'El
88177d7be7 test/alternator: add test for too many items in BatchWriteItem
DynamoDB limits the number of items that a BatchWriteItem call can write
to 25. As noted in issue #5057, in Alternator we don't have this limit
or any limit on the number of items in a BatchWriteItem - which probably
isn't wise.

This patch adds a simple xfailing test for this.

Refs #5057

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210912140736.76995-1-nyh@scylladb.com>
2021-09-29 10:48:58 +02:00
Nadav Har'El
a1bab2c4c9 Merge 'cql3: improve expression ergonomics' from Avi Kivity
The `expression` type (an std::variant) suffers from bad ergonomics:
 - std::variant has poor/no constraints, so compiler error messages are long and uninformative
 - it cannot be forward-declared (since std::variant does not support incomplete types)
 - the type name is long, polluting compiler error messages and debug symbols
 - it requires an artificial `nested_expression` when one expression is nested inside another

This series fixes those drawbacks by wrapping the variant in a class, adding constraints, and adding an extra indirection.

Test: unit (dev)

Closes #9402

* github.com:scylladb/scylla:
  cql3: expr: drop nested_expression
  cql3: expr: make expression forward declarable, easier to use
  cql3: expr: construct column_value explicitly
  cql3: expr: introduce as/as_if/is
  cql3: expr: introduce expr::visit, replacing std::visit
2021-09-29 10:47:39 +03:00
Takuya ASADA
cd7fe9a998 scylla_cpuscaling_setup: disable ondemand.service on Ubuntu
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.

Fixes #9324

Closes #9325
2021-09-29 10:32:34 +03:00
Avi Kivity
c72906a2ee cql3: expr: drop nested_expression
Now that expression can be nested in its component types
directly, we can remove nested_expression. Most of the patch
adjusts uses to drop the dereference that was needed for
nested_expression.
2021-09-28 23:49:21 +03:00
Avi Kivity
448c06f150 cql3: expr: make expression forward declarable, easier to use
Make expression a class, holding a unique_ptr to a variant,
instead of just a variant.

This has some advantages:
 - the constructor can be properly constrained
 - the type can be forward-declared
 - the type name is just "expression", rather than
   a huge variant. This makes compiler error messages easier
   to read.
 - the internal indirection allows removal of nested_expression
   (later in the series)
2021-09-28 23:49:21 +03:00
Avi Kivity
d43e72a747 cql3: expr: construct column_value explicitly
We have a few cases where a column_definition* is converted
directly to an expression without an explicit call to column_value{}.

The new expression implementation will not allow this, so make
these cases explicit. IMO this is better form than to rely
on the compiler picking the right expression subtype.
2021-09-28 23:49:21 +03:00
Avi Kivity
be44b579a1 cql3: expr: introduce as/as_if/is
Simple wrappers for std::get, std::get_if, std::holds_alternative.

The new names are shorter and IMO more readable.

Call sites are updated.

We will later replace the implementation.
2021-09-28 23:49:11 +03:00
Avi Kivity
e7db3def4f cql3: expr: introduce expr::visit, replacing std::visit
The new expr::visit() is just a wrapper around std::visit(),
but has better constraints. A call to expr::visit() with a
visitor that misses an overload will produce an error message
that points at the missing type. This is done using the new
invocable_on_expression concept. Note it lists the expression
types one by one rather than using template magic, since
otherwise we won't get the nice messages.

Later, we will change the implementation when expression becomes
our own type rather than std::variant.

Call sites are updated.
2021-09-28 23:48:42 +03:00
Botond Dénes
c7619de929 mutation_query: reconcilable_result_builder: document reverse query preconditions 2021-09-28 17:03:57 +03:00
Kamil Braun
7dc4ee35c9 sstable_set: time_series_sstable_set: reverse mode
`time_series_sstable_set` uses `clustering_combined_reader` to implement
efficient single-partition reads. It provides a `position_reader_queue`
to the reader. This queue returns readers to the sstables from the set
in order of the sstables' lower bounds, and with each reader it provides
an upper bound for the positions-in-partition returned by the reader.

Until now we would assume non-reversed queries only. Reversed queries
were implemented by performing forward query in the lower layers
and reversing the results at the upper-most layer of the reader stack.
Before pushing the reversing down to the sources (in particular,
to sstable readers), we need to support the reverse mode in
`time_series_sstable_set` and the queue it provides to
`clustering_combined_reader`.

This requires using different lower and upper bounds in the queue.
For non-reversed reads we used `sstable::min_position()` as the lower
bound and `sstable::max_position()` as the upper bound. For reversed
reads all comparisons performed by `clustering_combined_reader` will be
reversed, as it will use a reversed schema. We can then use
`sstable::max_position().reversed()` for the lower bound and
`sstable::min_position().reversed()` for the upper bound.
2021-09-28 17:03:57 +03:00
Botond Dénes
22e216563a mutlishard_mutation_query: set max result size on used permits
08042c1688 added the query max result size
to the permit but only set it for single partition queries. This patch
does the same for range-scans in preparation of `query::consume_page()`
not propagating max size soon.
2021-09-28 17:03:57 +03:00
Botond Dénes
dec282e050 db/virtual_table: streaming_virtual_table::as_mutation_source(): use query schema instead of table schema
The two might not be the same in case the schema was upgraded (unlikely
for virtual tables) or if we are reading in reverse. It is important to
use the passed-in query schema consistently during a read.
2021-09-28 17:03:57 +03:00
Botond Dénes
f5ef88c0c5 flat_mutation_reader: make_reversing_reader(): add convenience stored slice
This serves as a convenience slice storage for reads that have to
store an edited slice somewhere. This is common for reads that work
with a native-reversed slice and so have to convert the one used in the
query -- which is in half-reversed format.
2021-09-28 17:03:57 +03:00
Botond Dénes
2bd295ee80 mutation_reader: evictable_reader: add reverse read support
Evictable reader has to be made aware of reverse reads as it checks/edits the
slice. This shouldn't require reverse awareness normally, it is only
required because we still use the half-reversed (legacy) slice format
for reversed reads. Once we switch to the native format this commit can
be reverted.
2021-09-28 17:03:57 +03:00
Botond Dénes
eeebe4ab63 flat_mutation_reader: make_flat_mutation_reader_from_fragments(): add reverse read support
Implemented with the `make_reversing_reader()` adaptor.
2021-09-28 17:03:57 +03:00
Botond Dénes
cc222e5332 flat_mutation_reader: flat_mutation_reader_from_mutations(): add reverse read support
Implemented with the `make_reversing_reader()` adaptor.
2021-09-28 17:03:57 +03:00
Botond Dénes
1a2bdba25f flat_mutation_reader: flat_mutation_reader_from_mutations(): document preconditions 2021-09-28 17:03:57 +03:00
Kamil Braun
4bd601c6fd query-request: introduce half_reverse_slice
A utility function for converting between forward and half-reversed (or
'legacy'-reversed) slices to be used in the next commit.
2021-09-28 17:03:57 +03:00
Kamil Braun
270093b251 flat_mutation_reader_assertions: log what's expected 2021-09-28 17:03:57 +03:00
Tomasz Grabiec
c4328ffc4d tests: mutation_test: Add test for position_in_partition::reversed()
Message-Id: <20210927154942.44236-1-tgrabiec@scylladb.com>
2021-09-28 13:09:39 +02:00
Tomasz Grabiec
6bf873b663 Merge "raft: misc documentation edits" from Kostja
* scylla-dev/raft-misc-v4-docedit:
  raft: document pre-voting and protection against disruptive leaders
  raft: style edits of README.md.
  raft: document snapshot API
2021-09-28 12:12:46 +02:00
Konstantin Osipov
0adff23c21 raft: document pre-voting and protection against disruptive leaders 2021-09-27 22:04:18 +03:00
Konstantin Osipov
0e63e99b5a raft: style edits of README.md. 2021-09-27 22:04:04 +03:00
Konstantin Osipov
de2beac6ca raft: document snapshot API 2021-09-27 22:03:38 +03:00
Raphael S. Carvalho
9718173598 compaction: Update backlog tracker correctly when schema is updated
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN

This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.

Fixes #9157

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-27 14:15:29 -03:00
Raphael S. Carvalho
afd45b9f49 compaction: Don't leak backlog of input sstable when compaction strategy is changed
The generic backlog formula is: ALL + PARTIAL - COMPACTING

With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.

But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.

With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.

By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.

Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-27 14:03:28 -03:00
Raphael S. Carvalho
05126cfe29 compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables()
This new function makes it easier to remove monitor of exhausted
sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-27 14:01:40 -03:00
Raphael S. Carvalho
35050a8217 compaction: simplify removal of monitors
by switching to unordered_map, removal of generated monitors is
made easier. this is a preparatory change for patch which will
remove monitor for all exhausted sstables

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-27 13:59:30 -03:00
Tomasz Grabiec
2b3ae6aca4 position_in_partition: Introduce reversed() transformation
It transforms the position from a forward-clustering-order schema
domain to a reversed-clustering-order schema domain.

The object still refers to the same element of the space of keys under this
transformation. However, the identification of the position,
the position_in_partition object, is schema-dependent, it is always
interpreted relative to some schema. Hence the need to transform it
when switching schema domains.

Message-Id: <20210917102612.308149-1-tgrabiec@scylladb.com>
2021-09-27 14:23:09 +03:00
Gleb Natapov
78774a485a raft: drop local snapshot if it cannot be installed
If a locally taken snapshot cannot be installed because newer one was
received meanwhile it should be dropped, otherwise it will take space
needlessly.

Message-Id: <YUrWXxVfBjEio1Ol@scylladb.com>
2021-09-27 13:03:23 +02:00
Asias He
1657e7be14 gossiper: Send generation number with shutdown message
Consider:
- n1, n2 in the cluster
- n2 shutdown
- n2 sends gossip shutdown message to n1
- n1 delays processing of the handler of shutdown message
- n2 restarts
- n1 learns new gossip state of n2
- n1 resumes to handle the shutdown message
- n1 will mark n2 as shutdown status incorrectly until n2 restarts again

To prevent this, we can send the gossip generation number along with the
shutdown message. If the generation number does not match the local
generation number for the remote node, the shutdown message will be
ignored.

Since we use the rpc::optional to send the generation number, it works
with mixed cluster.

Fixes #8597

Closes #9381
2021-09-27 11:08:43 +03:00
Avi Kivity
d7ac699a55 Revert "Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael"
This reverts commit b5cf0b4489, reversing
changes made to e8493e20cb. It causes
segmentation faults when sstable readers are closed.

Fixes #9388.
2021-09-26 18:31:49 +03:00
Avi Kivity
bf94c06fc7 Revert "Merge "simplifications and layer violation fix for compaction manager" from Raphael"
This reverts commit 7127c92acc, reversing
changes made to 88480ac504. We need to
revert b5cf0b4489 to fix #9388, and this stands
in the way.

Ref #9388.
2021-09-26 18:30:36 +03:00
Piotr Sarna
06f724857f transport: remove unused map of stream_id->query states
The map is never touched, so it only occupies precious space
for each connection.

Closes #9383
2021-09-26 13:41:58 +03:00
Avi Kivity
936de92876 Merge 'cql3: Add evaluate(expression) and use instead of term::bind()' from Jan Ciołek
This PR adds the function:
```c++
constant evaluate(const expression&, const query_options&);
```
which evaluates the given expression to a constant value.
It binds all the bound values, calls functions, and reduces the whole expression to just raw bytes and `data_type`, just like `bind()` and `get()` did for `term`.

The code is often similar to the original `bind()` implementation in `lists.cc`, `sets.cc`, etc.

* For some reason in the original code, when a collection contains `unset_value`, then the whole collection is evaluated to `unset_value`. I'm not sure why this is the case, considering it's impossible to have `unset_value` inside a collection, because we forbid bind markers inside collections. For example here: cc8fc73761/cql3/lists.cc (L134)
This seems to have been introduced by Pekka Enberg in 50ec81ee67, but he has left the company.
I didn't change the behaviour, maybe there is a reason behind it, although maybe it would be better to just throw `invalid_request_exception`.
* There was a strange limitation on map key size, it seems incorrect: cc8fc73761/cql3/maps.cc (L150), but I left it in.
* When evaluating a `user_type` value, the old code tolerated `unset_value` in a field, but it was later converted to NULL. This means that `unset_value` doesn't work inside a `user_type`, I didn't change it, will do in another PR.
* We can't fully get rid of `bind()` yet, because it's used in `prepare_term` to return a `terminal`. It will be removed in the next PR, where we finally get rid of `term`.

Closes #9353

* github.com:scylladb/scylla:
  cql3: types: Optimize abstract_type::contains_collection
  cql3: expr: Convert evaluate_IN_list to use evaluate(expression)
  cql3: expr: Use only evaluate(expression) to evaluate term
  cql3: expr: Implement evaluate(expr::function_call)
  cql3: expr: Implement evaluate(expr::usertype_constructor)
  cql3: expr: Implement evaluate(expr::collection_constructor)
  cql3: expr: Implement evaluate(expr::tuple_constructor)
  cql3: expr: Implement evaluate(expr::bind_variable)
  cql3: Add contains_collection/set_or_map to abstract_type
  cql3: expr: Add evaluate(expression, query_options)
  cql3: Implement term::to_expression for function_call
  cql3: Implement term::to_expression for user_type
  cql3: Implement term::to_expression for collections
  cql3: Implement term::to_expression for tuples
  cql3: Implement term::to_expression for marker classes
  cql3: expr: Add data_type to *_constructor structs
  cql3: Add term::to_expression method
  cql3: Reorganize term and expression includes
2021-09-26 12:58:11 +03:00
Eliran Sinvani
0b2861d014 Prepare for inheriting from reader_concurrency_semaphore
Some future and enterprise features requires us to inherit from
reader_concurrency_semaphore, this might require additional
"wrap up" operations to be done on stop which serves as a barrier
for the semaphore. Here we simply make stop virtual so it is
inherited and can be augmented.
This change have no significant impact on  performance since stop
can get called once in a lifetime of a semaphore.
The approach is to add two extenction points to the
reader_concurrency_semaphore class, one just before the stop code is
executed and one just after.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #9373
2021-09-26 12:57:48 +03:00
Avi Kivity
2d352820f4 Update tools/java and tools/jmx submodules
* tools/java 9c5c0ad1fd...05ec511bbb (2):
  > reloc/build_reloc.sh: Add missing space
  > reloc: stop removing entire BUILDDIR

* tools/jmx 658818b...5c383b6 (1):
  > reloc: stop removing entire $BUILDDIR
2021-09-26 12:33:55 +03:00
Pavel Emelyanov
88e5b7c547 database: Shutdown in tests
There's a circular dependency:

  query processor needs database
  database owns large_data_handler and compaction_manager
  those two need qctx
  qctx owns a query_processor

Respectively, the latter hidden dependency is not "tracked" by
constructor arguments -- the query processor is started after
the database and is deferred to be stopped before it. This works
in scylla, because query processor doesn't really stop there,
but in cql_test_env it's problematic as it stops everything,
including the qctx.

Recent database start-stop sanitation revealed this problem --
on database stop either l.d.h. or compaction manager try to
start (or continue) messing with the query processor. One problem
was faced immediatelly and pluged with the 75e1d7ea safety check
inside l.d.h., but still cql_test_env tests continue suffering
from use after free on stopped query processor.

The fix is to partially revert the 4b7846da by making the tests
stop some pieces of the database (inclusing l.d.h. and compaction
manager) as it used to before. In scylla this is, probably, not
needed, at least now -- the database shutdown code was and still
is run right before the stopping one.

tests: unit(debug)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210924080248.11764-1-xemul@scylladb.com>
2021-09-26 11:09:01 +03:00
Benny Halevy
7498ac4869 dht: boot_strapper: bootstrap: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923144206.1690576-2-bhalevy@scylladb.com>
2021-09-26 11:09:01 +03:00
Benny Halevy
798aee6747 dht: boot_strapper: coroutinize bootstrap
Prepare for futurizing get_pending_address_ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923144206.1690576-1-bhalevy@scylladb.com>
2021-09-26 11:09:01 +03:00
Kamil Braun
bf823e34a4 raft: disable sticky leadership rule
The Raft PhD presents the following scenario.
When we remove a server from the cluster configuration, it does not
receive the configuration entry which removes it (because the leader
appending this entry uses that entry's configuration to decide to which
servers to send the entry to, and the entry does not contain the removed
server). Therefore the server keeps believing it is a member but does
not receive heartbeats from leaders in the new configuration. Therefore
it will keep becoming a candidate, causing existing leaders to step
down, harming availability. With many such candidates the cluster may
even stop being able to proceed at all. We call such servers
"disruptive".

More concretely, consider the following example, adapted from the PhD for
joint configuration changes (the original PhD considered a different
algorithm which can only add/remove one server at once):
Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint
configuration (C_old, C_new). D is the leader. D managed to append
C_joint to every server and commit it. D appends C_new. At this point, D
stops sending heartbeats to A because C_new does not contain A, but A's
last entry is still C_joint, so it still has the ability to become a
candidate. A can now become a candidate and cause D, or any other leader
in C_new, to step down. Even if D manages to commit C_new, A can keep
disrupting the cluster until it is shut down.

Prevoting changes the situation, which the authors admit. The "even if"
above no longer applies: if D manages to commit C_new, or just append it
to a majority of C_new, then A won't be able to succeed in the prevote
phase because a majority of servers in C_new has a longer log than A
(and A must obtain a prevote from a majority of servers in C_new because
A is in C_joint which contains C_new). But the authors continue to argue
that disruptions can still occur during the small period where C_new is
only appended on D but not yet on a majority of C_new. As they say:
"we also did not want to assume that a leader will reliably replicate
entries fast enough to move past the scenario (...) quickly; that might
have worked in practice, but it depends on stronger assumptions that we
prefer to avoid about the performance (...) of replicating log entries".
One could probably try debunking this by saying that if entries take
longer to replicate than the election timeout we're in much bigger
trouble, but nevermind.

In any case, the authors propose a solution which we call "sticky
leadership". A server will not grant a vote to a candidate if it has
recently received a heartbeat from the currently known leader, even if
the candidate's term is higher. In the above example, servers in C_new
would not grant votes to A as long as D keeps sending them heartbeats,
thus A is no longer disruptive.

In our case the situation is a bit
different: in original Raft, "heartbeats" have a very specific meaning
- they are append_entries requests (possibly empty) sent by leaders.
Thus if a node stops being a leader it stops sending heartbeats;
similarly, if a node leaves the configuration, it stops receiving
heartbeats from others still in the configuration. We instead use a
"shared failure detector" interface, where nodes may still consider
other nodes alive regardless of their configuration/leadership
situation, as part of the general "MultiRaft" framework.

This pretty much invalidates the original argument, as seen on
the above example: A will still consider D alive, thus it won't become
a candidate.

Shared failure detector combined with sticky leadership actually makes
the situation worse - it may cause cluster unavailability in certain
scenarios (fortunately not a permanent one, it can be solved with server
restarts, for example). Randomized nemesis testing with reconfigurations
found the following scenario:
Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration
C1, B is the leader. B commits joint (C1, C2), then new C2
configuration. Note that C does not learn about the last entry
(since it's not part of C2) but it keeps believing that B is alive,
so it keeps believing that B is the leader.
We then partition {A} from {B, C}. A appends (C2, C3) joint
configuration to its log. It's not able to append it to B or C due to
the partition. The partition holds long enough for A to revert to
candidate state (or we may restart A at this point). Eventually the
partition resolves. The only node which can become a candidate now is A:
C does not become a candidate because it keeps believeing that B is the
leader, and B does not become a candidate because it saw the C2
non-joint entry being committed. However, A won't become a leader
because C won't grant it a vote due to the sticky leadership rule.
The cluster will remain unavailable until e.g. C is restarted.

Note that this scenario requires allowing configuration changes which
remove and then readd the same servers to the configuration. One may
wonder if such reconfigurations should be allowed, but there doesn't
seem to be any example of them breaking safety of Raft (and the PhD
doesn't seem to mention them at all; perhaps it implicitly accepts
them). It is unknown whether a similar scenario may be produced without
such reconfigurations.

In any case, disabling sticky leadership resolves the problem, and it is
the last currently known availability problem found in randomized
nemesis testing. There is no reason to keep this extension, both because
the original Raft authors' argument does not apply for shared failure
detector, and because one may even argue with the authors in vanilla
Raft given that prevoting is enabled (see end of third paragraph of this
commit message).
Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>
2021-09-26 11:09:01 +03:00
Jan Ciolek
e9f24edc9b cql3: types: Optimize abstract_type::contains_collection
contains_collection() and contains_set_or_map() used to be calculated on each call().
Now the result is calculated only once during type creation.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 13:45:38 +02:00
Jan Ciolek
c672c0b42d cql3: expr: Convert evaluate_IN_list to use evaluate(expression)
evaluate_IN_list used term::bind(), but now it's possible
to make it use term::to_expression() and then evaluate(expression)

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
7ab14ca9c1 cql3: expr: Use only evaluate(expression) to evaluate term
Finally we don't need term::bind() to evaluate a term.
We can just convert the term to expression and call evaluate(expression).

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
ea02fd82bc cql3: expr: Implement evaluate(expr::function_call)
function_call can be evaluated now.
The code matches the one from functions::function_call::bind.

I needed to add cache id to function_call in order for it ot work properly.
See the blurb in struct function_call for more information.

New code corresponds to bind() in cql3/functions/functions.cc.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
4a035b07d3 cql3: expr: Implement evaluate(expr::usertype_constructor)
usertype_constructor can now be evaluated.

To evaluate an usertype_constructor we need to know the type,
because the fields have to be in the correct order.
Type has been added to usertype_constructor.

New code corresponds to old bind() of user_types::delayed_value in cql3/user_types.cc.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
f7ee40aa01 cql3: expr: Implement evaluate(expr::collection_constructor)
collection_constructor can now be evaluated.
There is a bit of a problem, because we don't know the type of an empty collection_constructor,
but luckily empty collection constructors get converted to constants during preparation.

For some reason in the original code when a collection contains unset_value,
the whole collection is automatically evaluated to unset_value. I didn't change this behaviour.

New code corresponds to old bind() of lists::delayed_value in cql3/lists.cc, sets::delayed_value etc.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
0f20d301d8 cql3: expr: Implement evaluate(expr::tuple_constructor)
Tuple constructors can now be evaluated.
New code corresponds to old bind() of tuples::delayed_value::marker in cql3/tuples.cc

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
5589f348e7 cql3: expr: Implement evaluate(expr::bind_variable)
Implement evaluating a bind_variable.
To be able to evaluate a bind_variable we need to know the type of the bound value.
This is why a data_type has been added to the bind_variable struct.

There are some quirks when evaluating a bind_variable.
The first problem occurs when the variable has been sent with an older cql serialization format and contains collections.
In that case the value has to be reserialized to use the newest cql serialization format.

The second problem occurs when there is a set or a map in the value.
The set value sent by the driver might not have the elements in the correct order, contain duplicates etc.
When a set or map is detected in the value it is reserialized as well.

collection_type_impl::reserialize doesn't work for this purpose, because it uses data_value which does not perform sorting or removal.

New code corresponds to old bind() of lists::marker in cql3/lists.cc, sets::marker etc.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
e621cbaa32 cql3: Add contains_collection/set_or_map to abstract_type
Sometimes we need to know whether some type contains
some collection, set, or map inside.
Introduce two functions that provide this information.

Information about collection is useful for reserializing
values with old serialization format.

Information about set/map is useful for reserializing
sets and maps to remove duplicates.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
f0e238f0a6 cql3: expr: Add evaluate(expression, query_options)
Add a function that takes an expression and evaluates it to a constant.
Evaluating specific expression variants will be implemented in the following commits.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
4ee4dc10ed cql3: Implement term::to_expression for function_call
Each functions::function_call can now be converted to expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
abd11b6fb4 cql3: Implement term::to_expression for user_type
Each user_type::delayed_value can now be converted to expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
d61b2dbf8a cql3: Implement term::to_expression for collections
Each collection delayed_value can now be converted to expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
f17d003808 cql3: Implement term::to_expression for tuples
Each tuples::delayed_value can now be converted to expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
c40f227c14 cql3: Implement term::to_expression for marker classes
Implement to_expression for non terminals that represent a bind marker.
For now each bind marker has a shape describing where it is used, but hopefully this can be removed in the future.

In order to evaluate a bind_variable we need to know its type.
The type is needed to pass to constant and to validate the value.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
499c9235fc cql3: expr: Add data_type to *_constructor structs
It is useful to have a data_type in *_constructor structs when evaluating.
The resulting constant has a data_type, so we have to find it somehow.

For tuple_constructor we don't have to create a separate tuple_type_impl instance.
For collection_constructor we know what the type is even in case of an empty collection.
For usertype_constructor we know the name, type and order of fields in the user type.

Additionally without a data_type we wouldn't know whether the type is reversed or not.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
f86a1270b0 cql3: Add term::to_expression method
Add a method that converts given term to the matching expression.
It will be used as an intermediate step when implementing evaluate(expression).
evaluate(term) will convert the term to the expression and then call evaluate(expression).

For terminals this is simply calling get() to serialize the value.
For non-terminals the implementation is more complicated and will be implemeted in the following commits.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Jan Ciolek
746e9c620f cql3: Reorganize term and expression includes
Make term.hh include expression.hh instead of the other way around.
expression can't be forward declared.
expression is needed in term.hh to declare term::to_expression().

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-24 11:05:53 +02:00
Tomasz Grabiec
f582bfd453 Merge "test: raft: randomized_nemesis_test: generator test with linearizability checking" from Kamil
The AppendReg state machine stores a sequence of integers. It supports
`append` inputs which append a single integer to the sequence and return
the previous state (before appending).

The implementation uses the `append_seq` data structure
representing an immutable sequence that uses a vector underneath
which may be shared by multiple instances of `append_seq`.
Appending to the sequence appends to the underlying vector,
but there is no observable effect on the other instances since
they use only the prefix of the sequence that wasn't changed.
If two instances sharing the same vector try to append,
the later one must perform a copy.

This allows efficient appends if only one instance is appending, which
is useful in the following context:
- a Raft server stores a copy in the underlying state machine replica
  and appends to it,
- clients send append operations to the server; the server returns the
  state of the sequence before it was appended to,
- thanks to the sharing, we don't need to copy all elements when
  returning the sequence to the client, and only one instance (the
  server) is appending to the shared vector,
- summarizing, all operations have amortized O(1) complexity.

We use AppendReg instead of ExReg in `basic_generator_test`
with a generator which generates a sequence of append operations with
unique integers.

This implies that the result of every operation uniquely identifies the
operation (since it contains the appended integer, and different
operations use different integers) and all operations that must have
happened before it (since it contains the previous state of the append
register), which allows us to reconstruct the "current state" of the
register according to the results of operations coming from Raft calls,
giving us an on-line serializability checker with O(1) amortized
complexity on each operation completion.
We also enforce linearizability by checking that every
completed operation was previously invoked.

We also perform a simple liveness check at the end of the test by
ensuring that a leader becomes eventually elected and that we can
successfully execute a call.

* kbr/linearizability-v2:
  test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test
  test: raft: randomized_nemesis_test: introduce append register
2021-09-23 23:55:13 +02:00
Benny Halevy
7e9ca101ae storage_service: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923093200.1559734-31-bhalevy@scylladb.com>
2021-09-23 17:36:43 +03:00
Benny Halevy
ecbe9f1ef6 storage_service: coroutinize rebuild
Prepare for futurizing get_ranges_for_endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923093200.1559734-30-bhalevy@scylladb.com>
2021-09-23 17:36:42 +03:00
Benny Halevy
c8b12afe1b storage_service: effective_ownership: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923093200.1559734-29-bhalevy@scylladb.com>
2021-09-23 17:35:32 +03:00
Benny Halevy
add78a8cc0 storage_service: coroutinize effective_ownership
Prepare for futurizing get_ranges_for_endpoint.

Dtest: nodetool_additional_test:TestNodetool.status_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923093200.1559734-28-bhalevy@scylladb.com>
2021-09-23 17:34:56 +03:00
Avi Kivity
7127c92acc Merge "simplifications and layer violation fix for compaction manager" from Raphael
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."

* 'compaction_manager_layer_violation_fix/v3' of github.com:raphaelsc/scylla:
  compaction: split compaction info and data for control
  compaction_manager: use task when stopping a given compaction type
  compaction: remove start_size and end_size from compaction_info
  compaction_manager: introduce helpers for task
  compaction_manager: introduce explicit ctor for task
  compaction: kill sstables field in compaction_info
  compaction: kill table pointer in compaction_info
  compaction: simplify procedure to stop ongoing compactions
  compaction: move management of compaction_info to compaction_manager
  compaction: move output run id from compaction_info into task
2021-09-23 17:29:19 +03:00
Raphael S. Carvalho
5bf51ced14 compaction: split compaction info and data for control
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:56:18 -03:00
Raphael S. Carvalho
6e7729fa21 compaction_manager: use task when stopping a given compaction type
compaction_info will eventually only be used for exporting data about
ongoing compactions, so task must be used instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:53:53 -03:00
Raphael S. Carvalho
6d1170ac94 compaction: remove start_size and end_size from compaction_info
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:41:13 -03:00
Raphael S. Carvalho
2353f40f63 compaction_manager: introduce helpers for task
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:38:39 -03:00
Raphael S. Carvalho
6820fbf460 compaction_manager: introduce explicit ctor for task
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:38:36 -03:00
Raphael S. Carvalho
d73a241a4e compaction: kill sstables field in compaction_info
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:38:32 -03:00
Raphael S. Carvalho
b6b4042faf compaction: kill table pointer in compaction_info
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:38:11 -03:00
Raphael S. Carvalho
98f8673d4e compaction: simplify procedure to stop ongoing compactions
Today, compactions are tracked by both _compactions and _tasks,
where _compactions refer to actual ongoing compaction tasks,
whereas _tasks refer to manager tasks which is responsible for
spawning new compactions, retry them on failure, etc.
As each task can only have one ongoing compaction at a time,
let's move compaction into task, such that manager won't have to
look at both when deciding to do something like stopping a task.

So stopping a task becomes simpler, and duplication is naturally
gone.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:25:51 -03:00
Raphael S. Carvalho
0885376a85 compaction: move management of compaction_info to compaction_manager
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.

This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.

From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.

This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 10:00:49 -03:00
Raphael S. Carvalho
7688d0432c compaction: move output run id from compaction_info into task
this run id is used to track partial runs that are being written to.
let's move it from info into task, as this is not an external info,
but rather one that belongs to compaction_manager.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-23 09:56:01 -03:00
Piotr Sarna
88480ac504 cql-pytest: relax another condition for a failed wasm execution
The previous commit already relaxed the condition for test_fib,
but the same should be done for test_fib_called_on_null
for an identical reason - more than 1 error can be expected
in the case of calling heavily recursive function, and either
fuel exhaustion, or hitting the stack limit, or any other
InvalidRequest exception should be accepted.

Closes #9363
2021-09-23 14:11:02 +03:00
Benny Halevy
ad46ff8e5e database: coroutinize create_keyspace
Prepare for futurizing on create_in_memory_keyspace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923093200.1559734-10-bhalevy@scylladb.com>
2021-09-23 14:05:44 +03:00
Benny Halevy
91091e9d89 database: update_keyspace: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923093200.1559734-9-bhalevy@scylladb.com>
2021-09-23 14:05:18 +03:00
Benny Halevy
c71cd2bed3 database: coroutinize update_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210923093200.1559734-8-bhalevy@scylladb.com>
2021-09-23 14:05:18 +03:00
Piotr Sarna
62948b7404 Merge 'cql3: Add expr::constant to replace terminal' from Jan Ciołek
Add new struct to the `expression` variant:
```c++
// A value serialized with the internal (latest) cql_serialization_format
struct constant {
    cql3::raw_value value;
    data_type type; // Never nullptr, for NULL and UNSET might be empty_type
};
```
and use it where possible instead of `terminal`.

This struct will eventually replace all classes deriving from
`terminal`, but for now `terminal` can't be removed completely.

We can't get rid of terminal yet, because sometimes `terminal` is
converted back to `term`, which `constant` can't do. This won't be a
problem once we replace term with expression.

`bool` is removed from `expression`, now `constant` is used instead.

This is a redesign of PR #9203, there is some discussion about the
chosen representation there.

Closes #9371

* github.com:scylladb/scylla:
  cql3: term: Remove get_elements and multi_item_terminal from terminals
  cql3: Replace most uses of terminal with expr::constant
  cql3: expr: Remove repetition from expr::get_elements
  cql3: expr: Add expr::get_elements(constant)
  cql3: term: remove term::bind_and_get
  cql3: Replace all uses of bind_and_get with evaluate_to_raw_view
  cql3: expr: Add evaluate_IN_list
  cql3: tuples: Implement tuples::in_value::get
  cql3: Move data_type to terminal, make get_value_type non-virtual
  cql3: user_types: Implement get_value_type in user_types.hh
  cql3: tuples: Implement get_value_type in tuples.hh
  cql3: maps: Implement get_value_type in maps.hh
  cql3: sets: Implement get_value_type in sets.hh
  cql3: lists: Implement get_value_type in lists.hh
  cql3: constants: Implement get_value_type in constants.hh
  cql3: expr: Add expr::evaluate
  cql3: Make collection term get() use the internal serialization format
  cql3: values: Add unset value to raw_value_view::make_temporary
  cql3: expr: Add constant to expression
2021-09-23 13:02:29 +02:00
Avi Kivity
369afe3124 treewide: use coroutine::maybe_yield() instead of co_await make_ready_future()
The dedicated API shows the intent, and may be a tiny bit faster.

Closes #9382
2021-09-23 12:28:56 +02:00
Avi Kivity
6702711d9c Merge "Gossiper start-stop sanitation (+ bonus track)" from Pavel E
"
The main challenge here is to move messaging_service.start_listen()
call from out of gossiper into main. Other changes are pretty minor
compared to that and include

- patch gossiper API towards a standard start-shutdown-stop form
- gossiping "sharder info" in initial state
- configure cluster name and seeds via gossip_config

tests: unit(dev)
       dtest.bootstrap_test.start_stop_test_node(dev)
       manual(dev): start+stop, nodetool enable-/disablegossip

refs: #2737
refs: #2795
refs: #5489

"

* 'br-gossiper-dont-start-messaging-listen-2' of https://github.com/xemul/scylla:
  code: Expell gossiper.hh from other headers
  storage_service: Gossip "sharder" in initial states
  gossiper: Relax set_seeds()
  gossiper, main: Turn init_gossiper into get_seeds_from_config
  storage_service: Eliminate the do-bind argument from everywhere
  gossiper: Drop ms-registered manipulations
  messaging, main, gossiper: Move listening start into main
  gossiper: Do handlers reg/unreg from start/stop
  gossiper: Split (un)init_messaging_handler()
  gossiper: Relocate stop_gossiping() into .stop()
  gossiper: Introduce .shutdown() and use where appropriate
  gossiper: Set cluster_name via gossip_config
  gossiper, main: Straighten start/stop
  tests/cql_test_env: Open-code tst_init_ms_fd_gossiper
  tests/cql_test_env: De-global most of gossiper
  gossiper: Merge start_gossiping() overloads into one
  gossiper: Use is_... helpers
  gossiper: Fix do_shadow_round comment
  gossiper: Dispose dead code
2021-09-23 12:18:38 +03:00
Avi Kivity
bae9c042c2 Merge 'Add compaction stats to tracing data' from Botond Dénes
Too many tombstones (row or range) are a common source of query performance problems, yet currently we have no visibility into the amount of tombstones a query has to process while constructing the results. This series addresses this by collecting stats about the compacted data in `compact_mutation_state`. This contains the number of partitions, static rows (live and dead), clustering rows (live and dead) and range tombstones. This data is then added to tracing on each query path.
Example trace:
```
 activity                                                                                                                              | timestamp                  | source    | source_elapsed | client
---------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                    Execute CQL3 query | 2021-09-22 12:06:24.089000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                         Parsing a statement [shard 0] | 2021-09-22 12:06:24.089552 | 127.0.0.1 |              1 | 127.0.0.1
                                                                                                      Processing a statement [shard 0] | 2021-09-22 12:06:24.089674 | 127.0.0.1 |            122 | 127.0.0.1
      Creating read executor for token -4069959284402364209 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 0] | 2021-09-22 12:06:24.089724 | 127.0.0.1 |            173 | 127.0.0.1
                                                                                                 read_data: querying locally [shard 0] | 2021-09-22 12:06:24.089727 | 127.0.0.1 |            175 | 127.0.0.1
                                                    Start querying singular range {{-4069959284402364209, pk{000400000001}}} [shard 0] | 2021-09-22 12:06:24.089732 | 127.0.0.1 |            181 | 127.0.0.1
                                Querying cache for range {{-4069959284402364209, pk{000400000001}}} and slice {(-inf, +inf)} [shard 0] | 2021-09-22 12:06:24.089751 | 127.0.0.1 |            199 | 127.0.0.1
 Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 4 clustering row(s) (3 live, 1 dead) and 1 range tombstone(s) [shard 0] | 2021-09-22 12:06:24.089838 | 127.0.0.1 |            286 | 127.0.0.1
                                                                                                            Querying is done [shard 0] | 2021-09-22 12:06:24.089847 | 127.0.0.1 |            295 | 127.0.0.1
                                                                                        Done processing - preparing a result [shard 0] | 2021-09-22 12:06:24.089862 | 127.0.0.1 |            311 | 127.0.0.1
                                                                                                                      Request complete | 2021-09-22 12:06:24.089326 | 127.0.0.1 |            326 | 127.0.0.1

```

Tests: unit(dev)

Fixes: https://github.com/scylladb/scylla/issues/5471

Closes #9372

* github.com:scylladb/scylla:
  multishard_mutation_query: add tracepoint with compaction stats
  querier: add tracepoint with compaction stats
  mutation_compactor: collect stats about compacted data
2021-09-22 19:24:19 +03:00
Kamil Braun
ea172fe531 test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test
Use AppendReg instead of ExReg for the state machine.
Use a generator which generates a sequence of append operations with
unique integers.

This implies that the result of every operation uniquely identifies the
operation (since it contains the appended integer, and different
operations use different integers) and all operations that must have
happened before it (since it contains the previous state of the append
register), which allows us to reconstruct the "current state" of the
register according to the results of operations coming from Raft calls,
giving us an on-line linearizability checker with O(1) amortized
complexity on each operation completion.

We also perform a simple liveness check at the end of the test by
ensuring that a leader becomes eventually elected and that we can
successfully execute a call.
2021-09-22 17:56:23 +02:00
Avi Kivity
c0afdf3f15 Update seastar submodule
* seastar c04a12edbd...e6db0cd587 (13):
  > Merge "Add kernel stack trace reporting for stalls" from Avi
Ref #8828
  > Merge "Keep XFS' dioattr cached" from Pavel E
  > coroutines: de-template maybe_yield()
  > sharded: Add const versions of map_reduce's
  > apps/io_tester: remove unused lambda capture
  > doc: exclude seastar::coroutine::internal namespace
  > deprecate unaligned_cast<> from unaligned.hh
  > reactor: adjust max_networking_aio_io_control_blocks to lower size when fs.aio-max-nr is small
  > build: clarify choice of C++ dialect, and change default to C++20
  > coding_style: update concepts style to snake_case
  > Merge "Teach io_tester to submit requests-per-second flow" from Pavel E
  > cmake: find and link against Boost::filesystem
  > coroutine: add maybe_yield
2021-09-22 18:55:25 +03:00
Nadav Har'El
92570ea7d9 cql-pytest: add tests on behavior of empty-string keys
We know (verified by existing tests) that null keys are not allowed -
neither as partition keys nor clustering keys.
In issue #9352 a question was raised of whether an *empty string* is
allowed as as a key on a base table (not a materialized view or index).
The following tests confirm that the current situation is as follows:

1. An empty string is perfectly legal as a clustering key.
2. An empty string is NOT ALLOWED as a partition key - the error
   "Key may not be empty" is reported if this is attempted.
3. If the partition key is compound (multiple partition-key columns)
   then any or all of them may be empty strings.

These tests pass the same on both Cassandra and Scylla, showing that
this bizarre (and undocumented) behavior is identical in both.

Refs #9352.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922131310.293846-1-nyh@scylladb.com>
2021-09-22 18:55:25 +03:00
Avi Kivity
083279d9ab Merge "Generalize sstable creation for tests" from Pavel E
"
There's a whole lot of places that create an sstable for tests
like this

    auto sst = env.make_sstable(...);
    sst->write_components(...);
    sst->load();

Some of them are already generalized with the make_sstable_easy
helper, but there are several instances of them.

Found while hunting down the places that use default IO sched
class behind the scenes.

tests: unit(dev)
"

* 'br-sst-tests-make-sstable-easy' of https://github.com/xemul/scylla:
  test: Generalize make_sstable() and make_sstable_easy()
  test: Use now existing helpers elsewhere
  test: Generalize all make_sstable_easy()-s
  test: Set test change estimation to 1
  test: Generalize make_sstable_easy in mutation tests
  test: Generalize make_sstable_easy in set tests
  test: Reuse make_sstable_easy in datafile tests
  test: Relax make_sstable_easy in compaction tests
2021-09-22 18:55:25 +03:00
Nadav Har'El
a99a774731 cql-pytest: test for secondary-index on empty-string value
When a string column is indexed with a secondary index, the empty value
for this column (an empty string '') is perfectly legal, and should be
indexed as well. This is not the same as an unset (null) value which
isn't indexed.

The following test demonstrates that this case works in Cassandra, but
does not in Scylla (so the test is marked "xfail"). In Scylla, a query
that returns the expected results with ALLOW FILTERING suddenly returns
a different (and wrong) result when an index is added on the table.

This test reproduces issue #9364.

Refs #9364.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922121510.291826-1-nyh@scylladb.com>
2021-09-22 18:55:25 +03:00
Avi Kivity
b5cf0b4489 Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael
"
Backlog tracker isn't updated correctly when facing a schema change, and
may leak a SSTable if compaction strategy is changed, which causes
backlog to be computed incorrectly. Most of these problems happen because
sstable set and tracker are updated independently, so it could happen
that tracker lose track (pun intended) of changes applied to set.

The first patch will fix the leak when strategy is changed, and the third
patch will make sure that tracker is updated atomically with sstable set,
so these kind of problems will not happen anymore.

Fixes #9157

test: mode(debug)
"

* 'fixes_to_backlog_tracker_v3' of https://github.com/raphaelsc/scylla:
  compaction: Update backlog tracker correctly when schema is updated
  compaction: Don't leak backlog of input sstable when compaction strategy is changed
  compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables()
  compaction: simplify removal of monitors
2021-09-22 18:55:25 +03:00
Nadav Har'El
e8493e20cb cql-pytest: test for empty-string as partition key in materialized view
Scylla and Cassandra do not allow an empty string as a partition key,
but a materialized view might "convert" a regular string column into a
partition key, and an empty string is a perfectly valid value for this
column. This can result in a view row which has an empty string as a
partition key. This case works in Cassandra, but doesn't in Scylla (the
row with the empty string as a partition key doesn't appear). The
following test demonstrates this difference between Scylla and Cassandra
(it passes on Cassandra, fails on Scylla, and accordingly marked
"xfail").

Refs #9375.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922115000.290387-1-nyh@scylladb.com>
2021-09-22 18:55:25 +03:00
Piotr Jastrzebski
56888c8954 docs: clean up codeowners
Recently we had to say goodbye to our dear friend Pekka.
He orphaned a few subsystems that can't call for his help in code
reviews anymore.

This patch makes sure no one will bother Pekka in his afterlife.
It also cleanups HACKING.md a little bit by removing Pekka and Duarte
from the maintainer/reviewer lists.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <98ba1aed9ee8a87b9037b5032b82abc5bfddbd66.1632301309.git.piotr@scylladb.com>
2021-09-22 18:55:25 +03:00
Botond Dénes
3f4f408bcf schema: add get_reversed()
A variant of make_reversed() which goes through the schema registry,
teaching the schema to the registry if necessary. This effectively
caches the result of the reversing and as an added bonus double
reversing yields the very same schema C++ object that was the starting
point.

Closes #9365
2021-09-22 18:55:25 +03:00
Kamil Braun
81b7ed23bb test: raft: randomized_nemesis_test: introduce append register
The AppendReg state machine stores a sequence of integers. It supports
`append` inputs which append a single integer to the sequence and return
the previous state (before appending).

The implementation uses the `append_seq` data structure
representing an immutable sequence that uses a vector underneath
which may be shared by multiple instances of `append_seq`.
Appending to the sequence appends to the underlying vector,
but there is no observable effect on the other instances since
they use only the prefix of the sequence that wasn't changed.
If two instances sharing the same vector try to append,
the later one must perform a copy.

This allows efficient appends if only one instance is appending, which
is useful in the following context:
- a Raft server stores a copy in the underlying state machine replica
  and appends to it,
- clients send append operations to the server; the server returns the
  state of the sequence before it was appended to,
- thanks to the sharing, we don't need to copy all elements when
  returning the sequence to the client, and only one instance (the
  server) is appending to the shared vector,
- summarizing, all operations have amortized O(1) complexity.
2021-09-22 17:54:07 +02:00
Botond Dénes
922295dd8e multishard_mutation_query: add tracepoint with compaction stats
Add the content of the compaction stats introduced in the previous patch
to the tracing data. This will help diagnose query performance related
problems caused by tombstones.
2021-09-22 14:00:24 +03:00
Botond Dénes
eba46e353d querier: add tracepoint with compaction stats
Add the content of the compaction stats introduced in the previous patch
to the tracing data. This will help diagnose query performance related
problems caused by tombstones.
2021-09-22 14:00:05 +03:00
Botond Dénes
f0ead81250 mutation_compactor: collect stats about compacted data
Stats contain the number of partitions, static rows, clustering rows and
range tombstones. For rows dead/live are counted separately.
2021-09-22 13:59:19 +03:00
Pavel Emelyanov
598841a5dd code: Expell gossiper.hh from other headers
This needs to add forward declarations of the gossiper class and
re-include some other headers here and there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
6875a4b292 storage_service: Gossip "sharder" in initial states
Right now the number of shards and ignore-msb-bits are gossiped
with a separate call. It's simpler to include this data into
the initial gossiping state.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
968e117315 gossiper: Relax set_seeds()
It's much shorter and simpler to pass the seeds, obtained from the
config, into gossiper via gossip_config rahter than with the help
of a special call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
2b63c4c16f gossiper, main: Turn init_gossiper into get_seeds_from_config
Looking into init_gossiper() helper makes it clear that what it does
is gets seeds, provider and listen_address from config and generates
a set of seeds for the gossiper. Then calls gossiper.set_seeds().

This patch renames the helper into get_seeds_from_config(), removes
all but db::config& argunebts from it and moves the call to set_seed()
into main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
7680274e02 storage_service: Eliminate the do-bind argument from everywhere
The same as in previous patch -- the gossiper doesn't need to know
if it should call messaging.start_listen() or not, neither should
do the storage_service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
0607a2b84f gossiper: Drop ms-registered manipulations
Now it's no-op and can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
ca316f32f0 messaging, main, gossiper: Move listening start into main
Before preparing the cluster join process the messaging should be
put into listening state. Right now it's done "on-demand" by the
call to the do_shadow_round(), also there's a safety call in the
start_gossiping(). Tests, however, should not start listening, so
the do_bind boolean exists and is passed all the way around.

Make the main() code explicitly call the messaging.start_listen()
and leave tests without it. This change makes messaging start
listening a bit earlier, but in between these old and new places
there's nothing that needs messaging to stay deaf.

As the do_bind becomes useless, the wait_for_gossip_to_settle() is
also moved into main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
f644eb1cf7 gossiper: Do handlers reg/unreg from start/stop
On start handlers can be registered any time before the messaging
starts to listen. On stop handlers can remain registered any long,
since the messaging service stops early in drain_on_shutdown().

One tricky place is API start_/stop_gossiping(). The latter calls
gossiper::stop() thus unregistering the handlers. So to make the
start_gossiping() work it must call gossiper::start() in advance.

Overall the gossiper start/stop becomes this:

   gossiper.start()
    `- registers handlers

   gossiper.start_gossiping()
    `- // starts gossiping

   gossiper.shutdown()
    `- // stops gossiping

   gossiper.stop()
    `- calls shutdown() // re-entrable
    `- unregisters handlers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
9aba3e6f9f gossiper: Split (un)init_messaging_handler()
As a preparation for the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
dfe54207cb gossiper: Relocate stop_gossiping() into .stop()
The helper in question is called in two places:

1. In main() as a fuse against early exception before creating the
   drain_on_shutdown() defer
2. In the stop_gossiping() API call

Both can be replaced with the stop_gossiping() call from the .stop()
method, here's why:

1. In main the gossiper::stop() call is already deferred right after
   the gossiper is started. So this change moves it above. It may
   happen that an exception pops up before the old fuse was deferred,
   but that's OK -- the stop_gossiping() is safe against early- and
   re- entrances

2. The stop_gossiping() change is effectlvey a rename -- it calls the
   stop_gossiping() as it did before, but with the help of the .stop()
   method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
e24c5034b5 gossiper: Introduce .shutdown() and use where appropriate
The start/stop sequence we're moving towards assumes a shutdown (or
drain) method that will be called early on stop to notify the service
that the system is going down so it could prepare.

For gossiper it already means calling stop_gossiping() on the shard-0
instance. So by and large this patch renames a few stop_gossiping()
calls into .shutdown() ones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
25210334b6 gossiper: Set cluster_name via gossip_config
It's taken purely from the db::config and thus can be set up early.

Right now the empty name is converted into "Test Cluster" one, but
remains empty in the config and is later used by the system_keyspace
code. This logic remains intact.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
084abb824e gossiper, main: Straighten start/stop
Turn the gossiper start/stop sequence into the canonical form

    gossiper.start(std::ref(dependencies)...).get();
    auto stop_gossiper = defer({
        gossiper.invoke_on_all(&gossiper::stop).get();
    });
    gossiper.invoke_on_all(&gossiper::start).get();

The deferred call should be gossiper.stop(); but for now keep
the instances memory alive.

This trick is safe at this point, because .start() and .stop()
methods are both empty (still).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:05 +03:00
Jan Ciolek
3d23d6f9dd cql3: term: Remove get_elements and multi_item_terminal from terminals
terminal now isn't used as a final value anywhere.
Remove things that are no longer needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:33:00 +02:00
Jan Ciolek
2523c9ba48 cql3: Replace most uses of terminal with expr::constant
constant is now ready to replace terminal as a final value representation.
Replace bind() with evaluate and shared_ptr<terminal> with constant.

We can't get rid of terminal yet. Sometimes terminal is converted back
to term, which constant can't do. This won't be a problem once we
replace term with expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:28:15 +02:00
Jan Ciolek
c859ec2bdf cql3: expr: Remove repetition from expr::get_elements
There was some repeating code in expr::get_elements family
of functions. It has been reduced into one function.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:28:15 +02:00
Jan Ciolek
2cbed7a679 cql3: expr: Add expr::get_elements(constant)
We need to be able to access elements of a constant.
Adds functions to easily do it.

Those functions check all preconditions required to access elements
and then use partially_deserialize_* or similar.

It's much more convenient than using partially_deserialize directly.

get_list_of_tuples_elements is useful with IN restrictions like
(a, b) IN [(1, 2), (3, 4)].

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:28:15 +02:00
Jan Ciolek
d39b085428 cql3: term: remove term::bind_and_get
term::bind_and_get is not needed anymore, remove it.

Some classes use bind_and_get internally, those functions are left intact
and renamed to bind_and_get_internal.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:28:14 +02:00
Jan Ciolek
221ed38e94 cql3: Replace all uses of bind_and_get with evaluate_to_raw_view
Start using evaluate_to_raw_value instead of bind_and_get.
This is a step towards using only evaluate.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:20:30 +02:00
Jan Ciolek
adaf6e5eec cql3: expr: Add evaluate_IN_list
A list representing IN values might contain NULLs before evaluation.
We can remove them during evaluation, because nothing equals NULL.
If we don't remove them, there are gonna be errors, because a list can't contain NULLs.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:20:29 +02:00
Jan Ciolek
33882cc716 cql3: tuples: Implement tuples::in_value::get
To convert a terminal to expr::constant we need to be able to serialize it.
tuples::in_value didn't have serialization implemented, do it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:20:29 +02:00
Jan Ciolek
2936adc570 cql3: Move data_type to terminal, make get_value_type non-virtual
Every class now has implementation of get_value_type().
We can simply make base class keep the data_type.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:20:28 +02:00
Jan Ciolek
e683bf0379 cql3: user_types: Implement get_value_type in user_types.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in user_types.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:13:36 +02:00
Jan Ciolek
0ac0f11d64 cql3: tuples: Implement get_value_type in tuples.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in tuples.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:13:36 +02:00
Jan Ciolek
48e5277b2f cql3: maps: Implement get_value_type in maps.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in mapshh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:13:36 +02:00
Jan Ciolek
5aae370928 cql3: sets: Implement get_value_type in sets.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in sets.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:13:36 +02:00
Jan Ciolek
6bf6b03d12 cql3: lists: Implement get_value_type in lists.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in lists.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:13:36 +02:00
Jan Ciolek
da7ca5a760 cql3: constants: Implement get_value_type in constants.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in constants.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:13:36 +02:00
Jan Ciolek
a964827696 cql3: expr: Add expr::evaluate
Adds the functions:
constant evaluate(term*, const query_options&);
raw_value_view evaluate(term*, const query_options&);

These functions take a term, bind it and convert the terminal
to constant or raw_value_view.

In the future these functions will take expression instead of term.
For that to happen bind() has to be implemented on expression,
this will be done later.

Also introduces terminal::get_value_type().
In order to construct a constant from terminal we need to know the type.
It will be implemented in the following commits.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:13:34 +02:00
Jan Ciolek
561e6b0a59 cql3: Make collection term get() use the internal serialization format
A term should always be serialized using the internal cql serialization format.
A term represents a value received from the driver,
but for every use we are going to need it in the internal serialization format.

Other places in the code already do this, for example see list_prepare_term,
it calls value.bind(query_options::DEFAULT) to evaluate a collection_constructor.
query_options::DEFAULT has the latest cql serialization format.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:05:09 +02:00
Jan Ciolek
2b9a9c8ff5 cql3: values: Add unset value to raw_value_view::make_temporary
When unset_value is passed to make_temporary it gets converted to null_value.
This looks like a mistake, it should be just unset_value.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:05:09 +02:00
Jan Ciolek
ad3d2ee47d cql3: expr: Add constant to expression
Adds constant to the expression variant:
struct constant {
    raw_value value;
    data_type type;
};

This struct will be used to represent constant values with known bytes and type.
This corresponds to the terminal from current design.

bool is removed from expression, now constant is used instead.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-21 16:05:09 +02:00
Pavel Emelyanov
c4d1022943 tests/cql_test_env: Open-code tst_init_ms_fd_gossiper
The helper is called once. Keeping this code in the caller packs the
code, helps it look more like main() and facilitates further patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-21 12:54:23 +03:00
Botond Dénes
c53c50e6f1 scylla-gdb.py: scylla memory: exclude too small object sizes
Sizes too small to fit a ::seastar::memory::free_object won't contain
any objects at all so they don't contribute anything to the listing
beyond noise.

Closes #9366
2021-09-21 11:21:10 +02:00
Pavel Emelyanov
83902f43ab tests/cql_test_env: De-global most of gossiper
Gossiper is still global and cql_test_env heavily exploits this fact.
Clean that by getting the gossiper once and using the local reference
everywhere else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-21 11:19:16 +03:00
Pavel Emelyanov
89adb0df90 gossiper: Merge start_gossiping() overloads into one
There are two of them and one is only called from the API with the
do_bind always set to "yes". This fact makes it possible to remove
it by adding relevant defaults for the other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-21 11:19:16 +03:00
Pavel Emelyanov
e71bd23b3d gossiper: Use is_... helpers
There are several state booleans on the service and some helpers to
manipulate/check those. Make the code consistent by always using these
helpers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-21 11:19:16 +03:00
Pavel Emelyanov
efb0ddff21 gossiper: Fix do_shadow_round comment
Shadow round is used during each boot, not only during node replacement

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-21 11:19:16 +03:00
Pavel Emelyanov
f7ab1aa876 gossiper: Dispose dead code
The debug_show() is unused, as well as the advertise_myself().
The _features_condvar used to be listened on before f32f08c9,
now it's signal-only.
Feature frendship with gossiper is not required.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-21 11:19:16 +03:00
Piotr Sarna
d3edca4b43 Merge 'alternator: add stub implementation of TTL's API operations'
... from Nadav Har'El

This small series adds a stub implementation of Alternator's
UpdateTimeToLive and DescribeTimeToLive operations. These operations can
enable, disable, or inquire about, the chosen expiration-time attribute.
Currently, the information about the chosen attribute is only saved,
with no actual expiration of any items taking place.

Because this is an incomplete implementation of this feature, it is not
enabled unless an experimental flag is enabled on all nodes in the
cluster.

See the individual patches for more information on what this series
does.

Refs #5060.

Closes #9345

* github.com:scylladb/scylla:
  test/alternator: rename utility function test_table_name()
  alternator: stub TTL operations
  alternator: make three utility functions in executor.cc non-static
  test/alternator: test another corner case of TTL
2021-09-21 09:58:17 +02:00
Raphael S. Carvalho
ff38f59f67 compaction: Update backlog tracker correctly when schema is updated
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN

This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.

Fixes #9157

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-20 15:54:41 -03:00
Raphael S. Carvalho
0a3049908c compaction: Don't leak backlog of input sstable when compaction strategy is changed
The generic back formula is: ALL + PARTIAL - COMPACTING

With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.

But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.

With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.

By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.

Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-20 15:36:05 -03:00
Raphael S. Carvalho
3dc1821287 compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables()
This new function makes it easier to remove monitor of exhausted
sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-20 15:16:41 -03:00
Raphael S. Carvalho
28ba8bde80 compaction: simplify removal of monitors
by switching to unordered_map, removal of generated monitors is
made easier. this is a preparatory change for patch which will
remove monitor for all exhausted sstables

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-20 15:06:37 -03:00
Pavel Emelyanov
1cb2b65205 test: Generalize make_sstable() and make_sstable_easy()
The former constructs a memtable from the vector of mutations and
then does exactlty the same steps as the latter one -- creates an
sstable corresponding to the memtable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
843dac0b8a test: Use now existing helpers elsewhere
There are several places in other tests that can make use of
the new make_sstable_easy() helpers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
a2590368ce test: Generalize all make_sstable_easy()-s
There are already four of them. Those working with the mutation reader
can be folded into one with some default args.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
e45f81ceb4 test: Set test change estimation to 1
The test intention is not to test how zero estimated partitions
work, there's another case for than (in another test). Also it
looks like 0 is doesn't flow anywhere far, it's std::max-ed into
1 early inside mc::writer constructor.

This changes significantly simplifies the unification of the set
of make_sstable_easy()-s in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
96feafabd4 test: Generalize make_sstable_easy in mutation tests
The same trick as in the previous patch, but the new helper
accepts a memtable instead of a mutation reader and makes the
reader from the memtable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
ee91a8334c test: Generalize make_sstable_easy in set tests
There a bunch of places in the test that do the same sequence
of steps to create an sstable. Generalize them into a helper
that resembles the one from previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
28e5307ce2 test: Reuse make_sstable_easy in datafile tests
This patch is two-fold. First it changes the signature of the
local helper to facilitate next patching. Second, it makes more
relevant places in the test use this helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
44294accb6 test: Relax make_sstable_easy in compaction tests
The version argument can be omitted, the env.make_sstable will
default it to highest version. The generation argument is left
and defaulted to 1.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Piotr Sarna
dd9d6c081e cql-pytest: relax error conditions for a failed wasm execution
Originally, the expected failure for a recursive invocation
test case was to expect that fuel gets exhausted, but it's also
possible to hit a stack limit first. All errors are equally
expected here as long as the execution is halted, so let's relax
the condition and accept any wasm-related InvalidRequest errors.

Closes #9361
2021-09-20 15:20:52 +03:00
Avi Kivity
8c0f2f9e3d Revert "Merge 'cql3: Add expr::constant to replace terminal' from Jan Ciołek"
This reverts commit e9343fd382, reversing
changes made to 27138b215b. It causes a
regression in v2 serialization_format support:

collection_serialization_with_protocol_v2_test fails with: marshaling error: read_simple_bytes - not enough bytes (requested 1627390306, got 3)

Fixes #9360
2021-09-20 15:15:09 +03:00
Avi Kivity
15819e0304 Merge "Database start/stop code sanitation" from Pavel E
"
Currently database start and stop code is quite disperse and
exists in two slightly different forms -- one in main and the
other one in cql_test_env. This set unifies both and makes
them look almost the perfect way:

    sharded<database> db;
    db.start(<dependencies>);
    auto stop = defer([&db] { db.stop().get(); });
    db.invoke_on_all(&database::start).get();

with all (well, most) other mentionings of the "db" variable
being arguments for other services' dependencies.

tests: unit(dev, release), unit.cross_shard_barrier(debug)
       dtest.simple_boot_shutdown(dev)
refs: #2737
refs: #2795
refs: #5489

"

* 'br-database-teardown-unification-2' of https://github.com/xemul/scylla: (26 commits)
  main: Log when database starts
  view_update_generator: Register staging sstables in constructor
  database, messaging: Delete old connection drop notification
  database, proxy: Relocate connection-drop activity
  messaging, proxy: Notify connection drops with boost signal
  database, tests: Rework recommended format setting
  database, sstables_manager: Sow some noexcepts
  database: Eliminate unused helpers
  database: Merge the stop_database() into database::stop()
  database: Flatten stop_database()
  database: Equip with cross-shard-barrier
  database: Move starting bits into start()
  database: Add .start() method
  main: Initialize directories before database
  main, api: Detach set_server_config from database and move up
  main: Shorten commitlog creation
  database: Extract commitlog initialization from init_system_keyspace
  repair: Shutdown without database help
  main: Shift iosched verification upward
  database: Remove unused mm arg from init_non_system_keyspaces()
  ...
2021-09-20 10:26:13 +03:00
Nadav Har'El
58078a3f84 test/alternator: rename utility function test_table_name()
We have a utility function test_table_name() to create a unique name for
a test table. The funny thing is, that because this function starts with
the string "test_", pytest believes it's a test. This doesn't cause any
problems (it's consider a *passing* test), but it's nevertheless strange
to see it listed on the list of tests.

So in this page, we trivially rename this function to unique_table_name(),
a name why pytest doesn't think is the name of test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-09-19 21:05:21 +03:00
Nadav Har'El
4ffd8c1f2b alternator: stub TTL operations
This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive
operations to Alternator. These operations can enable, disable, or inquire
about, the chosen expiration-time attribute.

Currently, the information about the chosen attribute is only saved, with
no actual expiration of any items taking place.

Some of the tests for the TTL feature start to pass, so their xfail tag
is removed.

Because this this new feature is incomplete, it is not enabled unless
the "alternator-ttl" experimental feature is enabled. Moreover, for
these operations to be allowed, the entire cluster needs to support
this experimental feature, because all nodes need to participate in the
data expiration - if some old nodes don't support Alternator TTL, some
of the data they hold won't get expired... So we don't allow enabling
TTL until all the nodes in the cluster support this feature.

The implementation is in a new source file, alternator/ttl.cc. This
source file will continue to grow as we implement the expiration feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-09-19 21:05:21 +03:00
Nadav Har'El
7404c7a9c1 alternator: make three utility functions in executor.cc non-static
Make three of the utility functions in alternator/executor.cc, which
until now were static (local to the source files) external symbols
(in the alternator namespace). This will allow using them in other
Alternator source files - like the one in the next patch for TTL
support, which we'll want to put in a separate source file.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-09-19 21:05:21 +03:00
Nadav Har'El
82d2942ac8 test/alternator: test another corner case of TTL
Usually the TTL feature's expiration-time attribute is a schema-less
attribute, implemented in Alternator as a JSON-serialized item in a
bigger map column. However, key attributes are a special case because
they are implemented as separate columns. We already had test cases
showing that this case works too - for the case of hash and range keys.

In this test we test another possibility of an attribute that is
implemented as a schema column - the case of an LSI key.

As the other TTL tests, this test too passes on DynamoDB but xfails on
Alternator because the TTL feature is not yet implemented.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-09-19 21:05:21 +03:00
Beni Peled
e873bdbfe9 docker: fix entrypoint issue
This commit fixes [0] which is about extra (redundant) keyword adds to
the `--entrypoint` and causes scylla-server to fail to start

[0] https://github.com/scylladb/scylla-pkg/issues/2395

Closes #9350
Fixes #9355
2021-09-19 15:39:08 +03:00
Kamil Braun
e3f1667744 sstables: remove use_binary_search_in_promoted_index
This was a global variable that was potentially modified from a
performance benchmark. It would modify the behavior of `index_reader`
in certain scenarios.

Remove the variable so we can specify the behavior of `index_reader`
functions without relying on anything other than what's passed into the
constructor and the function parameters.
2021-09-19 13:59:25 +03:00
Kamil Braun
28193805e5 mutation_partition: fix exception message in append_clustered_row 2021-09-19 13:47:19 +03:00
Benny Halevy
fa46bf3499 compaction: split compaction_aborted_exception from compaction_stopped_exception
Indicate whether the compaction job should be aborted
due to an error using a new, compaction_aborted_exception type,
vs. compaction_stopped_exception that indicates
the task should be stopped due to some external event that
doesn't indicate an error (like shutdown or api call).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-09-19 12:20:30 +03:00
Benny Halevy
eebe14e7bc compaction_manager: maybe_stop_on_error: rely on retry=false default
No need to set retry to false again in various catch
clauses.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-09-19 12:20:30 +03:00
Benny Halevy
ca2bb89180 compaction_manager: maybe_stop_on_error: sync return value with error
message.

It is misleading to set retry to true in the following statement
and return it later on when the `will_stop` parameter is true.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-09-19 12:20:30 +03:00
Benny Halevy
a1fe40278b compaction: drop retry parameter from compaction_stop_exception
Drop the retry parameter from compaction_stop_exception
as it is always false.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-09-19 12:20:30 +03:00
Benny Halevy
9800dbe871 compaction_manager: move errors stats accounting to maybe_stop_on_error
Currently, _stats.errors is not accounted for non-retryable errors
like storage_io_error.

Fixes #9354

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-09-19 12:20:22 +03:00
Benny Halevy
ce3fcc121e paxos_state: prepare: handle exception getting data or digest
This exception is ignored by design, but if it's left unhandled,
it generates `Exceptional future ignored` warnings, like the following.

Also, ignore f2 if f1 failed since we return early in this case.

```
[shard 5] seastar - Exceptional future ignored: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem), backtrace: 0x431689e 0x4316d40 0x43170e8 0x3f35486 0x218d14a 0x3f8002f 0x3f81217 0x3f9f868 0x3f4b76a /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902#012
N7seastar12continuationINS_8internal22promise_base_with_typeISt7variantIJN5utils4UUIDEN7service5paxos7promiseEEEEEZZZZNS7_11paxos_state7prepareEN7tracing15trace_state_ptrENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandERK13partition_keyS5_bNSI_16digest_algorithmENSt6chrono10time_pointINS_12lowres_clockENSQ_8durationIlSt5ratioILl1ELl1000EEEEEEENK3$_0clEvENUlvE_clEvENKUlSB_E_clESB_EUlT_E_ZNS_6futureISt5tupleIJNS13_IvEENS13_IS14_IJNSE_INSI_6resultEEE17cache_temperatureEEEEEEE14then_impl_nrvoIS12_NS13_IS9_EEEET0_OS11_EUlOSA_RS12_ONS_12future_stateIS1B_EEE_S1B_EE#012
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, unsigned long, seastar::lowres_clock::duration, std::result_of&&)::{lambda(seastar::basic_semaphore)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::basic_semaphore)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::future<std::variant<utils::UUID, service::paxos::promise> >&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::key_lock_map::with_locked_key<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(dht::token const&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::result_of)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, {lambda()#1}>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}>(service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012
seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}::operator()() const::{lambda(std::variant<utils::UUID, service::paxos::promise>)#1}, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_impl_nrvo<{lambda()#1}, {lambda()#1}<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > > >({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012
seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, seastar::future<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >::finally_body<seastar::smp::submit_to<service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}>(unsigned int, se
```

Refs #7779
Refs #9331

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210919053007.13960-1-bhalevy@scylladb.com>
2021-09-19 11:58:21 +03:00
Takuya ASADA
5ab7fb7f10 reloc: stop removing entire BUILDDIR
We found that user can mistakenly break system with --builddir option,
something like './reloc/build_deb.sh --builddir /'.
To avoid that we need to stop removing entire $BUILDDIR, remove
directories only we have to clean up before building deb package.

See: https://github.com/scylladb/scylla-python3/pull/23#discussion_r707088453

Closes #9351
2021-09-19 10:33:33 +03:00
Pavel Emelyanov
8d7a907a65 main: Log when database starts
Just to be consistent with other "services".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
0de69136d4 view_update_generator: Register staging sstables in constructor
First, it's to fix the discarded future during the register. The
future is not actually such, as it's always the no-op ready one as
at that stage the view_update_generator is neither aborted nor is
in throttling state.

Second, this change is to keep database start-up code in main
shorter and cleaner. Registering staging sstables belongs to the
view_update_generator start code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
a4118a70ee database, messaging: Delete old connection drop notification
Database no longer needs it. Since the only user of the old-style
notification is gone -- remove it as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
bfd91d7b81 database, proxy: Relocate connection-drop activity
On start database is subscribed on messaging-service connection drop
notification to drop the hit-rate from column families. However, the
updater and reader of those hit-rates is the storage_proxy, so it
must be the _proxy_ who drops the hit-rate.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
dd498273dc messaging, proxy: Notify connection drops with boost signal
The messaging_service keeps track of a list of connection-drop
listeners. This list is not auto-removing and is thus not safe
on stop (fortunately there's only 1 non-stopping client of it
so far).

This patch adds a safter notification based on boost/signals.
Also storage_proxy is subscribed on it in advance to demonstrate
how it looks like altogether and make next patch shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
b78e9b51b7 database, tests: Rework recommended format setting
Tests don't have sstable format selector and enforce the needed
format by hands with the help of special database:: method. It's
more natural to provide it via convig. Doing this makes database
initialization in main and cql_test_env closer to each other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
a42383b127 database, sstables_manager: Sow some noexcepts
Setting sstables format into database and into sstables_manager is
all plain assignments. Mark them as noexcept, next patch will become
apparently exception safe after that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
9a76df96e3 database: Eliminate unused helpers
There are some large-data-handler-related helpers left after previous
patches, they can be removed altogehter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
4b7846da86 database: Merge the stop_database() into database::stop()
After stop_database() became shard-local, it's possible to merge
it with database::stop() as they are both called one after another
on scylla stop. In cql-test-env there are few more steps in
between, but they don't rely on the database being partially
stopped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
469c734155 database: Flatten stop_database()
The method need to perform four steps cross-shard synchronously:
first stop compaction manager, then close user and, after it,
system tables, finally shutdown the large data handler.

This patch reworks this synchronization with the help of cross-shard
barrier added to the database previously. The motivation is to merge
.stop_database() with .stop().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
b1013e09b4 database: Equip with cross-shard-barrier
Make sure a node-wide barrier exists on a database when scylla starts.
Also provide a barrier for cql_test_env. In all other cases keep a
solo-mode barrier so that single-shard db stop doesn't get blocked.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
634ea4b543 database: Move starting bits into start()
Thse include large_data_handler::start, compaction_manager::enable
and database::init_commitlog.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:48:48 +03:00
Pavel Solodovnikov
02f27260cc idl: allow specifying multiple attributes in the grammar
This patch extends the IDL grammar by allowing to use
multiple `[[...]]` attribute clauses, as well, as specifying
more than one attribute inside a single attribute clause, e.g.:
`[[attr1, attr2]]` will be parsed correctly now.

For now, in all existing use cases only the first attribute
is taken into account and the rest is ignored.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-09-15 17:47:27 +03:00
Pavel Solodovnikov
7a8cadcca8 message: messaging_service: extract RPC protocol details and helpers into a separate header
Introduce a new header `message/rpc_protocol_impl.hh`, move here
the following things from `message/messaging_service.cc`:

* RPC protocol wrappers implementation
* Serialization thunks
* `register_handler` and `send_message*` functions

This code will be used later for IDL-generated RPC verbs
implementation.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-09-15 17:47:11 +03:00
Pavel Emelyanov
e2308034ff database: Add .start() method
Called right after the sharded::start(). For now empty, to be populated
by next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:44:48 +03:00
Pavel Emelyanov
80983951fb main: Initialize directories before database
This is to keep all database start (and stop) code together. Right
now directories startup breaks this into two pieces.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:42:20 +03:00
Pavel Emelyanov
c05c58d2b1 main, api: Detach set_server_config from database and move up
The api::set_server_config() depends on sharded database to start, but
really doesn't need it -- it needs only the db::config object which's
available earlier.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:37:10 +03:00
Pavel Emelyanov
127e4fe8de main: Shorten commitlog creation
This does three things in one go:

- converts

    db.invoke_on_all([] (database& db) {
        return db.init_commitlog();
    });

  into a one-line version

    db.invoke_on_all(&database::init_commitlog);

- removes the shard-0 pre-initialization for tests, because
  tests don't have the problem this pre- solves

- make the init_commitlog() re-entrable to let regular start
  not check for shard-0 explicitly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:37:07 +03:00
Pavel Emelyanov
f6ab69b7f8 database: Extract commitlog initialization from init_system_keyspace
The intention is to keep all database initialization code in one place.
The init_system_keyspace() is one the obstacles -- it initializes db's
commitlog as first step.

This patch moves the commitlog initialization out of the mentioned
helper. The result looks clumsy, but it's temporary, next patches will
brush it up.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:36:42 +03:00
Pavel Emelyanov
d156a8993f repair: Shutdown without database help
The sharded database reference is passed into repair_shutdown() just
to have something to call .invoke_on_all() onto. There's the more
appropriate sharded repair_service for this, so use it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:48 +03:00
Pavel Emelyanov
6c54c868b8 main: Shift iosched verification upward
There's a block of CLI options sanity checks in the beginning of
main starting lambda, it's better to have the iosched validation
in this block.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:39 +03:00
Pavel Emelyanov
bd2b7dca0e database: Remove unused mm arg from init_non_system_keyspaces()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:37 +03:00
Pavel Emelyanov
dc92f220e4 database: Drop get_available_memory() helper
It's only used on start to provide the total_memory() value to the
repair configuration code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:32 +03:00
Pavel Emelyanov
7e5abb5096 main, scylla-gdb, cql-test-env: Unify debug::the_database
All the debug:: inhabitants have their names look like "the_<classname>"
This patch brings the database piece to this standard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:30 +03:00
Pavel Emelyanov
e69969b6c7 scylla-gdb: Use find_db helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:26 +03:00
Pavel Emelyanov
75e1d7ea74 large_data_handler: Prepare for stopped qctx
All the large data handler methods rely on global qctx thing to
write down its notes. This creates circular dependency:

query processor -> database -> large_data_handler -> qctx -> qp

In scylla this is not a technical problem, neither qctx nor the
query processor are stopped. It is a problem in cql_test_env
that stops everything, including resetting qctx to null. To avoid
tests stepping on nullptr qctx add the explicit check.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:24 +03:00
Pavel Emelyanov
bb23986826 wasm: Localize it to database usage
The wasm::engine exists as a sharded<> service in main, but it's only
passed by local reference into database on start. There's no much profit
in keeping it at main scope, things get much simpler if keeping the
engine purely on database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:17 +03:00
Pavel Emelyanov
e324230648 utils: Introduce cross-shard barrier (with test)
Add a synchronization facility to let shards wait for each
other to pass through certain points in the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:12 +03:00
Calle Wilund
dcc73c5d4e commitlog: Recalculate footprint on delete_segment exceptions
Fixes #9348

If we get exceptions in delete_segments, we can, and probably will, loose
track of footprint counters. We need to recompute the used disk footprint,
otherwise we will flush too often, and even block indefinately on new_seg
iff using hard limits.
2021-09-15 11:53:03 +00:00
Calle Wilund
8a326638af commitlog_test: Add test for exception in alloc w. deleted underlying file
Tests that we can handle exception-in-alloc cleanup if the file actually
does not exist. This however uncovers another weakness (addressed in next
patch) - that we can loose track of disk footprint here, and w. hard limits
end up waiting for disk space that never comes. Thus test does not use hard
limit.
2021-09-15 11:51:05 +00:00
Calle Wilund
21152a2f5a commitlog: Ensure failed-to-create-segment is re-deleted
Fixes #9343

If we fail in allocate_segment_ex, we should push the file opened/created
to the delete set to ensure we reclaim the disk space. We should also
ensure that if we did not recycle a file in delete_segments, we still
wake up any recycle waiters iff we made a file delete instead.

Included a small unit test.
2021-09-15 11:40:34 +00:00
Calle Wilund
f3a9f361b9 commitlog::allocate_segment_ex: Don't re-throw out of function
Fixes #9342

commitlog_error_handler rethrows. But we want to not. And run post-handler
cleanup (co_await)
2021-09-15 11:40:34 +00:00
Avi Kivity
cc8fc73761 Merge 'hints: fix bugs in HTTP API for waiting for hints found by running dtest in debug mode' from Piotr Dulikowski
This series of commits fixes a small number of bugs with current implementation of HTTP API which allows to wait until hints are replayed, found by running the `hintedhandoff_sync_point_api_test` dtest in debug mode.

Refs: #9320

Closes #9346

* github.com:scylladb/scylla:
  commitlog: make it possible to provide base segment ID
  hints: fill up missing shards with zeros in decoded sync points
  hints: propagate abort signal correctly in wait_for_sync_point
  hints: fix use-after-free when dismissing replay waiters
2021-09-15 12:55:54 +03:00
Avi Kivity
daf028210b build: enable -Winconsistent-missing-override warning
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.

Enable the warning and fix numerous violations.

Closes #9347
2021-09-15 12:55:54 +03:00
Botond Dénes
bd8e2e6691 tools/utils.hh: make self-sufficient
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210915055910.167091-1-bdenes@scylladb.com>
2021-09-15 12:55:54 +03:00
Michał Radwański
7c8b895285 utils/small_vector: remove noexcept from the copy constructor, which potentially throws
The copy constructor of small vector has a noexcept specifier, however
it calls `reserve(size_t)`, which can throw `std::bad_alloc`. This
causes issues when using it inside tests that use
alloc_failure_injector, but potentially could also float up in the
production.

Closes #9338
2021-09-15 12:55:54 +03:00
Piotr Dulikowski
91163fcfa5 commitlog: make it possible to provide base segment ID
Adds a configuration option to the commitlog: base_segment_id. When
provided, the commitlog uses this ID as a base of its segment IDs
instead of calculating it based on the number of milliseconds between
the epoch and boot time.

This is needed in order for the feature which allows to wait for hints
to be replayed to work - it relies on the replay positions monotonically
increasing. Endpoint managers periodically re-creates its commitlog
instance - if it is re-created when there are no segments on disk,
currently it will choose the number of milliseconds between the epoch
and boot time, which might result in segments being generated with the
same IDs as some segments previously created and deleted during the same
runtime.
2021-09-15 11:04:34 +02:00
Piotr Dulikowski
486421c58c hints: fill up missing shards with zeros in decoded sync points
Between encoding and decoding of a sync point, the node might have been
restarted and resharded with increased shard count. During resharding,
existing hints segments might have been moved to new shards. Because of
that, we need to make sure that we wait for foreign segments to be
replayed on the new shards too.

This commit modifies the sync point decoding logic so that it places a
zero replay position for new shards. Additionally, a (incorrect) shard
count check is removed from `storage_proxy::wait_for_hint_sync_point`
because now the shard count in decoded sync point is guaranteed to be
not less than the node's current shard count.
2021-09-15 11:04:34 +02:00
Avi Kivity
08042c1688 Merge 'reader_permit: make query max result size accessible from the permit' from Kamil Braun
This will make it easier, for example, to enforce memory limits in lower
levels of the `flat_mutation_reader` stack.

By default, the query result size is unlimited. However, for specific queries it is
possible to store a different value (e.g. obtained from a `read_command` object)
through a setter. An example of this can be seen in the last commit of this PR,
where we set the limit to `cmd.max_result_size` if engaged, or to the 'unlimited
query' limit (using `database::get_unlimited_query_max_result_size()`) if not.

Refs: #9281. The v2 version of the reverse sstable reader PR will be based on this PR:
we'll use the query max result size parameter in one of the readers down the stack
where `read_command` is not available but `reader_permit` is.

Closes #9341

* github.com:scylladb/scylla:
  table, database: query, mutation_query: remove unnecessary class_config param
  reader_permit: make query max result size accessible from the permit
  reader_concurrency_semaphore: remove default parameter values from constructors
  query_class_config: remove query::max_result_size default constructor
2021-09-14 16:17:18 +03:00
Piotr Dulikowski
77f2448b2c hints: propagate abort signal correctly in wait_for_sync_point
When `manager::wait_for_sync_point` is called, the abort source from the
arguments (`as`) might have already been triggered. In such case, the
subscription which was supposed to trigger the `local_as` abort source
won't be run, and the code will wait indefinitely for hints to be
replayed instead of checking the replay status and returning
immediately.

This commit fixes the problem by manually triggering `local_as` if `as`
have been triggered.
2021-09-14 14:27:01 +02:00
Piotr Dulikowski
8e29ebc5d5 hints: fix use-after-free when dismissing replay waiters
When the promise waited on in the `wait_until_hints_are_replayed_up_to`
function is resolved, a continuation runs which prints a log line with
information about this event. The continuation captures a pointer to the
hints sender and uses it to get information about the endpoint whose
hints are waited for. However, at this point the sender might have been
deleted - for example, when the node is being stopped and everybody
waiting for hints is dismissed.

This commit fixes the use-after-free by getting all necessary
information while the sender is guaranteed to be alive and captures it
in the continuation's capture list.
2021-09-14 13:46:16 +02:00
Kamil Braun
c12e265eb8 table, database: query, mutation_query: remove unnecessary class_config param
The semaphore inside was never accessed and `max_memory_for_unlimited_query`
was always equal to `*cmd.max_result_size` so the parameter was completely
redundant.

`cmd.max_result_size` is supposed to be always set in the affected
functions - which are executed on the replica side - as soon as the
replica receives the `read_command` object, in case the parameter was
not set by the coordinator. However, we don't have a guarantee at the
type level (it's still an `optional`). Many places used
`*cmd.max_result_size` without even an assertion.

We make the code a bit safer, we check for `cmd.max_result_size` and if
it's indeed engaged, store it in `reader_permit`. We then access it from
`reader_permit` where necessary. If `cmd.max_result_size` is not set, we
assume this is an unlimited query and obtain the limit from
`get_unlimited_query_max_result_size`.
2021-09-14 13:39:56 +02:00
Kamil Braun
e8824986dd reader_permit: make query max result size accessible from the permit
This will make it easier, for example, to enforce memory limits in lower
levels of the flat_mutation_reader stack.

By default the size is unlimited. However, for specific queries it is
possible to store a different value (for example, obtained from a
`read_command` object) through a setter.
2021-09-14 13:27:25 +02:00
Kamil Braun
fbb83dd5ca reader_concurrency_semaphore: remove default parameter values from constructors
It's easy to forget about supplying the correct value for a parameter
when it has a default value specified. It's safer if 'production code'
is forced to always supply these parameters manually.

The default values were mostly useful in tests, where some parameters
didn't matter that much and where the majority of uses of the class are.
Without default values adding a new parameter is a pain, forcing one to
modify every usage in the tests - and there are a bunch of them. To
solve this, we introduce a new constructor which requires passing the
`for_tests` tag, marking that the constructor is only supposed to be
used in tests (and the constructor has an appropriate comment). This
constructor uses default values, but the other constructors - used in
'production code' - do not.
2021-09-14 12:20:28 +02:00
Kamil Braun
8386b55e9c query_class_config: remove query::max_result_size default constructor
The default values for the fields of this class didn't make much sense,
and the default constructor was used only in a single place so removing
it is trivial.

It's safer when the user is forced to supply the limits.
2021-09-14 12:20:28 +02:00
Avi Kivity
3f2c680b70 Merge 'Add initial support for WebAssembly in user-defined functions (UDF)' from Piotr Sarna
This series adds very basic support for WebAssembly-based user-defined functions.

This series comes with a basic set of tests which were used to designate a minimal goal for this initial implementation.

Example usage:
```cql
CREATE FUNCTION ks.fibonacci (str text)
    RETURNS NULL ON NULL INPUT
    RETURNS boolean
    LANGUAGE xwasm
    AS ' (module
  (func $fibonacci (param $n i32) (result i32)
    (if
      (i32.lt_s (local.get $n) (i32.const 2))
      (return (local.get $n))
    )
    (i32.add
      (call $fibonacci (i32.sub (local.get $n) (i32.const 1)))
      (call $fibonacci (i32.sub (local.get $n) (i32.const 2)))
    )
  )
  (export "fibonacci" (func $fibonacci))
) '
```

Note that the language is currently called "xwasm" as in "experimental wasm", because its interface is still subject to change in the future.

Closes #9108

* github.com:scylladb/scylla:
  docs: add a WebAssembly entry
  cql-pytest: add wasm-based tests for user-defined functions
  main: add wasm engine instantiation
  treewide: add initial WebAssembly support to UDF
  wasm: add initial WebAssembly runtime implementation
  db: add wasm_engine pointer to database
  lang: add wasm_engine service
  import wasmtime.hh
  lua: move to lang/ directory
  cql3: generalize user-defined functions for more languages
2021-09-14 11:34:20 +03:00
Avi Kivity
e9ae9279e8 system_keyspace: reindent after conversion to class
Conversion to class left indentation in ruins, but that can be easily
fixed. 'git diff -w' reports no changes.

Closes #9339
2021-09-14 08:49:24 +03:00
Avi Kivity
64537beb38 Update tools/java submodule (nodetool stop reshape)
* tools/java 3b378f7095...9c5c0ad1fd (1):
  > nodetool stop: Support Reshape
2021-09-13 21:17:01 +03:00
Piotr Sarna
6c4a71cdea docs: add a WebAssembly entry
The doc briefly describes the state of WASM support
for user-defined functions.
2021-09-13 19:03:58 +02:00
Piotr Sarna
41b94d3cf3 cql-pytest: add wasm-based tests for user-defined functions
A first set of wasm-based test cases is added.
The tests include verifying that supported types work
and that validation of the input wasm is performed.
2021-09-13 19:03:58 +02:00
Piotr Sarna
4959136afd main: add wasm engine instantiation
Once the engine is up, it can be used to execute user-defined
functions.
2021-09-13 19:03:58 +02:00
Piotr Sarna
62e8c89a9c treewide: add initial WebAssembly support to UDF
This commit adds a very basic support for user-defined functions
coded in wasm. The support is very limited (only a few types work)
and was not tested against reactor stalls and performance in general.
2021-09-13 19:03:58 +02:00
Piotr Sarna
78afd518a8 wasm: add initial WebAssembly runtime implementation
The engine is based on wasmtime and is able to:
 - compile wasm text format to bytecode
 - run a given compiled function with custom arguments

This implementation is missing crucial features, like running
on any other types than 32-bit integers. It serves as a skeleton
for future full implementation.
2021-09-13 19:03:58 +02:00
Avi Kivity
e9343fd382 Merge 'cql3: Add expr::constant to replace terminal' from Jan Ciołek
Add new struct to the `expression` variant:
```c++
// A value serialized with the internal (latest) cql_serialization_format
struct constant {
    cql3::raw_value value;
    data_type type; // Never nullptr, for NULL and UNSET might be empty_type
};
```
and use it where possible instead of `terminal`.

This struct will eventually replace all classes deriving from `terminal`, but for now `terminal` can't be removed completely.

We can't get rid of terminal yet, because sometimes `terminal` is converted back to `term`, which `constant` can't do. This won't be a problem once we replace term with expression.

`bool` is removed from `expression`, now `constant` is used instead.

This is a redesign of PR #9203, there is some discussion about the chosen representation there.

Closes #9244

* github.com:scylladb/scylla:
  cql3: term: Remove get_elements and multi_item_terminal from terminals
  cql3: Replace most uses of terminal with expr::constant
  cql3: expr: Remove repetition from expr::get_elements
  cql3: expr: Add expr::get_elements(constant)
  cql3: term: remove term::bind_and_get
  cql3: Replace all uses of bind_and_get with evaluate_to_raw_view
  cql3: expr: Add evaluate_IN_list
  cql3: tuples: Implement tuples::in_value::get
  cql3: Move data_type to terminal, make get_value_type non-virtual
  cql3: user_types: Implement get_value_type in user_types.hh
  cql3: tuples: Implement get_value_type in tuples.hh
  cql3: maps: Implement get_value_type in maps.hh
  cql3: sets: Implement get_value_type in sets.hh
  cql3: lists: Implement get_value_type in lists.hh
  cql3: constants: Implement get_value_type in constants.hh
  cql3: expr: Add expr::evaluate
  cql3: values: Add unset value to raw_value_view::make_temporary
  cql3: expr: Add constant to expression
2021-09-13 19:26:09 +03:00
Nadav Har'El
27138b215b Merge 'system_keyspace: convert from namespace to class' from Avi Kivity
All the namespace scope functions in system_keyspace have no place
to store context, so they must store their context in global
variables. This prevents conversion of those global variables
to constructor-provided depdendencies.

Take the first step towards providing a place to store the
context by converting system_keyspace to a class. All the functions
are static, so no context is yet available, but we can de-static-ify
them incrementally in the future and store the context in class members.

Closes #9335

* github.com:scylladb/scylla:
  system_keyspace: convert from namespace to class
  system_keyspace: prepare forward-declared members
  system_keyspace: rearrange legacy subnamespace
  system_keyspace: remove outdated java code
2021-09-13 19:01:42 +03:00
Avi Kivity
1b75e9312d Update tools/java and tools/jmx submodules (load-and-stream support)
* tools/java a2fe67fd42...3b378f7095 (1):
  > nodetool: add `--load-and-stream` option to `refresh`

* tools/jmx 70b19e6...658818b (1):
  > Support `--load-and-stream` option from `nodetool refresh`
2021-09-13 18:48:11 +03:00
Tomasz Grabiec
890b861d20 Merge 'query::reverse_slice(): toggle reversed bit instead of setting it' from Botond Dénes
The above mentioned method is supposed to work both ways: reversed <-> forward, so setting the reversed bit is not correct: it should be toggled, which is what this mini-series does.

Closes #9327

* github.com:scylladb/scylla:
  reverse_slice(): toggle reversed bit instead of setting it
  partition_slice_builder(): add with_option_toggled()
  enum_set: add toggle()
2021-09-13 18:48:11 +03:00
Takuya ASADA
f93793da7e configure.py: remove $builddir/release/{scylla_product}-python3-{arch}-package.tar.gz from dist-python3 target
'$builddir/release/{scylla_product}-python3-package.tar.gz' on
dist-python3 target is for compat-python3, we forgot to remove at 35a14ab.

Fixes #9333

Closes #9334
2021-09-13 18:48:10 +03:00
Jan Ciolek
fd98d40b75 cql3: term: Remove get_elements and multi_item_terminal from terminals
terminal now isn't used as a final value anywhere.
Remove things that are no longer needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:47:17 +02:00
Jan Ciolek
a0ec2113ae cql3: Replace most uses of terminal with expr::constant
constant is now ready to replace terminal as a final value representation.
Replace bind() with evaluate and shared_ptr<terminal> with constant.

We can't get rid of terminal yet. Sometimes terminal is converted back
to term, which constant can't do. This won't be a problem once we
replace term with expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:47:17 +02:00
Jan Ciolek
b67f72037f cql3: expr: Remove repetition from expr::get_elements
There was some repeating code in expr::get_elements family
of functions. It has been reduced into one function.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:47:17 +02:00
Jan Ciolek
8b475a966c cql3: expr: Add expr::get_elements(constant)
We need to be able to access elements of a constant.
Adds functions to easily do it.

Those functions check all preconditions required to access elements
and then use partially_deserialize_* or similar.

It's much more convenient than using partially_deserialize directly.

get_list_of_tuples_elements is useful with IN restrictions like
(a, b) IN [(1, 2), (3, 4)].

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:47:17 +02:00
Jan Ciolek
134b76f5d9 cql3: term: remove term::bind_and_get
term::bind_and_get is not needed anymore, remove it.

Some classes use bind_and_get internally, those functions are left intact
and renamed to bind_and_get_internal.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:47:17 +02:00
Avi Kivity
ed396a31f3 Merge "Remove global storage proxy from cdc" from Pavel E
"
There's a single call to get_local_storage_proxy in cdc
code that needs to get database from. Furtunately, the
database can be easily provided there via call argument.

tests: unit(dev)
"

* 'br-remove-proxy-from-cdc' of https://github.com/xemul/scylla:
  cdc: Add database argument to is_log_for_some_table
  client_state: Pass database into has_access()
  client_state: Add database argument to has_schema_access
  client_state: Add database argument to has_keyspace_access()
  cdc: Add database argument to check_for_attempt_to_create_nested_cdc_log
2021-09-13 18:45:46 +03:00
Avi Kivity
f3712d4767 Merge "Avoid nested seastar::async in tests" from Pavel E
"
There's a bunch of explicit and implicit async contexts nesting
in sstables tests. This set turns them into a single nest async
(mostly with an awk script).

The indentation in first two patches is deliberately left as it
was before patching, i.e. -- slightly broken. As a consolation,
after the third patch it suddenly becomes fixed as the unneeded
intermediate call with broken indent is removed.

tests: unit(dev)
"

* 'br-sst-tests-no-nested-async' of https://github.com/xemul/scylla:
  test: Don't nest seastar::async calls (2nd cont)
  test: Don't nest seastar::async calls (cont)
  test: Don't nest seastar::async calls
2021-09-13 18:45:46 +03:00
Takuya ASADA
f928dced0c scylla_cpuscaling_setup: add --force option
To building Ubuntu AMI with CPU scaling configuration, we need force
running mode for scylla_cpuscaling_setup, which run setup without
checking scaling_governor support.

See scylladb/scylla-machine-image#204

Closes #9326
2021-09-13 18:45:46 +03:00
Jan Ciolek
c3fb2f2b57 cql3: Replace all uses of bind_and_get with evaluate_to_raw_view
Start using evaluate_to_raw_value instead of bind_and_get.
This is a step towards using only evaluate.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:44:06 +02:00
Botond Dénes
6b5936812f reverse_slice(): toggle reversed bit instead of setting it
reverse_slice() works in both direction: reverse <-> forward, so it
cannot unconditionally set the reversed bit, instead it should toggle
it.
2021-09-13 18:05:11 +03:00
Botond Dénes
e16c388437 partition_slice_builder(): add with_option_toggled() 2021-09-13 18:05:11 +03:00
Botond Dénes
96c95119f9 enum_set: add toggle() 2021-09-13 18:05:11 +03:00
Jan Ciolek
25caa1950d cql3: expr: Add evaluate_IN_list
A list representing IN values might contain NULLs before evaluation.
We can remove them during evaluation, because nothing equals NULL.
If we don't remove them, there are gonna be errors, because a list can't contain NULLs.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
5a90fd097a cql3: tuples: Implement tuples::in_value::get
To convert a terminal to expr::constant we need to be able to serialize it.
tuples::in_value didn't have serialization implemented, do it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
9b6b2899ed cql3: Move data_type to terminal, make get_value_type non-virtual
Every class now has implementation of get_value_type().
We can simply make base class keep the data_type.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
9b3478e1cd cql3: user_types: Implement get_value_type in user_types.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in user_types.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
68b65771a7 cql3: tuples: Implement get_value_type in tuples.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in tuples.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
319b6608b0 cql3: maps: Implement get_value_type in maps.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in mapshh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
5a755cda2b cql3: sets: Implement get_value_type in sets.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in sets.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
0b3436598a cql3: lists: Implement get_value_type in lists.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in lists.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
60a34236ee cql3: constants: Implement get_value_type in constants.hh
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in constants.hh.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
844bf2d472 cql3: expr: Add expr::evaluate
Adds the functions:
constant evaluate(term*, const query_options&);
raw_value_view evaluate(term*, const query_options&);

These functions take a term, bind it and convert the terminal
to constant or raw_value_view.

In the future these functions will take expression instead of term.
For that to happen bind() has to be implemented on expression,
this will be done later.

Also introduces terminal::get_value_type().
In order to construct a constant from terminal we need to know the type.
It will be implemented in the following commits.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
da099dd922 cql3: values: Add unset value to raw_value_view::make_temporary
When unset_value is passed to make_temporary it gets converted to null_value.
This looks like a mistake, it should be just unset_value.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:23 +02:00
Jan Ciolek
79cb268ada cql3: expr: Add constant to expression
Adds constant to the expression variant:
struct constant {
    raw_value value;
    data_type type;
};

This struct will be used to represent constant values with known bytes and type.
This corresponds to the terminal from current design.

bool is removed from expression, now constant is used instead.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-09-13 17:03:21 +02:00
Avi Kivity
e70b9d4835 system_keyspace: convert from namespace to class
All the namespace scope functions in system_keyspace have no place
to store context, so they must store their context in global
variables. This prevents conversion of those global variables
to constructor-provided depdendencies.

Take the first step towards providing a place to store the
context by converting system_keyspace to a class. All the functions
are static, so no context is yet available, but we can de-static-ify
them incrementally in the future and store the context in class members.

Indentation is a mess, but can be easily fixed later.
2021-09-13 15:14:14 +03:00
Avi Kivity
115d6d8d4c system_keyspace: prepare forward-declared members
In anticipation of making system_keyspace a class instead of a
namespace, rename any member that is currently forward-declared,
since one can't forward-declare a class member. Each member
is taken out of the system_keyspace namespace and gains a
system_keyspace prefix. Aliases are added to reduce code churn.

The result isn't lovely, but can be adjusted later.
2021-09-13 15:11:26 +03:00
Avi Kivity
c6ce81d6a0 system_keyspace: rearrange legacy subnamespace
Merge two fragments together, in anticipation of making 'legacy'
s struct instead of a namespace (when system_keyspace is a class,
we can't nest a namespace inside it).
2021-09-13 15:10:15 +03:00
Avi Kivity
6d379ae6f9 system_keyspace: remove outdated java code
This code has been rewritten and not removed, or is not needed.
Remove it to reduce clutter.
2021-09-13 15:08:57 +03:00
Michał Chojnowski
7df9deb628 service: storage_proxy: don't compute probabilistic read repair decisions when probability is 0
On ARM, the code in libstdc++ responsible for computing random
floating-point numbers is very slow, because it uses `long double`
arithmetic, which is emulated in software on this architecture.

The performance effect on read queries is noticeable – about 6% of the
total work of a read from cache. Since probabilistic read repair is almost
always disabled (and under consideration for removal) let's just optimize
the case when it's disabled.

Fixes #9107

Closes #9329
2021-09-13 12:31:14 +03:00
Piotr Sarna
83f46e6e6f db: add wasm_engine pointer to database
WASM engine needs to be used from two separate contexts:
 - when a user-defined function is created via CQL
 - when a user-defined function is received during schema migration

The common instance that these two have in common is the database
object, so that's where the reference is stored.
2021-09-13 11:01:33 +02:00
Piotr Sarna
5e6fa47198 lang: add wasm_engine service
WASM engine stores the wasm runtime engine for user-defined functions.
2021-09-13 11:01:33 +02:00
Piotr Sarna
4caf57f730 import wasmtime.hh
Courtesy of https://github.com/bytecodealliance/wasmtime-cpp .
Taken as is, with a small licensing blurb added on top.
2021-09-13 11:01:33 +02:00
Piotr Sarna
4e952df470 lua: move to lang/ directory
Support for more languages is comming, so let's group them
in a separate directory.
2021-09-13 11:01:33 +02:00
Piotr Sarna
46c6603fe0 cql3: generalize user-defined functions for more languages
In order to support more languages than just Lua in the future,
Lua-specific configuration is now extracted to a separate
structure.
2021-09-13 11:01:33 +02:00
Avi Kivity
61c9df4bd2 Merge "Split sstable_conforms_to_mutation_source" from Pavel E
"
The tests contains a single case that runs 6 different
cases inside and is one of the longest tests out there.
Splitting it improves parallel-cases suite run time.

tests: unit(dev, debug, release)
"

* 'br-split-sst-conforms-to-ms' of https://github.com/xemul/scylla:
  tests: Fix indentation after previous patch
  tests: Split sstable_conforms_to_mutation_source
2021-09-13 11:27:44 +03:00
Avi Kivity
1fd701e709 test: cql-pytest: skip tests depending on timeuuid monotonicity
timeuuid is not monotonic when now() is called on different connections,
so when running tests that depend on that property, we get failures if
using the Scylla driver (which became standard in 729d0fe).

Skip the tests for now, until we figure out what to do. We probably
can't make now() globally monotonic, and there isn't much to gain
by making it monotonic only per connection, since clients are allowed
to switch connections (and even nodes) at will.

Ref #9300

Closes #9323

[avi: committing my own patch to unblock master]
2021-09-12 19:30:40 +03:00
Nadav Har'El
1d4474d543 test/alternator/run: don't run Scylla if "--aws" option
The test/alternator/run script runs Scylla and then runs pytest against
it. But when passing the "--aws" option, the intention is that these
tests be run against AWS DynamoDB, not a local Scylla, so there is no
point in starting Scylla at all - so this is what we do in this patch.

This doesn't really add a new feature - "test/alternator/run --aws"
will now be nothing more than "cd test/alternator; pytest --aws".
But it adds the convenience that you can run the same tests on Scylla
and AWS with exactly the same "run" command, just adding the "--aws"
option, and don't need to sometimes use "run" and sometimes "pytest".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210912133239.75463-1-nyh@scylladb.com>
2021-09-12 16:50:38 +03:00
Avi Kivity
c5f52f9d97 schema_tables: don't flush in tests
Flushing schema tables is important for crash recovery (without a flush,
we might have sstables using a new schema before the commitlog entry
noting the schema change has been replayed), but not important for tests
that do not test crash recovery. Avoiding those flushes reduces system,
user, and real time on tests running on a consumer-level SSD.

before:
real	8m51.347s
user	7m5.743s
sys	5m11.185s

after:
real	7m4.249s
user	5m14.085s
sys	2m11.197s

Note real time is higher that user+sys time divided by the number
of hardware threads, indicating that there is still idle time due
to the disk flushing, so more work is needed.

Closes #9319
2021-09-12 11:32:13 +03:00
Raphael S. Carvalho
acba3bd3c4 sstables: give a more descriptive name to compaction_options
the name compaction_options is confusing as it overlaps in meaning
with compaction_descriptor. hard to reason what are the exact
difference between them, without digging into the implementation.

compaction_options is intended to only carry options specific to
a give compaction type, like a mode for scrub, so let's rename
it to compaction_type_options to make it clearer for the
readers.

[avi: adjust for scrub changes]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>
2021-09-12 11:21:33 +03:00
Benny Halevy
389ef9316f compaction: scrub/validate: prevent printing non-utf8 partition keys
Corrupt keys might be printed as non-utf8 strings to the log,
and that, in turn, may break applications reading the logs,
such as Python (3.7)

For example:
```
Traceback (most recent call last):
  File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown
    self.cleanUpCluster()
  File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster
    matches = node.grep_log(expr)
  File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log
    for line in f:
  File "/usr/lib64/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte
```

Test: unit(dev)
DTest: scrub_with_one_node_expect_data_loss_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210730105428.2844668-1-bhalevy@scylladb.com>
2021-09-12 10:52:18 +03:00
Tomasz Grabiec
83113d8661 Merge "raft: new schema for storing raft snapshots" from Pavel Solodovnikov
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.

That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:

1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
   documentation in the future.

Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.

Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.

Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.

Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.

Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.

So, now the snapshots schema looks like this:

    CREATE TABLE raft_snapshots (
        group_id timeuuid,
        server_id uuid,
        snapshot_id uuid,
        idx int,
        term int,
        -- no `config` field here, moved to `raft_config` table
        PRIMARY KEY ((group_id, server_id))
    )

    CREATE TABLE raft_config (
         group_id timeuuid,
         my_server_id uuid,
         server_id uuid,
         disposition text, -- can be either 'CURRENT` or `PREVIOUS'
         can_vote bool,
         ip_addr inet,
         PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
    );

This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.

Tests: unit(dev)

* manmanson/raft_snapshots_new_schema_v2:
  test: adjust `schema_change_test` to include new `system.raft_config` table
  raft: new schema for storing raft snapshots
  raft: pass server id to `raft_sys_table_storage` instance
2021-09-10 20:41:59 +02:00
Avi Kivity
16116ac631 interval: constrain comparator parameters
The interval template member functions mostly accept
tri-comparators but a few functions accept less-comparators.
To reduce the chance of error, and to provide better error
messages, constrain comparator parameters to the expected
signature.

In one case (db/size_estimates_virtual_reader.cc) the caller
had to be adjusted. The comparator supported comparisons
of the interval value type against other types, but not
against itself. To simplify things, we add that signature too,
even though it will never be called.

Closes #9291
2021-09-10 16:43:16 +02:00
Avi Kivity
7a798b44a2 cql3: expr: replace column_value_tuple by a composition of tuple_constructor and column_value
column_value_tuple overlaps both column_value and tuple_constructor
(in different respects) and can be replaced by a combination: a
tuple_constructor of column_value. The replacement is more expressive
(we can have a tuple of column_value and other expression types), though
the code (especially grammar) do not allow it yet.

So remove column_value_tuple and replace it everywhere with
tuple_constructor. Visitors get the merged behavior of the existing
tuple_constructor and column_value_tuple, which is usually trivial
since tuple_constructor and column_value_tuple came from different
hierarchies (term::raw and relation), so usually one of the types
just calls on_internal_error().

The change results in awkwards casts in two areas: WHERE clause
filtering (equal() and related), and clustering key range evaluations
(limits() and related). When equal() is replaced by recursive
evaluate(), the casts will go way (to be replaced by the evaluate())
visitor. Clustering key range extraction will remain limited
to tuples of column_value, so the prepare phase will have to vet
the expressions to ensure the casts don't fail (and use the
filtering path if they will).

Tests: unit (dev)

Closes #9274
2021-09-10 10:43:29 +02:00
Piotr Sarna
234c2b9f6d Merge 'Scrub compaction serialization' from Benny Halevy
Currently scrub compaction filters-out sstables that are undergoing
(regular) compaction.  This is surprising to the user and we would like
scrub (in validate mode or otherwise) to examine all sstables in the
table.

Scrub in VALIDATE mode is read-only, therefore it can run in parallel to
regular compaction. However, this series makes sure it selects all
sstables in the table, without filtering sstables undergoing compaction.

For scrub in non-validation mode, we would like to ensure that it
examined all sstables that were sealed when it started and it fixed any
corruption (based on the scrub mode). Therefore, we stop ongoing
compactions when running scrub in non-validation modes. Otherwise
compaction might just copy the corrupt data onto new sstables, requiring
scrub to run again.

Also, acquire _compaction_locks write lock for the table to serialize
with other custom compaction jobs like major compaction, reshape, and
reshard.

Fixes #9256

Test: unit(dev) DTest:
nodetool_additional_test.py:TestNodetool.{validate_sstable_with_invalid_fragment_test,
  validate_ks_sstable_with_invalid_fragment_test,validate_with_one_node_expect_data_loss_test}

Closes #9258

* github.com:scylladb/scylla:
  compaction_manager: rewrite_sstables: acquire _compaction_locks
  compaction_manager: perform_sstable_scrub: run_with_compaction_disabled
  compaction: don't rule out compacting sstables in validate-mode scrub
2021-09-09 18:33:43 +02:00
Avi Kivity
219fdcd8da Merge 'tools: introduce scylla-sstable' from Botond Dénes
A tool which can be used to examine the content of sstable(s) and
execute various operations on them. The currently supported operations
are:
* dump - dumps the content of the sstable(s), similar to sstabledump;
* dump-index - dumps the content of the sstable index(es), similar to scylla-sstable-index;
* writetime-histogram - generates a histogram of all the timestamps in
  the sstable(s);
* custom - a hackable operation for the expert user (until scripting
  support is implemented);
* validate - validate the content of the sstable(s) with the mutation
  fragment stream validator, same as scrub in validate mode;

The sstables to-be-examined are passed as positional command line
arguments. Sstables will be processed by the selected operation
one-by-one (can be changed with `--merge`). Any number of sstables can
be passed but mind the open file limits. Pass the full path to the data
component of the sstables (*-Data.db). For now it is required that the
sstable is found at a valid data path:

    /path/to/datadir/{keyspace_name}/{table_name}-{table_id}/

The schema to read the sstables is read from a `schema.cql` file. This
should contain the keyspace and table definitions, as well as any UDTs
used.
Filtering the sstable(s) to process only certain partition(s) is supported
via the `--partition` and `--partitions-file` command line flags.
Partition keys are expected to be in the hexdump format used by scylla
(hex representation of the raw buffer).
Operations write their output to stdout, or file(s). The tool logs to
stderr, with a logger called `scylla-sstable-crawler`.

Examples:

    # dump the content of the sstable
    $ scylla-sstable-crawler --dump /path/to/md-123456-big-Data.db

    # dump the content of the two sstable(s) as a unified stream
    $ scylla-sstable-crawler --dump --merge /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db

    # generate a joint histogram for the specified partition
    $ scylla-sstable-crawler --writetime-histogram --partition={{myhexpartitionkey}} /path/to/md-123456-big-Data.db

    # validate the specified sstables
    $ scylla-sstable-crawler --validate /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db

Future plans:
* JSON output for dump.
* A simple way of generating `schema.cql` for any schema, other than copying it from snapshots, or copying from `cqlsh`. None of these generate a complete output.
* Relax sstable path checks, so sstables can be loaded from any path.
* Add scripting support (Lua), allowing custom operations to be written
  in a scripting language.

Refs: #9241

Closes #9271

* github.com:scylladb/scylla:
  tools: remove scylla-sstable-index
  tools: introduce scylla-sstable
  tools: extract finding selected operation (handler) into function
  tools: add schema_loader
  cql3: query_processor: add parse_statements()
  cql3: statements/create_type: expose create_type()
  cql3: statements/create_keyspace: add get_keyspace_metadata()
2021-09-09 19:24:06 +03:00
Avi Kivity
c1028de22a Merge 'Introduce native reversed format' from Botond Dénes
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.

This series is the first step towards implementing efficient reverse
reads. It allows us to remove all the special casing we have in various
places for reverse reads and thus treating reverse streams transparently
in all the middle layers. The only layers that have to know about the
actual reversing are mutation sources proper. The plan is that when
reading in reverse we create a reversed schema in the top layer then
pass this down as the schema for the read. There are two layers that
will need to act on this reversed schema:
* The layer sitting on top of the first layer which still can't handle
  reversed streams, this layer will create a reversed reader to handle
  the transition.
* The mutation source proper: which will obtain the underlying schema
  and will emit the data in reverse order.

Once all the mutation sources are able to handle reverse reads, we can
get rid of the reverse reader entirely.

Refs: #1413

Tests: unit(dev)

TODO:
* v2
* more testing

Also on: https://github.com/denesb/scylla.git reverse-reads/v3

Changelog

v3:
* Drop the entire schema transformation mechanism;
* Drop reversing from `schema_builder()`;
* Don't keep any information about whether the schema is reversed or not
  in the schema itself, instead make reversing deterministic w.r.t.
  schema version, such that:
  `s.version() == s.make_reversed().make_reversed().version()`;
* Re-reverse range tombstones in `streaming_mutation_freezer`, so
  `reconcilable_results` sent to the coordinator during read repair
  still use the old reverse format;

v2:
* Add `data_type reversed(data_type)`;
* Add `bound_kind reverse_kind(bound_kind)`;
* Make new API safer to use:
    - `schema::underlying_type()`: return this when unengaged;
    - `schema::make_transformed()`: noop when applying the same
      transformation again;
* Generalize reversed into transformation. Add support to transferring
  to remote nodes and shards by way of making `schema_tables` aware of
  the transformation;
* Use reverse schema everywhere in reverse reader;

Closes #9184

* github.com:scylladb/scylla:
  range_tombstone_accumulator: drop _reversed flag
  test/boost/mutation_test: add test for mutation::consume() monotonicity
  test/boost/flat_mutation_reader_test: more reversed reader tests
  flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range)
  flat_mutation_reader: make_reversing_reader(): take ownership of the reader
  test/lib/mutation_source_test: add consistent log to all methods
  mutation: introduce reverse()
  mutation_rebuilder: make it standalone
  mutation: make copy constructor compatible with mutation_opt
  treewide: switch to native reversed format for reverse reads
  mutation: consume(): add native reverse order
  mutation: consume(): don't include dummy rows
  query: add slice reversing functions
  partition_slice_builder: add range mutating methods
  partition_slice_builder: add constructor with slice
  query: specific_ranges: add non-const ranges accessor
  range_tombstone: add reverse()
  clustering_bounds_comparator: add reverse_kind()
  schema: introduce make_reversed()
  schema: add a transforming copy constructor
  utils: UUID_gen: introduce negate()
  types: add reversed(data_type)
  docs: design-notes: add reverse-reads.md
2021-09-09 15:50:22 +03:00
Botond Dénes
f02632aeb0 range_tombstone_accumulator: drop _reversed flag 2021-09-09 15:42:15 +03:00
Botond Dénes
f07805c3ef test/boost/mutation_test: add test for mutation::consume() monotonicity
In both forward and reverse modes.
2021-09-09 15:42:15 +03:00
Botond Dénes
3cc882f6a8 test/boost/flat_mutation_reader_test: more reversed reader tests
Check that the reverse reader emits a stream identical to that emitted
by a reader reading in native order from a table with reversed
clustering order.
2021-09-09 15:42:15 +03:00
Botond Dénes
bf38e204af flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range) 2021-09-09 15:42:15 +03:00
Botond Dénes
350440b418 flat_mutation_reader: make_reversing_reader(): take ownership of the reader
Makes for much simpler client code.
2021-09-09 15:42:15 +03:00
Botond Dénes
c71a281e6b test/lib/mutation_source_test: add consistent log to all methods
Most test methods log their own name either via testlog.info() or
BOOST_TEST_MESSAGE() so failures can be more easily located. Not all do
however. This commit fixes this and also converts all those using
BOOST_TEST_MESSAGE() for this to testlog.info(), for consistency.
2021-09-09 15:42:15 +03:00
Botond Dénes
1d6896c14f mutation: introduce reverse()
Which reverses the mutation as if it was created with a schema with
reversed clustering order.
2021-09-09 15:42:15 +03:00
Botond Dénes
74a22a706b mutation_rebuilder: make it standalone
Not requiring a wrapper object to become usable.
2021-09-09 15:42:15 +03:00
Botond Dénes
16b9d19e50 mutation: make copy constructor compatible with mutation_opt
Currently `_data` is assumed to be engaged by the copy constructor which
is not necessarily the case with `mutation_opt` objects (which is an
`optimized_optional<mutation>`). Fix this by only copying `_data` if
non-null.
2021-09-09 15:42:15 +03:00
Botond Dénes
502a45ad58 treewide: switch to native reversed format for reverse reads
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
2021-09-09 15:42:15 +03:00
Botond Dénes
0af5a8add0 mutation: consume(): add native reverse order
The existing consume_in_reverse::yes is renamed to
consume_in_reverse::legacy_half_reverse and consume_in_reverse::yes now
means native reverse order. This is because we expect the legacy order
to die out at one point and when that happens we can just remove that
ugly third option and will be left with yes and no as before.
2021-09-09 14:18:32 +03:00
Botond Dénes
38ef80d4d2 mutation: consume(): don't include dummy rows 2021-09-09 14:18:32 +03:00
Botond Dénes
5d33d76cfd query: add slice reversing functions 2021-09-09 14:18:32 +03:00
Botond Dénes
4fc39721a2 partition_slice_builder: add range mutating methods 2021-09-09 14:16:21 +03:00
Botond Dénes
a2eb0f7d7e partition_slice_builder: add constructor with slice
Intended to be used to modify an existing slice. We want to move the
slice into the direction where the schema is at: make it completely
immutable, all mutations happening through the slice builder class.
2021-09-09 14:15:42 +03:00
Benny Halevy
40a6049ac2 compaction_manager: rewrite_sstables: acquire _compaction_locks
Take write lock for cf to serialize cleanup/upgrade sstables/scrub
with major compaction/reshape/reshard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-09-09 14:13:45 +03:00
Benny Halevy
44348b3080 compaction_manager: perform_sstable_scrub: run_with_compaction_disabled
since we might potentially have ongoing compactions, and we
must ensure that all sstables created before we run are scrubbed,
we need to barrier out any previously running compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-09-09 14:13:40 +03:00
Raphael S. Carvalho
a145ffcf52 compaction: don't rule out compacting sstables in validate-mode scrub
even sstables being compacted must be validated. otherwise scrub
validate may return false negative.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-09 14:06:50 +03:00
Raphael S. Carvalho
a23057edce Update CODEOWNERS for compaction subsystem
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210904003731.67134-1-raphaelsc@scylladb.com>
2021-09-09 12:50:06 +03:00
Botond Dénes
34abbe82fe query: specific_ranges: add non-const ranges accessor 2021-09-09 12:09:08 +03:00
Botond Dénes
30f6f676b8 range_tombstone: add reverse()
Reversing the range-tombstone, as-if it was emitted from a table with
reversed clustering order.
2021-09-09 11:49:05 +03:00
Botond Dénes
d0351eaaed clustering_bounds_comparator: add reverse_kind()
Hiding the tricky reversing of a bound_kind.
2021-09-09 11:49:05 +03:00
Botond Dénes
f200c8104a schema: introduce make_reversed()
`make_revered()` creates a schema identical to the schema instance it is
called on, with clustering order reversed. To distinguish the reverse
schema from the original one, the node-id part of its version UUID is
bit-flipped. This ensures that reversing a schema twice will result in
the identical schema to the original one (although a different C++
object).

This reversed schema will be used in reversed reads, so intermediate
layers can be ignorant of the fact that the read happens in reverse.
2021-09-09 11:49:05 +03:00
Botond Dénes
9a9b58e67b schema: add a transforming copy constructor
Taking a transform functor, which is executed after the raw schema is
copied, but before the derivate fields are computed (rebuild()).
2021-09-09 11:49:05 +03:00
Botond Dénes
65913f4cfa utils: UUID_gen: introduce negate() 2021-09-09 11:49:05 +03:00
Botond Dénes
183ac6981a types: add reversed(data_type)
Reversing the sort order of a type.
2021-09-09 11:49:05 +03:00
Botond Dénes
0cc00b5d17 docs: design-notes: add reverse-reads.md
Explaining how reverse reads work, in particular the difference between
the legacy and native formats.
2021-09-09 11:49:02 +03:00
Tzach Livyatan
eba2ea9907 scylla.yaml: remove comment for num_tokens
The comment is less relevant for Scylla, and point to a non relevant Apache Cassandra doc page.

Closes #9284
2021-09-09 11:45:40 +03:00
Nadav Har'El
e4bafe7dc7 Merge 'Split view builder shutdown procedure to drain + stop' from Piotr Sarna
In order to be able to avoid a deadlock when CQL server cannot be started,
the view builder shutdown procedure is now split to two parts -
- drain and stop. Drain is performed before storage proxy shutdown,
but stop() will be called even before drain is scheduled.
The deadlock is as follows:
 - view builder creates a reader permit in order to be able
   to read from system tables
 - CQL server fails to start, shutdown procedure begins
 - view builder stop() is not called (because it was not scheduled
   yet), so it holds onto its reader permit
 - database shutdown procedure waits for all permits to be destroyed,
   and it hangs indefinitely because view builder keeps holding
   its permit.

Fixes #9306

Closes #9308

* github.com:scylladb/scylla:
  main: schedule view builder stopping earlier
  db,view: split stopping view builder to drain+stop
2021-09-09 11:38:15 +03:00
Dejan Mircevski
6afdc6004c cql3/modification_statement: Replace empty-range check with null check
The empty-range check causes more bugs than it fixes.  Replace it with
an explicit check for =NULL (see #7852).

Fixes #9311.
Fixes #9290.

Tests: unit (dev), cql-pytest on Cassandra 4.0

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9314
2021-09-09 10:56:13 +03:00
Avi Kivity
595f1fe802 storage_proxy: digest_read_resolver: use small_vector for holding digests
There is typically just 1-2 digests per query, so we can allocate
space for them in digest_read_resolver using small_vector, saving
an allocation.

Results: (perf_simple_query --smp 1 --operations-per-shard 1000000 --task-quota-ms 10)
    before: median 215301.75 tps ( 75.1 allocs/op,  12.1 tasks/op,   45238 insns/op)
    after:  median 221121.37 tps ( 74.1 allocs/op,  12.1 tasks/op,   45186 insns/op)

While the throughput numbers are not reliable due to frequency throttling,
it's clear there are fewer allocations and instuctions executed.

Closes #9296
2021-09-09 10:24:39 +03:00
Piotr Sarna
e93585e66c main: schedule view builder stopping earlier
In order to avoid a deadlock described in the previous commit,
view builder stopping is registered earlier, so that its destructor
is called and its reader permit is released before the database starts
shutting down.
Note that draining the view builder is still scheduled later,
because it needs to happen before storage proxy drain
to keep the existing deinitialization order.

Fixes #9306
2021-09-08 10:53:08 +02:00
Piotr Sarna
5d7c765422 db,view: split stopping view builder to drain+stop
In order to be able to avoid a deadlock when CQL server cannot be started,
the view builder shutdown procedure is now split to two parts -
- drain and stop. Drain is performed before storage proxy shutdown,
but stop() will be called even before drain is scheduled.
The deadlock is as follows:
 - view builder creates a reader permit in order to be able
   to read from system tables
 - CQL server fails to start, shutdown procedure begins
 - view builder stop() is not called (because it was not scheduled
   yet), so it holds onto its reader permit
 - database shutdown procedure waits for all permits to be destroyed,
   and it hangs indefinitely because view builder keeps holding
   its permit.
2021-09-08 10:52:40 +02:00
Dejan Mircevski
58a9a24ff0 cql3: Allow indexed query to select static columns
We previously forbade selecting a static column when an index is
used.  But Cassandra allows it, so we should, too -- see #8869.

After removing the static-column check, the existing code gets the
correct result without any further changes (though it may read
multiple rows from the same partition).

Fixes #8869.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9307
2021-09-08 08:22:59 +02:00
Tomasz Grabiec
9a77a03ea1 Merge "Remove most uses of gms::get_gossiper(), gms::get_local_gossiper()" from Avi
In the quest to have explicit dependencies and the abiliy to run
multiple nodes in one process, remove some uses of get_gossiper() and
get_local_gossiper() and replace them with parameters passed from main()
or its equivalents.

Some uses still remain, mostly in snitch, but this series removes a
majority.

* https://github.com/avikivity/scylla.git gossiper-deglobal-1/v1
  alternator: remove uses of get_local_gossiper()
  storage_service: remove stray get_gossiper(), get_local_gossiper() calls
  migration_manager: remove use of get_gossiper() from passive_announce()
  storage_proxy: start_hints_manager(): don't require caller to provide gossiper
  migration_manager: remove uses of get_local_gossiper()
  storage_proxy: remove uses of get_local_gossiper()
  gossiper: remove get_local_gossiper() from some inline helpers
  gossiper: remove get_gossiper() from stop_gossiping()
  gossiper: remove uses of get_local_gossiper for its rpc server
  api: remove use of get_local_gossiper()
  gossiper: remove calls to global get_gossiper from within the gossiper itself
2021-09-07 20:02:30 +02:00
Avi Kivity
4aaddd8609 alternator: remove uses of get_local_gossiper()
Replace with a gossiper parameter passed from the controller.
2021-09-07 20:08:15 +03:00
Avi Kivity
1ece156de6 storage_service: remove stray get_gossiper(), get_local_gossiper() calls
storage_service already has a reference go gossiper, so just use it.
2021-09-07 20:08:15 +03:00
Avi Kivity
1a8f4937ca migration_manager: remove use of get_gossiper() from passive_announce()
migration_manager already has a reference to _gossiper, but
passive_announce is static and so can't use it. Luckily the only
caller (in storage_service) uses it as it it wasn't static, so
we can just unstaticify it.
2021-09-07 20:08:15 +03:00
Avi Kivity
37818170d8 storage_proxy: start_hints_manager(): don't require caller to provide gossiper
storage_proxy now maintains a reference to gossiper, so it can simplify
its callers.
2021-09-07 20:08:15 +03:00
Avi Kivity
d8f7903f60 migration_manager: remove uses of get_local_gossiper()
Pass gossiper as a constructor parameter instead. cql_test_env
gains a use of get_gossiper() instead, but at least these uses
are concentrated in one place.
2021-09-07 20:08:11 +03:00
Avi Kivity
71081be99c storage_proxy: remove uses of get_local_gossiper()
Pass the gossiper as a constructor parameter instead.
2021-09-07 17:14:09 +03:00
Botond Dénes
6e78e6c97f tools: remove scylla-sstable-index
It is replaced by scylla-sstable --dump-index.
2021-09-07 17:10:44 +03:00
Botond Dénes
2c600e34aa tools: introduce scylla-sstable
A tool which can be used to examine the content of sstable(s) and
execute various operations on them. The currently supported operations
are:
* dump - dumps the content of the sstable(s), similar to sstabledump;
* index-dump - dumps the content of the sstable index(es), similar to
  scylla-sstable-index;
* writetime-histogram - generates a histogram of all the timestamps in
  the sstable(s);
* custom - a hackable operation for the expert user (until scripting
  support is implemented);
* validate - validate the content of the sstable(s) with the mutation
  fragment stream validator, same as scrub in validate mode;
2021-09-07 17:10:44 +03:00
Avi Kivity
aa68927873 gossiper: remove get_local_gossiper() from some inline helpers
Some state accessors called get_local_gossiper(); this is removed
and replaced with a parameter. Some callers (redis, alternators)
now have the gossiper passed as a parameter during initialization
so they can use the adjusted API.
2021-09-07 17:03:37 +03:00
Avi Kivity
9ce1af9fcb gossiper: remove get_gossiper() from stop_gossiping()
Have the callers pass it instead, and they all have a reference
already except for cql_test_env (which will be fixed later).

The checks for initialization it does are likely unnecessary, but
we'll only be able to prove it when get_gossiper() is completely
removed.
2021-09-07 16:20:04 +03:00
Avi Kivity
fcd5376585 gossiper: remove uses of get_local_gossiper for its rpc server
Initialization happens in the gossiper itself, so we can capture
'this'. If we need to move to shard 0, use sharded::invoke_on() to
get the local instance.
2021-09-07 16:06:11 +03:00
Avi Kivity
9fb9299d95 api: remove use of get_local_gossiper()
Pass down gossiper from main, converting it to a shard-local instance
in calls to register_api() (which is the point that broadcasts the
endpoint registration across shards).

This helps remove gossiper as a global variable.
2021-09-07 15:53:39 +03:00
Botond Dénes
e86073c703 tools: extract finding selected operation (handler) into function
We want to use the pattern of having a command line flag for each
operation in more tools, so extract the logic which finds the selected
operation from the command line arguments into a function.
2021-09-07 15:47:22 +03:00
Botond Dénes
23a56beccc tools: add schema_loader
A utility which can load a schema from a schema.cql file. The file has
to contain all the "dependencies" of the table: keyspace, UDTs, etc.
This will be used by the scylla-sstable-crawler in the next patch.
2021-09-07 15:47:22 +03:00
Avi Kivity
61f02ece39 gossiper: remove calls to global get_gossiper from within the gossiper itself
gossiper is a peering_sharded_service, so it has access to sharded<gossiper>.
Remove the global call.
2021-09-07 15:15:09 +03:00
Botond Dénes
64dce2a59e cql3: query_processor: add parse_statements() 2021-09-07 11:13:30 +03:00
Felipe Mendes
1b8dff63c3 iotune - Fix i3en.xlarge check
i3en.xlarge is currently not getting tuned properly. A quick test using
Scylla AMI ( ami-07a31481e4394d346 ) reveals that the storage
capabilities under this instance are greatly reduced:

$ grep iops /etc/scylla.d/io_properties.yaml
  read_iops: 257024
  write_iops: 174080

This patch corrects this typo, in such a way that iotune now properly
tunes this instance type.

Closes #9298
2021-09-07 10:44:39 +03:00
Botond Dénes
68f5277e52 cql3: statements/create_type: expose create_type() 2021-09-07 10:37:25 +03:00
Botond Dénes
6b224b76b9 cql3: statements/create_keyspace: add get_keyspace_metadata() 2021-09-07 10:37:25 +03:00
Benny Halevy
e613c0d287 abstract_replication_strategy: remove commented out keyspace* member
It is not needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210906133840.3307279-2-bhalevy@scylladb.com>
2021-09-06 16:51:22 +03:00
Benny Halevy
b7eaa22ce6 abstract_replication_strategy: create_replication_strategy: drop keyspace name parameter
It is not used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210906133840.3307279-1-bhalevy@scylladb.com>
2021-09-06 16:51:21 +03:00
Benny Halevy
56e063ce93 keyspace: get rid of set_replication_strategy
It's unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210906133905.3307397-1-bhalevy@scylladb.com>
2021-09-06 16:48:35 +03:00
Avi Kivity
69275f02fd Merge "cmake: fix sources and out of source builds" from Pavel S
"
This is a set of random patches trying to fix broken cmake build:
* `-fcoroutines` flag is now used only for GCC, but not for Clang
* `SCYLLA-VERSION-GEN` invocation is adjusted to work correctly with
  out-of-source builds
* Auxiliary targets are adjusted to support out-of-source builds
* Removed extra source files and added missing ones to the scylla
  target

Scylla still doesn't build successfully with CMake build. But now,
at least, it's passes configuration step, which is a prerequisite
to loading the solution in IDEs.
"

* 'cmake_improvements' of github.com:ManManson/scylla:
  cmake: fix out-of-source builds
  cmake: don't use `-fcoroutines` for clang
  cmake: update and sort source files and idl:s
2021-09-06 14:17:23 +03:00
Pavel Emelyanov
ad2e63aaf4 test: Don't nest seastar::async calls (2nd cont)
The 2nd continuation of the previous patch fixes the places
with run_with_async() nested inside explicit seastar::async
calls.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-06 08:26:09 +03:00
Pavel Emelyanov
5fdc82bad7 test: Don't nest seastar::async calls (cont)
The continuation of the previous patch for the cases when the
sstables::test_env::run_with_async sits lower the stack from
the SEASTAR_THREAD_TEST_CASE. The patching is similar but also
requires some care about reference-captured variables.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-06 08:26:09 +03:00
Pavel Emelyanov
8c786937d5 test: Don't nest seastar::async calls
The SEASTAR_THREAD_TEST_CASE runs the provided lambda in async
context. The sstables::test_env::run_with_async does the same.

This (script-generated) patch makes all of the found cases be
SEASTAR_TEST_CASE and, respectively, return the async future
from the run_with_async().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-06 08:26:09 +03:00
Avi Kivity
dfc135dbd1 Merge "Keep range_tombstone apart from list linkage" from Pavel E
"
There's a landmine buried in range_rombstone's move constructor.
Whoever tries to use it risks grabbing the tombstone from the
containing list thus leaking the guy optionally invalidating an
iterator pointing at it. There's a safety without_link moving
constructor out there, but still.

To keep this place safe it's better to separate range_tombstone
from its linkage into anywhere. In particular to keep the range
tombstones in a range_tombstone_list here's the entry that keeps
the tombstone _and_ the list hook (which's a boost set hook).

The approach resembles the rows_entry::deletable_row pair.

tests: unit(dev, debug, patch from #9207)
fixes: #9243
"

* 'br-range-tombstone-vs-entry' of https://github.com/xemul/scylla:
  range_tombstone: Drop without-link constructor
  range_tombstone: Drop move_assign()
  range_tombstone: Move linkage into range_tombstone_entry
  range_tombstone_list: Prepare to use range_tombstone_entry
  range_tombstone, code: Add range_tombstone& getters
  range_tombstone_list: Factor out tombstone construction
  range_tombstone_list: Simplify (maybe) pop_front_and_lock()
  range_tombstone_list: De-templatize pop_as<>
  range_tombstone_list: Conceptualize erase_where()
  range_tombstone(_list): Mark some bits noexcept
  mutation: Use range_tombstone_list's iterators
  mutation_partition: Shorten memory usage calculation
  mutation_partition: Remove unused local variable
2021-09-05 17:26:13 +03:00
Raphael S. Carvalho
6849ec46b8 compaction: Don't purge tombstones in scrub
Scrub is supposed to not remove anything from input, write it as is
while fixing any corruption it might have. It shouldn't have any
assumption on the input. Additionally, a data shadowed by a tombstone
might be in another corrupted sstable, so expired tombstones should
not be purged in order to prevent data ressurection from occurring.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210904165908.135044-1-raphaelsc@scylladb.com>
2021-09-05 17:10:34 +03:00
Dejan Mircevski
1fdaeca7d0 cql3: Reject updates with NULL key values
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects.  Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.

This is the most direct fix, because range and slice are calculated in
one place for all modification statements.  It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior.  We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects.  We add a TODO for rejecting such DELETEs later.

Fixes #7852.

Tests: unit (dev), cql-pytest against Cassandra 4.0

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9286
2021-09-05 10:23:28 +03:00
Pavel Emelyanov
7a0e56d7c1 range_tombstone: Drop without-link constructor
The thing was used to move a range tombstone without detaching it
from the containing list (well, intrusive set). Now when the linkage
is gone this facility is no longer needed (and actually no longer
used).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:50 +03:00
Pavel Emelyanov
f82b5f30f6 range_tombstone: Drop move_assign()
The helper was in use by move-assignment operator and by the .swap()
method. Since now the operator equals the helper, the code can be
merged and the .swap() can be prettified.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:50 +03:00
Pavel Emelyanov
d6af441eaa range_tombstone: Move linkage into range_tombstone_entry
Now it's time to remove the boost set's hook from the range_tombstone
and keep it wrapped into another class if the r._tombstone's location
is the range_tombstone_list.

Also the added previously .tombstone() getters and the _entry alias
can be removed -- all the code can work with the new class.

Two places in the code that made use of without_link{} move-constructor
are patched to get the range_tombstone part from the respective _entry
with the same result.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
b8c585c54d range_tombstone_list: Prepare to use range_tombstone_entry
A continuation of the previous patch. The range_tombstone_list works
with the range_tombstone very actively, kicking every single line
doing this to call .tombstone() seems excessive. Instead, declare
the range_tombstone_entry alias. When the entry will appear for real,
the alias would go away and the range_tombstone_list will be switched
into new entity right at once.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
5515f7187d range_tombstone, code: Add range_tombstone& getters
Currently all the code operates on the range_tombstone class.
and many of those places get the range tombstone in question
from the range_tombstone_list. Next patches will make that list
carry (and return) some new object called range_tombstone_entry,
so all the code that expects to see the former one there will
need to patched to get the range_tombstone from the _entry one.

This patch prepares the ground for that by introdusing the

    range_tombstone& tombstone() { return *this; }

getter on the range_tombstone itself and patching all future
users of the _entry to call .tombstone() right now.

Next patch will remove those getters together with adding the new
range_tombstone_entry object thus automatically converting all
the patched places into using the entry in a proper way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
ae8a5bd046 range_tombstone_list: Factor out tombstone construction
Just add a helper for constructing the managed range
tombstone object. This will also help further patch
have less duplicating hunks in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
8f061b9b1c range_tombstone_list: Simplify (maybe) pop_front_and_lock()
The method returns a pointer on the left-most range tombstone
and expects the caller to "dispose" it. This is not very nice
because the callers thus needs to mess with the relevant deleter.

A nicer approach is the pop-like one (former pop_as<>-like)
which is in returning the range tombstone by value. This value
is move-constructed from the original object which is disposed
by the container itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
2e1b21d72b range_tombstone_list: De-templatize pop_as<>
The method pops the range tombstone from the containing list
and transparently "converts" it into some other type. Nowadays
all callers of it need range tombstone as-is, so the template
can be relaxed down to a plan call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
e4965b1662 range_tombstone_list: Conceptualize erase_where()
Just while at this code

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
fcc02c6bed range_tombstone(_list): Mark some bits noexcept
The range_tombstone's .empty() and .operator bool are trivially such.

The swap()'s noexceptness comes from what it calls -- the without-link
move constructor (noexcept) and .move_assign(). The latter is noexcept
because it's already called from noexcept move-assign operator and
because it calls noexcept move operators of tombstones' fields. The
update_node() is noexcept for the same reason.

The range_tombstone_list's clear() is noexcept because both -- set
clear and disposer lambda are both such.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:31:43 +03:00
Pavel Solodovnikov
9dc4e35e89 cmake: fix out-of-source builds
Don't use relative paths, construct absolute paths to sources
wherever needed.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-09-03 17:51:04 +03:00
Pavel Solodovnikov
334c982697 cmake: don't use -fcoroutines for clang
This gcc flag is not supported. `-fcoroutines-ts` also
cannot be used, so just don't supply anything, similar to
what `configure.py` does.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-09-03 17:50:57 +03:00
Pavel Emelyanov
87ce46d1c6 mutation: Use range_tombstone_list's iterators
The consume_clustering_fragments declares several auxiliary symbols
to work with rows' and range-tombstones' iterators. For the range
tombstones it relies on what container is declared inside the
range tombstone itself. Soon the container declaration will move
from range_tombstone class into a new entity and this place should
be prepared for that. The better place to get iterator types from
is the range-tombstones container itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 12:56:13 +03:00
Pavel Emelyanov
ac473a9e67 mutation_partition: Shorten memory usage calculation
The range_tombstone_list's replacer runs exactly the same loop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 12:56:13 +03:00
Pavel Emelyanov
f173be29d9 mutation_partition: Remove unused local variable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 12:56:13 +03:00
Nadav Har'El
b3f4a37a75 test/alternator: verify that nulls are valid inside string and bytes
The tests in this patch verify that null characters are valid characters
inside string and bytes (blob) attributes in Alternator. The tests
verify this for both key attributes and non-key attributes (since those
are serialized differently, it's important to check both cases).

The tests pass on both DynamoDB and Alternator - confirming that we
don't have a bug in this area.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210824163442.186881-1-nyh@scylladb.com>
2021-09-03 08:49:06 +02:00
Avi Kivity
a81057b2e1 Merge "sstables: introduce crawling reader" from Botond
"
A special-purpose reader which doesn't use the index at all, designed to
be used in circumstances where the index is not reliable. The use-case
is scrub and validate which often have to work with corrupt indexes and
it is especially important that they don't further any existing
corruption.

Tests: unit(dev)
"

* 'crawling-sstable-reader/v2' of https://github.com/denesb/scylla:
  compaction: scrub/validate: use the crawling sstable reader
  sstables: wire in crawling reader
  sstables: mx/reader: add crawling reader
  sstables: kl/reader: add crawling reader
2021-09-02 16:26:35 +03:00
Nadav Har'El
068c4283b7 test/cql-pytest: add tests for some undocumented cases of string types
This patch adds tests for two undocumented (as far as I can tell) corner
cases of CQL's string types:

1. The types "text" and "varchar" are not just similar - they are in
   fact exactly the same type.

2. All CQL string and blob types ("ascii", "text" or "varchar", "blob")
   allow the null character as a valid character inside them. They are
   *not* C strings that get terminated by the first null.

These tests pass on both Cassandra and Scylla, so did not expose any
bug, but having such tests is useful to understand these (so-far)
undocumented behaviors - so we can later document them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210824225641.194146-1-nyh@scylladb.com>
2021-09-02 15:45:47 +03:00
Pavel Solodovnikov
ebee744590 idl-compiler: make the script work with python 3.8
Python 3.8 doesn't allow to use built-in collection types
in type annotations (such as `list` or `dict`).
This feature is implemented starting from 3.9.

Replace `list[str]` type annotation with an old-style
`List[str]`, which uses `List` from the `typing` module.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210901131436.35231-1-pa.solodovnikov@scylladb.com>
2021-09-02 15:38:44 +03:00
Pavel Solodovnikov
4cfec099b9 cmake: update and sort source files and idl:s
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-09-02 14:23:37 +03:00
Raphael S. Carvalho
3263c1d5f1 Make shutdown clean when stopping sstable reshard
After aa7cdc0392, run_custom_job() propagates stop exception.
The problem is that we fail to handle stop exception in the procedure
which stops ongoing compactions, so the exception will be propagated
all the way to init, which causes scylla to abort.

to fix this, let's swallow stop_exception in stop_ongoing_compactions(),
which is correct because compactions are stopped by triggering that
exception if signalled to stop.

when reshard is stopped, scylla init will fail as follow instead:
ERROR 2021-08-16 20:13:13,770 [shard 0] init - Startup failed: std::runtime_error
(Exception while populating keyspace 'keyspace5' with column family 'standard1'
from file ...

Fixes #9158.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210816232434.78375-1-raphaelsc@scylladb.com>
2021-09-02 13:50:24 +03:00
Benny Halevy
33f579f783 distributed_loader: distributed_loader::get_sstables_from_upload_dir: do not copy vector containing foreign shared sstables
lw_shared_ptr must not be copied on a foreign shard.
Copying the vector on shard 0 tries increases the reference count of
lw_shared_ptr<sstable> elements that were created on other shards,
as seen in https://github.com/scylladb/scylla/issues/9278.

Fixes #9278

DTest: migration_test.py:TestLoadAndStream_with_3_0_md.load_and_stream_increase_cluster_test(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210902084313.2003328-1-bhalevy@scylladb.com>
2021-09-02 13:49:06 +03:00
Avi Kivity
9c17f75f52 cql3: reduce noise in grammar when using cql3::expr types
The CQL grammar is obviously about cql3 and mostly about cql3
expressions, so add using namespace statements so we don't
have to specify it over and over again.

These statements are present in the headers, but only in the
cql_parser namespace, so it doesn't pollute other translation
units.

Closes #9255
2021-09-02 13:39:42 +03:00
Michał Radwański
9a1e82bb92 .gitignore: add compile_commands.json
compile_commands.json is a format of compilation database info for use
with several editors, such as VSCode (with official C++ extension) and
Vim (with youcompleteme). It can be generated with ninja:
```
ninja -t compdb > compile_commands.json
```

I propose this addition, so that this file won't be commited by
accident.

Closes #9279
2021-09-02 13:37:35 +03:00
Pavel Solodovnikov
f8fe043b94 build: allow to run SCYLLA-VERSION-GEN utility out of source
This change allows to invoke the script in out-of-source
builds: `git log` now uses `-C` option with the directory
containing the script.

Also, the destination path can now be overriden by providing
`-o|--output-dir PATH` option. By default it's set to the
`build` directory relative to the script location.

Usage message is now shown, when '-h|--help' option is
specified.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210831120257.46920-1-pa.solodovnikov@scylladb.com>
2021-09-02 13:04:34 +03:00
Takuya ASADA
729d0feef0 install-dependencies.sh: add scylla-driver to relocatable python3
Pass --pip-packages option to tools/python3/reloc/build_reloc.sh,
add scylla-driver to relocatable python3 which required for
fix_system_distributed_tables.py.

[avi: regenrate toolchain]

Ref #9040
2021-09-02 11:52:47 +03:00
Pavel Emelyanov
cfcea8fc33 storage_service: Replace is_local_dc() with vs db::is_local()
Both functions do the same -- get datacenters from given
endpoint and local broadcast address and compare them to
match (or not).

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210902080858.16364-1-xemul@scylladb.com>
2021-09-02 11:25:48 +03:00
Avi Kivity
403645f58c Merge "raft: miscellaneous fixes" from Gleb
* 'raft-misc-v3' of github.com:scylladb/scylla-dev:
  raft: rename snapshot into snapshot_descriptor
  raft: drop snapshot if is application failed
  raft: fix local snapshot detection
  raft: replication_test: store multiple snapshots in a state machine
  raft: do not wait for entry to become stable before replicate it
2021-09-02 11:25:06 +03:00
Avi Kivity
8a1d99a039 Update seastar submodule
* seastar 07758294ef...c04a12edbd (4):
  > core: add alien() getter to reactor
  > io_priority_class: add missing headers
  > Merge "require deferred action to be noexcept" from Benny
  > net: silence compiler warning in tls_connected_socket_impl.
2021-09-02 11:11:49 +03:00
Michael Livshin
fbb5802229 mf-stream-validator: add previous partition key to error messages
Only seems to make sense in mutation fragment validation where
validation level is >= `partition_key`.

Fixes #9269

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210901165641.340185-1-michael.livshin@scylladb.com>
2021-09-02 11:05:33 +03:00
Botond Dénes
7a78601b5d compaction: scrub/validate: use the crawling sstable reader
Sstables that are scrubbed or validated are typically problematic ones
that potentially have corrupt indexes. To avoid using the index
altogether use the recently added crawling reader. Scrub and validate
never skips in the sstable anyway.
2021-09-01 16:21:49 +03:00
Botond Dénes
1abf665d1d sstables: wire in crawling reader 2021-09-01 16:21:49 +03:00
Avi Kivity
705f957425 Merge "Generalize TLS creds builder configuration" from Pavel E
"
There are 4 places out there that do the same steps parsing
"client_|server_encryption_options" and configuring the
seastar::tls::creds_builder with the values (messaging, redis,
alternator and transport).

Also to make redis and transport look slimmer main() cleans
the client_encryption_options by ... parsing it too.

This set introduces a (coroutinized) helper to configure the
creds_builder with map<string, string> and removes the options
beautification from main.

tests: unit(dev), dtest.internode_ssl_test(dev)
"

* 'br-generalize-tls-creds-builder-configuration' of https://github.com/xemul/scylla:
  code: Generalize tls::credentials_builder configuration
  transport, redis: Do not assume fixed encryption options
  messaging: Move encryption options parsing to ms
  main: Open-code internode encryption misconfig warning
  main, config: Move options parsing helpers
2021-09-01 14:19:19 +03:00
Nadav Har'El
72bc37ddc1 README.md: update link to docker build instructions
The link to the docker build instructions was outdated - from the time
our docker build was based on a Redhat distribution. It no longer is,
it's now based on Ubuntu, and the link changed accordingly.

Fixes #9276.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210901083055.445438-1-nyh@scylladb.com>
2021-09-01 11:50:11 +03:00
Liu Lan
a5c54867f8 alternator: Exclusive start key must lie within the segment
...when using Segment/TotalSegment option.

The requirement is not specified in DynamoDB documents, but found
in DynamoDB Local:

{"__type":"com.amazon.coral.validate#ValidationException",
"message":"Exclusive start key must lie within the segment"}

Fixes #9272

Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com>

Closes #9270
2021-09-01 11:05:45 +03:00
Botond Dénes
9548200e85 sstables: mx/reader: add crawling reader
A special-purpose reader which doesn't use the index at all and hence
doesn't support skipping at all. It is designed to be used in conditions
in which the index is not reliable (scrub compaction).
2021-09-01 08:44:13 +03:00
Botond Dénes
4421929b25 sstables: kl/reader: add crawling reader
A special-purpose reader which doesn't use the index at all and hence
doesn't support skipping at all. It is designed to be used in conditions
in which the index is not reliable (scrub compaction).
2021-09-01 08:42:10 +03:00
Avi Kivity
8b59e3a0b1 Merge ' cql3: Demand ALLOW FILTERING for unlimited, sliced partitions ' from Dejan Mircevski
Return the pre- 6773563d3 behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions.  Put it on a flag defaulting to "off" for now.

Fixes #7608; see comments there for justification.

Tests: unit (debug, dev), dtest (cql_additional_test, paging_test)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9126

* github.com:scylladb/scylla:
  cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
  cql3: Track warnings in prepared_statement
  test: Use ALLOW FILTERING more strictly
  cql3: Add statement_restrictions::to_string
2021-08-31 18:05:26 +03:00
Dejan Mircevski
2f28f68e84 cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
When a query requests a partition slice but doesn't limit the number
of partitions, require that it also says ALLOW FILTERING.  Although
do_filter() isn't invoked for such queries, the performance can still
be unexpectedly slow, and we want to signal that to the user by
demanding they explicitly say ALLOW FILTERING.

Because we now reject queries that worked fine before, existing
applications can break.  Therefore, the behavior is controlled by a
flag currently defaulting to off.  We will default to "on" in the next
Scylla version.

Fixes #7608; see comments there for justification.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-08-31 10:45:41 -04:00
Nadav Har'El
9666921dbc Merge 'cql3: expr: introduce search_and_replace()' from Avi Kivity
Introduce a general-purpose search and replace function to manipulate
expressions, and use it to simplify replace_column_def() and
replace_token().

Closes #9259

* github.com:scylladb/scylla:
  cql3: expr: rewrite replace_token in terms of search_and_replace()
  cql3: expr: rewrite replace_column_def in terms of search_and_replace()
  cql3: expr: add general-purpose search-and-replace
2021-08-31 15:56:41 +03:00
Avi Kivity
6a0a5a17d7 Merge "Fix exception safety of btree::clone_from()" from Pavel E
"
When cloning throws in the middle it may leak some child
nodes triggering the respective assertion in node destructor.

Also there's a chance to mis-assert the linear node roll-back.

tests: unit(dev)
"

Fixes #9248
Backport: 4.5

* 'br-btree-clone-exceptions-2' of https://github.com/xemul/scylla:
  btree: Add commens in .clone() and .clear()
  btree, test: Test exception safety and non-leakness of btree::clone_from
  btree, test: Test key copy constructor may throw
  btree: Dont leak kids on clone roll-back
  btree: Destroy, not drop, node on clone roll-back
2021-08-31 14:34:14 +03:00
Pavel Emelyanov
e6d568b38e btree: Add commens in .clone() and .clear()
There are two tricky places about corner leaves pointers
managements. Add comments describing the magic.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:36:54 +03:00
Avi Kivity
542a8bc0f3 cql3: expr: rewrite replace_token in terms of search_and_replace()
Use search_and_replace() to simplify replace_token(). Note the conversion
does not have 100% fidelity - the previous implementation throws on
some impossible subexpression types, and the new one passes them
through. It should be the caller's responsibility anyway, not a
side effect of replacing tokens, and since these subexpressions are
impossible there is no real effect on execution.

Note that this affects only TOKEN() calls on the partition key
columns in the right order. Other uses of the token function
(say with constants) won't be translated to the token subexpression
type. So something like

    WHERE token(pk) = token(?)

would only see the left-hand side replaced, not the right-hand
side, even if it were an expression rather than a term.
2021-08-31 12:29:47 +03:00
Avi Kivity
10ca63128a cql3: expr: rewrite replace_column_def in terms of search_and_replace()
We're won't introduce new expression types that are equivalent
to column_value, and search_and_replace() takes care of all
expressions that need to recurse, so we don't need std::visit()
for the search/replace lambda.
2021-08-31 12:29:47 +03:00
Avi Kivity
7a594bc42f cql3: expr: add general-purpose search-and-replace
Add a recursive search-and-replace function on expressions. The
caller provides a search/replace function to operate on subexpressions,
returning nullopt if they want the default behavior of recursively
copying, or a new expression to terminate the search (in the current
subtree) and replace the current node with the returned expression.

To avoid endlessly specifying the subexpression types that get the
the common behavior (copying) since they don't contain any subexpressions,
we add a new concept LeafExpression to signify them.

Existing functions such as replace_token() can be reimplemented in
terms of search_and_replace, but that is left for later.
2021-08-31 12:29:37 +03:00
Pavel Emelyanov
e26a6c1acc btree, test: Test exception safety and non-leakness of btree::clone_from
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Pavel Emelyanov
da38038222 btree, test: Test key copy constructor may throw
It calls the tree_test_key_base copy constructor which
is throwing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Pavel Emelyanov
d1a1a2dac2 btree: Dont leak kids on clone roll-back
When failed-to-be-cloned node cleans itself it must also clear
all its child nodes. Plain destroy() doesn't do it, it only
frees the provided node.

fixes: #9248

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Pavel Emelyanov
1d857d604a btree: Destroy, not drop, node on clone roll-back
The node in this place is not yet attached to its parent, so
in btree::debug::yes (tests only) mode the node::drop()'s parent
checks will access null parent pointer.

However, in non-tesing runtime there's a chance that a linear
node fails to clone one of its keys and gets here. In this case
it will carry both leftmost and rightmost flags and the assertion
in drop will fire.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Dejan Mircevski
81f00d82cf cql3: Drop more dead code
This is some dead code that 44ca965ba missed.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9267
2021-08-31 12:06:19 +03:00
Beni Peled
4fe4aa190d dist-check: add podman support
...and use container term instead of docker

Closes #9265
2021-08-31 09:10:58 +03:00
Nadav Har'El
d7474ddff3 dist/docker: fix errors in README.md
The (oddly-placed) document dist/docker/debian/README.md explains how a
developer can build a Scylla docker image using a self-built Scylla
executable.

While the document begins by saying that you can "build your own
Scylla in whatever build mode you prefer, e.g., dev.", the rest of the
instructions don't fit this example mode "dev" - the second command does
"ninja dist-deb" which builds *all* modes, while the third command
forgets to pass the mode at all (and therefore defaults to "release").
The forth command doesn't work at all, and became irrelevant during
a recent rewrite in commit e96ff3d.

This patch modifies the document to fix those problems.

It ends with an example of how to run the resulting docker image
(this is usually the purpose of building a docker image - to run it
and test it). I did this example using podman because I couldn't get
it to work in docker. Later we can hopefully add the corresponding
docker example.

Fixes #9263.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210829182608.355748-1-nyh@scylladb.com>
2021-08-30 08:36:33 +03:00
Nadav Har'El
ed7106ebd7 docker: fix regression of docker image ignoring command-line arguments
Our docker image accepts various command-line arguments and translates
them into Scylla arguments. For example, Alternator's getting-started
document has the following example:

  ```
  docker run --name scylla -d -p 8000:8000 scylladb/scylla-nightly:latest
  --alternator-port=8000 --alternator-write-isolation=always```

Recently, this stopped working and the extra arguments at the end were
just ignored.

It turns out that this is a regression caused by commit
e96ff3d82d that changed our docker image
creation process from Dockerfile to buildah. While the entry point
specified in Dockerfile was a string, the same string in buildah has
a strange meaning (an entry point which can't take arguments) and to
get the original meaning, the entry point needs to be a JSON array.
This is kind-of explained in https://github.com/containers/buildah/issues/732.

So changing the entry point from a string to a JSON array fixes the
regression, and we can again pass arguments to Scylla's docker image.

Fixes #9247.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210829180328.354109-1-nyh@scylladb.com>
2021-08-30 08:26:15 +03:00
Nadav Har'El
bd4552fd57 configure.py: fix build-mode-specific targets to not build all modes
We have in our Ninja build file various targets which ask to build just
a single build mode. For example, "ninja dev" builds everything in dev
mode - including Scylla, tests, and distribution artifacts - but
shouldn't build anything in other build modes (debug, release, etc.),
even if they were previously configured by configure.py.

However, we had a bug where these build-mode-specific targets
nevertheless compiled *all* configured modes, not just the requested
mode.

The bug was introduced in commit edd54a9463 -
targets "dist-server-compat" and "dist-unified-compat" were introduced,
but instead of having per-build-mode versions of these targets, only
one of each was introduced building all modes. When these new targets
were used in a couple of places in per-build-mode targets, it forced
these targets to build all modes instead of just the chosen one.

The solution is to split the dist-server-compat target into multiple
dist-server-compat-{mode}, and similarly split dist-unified-compat.
The unsplit target is also retained - for use in targets that really
want all build modes.

Fixes #9260.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210829123418.290333-1-nyh@scylladb.com>
2021-08-29 15:38:27 +03:00
Gleb Natapov
ce40b01b07 raft: rename snapshot into snapshot_descriptor
The snapshot structure does not contain the snapshot itself but only
refers to it trough its id. Rename it to snapshot_descriptor for clarity.
2021-08-29 12:53:03 +03:00
Gleb Natapov
0aa2e95475 raft: drop snapshot if is application failed
No need to keep a snapshot that was not applied.
2021-08-29 12:53:03 +03:00
Gleb Natapov
f9f859ac40 raft: fix local snapshot detection
The code assumes that the snapshot that was taken locally is never
applied. Currently logic to detect that is flawed. It relies on an
id of a most recently applied snapshot (where a locally taken snapshot
is considered to be applied as well). But if between snapshot creation
and the check another local snapshot is taken ids will not match.

The patch fixes this by propagating locality information together with
the snapshot itself.
2021-08-29 12:53:03 +03:00
Gleb Natapov
80a392a444 raft: replication_test: store multiple snapshots in a state machine
State machine should be able to store more then one snapshot at a time
(one may be the currently used one and another is transferred from a
leader but not applied yet).
2021-08-29 12:53:03 +03:00
Gleb Natapov
5e1d589872 raft: do not wait for entry to become stable before replicate it
Since io_fiber persist entries before sending out messages even non
stable entries will become stable before observed by other nodes.

This patch also moves generation of append messages into get_outptut()
call because without the change we will lose batching since each
advance of last_idx will generate new append message.
2021-08-29 12:48:15 +03:00
Avi Kivity
3de5b849e0 Update tools/java submodule (JAVA8_HOME)
* tools/java 0b6ecbeb90...a2fe67fd42 (1):
  > build_reloc.sh: set JAVA8_HOME if not already set
2021-08-29 12:27:17 +03:00
Pavel Emelyanov
60a7ca62f2 storage_service: Drop .enable_all_features()
This method has nothing to do with storage service and
is only needed to move feature service options from one
method to another. This can be done by the only caller
of it.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210827133954.29535-1-xemul@scylladb.com>
2021-08-29 11:27:05 +03:00
Pavel Solodovnikov
998dadf479 keys: remove with_linearized uses
There is a variant of `to_hex` that works with `managed_bytes_view`,
no need to linearize.

Tests: unit(dev)

[avi: edit out unneeded std::ref()]

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210828093252.650928-1-pa.solodovnikov@scylladb.com>
2021-08-28 12:49:10 +03:00
Pavel Emelyanov
77a8fee513 tests: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-27 19:17:08 +03:00
Pavel Emelyanov
9a77ff1cf4 tests: Split sstable_conforms_to_mutation_source
The only case in this test effectively carries 6 of them. When run
as they are now (sequentially) the total test run time in debug mode
is ~35 minutes. When split each case takes ~6 minutes to complete.
In dev/release mode it's ~1 minute vs ~10 seconds respectively.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-27 19:17:01 +03:00
Pavel Emelyanov
0fd00d7016 cdc: Add database argument to is_log_for_some_table
All callers has been patched already. This argument can now
be used to replace get_local_storage_proxy().get_db().local()
call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-27 14:07:26 +03:00
Pavel Emelyanov
2701a1ee28 client_state: Pass database into has_access()
All callers of it already have it, so just pass it along

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-27 14:07:26 +03:00
Pavel Emelyanov
de7761985c client_state: Add database argument to has_schema_access
The only caller is thrift that has database reference on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-27 14:07:26 +03:00
Pavel Emelyanov
36a4c1ddc1 client_state: Add database argument to has_keyspace_access()
Callers are cql3, that has database via proxy, and thrift that
has one by reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-27 14:07:18 +03:00
Pavel Emelyanov
fe8bc0757b cdc: Add database argument to check_for_attempt_to_create_nested_cdc_log
The only caller of it already has database argument, just
pass it a bit further

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-27 14:07:18 +03:00
Pavel Solodovnikov
b00443ab87 test: adjust schema_change_test to include new system.raft_config table
Check that the new table uses null sharder.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:30:17 +03:00
Pavel Solodovnikov
8d3c0ee9b6 raft: new schema for storing raft snapshots
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.

That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:

1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
   documentation in the future.

Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.

Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.

Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.

Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.

Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.

So, now the snapshots schema looks like this:

    CREATE TABLE raft_snapshots (
        group_id timeuuid,
        server_id uuid,
        snapshot_id uuid,
        idx int,
        term int,
        -- no `config` field here, moved to `raft_config` table
        PRIMARY KEY ((group_id, server_id))
    )

    CREATE TABLE raft_config (
         group_id timeuuid,
         my_server_id uuid,
         server_id uuid,
         disposition text, -- can be either 'CURRENT` or `PREVIOUS'
         can_vote bool,
         ip_addr inet,
         PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
    );

This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:24:46 +03:00
Pavel Solodovnikov
0a8faee660 raft: pass server id to raft_sys_table_storage instance
Preparations for changing raft snapshots schema.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:24:20 +03:00
Eliran Sinvani
7f44736939 Service Levels: do not notify stale service level removals
Before this commit, the service_level_controller will notify
the subscribers on stale deletes, meaning, deletes of localy
non exixtent service_levels.
The code flow shouldn't ever get to such a state, but as long
as this condition is checked instead of being asserted it is
worthwhile to change the code to be safe.

Closes #9253
2021-08-26 18:27:52 +03:00
Nadav Har'El
389b866d33 Merge 'cql3: convert term::raw to expressions' from Avi Kivity
This series converts the `term::raw` objects  the grammar produces to expressions.
For each grammar production, an expression type is either reused or created.
The term::raw methods are converted to free functions accepting expressions as
input (but not yet generating expressions as output).

There is some friction because the original code base had four different expression
domains: `term`, `term::raw`, `selectable`, and `selectable::raw`. They don't match
exactly, so in some cases we need to add additional state to distinguish between them.
There are also many run-time checks introduced (on_internal_error) since the union
of the domains is much larger than each individual domain.

The method used is to erect a bi-directional bridge between term::raw and expressions,
convert various term::raw subclasses one by one, and then remove the bridge and
term::raw.

Test: unit (dev).

Closes #9170

* github.com:scylladb/scylla:
  cql3: expr: eliminate column_specification_or_tuple
  cql3: expr: hide column_specification_or_tuple
  cql3: term::raw: remove term::raw and scaffolding
  cql3: grammar: collapse conversions between term::raw and expressions
  cql3: relation: convert to_term() to experssions
  cql3: functions: avoid intermediate conversions to term::raw
  cql3: create_aggregate_statement: convert term::raw to expression
  cql3: update_statement, insert_statement: convert term::raw to expression
  cql3: select_statement: convert term::raw to expression
  cql3: token_relation: convert term::raw to expressions
  cql3: operation: convert term::raw to expression
  cql3: multi_column_relation: convert term::raw to expressions
  cql3: single_column_relation: convert term::raw to expressions
  cql3: column_condition: convert term::raw to expressions
  cql3: expr: don't convert subexpressions to term::raw during the prepare phase
  cql3: attributes: convert to expressions
  cql3: expr: introduce test_assignment_all()
  cql3: expr: expose prepare_term, test_assignment in the expression domain
  cql3: expr: provide a bridge between expressions and assignment_testable
  cql3: expr, user types: convert user type literals to expressions
  cql3: selection: make selectable.hh not include expr/expresion.hh
  cql3: sets, user types: move user types raw functions around
  cql3: expr, sets, maps: convert set and map literals to collection_constructor
  cql3: sets, maps, expr: move set and map raw functions around
  cql3: expr, lists: convert lists::literal to new collection_constructor
  cql3: lists, expr: move list raw functions around
  cql3: tuples, expr: convert tuples::literal to expr::tuple_constructor
  cql3: expr, tuples: deinline and move tuple raw functions
  cql3: expr, constants: convert constants::literal to untyped_constant
  cql3: constants: move constants::literal implementation around
  cql3: expr, abstract_marker: convert to expressions
  cql3: column_condition: relax types around abstact_marker::in_raw
  cql3: tuple markers: deinline and rearrange
  cql3: abstract_marker, term_expr: rearrange raw abstract marker implementation
  cql3: expr, constants: convert cql3::constants::null_literal to new cql3::expr::null
  cql3: expr, constants: deinline null_literal
  cql3: constants: extricate cql3::constants::null_literal::null_value from null_literal
  cql3: term::raw, expr: convert type casts to expressions
  cql3: type_cast: deinline some methods
  cql3: expr: prepare expr::cast for unprepared types
  cql3: expr, functions: move raw function calls to expressions
  cql3: expr, term::raw: add conversions between the two types
  cql3: expr, term::raw: add reverse bridge
  cql3: term::raw, expr: add bridge between term::raw and expressions
  cql3: eliminate multi_column_raw
  cql3: term::raw, multi_column_raw: unify prepare() signatures
2021-08-26 17:29:40 +03:00
Michał Chojnowski
126baa7850 utils: compact-radix-tree: fix accidental cache line bouncing
Whenever a node_head_ptr is assigned to nil_root, the _backref inside it is
overwritten. But since nil_root is shared between shards, this causes severe
cache line bouncing. (It was observed to reduce the total write throughput
of Scylla by 90% on a large NUMA machine).

This backreference is never read anyway, so fix this bug by not writing it.

Fixes #9252

Closes #9246
2021-08-26 17:22:22 +03:00
Avi Kivity
2da7b79e16 cql3: expr: eliminate column_specification_or_tuple
column_specification_or_tuple is now used internally, wrapping and
a column_specification or a vector and immediately unwrapping in
the callee. The only exceptions are bind_variable and tuple_constructor,
which handles both cases.

Use the underlying types directly instead, and add dispatching
to prepare_term_multi_column() for the two cases it handles.
2021-08-26 16:30:47 +03:00
Avi Kivity
ad285c3c84 cql3: expr: hide column_specification_or_tuple
column_specification_or_tuple was introduced since some terms
were prepared using a single receiver e.g. (receiver = <term>) and
some using multiple receivers (e.g. (r1, r2) = <term>. Some
term types supported both.

To hide this complexity, the term->expr conversion used a single
interface for both variations (column_expression_or_tuple), but now
that we got rid of the term class and there are no virtual functions
any more, we can just use two separate functions for the two variants.

Internally we still use column_expression_or_tuple, it can be
removed later.
2021-08-26 16:17:49 +03:00
Avi Kivity
158822c1a6 cql3: term::raw: remove term::raw and scaffolding
Nothing now uses term::raw, remove it and the scaffolding used to
migrate it to expressions.
2021-08-26 16:14:47 +03:00
Avi Kivity
78b7af415f cql3: grammar: collapse conversions between term::raw and expressions
The grammar now talks to expression API:s solely, so it can be converted
internally to expressions too. Calls to as_term_raw() and as_expression()
are removed, and productions return expressions instead of term::raw:s.
2021-08-26 15:56:44 +03:00
Avi Kivity
cb2560728a cql3: relation: convert to_term() to experssions
Now that the entire relation hierarchy was converted to expressions,
also convert relation::to_term().
2021-08-26 15:56:44 +03:00
Avi Kivity
f652972b12 cql3: functions: avoid intermediate conversions to term::raw
Instead, use conversions to assignment_testable and native expression
prepare functions.
2021-08-26 15:56:44 +03:00
Avi Kivity
8cd505d191 cql3: create_aggregate_statement: convert term::raw to expression
Straightforward substitution.
2021-08-26 15:53:27 +03:00
Avi Kivity
dd30b7853b cql3: update_statement, insert_statement: convert term::raw to expression
Straightforward substitution.
2021-08-26 15:42:30 +03:00
Avi Kivity
b11ec1aeda cql3: select_statement: convert term::raw to expression
Straightforward substitution; using std::optional<> since those
expressions are indeed optional.
2021-08-26 15:41:14 +03:00
Avi Kivity
cf10df10f4 cql3: token_relation: convert term::raw to expressions
Change term::raw in token_relation to expressions.

to_term() is not converted, since it's part of the larger relation
hierarchy.
2021-08-26 15:39:43 +03:00
Avi Kivity
c2d49b50f4 cql3: operation: convert term::raw to expression
Straightforward substitution.
2021-08-26 15:37:52 +03:00
Avi Kivity
b6e17ed111 cql3: multi_column_relation: convert term::raw to expressions
Change term::raw in multi_column_relation to expressions. Because a single
raw class is used to represent multiple shapes (IN ? and IN (x, y, z)),
some of the expressions are optional, corresponding to nullables before the
conversion.

to_term() is not converted, since it's part of the larger relation
hierarchy.
2021-08-26 15:36:42 +03:00
Avi Kivity
4809cf7ff3 cql3: single_column_relation: convert term::raw to expressions
Change term::raw in single_column_relation to expressions. Because a single
raw class is used to represent multiple shapes (IN ? and IN (x, y, z)),
some of the expressions are optional, corresponding to nullables before the
conversion.

to_term() is not converted, since it's part of the larger relation
hierarchy.
2021-08-26 15:35:32 +03:00
Avi Kivity
793aca8e4e cql3: column_condition: convert term::raw to expressions
Change term::raw in column_condition::raw to expressions. Because a single
raw class is used to represent multiple shapes (IN ? and IN (x, y, z)),
some of the expressions are optional, corresponding to nullables before the
conversion.

to_term() is not converted, since it's part of the larger relation
hierarchy.
2021-08-26 15:34:13 +03:00
Avi Kivity
8cdb6a102f cql3: expr: don't convert subexpressions to term::raw during the prepare phase
Now that we have the prepare machinery exposed as expression API:s
(not just term::raw) we can avoid conversions from expressions to
term::raw when preparing subexpressions.
2021-08-26 15:34:01 +03:00
Avi Kivity
c93731a6e9 cql3: attributes: convert to expressions
Convert the three variables in attrbutes::raw to expressions. Since
those attributes are optional, use std::optional to indicate it
(since we can't rely on shared_ptr<term::raw> being null).
2021-08-26 15:32:52 +03:00
Avi Kivity
3c6914c5bf cql3: expr: introduce test_assignment_all()
The test_assignment class has a test_all() helper to test
a vector of assignment_testable. But expressions are not
derived from assignment_testable, so introduce a new helper
that does the same for expressions.
2021-08-26 15:30:46 +03:00
Avi Kivity
55fd8e69ec cql3: expr: expose prepare_term, test_assignment in the expression domain
So far prepare (in the term domain) was called via term::raw. To be
able to prepare in the expression domain, expose functions prepare_term()
and test_assignment() that accept expressions as arguments.

prepare_term() was chosen rather that prepare() to differentiate wrt.
the other domain that can be prepared (selectables).
2021-08-26 15:29:10 +03:00
Avi Kivity
be335f4dee cql3: expr: provide a bridge between expressions and assignment_testable
While we have a bridge between expressions and term::raw, which is
derived from assignment_testable, we will soon get rid of term::raw
and so won't be able to interface with API:s that require an
assignment_testable. So add a bridge for that. The user is
function::get(), which uses assignment_testable to infer the
function overload from the argument types.
2021-08-26 15:26:38 +03:00
Avi Kivity
562e68835b cql3: expr, user types: convert user type literals to expressions
Convert the user_types::literal raw to a new expression type
usertype_constructor. I used "usertype" to convey that is is a
((user type) constructor), not a (user (type constructor)).
2021-08-26 15:26:35 +03:00
Avi Kivity
4d7e00d0f8 cql3: selection: make selectable.hh not include expr/expresion.hh
We have this dependency now:

   column_identifier -> selectable -> expression

and want to introduce this:

   expression -> user types -> column_identifier

This leads to a loop, since expression is not (yet) forward
declarable.

Fix by moving any mention of expression from selectable.hh to a new
header selection-expr.hh.

database.cc lost access to timeout_config, so adjust its includes
to regain it.
2021-08-26 15:19:14 +03:00
Avi Kivity
9d6bc7eae6 cql3: sets, user types: move user types raw functions around
Move them closer to prepare related functions for modification.
2021-08-26 15:15:59 +03:00
Avi Kivity
06bca067f8 cql3: expr, sets, maps: convert set and map literals to collection_constructor
Add set and map styles to collection_constructor. Maps are implemented as
collection_constructor{tuple_constructor{key, value}...}. This saves
having a new expression type, and reduces the effort to implement
recursive descent evaluation for this omitted expression type.
2021-08-26 15:13:37 +03:00
Avi Kivity
658cd47d21 cql3: sets, maps, expr: move set and map raw functions around
Move them closer to prepare related functions for modification. Since
sets and maps share some implementation details in the grammar, they
are moved and converted as a unit.
2021-08-26 15:13:07 +03:00
Avi Kivity
d2ab7fc26d cql3: expr, lists: convert lists::literal to new collection_constructor
Introduce a collection_constructor (similar to C++'s std::initializer_list)
to hold subexpressions being gathered into a list. Since sets, maps, and
lists construction share some attributes (all elements must be of the
same type) collection_constructor will be used for all of them, so it
also holds an enum. I used "style" for the enum since it's a weak
attribute - an empty set is also an empty map. I chose collection_constructor
rather than plain 'collection' to highlight that it's not the only way
to get a collection (selecting a collection column is another, as an
example) and to hint at what it does - construct a collection from
more primitive elements.
2021-08-26 15:10:41 +03:00
Avi Kivity
4defb42c86 cql3: lists, expr: move list raw functions around
Move them closer to prepare related functions for modification.
2021-08-26 15:08:14 +03:00
Avi Kivity
5e448e4a2a cql3: tuples, expr: convert tuples::literal to expr::tuple_constructor
Introduce tuple_constructor (not a literal, since (?, ?) and (column_value,
column_value) are not literals) to represent a tuple constructed from
subexpressions. In the future we can replace column_value_tuple
with tuple_constructor(column_value, column_value, ...), but this is
not done now.

I chose the name 'tuple_constructor' since other expressions can represent
tuples (e.g. my_tuple_column, :bind_variable_of_tuple_type,
func_returning_tuple()). It also explains what the expression does.
2021-08-26 15:07:15 +03:00
Avi Kivity
41c532f19c cql3: expr, tuples: deinline and move tuple raw functions
Move them closer to prepare functions for modification.
2021-08-26 15:04:21 +03:00
Avi Kivity
2c42a65db1 cql3: expr, constants: convert constants::literal to untyped_constant
Introduce a new expression untyped_constant that corresponds to
constants::literal, which is removed. untyped_constant is rather
ugly in that it won't exist post-prepare. We should probably instead
replace it with typed constants that use the widest possible type
(decimal and varint), and select a narrower type during the prepare
phase when we perform type inference. The conversion itseld is
straightforward.
2021-08-26 15:03:07 +03:00
Avi Kivity
4d9bde561a cql3: constants: move constants::literal implementation around
Move it closer to prepare functions for modification.
2021-08-26 15:01:06 +03:00
Avi Kivity
838bfbd3e0 cql3: expr, abstract_marker: convert to expressions
Convert the four forms of abstract_marker to expr::bind_variable (the
name was chosen since variable is the role of the thing, while "marker"
refers more to the grammar). Having four variants is unnecessary, but
this patch doesn't do anything about that.
2021-08-26 15:01:04 +03:00
Avi Kivity
218f4d87f8 cql3: column_condition: relax types around abstact_marker::in_raw
We can only convert expressions to term::raw, not the subclass
abstract_marker::in_raw, so relax the types. They will all be converted
to expressions. Relaxing types isn't good, but the structure is enforced
now by the grammar (and dynamically using variant casts), and in the future
by a typecheck pass (which will allow us to remove the many variations
of markers).
2021-08-26 14:55:17 +03:00
Avi Kivity
6dcc43d227 cql3: tuple markers: deinline and rearrange
Move raw methods near to the other prepare-related functions.
2021-08-26 14:54:15 +03:00
Avi Kivity
35db2b34e4 cql3: abstract_marker, term_expr: rearrange raw abstract marker implementation
Move raw methods near to the other prepare-related functions.
2021-08-26 14:53:58 +03:00
lauranovich
e78746e94d docs: fix removal of master from website drop-down
Closes #9251
2021-08-26 14:51:37 +03:00
Avi Kivity
aba205917d cql3: expr, constants: convert cql3::constants::null_literal to new cql3::expr::null
Introduce cql3::expr::null and use it to represent null_literal, which is
removed.
2021-08-26 14:49:46 +03:00
Avi Kivity
5b42cbf9e0 cql3: expr, constants: deinline null_literal
Deinline null_literal methods and place them near the other prepare-related
functions.
2021-08-26 14:45:56 +03:00
Avi Kivity
51f62d5953 cql3: constants: extricate cql3::constants::null_literal::null_value from null_literal
null_literal (which is in the term::raw domain) will be converted to an
expression, so unnest the nested class null_value (which is in the term
domain and is not converted now).
2021-08-26 14:44:21 +03:00
Avi Kivity
10e08dc87e cql3: term::raw, expr: convert type casts to expressions
We reuse the expr::cast type that was previously used for selectables.
When preparing, subexpressions are converted to term::raw; this will
be removed later.
2021-08-26 14:42:55 +03:00
Avi Kivity
6f8b6aef17 cql3: type_cast: deinline some methods
These methods will be converted to the expression variant, and
it's impossible to do this while inlined due to #include cycles. In
any case, deinlining is better.

Since there is no type_cast.cc, and since they'll become part of
expr_term call chain soon, they're moved there, even though it seems
odd for this patch. It's a waste to create type_cast.cc just for those
three functions.
2021-08-26 14:41:38 +03:00
Avi Kivity
3d30c161e4 cql3: expr: prepare expr::cast for unprepared types
The cast expression has two operands: the subexpression to cast and the
type to cast to. Since prepared and unprepared expressions are the
same type, we don't have to do anything, but prepared and unprepared
types are different. So add a variant to be able to support both.

The reason the selectable->expression transformation did not need to
do this is that casts in a selector cannot accept a user defined type.
Note those casts also have different syntax and different execution,
so we'll have to choose whether to unify the two semantics, or whether
to keep them separate. This patch does not force anything (but does hint
at unification by not including any discriminant beyond the type's
rawness). The string representation matches the part of the grammar
it was derived from (or conversion back to CQL will yield wrong
results).
2021-08-26 14:39:33 +03:00
Avi Kivity
b76395a410 cql3: expr, functions: move raw function calls to expressions
Remove cql3::functions::function_call::raw and replace it with
cql3::expr::function_call, which already existed from the selector
migration to expressions. The virtual functions implementing term::raw
are made free functions and remain in place, to ease migration and
review.

Note that preparing becomes a more complicated as it needs to
account for anonymous functions, which were not representable
in the previous structure (and still cannot be created by the
parser for the term::raw path).

The parser now wraps all its arguments with the term::raw->expr
bridge, since that's what expr::function_call expects, and in
turn wraps the function call with an expr->term::raw bridge, since
that's what the rest of the parser expects. These will disappear
when the migration completes.
2021-08-26 14:38:16 +03:00
Avi Kivity
0d24af7775 cql3: expr, term::raw: add conversions between the two types
Add a way to convert between the old world and the new, and back. Note
that instead of blindly wrapping, we unwrap if we received a wrapped
object.
2021-08-26 14:35:46 +03:00
Avi Kivity
a5031dd5bf cql3: expr, term::raw: add reverse bridge
Since expressions can nest, and since we won't covert everything at once,
add a way to store a term::raw as an expression. We can now have a
term::raw that is internally an expression, and an expression that is
implemented as term::raw.
2021-08-26 14:32:04 +03:00
Avi Kivity
725065b066 cql3: term::raw, expr: add bridge between term::raw and expressions
A term_raw_expression is a term::raw that holds an expression. It will
be used to incrementally convert the source base to expressions, while
still exposing the result to the common interface of shared_ptr<term::raw>.
2021-08-26 14:14:18 +03:00
Avi Kivity
9a158cd7b5 cql3: eliminate multi_column_raw
Now that the signatures of term::raw::prepare and multi_column_raw::prepare
are identical, we can eliminate multi_column_raw, replacing it with
term::raw where needed. In some cases we delete it from the inheritance chain
since we reach term::raw via a different base class.

Note that a dynamic_cast<> is eliminated, so we compenate for the addition
of runtime checks in the previous patch by the deletion of runtime checks
in this patch.
2021-08-26 14:11:42 +03:00
Avi Kivity
660be97028 cql3: term::raw, multi_column_raw: unify prepare() signatures
In order to replace the term::raw hierarchy with expressions,
we need to unify the signatures of term::raw::prepare() and
term::multi_column_raw::prepare(). This is because we'll only have
one expression type to represent both single values and tuples
(although, different subexpression types will may used).

The difference in the two prepare() signatures is the
`receiver` parameter - which is a (type, name) pair used
to perfom type inference on the expression being prepared,
with the name used to report errors. In a perfect world, this
would just be an expression - a tuple or a singular expression
as the case requires. But we don't have the needed expression
infrastructure yet - general tuples or name-annotated expressions.

Resolve the problem by introducing a variant for the single-value
and tuple. This is more or less creating a mini-expression type
used just for this. Once our expression type grows the needed
capabilities, it can replace this type.

Note that for some cases, this replaces compile-time checks by
runtime checks (which should never trigger). In other cases
the classes really needed both interfaces, so the new variant
is a better fit.
2021-08-26 14:11:42 +03:00
Avi Kivity
9bf3b9f964 Merge 'Some IDL compiler cleanups' from Pavel Solodovnikov
This series incorporates various refactorings aimed mostly
at eliminating extra parameters to `serializer_*_impl` functions
for `EnumDef` and `ClassDef` AST classes.

Instead of carrying these parameters here and there over many
places, they are calculated on a preliminary run to collect
additional metadata, such as: namespaces and template parameters
from parent scopes. This metadata is used later to extend AST
classes.

The patchset does not introduce any changes in the generation
procedures, exclusively dealing with internal code structuring.

NOTE: although metadata collection involves an extra run through
the parse tree, the proper way should be to populate it instantly
while parsing the input. This is left to be adjusted lated in a
follow-up series.

Closes #8148

* github.com:scylladb/scylla:
  idl: add descriptions for the top-level generation routines
  idl: make ns_qualified name a class method
  idl: cache template declarations inside enums and classes
  idl: cache parent template params for enums and classes
  idl: rename misleading `local_types` to `local_writable_types`
  idl: remove remaining uses of `namespaces` argument
  idl: remove `is_final` function and use `.final` AST class property
  idl: remove `parent_template_param` from `local_types` set
  idl: cache namespaces in AST nodes
  idl: remove unused variables
2021-08-26 13:18:54 +03:00
Benny Halevy
4ffdafe6dc token_metadata: delete old java code
We no longer need to keep it for reference.
It's just causing confusion at this point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210826095457.994834-1-bhalevy@scylladb.com>
2021-08-26 13:03:59 +03:00
Pekka Enberg
a53c1949cd Update tools/jmx submodule
* tools/jmx 5311e9b...70b19e6 (1):
  > scrub: support scrubMode and deprecate skipCorrupted
2021-08-26 12:27:13 +03:00
Pavel Solodovnikov
c0854a0f62 raft: create system tables only when raft experimental feature is set
Also introduce a tiny function to return raft-enabled db config
for cql testing.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>
2021-08-26 12:21:12 +03:00
Pekka Enberg
bd8fa47d84 Update tools/java submodule
* tools/java 4ef8049e07...0b6ecbeb90 (1):
  > nodetool scrub: support --mode and deprecate --skip-corrupted
2021-08-26 11:07:14 +03:00
Dejan Mircevski
5a4ac002c1 cql3: Track warnings in prepared_statement
Preparation should be able to record warnings that make it back to the
user via the query response.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-08-25 17:51:05 -04:00
Avi Kivity
acf8da2bce Merge "flat_mutation_reader: keep timeout in permit" from Benny
"
This series moves the timeout parameter, that is passed to most
f_m_r methods, into the reader_permit.  This eliminates
the need to pass the timeout around, as it's taken
from the permit when needed.

The permit timeout is updated in certain cases
when the permit/reader is paused and retrieved
later on for reuse.

Following are perf_simple_query results showing ~1%
reduction in insns/op and corresponding increase in tps.

$ build/release/test/perf/perf_simple_query -c 1 --operations-per-shard 1000000 --task-quota-ms 10

Before:
102500.38 tps ( 75.1 allocs/op,  12.1 tasks/op,   45620 insns/op)

After:
103957.53 tps ( 75.1 allocs/op,  12.1 tasks/op,   45372 insns/op)

Test: unit(dev)
DTest:
    repair_additional_test.py:RepairAdditionalTest.repair_abort_test (release)
    materialized_views_test.py:TestMaterializedViews.remove_node_during_mv_insert_3_nodes_test (release)
    materialized_views_test.py:InterruptBuildProcess.interrupt_build_process_with_resharding_half_to_max_test (release)
    migration_test.py:TTLWithMigrate.big_table_with_ttls_test (release)
"

* tag 'reader_permit-timeout-v6' of github.com:bhalevy/scylla:
  flat_mutation_reader: get rid of timeout parameter
  reader_concurrency_semaphore: use permit timeout for admission
  reader_concurrency_semaphore: adjust reactivated reader timeout
  multishard_mutation_query: create_reader: validate saved reader permit
  repair: row_level: read_mutation_fragment: set reader timeout
  flat_mutation_reader: maybe_timed_out: use permit timeout
  test: sstable_datafile_test: add sstable_reader_with_timeout
  reader_permit: add timeout member
2021-08-25 17:51:10 +03:00
Raphael S. Carvalho
a4053dbb72 repair: Postpone data segregation to off-strategy compaction
With data segregation on repair, thousands of sstables are potentially
added to maintenance set which causes high latency due to stalls.

That's because N*M sstables are created by a repair,
	where N = # of ranges
	and M = # of segregations

For TWCS, M = # of windows.

Assuming N = 768 and M = 20, ~15k sstables end up in sstable set

To fix this problem, let's avoid performing data segregation in repair,
as offstrategy will already perform the segregation anyway.

So from now on, only N non-overlapping sstables will be added to set.
Read amplification isn't affected because a query will only touch one
sstable in maintenance set.
When offstrategy starts, it will pick all sstables from set and
compact them in a single step while performing data segregation,
so data is properly laid out before integrated into the main set.

tests:
	- sstable_compaction_test.twcs_reshape_with_disjoint_set_test
	- mode(dev)
	- manual test using repair-based bootstrap

Fixes #9199.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210824185043.76475-1-raphaelsc@scylladb.com>
2021-08-25 15:31:38 +03:00
Pavel Emelyanov
b012040a76 mutation: Keep range tombstone in tree when consuming
Current code std::move()-s the range tombstone into consumer thus
moving the tombstone's linkage to the containing list as well. As
the result the orignal range tombstone itself leaks as it leaves
the tree and cannot be reached on .clear(). Another danger is that
the iterator pointing to the tombstone becomes invalid while it's
then ++-ed to advance to the next entry.

The immediate fix is to keep the tombstone linked to the list while
moving.

fixes: #9207

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210825100834.3216-1-xemul@scylladb.com>
2021-08-25 13:25:18 +03:00
Botond Dénes
6df77e350a mutation_fragment{_v2}: MutationFragmentConsumer: allow for abstract consumer
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210825083244.436274-1-bdenes@scylladb.com>
2021-08-25 13:12:41 +03:00
Avi Kivity
993f824cfd Merge "raft: implement linearisable reads on a follower" from Gleb and Kostja
"
This series implements section 6.4 of the Raft PhD. It allows to do
linearisable reads on a follower bypassing raft log entirely. After this
series server::read_barrier can be executed on a follower as well as
leader and after it completes local user's state machine state can be
accessed directly.
"

* 'raft-read-v9' of github.com:scylladb/scylla-dev:
  raft: test: add read_barrier test to replication_test
  raft: test: add read_barrier tests to fsm_test
  raft: make read_barrier work on a follower as well as on a leader
  raft: add a function to wait for an index to be applied
  raft: (server) add a helper to wait through uncertainty period
  raft: make fsm::current_leader() public
  raft: add hasher for raft::internal::tagged_uint64
  serialize: add serialized for std::monostate
  raft: fix indentation in applier_fiber
2021-08-25 13:11:35 +03:00
Gleb Natapov
3ff6f76cef raft: test: add read_barrier test to replication_test 2021-08-25 08:57:13 +03:00
Gleb Natapov
ad2c2abcb8 raft: test: add read_barrier tests to fsm_test 2021-08-25 08:57:13 +03:00
Gleb Natapov
03a266d73b raft: make read_barrier work on a follower as well as on a leader
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.

The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
2021-08-25 08:57:13 +03:00
Gleb Natapov
73af7edc78 raft: add a function to wait for an index to be applied 2021-08-25 08:19:25 +03:00
Konstantin Osipov
0429196e06 raft: (server) add a helper to wait through uncertainty period
Add a helper to be able to wait until a Raft cluster
leader is elected. It can be used to avoid sleeps
when it's necessary to forward a request to the leader,
but the leader is yet unknown.
2021-08-25 08:19:25 +03:00
Gleb Natapov
376785042f raft: make fsm::current_leader() public
Later patch will call it from server class.
2021-08-25 08:19:25 +03:00
Gleb Natapov
273f753815 raft: add hasher for raft::internal::tagged_uint64
Need it to be able to use tagged_uint64 as a key in an unordered  map.
2021-08-25 08:19:25 +03:00
Gleb Natapov
4851d64c68 serialize: add serialized for std::monostate 2021-08-25 08:19:25 +03:00
Gleb Natapov
bd0fd579cf raft: fix indentation in applier_fiber 2021-08-25 08:19:25 +03:00
Nadav Har'El
cf06b7cd40 test/alternator: correct some typos in comments
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210729125317.1610573-1-nyh@scylladb.com>
2021-08-24 19:43:29 +03:00
Avi Kivity
4a42b69ba8 Merge "raft: testing: many nodes test" from Alejo
"
Factor out replication test, make it work with different clocks, add
some features, and add a many nodes test with steady_clock. Also
refactor common test helper.

Many nodes test passes for release and dev and normal tick of 100ms for
up to 1000 servers. For debug mode it's much fewer due to lack of
optimizations so it's only tested for smaller numbers.

Tests: unit ({dev}), unit ({debug}), unit ({release})
"

* 'raft-many-22-v12' of https://github.com/alecco/scylla: (21 commits)
  raft: candidate timeout proportional to cluster size
  raft: testing: many nodes test
  raft: replication test: remove unused tick_all
  raft: replication test: delays
  raft: replication test: packet drop rpc helper
  raft: replication test: connectivity configuration
  raft: replication test: rpc network map in raft_cluster
  raft: replication test: use minimum granularity
  raft: replication test: minor: rename local to int ids
  raft: replication test: fix restart_tickers when partitioning
  raft: replication test: partition ranges
  raft: replication test: isolate one server
  raft: replication test: move objects out of header
  raft: replication test: make dummy command const
  raft: replication test: template clock type
  raft: replication test: tick delta inside raft_cluster
  raft: replication test: style - member initializer
  raft: replication test: move common code out
  raft: testing: refactor helper
  raft: log election stages
  ...
2021-08-24 17:05:05 +03:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
4e3dcfd7d6 reader_concurrency_semaphore: use permit timeout for admission
Now that the timeout is stored in the reader
permit use it for admission rather than a timeout
parameter.

Note that evictable_reader::next_partition
currently passes db::no_timeout to
resume_or_create_reader, which propagated to
maybe_wait_readmission, but it seems to be
an oversight of the f_m_r api that doesn't
pass a timeout to next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
9b0b13c450 reader_concurrency_semaphore: adjust reactivated reader timeout
Update the reader's timeout where needed
after unregistering inactive_read.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
605a1e6943 multishard_mutation_query: create_reader: validate saved reader permit
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
eeab5f77d9 repair: row_level: read_mutation_fragment: set reader timeout
The timeout needs to be propagated to the reader's permit.
Reset it to db::no_timeout in repair_reader::pause().

Warn if set_timeout asks to change the timeout too far into the
past (100ms).  It is possible that it will be passed a
past timeout from the rcp path, where the message timeout
is applied (as duration) over the local lowres_clock time
and parallel read_data messages that share the query may end
up having close, but different timeout values.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:40 +03:00
Benny Halevy
f25aabf1b2 flat_mutation_reader: maybe_timed_out: use permit timeout
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Benny Halevy
46fb7fe68e test: sstable_datafile_test: add sstable_reader_with_timeout
Verify that the sstable reader (for the highest supported version)
times out properly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Benny Halevy
fe479aca1d reader_permit: add timeout member
To replace the timeout parameter passed
to flat_mutation_reader methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Alejo Sanchez
a5c74a6442 raft: candidate timeout proportional to cluster size
To avoid dueling candidates with large clusters, make the timeout
proportional to the cluster size.

Debug mode is too slow for a test of 1000 nodes so it's disabled, but
the test passes for release and dev modes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
7206eae16e raft: testing: many nodes test
Tests with many nodes and realistic timers and ticks.

Network delays are kept as a fraction of ticks. (e.g. 20/100)

Tests with 600 or more nodes hang in debug mode.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
87a03a3485 raft: replication test: remove unused tick_all
Tests now wait for normal ticks for election, remove deprecated tick_all
helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
14c214d73e raft: replication test: delays
Allow test supplied delays for rpc communication.

Allow supplying network delay, local delay (nodes within the same
server), how many nodes are local, and an extra small delay simulating
local load.

Modify rpc class to support delays. If delays are enabled, it no longer
directly calls the other node's server code but it schedules it to be
called later. This makes the test more realistic as in the previous
version the first candidate was always going to get to all followers
first, preventing a dueling candidates scenario.

Previously, tickers were all scheduled at the same time, so there was no
spread of them across the tick time. Now these tickers are scheduled
with a uniform spread across this time (tick delta).

Also previously, for custom free elections used tick_all() which
traversed _in_configuration sequentially and ticked each. This, combined
with rpc outbound directly calling methods in the other server without
yielding, caused free elections to be unrealistic with same order
determined and first candidate always winning. This patch changes this
behavior. The free election uses normal tickers (now uniformly
distributed in tick delay time) and its loop waits for tick delay time
(yielding) and checks if there's a new leader. Also note the order might
not be the same in debug mode if more than one tick is scheduled.

As rpc messages are sent delayed, network connectivity needs to be
checked again before calling the function on the remote side.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:05:53 +02:00
Alejo Sanchez
db23823c77 raft: replication test: packet drop rpc helper
Add a helper to check if a packet should be dropped.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
497af3167f raft: replication test: connectivity configuration
Pass packet drops within connectivity configuration struct.
Default to no packet drops.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
e4d5428e8a raft: replication test: rpc network map in raft_cluster
Move rpc network map to raft cluster, no longer as static in rpc class.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
192ac5be4c raft: replication test: use minimum granularity
seastar lowres_clock minimum granularity is 10ms, not 1ms.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
5cfe6c1ca2 raft: replication test: minor: rename local to int ids
For clarity, name 0-based integer ids as int ids not local.
This is in contrast with 1-based UUID ids.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
27d90f0165 raft: replication test: fix restart_tickers when partitioning
When partitioning, elect_new_leader restarts tickers, so don't
re-restart them in this case.

When leader is dropped and no new leader is specified, restart tickers
before free election.

If no change of leader, restart tickers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
e4262291f2 raft: replication test: partition ranges
Allow specifying ranges within partition to handle large number of
nodes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
56a110d42f raft: replication test: isolate one server
Support disconnection of one server with the rest.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
6b3327c753 raft: replication test: move objects out of header
Use a separate cc file for definitions and objects.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
cea18e6830 raft: replication test: make dummy command const
Make dummy command const in header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
2db3192ac3 raft: replication test: template clock type
Templetize clock type.

Use a struct for run_test to work around
https://bugs.llvm.org/show_bug.cgi?id=50345

With help from @kbr-

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
cb35588fb1 raft: replication test: tick delta inside raft_cluster
Store tick delta inside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
49cb040037 raft: replication test: style - member initializer
Fix raft_cluster constructor member initializer list.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
6e2ab657b3 raft: replication test: move common code out
Common replication test code moved to header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
a6cd35c512 raft: testing: refactor helper
Move definitions to helper object file.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
466972afb0 raft: log election stages
Add logging for election tracing.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
617d6df42c raft: log with method name
Use standard log format function[id] for log entries in fsm.cc

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Takuya ASADA
2a8b48b6fa configure.py: add --dist-only for packaging development
Add --dist-only option to disable compiling code, just build packages.
It will significantly speed up rebuild packages, make packaging
development easier.

Closes #9227
2021-08-23 18:38:35 +03:00
Avi Kivity
22d2a815c9 transport: server.hh: trim unneeded cql3 includes
query_processor.hh can be replaced with a forward declaration, and
result-message headers, and valuees.hh is unneeded.

Closes #9238
2021-08-23 18:09:22 +03:00
Avi Kivity
115d14028b Merge 'Allow multi-parameter user-defined aggregates' from Piotr Sarna
Due to an overzealous assertion put in the code (in one of the last iterations, by the way!) it was impossible to create an aggregate which accepts multiple arguments. The behavior is now fixed, and a test case is provided for it.

Tests: unit(release)

Closes #9211

* github.com:scylladb/scylla:
  cql-pytest: add test case for UDA with multiple args
  cql3: fix aggregates with > 1 argument
2021-08-23 17:45:58 +03:00
Pavel Solodovnikov
22794efc22 db: add experimental option for raft
Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.

It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.

Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.

Later, other raft-related things, such as raft schema
changes, will also use this flag.

Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.

This will be done in a separate series, probably related to
implementing the feature itself.

Tests: unit(dev)

Ref #9239.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
2021-08-23 17:45:58 +03:00
Nadav Har'El
49aea3b301 Merge 'database: coroutinize schema load functions' from Avi Kivity
Simple coroutinization of the schema load functions, leaving the code tidier.

Test: unit (dev)

Closes #9217

* github.com:scylladb/scylla:
  database: adjust indentation after coroutinization of schema table parsing code
  database: convert database::parse_schema_tables() to a coroutine
  database: remove unneeded temporary in do_parse_schema_tables()
  database: convert do_parse_schema_tables() to a coroutine
2021-08-23 17:45:58 +03:00
Nadav Har'El
d598a94b43 Merge: everywhere: mark deferred actions noexcept
Merged patch series by By Benny Halevy:

Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).

Test: unit(dev, debug)

* tag 'deferred_action-noexcept-v1' of github.com:bhalevy/scylla:
  everywhere: make deferred actions noexcept
  cql3: prepare_context: mark methods noexcept
  commitlog: segment, segment_manager: mark methods noexcept
  everywhere: cleanup defer.hh includes
2021-08-23 11:16:17 +03:00
Avi Kivity
1b492396c1 stream_session.cc: trim unneeded includes
stream_session.cc doesn't need storage_proxy, or sstables, or the
system keyspace. Remove them.

Closes #9230
2021-08-23 10:57:04 +03:00
Eliran Sinvani
b33479f731 Micro Benchmark: Fix division by zero in 'perf_fast_forward'
Commit 8d6e575 introduced a new stat, instructions per fragment.
Computing this new stat can end with a division by zero when
the number of fragmens read is 0. Here we fix it by reporting
0 ins/f when no fragments were read.

Fixes #9231

Closes #9232
2021-08-23 10:55:44 +03:00
Avi Kivity
6221b90b89 secondary_index_manager: stop including expression.hh
Use a forward declaration of cql3::expr::oper_t to reduce the
number of translation units depending on expression.hh.

Before:

    $ find build/dev -name '*.d' | xargs cat | grep -c expression.hh
    272

After:

    $ find build/dev -name '*.d' | xargs cat | grep -c expression.hh
    154

Some translation units adjust their includes to restore access
to required headers.

Closes #9229
2021-08-22 21:21:46 +03:00
Benny Halevy
e9aff2426e everywhere: make deferred actions noexcept
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).

Test: unit(dev, debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:52 +03:00
Benny Halevy
eba4191223 cql3: prepare_context: mark methods noexcept
Prepare for marking deferred actions noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:40 +03:00
Benny Halevy
ef8ec54970 commitlog: segment, segment_manager: mark methods noexcept
Prepare for marking deferred_actions nexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:40 +03:00
Benny Halevy
4439e5c132 everywhere: cleanup defer.hh includes
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:39 +03:00
Vlad Zolotarov
7bd1bcd779 loading_shared_values/loading_cache: get rid of iterators interface and return value_ptr from find(...) instead
loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.

And this is in addition to #8920.

So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.

A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.

loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.

Fixes #8920

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
2021-08-22 16:49:40 +03:00
Takuya ASADA
5b62bebbb6 scylla_io_setup: check root privilege on root mode
This is side effect of allowing to run scylla_io_setup in nonroot mode,
the script able to run in non-root user even the installation is not
nonroot mode.

Result of that, the script finally failed to write io_properties.yaml
and causes permission denied.  Since the evaluation takes long time, we
should run permission check before starting it.

We need to add root privilege check again, but skip it on nonroot mode.

Fixes #8915

Closes #8984
2021-08-22 16:49:40 +03:00
Botond Dénes
714ff8b758 docs/guides/debugging.md: mention the debuginfo package pitfall
Add a note to the "Obtaining the relocatable packages" section and
a separate entry to Throubleshooting.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210819110459.159733-1-bdenes@scylladb.com>
2021-08-22 16:49:40 +03:00
Botond Dénes
13080794d6 docs/guides/debugging.md: recommend symlinking instead of installing
When setting up the env. Install no longer works as it depends on
systemd.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210819110419.159351-1-bdenes@scylladb.com>
2021-08-22 16:49:40 +03:00
Tomasz Grabiec
865d072756 Merge 'sstables: convert parse(high level types) to a coroutine' from Avi Kivity
The parse() function of high-level sstable metadata types are
trivial straight line code and can be easily simplified by
conversion to coroutines.

Test: unit (dev)

Closes #9224

* github.com:scylladb/scylla:
  sstables: parse(*): adjust indentation after coroutine conversion
  sstables: parse(compression&): eliminate unnecessary indirection
  sstables: convert parse(compression&) to a coroutine
  sstables: convert parse(commitlog_interval&) to a coroutine
  sstables: parse(streaming_histogram&): eliminate unnecessary indirection
  sstables: convert parse(streaming_histogram&) to a coroutine
  sstables: convert parse(estimated_histogran&) to a coroutine
  sstables: convert parse(statistics&) to a coroutine
  sstables: convert parse(summary&) to a coroutine
2021-08-22 16:49:40 +03:00
Pavel Emelyanov
e02b39ca3d code: Generalize tls::credentials_builder configuration
All the places in code that configure the mentioned creds builder
from client_|server_encryption_options now do it the same way.
This patch generalizes it all in the utils:: helper.

The alternator code "ignores" require_client_auth and truststore
keys, but it's easy to make the generalized helper be compatible.

Also make the new helper coroutinized from the beginning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-20 18:05:41 +03:00
Pavel Emelyanov
35209e7500 transport, redis: Do not assume fixed encryption options
On start main() brushes up the client_encryption_options option
so that any user of it sees it in some "clean" state and can
avoid using get_or_default() to parse.

This patch removes this assumption (and the cleaning code itself).
Next patch will make use of it and relax the duplicated parsing
complexity back.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-20 17:59:33 +03:00
Pavel Emelyanov
2f5941ca6f messaging: Move encryption options parsing to ms
Main collects a bunch of local variables from config and passes
them as arguments to messaging service initialization helper.
This patch replaces all these args with const config reference.

The motivation is to facilitate next patching by providing the
server encryption options k:v set right in the m.s. init code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-20 17:56:16 +03:00
Pavel Emelyanov
33c70e54bb main: Open-code internode encryption misconfig warning
There's a warning message printed when internode encryption is
set up "incorrectly". The incorrectness "if" uses local variables
that soon will be moved away. This patch makes the check rely
purely on the config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-20 17:54:40 +03:00
Pavel Emelyanov
aa88527375 main, config: Move options parsing helpers
The get_or_default and is_true are two aux bits that are used
to parse the config options. The former is duplicated in the
alternator code as well.

Put both in utils namespace for future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-20 17:53:41 +03:00
Avi Kivity
4078d55961 sstables: parse(*): adjust indentation after coroutine conversion
Verified with "git diff -w"
2021-08-18 19:10:26 +03:00
Avi Kivity
6f823eba5f sstables: parse(compression&): eliminate unnecessary indirection
lw_shared_ptr<> was used to keep the addresses of two integers
stable, but this is now unnecessary.
2021-08-18 19:09:55 +03:00
Avi Kivity
bd05bc40f4 sstables: convert parse(compression&) to a coroutine 2021-08-18 19:09:55 +03:00
Avi Kivity
3626a76c53 sstables: convert parse(commitlog_interval&) to a coroutine 2021-08-18 19:09:55 +03:00
Avi Kivity
8699bda724 sstables: parse(streaming_histogram&): eliminate unnecessary indirection
The pre-coroutine code used a unique_ptr to keep the address of a
disk_array stable, but this is now unnecessary.
2021-08-18 19:09:20 +03:00
Avi Kivity
ad27824623 sstables: convert parse(streaming_histogram&) to a coroutine 2021-08-18 19:08:53 +03:00
Avi Kivity
f8b2f0449c sstables: convert parse(estimated_histogran&) to a coroutine 2021-08-18 19:08:53 +03:00
Avi Kivity
71c69fb9e2 sstables: convert parse(statistics&) to a coroutine 2021-08-18 19:08:53 +03:00
Avi Kivity
bd6460f00a sstables: convert parse(summary&) to a coroutine 2021-08-18 18:21:33 +03:00
Pavel Solodovnikov
f98cb96506 raft: raft_sys_table_storage_test: don't use initializer lists inside loops and coroutines
Workaround for Clang bug: https://bugs.llvm.org/show_bug.cgi?id=51515

When compiled on aarch64 with ASAN support and -Og/-Oz/-Os optimization
level, `raft_sys_table_storage::do_store_log_entries` crashes during the
tests. ASAN incorrectly reports `stack-use-after-return` on
`std::vector` list initialization after initial coroutine suspension
(initializer list's data pointer starts to point to garbage).

The workaround is simple: don't use initializer lists in such case
and replace with a series of `emplace_back` calls.

Tests: unit(debug, aarch64)

Fixes #9178

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210818102038.92509-1-pa.solodovnikov@scylladb.com>
2021-08-18 13:32:55 +03:00
Takuya ASADA
462961ca51 unified: fix handling --supervisor option
We need to pass --supervisor option just for scylla-server module,
and also pass --packaging option to scylla-jmx module to avoid running
systemctl command, since the option may run in container, and container
may not have systemd.

Fixes #9141

Closes #9142
2021-08-18 13:17:08 +03:00
Avi Kivity
5450af8e1b database: coroutinize stop()
Make the code tidier.

The conversion is not mechanical: the finally block is converted
to straight line code. stop()/close() must not fail anyway, and we
cannot recover from such failures. The when_all_succeed() for stopping
the semaphores is also converted to straight-line code - there is no
advantage to stopping them in parallel, as we're just waiting for
running tasks to complete and clean up.

Test: unit (dev)

Closes #9218
2021-08-18 10:57:44 +02:00
Avi Kivity
bae67dcce6 Merge "mutation_fragment: Specialize appending_hash for it" from Pavel E
"
The mutation_fragments hashing code sitting in row-level repair
upsets clang and makes it spend 20 minutes compiling itself. This
set speeds this up greatly by moving the hashing code into the
mutation_fragment.cc and turning it into the appending_hash<>
specialisation. A simple sanity checking test makes sure this
doesn't change resulting hash values.

tests: unit.hashers_test(dev, release) // hash values matched, phew
       dtest.repair_additional_test.repair_large_partition_existing_rows_test(release)
"

* 'br-row-level-comp-speedup-2.2' of https://github.com/xemul/scylla:
  mutation_fragment: Specialize appending_hash for it
  tests: Add sanity check for hashing mutation_fragments
2021-08-18 11:25:48 +03:00
Pavel Emelyanov
b5fee07527 mutation_fragment: Specialize appending_hash for it
Row-level rpair hashes the mutation fragment and wraps this into a
private fragment_hasher class. For some reason it takes ~20 minutes
for clang to compile the row_level.o with -O3 level (release mode).
Putting the whole fragment_hasher into a dedicated file reduces the
compilation times ~9 times.

However, it seems more natural not to move the fragment_hasher around
but to specialize the appending_hash<> for mutation_fragment and make
row_level.cc code just call feed_hash().

Compilation times (release mode):

                       before     after
row_level.o            19m34s      2m4s
mutation_fragment.o       13s       17s

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-18 09:17:40 +03:00
Pavel Emelyanov
34f8f10123 tests: Add sanity check for hashing mutation_fragments
Next patch is going to change the way row-level repair code hashes
mutation_fragment objects. This patch prepares the sanity check for
the hash values not be accidentally changed by hashing some simple
fragments and comparing them against known expected values.

The hash_mutation_fragment_for_test helper is added for this patch
only and will be removed really soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-18 09:17:40 +03:00
Raphael S. Carvalho
3a1cf3aa88 database: document database::get_keyspace_local_ranges()
Documentation was extracted from abstract_replication_strategy::get_ranges(),
which says:
    // get_ranges() returns the list of ranges held by the given endpoint.
    // The list is sorted, and its elements are non overlapping and non wrap-around.

That's important because users of get_keyspace_local_ranges() expect
that the returned list is both sorted and non overlapping, so let's
document it to prevent someone from removing any of these properties.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210805140628.537368-1-raphaelsc@scylladb.com>
2021-08-17 21:44:24 +03:00
Asias He
eaf4d2afb4 storage_service: Generate view update for load and stream
Currently, view will be not updated because the streaming reason is set
to streaming::stream_reason::rebuild. On the receiver side, only
streaming with the reason streaming::stream_reason::repair will trigger
view update.

Change the stream reason to repair to trigger view update for load and
stream. This makes load_and_stream behaves the same as nodetool refresh.

Note: However, this is not very efficient though.

Consider RF = 3, sst1, sst2, sst3 from the older cluster. When sst1 is
loaded, it streams to 3 replica nodes, if we generate view updates, we
will have 3 view updates for this replica (each of the peer nodes finds
its peer and writes the view update to peer). After loading sst2 and
sst3, we will have 9 view updates in total for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.

If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.

Fixes #9205

Closes #9213
2021-08-17 21:44:24 +03:00
Avi Kivity
73d6f2798d database: adjust indentation after coroutinization of schema table parsing code 2021-08-17 21:05:05 +03:00
Avi Kivity
4ca856157d database: convert database::parse_schema_tables() to a coroutine
In one case we have f = f.then(...), but we can just wait
for the first future where it's created.
2021-08-17 21:00:15 +03:00
Avi Kivity
4f91953ebf database: remove unneeded temporary in do_parse_schema_tables()
The coroutine can keep the cf_name parameter alive, provided we
pass it by value.
2021-08-17 20:45:41 +03:00
Avi Kivity
b2d5820d75 database: convert do_parse_schema_tables() to a coroutine 2021-08-17 20:44:28 +03:00
Tomasz Grabiec
9fe3e86368 db: Print more fields of read_command
Message-Id: <20210810143752.420988-1-tgrabiec@scylladb.com>
2021-08-17 12:24:40 +03:00
Piotr Sarna
88238c7c2a cql-pytest: add test case for UDA with multiple args
A test case for an aggregate which works on multiple parameters
is added.
2021-08-16 19:52:50 +02:00
Piotr Sarna
d83d212ee5 cql3: fix aggregates with > 1 argument
It was impossible to use an aggreagate with more than 1 argument
due to an overzealous assert, which is now removed.
2021-08-16 19:49:03 +02:00
Pavel Emelyanov
9c7bcd1d85 bound_view: Rewrite tri_compare() tail
The new implementation is shorter and allows compiler to
produce nicer assembly. In particular clang-11 and -O3 flag:

Was:
    if (d1 == d2) {
        return w1 - w2;
    }
    return d1 < d2 ? w1 - (w1 <= 0) : -(w2 - (w2 <= 0));

    89 f0           mov    %esi,%eax
    39 d7           cmp    %edx,%edi
    74 13           je     403f69 <_Z7cmp_intiiii+0x19>
    7d 0a           jge    403f62 <_Z7cmp_intiiii+0x12>
    31 c9           xor    %ecx,%ecx
    85 c0           test   %eax,%eax
    0f 9e c1        setle  %cl
    29 c8           sub    %ecx,%eax
    c3              retq
    31 c0           xor    %eax,%eax
    85 c9           test   %ecx,%ecx
    0f 9e c0        setle  %al
    29 c8           sub    %ecx,%eax
    c3              retq

14 instructions 2 cond jumps, 2 cond sets

Now:
    return ((d1 <= d2) ? w1 << 1 : 1) - ((d2 <= d1) ? w2 << 1 : 1);

    8d 04 36        lea    (%rsi,%rsi,1),%eax
    39 d7           cmp    %edx,%edi
    be 01 00 00 00  mov    $0x1,%esi
    0f 4f c6        cmovg  %esi,%eax
    01 c9           add    %ecx,%ecx
    39 fa           cmp    %edi,%edx
    0f 4f ce        cmovg  %esi,%ecx
    29 c8           sub    %ecx,%eax
    c3              retq

9 instructions, 0 cond jumps, 2 cond movs

tests: unit(dev), perf(simple_query, release)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210730092629.18940-1-xemul@scylladb.com>
2021-08-16 17:17:27 +03:00
Tomasz Grabiec
09b575474b Merge "test: raft: generators infrastructure with an actual random nemesis test" from Kamil
Operations and generators can be composed to create more complex
operations and generators. There are certain composition patterns useful
for many different test scenarios.

We implement a couple of such patterns. For example:
- Given multiple different operation types, we can create a new
  operation type - `either_of` - which is a "union" of the original
  operation types. Executing `either_of` operation means executing an
  operation of one of the original types, but the specific type
  can be chosen in runtime.
- Given a generator `g`, `op_limit(n, g)` is a new generator which
  limits the number of operations produced by `g`.
- Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a
  new generator which spreads the operations from `g` roughly every `d`
  ticks. (The actual definition in code is more general and complex but
  the idea is similar.)

Some of these patterns have correspodning notions in Jepsen, e.g. our
`stagger` has a corresponding `stagger` in Jepsen (although our
`stagger` is more general).

Finally, we implement a test that uses this new infrastructure.

Two `Executable` operations are implemented:
- `raft_call` is for calling to a Raft cluster with a given state
  machine command,
- `network_majority_grudge` partitions the network in half,
  putting the leader in the minority.

We run a workload of these operations against a cluster of 5 nodes with
6 threads for executing the operations: one "nemesis thread" for
`network_majority_grudge` and 5 "client threads" for `raft_call`.
Each client thread randomly chooses a contact point which it tries first
when executing a `raft_call`, but it can also "bounce" - call a
different server when the previous returned "not_a_leader" (we use the
generic "bouncing" wrapper to do this).

For now we only print the resulting history. In a follow-up patchset
we will analyze it for consistency anomalies.

* kbr/raft-test-generator-v4:
  test: raft: randomized_nemesis_test: a basic generator test
  test: raft: generator: a library of basic generators
  test: raft: introduce generators
  test: raft: introduce `future_set`
  test: raft: randomized_nemesis_test: handle `raft::stopped_error` in timeout futures
2021-08-16 15:55:25 +02:00
Kamil Braun
3344ac8a6c test: raft: randomized_nemesis_test: a basic generator test
The previous commits introduced basic the generator concept and a
library of most common composition patterns.

In this commit we implement a test that uses this new infrastructure.

Two `Executable` operations are implemented:
- `raft_call` is for calling to a Raft cluster with a given state
  machine command,
- `network_majority_grudge` partitions the network in half,
  putting the leader in the minority.

We run a workload of these operations against a cluster of 5 nodes with
6 threads for executing the operations: one "nemesis thread" for
`network_majority_grudge` and 5 "client threads" for `raft_call`.
Each client thread randomly chooses a contact point which it tries first
when executing a `raft_call`, but it can also "bounce" - call a
different server when the previous returned "not_a_leader" (we use the
generic "bouncing" wrapper to do this).

For now we only print the resulting history. In a follow-up patchset
we will analyze it for consistency anomalies.
2021-08-16 13:07:08 +02:00
Kamil Braun
66ec484730 test: raft: generator: a library of basic generators
Operations and generators can be composed to create more complex
operations and generators. There are certain composition patterns useful
for many different test scenarios.

This commit introduces a couple of such patterns. For example:
- Given multiple different operation types, we can create a new
  operation type - `either_of` - which is a "union" of the original
  operation types. Executing `either_of` operation means executing an
  operation of one of the original types, but the specific type
  can be chosen in runtime.
- Given a generator `g`, `op_limit(n, g)` is a new generator which
  limits the number of operations produced by `g`.
- Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a
  new generator which spreads the operations from `g` roughly every `d`
  ticks. (The actual definition in code is more general and complex but
  the idea is similar.)

And so on.

Some of these patterns have correspodning notions in Jepsen, e.g. our
`stagger` has a corresponding `stagger` in Jepsen (although our
`stagger` is more general).
2021-08-16 13:07:08 +02:00
Kamil Braun
d8863c5a7b test: raft: introduce generators
We introduce the concepts of "operations" and "generators", basic
building blocks that will allow us to declaratively write randomized
tests for torturing simulated Raft clusters.

An "operation" is a data structure representing a computation which
may cause side effects such as calling a Raft cluster or partitioning
the network, represented in the code with the `Executable` concept.
It has an `execute` function performing the computation and returns
a result of type `result_type`. Different computations of the same type
share state of type `state_type`. The state can, for example, contain
database handles.

Each execution is performed on an abstract `thread' (represented by a `thread_id`)
and has a logical starting time point. The thread and start point together form
the execution's `context` which is passed as a reference to `execute`.

Two operations may be called in parallel only if they are on different threads.

A generator, represented through the `Generator` concept, produces a
sequence of operations. An operation can be fetched from a generator
using the `op` function, which also returns the next state of the
generator (generators are purely functional data structures).

The generator concept is inspired by the generators in the Jepsen
testing library for distributed systems.

We also implement `interpreter` which "interprets", or "runs", a given
generator, by fetching operations from the generator and executing them
with concurrency controlled by the abstract threads.

The algorithm used in the interpreter is also similar to the interpreter
algorithm in Jepsen, although there are differences. Most notably we don't
have a "worker" concept - everything runs on a single shard; but we use
"abstract threads" combined with futures for concurrency.
There is also no notion of "process". Finally, the interpreter doesn't
keep an explicit history, but instead uses a callback `Recorder` to notify
the user about operation invocations and completions. The user can
decide to save these events in a history, or perhaps they can analyze
them on the fly using constant memory.
2021-08-16 13:07:08 +02:00
Kamil Braun
421b1b9494 test: raft: introduce future_set
A set of futures that can be polled.

Polling the set (`poll` function) returns the value of one of
the futures which became available or `std::nullopt` if the given
logical durationd passes (according to the given timer), whichever
event happens first.  The current implementation assumes sequential
polling.

New futures can be added to the set with `add`.
All futures can be removed from the set with `release`.
2021-08-16 13:07:08 +02:00
Kamil Braun
a5e92e1c45 test: raft: randomized_nemesis_test: handle raft::stopped_error in timeout futures
The timeout futures in `call` and `reconfigure` may be discarded after
Raft servers were `abort()`ed which would result in
`raft::stopped_error` and the test complained about discarded
exceptional futures. Discard these errors explicitly.
2021-08-16 13:07:08 +02:00
Takuya ASADA
cb19048186 docker: use dist/common/supervisor script for docker
supervisor scripts for Docker and supervisor scripts for offline
installer are almost same, drop Docker one and share same code to
deduplicate them.

Closes #9143

Fixes #9194
2021-08-16 13:36:14 +03:00
Avi Kivity
0ba697d515 Merge 'Add service level config change subscription API' from Eliran Sinvani
In order to decouple the service level controller from the systems logic, we introduce an API for subscribing to configuration changes. The timing of the call was determined with resource creation and destruction in mind. An API subscriber can create
resources that will be available from the very start of the service level existence it can also destroy them since the service level
is guarantied not to exist anymore at the time of the call to the deletion notification callback.

Testing:
unit tests - all + a newly added one.
dtests - next-gating (dev mode)

Closes #9097

* github.com:scylladb/scylla:
  service level controller: Subscriber API unit test
  Service Level Controller: Add a listener API for service level config changes
2021-08-16 11:47:33 +03:00
Eliran Sinvani
403db8e943 service level controller: Subscriber API unit test
Here we add a very simple unit test for the configuration
change API.
2021-08-16 11:38:59 +03:00
Eliran Sinvani
47d3862b63 Service Level Controller: Add a listener API for service level config
changes

This change adds an api for registering a listener for service_level
configuration chanhes. It notifies about removal addition and change of
service level.
The hidden assumption is that some listeners are going to create and/or
manage service level specific resources and this it what guided the
time of the call to the subscriber.
Addition and change of a service level are called before the actual
change takes place, this guaranties that resource creation can take
place before the service level or new config starts to be used.
The deletion notification is called only after the deletion took place
and this guranties that the service level can't be active and the
resources created can be safely destroyed.
2021-08-16 11:38:59 +03:00
Pavel Emelyanov
6dd67012bb main: Fix internode encryption warning check
It should check for dc || rack, not dc || dc. The correct behavior
is described in both -- the warning message and the commit that
introduced it (a0745f94).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210730094549.19477-1-xemul@scylladb.com>
2021-08-16 11:14:20 +03:00
Calle Wilund
3633c077be commitlog/config: Make hard size enforcement false by default + add config opt
Refs #9053

Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.

Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.

Closes #9195
2021-08-15 15:10:27 +03:00
Asias He
97bb2e47ff storage_service: Enable Repair Based Note Operations (RBNO) by default for replace
We decided to enable repair based node operations by default for replace
node operations.

To do that, a new option --allowed-repair-based-node-ops is added. It
lists the node operations that are allowed to enable repair based node
operations.

The operations can be bootstrap, replace, removenode, decommission and rebuild.

By default, --allowed-repair-based-node-ops is set to contain "replace".

Note, the existing option --enable-repair-based-node-ops is still in
play. It is the global switch to enable or disable the feature.

Examples:

- To enable bootstrap and replace node ops:

```
scylla --enable-repair-based-node-ops true --allowed-repair-based-node-ops replace,bootstrap
```

- To disable any repair based node ops:

```
scylla --enable-repair-based-node-ops false
```

Closes #9197
2021-08-15 13:30:46 +03:00
Nadav Har'El
b53eeb8a6c Merge 'Enable user-defined aggregates' from Piotr Sarna
It turns out that user-defined aggregates did not need any elaborate coding in order to make them exposed to the users. The whole infrastructure is already there, including system schema tables and support for running aggregate queries, so this series simply adds lots and lots of boilerplate glue code to make UDA usable.

It also comes with a simple test which shows that it's possible to define and use such an aggregate.

Performance not tested, since user-defined functions are still experimental, so nothing really changes in this matter.

Tests: unit(release)

Fixes #7201

Closes #9165

* github.com:scylladb/scylla:
  cql-pytest: add a test suite for user-defined aggregates
  cql-pytest: add context managers for functions and aggregates
  cql3: enable user-defined aggregates in CQL grammar
  cql3: add statements for user-defined aggregates
  cql3,functions: add checking if a function is used in UDA
  gms: add UDA feature
  migration_manager: add migrating user-defined aggregates
  db,schema_tables: add handling user-defined aggregates
  pagers: make a lambda mutable in fetch_page
  cql3: wrap handling paging result with with_thread_if_needed
  cql3: correctly mark function selectors as needing threads
  cql3: add user-define aggregate representation
2021-08-14 12:14:12 +03:00
Piotr Sarna
38c1fd0762 cql-pytest: add a test suite for user-defined aggregates
The test suite now consists of a single user aggregate:
a custom implementation for existing avg() built-in function,
as well as a couple of cases for catching incorrect operations,
like using wrong function signatures or dropping used functions.
2021-08-13 11:16:52 +02:00
Piotr Sarna
5f773d04d2 cql-pytest: add context managers for functions and aggregates
These context managers can be used to create temporary
user-defined functions and user-defined aggregates.
2021-08-13 11:16:52 +02:00
Piotr Sarna
2ebf018e74 cql3: enable user-defined aggregates in CQL grammar
Statements for creating and dropping user-defined aggregates
are now accepted by the grammar and can be used by the users.
2021-08-13 11:16:52 +02:00
Piotr Sarna
ec25cf965e cql3: add statements for user-defined aggregates
The following statements are added:
 - CREATE AGGREGATE
 - DROP AGGREGATE
2021-08-13 11:16:52 +02:00
Piotr Sarna
a9ae753cd6 cql3,functions: add checking if a function is used in UDA
If a function is used by a user-defined aggregate, it must not
be dropped or the aggregate would be left with a dangling function.
2021-08-13 11:16:47 +02:00
Piotr Sarna
da67c594c8 gms: add UDA feature
UDA stands for user-defined aggregates and the feature implies
that the whole cluster supports them.
2021-08-13 11:14:12 +02:00
Piotr Sarna
e1be04852b migration_manager: add migrating user-defined aggregates
User-defined aggregate creation and deletion can now be announced.
2021-08-13 11:14:12 +02:00
Piotr Sarna
84876a165b db,schema_tables: add handling user-defined aggregates
Aggregates are propagated, created and dropped very similarly
to user-defined functions - a set of helper functions
for aggregates are added based on the UDF implementation.
2021-08-13 11:14:11 +02:00
Piotr Sarna
ad2093539b pagers: make a lambda mutable in fetch_page
The lambda passed to with_thread_if_needed helper function
relies on moving its captured parameters, so it's made mutable
in order to avoid copying.
2021-08-13 11:13:43 +02:00
Piotr Sarna
260604d053 cql3: wrap handling paging result with with_thread_if_needed
One of the pagers did not spawn a Seastar thread even if it was
required by its underlying selectors - the behavior is now fixed.
2021-08-13 11:13:43 +02:00
Piotr Sarna
cac321cd12 cql3: correctly mark function selectors as needing threads
Function call selectors correctly checked if their arguments
are required to run in threaded context, but forgot to check
the function itself - which is now done.
2021-08-13 11:13:43 +02:00
Piotr Sarna
ee81453596 cql3: add user-define aggregate representation
A user-defined aggregate is represented as an aggregate
which calls its state function on each input row
and then finalizes its execution by calling its final function
on the final state, after all rows were already processed.
2021-08-13 11:13:41 +02:00
Piotr Sarna
58196e8ea6 db,view: avoid ignoring failed future in background view updates
The code for handling background view updates used to propagate
exceptions unconditionally, which leads to "exceptional future
ignored" warnings if the update was put to background.
From now on, the exception is only propagated if its future
is actually waited on.

Fixes #6187

Tested manually, the warning was not observed after the patch

Closes #9179
2021-08-12 17:32:35 +03:00
Piotr Sarna
ea0e0c924d configure,install-dependencies: add wasmtime dependency
If the wasmtime library is available for download, it will be
set up by install-dependencies and prepared for linking.

Closes #9151

[avi: regenerate toolchain, which also updates clang to 12.0.1]
2021-08-12 12:33:43 +03:00
Asias He
cc44edb4e2 database: Detemplate run_async
I initially tried to use a noncopyable_function to avoid the unnecessary
template usage.

However, since database::apply_in_memory is a hot function. It is better
to use with_gate directly. The run_async function does nothing but calls
with_gate anyway.

Closes #9160
2021-08-12 07:53:10 +03:00
Takuya ASADA
e5bb88b69a scylla_cpuscaling_setup: change scaling_governor path
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.

Fixes #9191

Closes #9193
2021-08-11 15:31:14 +03:00
Nadav Har'El
89724533f8 test/cql-pytest: CREATE INDEX IF NOT EXISTS vs. Cassandra
What should the following pair of statements do?

    CREATE INDEX xyz ON tbl(a)
    CREATE INDEX IF NOT EXISTS xyz ON tbl(b)

There are two reasonable choices:
1. An index with the name xyz already exists, so the second command should
   do nothing, because of the "IF NOT EXISTS".
2. The index on tbl(b) does *not* yet exist, so the command should try to
   create it. And when it can't (because the name xyz is already taken),
   it should produce an error message.

Currently, Cassandra went with choice 1, and Scylla went with choice 2.

After some discussions on the mailing list, we agreed that Scylla's
choice is the better one and Cassandra's choice could be considered a
bug: The "IF NOT EXIST" feature is meant to allow idempotent creation of
an index - and not to make it easy to make mistakes without not noticing.
The second command listed above is most likely a mistake by the user,
not anything intentional: The command intended to ensure than an index
on column b exists, but after the silent success of the command, no such
index exists.

So this patch doesn't change any Scylla code (it just adds a comment),
and rather it adds a test which "enshrines" the current behavior.
The test passes on Scylla and fails on Cassandra so we tag it
"cassandra_bug", meaning that we consider this difference to be
intentional and we consider Cassandra's behavior in this case to be wrong.

Fixes #9182.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210811113906.2105644-1-nyh@scylladb.com>
2021-08-11 13:41:58 +02:00
Asias He
ce8fd051c9 storage_service: Fix argument in send_meta_data::do_receive
The extra status print is not needed in the log.

Fixes the following error:

ERROR 2021-08-10 10:54:21,088 [shard 0] storage_service -
service/storage_service.cc:3150 @do_receive: failed to log message:
fmt='send_meta_data: got error code={}, from node={}, status={}':
fmt::v7::format_error (argument not found)

Fixes #9183

Closes #9189
2021-08-11 11:35:30 +02:00
Asias He
040b626235 table: Fix is_shared assert for load and stream
The reader is used by load and stream to read sstables from the upload
directory which are not guaranteed to belong to the local shard.

Using the make_range_sstable_reader instead of
make_local_shard_sstable_reader.

Tests:

backup_restore_tests.py:TestBackupRestore.load_and_stream_using_snapshot_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_2_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_1_test
migration_test.py:TestLoadAndStream.load_and_stream_asymmetric_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_decrease_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_frozen_pk_test
migration_test.py:TestLoadAndStream.load_and_stream_increase_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_primary_replica_only_test

Fixes #9173

Closes #9185
2021-08-11 12:18:40 +03:00
Piotr Jastrzebski
db4c9199f5 sstables: remove unused uppermost_bound from clustering_ranges_walker and mutation_fragment_filter
Those methods are never used so it's better not to keep a dead code
around.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #9188
2021-08-11 10:54:59 +02:00
Nadav Har'El
49ca1f86b2 Merge 'hints: error injection for pausing hint replay' from Piotr Dulikowski
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.

This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.

The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).

Refs: #6649

Closes #8801

* github.com:scylladb/scylla:
  hints: fix indentation after previous patch
  hints: error injection for pausing hint replay
  hints: coroutinize lambda inside send_one_file
2021-08-11 11:42:29 +03:00
Piotr Dulikowski
f2e1339f38 hints: use an abort_source with sleep_abortable in flush+send loop
Each hint sender runs an asynchronous loop with tries to flush and then
send hints. Between each attempt, it sleeps at most 10 seconds using
sleep_abortable. However, an overload of sleep_abortable is used which
does not take an abort_source - it should abort the sleep in case
Seastar handles a SIGINT or SIGTERM signal. However, in order for that
to work, the application must not prevent default handling of those
signals in Seastar - but Scylla explicitly does it by disabling the
`auto_handle_sigint_sigterm` option in reactor config. As a result,
those sleeps are never aborted, and - because we wait for the async
loops to stop - they can delay shutdown by at most 10 seconds.

To fix that, an abort_source is added to the hints sender, and the
abort_source is triggered when the corresponding sender is requested to
stop.

Fixes: #9176

Closes #9177
2021-08-11 10:32:53 +02:00
Tomasz Grabiec
e177cd382b db: Remove superfluous } from read_command printout
Message-Id: <20210810131429.407903-1-tgrabiec@scylladb.com>
2021-08-10 17:32:34 +03:00
Michał Chojnowski
2aa0a2e6a1 test: perf: perf_collection: use the optimized version of bptree
Since key_compare does not conform to SimpleLessCompare, the benchmark
tests the non-optimized version of bptree (without SIMD key search).
We want to test the optimized version.

Closes #9180
2021-08-10 17:04:34 +03:00
Nadav Har'El
65381bd155 test/alternator: add tests for expression length limits
The DynamoDB documentation
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
describes several hard limits on the size of the size of expressions
(ProjectionExpression, ConditionExpression, UpdateExpression,
FilterExpression) and various elements they contain.

In this patch we begin testing those limits with a comprehensive test for
the *length* of each of these four expressions: we test that lengths up to
(and including) 4096 bytes are allowed but longer expressions are rejected.
We also add TODOs for additional documented limits that should be tested
in the future.

Currently, this test passes on DynamoDB but xfails on Alternator because
Alternator does *not* enforce any limits on the expression length. I don't
think this is a real problem, and we may consider keeping it this way,
but we should at least be aware that this difference exists and an
xfailing test will remind us.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210810081948.2012120-2-nyh@scylladb.com>
2021-08-10 12:06:21 +02:00
Nadav Har'El
9d49a32486 test/alternator: add tests for attribute name limits
DynamoDB limits attribute names in items to lengths of up 65535 bytes,
but in some cases (such as key attributes) the limit is lower - 255.
This patch adds tests for many of these cases.

All the new tests pass on DynamoDB, but some still xfail on Alternator
because Alternator is too lenient - sometimes allowing longer attribute
names than DynamoDB allows. While this may sound great, it also has
downsides: The oversized attribute names perform badly, and as they
grow, Alternator's internal limits will be reached as well, and result
in an unsightly "internal server error" being reported instead of the
expected user-friendly error.

Refs #9169.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210810081948.2012120-1-nyh@scylladb.com>
2021-08-10 12:06:13 +02:00
Avi Kivity
112cee4960 Merge "make sstable::make_reader() return flat_mutation_reader_v2" from Michael
"
* Make `sstable::make_reader()` return `flat_mutation_reader_v2`,
  retain the old one as `sstable::make_reader_v1()`

* Start weaning tests off `sstable::make_reader_v1()` (done all the
  easy ones, i.e. those not involving range tombstones)
"

* tag 'sstable-make-reader-v2-v1' of github.com:cmm/scylla:
  tests: use flat_mutation_reader_v2 in the easier part of sstable_3_x_test
  tests: upgrade the "buffer_overflow" test to flat_mutation_reader_v2
  tests: get rid of sstable::make_reader_v1() in broken_sstable_test
  tests: get rid of sstable::make_reader_v1() in the trivial cases
  sstables: make sstable::make_reader() return flat_mutation_reader_v2
2021-08-10 12:57:10 +03:00
Avi Kivity
a7ef826c2b Merge "Fold validation compaction into scrub" from Botond
"
Validation compaction -- although I still maintain that it is a good
descriptive name -- was an unfortunate choice for the underlying
functionality because Origin has burned the name already as it uses it
for a compaction type used during repair. This opens the door for
confusion for users coming from Cassandra who will associate Validation
compaction with the purpose it is used for in Origin.
Additionally, since Origin's validation compaction was not user
initiated, it didn't have a corresponding `nodetool` command to start
it. Adding such a command would create an operational difference between
us and Origin.

To avoid all this we fold validation compaction into scrub compaction,
under a new "validation" mode. I decided against using the also
suggested `--dry-mode` flag as I feel that a new mode is a more natural
choice, we don't have to define how it interacts with all the other
modes, unlike with a `--dry-mode` flag.

Fixes: #7736

Tests: unit(dev), manual(REST API)
"

* 'scrub-validation-mode/v2' of https://github.com/denesb/scylla:
  compaction/compaction_descriptor: add comment to Validation compaction type
  compaction/compaction_descriptor: compaction_options: remove validate
  api: storage_service: validate_keyspace -> scrub_keyspace (validate mode)
  compaction/compaction_manager: hide perform_sstable_validation()
  compaction: validation compaction -> scrub compaction (validate mode)
  compaction/compaction_descriptor: compaction_options: add options() accessor
  compaction/compaction_descriptor: compaction_options::scrub::mode: add validate
2021-08-10 12:18:35 +03:00
Michael Livshin
c0ba657a86 tests: use flat_mutation_reader_v2 in the easier part of sstable_3_x_test
That is, anything not involving range tombstones.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Michael Livshin
7c2854a094 tests: upgrade the "buffer_overflow" test to flat_mutation_reader_v2
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Michael Livshin
a4c43eda3a tests: get rid of sstable::make_reader_v1() in broken_sstable_test
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Michael Livshin
37c9f8f137 tests: get rid of sstable::make_reader_v1() in the trivial cases
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Michael Livshin
f07306d75c sstables: make sstable::make_reader() return flat_mutation_reader_v2
Rename the old version to `sstables::make_reader_v1()`, to have a
nicely searcheable eradication target.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Piotr Dulikowski
68cac2eab7 hints: fix indentation after previous patch 2021-08-09 16:16:14 +02:00
Piotr Dulikowski
20cbe7fa2f hints: error injection for pausing hint replay
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.

This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.

The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).

Refs: #6649
2021-08-09 16:16:14 +02:00
Piotr Dulikowski
29993f7745 hints: coroutinize lambda inside send_one_file
Converts the lambda invoked for every commitlog entry in a hints file
into a coroutine.
2021-08-09 16:16:14 +02:00
Asias He
4ae6eae00a table: Get rid of table::run_compaction helper
The table::run_compaction is a trivial wrapper for
table::compact_sstables.

We have lots of similar {start, trigger, run}_compaction functions.
Dropping the run_compaction wrapper to reduce confusion.

Closes #9161
2021-08-09 14:02:54 +03:00
Tomasz Grabiec
e115fce8f7 Merge "raft: sometimes become a candidate even if outside the configuration" from Kamil
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.

* kbr/candidate-outside-config-v3:
  raft: sometimes become a candidate even if outside the configuration
  raft: fsm: update _commit_idx when applying snapshot
2021-08-09 12:29:03 +02:00
Avi Kivity
1b618921be Merge 'hinted handoff: introduce HTTP API for waiting for hint replay (stateless version)' from Piotr Dulikowski
This PR introduces a new feature to hinted handoff: ability to wait until hints from given node are replayed towards a chosen set of nodes.

It replaces the old mechanism which waits for hints to be replayed before repair and exposes it through an HTTP API. The implementation is completely different, so this PR begins with a revert of the old functionality and then introduces the new implementation.

Waiting for hints is made possible with the help of "hint sync points". A sync point is a collection of positions in some hint queues from one node - those positions are encoded into the sync point's description as a hexadecimal string. The sync point consists only of its hexadecimal description - there is no state kept on any of the nodes.

Two operations are available through the HTTP API:

- `/hints_manager/waiting_point` (POST) - _Create a sync point_. Given a set of `target_hosts`, creates a sync point which encodes positions currently at the end of all queues pointing to any of the `target_hosts`.
- `/hints_manager/waiting_point` (GET) - _Wait or check the sync point_. Given a description of a sync point, checks if the sync point was already reached. If you provide a non-zero `timeout` parameter and the sync point is not reached yet, this endpoint will wait until it the point reached or the timeout expires.

Hinted handoff uses the commitlog framework in order to store and replay hints. Each entry (here, a serialized hint) can be identified by a "replay position", which contains the ID of the segment containing the hint, and its position in the file. Replay positions are ordered with respect to segment ID and then position in the file; because segment IDs are assigned sequentially and entries are also written sequentially, this order corresponds to the chronological order in which hints were written. This order also corresponds to the order in which hints are replayed, provided that hint segments are processed starting with the one with the smallest ID first.

The main idea is to track the positions of both the most recently written hint, and the most recently replayed hint. When creating a hint sync point, the position of the last written hint is encoded; when the sync point is waited on, the hints manager waits until the last replayed position reached the position encoded in the sync point. The description of the sync point encodes positions on a per-hint queue basis - separately for each shard, destination endpoint and hint type (regular or MV).

Note: although hints manager destroys and re-creates commitlog instances, the ordering described above still works - the ID of the first segment assigned by the commitlog instance corresponds to the number of milliseconds since the epoch, so commitlog instances created by newer instances will have larger IDs.

Before the hints manager is enabled, it performs segment _rebalancing_: for a given endpoint, it makes sure that each shard gets roughly the same number of hint segments. For example, if there are 3 shards and shard 1 has 7 segments, then shard 0 will get 2 segments, shard 1: 3 segments, and shard 2: 2 segments. Apart from distributing the work evenly between shards on startup, it also handles the case when the node is resharded - if the number of shards is reduced, segments from removed shards will be redistributed to lower shards.

Because of the possibility of segments being moved between shards on restart, this makes accurate tracking of hint replay harder. In order to simplify the problem, this PR changes the order in which hint segments are replayed - segments from other shards (here called "foreign" segments) are replayed first, before any "local" segment from this shard. Foreign segments are treated as if they were placed before the 0 replay position - when waiting for a hint sync point, we will __always__ wait for foreign segments to be replayed.

This behavior makes sure that hints generated before the sync point was created will be replayed - and, if segment rebalancing happened in the meantime, we will potentially replay some more segments which were moved across shards.

This PR starts with a revert of the "hints: delay repair until hints are replayed" (#8452) functionality. Some infrastructure introduced in the original PR started to be used by other places in the code, so this is not a simple revert of the merge commit - instead, commits of the old PR are reverted separately and modified in order to make the code compile.

The following commits from the original PR were omitted from the revert because the code introduced by them became used by other logic in repair:

- 0db45d1df5 (repair: introduce abort_source for repair abort)
- 3a2d09b644 (repair: introduce abort_source for shutdown)
- 49f4a2f968 (repair: plug in waiting for hints to be sent before repair)

Refs: #8102
Fixes: #8727

Tests: unit(dev)

Closes #8982

* github.com:scylladb/scylla:
  api: add HTTP API for hint sync points
  api: register hints HTTP API outside set_server_done
  storage_proxy: add functions for creating and waiting for hint sync pts
  hints: add functions for creating and waiting for sync points
  hints: add hint sync point structure
  utils,alternator: move base64 code from alternator to utils
  hints: make it possible to wait until hints are replayed
  hints: track the RP of the last replayed position
  hints: track the RP of the last written hint
  hints: change last_attempted_rp to last_succeeded_rp
  hints: rearrange error handling logic for hint sending
  hints: sort segments by ID, divide into foreign and local
  Revert "db/hints: allow to forcefully update segment list on flush"
  Revert "db/hints: add a metric for counting processed files"
  Revert "db/hints: make it possible to wait until current hints are sent"
  Revert "storage_proxy: add functions for syncing with hints queue"
  Revert "messaging_service: add verbs for hint sync points"
  Revert "storage_proxy: implement verbs for hint sync points"
  Revert "config: add wait_for_hint_replay_before_repair option"
  Revert "storage_proxy: coordinate waiting for hints to be sent"
  Revert "repair: plug in waiting for hints to be sent before repair"
  Revert "hints: dismiss segment waiters when hint queue can't send"
  Revert "storage_proxy: stop waiting for hints replay when node goes down"
  Revert "storage_proxy: add abort_source to wait_for_hints_to_be_replayed"
2021-08-09 10:59:07 +03:00
Piotr Dulikowski
7e3966c03e api: add HTTP API for hint sync points
Adds HTTP endpoints for manipulating hint sync points:

- /hinted_handoff/sync_point (POST) - creates a new sync point for
  hints towards nodes listed in the `target_hosts` parameter
- /hinted_handoff/sync_point (GET) - checks the status of the sync
  point. If a non-zero `timeout` parameter is given, it waits until the
  sync point is reached or the timeout expires.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
9091ce5977 api: register hints HTTP API outside set_server_done
Registration of the currently unused hinted handoff endpoints is moved
out from the set_server_done function. They are now explicitly
registered in main.cc by calling api::set_hinted_handoff and also
uninitialized by calling api::unset_hinted_handoff.

Setting/unsetting HTTP API separately will allow to pass a reference to
the sync_point_service without polluting the set_server_done function.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
14b00610b2 storage_proxy: add functions for creating and waiting for hint sync pts
Adds functions in storage_proxy which allow to create sync points and
wait for them.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
d41d39bbcd hints: add functions for creating and waiting for sync points
Adds functions which allow to create per-shard sync points and wait for
them.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
e18b29765a hints: add hint sync point structure
Adds a sync_point structure. A sync point is a (possibly incomplete)
mapping from hint queues to a replay position in it. Users will be able
to create sync points consisting of the last written positions of some
hint queues, so then they can wait until hint replay in all of the
queues reach that point.

The sync point supports serialization - first it is serialized with the
help of IDL to a binary form, and then converted to a hexadecimal
string. Deserialization is also possible.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
5a0942a0f8 utils,alternator: move base64 code from alternator to utils
The base64 encoding/decoding functions will be used for serialization of
hint sync point descriptions. Base64 format is not specific to
Alternator, so it can be moved to utils.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
70df9973f3 hints: make it possible to wait until hints are replayed
Adds necessary infrastructure which allows, for a given endpoint
manager, to wait until hints are replayed up to a specified position. An
abort source must be specified which, if triggered, cancels waiting for
hint replay.

If the endpoint manager is stopped, current waiters are dismissed with
an exception.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
93f244426d hints: track the RP of the last replayed position
Keeps track of a position which serves as an upper bound for positions
of already replayed hints - i.e. all hints with replay positions
strictly lower than it are considered replayed.

In order to accurately track this bound during hint replay, a std::map
is introduced which contains positions of hints which are currently
being sent.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
03e2e671cd hints: track the RP of the last written hint
The position of the last written hint is now tracked by the endpoint
hints manager.

When manager is constructed and no hints are replayed yet, the last
written hint position is initialized to the beginning of a fake segment
with ID corresponding to the current number of milliseconds since the
epoch. This choice makes sure that, in case a new hint sync point is
created before any hints are written, the position recorded for that
hint queue will be larger than all replay positions in segments
currently stored on disk.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
27d0d598fd hints: change last_attempted_rp to last_succeeded_rp
Instead of tracking the last position for which hint sending is
attempted, the last successfully replayed position is tracked.

The previous variable was used to calculate the position from which hint
replay should restart in case of an error, in the following way:

    _last_not_complete_rp = ctx_ptr->first_failed_rp.value_or(
        ctx_ptr->last_attempted_rp.value_or(_last_not_complete_rp));

Now, this formula uses the last_succeeded_rp in place of
last_attempted_rp. This change does not have an effect on the choice of
the starting position of the next retry:

- If the hint at `last_attempted_rp` has succeeded, in the new algorithm
  the same position will be recorded in `last_succeeded_rp`, and the
  formula will yield the same result.
- If the hint at `last_attempted_rp` has failed, it will be accounted
  into `first_failed_rp`, so the formula will yield the same result.

The motivation for this change is that in the next commits of this PR we
will start tracking the position of the last replayed hint per hint
queue, and the meaning of the new variable makes it more useful - when
there are no failed hints in the hint sending attempt, last_succeeded_rp
gives us information that hints _up to this position_ were replayed; the
last_attempted_rp variable can only tell us that hints _before that
position_ were replayed successfully.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
08a7d79ffc hints: rearrange error handling logic for hint sending
Instead of calling the `on_hint_send_failure` method inside the hint
sending task in places where an error occurs, we now let the exceptions
be returned and handle them inside a single `then_wrapped` attached to
the hint sending task.

Apart from the `then_wrapped`, there is one more place which calls
`on_hint_send_failure` - in the exception handler for the future which
spawns the asynchronous hint sending task. It needs to be kept separate
because it is a part of a separate task.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
45b04c94e0 hints: sort segments by ID, divide into foreign and local
Endpoint hints manager keeps a commitlog instance which is used to write
hints into new segments. This instance is re-created every 10 seconds,
which causes the previous instance to leave its segments on disk.

On the other hand, hints sender keeps a list of segments to replay which
is updated only after it becomes empty. The list is repopulated with
segments returned by the commitlog::get_segments_to_replay() method
which does not specify the order of the segments returned.

As a preparation for the upcoming hint sync points feature, this commit
changes the order in which segments are replayed:

- First, segments written by other shards are replayed. Such segments
  may appear in the queue because of segment rebalancing which is done
  at startup.
  The purpose of replaying "foreign" segments first is that they are
  problematic for hint sync points. For each hint queue, a hint sync
  point encodes a replay position of the last written hint on the local
  shard. Accounting foreign segments precisely would make the
  implementation more complicated. To make things simpler, waiting for
  sync points will always make sure that all foreign segments are
  replayed. This might sometimes cause more hints to be waited on than
  necessary if a restart occurs in the meantime.
- Segments written by the local shard are replayed later, in order of
  their IDs. This makes sure that local hints are replayed in the order
  they were written to segments, and will make it possible to use replay
  positions to track progress of hint replay.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
f83699bb7c Revert "db/hints: allow to forcefully update segment list on flush"
This reverts commit e48739a6da.

This commit removes the functionality from endpoint hints manager which
allowed to flush hints immediately and forcefully update the list of
segments to replay.

The new implementation of waiting for hints will be based on replay
positions returned by the commitlog API and it won't be necessary to
forcefully update the segment list when creating a sync point.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
9c1d4e7e6c Revert "db/hints: add a metric for counting processed files"
This reverts commit 5a49fe74bb.

This commit removes a metric which tracks how many segments were
replayed during current runtime. It was necessary for current "wait for
hints" mechanism which is being replaced with a different one -
therefore we can remove the metric.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
3b851a5ebd Revert "db/hints: make it possible to wait until current hints are sent"
This reverts commit 427bbf6d86.

This commit removes the infrastructure which allows to wait until
current hints are replayed in a given hint queue.

It will be replaced with a different mechanism in later commits.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
4a35d138f6 Revert "storage_proxy: add functions for syncing with hints queue"
This reverts commit 244738b0d5.

This commit removes create_hint_queue_sync_point and
check_hint_queue_sync_point functions from storage_proxy, which were
used to wait until local hints are sent out to particular nodes.

Similar methods will be reintroduced later in this PR, with a completely
different implementation.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
0d74dee683 Revert "messaging_service: add verbs for hint sync points"
This reverts commit 82c419870a.

This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK
rpc verbs.

The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so there is no
need for new verbs.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
4604bb21c3 Revert "storage_proxy: implement verbs for hint sync points"
This reverts commit 485036ac33.

This commit removes the handlers for HINT_SYNC_POINT_CREATE and
HINT_SYNC_POINT_CHECK verbs.

The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so there is no
need for new verbs.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
ff453d80ff Revert "config: add wait_for_hint_replay_before_repair option"
This reverts commit 86d831b319.

This commit removes the wait_for_hints_before_repair option. Because a
previous commit in this series removes the logic from repair which
caused it to wait for hints to be replayed, this option is now useless.

We can safely remove this option because it is not present in any
release yet.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
6c5d2fe0bf Revert "storage_proxy: coordinate waiting for hints to be sent"
This reverts commit 46075af7c4.

This commit removes the logic responsible for waiting for other nodes to
replay their hints. The upcoming HTTP API for waiting for hint replay
will be restricted to waiting for hints on the node handling the
request, so there is no need for coordinating multiple nodes.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
ecf854affc Revert "repair: plug in waiting for hints to be sent before repair"
This reverts commit 49f4a2f968.

The idea to wait for hints to be replayed before repair is not always a
good one. For example, someone might want to repair a small token range
or just one table - but hinted handoff cannot selectively replay hints
like this.

The fact that we are waiting for hints before repair caused a small
number of regressions (#8612, #8831).

This commit removes the logic in repair which caused it to wait for
hints. Additionally, the `storage_proxy.hh` include, which was
introduced in the commit being reverted is also removed and smaller
header files are included instead (gossiper.hh and fb_utilities.hh).
2021-08-09 09:22:26 +02:00
Piotr Dulikowski
e3c32c897a Revert "hints: dismiss segment waiters when hint queue can't send"
This reverts commit 9d68824327.

First, we are reverting existing infrastructure for waiting for hints in
order to replace it with a different one, therefore this commit needs to
be reverted as well.

Second, errors during hint replay can occur naturally and don't
necessarily indicate that no progress can be made - for example, the
target node is heavily loaded and some hints time out. The "waiting for
hints" operation becomes a user-issued command, so it's not as vital to
ensure liveness.
2021-08-09 09:06:23 +02:00
Piotr Dulikowski
afb4c85662 Revert "storage_proxy: stop waiting for hints replay when node goes down"
This reverts commit 22e06ace2c.

The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so we are
removing all infrastructure related to coordinating hint waiting -
therefore this commit needs to be reverted.
2021-08-09 09:06:23 +02:00
Piotr Dulikowski
035da96161 Revert "storage_proxy: add abort_source to wait_for_hints_to_be_replayed"
This reverts commit 958a13577c.

The `wait_for_hints_to_be_replayed` function is going to be completely
removed in this PR, so this commit needs to be reverted, too.
2021-08-09 09:06:23 +02:00
Takuya ASADA
b822c642e5 docker: fix housekeeping --repo-files to apt repository
Even we switched to Ubuntu based container image, housekeeping still
using yum repository.
It should be switched to apt repository.

Fixes #9144

Closes #9147
2021-08-09 07:47:03 +03:00
Avi Kivity
31dcb0d1d0 Update seastar submodule
* seastar ce3cc2687f...07758294ef (12):
  > perftune.py: change hwloc-calc parameters order
Fixes perftune on Fedora 34 based hwloc
  > resource: pass configuration to nr_processing_units()
  > semaphore: semaphore_timed_out: derive from timed_out_error
  > Merge "resource: use hwloc_topology_holder" from Benny
  > Merge "file: ioctl, fcntl and lifetime_hint interfaces in seastar::file" from Arun George
  > pipe: mark pipe_reader and pipe_writer ctors as noexcept
  > test: pipe: add simple unit test
  > test: source_location_test: relax function name check for gcc 11
  > http: add 429 too_many_requests status code
  > Added [[nodiscard]] to abort-source's subscribe
  > io_queue: Use on_internal_error in io_queue
  > reactor: Remove unused epoll poller from reactor
2021-08-08 14:42:54 +03:00
Avi Kivity
3b5e312800 db: schema_tables: clean up read_schema_partition_for_keyspace() coroutine captures
read_schema_partition_for_keyspace() copies some parameters to capture them
in a coroutine, but the same can be achieved more cleanly by changing the
reference parameters to value parameters, so do that.

Test: unit (dev)

Closes #9154
2021-08-08 12:55:10 +03:00
Nadav Har'El
61bcc0ad29 Merge 'compaction: Move compaction_strategy.hh and compaction_garbage_collector.hh to compaction directory ' from Asias He
This trial patch set moves compaction_strategy.hh and compaction_garbage_collector.hh to compaction directory and drops two unused compact_for_mutation_query_state and compact_for_data_query_state.

Closes #9156

* github.com:scylladb/scylla:
  compaction: Move compaction_garbage_collector.hh to compaction dir
  compaction: Move compaction_strategy.hh to compaction dir
  mutation_compactor: Drop compact_for_mutation_query_state and compact_for_data_query_state
2021-08-08 11:58:41 +03:00
Dejan Mircevski
ba55769f80 test: Use ALLOW FILTERING more strictly
Prepare for the upcoming strict ALLOW FILTERING check by modifying
unit-test queries that need it.  Current code allows such queries both
with and without ALLOW FILTERING; future code will reject them without
ALLOW FILTERING.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-08-08 08:01:19 +02:00
Dejan Mircevski
5da846a4a8 cql3: Add statement_restrictions::to_string
Useful for error messages and debugging.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-08-08 07:16:55 +02:00
Asias He
4c1f8c2f83 compaction: Move compaction_garbage_collector.hh to compaction dir
The top dir is a mess. Move compaction_garbage_collector.hh to
the new home.
2021-08-07 08:07:09 +08:00
Asias He
6350a19f73 compaction: Move compaction_strategy.hh to compaction dir
The top dir is a mess. Move compaction_strategy.hh and
compaction_strategy_type.hh to the new home.
2021-08-07 08:06:37 +08:00
Asias He
47aae83185 mutation_compactor: Drop compact_for_mutation_query_state and compact_for_data_query_state
They are not used.
2021-08-07 07:21:48 +08:00
Tomasz Grabiec
0af2c2b1cb Merge "raft: store cluster configuration when taking snapshots" from Kamil
The cluster would forget its configuration when taking a snapshot,
making it unable to reelect a leader.

We fix the problem and introduce a regression test.

The last commit introduces some additional assertions for safety.

* kbr/snapshot-preserve-config-v4:
  raft: sanity checking of apply index
  test: raft: regression test for storing cluster configuration when taking snapshots
  raft: store cluster configuration when taking snapshots
2021-08-06 18:34:53 +02:00
Kamil Braun
7533c84e62 raft: sometimes become a candidate even if outside the configuration
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.
2021-08-06 13:18:32 +02:00
Kamil Braun
907672622f raft: fsm: update _commit_idx when applying snapshot
All entries up to snapshot.idx must obviously be committed, so why not
update _commit_idx to reflect that.

With this we get a useful invariant:
`_log.get_snapshot().idx <= _commit_idx`.
For example, when checking whether the latest active configuration is
committed, it should be enough to compare the configuration index to the
commit index. Without the invariant we would need a special case if the
latest configuration comes from a snapshot.
2021-08-06 12:43:07 +02:00
Kamil Braun
1ca4d30cc3 raft: sanity checking of apply index
Check that entries are applied in the correct order.
2021-08-06 12:21:19 +02:00
Kamil Braun
93822b0ee7 test: raft: regression test for storing cluster configuration when taking snapshots
Before the fix introduced in the previous patch, the cluster would
forget its configuration when taking a snapshot, making it unable to
reelect a leader. This regression test catches that.
2021-08-06 12:17:22 +02:00
Kamil Braun
c6563220b0 raft: store cluster configuration when taking snapshots
We add a function `log_last_conf_before(index_t)` to `fsm` which, given
an index greater than the last snapshot index, returns the configuration
at this index, i.e. the configuration of the last configuration entry
before this index.

This function is then used in `applier_fiber` to obtain the correct
configuration to be stored in a snapshot.

In order to ensure that the configuration can be obtained, i.e. the
index we're looking at is not smaller than the last snapshot index, we
strengthen the conditions required for taking a snapshot: we check that
`_fsm` has not yet applied a snapshot at a larger index (which it may
have due to a remote snapshot install request). This also causes fewer
unnecessary snapshots to be taken in general.
2021-08-06 12:00:32 +02:00
Avi Kivity
52364b5da0 Merge 'cql3: Use expressions to calculate the local-index clustering ranges' from Jan Ciołek
Calculating clustering ranges on a local index has been rewritten to use the new `expression` variant.

This allows us to finally remove the old `bounds_ranges` function.

Closes #9080

* github.com:scylladb/scylla:
  cql3: Remove unused functions like bounds_ranges
  cql3: Use expressions to calculate the local-index clustering ranges
  statement_restrictions_test: tests for extracting column restrictions
  expression: add a function to extract restrictions for a column
2021-08-05 18:32:11 +03:00
Tomasz Grabiec
4bfff86ba5 gdb: Print disengaged optionals as std::nullopt to reduce noise
Message-Id: <20210805113409.75394-1-tgrabiec@scylladb.com>
2021-08-05 14:42:31 +02:00
Kamil Braun
f050d3682c raft: fsm: stronger check for outdated remote snapshots
We must not apply remote snapshots with commit indexes smaller than our
local commit index; this could result in out-of-order command
application to the local state machine replica, leading to
serializability violations.
Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>
2021-08-05 14:29:50 +02:00
Tomasz Grabiec
8fe06ad681 storage_proxy: Fix result reconciliation for memory-limitter induced short reads
This applies to the case when pages are broken by replicas based on
memory limits (not row or partition limits).

If replicas stop pages in the following places:

  replica1 = {
     row 1,
     <end-of-page>
     row 2
  }

  replica2 = {
    row 3
  }

The coordinator will reconcile the first page as:

  {
    row 1,
    row 3
  }

and row 2 will not be emitted at all in the following pages.

The coordinator should notice that replica1 returned a short read and
ignore everything past row 1 from other replicas, but it doesn't.

There is a logic to do this trimming, but it is done in
got_incomplete_information_across_partitions() which is executed only
for the partition for which row limits were exhausted.

Fix by running the logic unconditionally.

Fixes #9119

Tests:
  - unit (dev)
  - manual (2 node cluster, manual reproducer)

Message-Id: <20210802231539.156350-1-tgrabiec@scylladb.com>
2021-08-05 11:28:52 +03:00
Nadav Har'El
ae51fef57c cql-pytest: add tests for estimated partition count
In issue #9083 a user noted that whereas Cassandra's partition-count
estimation is accurate, Scylla's (rewritten in commit b93cc21) is very
inaccurate. The tests introduce here, which all xfail on Scylla, confirm
this suspicion.

The most important tests are the "simple" tests, involving a workload
which writes N *distinct* partitions and then asks for the estimated
partition count. Cassandra provides accurate estimates, which grow
more accurate with more partitions, so it passes these tests, while
Scylla provides bad estimates and fails them.

Additional tests demonstrate that neither Scylla nor Cassandra
can handle anything beyond the "simple" case of distinct partitions.
Two tests which xfail on both Cassandra and Scylla demonstrate that
if we write the same partitions to multiple sstables - or also delete
partitions - the estimated partition counts will be way off.

Refs #9083

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210726211315.1515856-1-nyh@scylladb.com>
2021-08-05 08:50:19 +02:00
Asias He
9903eecc0f storage_service: Close reader in load_and_stream
We forgot to call the reader.close() for the reader when the close api is
introduced.

Fixes #9146

Closes #9148
2021-08-05 09:27:19 +03:00
Botond Dénes
76f2790c24 compaction/compaction_descriptor: add comment to Validation compaction type
Add a note explaining what Origin uses this for, to deter future
attempts at reusing this for something else.
2021-08-05 07:36:45 +03:00
Botond Dénes
ab7a2cabb3 compaction/compaction_descriptor: compaction_options: remove validate
It is unused now.
2021-08-05 07:36:45 +03:00
Botond Dénes
c1203618eb api: storage_service: validate_keyspace -> scrub_keyspace (validate mode)
Fold validate keyspace into scrub keyspace (validate mode).
2021-08-05 07:36:45 +03:00
Botond Dénes
5f6468d7d7 compaction/compaction_manager: hide perform_sstable_validation()
We are folding validation compaction into scrub (at least on the
interface level), so remove the validation entry point accordingly and
have users go through `perform_sstable_scrub()` instead.
2021-08-05 07:36:44 +03:00
Botond Dénes
a258f5639b compaction: validation compaction -> scrub compaction (validate mode)
Fold validation compaction into scrub compaction (validate mode). Only
on the interface level though: to initiate validation compaction one now
has to use `compaction_options::make_scrub(compaction_options::scrub::mode::validate)`.
The implementation code stays as-is -- separate.
2021-08-05 07:32:05 +03:00
Raphael S. Carvalho
154e8959f9 compaction: Optimize partition filtering for cleanup compaction
Realized that the overall complexity of partition filtering in
cleanup is O(N * log(M)), where
	N is # of tokens
	M is # of ranges owned by the node

Assuming N=10,000,000 for a table and M=257, N*log(M) ~= 80,056,245
checks performed during the whole cleanup.

This can be optimized by taking advantage that owned ranges are
both sorted and non wrapping, so an incremental iterator-oriented
checker is introduced to reduce complexity from O(N * log(M)) to
O(N + M) or just O(N).

BEFORE

240MB to 237MB (~98% of original) in 3239ms = 73MB/s. ~950016 total partitions merged to 949943.
719MB to 719MB (~99% of original) in 9649ms = 74MB/s. ~2900608 total partitions merged to 2900576.
1GB to 1GB (~100% of original) in 15231ms = 74MB/s. ~4536960 total partitions merged to 4536852.
1GB to 1GB (~100% of original) in 15244ms = 74MB/s. ~4536960 total partitions merged to 4536840.
1GB to 1GB (~100% of original) in 15263ms = 74MB/s. ~4536832 total partitions merged to 4536783.
1GB to 1GB (~100% of original) in 15216ms = 74MB/s. ~4536832 total partitions merged to 4536812.

AFTER

240MB to 237MB (~98% of original) in 3169ms = 74MB/s. ~950016 total partitions merged to 949943.
719MB to 719MB (~99% of original) in 9444ms = 76MB/s. ~2900608 total partitions merged to 2900576.
1GB to 1GB (~100% of original) in 14882ms = 76MB/s. ~4536960 total partitions merged to 4536852.
1GB to 1GB (~100% of original) in 14918ms = 76MB/s. ~4536960 total partitions merged to 4536840.
1GB to 1GB (~100% of original) in 14919ms = 76MB/s. ~4536832 total partitions merged to 4536783.
1GB to 1GB (~100% of original) in 14894ms = 76MB/s. ~4536832 total partitions merged to 4536812.

Fixes #6807.

test: mode(dev).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802213159.182393-1-raphaelsc@scylladb.com>
2021-08-04 20:35:44 +03:00
Jan Ciolek
44ca965ba0 cql3: Remove unused functions like bounds_ranges
Finding clustering ranges has been rewritten to use the new
expression variant.
Old bounds_ranges() and other similar ones are no longer needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-08-04 17:12:44 +02:00
Jan Ciolek
da54c9e2fb cql3: Use expressions to calculate the local-index clustering ranges
Removes old code used to calculate local-index clustering range
and replaces it with new based on the expression variant.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-08-04 17:12:40 +02:00
Asias He
6230bd4b5a locator: Add yield in do_get_ranges and friends
Not all calculate_natural_endpoints implementations respect can_yield
flag, for example, everywhere_replication_strategy.

This patch adds yield at the caller site to fix stalls we saw in
do_get_ranges.

Fixes #8943

Closes #9139
2021-08-04 15:52:37 +03:00
Laura Novich
54f0b1556d Update Conf.py to remove master from drop-down
Still allows master to be built, but users will not be able to select it.

Closes #9140
2021-08-04 15:24:47 +03:00
Laura Novich
4d7835d635 Update docs/conf.py
Co-authored-by: David Garcia <hi@davidgarcia.dev>
2021-08-04 15:24:47 +03:00
Laura Novich
3533d5ec15 Update docs/conf.py
Co-authored-by: David Garcia <hi@davidgarcia.dev>
2021-08-04 15:24:47 +03:00
lauranovich
79f0dc64cb add multiversion control to scylla 2021-08-04 15:24:47 +03:00
Benny Halevy
3ad0067272 date_tiered_manifest: get_now: fix use after free of sstable_list
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.

Fixes #9138

Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
2021-08-04 15:24:47 +03:00
Nadav Har'El
d640998ca8 test/cql-pytest: add test for another ALLOW FILTERING case
In this patch we add another test case for a case where ALLOW FILTERING
should not be required (and Cassandra doesn't require it) but Scylla
does.

This problem was introduced by pull request #9122. The pull request
fixed an incorrect query (see issue #9085) involving both an index and
a multi-column restriction on a compound clustering key - and the fix is
using filtering. However, in one specific case involving a full prefix,
it shouldn't require filtering. This test reproduces this case.

The new test passes on Cassandra (and also theoretically, should pass),
but fails on Scylla - the check_af_optional() call fails because Scylla
made the ALLOW FILTERING mandatory for that case.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210803092046.1677584-1-nyh@scylladb.com>
2021-08-04 15:24:47 +03:00
Nadav Har'El
dba184039a test/alternator: another test for Query's ExclusiveStartKey
We already have tests for Query's ExclusiveStartKey option, but we
only exercised it as a way for paging linearly through all the results.
Now we add a test that confirms that ExclusiveStartKey can be used not
just for paging through all the result - but also for jumping directly to
the middle of a partition after any clustering key (existing or non-
existing clustering key). The new test also for the first time verifies
that ExclusiveStartKey with a specific format works (previous tests just
copied LastEvaluatedKey to ExclusiveStartKey, so any opaque cookie could
have worked).

The test passes on both DynamoDB and Alternator so it did not find a new
bug. But it's useful to have as a regression test, in case in the future
we want to improve paging performance (see #6278) - and need to keep in
mind that ExclusiveStartKey is not just for paging.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210729114703.1609058-1-nyh@scylladb.com>
2021-08-04 15:24:47 +03:00
Kamil Braun
4165045356 test: raft: randomized_nemesis_test: handle timeouts in rpc::send_snapshot
They were already correctly returned to the caller, but we had a
leftover discarded future that would sometimes end up with a
broken_promise exception. Ignore the exception explicitly.
Message-Id: <20210803122207.78406-1-kbraun@scylladb.com>
2021-08-04 15:24:47 +03:00
Nadav Har'El
9662de85f5 Merge 'Azure snitch support' from Pekka Enberg
This add support for Azure snitch. The work is an adaptation of
AzureSnitch for Apache Cassandra by Yoshua Wakeham:

https://raw.githubusercontent.com/yoshw/cassandra/9387-trunk/src/java/org/apache/cassandra/locator/AzureSnitch.java

Also change `production_snitch_base` to protect against
a snitch implementation setting DC and rack to an empty string,
which Lubos' says can happen on Azure.

Fixes #8593

Closes #9084

* github.com:scylladb/scylla:
  scylla_util: Use AzureSnitch on Azure
  production_snitch_base: Fallback for empty DC or rack strings
  azure_snitch: Azure snitch support
2021-08-03 22:52:05 +03:00
Pavel Solodovnikov
ce330d11af cql3: create_view_statement: validate bound variables at prepare step
Variables specification is already known at prepare step, so
it's safe to move the check to happen as early as possible.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210802090852.253469-1-pa.solodovnikov@scylladb.com>
2021-08-03 22:52:05 +03:00
Tomasz Grabiec
cd56a4ec09 service: query_pagers: Reuse query_uuid across pages when paging locally
Query pager was reusing query_uuid only when it had no local state (no
_last_pkey), so querier cache was not used when paging locally.

This bug affects performance of aggregate queries like count(*).

Fixes #9127
Message-Id: <20210803003941.175099-1-tgrabiec@scylladb.com>
2021-08-03 22:52:05 +03:00
Botond Dénes
8b64a6caa7 compaction/compaction_descriptor: compaction_options: add options() accessor 2021-08-03 09:34:17 +03:00
Botond Dénes
f01b799a30 compaction/compaction_descriptor: compaction_options::scrub::mode: add validate
To replace compaction_type::Validation.
2021-08-03 09:34:15 +03:00
Avi Kivity
885ca2158e db: schema_tables: reindent
Following conversion to corotuines in fc91e90c59, remove extra
indents and braces left to make the change clearer.

One variable had to be renamed since without the braces it
duplicated another variable in the same block.

Test: unit (dev)

Closes #9125
2021-08-02 22:36:57 +02:00
Raphael S. Carvalho
a869d61c89 tests: Move compaction-related tests into its own unit
With commit 1924e8d2b6, compaction code was moved into a
top level dir as compaction is layered on top of sstables.
Let's continue this work by moving all compaction unit tests
into its own test file. This also makes things much more
organized.

sstable_datafile_test, as its name implies, will only contain
sstable data tests. Perhaps it should be renamed to only
sstable_data_test, as the test also contains tests involving
other components, not only the data one.

BEFORE
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
105

AFTER
$ cat test/boost/sstable_compaction_test.cc | grep TEST_CASE | wc -l
57
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
48

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802192120.148583-1-raphaelsc@scylladb.com>
2021-08-02 22:26:26 +03:00
Avi Kivity
0f50f8ec5f Merge "Allow reshape to be aborted" from Raphael
"
Now reshape can be aborted on either boot or refresh.

The workflow is:
    1) reshape starts
    2) user notices it's taking too long
    3) nodetool stop RESHAPE

the good thing is that completed reshape work isn't lost, allowing
table to enjoy the benefits of all reshaping done up to the abortion
point.

Fixes #7738.
"

* 'abort_reshape_v1' of https://github.com/raphaelsc/scylla:
  compaction: Allow reshape to be aborted
  api: make compaction manager api available earlier
2021-08-02 21:59:42 +03:00
Raphael S. Carvalho
aa7cdc0392 compaction: Allow reshape to be aborted
Now reshape can be aborted on either boot or refresh.

The workflow is:
    1) reshape starts
    2) user notices it's taking too long
    3) nodetool stop RESHAPE

the good thing is that completed reshape work isn't lost, allowing
table to enjoy the benefits of all reshaping done up to the abortion
point.

Fixes #7738.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-08-02 13:54:51 -03:00
Raphael S. Carvalho
33404b9169 api: make compaction manager api available earlier
That will be needed for aborting reshape on boot.

Refs #7738.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-08-02 13:54:44 -03:00
Raphael S. Carvalho
f75154afca compaction: Remove overhead of merging reader for cleanup compaction
When perfing cleanup, merging reader showed up as significant.
Given that cleanup is performed on a single sstable at a time,
merging reader becomes an extra layer doing useless work.

     1.71%     1.71%  scylla     scylla               [.] merging_reader<mutation_reader_merger>::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}::operator()

mutation compactor, to get rid of purgeable expired data
and so on, still consumes the data retrieved by sstable
reader, so no semantic change is done.

With the overhead removed, cleanup becomes ~9% faster, see:

BEFORE

real	1m15.240s
user	0m2.648s
sys	0m0.128s

240MB to 237MB (~98% of original) in 3301ms = 71MB/s.
719MB to 719MB (~99% of original) in 9761ms = 73MB/s.
1GB to 1GB (~100% of original) in 15372ms = 73MB/s.
1GB to 1GB (~100% of original) in 15343ms = 74MB/s.
1GB to 1GB (~100% of original) in 15329ms = 74MB/s.
1GB to 1GB (~100% of original) in 15360ms = 73MB/s.

AFTER

real	1m9.154s
user	0m2.428s
sys	0m0.123s

240MB to 237MB (~98% of original) in 3010ms = 78MB/s.
719MB to 719MB (~99% of original) in 8997ms = 79MB/s.
1GB to 1GB (~100% of original) in 14114ms = 80MB/s.
1GB to 1GB (~100% of original) in 14145ms = 80MB/s.
1GB to 1GB (~100% of original) in 14106ms = 80MB/s.
1GB to 1GB (~100% of original) in 14053ms = 80MB/s.

With 1TB set, ~20m would had been reduced instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210730190713.462135-1-raphaelsc@scylladb.com>
2021-08-02 19:22:41 +03:00
Michael Livshin
0eb2eb1b44 rename coarse_clock to coarse_steady_clock
Also add a comment to explain why it exists.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #9123
2021-08-02 17:41:21 +03:00
Jan Ciolek
a7d1dab066 statement_restrictions_test: tests for extracting column restrictions
Add unit tests for the function
extract_single_column_restrictions_for_column()

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-08-02 15:43:42 +02:00
Jan Ciolek
43ab3d6831 expression: add a function to extract restrictions for a column
Add a function, which given an expression and a column,
extracts all restrictions involving this column.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-08-02 15:43:33 +02:00
Tomasz Grabiec
3e47f28c65 Merge "raft: use the correct term when storing a snapshot" from Kamil
We should not use the current term; we should use the term of the
snapshot's index, which may be lower.

* https://github.com/kbr-/scylla/tree/snapshot-right-term-fix:
  test: raft: regression test for using the correct term when taking a snapshot
  test: raft: randomized_nemesis_test: server configuration parameter
  raft: use the correct term when storing a snapshot
2021-08-02 15:33:52 +02:00
Eduardo Benzecri
f196a4131a scylla_setup: Fix outdated message
Message changed according to what 'scylla_bootparam_setup' currently does
(set a clock source at boot time) instead of of what it used to do in
the past (setting huge pages).

Closes #9116.
2021-08-02 16:04:38 +03:00
Dejan Mircevski
debf65e136 cql3: Filter regular-index results on multi-column
When a WHERE clause contains a multi-column restriction and an indexed
regular column, we must filter the results.  It is generally not
possible to craft the index-table query so it fetches only the
matching rows, because that table's clustering key doesn't match up
with the column tuple.

Fixes #9085.

Tests: unit (dev, debug)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9122
2021-08-02 14:15:43 +03:00
Nadav Har'El
fc91e90c59 Merge 'db: schema_tables: coroutinize' from Avi Kivity
schema_tables is quite hairy, but can be easily simplified with coroutines.

In addition to switching future-returning functions to coroutines, we also
switch Seastar threads to coroutines. This is less of a clear-cut win; the
motivation is to reduce the chances of someone calling a function that
expects to run in a thread from a non-thread context. This sometimes works
by accident, but when it doesn't, it's pretty bad. So a uniform calling convention
has some benefit.

I left the extra indents in, since the indent-fixing patch is hard to rebase in case
a rebase is needed. I will follow up with an indent fix post merge.

Test: unit (dev, debug, release)

Closes #9118

* github.com:scylladb/scylla:
  db: schema_tables: drop now redundant #includes
  db: schema_tables: coroutinize drop_column_mapping()
  db: schema_tables: coroutinize column_mapping_exists()
  db: schema_tables: coroutinize get_column_mapping()
  db: schema_tables: coroutinize read_table_mutations()
  db: schema_tables: coroutinize create_views_from_schema_partition()
  db: schema_tables: coroutinize create_views_from_table_row()
  db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition()
  db: schema_tables: coroutinize create_tables_from_tables_partition()
  db: schema_tables: coroutinize create_table_from_name()
  db: schema_tables: coroutinize read_table_mutations()
  db: schema_tables: coroutinize merge_keyspaces()
  db: schema_tables: coroutinize do_merge_schema()
  db: schema_tables: futurize and coroutinize merge_functions()
  db: schema_tables: futurize and coroutinize user_types_to_drop::drop
  db: schema_tables: futurize and coroutinize merge_types()
  db: schema_tables: futurize and coroutinize merge_tables_and_views()
  db: schema_tables: coroutinize store_column_mapping()
  db: schema_tables: futurize and coroutinize read_tables_for_keyspaces()
  db: schema_tables: coroutinize read_table_names_of_keyspace()
  db: schema_tables: coroutinize recalculate_schema_version()
  db: schema_tables: coroutinize merge_schema()
  db: schema_tables: introduce and use with_merge_lock()
  db: schema_tables: coroutinize update_schema_version_and_announce()
  db: schema_tables: coroutinize read_keyspace_mutation()
  db: schema_tables: coroutinize read_schema_partition_for_table()
  db: schema_tables: coroutinize read_schema_partition_for_keyspace()
  db: schema_tables: coroutinize query_partition_mutation()
  db: schema_tables: coroutinize read_schema_for_keyspaces()
  db: schema_tables: coroutinize convert_schema_to_mutations()
  db: schema_tables: coroutinize calculate_schema_digest()
  db: schema_tables: coroutinize save_system_schema()
2021-08-02 13:43:53 +03:00
Kamil Braun
ac5121a016 test: raft: regression test for using the correct term when taking a snapshot 2021-08-02 11:48:35 +02:00
Kamil Braun
63fdc718d4 test: raft: randomized_nemesis_test: server configuration parameter 2021-08-02 11:47:19 +02:00
Kamil Braun
e9632ee986 raft: use the correct term when storing a snapshot
We should not use the current term; we should use the term of the
snapshot's index, which may be lower.
2021-08-02 11:46:04 +02:00
Avi Kivity
e4d0af808d Merge 'repair: Log improvement and cleanup' from Asias He
This series improves the repair logging by removing the unused sub_ranges_nr counter, adding peer node ip in the log, removing redundant logs in case of error.

Closes #9120

* github.com:scylladb/scylla:
  repair: Remove redudnary error log in tracker::run
  repair: Do not log errors in repair_ranges
  repair: Move more repair single range code into repair_info::repair_range
  repair: Use the same uuid from the repair_info
  repair: Drop sub_ranges_nr counter
2021-08-02 12:04:39 +03:00
Avi Kivity
ebda2fd4db test: cql_test_env: increase file descriptor limit
It was observed that since fce124bd90 ('Merge "Introduce
flat_mutation_reader_v2" from Tomasz') database_test takes much longer.
This is expected since it now runs the upgrade/downgrade reader tests
on all existing tests. It was also observed that in a similar time frame
database_test sometimes times our on test machines, taking much
longer than usual, even with the extra work for testing reader
upgrade/downgrade.

In an attempt to reproduce, I noticed ti failing on EMFILE (too many
open file descriptors). I saw that tests usually use ~100 open file
descriptors, while the default limit is 1024.

I suspect we have runaway concurrency, but I was not able to pinpoint the
cause. It could be compaction lagging behind, or cleanup work for
deleting tables (the test
test_database_with_data_in_sstables_is_a_mutation_source creates and
deletes many tables).

As a stopgap solution to unblock the tests, this patch raises the file
descriptor limit in the way recommended by [1]. While tests shouldn't
use so many descriptors, I ran out of ideas about how to plug the hole.

Note that main() does something similar, through more elaborate since
it needs to communicate to users. See ec60f44b64 ("main: improve
process file limit handling").

[1] http://0pointer.net/blog/file-descriptor-limits.html

Closes #9121
2021-08-02 11:57:14 +03:00
Asias He
1f86d5a870 repair: Use a timeout for reading fragments
We recently saw repairs blocking and not making progress for a prolonged
time (days). One of the primary suspects is reads belonging to several
repairs deadlocking on the streaming read concurrency semaphore. This is
something that we've seen in the past during internal testing, and
although theoretically we have fixed it, such deadlocks are notoriously
hard to reliably reproduce so not seeing them in recent testing doesn't
mean they definitely cannot happen.

The main reason these deadlocks can happen in the first place is that
reads belonging to repairs don't use timeouts. This means that if there
happens to be a deadlock, neither of the participating repairs will give
up and the only way to release the deadlock is restarting the node.

This patch proposes a workaround for these recently saw repair
problems by introducing a timeout for reads belonging to row-level
reads, building on the premise that a failed repair is better than a
stuck repair. A timeout allows one of the participants to give up,
releasing the deadlock, allowing the others to proceed.

The timeout value chosen by this patch is 30m. Note that this applies to
reading a single mutation fragment, not for the entire read. Thirty
minutes should be more than enough for producing a single mutation
fragment.

Refs: #5359

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>

Closes #9098
2021-08-02 11:55:47 +03:00
Pavel Solodovnikov
b1a3b59a08 test: test_materialized_view: test_mv_select_stmt_bound_values: improve error handling
Restrict expected exception message to filter only relevant exception,
matching both for scylla and cassandra.

For example, the former has this message:

    Cannot use query parameters in CREATE MATERIALIZED VIEW statements

While the latter throws this:

    Bind variables are not allowed in CREATE MATERIALIZED VIEW statements

Also, place cleanup code in try-finally clause.

Tests: cql-pytest:test_materialized_view.py(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210802083912.229886-1-pa.solodovnikov@scylladb.com>
2021-08-02 11:49:50 +03:00
Nadav Har'El
e8fe1817df cql-pytest: translate Cassandra's tests for timestamps
This is a translation of Cassandra's CQL unit test source file
validation/entities/TimestampTest.java into our our cql-pytest framework.

This test file checks has a few tests (8) on various features of
cell timestamps. All these tests pass on Cassandra and on Scylla - i.e.,
these tests no new Scylla bug was detected :-)

Two of the new tests are very slow (6 seconds each) and check a trivial
feature that was already checked elsewhere more efficiently (the fact
that TTL expiration works), so I marked them "skip" after verifying they
really pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210801142738.1633126-1-nyh@scylladb.com>
2021-08-02 09:25:49 +02:00
Asias He
bd9447370f repair: Remove redudnary error log in tracker::run
The calller of tracker::run will log the error. Remove the log inside
tracker::run in case of error to reduce redundancy.

Before:

   WARN  2021-07-28 15:24:32,325 [shard 0] repair - repair id [id=1,
   uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] failed: std::runtime_error
   ({shard 0: std::runtime_error (repair id [id=1,
   uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on s hard 0 failed to repair
   513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1,
   uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 1 failed to
   repair 513 out of 513 ranges)})

   WARN  2021-07-28 15:24:32,325 [shard 0] repair - repair_tracker run for
   repair id [id=1, uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] failed:
   std::runtime_error ({shard 0: std::runtime_error (repair id [id=1,
   uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 0 failed to repair
   513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1,
   uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 1 failed to
   repair 513 out of 513 ranges)})

After:

   WARN  2021-07-28 15:33:17,038 [shard 0] repair - repair_tracker run for
   repair id [id=1, uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] failed:
   std::runtime_error ({shard 0: std::runtime_error (repair id [id=1,
   uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] on shard 0 failed to repair
   513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1,
   uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] on shard 1 failed to
   repair 513 out of 513 ranges)})
   ERROR 2021-07-28 15:33:17,453 [shard 0] rpc - client 127.0.0.2:7000:
   fail to connect: Connection refused
2021-08-02 10:11:52 +08:00
Gleb Natapov
15d34d9f96 raft: do not let follower's commit_idx to go backwards
append_reply packets can be reordered and thus reply.commit_idx may be
smaller than the one it the tracker. The tracker's commit index is used
to check if a follower needs to be updated with potentially empty
append message, so the bug may theoretically cause unneeded packets to
be sent.

Message-Id: <YQZZ/6nlNb5nQyXp@scylladb.com>
2021-08-02 01:25:55 +02:00
Tomasz Grabiec
c3ada1a145 Merge "count row (sstables/row cache/memtables) and range (memtables) tombstone reads" from Michael
Fixes #7749.
2021-08-01 23:13:18 +02:00
Avi Kivity
343b98d9b5 Merge "Print memory reclamation diagnostics on stalls" from Michael Livshin
"
Refs #4186 but does not fix it, because I punted on the "number (and
kinds) of objects migrated and evicted" part.
"

* tag 'gh-4186-reclamation-diagnostics-on-stalls-v6' of github.com:cmm/scylla:
  logalloc: add on-stall memory reclaim diagnostics
  utils: add a coarse clock
  logalloc: split tracker::impl::reclaim into reclaim & reclaim_locked
  logalloc: metrics: remove unneeded captures and a pleonasm
  logalloc: add metrics for evicted and freed memory
  logalloc: count evicted memory
  logalloc: count freed memory
2021-08-01 22:48:55 +03:00
Michael Livshin
71d721a97e logalloc: add on-stall memory reclaim diagnostics
Reuse the existing `reclaim_timer` for stall detection.

* Since a timer is now set around every reclaim and compaction, use a
  coarse one for speed.
* Set log level according to conditions (stalls deserve a warning).
* Add compaction/migration/eviction/allocation stats.

Refs #4186.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 21:51:08 +03:00
Michael Livshin
68ab3948f8 utils: add a coarse clock
Implement a millisecond-resolution `std::chrono`-style clock using
`CLOCK_MONOTONIC_COARSE`.  The use cases are those where you care
about clock sampling latency more than about accuracy.

Assuming non-ancient versions of the kernel & libc, all clock types
recognized by `clock_gettime()` are implemented through a vDSO, so
`clock_gettime()` is not an actual system call.  That means that even
`CLOCK_MONOTONIC` (which is what `std::chrono::steady_clock` uses) is
not terribly expensive in practice.

But `CLOCK_MONOTONIC_COARSE` is still 3.5 times faster than that (on
my machine the latencies are 4ns versus 14ns) and is also supposed to
be easier on the cache.

The actual granularity of `CLOCK_MONOTONIC_COARSE` is tick (on x86-64,
anyway) -- but `getclock_getres()` says it has millisecond resolution,
so we use that.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 21:51:08 +03:00
Avi Kivity
ca59754e68 db: schema_tables: drop now redundant #includes 2021-08-01 20:13:15 +03:00
Avi Kivity
40fdbf9558 db: schema_tables: coroutinize drop_column_mapping() 2021-08-01 20:13:15 +03:00
Avi Kivity
7d46300af2 db: schema_tables: coroutinize column_mapping_exists() 2021-08-01 20:13:15 +03:00
Avi Kivity
74b2200f4d db: schema_tables: coroutinize get_column_mapping() 2021-08-01 20:13:15 +03:00
Avi Kivity
f19ca7aaaa db: schema_tables: coroutinize read_table_mutations() 2021-08-01 20:13:15 +03:00
Avi Kivity
81a2be17b6 db: schema_tables: coroutinize create_views_from_schema_partition() 2021-08-01 20:13:15 +03:00
Avi Kivity
15f2fd2a23 db: schema_tables: coroutinize create_views_from_table_row() 2021-08-01 20:13:15 +03:00
Avi Kivity
0843d441ff db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition()
The tables local is a lw_shared_ptr which is created and then refeferenced
before returning. It can be unpeeled to the pointed-to type, resulting in
one less allocation.
2021-08-01 20:13:15 +03:00
Avi Kivity
66054d24c4 db: schema_tables: coroutinize create_tables_from_tables_partition() 2021-08-01 20:13:15 +03:00
Avi Kivity
82ba3c5f4a db: schema_tables: coroutinize create_table_from_name() 2021-08-01 20:13:15 +03:00
Avi Kivity
862f491605 db: schema_tables: coroutinize read_table_mutations() 2021-08-01 20:13:15 +03:00
Avi Kivity
91c1a29808 db: schema_tables: coroutinize merge_keyspaces() 2021-08-01 20:13:15 +03:00
Avi Kivity
78fc05922b db: schema_tables: coroutinize do_merge_schema()
It is now using an internal thread, so unpeel is and replace
future::get() with co_await.
2021-08-01 20:13:15 +03:00
Avi Kivity
9680d9e76c db: schema_tables: futurize and coroutinize merge_functions()
Right now, merge_functions() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.

De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
2021-08-01 20:13:15 +03:00
Avi Kivity
9cbae212bf db: schema_tables: futurize and coroutinize user_types_to_drop::drop
user_types_to_drop::drop is a function object returning void, and expecting
to be called in a thread. Make it return a future and convert the
only value it is initialized to to a coroutine.

De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
2021-08-01 20:13:15 +03:00
Avi Kivity
e5f28fc746 db: schema_tables: futurize and coroutinize merge_types()
Right now, merge_types() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.

The [[nodiscard]] attribute is moved from the function to the
return type, since the function now returns a future which is
nodiscard anyway.

The lambda returned is not coroutinized (yet) since it's part
of the user_types_to_drop inner function that still returns void
and expects to be called in a thread.

De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
2021-08-01 20:13:15 +03:00
Avi Kivity
c9584d50ee db: schema_tables: futurize and coroutinize merge_tables_and_views()
Right now, merge_tables_and_views() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.

De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
2021-08-01 20:13:15 +03:00
Avi Kivity
80fe158387 db: schema_tables: coroutinize store_column_mapping() 2021-08-01 20:13:15 +03:00
Avi Kivity
ee8b02f437 db: schema_tables: futurize and coroutinize read_tables_for_keyspaces()
Right now, read_tables_for_keyspaces() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.

De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
2021-08-01 20:13:15 +03:00
Avi Kivity
cd1003daad db: schema_tables: coroutinize read_table_names_of_keyspace() 2021-08-01 20:13:15 +03:00
Avi Kivity
000f7eabd5 db: schema_tables: coroutinize recalculate_schema_version() 2021-08-01 20:13:15 +03:00
Avi Kivity
95d33e9e86 db: schema_tables: coroutinize merge_schema() 2021-08-01 20:13:15 +03:00
Avi Kivity
25548f46dd db: schema_tables: introduce and use with_merge_lock()
Rather than open-coding merge_lock()/merge_unlock() pairs, introduce
and use a helper. This helps in coroutinization, since coroutines
don't support RAII with destructors that wait.
2021-08-01 20:13:15 +03:00
Avi Kivity
7b731ae2c6 db: schema_tables: coroutinize update_schema_version_and_announce() 2021-08-01 20:13:15 +03:00
Avi Kivity
385e0dcc2e db: schema_tables: coroutinize read_keyspace_mutation() 2021-08-01 20:13:15 +03:00
Avi Kivity
ef5df86b1f db: schema_tables: coroutinize read_schema_partition_for_table() 2021-08-01 20:13:15 +03:00
Avi Kivity
8841c2ba10 db: schema_tables: coroutinize read_schema_partition_for_keyspace()
Two reference parameters are copied rather than changing the signature,
to avoid a compile-the-world. It can be cleaned up post-merge.
2021-08-01 20:09:00 +03:00
Michael Livshin
5f9695c1b2 sstables: count read row tombstones
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Michael Livshin
64dca1fef9 memtables: count read row tombstones
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Michael Livshin
f364666d4a row_cache: count read row tombstones
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Michael Livshin
d4a5508d47 memtables: rename partition_snapshot_accounter for consistency
It is actually `partition_snapshot_flush_accounter`, as opposed to
`partition_snapshot_read_accounter`.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Michael Livshin
69ade155be partition_snapshot_reader: rename MemoryAccounter to just Accounter
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Michael Livshin
4d8f99df25 remove the newly-unused partition_snapshot_reader_dummy_accounter
(along with the `make_partition_snapshot_flat_reader` overload that
used it)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Michael Livshin
2ee9f1b951 memtables: add metric and accounter for range tombstone reads
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Michael Livshin
20c760e638 logalloc: split tracker::impl::reclaim into reclaim & reclaim_locked
Similarly to compact_and_evict().

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:34:13 +03:00
Michael Livshin
a96aed3973 logalloc: metrics: remove unneeded captures and a pleonasm
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:34:13 +03:00
Michael Livshin
aa6c8ef582 logalloc: add metrics for evicted and freed memory
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:34:13 +03:00
Michael Livshin
a6283b322b logalloc: count evicted memory
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:34:13 +03:00
Michael Livshin
4bcd91a09a logalloc: count freed memory
(On the individual free() request level, i.e. similarly to allocs)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:34:13 +03:00
Avi Kivity
d1876488f7 db: schema_tables: coroutinize query_partition_mutation() 2021-08-01 19:17:13 +03:00
Avi Kivity
35f9caf6a9 db: schema_tables: coroutinize read_schema_for_keyspaces() 2021-08-01 19:17:09 +03:00
Avi Kivity
7c0476251a db: schema_tables: coroutinize convert_schema_to_mutations() 2021-08-01 19:16:55 +03:00
Avi Kivity
921216e8e6 db: schema_tables: coroutinize calculate_schema_digest() 2021-08-01 19:16:50 +03:00
Avi Kivity
3dab308ddf db: schema_tables: coroutinize save_system_schema() 2021-08-01 19:16:40 +03:00
Nadav Har'El
6c27000b98 Merge 'Propagate exceptions without throwing' from Piotr Sarna
NOTE: this series depends on a Seastar submodule update, currently queued in next: 0ed35c6af052ab291a69af98b5c13e023470cba3

In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. `make_exception_future<>` for futures
 2. `co_return coroutine::exception(...)` for coroutines
    which return `future<T>` (the mechanism does not work for `future<>`
    without parameters, unfortunately)

Tests: unit(release)

Closes #9079

* github.com:scylladb/scylla:
  system_keyspace: pass exceptions without throwing
  sstables: pass exceptions without throwing
  storage_proxy: pass exceptions without throwing
  multishard_mutation_query: pass exceptions without throwing
  client_state: pass exceptions without throwing
  flat_mutation_reader: pass exceptions without throwing
  table: pass exceptions without throwing
  commitlog: pass exceptions without throwing
  compaction: pass exceptions without throwing
  database: pass exceptions without throwing
2021-08-01 16:47:47 +03:00
Pavel Solodovnikov
d07f681a95 test: test_non_deterministic_functions: add lwt to test cases names
The tests are related to LWT so add the corresponding prefix
to all the tests cases to emphasize that.

Tests: cql-pytest(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210801131820.164480-1-pa.solodovnikov@scylladb.com>
2021-08-01 16:23:30 +03:00
Avi Kivity
48860b135a Merge "cql3: fix current*() functions to be non-deterministic" from Pavel S
"
Previously, the following functions were
incorrectly marked as pure, meaning that the
function is executed at "prepare" step:

* `currenttimestamp()`
* `currenttime()`
* `currentdate()`
* `currenttimeuuid()`

For functions that possibly depend on timing and random seed,
this is clearly a bug. Cassandra doesn't have a notion of pure
functions, so they are lazily evaluated.

Make Scylla to match Cassandra behavior for these functions.

Add a unit-test for a fix (excluding `currentdate()` function,
because there is no way to use synthetic clock with query
processor and sleeping for a whole day to demonstrate correct
behavior is clearly not an option).

Also, extend the cql-pytest for #8604 since there are now more
non-deterministic CQL functions, they are all subject to the test
now.

Fixes: #8816
"

* 'timeuuid_function_pure_annotation_v3' of https://github.com/ManManson/scylla:
  test: test_non_deterministic_functions: test more non-pure functions
  cql3: change `current*()` CQL functions to be non-pure
2021-08-01 12:35:36 +03:00
Pavel Solodovnikov
a130921120 test: test_non_deterministic_functions: test more non-pure functions
Check that all existing non-pure functions (except for
`currentdate()`) work correctly with or without prepared
statements.

Tests: cql-pytest/test_non_deterministic_functions.py(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-01 12:18:26 +03:00
Pavel Solodovnikov
21d758020a cql3: change current*() CQL functions to be non-pure
These include the following:
* `currenttimestamp()`
* `currenttime()`
* `currentdate()`
* `currenttimeuuid()`

Previously, they were incorrectly marked as pure, meaning
that the function is executed at "prepare" step.

For functions that possibly depend on timing and random seed,
this is clearly a bug. Cassandra doesn't have a notion of pure
functions, so they are lazily evaluated.

Make Scylla to match Cassandra behavior for these functions.

Add a unit-test for a fix (excluding `currentdate()` function,
because there is no way to use synthetic clock with query
processor and sleeping for a whole day to demonstrate correct
behavior is clearly not an option).

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-01 12:17:23 +03:00
Avi Kivity
18d9ad1d78 Merge "cql3: create_view_statement: fix wrong check for bound variables" from Pavel S
"
`create_view_statement::announce_migration()` has an incorrect
check to verify that no bound variables were supplied to a
select statement of a materialized view.

This used `prepare_context::empty()` static method,
which doesn't check the current instance for emptiness but
constructs a new empty instance instead.

The following bit of code actually checked that the pointer
to the new empty instance is not null:

    if (!_variables.empty()) {
        throw exceptions::invalid_request_exception(format("Cannot use query parameters in CREATE MATERIALIZED VIEW statements"));
    }

Use `get_variable_specifications().empty()` instead to fix the semantics of
the `if` statement.

This series also removes this `empty()` method, because it's
not used anymore. The corresponding non-default constructor
is also removed due to being unused.

Tests: unit(dev), cql-pytest:test_materialized_view.py(scylla dev, cassandra trunk)
"

Fixes #9117

* 'create_view_stmt_check_bound_vars_v3' of https://github.com/ManManson/scylla:
  test: add a test checking that bind markers within MVs SELECT statement don't lead to a crash
  cql3: prepare_context: remove unused methods
  cql3: create_view_statement: fix check for bound variables
  cql3: make `prepare_context::get_variable_specifications()` return const-ref for lvalue overload
2021-08-01 12:04:36 +03:00
Avi Kivity
3089558f8d tools: toolchain: update to Fedora 34 with clang 12 and libstdc++ 11.2 2021-07-31 15:25:13 +03:00
Pavel Solodovnikov
1ca7825cf6 test: add a test checking that bind markers within MVs SELECT statement don't lead to a crash
The request should fail with `InvalidRequest` exception and shouldn't
crash the database.

Don't check for actual error messages, because they are different
between Scylla and Cassandra.

The former has this message:

        Cannot use query parameters in CREATE MATERIALIZED VIEW statements

While the latter throws this:

        Bind variables are not allowed in CREATE MATERIALIZED VIEW statements

Tests: cql-pytest/test_materialized_view.py(scylla dev, cassandra trunk)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-30 17:57:24 +03:00
Pavel Solodovnikov
1694f5f66f cql3: prepare_context: remove unused methods
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-30 17:57:10 +03:00
Pavel Solodovnikov
9f0dc99627 cql3: create_view_statement: fix check for bound variables
The code for checking that an MV's select statement doesn't
have any bind markers uses the wrong method and always returns
`false` even when it should not.

`prepare_context::empty()` is a misleading name because
it doesn't check if the current instance is empty, but creates
an empty  instance wrapped in a `lw_shared_ptr` instead.
Thus, the code in `create_view_statement::announce_migration()`
checks that the pointer is not empty, which is always false.

Use `get_variable_specifications().empty()` to check that the
specifications vector inside the `prepare_context`
instance is not empty.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-30 17:55:43 +03:00
Pavel Solodovnikov
0edf975bf7 cql3: make prepare_context::get_variable_specifications() return const-ref for lvalue overload
There's no point in copying the `_specs` vector by value in such
case, just return a const reference. All existing uses create
a copy either way.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-30 17:54:43 +03:00
Piotr Sarna
1c7af8d46f cql-pytest: adjust a test case for Cassandra 4
One of the test cases stopped working against Cassandra 4, but that's
just because it returns a slightly different error type.
The test case is adjusted to work on both Scylla and new Cassandra.

Message-Id: <222a7f63a3e9739c6fc646173306fcdb3da25890.1627655555.git.sarna@scylladb.com>
2021-07-30 17:36:23 +03:00
Avi Kivity
0876248c2b Merge "cql3: cache function calls evaluation for non-deterministic functions" from Pavel S
"
`function_call` AST nodes are created for each function
with side effects in a CQL query, i.e. non-deterministic
functions (`uuid()`, `now()` and some others timeuuid-related).

These nodes are evaluated either when a query itself is executed
or query restrictions are computed (e.g. partition/clustering
key ranges for LWT requests).

We need to cache the calls since otherwise when handling a
`bounce_to_shard` request for an LWT query, we can possibly
enter an infinite bouncing loop (in case a function is used
to calculate partition key ranges for a query), since the
results can be different each time.

Furthermore, we don't support bouncing more than one time.
Returning `bounce_to_shard` message more than one time
will result in a crash.

Caching works only for LWT statements and only for the function
calls that affect partition key range computation for the query.

`variable_specifications` class is renamed to `prepare_context`
and generalized to record information about each `function_call`
AST node and modify them, as needed:
* Check whether a given function call is a part of partition key
  statement restriction.
* Assign ids for caching if above is true and the call is a part
  of an LWT statement.

There is no need to include any kind of statement identifier
in the cache key since `query_options` (which holds the cache)
is limited to a single statement, anyway.

Function calls are indexed by the order in which they appear
within a statement while parsing. There is no need to
include any kind of statement identifier to the cache key
since `query_options` (which holds the cache) is limited
to a single statement, anyway.

Note that `function_call::raw` AST nodes are not created
for selection clauses of a SELECT statement hence they
can only accept only one of the following things as parameters:
* Other function calls.
* Literal values.
* Parameter markers.

In other words, only parameters that can be immediately reduced
to a byte buffer are allowed and we don't need to handle
database inputs to non-pure functions separately since they
are not possible in this context. Anyhow, we don't even have
a single non-pure function that accepts arguments, so precautions
are not needed at the moment.

Add a test written in `cql-pytest` framework to verify
that both prepared and unprepared lwt statements handle
`bounce_to_shard` messages correctly in such scenario.

Fixes: #8604

Tests: unit(dev, debug)

NOTE: the patchset uses `query_options` as a container for
cached values. This doesn't look clean and `service::query_state`
seems to be a better place to store them. But it's not
forwarded to most of the CQL code and would mean that a huge number
of places would have to be amended.
The series presents a trade-off to avoid forwarding `query_state`
everywhere (but maybe it's the thing that needs to be done, nonetheless).
"

* 'lwt_bounce_to_shard_cached_fn_v6' of https://github.com/ManManson/scylla:
  cql-pytest: add a test for non-pure CQL functions
  cql3: cache function calls evaluation for non-deterministic functions
  cql3: rename `variable_specifications` to `prepare_context`
2021-07-30 14:21:11 +03:00
Pekka Enberg
21cfd090f7 Update tools/python3 submodule
* tools/python3 afe2e7f...279aae1 (1):
  > Drop filename start with '..' in pip modules
2021-07-30 13:58:45 +03:00
Avi Kivity
c3c82415c3 cql3: term: make term::raw, term::multi_column_raw forward declarable
As preparation for converting term::raw an expression, make it
forward declarable so that we can have a term::raw that is an
expression, and an expression that is a term::raw, without driving
the compiler insane.

Closes #9101
2021-07-30 13:50:28 +03:00
Pavel Emelyanov
4f4b863e6a test.py: Always disable boost colored output
Tests' output is always redirected to a log file. Enabling colored
output makes it very hard to read.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210730083731.17813-1-xemul@scylladb.com>
2021-07-30 12:22:31 +03:00
Piotr Sarna
60072045db Merge 'cql3: replace cql3::selection::selectable::raw ...
hierarchy with expressions' from Avi Kivity

Currently, the grammar has two parallel hierarchies. One hierarchy is
used in the WHERE clause, and is based on a combination of `term`
and expressions. The other is used in the SELECT clause, and is
using the cql3::selection::selectable hierarchy. There is some overlap
between the hierarchies: both can name columns. Logically, however,
they overlap completely - in SQL anything you can select you can
filter on, and vice versa. So merging the two hierarchies is important if
we want to enrich CQL. This series does that, partially (see below),
converting the SELECT clause to expressions.

There is another hierarchy split: between the "raw", pre-prepare object
hierarchy, and post-prepare non-raw. This series limits itself to converting
the raw hierarchy and leaves the non-raw hierarchy alone.

An important design choice is not to have this raw/non-raw split in expressions.
Note that most of the hierarchy is completely parallel: addition is addition
both before prepare and after prepare (but see [1]). The main difference
is around identifiers - before preparation they are unresolved, and after
preparation they become `column_definition` objects. We resolve that by
having two separate types: `unresolved_identifier` for the pre-prepare phase,
and the existing `column_value` for post-prepare phase.

Alternative choices would be to keep a separate expression::raw variant, or
to template the expression variant on whether it is raw or not. I think it would
cause undue bloat and confusion.

Note the series introduces many on_internal_error() calls. This is because
there is not a lot of overlap in the hierarchies today; you can't have a cast in
the WHERE clause, for example. These on_internal_error() calls cannot be
triggered since the grammar does not yet allow such expressions to be
expressed. As we expand the grammar, they will have to be replaced with
working implementations.

Lastly, field selection is expressible in both hierarchies. This series does not yet
merge the two representations (`column_value.sub` vs `field_selection`), but it
should be easy to do so later.

[1] the `+` operator can also be translated to list concatenation, which we may
  choose to represent by yet another type.

Test: unit(dev)

Closes #9087

* github.com:scylladb/scylla:
  cql3: expression: update find_atom, count_if for function_call, cast, field_selection
  cql3: expressions: fix printing of nested expressions
  cql3: selection: replace selectable::raw with expression
  cql3: expression: convert selectable::with_field_selection::raw to expression
  cql3: expression: convert selectable::with_cast::raw to expression
  cql3: expression: convert selectable::with_anonymous_function::raw to expression
  cql3: expression: convert selectable::with_function_call::raw to expressions
  cql3: selectable: make selectable::raw forward-declarable
  cql3: expressions: convert writetime_or_ttl::raw to expression
  cql3: expression: add convenience constructor from expression element to nested expression
  utils: introduce variant_element.hh
  cql3: expression: use nested_expression in binary_operator
  cql3: expression: introduce nested_expression class
  Convert column_identifier_raw's use as selectable to expressions
  make column_identifier::raw forward declarable
  cql3: introduce selectable::with_expression::raw
2021-07-30 09:57:39 +02:00
Pavel Solodovnikov
eaf70df203 cql-pytest: add a test for non-pure CQL functions
Introduce a test using `cql-pytest` framework to assert that
both prepared an unprepared LWT statements (insert with
`IF NOT EXISTS`) with a non-deterministic function call
work correctly in case its evaluation affects partition
key range computation (hence the choice of `cas_shard()`
for lwt query).

Tests: cql-pytest/test_non_deterministic_functions.py

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-30 01:22:50 +03:00
Pavel Solodovnikov
3b6adf3a62 cql3: cache function calls evaluation for non-deterministic functions
And reuse these values when handling `bounce_to_shard` messages.

Otherwise such a function (e.g. `uuid()`) can yield a different
value when a statement re-executed on the other shard.

It can lead to an infinite number of `bounce_to_shard` messages
sent in case the function value is used to calculate partition
key ranges for the query. Which, in turn, will cause crashes
since we don't support bouncing more than one time and the second
hop will result in a crash.

Caching works only for LWT statements and only for the function
calls that affect partition key range computation for the query.

`variable_specifications` class is renamed to `prepare_context`
and generalized to record information about each `function_call`
AST node and modify them, as needed:
* Check whether a given function call is a part of partition key
  statement restriction.
* Assign ids for caching if above is true and the call is a part
  of an LWT statement.

There is no need to include any kind of statement identifier
in the cache key since `query_options` (which holds the cache)
is limited to a single statement, anyway.

Note that `function_call::raw` AST nodes are not created
for selection clauses of a SELECT statement hence they
can only accept only one of the following things as parameters:
* Other function calls.
* Literal values.
* Parameter markers.

In other words, only parameters that can be immediately reduced
to a byte buffer are allowed and we don't need to handle
database inputs to non-pure functions separately since they
are not possible in this context. Anyhow, we don't even have
a single non-pure function that accepts arguments, so precautions
are not needed at the moment.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-30 01:22:39 +03:00
Tomasz Grabiec
7c28f77412 Merge 'Convert all remaining int tri-compares to std::strong_ordering' from Avi Kivity
Convert all known tri-compares that return an int to return std::strong_ordering.
Returning an int is dangerous since the caller can treat it as a bool, and indeed
this series uncovered a minor bug (#9103).

Test: unit (dev)

Fixes #1449

Closes #9106

* github.com:scylladb/scylla:
  treewide: remove redundant "x <=> 0" compares
  test: mutation_test: convert internal tri-compare to std::strong_ordering
  utils: int_range: change to std::strong_ordering
  test: change some internal comparators to std::strong_ordering
  utils: big_decimal: change to std::strong_ordering
  utils: fragment_range: change to std::strong_ordering
  atomic_cell: change compare_atomic_cell_for_merge() to std::strong_ordering
  types: drop scaffolding erected around lexicographical_tri_compare
  sstables: keys: change to std::strong_ordering internally
  bytes: compare_unsigned(): change to std::strong_ordering
  uuid: change comparators to std::strong_ordering
  types: convert abstract_type::compare and related to std::strong_ordering
  types: reduce boilerplate when comparing empty value
  serialized_tri_compare: change to std::strong_ordering
  compound_compat: change to std::strong-ordering
  types: change lexicographical_tri_compare, prefix_equality_tri_compare to std::strong_ordering
2021-07-29 21:43:54 +02:00
Takuya ASADA
3ecdd15777 dist/debian: keep sysconfdir.conf for scylla-housekeeping on 'remove'
Same as 4309785, dpkg does not re-install confffiles when it removed by
user, we are missing sysconfdir.conf for scylla-housekeeping on rollback.
To prevent this, we need to stop removing drop-in file directory on
'remove'.

Fixes #9109

Closes #9110
2021-07-29 12:32:21 +03:00
Avi Kivity
e44d3cc0ea Merge "Remove global storage service instance" from Pavel E
"
There are few places that call global storage service, but all
are easily fixable without significant changes.

1. alternator -- needs token metadata, switch to using proxy
2. api -- calls methods from storage service, all handlers are
   registered in main and can capture storage service from there
3. thrift -- calls methods from storage service, can carry the
   reference via controller
4. view -- needs tokens, switch to using (global) proxy
5. storage_service -- (surprisingly) can use "this"

tests: unit(dev), dtest(simple_boot_shutdown, dev)
"

* 'br-unglobal-storage-service' of https://github.com/xemul/scylla:
  storage_service: Make it local
  storage_service: Remove (de)?init_storage_service()
  storage_service: Use container() in run_with(out)_api_lock
  storage_service: Unmark update_topology static
  storage_service: Capture this when appropriate
  view: Use proxy to get token metadata from
  thrift: Use local storage service in handlers
  thrift: Carry sharded<storage_service>& down to handler
  api: Capture and use sharded<storage_service>& in handlers
  api: Carry sharded<storage_service>& down to some handlers
  alternator: Take token metadata from server's storage_proxy
  alternator: Keep storage_proxy on server
2021-07-29 11:47:16 +03:00
Avi Kivity
8d2255d82c Merge "Parallelize multishard_combining_reader_as_mutation_source test" from Pavel E
"
This is the 3rd slowest test in the set. There are 3 cases out
there that are hard-coded to be sequential. However, splitting
them into boost test cases helps running this test faster in
--parallel-cases mode. Timings for debug mode:

         Total before the patch: 25 min
     Sequential after the patch: 25 min
                     Basic case:  5 min
      Evict-paused-readers case:  5 min
    Single-mutation-buffer case: 15 min

tests: unit.multishard_combining_reader_as_mutation_source(debug)
"

* 'br-parallel-mcr-test' of https://github.com/xemul/scylla:
  test: Split test_multishard_combining_reader_as_mutation_source into 3
  test: Fix indentation after previous patch
  test: Move out internals of test_multishard_combining_reader_as_mutation_source
2021-07-29 11:39:02 +03:00
Raphael S. Carvalho
c399601833 table: kill move_sstables_from_staging()
not used anywhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210728175403.86867-1-raphaelsc@scylladb.com>
2021-07-29 10:42:36 +03:00
Raphael S. Carvalho
eb16268768 table: Guarantee serialization of every sstable set updates
Continuing the work from e4eb7df1a1, let's guarantee
serialization of sstable set updates by making all sites acquire
the mutation permit. Then table no longer rely on serialization
mechanism of row cache's update functions.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210728174740.78826-1-raphaelsc@scylladb.com>
2021-07-29 10:42:18 +03:00
Asias He
0b2359b45b repair: Do not log errors in repair_ranges
The exception will be logged by the caller of repair_ranges. Do not log
it here to reduce redundancy.
2021-07-29 15:05:16 +08:00
Asias He
e3c4f2d54f repair: Move more repair single range code into repair_info::repair_range
The benefit is that inside repair_info::repair_range we have the peer
node information. It is useful to log peer nodes.

In addition, we can avoid logging the similar logs twice in case the
range is skipped. For example:

INFO  2021-07-28 14:57:15,388 [shard 1] repair - Repair 417 out of 513
ranges, id=[id=1, uuid=72344b26-1db2-48a0-bc5b-e8ac2874e154], shard=1,
keyspace=keyspace1, table={standard1}, range=(5380136883876426790,
5406788998747631705]

WARN  2021-07-28 14:57:15,388 [shard 1] repair - Repair 417 out of 513
ranges, id=[id=1, uuid=72344b26-1db2-48a0-bc5b-e8ac2874e154], shard=1,
keyspace=keyspace1, table={standard1}, range=(5380136883876426790,
5406788998747631705], peers={127.0.0.2}, live_peers={}, status=skipped
2021-07-29 15:05:16 +08:00
Asias He
c8e5572cf0 repair: Use the same uuid from the repair_info
It is a regression introduced by d92d404629 (repair: Turn
repair_range a repair_info method).
2021-07-29 15:05:16 +08:00
Asias He
c72cc3eb9d repair: Drop sub_ranges_nr counter
It was used to count the number of sub ranges divided by partition level
repair. We do not use it anymore in row level repair.
2021-07-29 15:05:16 +08:00
Pavel Emelyanov
f9132b582b storage_service: Make it local
There are 3 places that can now declare local instance:

- main
- cql_test_env
- boost gossiper test

The global pointer is saved in debug namespace for debugging.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
055025eaa9 storage_service: Remove (de)?init_storage_service()
One of them just re-wraps arguments in std::ref and calls for
global storage service. The other one is dead code which also
calls the global s._s. Remove both and fix the only caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
2ffbe894b9 storage_service: Use container() in run_with(out)_api_lock
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
cd44a808be storage_service: Unmark update_topology static
And use container() to reshard to shard 0. This removes one
more call for global storage service instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
39db19191f storage_service: Capture this when appropriate
Some storage_service methods call for global storage service instance
while they can enjoy "this" pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
689a4c1e54 view: Use proxy to get token metadata from
The mutate_MV() call needs token metadata and it gets them from
global storage service. Fixing it not to use globals is a huge
refactoring, so for now just get the tokens from global storage
proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
5a13031ce8 thrift: Use local storage service in handlers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
f2992f4e32 thrift: Carry sharded<storage_service>& down to handler
The thrift_handler class' methods need storage service. This
patch makes sure this class has sharded storage service
reference on board.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
df285fca7a api: Capture and use sharded<storage_service>& in handlers
The reference in question is already there, handlers that need
storage service can capture it and use. These handlers are not
yet stopped, but neither is the storage service itself, so the
potentially dangling reference is not being set up here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
2e50ba7079 api: Carry sharded<storage_service>& down to some handlers
Both set_server_storage_service and set_server_storage_proxy set up
API handlers that need storage service to work. Now they all call for
global storage service instance, but it's better if they receive one
from main. This patch carries the sharded storage service reference
down to handlers setting function, next patch will make use of it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
a965a742fc alternator: Take token metadata from server's storage_proxy
There's a local_nodelist_handler serving API requests that calls
for global storage service to get token metadata from. Now it
can get storage proxy reference from server upon construction
and use it for tokens.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Pavel Emelyanov
ba10e96c75 alternator: Keep storage_proxy on server
It's already available on controller and will be needed by
API handlers in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Gleb Natapov
4764028cb3 raft: Remove leader_id from append_request
The filed is not used anywhere.

Message-Id: <YP0khmjK2JSp77AG@scylladb.com>
2021-07-28 20:30:07 +02:00
Avi Kivity
42e1f318d7 Merge "Respect "bypass cache" in sstable index caching" from Tomasz
"
This series changes the behavior of the system when executing reads
annotated with "bypass cache" clause in CQL. Such reads will not
use nor populate the sstable partition index cache and sstable index page cache.
"

* 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla:
  sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads
  sstables: Do not populate partition index cache for "bypass cache" reads
2021-07-28 18:45:39 +03:00
Avi Kivity
331eb57e17 Revert "compression: define 'class' attribute for compression and deprecate 'sstable_compression'"
This reverts commit 5571ef0d6d. It causes
rolling upgrade failures.

Fixes #9055.

Reopens #8948.
2021-07-28 14:14:22 +03:00
Pekka Enberg
ef5b2934e8 scylla_util: Use AzureSnitch on Azure
Fixes #8593
2021-07-28 14:07:42 +03:00
Pekka Enberg
42e32566f6 production_snitch_base: Fallback for empty DC or rack strings
Lubos Kosco points out that on Microsoft Azure, for example, it is
possible for the "zone metadata" (which we use as rack information) can
be empty as shown in:

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service?tabs=windows#instance-metadata

Therefore, protect against empty DC or rack strings in
`production_snitch_base` to keep the behavior consistent across
different snitches.
2021-07-28 14:07:42 +03:00
Pekka Enberg
e44fa8d806 azure_snitch: Azure snitch support
This add support for Azure snitch. The work is an adaptation of
AzureSnitch for Apache Cassandra by Yoshua Wakeham:

https://raw.githubusercontent.com/yoshw/cassandra/9387-trunk/src/java/org/apache/cassandra/locator/AzureSnitch.java

As per Lubos' suggestion, we switched to a later API version.
2021-07-28 14:07:42 +03:00
Avi Kivity
0909e3c17d treewide: remove redundant "x <=> 0" compares
If x is of type std::strong_ordering, then "x <=> 0" is equivalent to
x. These no-ops were inserted during #1449 fixes, but are now unnecessary.
They have potential for harm, since they can hide an accidental of the
type of x to an arithmetic type, so remove them.

Ref #1449.
2021-07-28 13:30:32 +03:00
Avi Kivity
70f481a1f0 test: mutation_test: convert internal tri-compare to std::strong_ordering
Drop the temporary merge_container() overload we had to support
tri-compares that returned int.
2021-07-28 13:30:07 +03:00
Avi Kivity
14fd886c72 utils: int_range: change to std::strong_ordering
Ref #1449.
2021-07-28 13:29:50 +03:00
Avi Kivity
11fa402ecc test: change some internal comparators to std::strong_ordering
Ref #1449.
2021-07-28 13:28:51 +03:00
Avi Kivity
89bd7737f3 utils: big_decimal: change to std::strong_ordering
Ref #1449.
2021-07-28 13:28:21 +03:00
Avi Kivity
59941c536c utils: fragment_range: change to std::strong_ordering
Ref #1449.
2021-07-28 13:27:49 +03:00
Avi Kivity
a180cd240f atomic_cell: change compare_atomic_cell_for_merge() to std::strong_ordering
The implementation is in database.cc for some reason.

Ref #1449.
2021-07-28 13:26:27 +03:00
Avi Kivity
b866c12bc5 types: drop scaffolding erected around lexicographical_tri_compare
With no more users, the int-returning variant can be dropped.

Ref #1449.
2021-07-28 13:25:19 +03:00
Avi Kivity
9a2f3ac288 sstables: keys: change to std::strong_ordering internally
The signature already returned std::strong_ordering, but an
internal comparator returned int. Switch it, so it now uses
the strong_ordering overload of lexicographicall_tri_compare().

Ref #1449.
2021-07-28 13:23:13 +03:00
Avi Kivity
1b64b1a628 bytes: compare_unsigned(): change to std::strong_ordering
Note that the previous implementation was broken for blobs
larger than 4GB. Luckily that can't happen.

Ref #1449.
2021-07-28 13:21:01 +03:00
Avi Kivity
7729ff03ad uuid: change comparators to std::strong_ordering
Ref #1449.
2021-07-28 13:20:32 +03:00
Avi Kivity
e52ebe2da5 types: convert abstract_type::compare and related to std::strong_ordering
Change comparators around types to std::strong_ordering.

Ref #1449.
2021-07-28 13:19:24 +03:00
Avi Kivity
b7160b74ea types: reduce boilerplate when comparing empty value
Some types have boilerplate code to check if one or both values
are empty. Consolidate it in a helper to reduce noise.
2021-07-28 13:19:09 +03:00
Avi Kivity
d86e529239 serialized_tri_compare: change to std::strong_ordering
Also convert a users in mutation_test.

Ref #1449.
2021-07-28 13:19:00 +03:00
Avi Kivity
3653518d9e compound_compat: change to std::strong-ordering
Ref #1449.
2021-07-28 13:16:05 +03:00
Avi Kivity
1bbabb5ccc types: change lexicographical_tri_compare, prefix_equality_tri_compare to std::strong_ordering
The original signatures with `int` are retained (by calling the new
signatures), until the callers are converted. Constraints are used to
disambiguate.

Ref #1449.
2021-07-28 13:14:46 +03:00
Avi Kivity
12f9a5462d Merge 'repair: Drop unused partition level repair related code' from Asias He
This series removes unused partition level repair related code.

Closes #9105

* github.com:scylladb/scylla:
  repair: Drop stream plan related code
  locator: Add missing file.hh include in production_snitch_base
  repair: Drop request_transfer_ranges and do_streaming
  repair: Drop parallelism_semaphore
2021-07-28 11:24:30 +03:00
Benny Halevy
67d5addc09 test: mutation_reader_test: clustering_order_merger_test_generator: use explicit type for num_ranges
gcc 10.3.1 spews the following error:
```
_test_generator::generate_scenario(std::mt19937&) const’:
test/boost/mutation_reader_test.cc:3731:28: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
 3731 |         for (auto i = 0; i < num_ranges; ++i) {
      |                          ~~^~~~~~~~~~~~
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210728073538.2467040-1-bhalevy@scylladb.com>
2021-07-28 11:22:59 +03:00
Asias He
91bcfba3f7 repair: Drop stream plan related code
We do not use stream plan to sync data in repair anymore.
2021-07-28 11:23:42 +08:00
Asias He
860109daca locator: Add missing file.hh include in production_snitch_base
```
clang++

build/dev/locator/production_snitch_base.o
locator/production_snitch_base.cc
In file included from locator/production_snitch_base.cc:41:
In file included from ./locator/production_snitch_base.hh:41:
In file included from
/usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/unordered_map:38:
/usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/type_traits:1329:23:
error: incomplete type 'seastar::file' used in type trait expression
                    __bool_constant<__has_trivial_destructor(_Tp)>>
		                                        ^
```

This code compiles now due to indirect include from repair.hh.
2021-07-28 11:22:37 +08:00
Asias He
f274ed6512 repair: Drop request_transfer_ranges and do_streaming
They are used by parititon level repair. With row level repair, we do
not need them anymore.
2021-07-28 10:54:18 +08:00
Asias He
f1c08f121a repair: Drop parallelism_semaphore
It is used by partition level repair. With row level repair, we do not
need it anymore.
2021-07-28 10:54:15 +08:00
Avi Kivity
77a2b4b520 test: perf: perf_simple_query: add instructions_per_op to the json-result output
It's in text output, but 863b49af03 forgot to add it to the machine
readable results.

Closes #9017
2021-07-27 20:26:19 +02:00
Calle Wilund
59555fa363 cdc: fix broken function signature in maybe_back_insert_iterator
Fixes #9103

compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.

Closes #9104
2021-07-27 20:37:30 +03:00
Avi Kivity
0a4884e87a Merge "Expose immutable rows and tombstones collections" from Pavel E
"
While working on evicting range tombstones one of the nastiest
difficulties is that mutation_partition has very loose control
over adding and removing of rows and range tombstones. This is
because it exposes both collections via public methods, so it's
pretty easy to grab a non-const reference on either of it and
modify the collection.

At the same time restricting the API with returning only const
reference on the collection is not possible either, since finding
in or iterating over a const-referenced collection would expose
the const-reference element as well, while it can be perfectly
valid to modify the single row/tombstone without touching the
whole collection.

In other words there's the need for an access method that both
guarantees that no new elements are added to the collection,
nor existing ones are removed from it, AND doesn't impose const
on the obtained elements.

The solution proposed here is the immutable_collection<> template
that wraps a non-const collection reference and gives the caller
only reading methods (find, lower_bound, begin, etc) so that it's
guaranteed that the external user of mutation_partition won't be
able to modify the collections. Those that already use the const
reference on the mutation_partition itself are OK, they also use
const-referenced everything. The places than do need to modify
the partition's collections are thus made explicit.

tests: unit(dev)
"

* 'br-mutation-partition-collection-view-3' of https://github.com/xemul/scylla:
  mutation_partition: Return immutable collection for range tombstones
  mutation_partition: Pin mutable access to range tombstones
  mutation_partition: Return immutable collection for rows
  mutation_partition: Pin mutable access to rows
  utils: Introduce immutable_collection<>
  btree: Generalize some iterator methods
  btree: Make iterators not modify the tree itself
  btree tests: Dont use iterator erase
  mutation_partition: Shuffle declarations
  range_tombstone_list: Mark more methods noexcept
  range_tombstone_list, code: Mark external_memory_usage noexcept
2021-07-27 20:34:04 +03:00
Avi Kivity
a38b1006d1 cql3: expression: update find_atom, count_if for function_call, cast, field_selection
The combination of the new types and these functions cannot happen yet,
but as they are generic functions it is better to implement them in
case it becomes possible later.
2021-07-27 20:16:43 +03:00
Avi Kivity
2b7b9bb469 cql3: expressions: fix printing of nested expressions
Now that we eliminated cql3::selectable::raw, we can print
nested expressions.
2021-07-27 20:16:29 +03:00
Avi Kivity
98c4f0dfb3 cql3: selection: replace selectable::raw with expression
Now that all selectable::raw subclasses have been converted to
cql3::selectable::with_expression::raw, the class structure is
just a wrapper around expressions. Peel it, converting the
virtual member functions to free functions, and replacing
object instances with expression or nested_expression as the
case allows.
2021-07-27 20:16:15 +03:00
Avi Kivity
979010a1e5 cql3: expression: convert selectable::with_field_selection::raw to expression
Add a field_selection variant element to expression. Like function_call
and cast, the structure from which a field is selectewd cannot yet be
an expression, since not all seletable::raw:s are converted. This will
be done in a later pass. This is also why printing a field selection now
does not print the selected expression; this will also be corrected later.
2021-07-27 20:16:12 +03:00
Avi Kivity
714b812212 cql3: expression: convert selectable::with_cast::raw to expression
Add a cast variant element to expression. Like function_call, the
argument being converted cannot yet be an expression, since not
all seletable::raw:s are converted. This will be done in a later
pass.  This is also why printing a cast now does not print the
casted expression; this will also be corrected later.
2021-07-27 20:14:52 +03:00
Avi Kivity
5adae5837e cql3: expression: convert selectable::with_anonymous_function::raw to expression
Rather than creating a new variant element in expression, we extend
function_call to handle both named and anonymous functions, since
most of the processing is the same.
2021-07-27 20:13:55 +03:00
Avi Kivity
3e392d2513 cql3: expression: convert selectable::with_function_call::raw to expressions
Add a function_call variant element to hold function calls. Note
that because not all selectables are yet converted, function call
arguments are still of type selectable::raw. They will be converted
to expressions later. This is also why printing a function now
does not print its arguments; this will also be corrected later.
2021-07-27 20:13:51 +03:00
Avi Kivity
a56787d95e cql3: selectable: make selectable::raw forward-declarable
As temporary scaffolding while we're converting selectable::raw
subclasses to expressions, we'll need expressions to refer to
selectable::raw (specifically, function call arguments, which will
end up as expressions as well). To avoid a #include loop, make
selectable::raw forward-declarable by moving it to namespace scope.
2021-07-27 20:10:54 +03:00
Avi Kivity
ff65c54316 cql3: expressions: convert writetime_or_ttl::raw to expression
Create a new element in the expression variant, column_mutation_attribute,
signifying we're picking up an attribute of a column mutation (not a
column value!). We use an enum rather than a bool to choose between
writetime and ttl (the two mutation attributes) for increased
explicitness.

Although there can only be one type for the column we're operating
on (it must be an unresolved_identifer), we use a nested_expression.
This is because we'll later need to also support a column_value
as the column type after we prepare it. This is somewhat similar
to the address of operator in C, which syntactically takes any
expression but semantically operates only on lvalues.
2021-07-27 20:10:52 +03:00
Avi Kivity
294f0f35b1 cql3: expression: add convenience constructor from expression element to nested expression
It is convenient to initialize a nested_expression variable from
one of the types that compose the expression variant, but C++ doesn't
allow it. Add a constructor that does this. Use the new variant_element
concept to constrain the input to be one of the variant's elements.
2021-07-27 20:08:48 +03:00
Avi Kivity
636b133cbc utils: introduce variant_element.hh
A type trait (is_variant_element) and a concept (VariantElement)
that tell if a type T is a member of a variant or not. It can be
used even if the variant's elements are not yet defined (just
forward-declared).
2021-07-27 20:08:47 +03:00
Avi Kivity
ac3b093e3c cql3: expression: use nested_expression in binary_operator
binary_operator::lhs is implementing the pattern in nested_expression.
Use nested_expression instead to reduce code size.
2021-07-27 20:08:34 +03:00
Avi Kivity
b07a0867b3 cql3: expression: introduce nested_expression class
The exression type cannot be a member of a struct that is an
element of the expression variant. This is because it would then
be required to contain itself. So introduce a nested_expression
type to indirectly hold an expression, but keep the value semantics
we expect from expressions: it is copyable and a copy has separate
identity and storage.

In fact binary_operator had to resort to this trick, so it's converted
to nested_expression in the next patch.
2021-07-27 20:08:21 +03:00
Avi Kivity
8a518e9c78 Convert column_identifier_raw's use as selectable to expressions
Introduce unresolved_identifer as an unprepared counterpart to column_value.
column_identifier_raw no longer inherits from selectable::raw, but
methods for now to reduce churn.
2021-07-27 20:08:15 +03:00
Pavel Emelyanov
b3c89787be mutation_partition: Return immutable collection for range tombstones
Patch the .row_tombstones() to return the range_tombstone_list
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the tombstones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
1bf643d4fd mutation_partition: Pin mutable access to range tombstones
Some callers of mutation_partition::row_tomstones() don't want
(and shouldn't) modify the list itself, while they may want to
modify the tombstones. This patch explicitly locates those that
need to modify the collection, because the next patch will
return immutable collection for the others.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
05b8cdfd24 mutation_partition: Return immutable collection for rows
Patch the .clustered_rows() method to return the btree of rows
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the elements in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
ad27bf40e6 mutation_partition: Pin mutable access to rows
Some callers of mutation_partition::clustered_rows() don't want
(and shouldn't) modify the tree of rows, while they may want to
modify the rows themselves. This patch explicitly locates those
that need to modify the collection, because the next patch will
return immutable collection for the others.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
c2a36f5668 utils: Introduce immutable_collection<>
Wokring with collections can be done via const- and non-const
references. In the former case the collection can only be read
from (find, iterate, etc) in the latter it's possible to alter
the collection (erase elements from or insert them into). Also
the const-ness of the collection refernece is transparently
inherited by the returned _elements_ of the collection, so when
having a const reference on a collection it's impossible to
modify the found element.

This patch introduces a immutable_collection -- a wrapper over
a random collection that makes sure the collection itself is not
modified, but the obtained from it elements can be non-const.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
d1c693473a btree: Generalize some iterator methods
The non-const iterator has constructor from key pointer and
the tree_if_singular method. There's no reasons why these
two are absent in the const_iterator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
6ef27c9fa1 btree: Make iterators not modify the tree itself
The const_iterator cannot modify anything, but the plain
iterator has public methods to remove the key from the tree.
To control how the tree is modified this method must be
marked private and modification by iterator should come
from somewhere else.

This somewhere else is the existing key_grabber that's
already used to move keys between trees. Generalize this
ability to move a key out of a tree (i.e. -- erase).

Once done -- mark the iterator::erase_and_dispose private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
e652b03b4e btree tests: Dont use iterator erase
Next patches will mark btree::iterator methods that modify
the tree itself as private, so stop using them in tests.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
a9b4fa9db3 mutation_partition: Shuffle declarations
Its methods that provide access to enclosed collections of rows
and range tombstones are intermixed, so group them for smoother
next patching and mark noexcept while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
9122f4129d range_tombstone_list: Mark more methods noexcept
Those returning iterators and size for the underlying
collection of range tombstones are all non-throwing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
0f53e83a8e range_tombstone_list, code: Mark external_memory_usage noexcept
The range_tombstone_list's method is at the top of the
stack of calls each not throwing anything, so do the
deep-dive noexcept marking.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Avi Kivity
d3e8c05bed make column_identifier::raw forward declarable
Otherwise we run into a #include loop when try to have an expression
with column_identifier::raw: expression.hh -> column_identifier.hh
-> selectable.hh -> expression.hh.
2021-07-27 20:00:48 +03:00
Avi Kivity
0e30a78573 cql3: introduce selectable::with_expression::raw
Prepare to migrate selectable::raw sub-classes to expressions by
creating a bridge betweet the two types. with_expression::raw
is a selectable::raw and implements all its methods (right now,
trivially), and its contents is an expression. The methods are
implemented using the usual visitor pattern.
2021-07-27 20:00:48 +03:00
Benny Halevy
3a4e4f9914 compaction: to_string: handle invalid values as internal error
Although the switch in `to_string(compaction_options::scrub::mode)`
covers all possible cases, gcc 10.3.1 warns about:
```
    sstables/compaction.cc: In function ‘std::string_view sstables::to_string(sstables::compaction_options::scrub::mode)’:
    sstables/compaction.cc:95:1: error: control reaches end of non-void function [-Werror=return-type]
```

Adding __builtin_unreachable(), as in `to_string(compaction_type)`
does calm the compiler down, but it might cause undefined behavior
in the future in case the switch won't cover all cases, or
the passed value is corrupt somehow.

Instead, call on_internal_error_noexcept to report the
error and abort if configure to do so, otherwise,
just return an "(invalid)" string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210727130251.2283068-1-bhalevy@scylladb.com>
2021-07-27 16:04:09 +03:00
Avi Kivity
df4d77e857 table: simplify generate_and_propagate_view_updates exception handling
We have both try/catch and handle_exception() to ignore exceptions.
Try/catch is enough, so remove handle_exception().

Closes #9011
2021-07-27 14:08:30 +02:00
Avi Kivity
f86e65b4e7 Merge "Fix quadratic behavior in memtable/row_cache with lots of range tombstones" from Tomasz
"
This series fixes two issues which cause very poor efficiency of reads
when there is a lot of range tombstones per live row in a partition.

The first issue is in the row_cache reader. Before the patch, all range
tombstones up to the next row were copied into a vector, and then put
into the buffer until it's full. This would get quadratic if there is
much more range tombstones than fit in a buffer.

The fix is to avoid the accumulation of all tombstones in the vector
and invoke the callback instead, which stops the iteration as soon as
the buffer is full.

Fixes #2581.

The second, similar issue was in the memtable reader.

Tests:

  - unit (dev)
  - perf_row_cache_update (release)
"

* tag 'no-quadratic-rt-in-reads-v1' of github.com:tgrabiec/scylla:
  test: perf_row_cache_update: Uncomment test case for lots of range tombstones
  row_cache: Consume range tombstones incrementally
  partition_snapshot_reader: Avoid quadratic behavior with lots of range tombstones
  tests: mvcc: Relax monotonicity check
  range_tombstone_stream: Introduce peek_next()
2021-07-27 14:39:13 +03:00
Avi Kivity
05d22d27a8 Merge "Cut repair->storage-service link" from Pavel E
"
It exists in the node-ops handler which is registered by repair code,
but is handled by storage service. Probably, the whole node-ops handler
should instead be moved into repair, but this looks like rather huge
rework. So instead -- put the node-ops verb registration inside the
storage-service.

This removes some more calls for global storage service instance and
allows slight optimization of node-ops cross-shards calls.

tests: unit(dev), start-stop
"

* 'br-remove-storage-service-from-nodeops' of https://github.com/xemul/scylla:
  storage_service: Replace globals with locals
  storage_service: Remove one extra hop of node-ops handler
  storage_service: Fix indentation after previous patch
  storage_service: Move cross-shard hop up the stack
  repair: Drop empty verbs reg/unreg methods
  repair, storage_service: Move nodeops reg/unreg to storage service
  repair: Coroutinize row-level start/stop
2021-07-27 13:27:27 +03:00
Takuya ASADA
fdc786b451 install.sh: add supervisor support
Bring supervisor support from dist/docker to install.sh, make it
installable from relocatable package.
This enables to use supervisor with nonroot / offline environment,
and also make relocatable package able to run in Docker environment.

Related #8849

Closes #8918
2021-07-27 12:51:29 +03:00
Takuya ASADA
42fd73d033 scylla_setup: add RAID5 support
This supports optional RAID5 support on scylla_setup.

Fixes #9076

Closes #9093
2021-07-27 12:49:29 +03:00
Avi Kivity
2cca461652 Merge 'sstables: merge row consumer interfaces with implementations' from Wojciech Mitros
This patch follows #9002, further reducing the complexity of the sstable readers.
The split between row consumer interfaces and implementations has been first added in 2015, and there is no reason to create new implementations anymore. By merging those classes, we achieve a sizeable reduction in sstable reader length and complexity.
Refs #7952
Tests: unit(dev)

Closes #9073

* github.com:scylladb/scylla:
  sstables: merge row_consumer into mp_row_consumer_k_l
  sstables: move kl row_consumer
  sstables: merge consumer_m into mp_row_consumer_m
  sstables: move mp_row_consumer_m
2021-07-27 12:23:29 +03:00
Benny Halevy
424c53d5b1 mutation_fragment_stream_validator: disambiguate schema member definition
gcc 10.3.1 complains that:
```
./mutation_fragment_stream_validator.hh:39:21: error: declaration of ‘const schema& mutation_fragment_stream_validator::schema() const’ changes meaning of ‘schema’ [-fpermissive]
   39 |     const ::schema& schema() const { return _schema; }
      |                     ^~~~~~
```

Defining the _schama member as `::schema` rather than just `schema`
calms the compiler down.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210727073941.1999909-1-bhalevy@scylladb.com>
2021-07-27 11:55:42 +03:00
Pavel Emelyanov
ca2dfac7d7 test: Split test_multishard_combining_reader_as_mutation_source into 3
There are 3 independent cases in this test that benefit
from running in parallel.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 09:29:20 +03:00
Pavel Emelyanov
e184ed2b9c test: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 09:29:12 +03:00
Pavel Emelyanov
3e979a20ea test: Move out internals of test_multishard_combining_reader_as_mutation_source
Preparation. They will be called from 3 independent cases.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 09:28:12 +03:00
Nadav Har'El
8030461a2c cql-pytest: translate Cassandra's misc. type tests
This is a translation of Cassandra's CQL unit test source file
validation/entities/TypeTest.java into our our cql-pytest framework.

This is a tiny test file, with only four test which apparently didn't
find their place in other source files. All four tests pass on Cassandra,
and all but one pass on Scylla - the test marked xfail discovered one
previously-unknown incompatibility with Cassandra:

Refs #9082: DROP TYPE IF EXISTS shouldn't fail on non-existent keyspace

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210726140934.1479443-1-nyh@scylladb.com>
2021-07-27 08:28:16 +03:00
Tomasz Grabiec
7578cef0a4 test: perf_row_cache_update: Uncomment test case for lots of range tombstones 2021-07-26 21:38:00 +02:00
Gleb Natapov
56d0f711e8 serialize: allow use non copyable types with std::variant
Message-Id: <20210720120935.710549-3-gleb@scylladb.com>
2021-07-26 19:09:19 +03:00
Gleb Natapov
63025a75b2 serialize: allow use non copyable types with std::optional
Message-Id: <20210720120935.710549-2-gleb@scylladb.com>
2021-07-26 19:09:19 +03:00
Avi Kivity
8a80e455fb sstables: keys: convert trichotomic comparisons to std::strong_ordering
Prevent accidental conversions to bool from yielding the wrong results.
Unprepared users (that converted to bool, or assigned to int) are adjusted.

Ref #1449

Test: unit (dev)

Closes #9088
2021-07-26 19:09:19 +03:00
Nadav Har'El
d3a715e0ff Update seastar submodule
* seastar 93d053cd...ce3cc268 (4):
  > doc: update coroutine exception paragraph with make_exception
  > coroutine: add make_exception helper
  > coroutine: use std::move for forwarding exception_ptr
  > doc: tutorial: document direct exception propagation

With the new throw-less coroutine exception support, we can modify
some of Scylla's new coroutine code to generate exceptions a bit more
efficiently, without actually thowing an exception.
2021-07-26 19:09:19 +03:00
Tomasz Grabiec
2d18360157 row_cache: Consume range tombstones incrementally
Before the patch, all range tombstones up to the next row were copied
into a vector, and then put into the buffer until it's full. This
would get quadratic if there is much more range tombstones than fit in
a buffer.

The fix is to avoid the accumulation of all tombstones in the vector
and invoke the callback instead, which stops the iteartion as soon as
the buffer is full.

Fixes #2581.
2021-07-26 17:48:05 +02:00
Tomasz Grabiec
e74c3c885e partition_snapshot_reader: Avoid quadratic behavior with lots of range tombstones
next_range_tombstone() was populating _rt_stream on each invocation
from the current iterator ranges in _range_tombstones. If there is a
lot of range tombstones, all would be put into _rt_stream. One problem
is that this can cause a reactor stall. Fix by more incremental
approach where we populate _rt_stream with minimal amount on each
invocation of next_range_tombstone().

Another problem is that this can get quadratic. The iterators in
_range_tombstones are advanced, but if lsa invalidates them across
calls they can revert back to the front since they go back to
_last_rt, which is the last consumed range tombstone, and if the
buffer fills up, not all tombstones from _rt_stream could be
consumed. The new code doesn't have this problem because everything
which is produced out of the iterators in _range_tombstones is
produced only once. What we put into _rt_stream is consumed first
before we try to feed the _rt_stream with more data.
2021-07-26 17:48:05 +02:00
Tomasz Grabiec
0d7b3f9463 tests: mvcc: Relax monotonicity check
Consecutive range tombstones can have the same position. They will, in
one of the test cases, after the range tombstone merger in
partition_snapshot_flat_reader no longer uses range_tombstone_list to
merge data form multiple versions, which deoverlaps, but rather merges
the streams corresponding to each version, which interleaves range
tombstones from different versions.
2021-07-26 17:27:03 +02:00
Piotr Sarna
ac7e6028a5 system_keyspace: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:05:52 +02:00
Piotr Sarna
55cd46154c sstables: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:05:51 +02:00
Piotr Sarna
4de751c8c8 storage_proxy: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:05:15 +02:00
Piotr Sarna
776ab4bcb1 multishard_mutation_query: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:05:14 +02:00
Piotr Sarna
101eb26171 client_state: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:04:28 +02:00
Piotr Sarna
e5925d4980 flat_mutation_reader: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:04:20 +02:00
Piotr Sarna
26ae74524a table: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:04:18 +02:00
Piotr Sarna
3b37d75956 commitlog: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:03:41 +02:00
Piotr Sarna
6e994ce7c2 compaction: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:03:06 +02:00
Piotr Sarna
66c4d58a8c database: pass exceptions without throwing
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
 1. make_exception_future<> for futures
 2. co_return coroutine::exception(...) for coroutines
    which return future<T> (the mechanism does not work for future<>
    without parameters, unfortunately)
2021-07-26 17:02:36 +02:00
Tomasz Grabiec
91868cf0cd range_tombstone_stream: Introduce peek_next() 2021-07-26 13:33:34 +02:00
Pavel Emelyanov
11a2709f10 storage_service: Replace globals with locals
The node-ops verb handler is the lambda of storage-service and it
can stop using global storage service instance for no extra charge.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-26 14:21:30 +03:00
Pavel Emelyanov
6e56671d9e storage_service: Remove one extra hop of node-ops handler
It's now clear that the verb handler goes to some "random"
shard, then immediatelly switches to shard-0 and then does
the handling. Avoid the extra hop and go to shard-0 right
at once.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-26 14:21:30 +03:00
Pavel Emelyanov
b6315d3af7 storage_service: Fix indentation after previous patch
And, while at it, s/ss/this/g and drop the ss variable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-26 14:21:30 +03:00
Pavel Emelyanov
f5fad311cf storage_service: Move cross-shard hop up the stack
The storage_service::node_ops_cmd_handler runs inside a huge
invoke_on(0, ...) lambda. Make it be called on shard-0. This
is the preparation for next two patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-26 14:21:30 +03:00
Pavel Emelyanov
eb55c252c9 repair: Drop empty verbs reg/unreg methods
Those in repair.cc's are now noops, so remove them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-26 14:21:30 +03:00
Pavel Emelyanov
a09586a237 repair, storage_service: Move nodeops reg/unreg to storage service
The storage service is the verb sender, so it must be the verb
registrator. Another goal of this patch is to allow removal of
repair -> storage_service dependency.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-26 14:21:21 +03:00
Pavel Emelyanov
18397a5e0a repair: Coroutinize row-level start/stop
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-26 14:21:21 +03:00
Piotr Jastrzebski
90a607e844 api: use proper type to reduce partition count
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.

Fixes #9090

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #9074
2021-07-26 11:53:06 +03:00
Nadav Har'El
b503ec36c2 cql-pytest: translate Cassandra's tests for tuples
This is a translation of Cassandra's CQL unit test source file
validation/entities/TupleTypeTest.java into our our cql-pytest framework.

This test file checks has a few tests on various features of tuples.
Unfortunately, some of the tests could not be easily translated into
Python so were left commented out: Some tests try to send invalid input
to the server which the Python driver "helpfully" forbids; Two tests
used an external testing library "QuickTheories" and are the only two
tests in the Cassandra test suite to use this library - so it's not
a worthwhile to translate it to Python.

11 tests remain, all of them pass on Cassandra, and just one fails on
Scylla (so marked xfail for now), reproducing one known issue:

Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Actually, += is not supposed to be supported on tuple columns anyway, but
should print the appropriate error - not the syntax error we get now as
the "+=" feature is not supported at all.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210722201900.1442391-1-nyh@scylladb.com>
2021-07-26 08:20:12 +03:00
Benny Halevy
8674746fdd flat_mutation_reader: detach_buffer: mark as noexcept
Since detach_buffer is used before closing and
destroying the reader, we want to mark it as noexcept
to simply the caller error handling.

Currently, although it does construct a new circular_buffer,
none of the constructors used may throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210617114240.1294501-2-bhalevy@scylladb.com>
2021-07-25 12:02:27 +03:00
Benny Halevy
0e31cdf367 flat_mutation_reader: detach_buffer: clarify buffer constructor
detach_buffer exchanges the current _buffer with
a new buffer constructed using the circular_buffer(Alloc)
constructor. The compiler implicitly constructs a
tracking_allocator(reader_permit) and passes it
to the circular_buffer constructor.

This patch just makes that explicit so it would be
clearer to the reader what's going on here.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210617114240.1294501-1-bhalevy@scylladb.com>
2021-07-25 11:59:37 +03:00
Pavel Solodovnikov
bcbcc18aa1 raft: raft_sys_table_storage: fix broken load_snapshot and load_term_and_vote
Loading snapshot id and term + vote involve selecting static
fields from the "system.raft" table, constrained by a given
group id.

The code incorrectly assumes that, for example,
`SELECT snapshot_id FROM raft WHERE group_id=?` in
`load_snapshot` always returns  only one row.
This is not true, since this will return a row
for each (pk, ck) combination, which is (group_id, index)
for "system.raft" table.

The same applies for the `load_term_and_vote`, which selects
static `vote_term` and `vote` from "system.raft".

This results in a crash at node startup when there is
a non-empty raft log containing more than one entry
for a given `group_id`.

Restrict the selection to always return one row by applying
`LIMIT 1` clause.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210723183232.742083-1-pa.solodovnikov@scylladb.com>
2021-07-25 02:01:34 +02:00
Pavel Solodovnikov
49ddd269ea cql3: rename variable_specifications to prepare_context
The class is repurposed to be more generic and also be able
to hold additional metadata related to function calls within
a CQL statement. Rename all methods appropriately.

Visitor functions in AST nodes (`collect_marker_specification`)
are also renamed to a more generic `fill_prepare_context`.

The name `prepare_context` designates that this metadata
structure is a byproduct of `stmt::raw::prepare()` call and
is needed only for "prepare" step of query execution.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-24 14:33:33 +03:00
Nadav Har'El
ec5e4c338b cql: fix undefined behavior in timestamp verification
Commit 2150c0f7a2 proposed by issue #5619
added a limitation that USING TIMESTAMP cannot be more than 3 days into
the future. But the actual code used to check it,

     timestamp - now > MAX_DIFFERENCE

only makes sense for *positive* timestamps. For negative timestamps,
which are allowed in Cassandra, the difference "timestamp - now" might
overflow the signed integer and the result is undefined - leading to the
undefined-behavior sanitizer to complain as reported in issue #8895.
Beyond the sanitizer, in practice, on my test setup, the timestamp -2^63+1
causes such overflow, which causes the above if() to make the nonsensical
statement that the timestamp is more than 3 days into the future.

This patch assumes that negative timestamps of any magnitude are still
allowed (as they are in Cassandra), and fixes the above if() to only
check timestamps which are in the future (timestamp > now).

We also add a cql-pytest test for negative timestamps, passing on both
Cassandra and Scylla (after this patch - it failed before, and also
reported sanitizer errors in the debug build).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210621141255.309485-1-nyh@scylladb.com>
2021-07-24 11:01:08 +03:00
Tomasz Grabiec
b044db863f Merge 'db/virtual_table: Streaming tables for large data + describe_ring example table' from Juliusz Stasiewicz
This is the 2nd PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). This one introduces a new implementation of the virtual tables, the streaming tables, which are suitable for large amounts of data.

This PR was created by @jul-stas and @StarostaGit

Closes #8961

* github.com:scylladb/scylla:
  test/boost: run_mutation_source_tests on streaming virtual table
  system_keyspace: Introduce describe_ring table as virtual_table
  storage_service: Pass the reference down to system_keyspace
  endpoint_details: store `_host` as `gms::inet_address`
  queue_reader: implement next_partition()
  virtual_tables: Introduce streaming_virtual_table
  flat_mutation_reader: Add a new filtering reader factory method
2021-07-23 18:05:51 +02:00
Gleb Natapov
f0047bd749 raft: apply snapshots in applier_fiber
We want to serialize snapshot application with command application
otherwise a command may be applied after a snapshot that already contains
the result of its application (it is not necessary a problem since the
raft by itself does not guaranty apply-once semantics, but better to
prevent it when possible). This also moves all interactions with user's
state machine into one place.

Message-Id: <YPltCmBAGUQnpW7r@scylladb.com>
2021-07-23 18:05:38 +02:00
Avi Kivity
aaf35b5ac2 Merge "Remove storage-service from transport (and a bit more)" from Pavel E
"
The cql-server -> storage-service dependency comes from the server's
event_notifier which (un)subscribes on the lifecycle events that come
from the storage service. To break this link the same trick as with
migration manager notifications is used -- the notification engine
is split out of the storage service and then is pushed directly into
both -- the listeners (to (un)subscribe) and the storage service (to
notify).

tests: unit(dev), dtest(simple_boot_shutdown, dev)
       manual({ start/stop,
                with/without started transport,
	        nodetool enable-/disablebinary
	      } in various combinations, dev)
"

* 'br-remove-storage-service-from-transport' of https://github.com/xemul/scylla:
  transport.controller: Brushup cql_server declarations
  code: Remove storage-service header from irrelevant places
  storage_service: Remove (unlifecycle) subscribe methods
  transport: Use local notifier to (un)subscribe server
  transport: Keep lifecycle notifier sharded reference
  main: Use local lifecycle notifier to (un)subscribe listeners
  main, tests: Push notifier through storage service
  storage_service: Move notification core into dedicated class
  storage_service: Split lifecycle notification code
  transport, generic_server: Remove no longer used functionality
  transport: (Un)Subscribe cql_server::event_notifier from controller
  tests: Remove storage service from manual gossiper test
2021-07-22 19:27:45 +03:00
Pavel Emelyanov
b1bb00a95c transport.controller: Brushup cql_server declarations
The controller code sits in the cql_transport namespace and
can omit its mentionings. Also the seastar::distributed<>
is replaced with modern seastar::sharded<> while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:50:57 +03:00
Pavel Emelyanov
c39f04fa6f code: Remove storage-service header from irrelevant places
Some .cc files over the code include the storage service
for no real need. Drop the header and include (in some)
what's really needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:50:19 +03:00
Pavel Emelyanov
e711bfbb7e storage_service: Remove (unlifecycle) subscribe methods
All the listeners now use main-local notifier instance directly
and these methods become unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:49:35 +03:00
Pavel Emelyanov
65b1bb8302 transport: Use local notifier to (un)subscribe server
Now the controller has the lifecycle notifier reference and
can stop using storage service to manage the subscription.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:48:58 +03:00
Pavel Emelyanov
5f99eeb35e transport: Keep lifecycle notifier sharded reference
It's needed to (un)subscribe server on it (next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:48:20 +03:00
Pavel Emelyanov
2a30cb1664 main: Use local lifecycle notifier to (un)subscribe listeners
The storage proxy and sl-manager get subscribed on lifecycle
events with the help of storage service. Now when the notifier
lives in main() they can use it directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:47:15 +03:00
Pavel Emelyanov
8248bc9e33 main, tests: Push notifier through storage service
Now it's time to move the lifecycle notifier from storage
service to the main's scope. Next patches will remove the
$lifecycle-subscriber -> storage_service dependency.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:45:51 +03:00
Pavel Emelyanov
6b3b01d9a6 storage_service: Move notification core into dedicated class
Introduce the endpoint_lifecycle_notifier class that's in
charge of keeping track of subscribers and notifying them.
The subscribers will thus be able to set and unset their
subscription without the need to mess with storage service
at all.

The storage_service for now keeps the notifier on board, but
this is going to change in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:44:02 +03:00
Pavel Emelyanov
7e8a032013 storage_service: Split lifecycle notification code
This prepares the ground for moving the notification engine
into own class like it was done for migration_notifier some
time ago.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:43:14 +03:00
Pavel Emelyanov
c7b0b25494 transport, generic_server: Remove no longer used functionality
After subscription management was moved onto controller level
a bunch of code can be dropped:

- passing migration notifier beyond controller
- event_notifier's _stopped bit
- event_notifier .stop() method
- event_notifier empty constructor and destrictor
- generic_server's on_stop virtual method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:41:32 +03:00
Pavel Emelyanov
1acef41626 transport: (Un)Subscribe cql_server::event_notifier from controller
There's a migration notifier that's carried through cql_server
_just_ to let event-notifier (un)subscribe on it. Also there's
a call for global storage-service in there which will need to
be replaced with yet another pass-through argument which is not
great.

It's easier to establish this subscription outside of cql_server
like it's currently done for proxy and sl-manager. In case of
cql_server the "outside" is the controller.

This patch just moves the subscription management from cql_server
to controller, next two patches will make more use of this change.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:37:23 +03:00
Pavel Emelyanov
b57fb0aa9a tests: Remove storage service from manual gossiper test
It's not needed there, gossiper starts and works without it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:36:28 +03:00
Yaron Kaikov
a004b1da30 scylla_util:add AWS arm based instance to supported list
Today we have a Scylla AMI image based on x86 archituctre only.
Following the work we did in https://github.com/scylladb/scylla-machine-image/pull/153 we can build
ARM based AMI image

Let's add ARM based instance to supported list

Closes #9064
2021-07-22 15:48:29 +03:00
Avi Kivity
d0d42891e9 Merge 'Harden batchlog_manager stop and call from main in deferred action' from Benny Halevy
This PR contains the parts relevant to batchlog_manager stop in #8998 without adding a gate to the storage_proxy for synchronization with on-going queries in storage_proxy::drain_on_shutdown.

As explained in #9009, we see that the batchlog_manager isn't stopped if scylla shuts down during startup, e.g. when waiting for gossip to settle, since currently the batchlog_manager is stopped only from `storage_service::do_drain`, while `storage_service::drain_on_shutdown` deferred shutdown is installed only later on:
222ef17305/main.cc (L1419-L1421)

Fixes #9009

Test: unit(dev)
DTest: compact_storage_tests.py:TestCompactStorage.wide_row_test paging_test:TestPagingDatasetChanges.test_cell_TTL_expiry_during_paging update_cluster_layout_tests:TestUpdateClusterLayout.simple_add_new_node_while_adding_info_{1,2}_test (dev)

Closes #9010

* github.com:scylladb/scylla:
  main: add deferred stop of batchlog_manager
  batchlog_manager: refactor drain out of stop
  batchlog_manager: stop: break _sem on shard 0
  batchlog_manager: stop: use abort_source to abort batchlog_replay_loop
  batchlog_manager: do_batch_log_replay: hold _gate
2021-07-22 15:47:29 +03:00
Piotr Sarna
ea3d9baa5a Update seastar submodule
* seastar 388ee307...93d053cd (5):
  > doc: tutorial: document seastar::coroutine::all()
  > doc: tutorial: nest "exceptions in coroutines" under "coroutines"
  > coroutine: add a way of propagating exceptions without throwing
  > input_stream: Fix read_exactly(n) incorrectly skipping data
  > coroutines: introduce all() template for waiting for multiple futures
2021-07-22 12:29:28 +02:00
Piotr Sarna
e9d26dd7ed utils/coroutine: wrap a helper in utils namespace
The class name `coroutine` became problematic since seastar
introduced it as a namespace for coroutine helpers.
To avoid a clash, the class from scylla is wrapped in a separate
namespace.

Without this patch, Seastar submodule update fails to compile.
Message-Id: <6cb91455a7ac3793bc78d161e2cb4174cf6a1606.1626949573.git.sarna@scylladb.com>
2021-07-22 13:28:43 +03:00
Piotr Sarna
526ad2a151 Merge 'secondary_index: Fix TOKEN() restrictions in indexed SELECTs' from Jan Ciołek
This is a rewrite of an old PR: #7582

`TOKEN()` restrictions don't work properly when a query uses an index.
For example this returns both rows:
```cql
CREATE TABLE t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ON t(v);
INSERT INTO t (pk, ck, v) VALUES (0, 0, 0);
INSERT INTO t (pk, ck, v) VALUES (1, 0, 0);
SELECT token(pk), pk, ck, v FROM t WHERE v = 0 AND token(pk) = token(0) ALLOW FILTERING;
```

This functionality is supported on both old and new indexes.  In old
indexes the type of the token column was `blob`.  This causes problems,
because `blob` representation of tokens is ordered differently. Tokens
represented as blobs are ordered like this:
```
0, 1, 2, 3, 4, 5, ..., bigint_max, bigint_min, ...., -5, -4, -3, -2, -1
```
Because of that clustering range for `token()` restrictions needs to be
translated to two clustering ranges on the `blob` column.

To create old indexes disable the feature called:
`CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX` or run scylla version from branch
[`cvybhu/si-token2-old-index`](https://github.com/cvybhu/scylla/commits/si-token2-old-index)

I'm not sure if it's possible to create automatic tests with old
indexes. I ran `dev-test` manually on the `si-token2-old-index` branch,
and the only tests that failed were the ones testing row ordering. Rows
should be ordered by `token`, but because in old indexes the token is
represented as a `blob` this ordering breaks. This is a known issue
(#7443), that has been fixed by introducing new indexes.

To sum up:
* `token()` restrictions are fixed on both new and old indexes.
* When using old indexes, the rows are not properly ordered by token.
* With new indexes the rows are properly ordered by token.

Fixes #7043

Closes #9067

* github.com:scylladb/scylla:
  tests: add secondary index tests with TOKEN clause
  secondary_index_test: extract test data
  secondary_index: Fix TOKEN() restrictions in indexed SELECTs
  expression: Add replace_token function
2021-07-22 10:22:45 +02:00
Wojciech Mitros
7f41af0916 sstables: merge row_consumer into mp_row_consumer_k_l
The row_consumer interface has only one implementation:
mp_row_consumer_k_l; and we're not planning other ones,
so to reduce the number of inheritances, and the number
of lines in the sstable reader, these classes may be
combined.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-07-21 18:19:49 +02:00
Wojciech Mitros
1ff72ca0a6 sstables: move kl row_consumer
In preparation for the next patch combining row_consumer and
mp_row_consumer_k_l, move row_consumer next to row_consumer.

Because row_consumer is going to be removed, we retire some
old tests for different implementations of the row_consumer
interface; as a result, we don't need to expose internal
types of kl sstable reader for tests, so all classes from
reader_impl.hh are moved to reader.cc, and the reader_impl.hh
file is deleted, and the reader.cc file has an analogous
structure to the reader.cc file in sstables/mx directory.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-07-21 18:04:22 +02:00
Wojciech Mitros
fc17c48bc9 sstables: merge consumer_m into mp_row_consumer_m
The consumer_m interface has only one implementation:
mp_row_consumer_m; and we're not planning other ones,
so to reduce the number of inheritances, and the number
of lines in the sstable reader, these classes may be
combined.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-07-21 17:36:10 +02:00
Wojciech Mitros
fbb56e930c sstables: move mp_row_consumer_m
To make next patch combining consumer_m and mp_row_consumer_m
more readable, move mp_row_consumer_m next to consumer_m.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-07-21 17:36:04 +02:00
Piotr Grabowski
e06102aed9 tests: add secondary index tests with TOKEN clause
Add tests of SELECTs with TOKEN clauses on tables with secondary
indexes (both global and local).

test_select_with_token_range_cases checks all possible token range
combinations (inclusive/exclusive/infinity start/end) on tables without
index, with local or with global index.

test_select_with_token_range_filtering checks whether TOKEN restrictions
combined with column restrictions work properly. As different code paths
are taken if index is created on clustering key (first or non-first) or
non-primary-key column, the tests checks scenarios when index is created
on different columns.
2021-07-21 16:12:55 +02:00
Piotr Grabowski
e2bd1cdb9d secondary_index_test: extract test data
Extract test data to a separate variables, allowing it to be easily
reused by other tests. The tokens are hard-coded, because calculating
their value brought too much complexity to this code.
2021-07-21 16:12:55 +02:00
Jan Ciolek
694d62a567 secondary_index: Fix TOKEN() restrictions in indexed SELECTs
When using an index, restrictions like token(p) <= x were ignored.
Because of this a query like this would select all rows where r = 0:
SELECT * FROM tab WHERE r = 0 and token(p) > 0;

Adds proper handling of token restrictions to queries that use indexes.

Old indexes represented token as a blob, which complicates
clustering bounds. Special code is included, which translates
token clustering bounds to blob clustering bounds.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-07-21 16:12:49 +02:00
Raphael S. Carvalho
e4eb7df1a1 table: Make correctness of concurrent sstable list update robust
Today, table relies on row_cache::invalidate() serialization for
concurrent sstable list updates to produce correct results.
That's very error prone because table is relying on an implementation
detail of invalidate() to get things right.
Instead, let's make table itself take care of serialization on
concurrent updates.
To achieve that, sstable_list_builder is introduced. Only one
builder can be alive for a given table, so serialization is guaranteed
as long as the builder is kept alive throughout the update procedure.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210721001716.210281-1-raphaelsc@scylladb.com>
2021-07-21 16:45:30 +03:00
Botond Dénes
84c9bf2b63 tools/scylla-sstable-index: remove global reader concurrency semaphore
Use a local one instead and make sure to stop it before it is destroyed.

Message-Id: <20210721133754.356229-1-bdenes@scylladb.com>
2021-07-21 16:41:01 +03:00
Raphael S. Carvalho
aad72289e2 table: Kill load_sstable()
That function is dangerously used by distributed loader, as the latter
was responsible for invalidating cache for new sstable.
load_sstable() is an unsafe alternative to
add_sstable_and_update_cache() that should never have been used by
the outside world. Instead, let's kill it and make loader use
the safe alternative instead.
This will also make it easier to make sure that all concurrent updates
to sstable set are properly serialized.

Additionally, this may potentially reduce the amount of data evicted
from the cache, when the sstables being imported have a narrow range,
like high level sstables imported from a LCS table. Unlikely but
possible.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210721131949.26899-1-raphaelsc@scylladb.com>
2021-07-21 16:21:42 +03:00
Botond Dénes
a819f013f6 compaction/compaction: create_compaction_info(): take const compaction_descriptor&
Don't copy the descriptor.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210721120219.326972-1-bdenes@scylladb.com>
2021-07-21 16:19:03 +03:00
Pavel Solodovnikov
718977e2b7 idl: add descriptions for the top-level generation routines
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:39:20 +03:00
Pavel Solodovnikov
fe6b0e8bbf idl: make ns_qualified name a class method
Introduce ASTBase and move `combine_ns` and `ns_qualified_name`
there.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:31:47 +03:00
Pavel Solodovnikov
c584bbf841 idl: cache template declarations inside enums and classes
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:17:40 +03:00
Pavel Solodovnikov
82cb2dc3e3 idl: cache parent template params for enums and classes
Also remove `parent_template_param` argument for
`handle_enum` and `handle_class` functions.

`setup_namespace_bindings` is renamed to `setup_additional_metadata`
since it now also sets parent template arguments for each object.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:17:40 +03:00
Pavel Solodovnikov
6dd3bf9d6d idl: rename misleading local_types to local_writable_types
Rename `add_to_types` function to `register_local_writable_type`
and `local_types` set to `local_writable_types`.
Also rename other related functions accordingly, by adding
`writable` to the names.

Previous names were misleading since `local_types` refers not
to all local types but only to those which are marked with
`[[writable]]` attribute.

Nonetheless, we are going to need a mapping of all local types
to resolve type references from `BasicType` AST node
instances. So the `local_types` set is retained, but now it
corresponds to the list of all local types.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:17:40 +03:00
Pavel Solodovnikov
d17a6a5e5a idl: remove remaining uses of namespaces argument
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:17:40 +03:00
Pavel Solodovnikov
8aeaba5eb6 idl: remove is_final function and use .final AST class property
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:17:40 +03:00
Pavel Solodovnikov
5ec8aeb74c idl: remove parent_template_param from local_types set
Previously local types set contained a items, which are lists
of `[cls, parent_template_param]`. The second element is never
used, so remove it and move `cls` from the list.

All uses are adjusted accordingly.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:17:40 +03:00
Pavel Solodovnikov
4a66e07cb1 idl: cache namespaces in AST nodes
Do a pre-processing pass to cache namespaces info in each
type declaration AST node (`ClassDef` and `EnumDef`) and
store it in the `ns_context` field of a node.

Switch to `ns_context` and eliminate `namespaces` parameter
carried over through all writer functions.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:17:39 +03:00
Pavel Solodovnikov
3218a952b9 idl: remove unused variables
This patch removes unused `parent_template_param` and `namespaces`
variables obtained from unpacking values from the `local_types` set.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-07-21 15:17:39 +03:00
Jan Ciolek
51ee9adeec expression: Add replace_token function
Adds replace_token function which takes an expression
and replaces all left hand side occurences of token()
with the given column definition.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-07-21 12:25:12 +02:00
Gleb Natapov
7261c2c93e raft: return a correct leader when leaving leader state
When a leader moves to a follower state it aborts all requests that are
waiting on an admission semaphore with not_a_leader exception. But
currently it specifies itself as a new leader since abortion happens
before the fsm state changes to a follower. The patch fixes this by
destroying leader state after fsm state already changed to be a
follower.

Message-Id: <YPbI++0z5ZPV9pKb@scylladb.com>
2021-07-21 00:42:39 +02:00
Nadav Har'El
c4f20f1641 Update seastar submodule
* seastar ef320940...388ee307 (4):
  > Merge 'Add a stall analyser tool' from Benny Halevy
  > compat: implement coroutine_handle<void> for <experimental/coroutine> header
  > Merge "Make app_template::run noexcept" from Pavel E
  > perftune.py: make RPS CPU set to be a full CPU set

The stall analyser tool was requested by the SCT team to help make
sense of Scylla's stall reports and find more stall bugs!
2021-07-21 00:47:11 +03:00
Benny Halevy
c5e08eb6e7 main: add deferred stop of batchlog_manager
Stop the batchlog manager using a deferred
action in main to make sure it is stopped after
its start() method has been called, also
if we bail out of main early due to exception.

Change the bm.stop() calls in storage_service
to just stop the replay loop using drain().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-20 20:24:11 +03:00
Benny Halevy
5165780d81 batchlog_manager: refactor drain out of stop
drain() aborts the replay loop fiber
and returns its future.

It's grabbing _gate so stop() will wait on it.

The intention is to call stop_replay_loop from
storage_service::decommission and do_drain rather
than stop, so we can stop the batchlog manager once,
using a deferred action in main.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-20 20:23:06 +03:00
Benny Halevy
c47fbda076 batchlog_manager: stop: break _sem on shard 0
Abort do_batch_log_replay if waiting on the semaphore.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-20 19:35:23 +03:00
Benny Halevy
deef1b4f59 batchlog_manager: stop: use abort_source to abort batchlog_replay_loop
Harden start/stop by using an abort_source to abort from
the replay loop.

Extract the loop into batchlog_replay_loop() coroutine,
with the _stop abourt source as a stop condition,
plus use it for sleep_abortable to be able to promptly
stop while sleeping.

start() stores batchlog_replay_loop's future in a newly added
_started member, which is waited on in stop() to synchronize
with the start process at any stage.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-20 19:32:55 +03:00
Benny Halevy
976b517f55 batchlog_manager: do_batch_log_replay: hold _gate
So we can wait on do_batch_log_replay on stop().

Note that do_batch_log_replay is called both from
batchlog_replay_loop and from the storage_service.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-20 19:30:55 +03:00
Juliusz Stasiewicz
38b8a6ce2c test/boost: run_mutation_source_tests on streaming virtual table
Tests that require inter-partition forwarding are excluded.
2021-07-20 14:19:17 +02:00
Juliusz Stasiewicz
65c87e2c74 system_keyspace: Introduce describe_ring table as virtual_table
This change adds "system.describe_ring" table using the new
streaming_virtual_table infrastructure.
2021-07-20 14:19:17 +02:00
Juliusz Stasiewicz
f8067d938d storage_service: Pass the reference down to system_keyspace
According to the policy of avoiding globals.
2021-07-20 14:18:24 +02:00
Juliusz Stasiewicz
a8b741efe2 endpoint_details: store _host as gms::inet_address
In an upcoming commit I will add "system.describe_ring" table which uses
endpoint's inet address as a part of CK and, therefore, needs to keep them
sorted with `inet_addr_type::less`.
2021-07-20 14:00:54 +02:00
Juliusz Stasiewicz
2b802711c2 queue_reader: implement next_partition() 2021-07-20 14:00:54 +02:00
Piotr Wojtczak
9a77751c6b virtual_tables: Introduce streaming_virtual_table
This change adds another implementation of the virtual_table interface,
useful for cases where there's bigger amounts of data.
2021-07-20 14:00:54 +02:00
Piotr Wojtczak
cb2a0ab858 flat_mutation_reader: Add a new filtering reader factory method
Introduce a new function creating a filtering reader using query slice
and partition range.
2021-07-20 14:00:47 +02:00
Tomasz Grabiec
dcd05f77b1 lsa: Avoid excessive eviction if region is not compactible
Introduced in d72b91053b.

If region was not compactible, for example because it has dense
segments, we would keep evicting even though the target for reclaimed
segments was met. In the worst case we may have to evict whole cache.

Refs #9038 (unlikely to be the cause though)
Message-Id: <20210720104039.463662-1-tgrabiec@scylladb.com>
2021-07-20 14:36:14 +03:00
dgarcia360
8d51482ffe docs: moved latest_version to conf.py
Related issues: scylladb/sphinx-scylladb-theme#87

All the variables related to the multiversion extension are now defined in conf.py instead of using the GitHub Actions file.

How to test this PR
Run make multiversionpreview on docs folder. When you open https://0.0.0.0:5500, the browser should render the documentation site.

Closes #7957
2021-07-20 14:31:46 +03:00
Avi Kivity
05fcf11557 Merge 'Coroutinize commit log' from Calle Wilund
No real refactoring, just move the various methods to coroutines. Because coroutines are neat.
Broken down into one method per change to make review easier. And hoping I get tipped per change.

Grand idea being that using coroutines will eventually make real refactoring easier.
Unit tests + relevant dtest.

As discussed below, simply coroutinizing the code will, at least in the fast path, cause the slightly naive
compiler to generate multiple unused coroutine frames, dropping raw performance a bit.
The last two patches in this series addresses this, by breaking the fast path into non-coroutine
subroutines (no futures involved) and one coroutine main loop.

Results, as collected by `perf_simple_query` are:
Master (before changes):
```
{
        "parameters" :
        {
                "concurrency" : 100,
                "concurrency,partitions,cpus,duration" : "100,10000,1,30",
                "cpus" : 1,
                "duration" : 30,
                "partitions" : 10000
        },
        "stats" :
        {
                "allocs_per_op" : 52.237303521776113,
                "instructions_per_op" : 47403.34422198555,
                "mad tps" : 670.12528706749436,
                "max tps" : 140817.0800358199,
                "median tps" : 139391.58369995767,
                "min tps" : 133663.0095463676,
                "tasks_per_op" : 13.189605506751203
        },
        "test_properties" :
        {
                "type" : "write"
        },
        "versions" :
        {
                "scylla-server" :
                {
                        "commit_id" : "1f51bc67fd",
                        "date" : "20210712",
                        "run_date_time" : "2021-07-13 10:26:46",
                        "version" : "4.6.dev"
                }
        }
}
```

This PR (coroutines + fast path optimization patches):
```
{
        "parameters" :
        {
                "concurrency" : 100,
                "concurrency,partitions,cpus,duration" : "100,10000,1,30",
                "cpus" : 1,
                "duration" : 30,
                "partitions" : 10000
        },
        "stats" :
        {
                "allocs_per_op" : 52.208628061750559,
                "instructions_per_op" : 47300.501878330339,
                "mad tps" : 707.70233700674726,
                "max tps" : 139618.0661493362,
                "median tps" : 137891.11290420164,
                "min tps" : 127551.83433347062,
                "tasks_per_op" : 13.172121395660733
        },
        "test_properties" :
        {
                "type" : "write"
        },
        "versions" :
        {
                "scylla-server" :
                {
                        "commit_id" : "1d4b6f50bd",
                        "date" : "20210719",
                        "run_date_time" : "2021-07-19 09:27:09",
                        "version" : "4.6.dev"
                }
        }
}
```

I.e. both allocations/op and instruction count seem to be on par.

Closes #8954

* github.com:scylladb/scylla:
  commitlog: Make allocate_when_possible a template
  commitlog: break fast path alloc into non-fut/corout + outer loop
  commitlog: Drop stream/subscription from replayer
  commitlog: coroutinize commitlog::read_log_file
  commitlog: coroutinize commitlog::create_commitlog
  commitlog: coroutinize commitlog::add_entries
  commitlog: coroutinize commitlog::add_entry
  commitlog: coroutinize commitlog::add
  commitlog: change entry_writer usage to reference
  commitlog: coroutinize segment_manager::clear
  commitlog: coroutinize segment_manager::do_pending_deletes
  commitlog: coroutinize segment_manager::delete_file
  commitlog: coroutinize segment_manager::shutdown
  commitlog: coroutinize segment_manager::shutdown_all_segments
  commitlog: coroutinize segment_manager::sync_all_segments
  commitlog: coroutinize segment_manager::clear_reserve_segments
  commitlog: coroutinize segment_manager::active_segment
  commitlog: coroutinize segment_manager::new_segment
  commitlog: coroutinize segment_manager::allocate_segment
  commitlog: coroutinize segment_manager::rename_file
  commitlog: coroutinize segment_manager::init
  commitlog: coroutinize segment_manager::list_descriptors
  commitlog: coroutinize segment_manager::replenish_reserve
  commitlog: coroutinize segment::shutdown
  commitlog: coroutinize segment::close
  commitlog: coroutinize segment::batch_cycle
  commitlog: coroutinize segment::do_flush
  commitlog: coroutinize segment::flush
  commitlog: coroutinize segment::cycle
  commitlog: coroutinize allocate_when_possible
  commitlog: coroutinize segment::allocate
2021-07-20 14:14:13 +03:00
Tomasz Grabiec
50ec3ea295 lsa: Fix misaccunting of used space when allocating lsa_buffers
lsa_buffer allocations are aligned to 4K. If smaller size is
requested, whole 4K is used. However, only requested size was used in
accounting segment occupancy. This can confuse reclaimer which may
think the segment is sparse while it is actually dense, and compacting
it will yield no or little gain. This can cause inefficient memory
reclamation or lack of progress.

Refs #9038
Message-Id: <20210720104110.463812-1-tgrabiec@scylladb.com>
2021-07-20 14:08:06 +03:00
Pavel Solodovnikov
d2b53bc0ca configure: simplify raft tests dependencies management
There's no need for extended `scylla_raft_dependencies`,
which includes the entire `scylla_core` target list.

Revert the tests which don't need the extended list to use
a minimal set of dependencies and switch to using
`scylla_core` as a dependency for
`raft_sys_table_storage_test` and `raft_address_map_test`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210705104712.295499-1-pa.solodovnikov@scylladb.com>
2021-07-20 12:32:10 +03:00
Botond Dénes
8fc55fa5bf reader_concurrency_semaphore: get rid of struct permit_list
struct permit_list exists so the intrusive list declaration which needs
the definition of reader_permit can be hidden in the .cc. But it turns
out that if the hook type is fully spelled out, the intrusive list
declaration doesn't need T to be defined. Exploit this to get rid of
this extra indirection.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210720073121.63027-2-bdenes@scylladb.com>
2021-07-20 10:35:12 +03:00
Botond Dénes
11b39cbc23 reader_concurrency_semaphore: merge permit_stats into stats
If there was any reason to have them separate when permit_stats was
conceived, it is gone now, so merge the two.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210720073121.63027-1-bdenes@scylladb.com>
2021-07-20 10:35:12 +03:00
Tomasz Grabiec
a8528cb24d lsa: Fix uninitialized field access resulting in hangs during segment compaction
_free_space may be initialized with garbage so kind() getter should
only look at the bit which corresponds to the kind. Misclasification
of segment as being of different kind may result in a hang during
segment compaction.

Surfaced in debug mode build where the field is filled with 0xbebebebe.

Introduced in b5ca0eb2a2.

Fixes #9057
Message-Id: <20210719232734.443964-1-tgrabiec@scylladb.com>
2021-07-20 02:33:21 +03:00
Tomasz Grabiec
393b90112f gdb: segment-descs: Support debug mode builds
Debug mode builds have a different implementation of segment_store in LSA.
Message-Id: <20210719232125.442458-1-tgrabiec@scylladb.com>
2021-07-20 02:33:18 +03:00
Gleb Natapov
aa8c6b85fb raft: do not apply empty command list
Do not call user's state machine apply() if there is nothing to apply.

Message-Id: <YO1dMitXnZhZlmra@scylladb.com>
2021-07-19 18:26:18 +02:00
Nadav Har'El
36ec1d792e Merge 'cql-pytest: Test selecting from indexed table using only clustering key' from Jan Ciołek
Add examples from issue #8991 to tests
Both of these tests pass on `cassandra 4.0` but fail on `scylla 4.4.3`

First test tests that selecting values from indexed table using only clustering key returns correct values.
The second test tests that performing this operation requires filtering.

The filtering test looks similar to [the one for #7608](1924e8d2b6/test/cql-pytest/test_allow_filtering.py (L124)) but there are some differences - here the table has two clustering columns and an index, so it could test different code paths.

Contains a quick fix for the `needs_filtering()` function to make these tests pass.
It returns `true` for this case and the one described in #7708.

This implementation is a bit conservative - it might sometimes return `true` where filtering isn't actually needed, but at least it prevents scylla from returning incorrect results.

Fixes #8991.
Fixes #7708.

Closes #8994

* github.com:scylladb/scylla:
  cql3: Fix need_filtering on indexed table
  cql-pytest: Test selecting using only clustering key requires filtering
  cql-pytest: Test selecting from indexed table using clustering key
2021-07-19 18:23:08 +03:00
Tomasz Grabiec
049a1ef729 Merge 'flat_mutation_reader: downgrade_to_v1 - reset state of rt_assembler' from enedil
The downgrade_to_v1 didn't reset the state of range tombstone assembler
in case of the calls to next_partition or fast_forward_to, which caused
a situation where the closing range tombstone change is cleared from the
buffer before being emitted, without notifying the assembler. This patch
fixes the behaviour in fast_forward_to as well.

Fixes #9022

Closes #9023

* github.com:scylladb/scylla:
  flat_mutation_reader: downgrade_to_v1 - reset state of rt_assembler
  flat_mutation_reader: introduce public method returning the default size of internal buffer.
2021-07-19 17:10:23 +02:00
Jan Ciolek
54149242b4 cql3: Fix need_filtering on indexed table
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-07-19 16:22:17 +02:00
Michał Radwański
67d99e02a7 flat_mutation_reader: downgrade_to_v1 - reset state of rt_assembler
The downgrade_to_v1 didn't reset the state of range tombstone assembler
in case of the calls to next_partition or fast_forward_to, which caused
a situation where the closing range tombstone change is cleared from the
buffer before being emitted, without notifying the assembler. This patch
fixes the behaviour in fast_forward_to as well.

Fixes #9022
2021-07-19 15:54:26 +02:00
Michał Radwański
c4089007a2 flat_mutation_reader: introduce public method returning the default size
of internal buffer.

This method is useful in tests that examine behaviour after the buffer
has been filled up.
2021-07-19 15:54:13 +02:00
Nadav Har'El
4c6dc5fce2 Merge 'continuous_data_consumer: properly skip bytes at the end of a range' from Wojciech Mitros
When skipping bytes at the end of a continuous_data_consumer range,
the position of the consumer is moved after the skipped bytes, but
the position of the underlying input_stream is not.

This patch adds skipping of the underlying input_stream, to make
its position consistent with the position of the consumer.

Fixes #9024

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #9039

* github.com:scylladb/scylla:
  tests: add test for skipping bytes at end of consumer
  continuous_data_consumer: properly skip bytes at the end of a range
2021-07-19 15:57:26 +03:00
Botond Dénes
27fbca84f6 reader_concurrency_semaphore: remove prethrow_action
The semaphore accepts a functor as in its constructor which is run just
before throwing on wait queue overload. This is used exclusively to bump
a counter in the database::stats, which counts queue overloads. However,
there is now an identical counter in
reader_concurrency_semaphore::stats, so the database can just use that
directly and we can retire the now unused prethrow action.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210716111105.237492-1-bdenes@scylladb.com>
2021-07-19 15:47:37 +03:00
Wojciech Mitros
507bdfc36a tests: add test for skipping bytes at end of consumer
The new tests confirms that the regression issue, where
we didn't correctly skip bytes at the end of a
continuous_data_consumer range, is fixed.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-07-19 14:42:38 +02:00
Wojciech Mitros
7107e32390 continuous_data_consumer: properly skip bytes at the end of a range
When skipping bytes at the end of a continuous_data_consumer range,
the position of the consumer is moved after the skipped bytes, but
the position of the underlying input_stream is not.

This patch adds skipping of the underlying input_stream, to make
its position consistent with the position of the consumer.

Fixes #9024

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-07-19 11:43:30 +02:00
Piotr Sarna
38afef71b9 Merge 'Service Level Controller: Stop polling distributed data..
... when decommissioned (reworked)' from Eliran Sinvani

This is a rework of #8916 The polling loop of the service level
controller queries a distributed table in order to detect configuration
changes. If a node gets decommissioned, this loop continues to run until
shutdown, if a node stays in the decommissioned mode without being shut
down, the loop will fail to query the table and this will result in
warnings and eventually errors in the log. This is not really harmful
but it adds unnecessary noise to the log.  The series below lays the
infrastructure for observing storage service state changes, which
eventually being used to break the loop upon preparation for
decommissioning.  Tests: Unit test (dev) Failing tests in jenkins.

Fixes #8836

The previous merge (possibly due to conflict resolution) contained a
misplaced get that caused an abort on shutdown.

Closes #9035

* github.com:scylladb/scylla:
  Service Level Controller: Stop configuration polling loop upon leaving the cluster
  main: Stop using get_local_storage_service in main
2021-07-19 10:52:42 +02:00
Benny Halevy
3700702e90 cmake: update compaction source files location
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210718120906.701185-1-bhalevy@scylladb.com>
2021-07-19 11:47:35 +03:00
Botond Dénes
5aa733f933 sstables/mx/writer: initialize _range_tombstones at the end of the ctor
We need a permit to initialize said object which makes the semaphore
used and hence trigger an error if an exception is thrown in the
constructor. Move the initialization to the end of the constructor to
prevent this.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210719040449.9202-1-bdenes@scylladb.com>
2021-07-19 11:43:00 +03:00
Calle Wilund
4990ba2769 commitlog: Make allocate_when_possible a template
And call it by-value with the polymorphic writers. This
eliminates outer coroutine frame and ensures we use only one
for fast-case allocation.
2021-07-19 08:27:30 +00:00
Calle Wilund
69ead0e658 commitlog: break fast path alloc into non-fut/corout + outer loop
Removes 2 coroutine frames in fast path (as long as segment + space is
avail). Puts IPS back on track with master.
2021-07-19 08:27:30 +00:00
Calle Wilund
62acc84e58 commitlog: Drop stream/subscription from replayer
Change args to values so stays on coroutine frame.
Remove pointless subscription/stream usage, just iterate.
2021-07-19 08:27:30 +00:00
Calle Wilund
5e8af28da7 commitlog: coroutinize commitlog::read_log_file 2021-07-19 08:27:30 +00:00
Calle Wilund
b3c35f9ec0 commitlog: coroutinize commitlog::create_commitlog 2021-07-19 08:27:30 +00:00
Calle Wilund
ef471d0a93 commitlog: coroutinize commitlog::add_entries 2021-07-19 08:27:30 +00:00
Calle Wilund
96434b1b12 commitlog: coroutinize commitlog::add_entry 2021-07-19 08:27:30 +00:00
Calle Wilund
e16cff6952 commitlog: coroutinize commitlog::add 2021-07-19 08:27:30 +00:00
Calle Wilund
da360fb841 commitlog: change entry_writer usage to reference
Calling frames keeps object alive in all paths. Use references in
allocate()/allocate_when_possible()
2021-07-19 08:27:30 +00:00
Calle Wilund
42bfae513a commitlog: coroutinize segment_manager::clear 2021-07-19 08:27:30 +00:00
Calle Wilund
554a09baab commitlog: coroutinize segment_manager::do_pending_deletes 2021-07-19 08:27:30 +00:00
Calle Wilund
9e18cf3f5f commitlog: coroutinize segment_manager::delete_file 2021-07-19 08:27:30 +00:00
Calle Wilund
ca65387c53 commitlog: coroutinize segment_manager::shutdown 2021-07-19 08:27:30 +00:00
Calle Wilund
4678d1fbec commitlog: coroutinize segment_manager::shutdown_all_segments 2021-07-19 08:27:30 +00:00
Calle Wilund
2f048e658b commitlog: coroutinize segment_manager::sync_all_segments 2021-07-19 08:27:30 +00:00
Calle Wilund
ad4e4e9ee4 commitlog: coroutinize segment_manager::clear_reserve_segments 2021-07-19 08:27:30 +00:00
Calle Wilund
ec430807fc commitlog: coroutinize segment_manager::active_segment 2021-07-19 08:27:30 +00:00
Calle Wilund
13bba1ef39 commitlog: coroutinize segment_manager::new_segment 2021-07-19 08:27:30 +00:00
Calle Wilund
ccd34203dc commitlog: coroutinize segment_manager::allocate_segment 2021-07-19 08:27:30 +00:00
Calle Wilund
f5de830f0c commitlog: coroutinize segment_manager::rename_file 2021-07-19 08:27:30 +00:00
Calle Wilund
011bc68209 commitlog: coroutinize segment_manager::init 2021-07-19 08:27:30 +00:00
Calle Wilund
04c725b29c commitlog: coroutinize segment_manager::list_descriptors 2021-07-19 08:27:30 +00:00
Calle Wilund
d514fc5822 commitlog: coroutinize segment_manager::replenish_reserve 2021-07-19 08:27:30 +00:00
Jan Ciolek
9bd62a07c9 cql-pytest: Test selecting using only clustering key requires filtering
Adds test that creates a table with primary key (p, c1, c2)
with a global index on c2 and then selects where c1 = 1 and c2 = 1.

This should require filtering, but doesn't.
Refs #8991.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-07-19 10:24:48 +02:00
Jan Ciolek
a041767aa3 cql-pytest: Test selecting from indexed table using clustering key
Adds test that creates a table with primary key (p, c1, c2)
with a global index on c2 and then selects where c1 = 1 and c2 = 1.

This currently fails.
Refs #8991.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-07-19 10:24:46 +02:00
Calle Wilund
d4bd17d577 commitlog: coroutinize segment::shutdown 2021-07-19 08:17:33 +00:00
Calle Wilund
e9820827e3 commitlog: coroutinize segment::close 2021-07-19 08:17:33 +00:00
Calle Wilund
999701a8ee commitlog: coroutinize segment::batch_cycle 2021-07-19 08:17:33 +00:00
Calle Wilund
cef7ee2014 commitlog: coroutinize segment::do_flush 2021-07-19 08:17:33 +00:00
Calle Wilund
1a76d735f2 commitlog: coroutinize segment::flush 2021-07-19 08:17:33 +00:00
Calle Wilund
0b1e2084ce commitlog: coroutinize segment::cycle 2021-07-19 08:17:33 +00:00
Calle Wilund
79b9cb1e5c commitlog: coroutinize allocate_when_possible 2021-07-19 08:17:33 +00:00
Calle Wilund
e545b382bd commitlog: coroutinize segment::allocate 2021-07-19 08:17:33 +00:00
Avi Kivity
2cfc517874 main, test: adjust number of networking iocbs
Seastar's default limit of 10,000 iocbs per shard is too low for
some workload (it places an upper bound on the number of idle
connections, above which a crash occurs). Use the new Seastar
feature to raise the default to 50000.

Also multiply the global reservation by 5, and round it upwards
so the number is less weird. This prevents io_setup() from failing.

For tests, the reservation is reduced since they don't create large
numbers of connections. This reduces surprise test failures when they
are run on machines that haven't been adjusted.

Fixes #9051

Closes #9052
2021-07-18 14:38:44 +03:00
Avi Kivity
9c3f8028f1 Update tools/java submodule (SLES 15)
* tools/java 79a441972d...4ef8049e07 (1):
  > dist/redhat: change PyYAML filepath to allow installing on SLES15

Fixes #9045.
2021-07-18 14:24:42 +03:00
Raphael S. Carvalho
841e9227f9 table: Document the serialization requirement on sstable set rebuild
In order to avoid data loss bugs, that could come due to lack of
serialization when using the preemptable build_new_sstable_list(),
let's document the serialization requirement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210714201301.188622-1-raphaelsc@scylladb.com>
2021-07-17 18:09:00 +03:00
Avi Kivity
df822e09e0 Merge "Run test cases in parallel" from Pavel E
"
The debug-mode tests nowadays take ~1 hours to complete on a
24-cores threadripper machine. This is mostly because of a bunch
of individual test cases that run sequentially (since they sit
in one test) each taking half-an-hour and longer.

The previous attempt was to break the longest tests into pieces,
and to update the list of long-running test in suite.yaml file,
but the concern was that the linkage time and disk space would
grow without limits if this continues. Also the long-running tests
list needs to be revisited every so often.

So the new attempt is to resurrect Avi's patch that ran test
cases in parallel for boost tests. This set applies parallelizm
to all tests and allows to blacklist those that shound't (the
logalloc needs the very first case to prime_segment_pools so
that other cases run smoothly, thus is cannot be parallelized).

Although this wild parallelizm adds an overhead for _each_ test
case this is good enough even for short dev-mode tests (saves
25% of runtime), but greatly relaxes the maintenance of the
"parallelizable list of tests".

For debug tests the problem is not 100% solved. There are 6 cases
that run longer than 30min,  while all the others complete much-
-much faster. So if excluding those slow 6 cases the full parallel
run saves 50+% of the runtime -- 60+m now vs 25m with the patch.
Those 6 slowest cases will need more incremental care.

The --parallel-cases mode is not yet default, because it requires
larger max-aio-nr value to be set, which is not (yet?) automatic.
Also it sometimes hits nr-open-files limit, which also needs more
work.

tests: unit(dev), unit(debug)
"

* 'br-parallel-testpy-3' of https://github.com/xemul/scylla:
  tests: Update boost long tests list
  test.py: Parallelize test-cases run (for boost tests)
  test.py: Prepare BoostTest for running individual cases
  test.py: Prepare TestSuite::create_test() for parallelizm
  test.py: Treat shortname as composite
  test.py: Reformat tabluar output
2021-07-17 13:57:56 +03:00
Pavel Emelyanov
1ed582304d memtable_list: Shorten flush coalescing codeflow
The memtable_list::flush() maintains a shared_promise object
to coalesce the flushers until the get_flush_permit() resolves.
Also it needs to keep the extraneous flushes counter bumped
while doing the flush itself.

All this can be coded in a shorter form and without the need
to carry shared_promise<> around.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210716164237.10993-1-xemul@scylladb.com>
2021-07-17 00:42:20 +02:00
Avi Kivity
3058c42171 Update seastar submodule
* seastar 8ed9771ae9...ef320940c2 (6):
  > reactor: reactor_backend_aio: allow tuning number of network iocbs
Ref #9051.
  > aio_general_context: flush: handle io_submit short return
  > aio_general_context: prevent overflow
  > file: Do not assume nowait_works by default
  > Merge "reactor: use sched_clock consistently" from Michael
  > testing: Lazily create seastar::app thread
2021-07-16 18:07:10 +03:00
Pavel Emelyanov
9d59f1daf3 tests: Update boost long tests list
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-16 17:25:07 +03:00
Pavel Emelyanov
cbb4837b77 test.py: Parallelize test-cases run (for boost tests)
The parallelizm is acheived by listing the content of each (boost)
test and by adding a test for each case found appending the
'--run_test={case_name}' option.

Also few tests (logallog and memtable) have cases that depend on
each other (the former explicitly stated this in the head comment),
so these are marked as "no_parallel_cases" in the suite.yaml file.

In dev mode tests need 2m:5s to run by default. With parallelizm
(and updated long-running tests list) -- 1m 35s.

In debug mode there are 6 slow _cases_ that overrun 30 minutes.
They finish last and deserve some special (incremental) care. All
the other tests run ~1h by default vs ~25m in parallel.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-16 17:25:07 +03:00
Pavel Emelyanov
3cac5173b7 test.py: Prepare BoostTest for running individual cases
This means adding the casename argument to its describing class
and handling it:

1. appending to the shortname
2. adding the --run_test= argument to boost args

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-16 17:25:07 +03:00
Pavel Emelyanov
0baee5d423 test.py: Prepare TestSuite::create_test() for parallelizm
The method in question is in charge of creating a single
entry in the list of tests to be run. The BoostTestSuite's
method is about to create several entries and this patch
prepares it for this:

- makes it distinguish individual arguments
- lets it select the test.id value itself

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-16 17:25:07 +03:00
Pavel Emelyanov
a547677502 test.py: Treat shortname as composite
When running tests in parallel-cases mode the test.uname must
include the case name to make different log and xml files for
different runs and to show which exact case is run when shown
by the tabular-output. At the same time the test shortname
identifies the binary with the whole test.

This patch makes class Test treat the shortname argument as
a dot-separated string where the 0th component is the binary
with the test and the rest is how test identifies itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-16 17:25:07 +03:00
Pavel Emelyanov
f188dd3396 test.py: Reformat tabluar output
This change solves several issues that would arise with the
case-by-case run.

First, the currently printed name is "$binary_name.$id". For
case-by-case run the binary name would coinside for many cases
and it will be inconvenient to identify the test case. So
the tests uname is printed instead.

Second, the tests uname doesn't contain suite name (unlike the
test binary name which does), so this patch also adds the
explicit suite name back as a separate column (like MODE)

Third, the testname + casename string length will be far above
the allocated 50 characters, so the test name is moved at the
tail of the line.

Fourth, the total number of cases is 2100+, the field of 7
characters is not enough to print it, so it's extended.

Finally the test.py output would look like this for parallel run:
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/2108]     raft     dev   [ PASS ] etcd_test.test_progress_leader.40 0.06s
[2/2108]     raft     dev   [ PASS ] etcd_test.test_vote_from_any_state.45 0.03s
[3/2108]     raft     dev   [ PASS ] etcd_test.test_progress_flow_control.43 0.04s
[4/2108]     raft     dev   [ PASS ] etcd_test.test_progress_resume_by_append_resp.41 0.05s
[5/2108]     raft     dev   [ PASS ] etcd_test.test_leader_election_overwrite_newer_logs.44 0.04s
[6/2108]     raft     dev   [ PASS ] etcd_test.test_progress_paused.42 0.05s
[7/2108]     raft     dev   [ PASS ] etcd_test.test_log_replication_2.47 0.06s
...

or like this for regular:
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/184]      raft     dev   [ PASS ] fsm_test.41 0.06s
[2/184]      raft     dev   [ PASS ] etcd_test.40 0.06s
[3/184]      cql      dev   [ PASS ] cassandra_cql_test.2 1.87s
[4/184]      unit     dev   [ PASS ] btree_stress_test.30 1.82s
...

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-16 17:24:36 +03:00
Tomasz Grabiec
97aa335a60 Merge "test: raft: randomized_nemesis_test: refactors and improvements" from Kamil
A couple of improvements to prepare for the next patchset.

We move `logical_timer` and `ticker` to their own headers due to the
generality of these data structures. They are not very specific to the
test.

`logical_timer` is extended with a `schedule` function, allowing to
schedule any given function to be called at the given time point.

The interface of `network` in `randomized_nemesis_test` is extended by
`add_grudge` and `remove_grudge` functions for implementing network
partitioning nemeses.
Furthermore `network` can be now constructed with an arbitrary network
delay, which was previously hardcoded.

`with_env_and_ticker` is now generic w.r.t. return values (previously
`future<>` was assumed).

`environment` exposes a reference to the `network` through a getter.

The `not_a_leader` exception now shows the leader's ID in the exception
message. Useful for logging.

In `logical_timer::with_timeout`, when we timeout, we don't just return
`timed_out_error`. The returned exception now actually contains the
original future... well almost; in any case, the user can now do
something different to the future other than simply discarding it.

We also fix some `broken_promise` exceptions appearing in discarded
futures in certain scenarios. See the corresponding commit for detailed
explanation.

We handle `raft::dropped_entry` in the `call` function.

`persistence` is fixed to avoid creating gaps in the log when storing
snapshots and to support complex state types.

Waiting for leader was refactored into a separate function and
generalized (we wait for a set of nodes to elect a leader instead of a
single node to elect itself) to be useful in more situations.

Finally, we introduce `reconfigure`, a higher-level version of
`set_configuration` which performs error handling and supports timeouts.

* kbr/raft-nemesis-improvements-v4:
  test: raft: randomized_nemesis_test: `reconfigure` function
  test: raft: randomized_nemesis_test: refactor waiting for leader into a separate function
  test: raft: randomized_nemesis_test: persistence: avoid creating gaps in the log when storing snapshots
  test: raft: randomized_nemesis_test: persistence: handle complex state types
  test: raft: randomized_nemesis_test: `call`: handle `raft::dropped_entry`
  test: raft: randomized_nemesis_test: impure_state_machine/call: handle dropped channels
  test: raft: randomized_nemesis_test: environment: expose the network
  test: raft: randomized_nemesis_test: configurable network delay and FD convict threshold
  test: raft: randomized_nemesis_test: generalize `with_env_and_ticker`
  test: raft: randomized_nemesis_test: network: `add_grudge`, `remove_grudge` functions
  test: raft: randomized_nemesis_test: move `ticker` to its own header
  test: raft: randomized_nemesis_test: ticker: take `logger` as a constructor parameter
  test: raft: logical_timer: handle immediate timeout
  test: raft: logical_timer: on timeout, return the original future in the exception
  test: raft: logical_timer: add `schedule` member function
  test: raft: randomized_nemesis_test: move `logical_timer` to its own header
  test: raft: include the leader's ID in the `not_a_leader` exception's message
2021-07-16 16:12:05 +02:00
Benny Halevy
a44c06d776 storage_proxy: query: log also errors
If log trace level is enabled, log also error.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210712070509.24102-1-bhalevy@scylladb.com>
2021-07-16 16:12:05 +02:00
Nadav Har'El
5183e0cbe9 Merge 'Fix artificial view update size limit' from Piotr Sarna
The series which split the view update process into smaller parts
accidentally put an artificial 10MB limit on the generated mutation
size, which is wrong - this limit is configurable for users,
and, what's more important, this data was already validated when
it was inserted into the base table. Thus, the limit is lifted.

The series comes with a cql-pytest which failed before the fix and succeeds now. This bug is also  covered by `wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view` dtest, but it needs over a minute to run, as opposed to cql-pytest's <1 second.

Fixes #9047

Tests: unit(release), dtest(wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view)

Closes #9048

* github.com:scylladb/scylla:
  cql-pytest: add a materialized views suite with first cases
  db,view: drop the artificial limit on view update mutation size
2021-07-15 17:03:07 +03:00
Piotr Sarna
c05340c4bf cql-pytest: add a materialized views suite with first cases
cql-pytest did not have a suite for materialized views, so one is
created. At the same time, test cases for building/updating a view on
a base table with large cells is added as a regression test for #9047.
2021-07-15 15:40:38 +02:00
Piotr Sarna
697e2fc66d db,view: drop the artificial limit on view update mutation size
The series which split the view update process into smaller parts
accidentally put an artificial 10MB limit on the generated mutation
size, which is wrong - this limit is configurable for users,
and, what's more important, this data was already validated when
it was inserted into the base table. Thus, the limit is lifted.

Tests: unit(release), dtest(wide_rows_test)
2021-07-15 14:09:37 +02:00
Tomasz Grabiec
1f255c420e flat_mutation_reader_v2: Make is_end_of_stream() reflect consumer-side state of the stream
Currently, flat_mutation_reader_v2::is_end_of_stream() returns
flat_mutation_reader_v2::impl::_end_of_stream, which means the producer
is done. The stream may be still not yet fully consumed even if
producer is done, due to internal buffering. So consumers need to make
a more elaborate check:

  rd.is_end_of_stream() && rd.is_buffer_empty()

It would be cleaner if flat_mutation_reader_v2::is_end_of_stream()
returned the state of the consumer-side of the stream, since it
belongs to the consumer-side of the API. The consumption will be as
simple as:

  while (!rd.is_end_of_stream()) {
    consume_fragment(rd());
  }

This patch makes the change on the v2 of the reader interface. v1 is
not changed to avoid problems which could happen when backporting code
which assumes new semantics into a version with the old semantics. v2
is not in any old branch yet so it doesn't have this problem and it's
a good time to make the API change.

Note that it's always safe to use the new semantics in the context
which assumes the old semantics, so v1 users can be safely converted
to v2 even if they are unware of the change.

Fixes #3067

Message-Id: <20210715102833.146914-1-tgrabiec@scylladb.com>
2021-07-15 14:00:48 +03:00
Calle Wilund
b8b5f69111 messaging_service: Bind to listen address, not broadcast
Refs #8418

Broadcast can (apparently) be an address not actually on machine, but
on the other side of NAT. Thus binding local side of outgoing
connection there will fail.
Bind instead to listen_address (or broadcast, if listen_to_broadcast),
this will require routing + NAT to make the connection looking
like from broadcast from node connected to, to allow the connection
(if using partial encryption).

Note: this is somewhat verified somewhat limitedly. I would suggest
verifying various multi rack/dc setups before relying on it.

Closes #8974
2021-07-15 13:18:10 +03:00
Tomasz Grabiec
21f1a7be8b sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads
Reads which bypass cache will use a private temporary instance of cached_file
which dies together with the index cursor.

The cursor still needs a cached_file with cachig layer. Binary searching needs
caching for performance, some of the pages will be reused. Another reason to still
use cached_file is to work with a common interface, and reusing it requires minimal changes.
2021-07-15 12:14:28 +02:00
Tomasz Grabiec
f4227c303b sstables: Do not populate partition index cache for "bypass cache" reads
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.

Promoted index scanner (ka/la format) will not go through the page cache.
2021-07-15 12:13:20 +02:00
Avi Kivity
ed6c01a9fa test: increase timeout to account for flat_mutation_reader_v2 tests
Since fce124bd90 ("Merge "Introduce flat_mutation_reader_v2" from
Tomasz") tests involving mutation_reader are a lot slower due to
the new API testing. On slower machines it's enough to time out.

Work underway to improve the situation, and it will also revert back
to the original timing once the flat_mutation_reader_v2 work is done,
but meanwhile, increase the timeout.

Closes #9046
2021-07-15 12:33:43 +03:00
Avi Kivity
1643549d08 Merge 'Coroutinize the sstable reader' from Wojciech Mitros
This patch applies the same changes to both kl and mx sstable readers, but because the kl reader is old, we'll focus on the newer one.

This patch makes the main sstable reader process a coroutine,
allowing to simplify it, by:

- using the state saved in the coroutine instead of most of the states saved in the _state variable
- removing the switch statement and moving the code of former switch cases, resulting in reduced number of jumps in code
- removing repetitive ifs for read statuses, by adding them to the coroutine implementation

The coroutine is saved in a new class ```processing_result_generator```, which works like a generator: using its ```generate()``` method, one can order the coroutine to continue until it yields a data_consumer::processing_result value, which was achieved previously by calling the function that is now the coroutine(```do_process_state()```).

Before the patch, the main processing method had 558 lines. The patch reduces this number to 345 lines.

However, usage of c++ coroutines has a non-negligible effect on the performance of the sstable reader.
In the test cases from ```perf_fast_forward``` the new sstable reader performs up to 2% more instructions (per fragment) than the former implementation, and this loss is achieved for cases where we're reading many subsequent rows, without any skips.
Thanks to finding an optimization during the development of the patch, the loss is mitigated when we do skip rows, and for some cases, we can even observe an improvement.
You can see the full results in attached files: [old_results.txt](https://github.com/scylladb/scylla/files/6793139/old_results.txt), [new_results.txt](https://github.com/scylladb/scylla/files/6793140/new_results.txt)

Test: unit(dev)
Refs: #7952

Closes #9002

* github.com:scylladb/scylla:
  mx sstable reader: reduce code blocks
  mx sstable reader: make ifs consistent
  sstable readers: make awaiter for read status
  mx sstable reader: don't yield if the data buffer is not empty
  mx sstable reader: combine FLAGS and FLAGS_2 states
  mx sstable reader: reduce placeholder state usage
  mx sstable reader: replace non_consuming states with a bool
  mx sstable reader: reduce placeholder state usage
  mx sstable reader: replace unnecessary states with a placeholder
  mx sstable reader: remove false if case
  mx sstable reader: remove row_body_missing_columns_label
  mx sstable reader: remove row_body_deletion_label
  mx sstable reader: remove column_end_label
  mx sstable reader: remove column_cell_path_label
  mx sstable reader: remove column_ttl_label
  mx sstable reader: remove column_deletion_time_label
  mx sstable reader: remove complex_column_2_label
  mx sstable reader: remove row_body_missing_columns_read_columns_label
  mx sstable reader: remove row_body_marker_label
  mx sstable reader: remove row_body_shadowable_deletion_label
  mx sstable reader: remove row_body_prev_size_label
  mx sstable reader: remove ck_block_label
  mx sstable reader: remove ck_block2_label
  mx sstable reader: remove clustering_row_label and complex_column_label
  mx sstable reader: remove labels with only one goto
  mx sstable reader: replace the switch cases with gotos and a new label
  mx sstable reader: remove states only reached consecutively or from goto
  mx sstable reader: remove switch breaks for consecutive states
  mx sstable reader: convert readers main method into a coroutine
  kl sstable reader: replace states for ending with one state, simplify non_consuming
  kl sstable reader: remove unnecessary states
  kl sstable reader: remove unnecessary yield
  kl sstable reader: remove unnecessary blocks
  kl sstable reader: fix indentation
  kl sstable reader: replace switch with standard flow control
  kl sstable reader: remove state::CELL case
  kl sstable reader: move states code only reachable from one place
  kl sstable reader: remove states only reached consecutively
  kl sstable reader: remove switch breaks for consecutive states
  kl sstable reader: remove unreachable case
  kl sstable reader: move testing hack for fragmented buffers outside the coroutine
  kl sstable reader: convert readers main method into a coroutine
  sstable readers: create a generator class for coroutines
2021-07-15 12:06:14 +03:00
Wojciech Mitros
45058776c2 mx sstable reader: reduce code blocks
Some blocks of code were surrounded by curly braces, because
a variable was declared inside a switch case. After changes,
some of the variable declarations are in if/else/while cases,
and no longer need to be in separate code blocks, while other
blocks can be extended to entire labels for simplicity.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
9b333908e4 mx sstable reader: make ifs consistent
In several places we're checking the return value of our
consumers' consume_* calls. Because the behaviour in all cases
is the same, let us use the same notation as well.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
dc38605f75 sstable readers: make awaiter for read status
After each read* call of the primitive_consumer we need to check
if the entire primitive was in our current buffer. We can check it
in the proceed_generator object by yielding the returned read status:
if the yielded status is ready, the yield_value method returns
a structure whose await_ready() method returns true. Otherwise it
returns false.
The returned structure is co_awaited by the coroutine (due to co_yield),
and if await_ready() returns true, the coroutine isn't stopped,
conversely, if it returns false, (technical: and because its await_suspend
methods returns void) the coroutine stops, and a proceed::yes value
is saved, indicating that we need more buffers.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
09a0cd7c05 mx sstable reader: don't yield if the data buffer is not empty
The skip() method returns a skip_bytes object if we want to
skip the entire buffer, otherwise it returns a proceed::yes
and trims the buffer.

If the buffer is only trimmed we don't need to interrupt
the coroutine, we simply continue instead.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
5dc64532bd mx sstable reader: combine FLAGS and FLAGS_2 states
We don't differentiate between FLAGS and FLAGS_2 in
verify_end_state(), so we can merge them into one state.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
ab1e6f4211 mx sstable reader: reduce placeholder state usage
After the changes to non_consuming states, we can
remove some state::OTHER assignments again.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
c904ab12c8 mx sstable reader: replace non_consuming states with a bool
The non_consuming() method is only used after assuring that
primitive_consumer::active() (in continuous_data_consumer::process())
so we don't need states where primitive_consumer::active(), which
is most of them.

We still need to make sure that the states change when they need to,
so we replace all the concerned states with the placeholder state,
and for the few states from the non_consuming() OR, where the
primitive_consumer::active() returns true, we set the value of
_consuming to false, changing it back when the state is no longer
non_consuming.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
b05d3eefed mx sstable reader: reduce placeholder state usage
We can remove state assignments that we know are
changing a state to itself.

Similarily, if a state is changed in the same way
in an if and an else, it can be changed before the
if/else instead.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
b2e3fbffd0 mx sstable reader: replace unnecessary states with a placeholder
After removing the switch, the state is only used for
verify_end_state() and non_consuming(), so we can
replace states that are not used there with a single
one, so that the state still stops being one of the
appearing states when it needs to.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
9a7a8fa86c mx sstable reader: remove false if case
consume_row_marker_and_tombstone does not return proceed::no in the
mp_row_consumer_m implementation, and even if it did, we would most
likely want to yield proceed::no in that case as well.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
2262aac11a mx sstable reader: remove row_body_missing_columns_label
row_body_missing_columns_label is only reached from one
goto, or consecutively, so the code omitted by goto can be
ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
99b5a332db mx sstable reader: remove row_body_deletion_label
row_body_deletion_label is only reached from one
goto, or consecutively, so the code omitted by goto can be
ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
cbce22a88b mx sstable reader: remove column_end_label
column_end_label is only reached from one goto, or consecutively,
so the code omitted by goto can be ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
925d921cb4 mx sstable reader: remove column_cell_path_label
column_cell_path_label is only reached from two goto, both
at the end of an if/else block, or consecutively, so the code
after the if/else block can be ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
e85987a439 mx sstable reader: remove column_ttl_label
column_ttl_label is only reached from two goto, both
at the end of an if/else block, or consecutively, so the code
after the if/else block can be ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
4b3607e97b mx sstable reader: remove column_deletion_time_label
column_deletion_time_label is only reached from one
goto, or consecutively, so the code omitted by goto can be
ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
8cf23c3b01 mx sstable reader: remove complex_column_2_label
complex_column_2_label is only reached from one goto, or consecutively,
so the code omitted by goto can be ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
fbe28d18f3 mx sstable reader: remove row_body_missing_columns_read_columns_label
row_body_missing_columns_read_columns_label is only reached
consecutively, or from a goto after the label. This is changed to a
while loop starting at the label and ending at the goto.

The code executed in the only case we do not reach the goto (so
when exiting the loop) is moved after the while.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
3b512ea2c2 mx sstable reader: remove row_body_marker_label
row_body_marker_label is only reached from one goto inside an else
case, or consecutively, so the code omitted by goto can be moved
inside the corresponding if case.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
0bcde69319 mx sstable reader: remove row_body_shadowable_deletion_label
row_body_shadowable_deletion_label is only reached from one
goto, or consecutively, so the code omitted by goto can be
ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
3d0fdf9f3b mx sstable reader: remove row_body_prev_size_label
row_body_prev_size_label is only reached consecutively, or from
a goto not far after the label. This is changed to a while loop
starting at the label and ending at the goto.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
b27166c36f mx sstable reader: remove ck_block_label
ck_block_label is only reached consecutively, or from
a few gotos not far after the label. This is changed
to a while loop with gotos replaced with continue's.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
ec6c2f0e07 mx sstable reader: remove ck_block2_label
ck_block2_label is only reached from one goto, or consecutively,
so the code omitted by goto can be ommited by an if instead (or else).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
1e59e249ec mx sstable reader: remove clustering_row_label and complex_column_label
clustering_row_label is only reached from one goto, or consecutively,
so the code omitted by goto can be ommited by an if instead (or else).

Also remove complex_column_label because it is next to
its only goto.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
440aba61a9 mx sstable reader: remove labels with only one goto
If a case is reached only after after jumping with a single
goto, that goto may be replaced with the target code.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
65f7eb5ada mx sstable reader: replace the switch cases with gotos and a new label
Because the number of remaining cases is moderately low, and
after finishing a case we always enter another one, the switch
is removed completely, and the last remaining cases are handled
by 3 additional gotos and 1 new label.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
0398c68797 mx sstable reader: remove states only reached consecutively or from goto
If a state is never reached from the top of the switch, but only
by continuing from the previous case, we don't need to have a case:
for it.

Similarily, if there is a label that we goto, we don't need the
switch case.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
f87b27b9e4 mx sstable reader: remove switch breaks for consecutive states
If _state at the end of a switch case has the same value as the
next case, instead of breaking the switch, we can just fall through.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
32b996aca5 mx sstable reader: convert readers main method into a coroutine
(same as in kl sstable reader)
The function is converted to a coroutine simply by adding an
infinite loop around the switch, and starting another iteration
after yielding a value, instead of returning.

Because the coroutine resume() function does not take any arguments,
a new member is introduced to remember the "data" buffer, that was
previously an argument to the method.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
4816e8120b kl sstable reader: replace states for ending with one state, simplify non_consuming
After removing the switch, the only use for states in the sstable reader
are methods non_consuming() and verify_end_state().

The non_consuming() method is only used after assuring that
!primitive_consumer::active() (in continuous_data_consumer::process())
so we don't need states where primitive_consumer::active() for this
method, and is actually all of them.

We don't differentiate between ATOM_START and ATOM_START_2 in
verify_end_state(), so we can just merge them into one.

While we need tho remember times when we enter states used in verify_end_state(),
we also need to remember when we exit them. For that reason we introduce a new
state "NOT_CLOSING", that fails all comparisons in verify_end_state(), and
replaces all states that aren't used in verify_end_state()
2021-07-14 20:50:30 +02:00
Wojciech Mitros
0c284a8b5e kl sstable reader: remove unnecessary states
After removing the switch, the state is only used for
verify_end_state() and non_consuming(), so we can
remove states that are not used there (and which do
not change them).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
35c30e6178 kl sstable reader: remove unnecessary yield
We don't need to yield row_consumer::proceed::yes if we are
not parsing a primitive using primitive_consumer, we can just
continue execution.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
97c7b5fe76 kl sstable reader: remove unnecessary blocks
Some blocks of code were surrounded by curly braces, because
a variable was declared inside a switch case. With standard
flow control, it's no longer needed.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
914e4f27e9 kl sstable reader: fix indentation
To simplify review, the code moved in previous commits
didn't change its indentation. This commit fixes it.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
7a6729159f kl sstable reader: replace switch with standard flow control
We get rid of the switch by using the infinite loop around the
switch for jumping to the first case, adding an infinite loop
around the second case (one break from the switch with the
state of the first case becomes a break of the new while),
and adding an if around the first case (because we never break
in the first case).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
cfe6a46a60 kl sstable reader: remove state::CELL case
The CELL state is only set in the if/else block immediately
before the CELL case, so we don't need to have a case for it.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
c41f49d2e5 kl sstable reader: move states code only reachable from one place
If a case is reached only after exiting a certain other case (or goto)
its code may as well be moved to that place.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
5f27413c1f kl sstable reader: remove states only reached consecutively
If a state is never reached from the top of the switch, but only
by continuing from the previous case, we don't need to have a case:
for it.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
e226fc12c9 kl sstable reader: remove switch breaks for consecutive states
If _state at the end of a switch case has the same value as the
next case, instead of breaking the switch, we can just fall through.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
bc7ed3f596 kl sstable reader: remove unreachable case
The STOP_THEN_ATOM_START is never reached, so it can be
removed altogether.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
63d1a44d12 kl sstable reader: move testing hack for fragmented buffers outside the coroutine
The testing hack can't be done inside the coroutine, because
we don't have the original "data" buffer
2021-07-14 20:50:30 +02:00
Wojciech Mitros
6fff9aed3c kl sstable reader: convert readers main method into a coroutine
The function is converted to a coroutine simply by adding an
infinite loop around the switch, and starting another iteration
after yielding a value, instead of returning.

Because the coroutine resume() function does not take any arguments,
a new member is introduced to remember the "data" buffer, that was
previously an argument to the method.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
01c2f406df sstable readers: create a generator class for coroutines
The data_consume_rows_context and data_consume_rows_context_m are
classes, that use primitive_consumer read* methods to get primitives
from a streamed sstable, and using their corresponding consumers' (
mp_row_consumer_k_l and mp_row_consumer_m) consume* methods, they
fill the buffer of the corresponding flat_mutation_reader.

The main procedure where we decide which read* and consume* methods
to call, is do_process_state. We save the current state of the
procedure in the _state variable, to remember where to continue in
the next call. For each call, the do_process_state method returns
an information about whether we can keep filling the buffer using
more buffers from the stream (proceed::yes), or not (proceed::no).

The saved state can be (mostly) removed by using a generator
coroutine, whose state is saved when its execution is halted,
and which yields the values, that do_process_state would return
before.

The processing_result_generator is a class for managing a generator
coroutine. When the coroutine halts, the proceed_generator saves the
value yielded by the coroutine, and returns it to the caller.
2021-07-14 20:50:27 +02:00
Piotr Sarna
3d816b7c16 Merge 'Move the reader concurrency semaphore in front of the cache' from Botond
This patchset combines two important changes to the way reader permits
are created and admitted:
1) It switches admission to be up-front.
2) It changes the admission algorithm.

(1) Currently permits are created before the read is started, but they
only wait for admission when going to the disk. This leaves the
resources consumption of cache and memtables reads unbounded, possibly
leading to OOM (rare but happens). This series changes this that permits
are admitted at the moment they are creating making admission up-front
-- at least those reads that pass admission at all (some don't).

(2) Admission currently is based on availability of resources. We have a
certain amount of memory available, which derived from the memory
available to the shard, as well a hardcoded count resource. Reads are
admitted when a count and a certain amount (base cost) of memory is
available. This patchset adds a new aspect to this admission process
beyond the existing resource availability: the number of used/blocked
reads. Namely it only admits new reads if in addition to the necessary
amount of resources being available, all currently used readers are
blocked. In other words we only admit new reads if all currently
admitted reads requires something other than CPU to progress. They are
either waiting on I/O, a remote shard, or attention from their consumers
(not used currently).

The reason for making these two changes at the same time is that
up-front admission means cache reads now need to obtain a permit too.
For cache reads the optimal concurrency is 1. Anything above that just
increases latency (without increasing throughput). So we want to make sure
that if a cache reader hits it doesn't get any competition for CPU and
it can run to completion. We admit new reads only if the read misses and
has to go to disk.

A side effect of these changes is that the execution stages from the
replica-side read path are replaced with the reader concurrency
semaphore as an execution stage. This is necessary due to bad
interaction between said execution stages and up-front admission. This
has an important consequence: read timeouts are more strictly enforced
because the execution stage doesn't have a timeout so it can execute
already timed-out reads too. This is not the case with the semaphore's
queue which will drop timed-out reads. Another consequence is that, now
data and mutation reads share the same execution stage, which increases
its effectiveness, on the other hand system and user reads don't
anymore.

Fixes: #4758
Fixes: #5718

Tests: unit(dev, release, debug)

* 'reader-concurrency-semaphore-in-front-of-the-cache/v5.3' of https://github.com/denesb/scylla: (54 commits)
  test/boost/reader_concurrency_semaphore_test: add used/blocked test
  test/boost/reader_concurrency_semaphore_test: add admission test
  reader_permit: add operator<< for reader_resources
  reader_concurrency_semaphore: add reads_{admitted,enqueued} stats
  table: make_sstable_reader(): fix indentation
  table: clean up make_sstable_reader()
  database: remove now unused query execution stages
  mutation_reader: remove now unused restricting_reader
  sstables: sstable_set: remove now unused make_restricted_range_sstable_reader()
  reader_permit: remove now unused wait_admission()
  reader_concurrency_semaphore: remove now unused obtain_permit_nowait()
  reader_concurrency_semaphore: admission: flip the switch
  database: increase semaphore max queue size
  test: index_with_paging_test: increase semaphore's queue size
  reader_concurrency_semaphore: add set_max_queue_size()
  test: mutation_reader_test: remove restricted reader tests
  reader_concurrency_semaphore: remove now unused make_permit()
  test: reader_concurrency_semaphore_test: move away from make_permit()
  test: move away from make_permit()
  treewide: use make_tracking_only_permit()
  ...
2021-07-14 16:22:56 +02:00
Botond Dénes
e2dfb2df71 test/boost/reader_concurrency_semaphore_test: add used/blocked test
Make sure that releasing a bunch of used/blocked guards in random order
doesn't break the permit state.
2021-07-14 17:19:02 +03:00
Botond Dénes
0337d3ea4a test/boost/reader_concurrency_semaphore_test: add admission test
Checking every conceivable admission scenario (hopefully).
2021-07-14 17:19:02 +03:00
Botond Dénes
b81f39cec9 reader_permit: add operator<< for reader_resources
And use it in tests, it results in actually useful error messages.
2021-07-14 17:19:02 +03:00
Botond Dénes
1666ad078a reader_concurrency_semaphore: add reads_{admitted,enqueued} stats
Primarily for tests, but we could also export these, should we want to.
2021-07-14 17:19:02 +03:00
Botond Dénes
46c9106bdf table: make_sstable_reader(): fix indentation 2021-07-14 17:19:02 +03:00
Botond Dénes
7ddde9107e table: clean up make_sstable_reader()
Remove all the now unneeded mutation sources.
2021-07-14 17:19:02 +03:00
Botond Dénes
ae4df99e6b database: remove now unused query execution stages 2021-07-14 17:19:02 +03:00
Botond Dénes
16d3cb4777 mutation_reader: remove now unused restricting_reader
Move the now orphaned new_reader_base_cost constant to
database.hh/table.cc, as its main user is now
`table::estimate_read_memory_cost()`.
2021-07-14 17:19:02 +03:00
Botond Dénes
2bab76c80e sstables: sstable_set: remove now unused make_restricted_range_sstable_reader() 2021-07-14 17:19:02 +03:00
Botond Dénes
5b8d6f02eb reader_permit: remove now unused wait_admission() 2021-07-14 17:19:02 +03:00
Botond Dénes
c86573813f reader_concurrency_semaphore: remove now unused obtain_permit_nowait() 2021-07-14 17:19:02 +03:00
Botond Dénes
1b7eea0f52 reader_concurrency_semaphore: admission: flip the switch
This patch flips two "switches":
1) It switches admission to be up-front.
2) It changes the admission algorithm.

(1) by now all permits are obtained up-front, so this patch just yanks
out the restricted reader from all reader stacks and simultaneously
switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By
doing this admission is now waited on when creating the permit.

(2) we switch to an admission algorithm that adds a new aspect to the
existing resource availability: the number of used/blocked reads. Namely
it only admits new reads if in addition to the necessary amount of
resources being available, all currently used readers are blocked. In
other words we only admit new reads if all currently admitted reads
requires something other than CPU to progress. They are either waiting
on I/O, a remote shard, or attention from their consumers (not used
currently).

We flip these two switches at the same time because up-front admission
means cache reads now need to obtain a permit too. For cache reads the
optimal concurrency is 1. Anything above that just increases latency
(without increasing throughput). So we want to make sure that if a cache
reader hits it doesn't get any competition for CPU and it can run to
completion. We admit new reads only if the read misses and has to go to
disk.

Another change made to accommodate this switch is the replacement of the
replica side read execution stages which the reader concurrency
semaphore as an execution stage. This replacement is needed because with
the introduction of up-front admission, reads are not independent of
each other any-more. One read executed can influence whether later reads
executed will be admitted or not, and execution stages require
independent operations to work well. By moving the execution stage into
the semaphore, we have an execution stage which is in control of both
admission and running the operations in batches, avoiding the bad
interaction between the two.
2021-07-14 17:19:02 +03:00
Botond Dénes
01a4bb33de database: increase semaphore max queue size
Queued reads don't take 10KB (not even 1KB) for years now. But the real
motivation of this patch is that due to a soon-to-come change to
admission we expect larger queues especially in tests, so be more
forgiving with queue sizes.
2021-07-14 17:19:02 +03:00
Botond Dénes
dcf49dcb67 test: index_with_paging_test: increase semaphore's queue size
To allow the flood of reads generated by this test to be queued up
during up-front admission without failing the test.
2021-07-14 17:19:02 +03:00
Botond Dénes
79fefc490c reader_concurrency_semaphore: add set_max_queue_size() 2021-07-14 17:19:02 +03:00
Botond Dénes
388da36bbb test: mutation_reader_test: remove restricted reader tests
Soon we will switch to up-front admission which will break these tests.
No point in trying to fix them as once the switch is done we'll retire
the restricted reader too. Remove these tests now so they are not in the
way of progress.
2021-07-14 17:19:02 +03:00
Botond Dénes
00511100a4 reader_concurrency_semaphore: remove now unused make_permit() 2021-07-14 17:19:02 +03:00
Botond Dénes
bacfaf9582 test: reader_concurrency_semaphore_test: move away from make_permit()
Migrate to the appropriate up-front admission variants.
2021-07-14 17:19:02 +03:00
Botond Dénes
c07db00b70 test: move away from make_permit()
Use the most appropriate up-front admission variant.
2021-07-14 17:19:02 +03:00
Botond Dénes
7bfa40a2f1 treewide: use make_tracking_only_permit()
For all those reads that don't (won't or can't) pass through admission
currently.
2021-07-14 17:19:02 +03:00
Nadav Har'El
8bdff97d8d Merge 'Fix propagating view update generation failures' from Piotr Sarna
When the generate-and-propagate-view-updates routine was rewritten
to allow partial results, one important validation got lost:
previously, an error which occured during update *generation*
was propagated to the user - as an example, the indexed column
value must be smaller than 64kB, otherwise it cannot act as primary
key part in the underlying view. Errors on view update *propagation*
are however ignored in this layer, because it becomes a background
process.
During the rewrite these two got mixed up and so it was possible
to ignore an error that should have been propagated.
This behavior is now fixed.

Fixes #9013

Closes #9021

* github.com:scylladb/scylla:
  cql-pytest: add a case for too large value in SI
  table: stop ignoring view generation errors on write path
2021-07-14 15:49:48 +02:00
Piotr Sarna
91b4e24db5 Merge 'Tests for Alternator's TTL feature' from Nadav Har'El
This series includes a comprehensive test suite for the DynamoDB API's
TTL (item expiration) feature described in issue #5060. Because we have
not yet implemented the TTL feature in Alternator, all of the tests
still xfail, but they all pass on DynamoDB and demonstrate exactly how
the TTL feature works and how it interacts with other features such as
LSI, GSI and Streams. The patch which introduces these tests is heavily
commented to explain exactly what it tests, and why.

Because DynamoDB only expires items some 10-30 minutes after their
expiration time (the documentation even suggests it can be delayed by 24
hours!), some of these tests are extremely long (up to 30 minutes!), so
we also introduce in this series a new marker for "verylong" tests.
verylong tests are skipped by default, unless the "--runverylong" option
is given. In the future, when we implement the TTL feature in Alternator
and start testing it, we may be able to configure it with a much shorter
expiration timeout and then we might be able to run these tests in a
reasonable time and make them run by default.

Closes #8564

* github.com:scylladb/scylla:
  test/alternator: add tests for the Alternator TTL feature
  test/alternator: add marker for "veryslow" tests
  test/alternator: add new_test_table() utility function
2021-07-14 15:49:48 +02:00
Botond Dénes
0ced9c83b7 mutation_reader: evictable_reader: futurize resume_or_recreate_reader()
In preparation for waiting for readmission after eviction in a later
patch.
2021-07-14 16:48:43 +03:00
Botond Dénes
f37e26c73d querier: remove now unused cache_context 2021-07-14 16:48:43 +03:00
Botond Dénes
7f2813e3fa database: mutation_query(): handle querier lookup/save on the database level
Instead of passing down the querier_cache_ctx to table::mutation_query(),
handle the querier lookup/save on the level where the cache exists.

The real motivation behind this change however is that we need to move
the lookup outside the execution stage, because the current execution
stage will soon be replaced by the one provided by the semaphore and to
use that properly we need to know if we have a saved permit or not.
2021-07-14 16:48:43 +03:00
Botond Dénes
f9d302bf49 database: mutation_query(): convert into coroutine
To facilitate further patching (and reading).
2021-07-14 16:48:43 +03:00
Botond Dénes
d2f5393a43 database: query(): handle querier lookup/save on the database level
Instead of passing down the querier_cache_ctx to table::query(),
handle the querier lookup/save on the level where the cache exists.

The real motivation behind this change however is that we need to move
the lookup outside the execution stage, because the current execution
stage will soon be replaced by the one provided by the semaphore and to
use that properly we need to know if we have a saved permit or not.
2021-07-14 16:48:43 +03:00
Botond Dénes
c28a6e8537 database: query(): convert into coroutine
To facilitate further patching (and reading).
2021-07-14 16:48:43 +03:00
Botond Dénes
6efb278ea3 querier_cache: insert(): close refused queriers
The querier cache refuses to cache queriers that read in reverse. These
queriers are also not closed, with the caller having no way to determine
whether the querier it just moved into `insert()` needs a close
afterwards or not, requiring a `close()` on the moved-from querier just
to be sure.
Avoid this by consistently closing all passed-in queriers, including
those the cache refuses to save. For this, the internal
`insert_querier()` methods has to be made a member to be able to use the
closing gate.
2021-07-14 16:48:43 +03:00
Botond Dénes
5291494a50 mutation_reader: shard reader: use reader_lifecycle_policy::obtain_reader_permit()
Co-routinize the reader creation lambda in the process.
2021-07-14 16:48:43 +03:00
Botond Dénes
426b46c4ed mutation_reader: reader_lifecycle_policy: add obtain_reader_permit()
This method is both a convenience method to obtain the permit, as well
as an abstraction to allow different implementations to get creative.
For example, the main implementation, the one in multishard mutation
query returns the permit of the saved reader one was successful. This
ensures that on a multi-paged read the same permit is used across as
much pages as possible. Much more importantly it ensures the evictable
reader wrapping the actual reader both use the same permit.
2021-07-14 16:48:43 +03:00
Botond Dénes
7fcf4a63c5 multishard_mutation_query: use the passed-in permit to create new reader
Ensure that when the reader has to be created anew the passed-in permit
is used to create it, instead of the one left over in remote-parts,
which is that of the already evicted reader.
This lays the groundwork to ensure the same permit is used across all
pages of a read, by a future patch which creates the wrapping reader
with the existing permit.
2021-07-14 16:48:43 +03:00
Botond Dénes
97a03f9027 database: make_multishard_streaming_reader: use external permit
As a preparation for up-front admission, add a permit parameter to
`make_multishard_streaming_reader()`, which will be the admitted permit
once we switch to up-front admission. For now it has to be a
non-admitted permit.
A nice side-effect of this patch is that now permits will have a
use-case specific description, instead of the generic
"multishard-streaming-reader" one
2021-07-14 16:48:43 +03:00
Botond Dénes
5293bd21cf streaming/stream_session: use database::obtain_reader_permit() 2021-07-14 16:48:43 +03:00
Botond Dénes
292a8819ec repair/row_level: use database::obtain_reader_permit() 2021-07-14 16:48:43 +03:00
Botond Dénes
f28b5018f2 view/view_update_generator: use obtain_reader_permit() 2021-07-14 16:48:43 +03:00
Botond Dénes
999169e535 database: make_streaming_reader(): require permit
As a preparation for up-front admission, add a permit parameter to
`make_streaming_reader()`, which will be the admitted permit once we
switch to up-front admission. For now it has to be a non-admitted
permit.
A nice side-effect of this patch is that now permits will have a
use-case specific description, instead of the generic "streaming" one.
2021-07-14 16:48:43 +03:00
Botond Dénes
3ec149222d database: add obtain_reader_permit()
A convenience method for obtaining an admitted permit for a read on a
given table.
For now it uses the nowait semaphore obtaining method, as all normal
reads still use the old admission method. Migrating reads to this method
will make the switch easier, as there will be one central place to
replace the nowait method with the proper one.
2021-07-14 16:48:43 +03:00
Botond Dénes
a6b59f0d89 table: add estimate_read_memory_cost()
To be used for determining the base cost of reads used in admission. For
now it just returns the already used constant. This is a forward looking
change, to when this will be a real estimation, not just a hardcoded
number.
2021-07-14 16:48:43 +03:00
Botond Dénes
af8f39a775 reader_concurrency_semaphore: make it an execution stage
The execution stage functionality is exposed via two new member
functions, `with_permit()` and `with_ready_permit()`. Both accept a
function to be run. The former obtains a permit then runs the passed in
function through the execution stage. The latter allows an already
obtained permit to be passed in.
2021-07-14 16:48:43 +03:00
Botond Dénes
5d3ddba2c7 reader_concurrency_semaphore: make_permit(): add up-front admission variants
Three new methods are added for creating permits:
1) obtain_permit()
2) obtain_permit_nowait()
3) make_tracking_only_permit()

(1) is meant to replace `make_permit()` + `wait_admission()`, by
integrating the waiting for admission into the process of creating the
permit. This is the method meant to be used to create permits from here
on, ensuring that each read passes admission before even being started.
(2) is a bridge between the old and new world. Up-front admission cannot
coexist with the restricted reader in the same read, so those reads that
have a restricted reader in their stack can use this method to create a
non-admitted permit to be admitted by the restricted reader later. Once
we have migrated all reads to (1) or (2), we can get rid of the
restricted reader and just replace (1) with (2) in the codebase. (2)
returns a future to make this a simple rename, the churn of dealing with
a future<reader_permit> return type already having been dealt with by
then.
(3) is for reads that bypass admission, yet their resource usage does
participate in the admission of other reads. This is the equivalent of
reads that don't pass admission at all.

The following patches will gradually transition the codebase away from
the old permit API, and once the transition is complete, we can switch
over to do the admission up-front at once.
2021-07-14 16:48:43 +03:00
Botond Dénes
844a99a91a reader_concurrency_semaphore: prepare for up-front admission
We want to make permits be admitted up-front, before even being created.
As part of this change, we will get rid of the `wait_admission()`
method on the permit, instead, the permit will be created as a result of
waiting for admission (just like back some time ago).
To allow evicted readers to wait for re-admission, a new method
`maybe_wait_readmission()` is created, which waits for readmission if
the permit is in evicted state.

Also refactor the internals of the semaphore to support and favor
up-front admission code. As up-front admission is the future we want the
permit code to be organized in such a way that it is natural to use with
it. This means that the "old-style" admission code might suffer but we
tolerate this as it is on its way out. To this end the following changes
were done:
* Add a _base_resources field to reader_permit which tracks the base
  cost of said permit. This is passed in the constructor and is used in
  the first and subsequent admissions.
* The base cost is now managed internally by the permit, instead of
  relying on an external `resource_units` instance, though the old way
  is still supported temporarily.
* Change the admission pipeline to favor the new permit-internally
  managed base cost variant.
* Compatibility with old-style admission: permits are created with 0
  base resources, base resources are set with the compatibility method
  `set_base_resources()` right before admission, then externalized again
  after admission with `base_resource_as_resource_units()`. These
  methods will be gone when the old style admission is retired (together
  with `wait_admission()`).
2021-07-14 16:48:43 +03:00
Botond Dénes
05e6881c73 reader_permit: allow constructing reader_permit from impl&
By enabling shared from this for impl and adding a reader permit
constructor which takes a shared pointer to an impl.
This allows impl members to invoke functions requiring a `reader_permit`
instance as a parameter.
2021-07-14 16:48:43 +03:00
Botond Dénes
ea2345c944 db/size_estimates_virtual_reader: mark as blocked when obtaining local ranges 2021-07-14 16:48:43 +03:00
Botond Dénes
b5cbd19383 mutation_reader: shard_reader: mark permit as blocked when waiting on remote shard 2021-07-14 16:48:43 +03:00
Botond Dénes
6f6a8f5cf8 mutation_reader: shard_reader: coroutinize fill_buffer() and fast_forward_to()
To facilitate further patching (and make the code look nicer too).
2021-07-14 16:48:43 +03:00
Botond Dénes
26e83bdde8 mutation_reader: foreign_reader: mark permit as blocked when waiting on remote shard 2021-07-14 16:48:43 +03:00
Botond Dénes
434f2efde5 sstables: continuous_data_consumer: mark permit as blocked when doing IO 2021-07-14 16:48:43 +03:00
Botond Dénes
aa480fa3f9 reader_permit: allow marking blocked
Distinguish between permits that are blocked and those that are not.
Conceptually a blocked permit is one that needs to wait on either I/O or
a remote shard to proceed.
This information will be used by admission, which will only admit new
reads when all currently used ones are blocked. More on that in the
commit introducing this new admission type.
This patch only adds the infrastructure, block sites are not marked yet.
2021-07-14 16:48:43 +03:00
Botond Dénes
9cb36cc516 test: continuous_data_consumer_test: mark permit as used 2021-07-14 16:48:43 +03:00
Botond Dénes
47342ae8a8 mutation_reader: shard_reader: mark underlying permit as used 2021-07-14 16:48:43 +03:00
Botond Dénes
a5dc48b4b1 reader_permit: allow marking it as used
Distinguish between permits that are used and those that are not. These
are two subtypes of the current 'active' state (and replace it).
Conceptually a permit is used when any readers associated with it have a
pending call to any of their async methods, i.e. the consumer is
actively consuming from them.
This information will be used for admission, together with a new blocked
state introduced by a future patch.
This patch only adds the infrastructure, use sites are not marked yet.
2021-07-14 16:48:43 +03:00
Botond Dénes
5a20861a1d reader_permit: add reader_permit_opt 2021-07-14 16:48:43 +03:00
Botond Dénes
a251cc2368 reader_permit: introduce evicted state
We want to introduce more fine-grained states for permits than what we
have currently, splitting the current 'active' state into multiple
sub-states. As a preparatory step, introduce an evicted state too, to
keep track of permits that were evicted while being inactive. This will
be important in determining what permits need to re-wait admission, once
we keep permits across pages. Having an evicted state also aids
validating internal state transitions.
2021-07-14 16:48:43 +03:00
Botond Dénes
5416fc6d1b reader_concurrency_semaphore: add current_permits to permit_stats 2021-07-14 16:48:43 +03:00
Botond Dénes
c97fc16105 reader_concurrency_semaphore: extract waiter admission into separate function
Because soon we will have more than one place to trigger waiter
admission from.
2021-07-14 16:48:43 +03:00
Nadav Har'El
2acfee8118 test/alternator: add tests for the Alternator TTL feature
This patch adds a comprehensive test suite for the DynamoDB API's TTL
(item expiration) feature.

The tests check the two new API commands added by this feature
(UpdateTimeToLive and DescribeTimeToLive), and also how items are
expired in practice, and how item expiration interacts with other
features such as GSI, LSI and DynamoDB Streams.

Because DynamoDB has extremely long delays until items are expired, or
until expiration configuration may be changed, several of these tests
take up to 30 minutes to complete. We mark these tests with the
 "verylong" marker, so they are skipped in ordinary test runs - use the
"--runverylong" option to run them.

All these tests currently pass on DynamoDB, but xfail on Alternator
because the two commands UpdateTimeToLive and DescribeTimeToLive are
currently rejected by Alternator.

Refs #5060

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-07-14 14:08:55 +03:00
Avi Kivity
6df3139455 install-dependencies.sh: add gdb
gdb is used for testing scylla-gdb.py (since 3c2e852dd), so it needs
to be listed as a dependency. Add it there. It was listed as a
courtesy dependency in the frozen toolchain (which is why it still
worked), so it's removed from there.

Closes #9034
2021-07-14 10:15:54 +03:00
Eliran Sinvani
ccdef39d21 Service Level Controller: Stop configuration polling loop upon leaving
the cluster

This change subscribes service_level_controller for nodes life cycle
notifications and uses the notification of leaving the cluster for
the current node to stop the configuration polling loop. If the loop
continues to run it's queries will fail consistently since the nodes
will not answers to queries. It is worth mentioning that the queries
failing in the current state of code is harmles but noisy since after
90 seconsd, if the scylla process is not shut down the failures will
start to generate failure logs every 90 seconds which is confusing for
users.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-07-14 09:31:40 +03:00
Eliran Sinvani
55e2aabbae main: Stop using get_local_storage_service in main
This change removes the use of service::get_local_storage_service,
instead, the approach taken is similar to other modules, for example
storage_proxy  where a reference to the `sharded` module is obtained
once and then the local() method in combination with capturing is
used.
2021-07-14 09:31:36 +03:00
Avi Kivity
64ad31c26f build: enable -Wc++1z-extensions
It was disabled for the move to clang, but now apparently no longer needed. So
re-enable that warning.

Closes #9026
2021-07-14 08:28:26 +03:00
Nadav Har'El
9d07ce3cb6 test/alternator: add marker for "veryslow" tests
Until now, Alternator test have all been very fast, taking milliseconds
or at worst seconds each - or a bit longer on DynamoDB. However,
sometimes we need to write tests which take a huge amount of time - for
example, tests for the TTL feature may take 10 minutes because the item
expiration is delayed by that much. Because a 10 minute test is
ridiculous (all 500 Alternator tests together take just one minute
today!), we would normally run such test once, and then mark it "skip"
so will never run again.

One annoying thing about skipped tests is that there is no way to
temporarily "unskip" them when we want to run such a test anyway.
So in this patch, we introduce a better option for these very slow tests
instead of the simple "skip":

The patch introduces a marker "@pytest.mark.veryslow". By default, a
test with this marker is skipped. However, an command-line option
"--runveryslow" is introduced which causes tests with the veryslow
mark to be run anyway, and not skipped.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-07-14 00:26:21 +03:00
Nadav Har'El
2fb379bb94 test/alternator: add new_test_table() utility function
This patch adds a convenient function new_test_table() that Alternator tests
can use to safely create a temporary table, and be sure it is deleted in any
case. This function is used in a "with", as follows:

    with new_test_table(dynamodb, ...) as table:
        do_something(table)
    # at this point table has already been deleted.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-07-14 00:26:21 +03:00
Piotr Sarna
11a054c1fc cql-pytest: add a case for too large value in SI
The test case tries to insert a too-large value into an indexed
column and expects to see a write failure.

Refs #9013
2021-07-13 17:26:18 +02:00
Piotr Sarna
73f7702a69 table: stop ignoring view generation errors on write path
When the generate-and-propagate-view-updates routine was rewritten
to allow partial results, one important validation got lost:
previously, an error which occured during update *generation*
was propagated to the user - as an example, the indexed column
value must be smaller than 64kB, otherwise it cannot act as primary
key part in the underlying view. Errors on view update *propagation*
are however ignored in this layer, because it becomes a background
process.
During the rewrite these two got mixed up and so it was possible
to ignore an error that should have been propagated.
This behavior is now fixed.

Fixes #9013
2021-07-13 17:20:38 +02:00
Benny Halevy
c8e7bd9a26 storage_proxy: abstract_read_resolver: catch semaphore_timed_out before timed_out_error
Prepare for making semaphore_timed_out derived from timed_out_error
in seastar.  When this happens in seastar, we would need to catch
the derived, more-specific exception first to avoid the following
warning:
```
service/storage_proxy.cc:2818:18: error: exception of type 'seastar::semaphore_timed_out &' will be caught by earlier handler [-Werror,-Wexceptions]
        } catch (semaphore_timed_out&) {
                 ^
service/storage_proxy.cc:2815:18: note: for type 'seastar::timed_out_error &'
        } catch (timed_out_error&) {
                 ^
```

Later on, after the seastar change is applied to the scylla repo,
we can eliminate the duplication and catch only timed_out_error.

Test: unit(dev) (w/ the seastar changes to semaphore_timed_out
and rpc::timeout_error to inherit from timed_out_error).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210713132858.294504-1-bhalevy@scylladb.com>
2021-07-13 16:39:17 +03:00
Nadav Har'El
c174eeae06 alternator: do not allow LSI on base table with no sort key
The purpose of an LSI (local secondary index) in Alternator is to allow
a different sort key for the existing partitions, keeping the same
division into partititions. So it doesn't make sense to create an LSI on
a table that did not originally have a sort key (i.e., single-item partitions).

DynamoDB indeed doesn't allow this case, and Alternator forgot to forbid
it - so this patch adds the missing check to the CreateTable operation.

This patch also adds a test case for this, test_lsi_wrong_no_sort_key,
which failed before the patch and passes after it (and also passes on
DynamoDB).

Also, the existing test_lsi_wrong tests for bad LSI creation attempts
by mistake used a base table without a sort key - so while they
encountered an error as expected, it was not the right error! So we fix
that test (and split it into two tests), adding the missing sort key
and exposing the actual errors that the tests were meant to expose.
That test passed before this patch and also afterwards - but at least
after the patch it is actually testing what it was meant to be testing.

Fixes #9018.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210713123747.1012954-1-nyh@scylladb.com>
2021-07-13 15:12:01 +02:00
Nadav Har'El
1ff1c3735b Merge 'Remove the mutation-based restriction checks' from Piotr Sarna
This series unifies the interface for checking if CQL restrictions are satisfied. Previously, an additional mutation-based approach was added in the materialized views layer, but the decision was reached that it's better to have a single API based on partition slices. With that, the regular selection path gets simplified at the cost of more complicated view generation path, which is a good tradeoff.
Note that in order to unify the interface, the view layer performs ugly transformations in order to adjust the input for `is_satisfied_by`. Reviewers, please take a close look at this code (`matches_view_filter`, `clustering_prefix_matches`, `partition_key_matches`), because it looks error-prone and relies on dirty internals of our serialization layer. If somebody has a better suggestion on how to do the transformation, I'm all ears.

Tests: unit(release), manual(playing with materialized views with custom filters)
Fixes #7215

Closes #8979

* github.com:scylladb/scylla:
  db,view,table: drop unneeded time point parameter
  cql3,expr: unify get_value
  cql3,expr: purge mutation-based is_satisfied_by
  db,view: migrate key checks from the deprecated is_satisfied_by
  db,view: migrate checking view filter to new is_satisfied_by
  db,view: add a helper result builder class
  db,view: move make_partition_slice helper function up
2021-07-13 12:42:13 +03:00
Kamil Braun
b5a7220da4 test: raft: randomized_nemesis_test: reconfigure function
Instead of calling `set_configuration` directly on a `raft::server`, the
caller will use the higher-level `reconfigure`. Similarly to `call`, the
function converts exceptions into return values (inside a `variant`) and
allows passing in a timeout parameter.
2021-07-13 11:15:26 +02:00
Kamil Braun
eb4a8d48aa test: raft: randomized_nemesis_test: refactor waiting for leader into a separate function 2021-07-13 11:15:26 +02:00
Kamil Braun
69c59ec801 test: raft: randomized_nemesis_test: persistence: avoid creating gaps in the log when storing snapshots
When storing a snapshot `snap`, if `snap.idx > e.idx` where `e`
is the last entry in the log (if any), we need to clear all previous
entries so that we don't create a gap in the log. The log must remain
contiguous.

One case is controversial: what to do if `snap.idx == e.idx + 1`.
Technically no gap would be created between the entry and the snapshot.
However, if we now want to store a new entry with index `e.idx + 2`,
that would create a gap between two entries which is illegal.
2021-07-13 11:15:26 +02:00
Kamil Braun
f381a97f6f test: raft: randomized_nemesis_test: persistence: handle complex state types
The usage of `template <..., State init_state>` in `persistence`
permitted using only a very restricted class of types (so called
"structural types").

Pass the initial state through `persistence`'s constructor instead. Also
modify the member functions so the State type doesn't need to have a
default constructor.
2021-07-13 11:15:25 +02:00
Kamil Braun
59e04b2b2e test: raft: randomized_nemesis_test: call: handle raft::dropped_entry
This exception happens when the leader stops being a leader in the
middle of a call. Expect it to happen and return it in the result
variant.
2021-07-13 11:15:25 +02:00
Kamil Braun
d97cf1a254 test: raft: randomized_nemesis_test: impure_state_machine/call: handle dropped channels
Inside `call`, if `add_entry` failed or the operation timed out,
the output channel promise would be dropped without setting a value, causing
a `broken_promise` exception. Furthermore the output future would be
dropped, so we get a discarded `broken_promise` future.

The fix:
1. When we drop a channel without a result (inside
   `impure_state_machine::with_output_channel`), set an explicit
    exception with a dedicated type.
2. Discard the channel future in a controlled way, explicitly handling
   the `output_channel_dropped` exception.
2021-07-13 11:15:25 +02:00
Kamil Braun
f51ff786bd test: raft: randomized_nemesis_test: environment: expose the network
Let the user of `environment` access the `network` directly
for e. g. introducing network partitions.
2021-07-13 11:15:25 +02:00
Kamil Braun
26d2f99cad test: raft: randomized_nemesis_test: configurable network delay and FD convict threshold
The following are now passed to `environement` as parameters:
- network delay,
- failure detector convict threshold.
Environment passes them further down when constructing the underlying
objects.
2021-07-13 11:15:25 +02:00
Kamil Braun
035ae2eb1b test: raft: randomized_nemesis_test: generalize with_env_and_ticker
Generalize the type of the callback: use a template parameter instead of
`noncopyable_function` and don't assume the return type of the callback.

This allows returning a result from `with_env_and_ticker`, e.g. for
performing analysis or logging the results after a part of the test that
used the environment and ticker have finished.
2021-07-13 11:15:25 +02:00
Kamil Braun
25fb195bc7 test: raft: randomized_nemesis_test: network: add_grudge, remove_grudge functions
Extend the interface of `network` to allow introducing and removing "grudges"
which prevent the delivery of messages from one given server to another
(when the time comes to deliver a message but there's a grudge, the
message is dropped).
2021-07-13 11:15:25 +02:00
Kamil Braun
774ef653b1 test: raft: randomized_nemesis_test: move ticker to its own header 2021-07-13 11:15:25 +02:00
Kamil Braun
a45e8e0db0 test: raft: randomized_nemesis_test: ticker: take logger as a constructor parameter
Remove the global dependency on `tlogger`.
2021-07-13 11:15:25 +02:00
Kamil Braun
21b5a6d9f7 test: raft: logical_timer: handle immediate timeout
If the user calls `with_timeout` with a time point that's already been
reached, we return `timed_out_error` immediately.
2021-07-13 11:15:25 +02:00
Kamil Braun
ed8e9a564a test: raft: logical_timer: on timeout, return the original future in the exception
More specifically, return a future which is equivalent to the original
future (when the original future resolves, this future will contain its
result).

Thus we don't discard the future, the user gets it back.
Let them decide what to do with it.
2021-07-13 11:15:25 +02:00
Kamil Braun
c86ff1eb7c test: raft: logical_timer: add schedule member function
It allows scheduling the given function to be called at the given logical
time point.
2021-07-13 11:15:25 +02:00
Kamil Braun
cf0d503a92 test: raft: randomized_nemesis_test: move logical_timer to its own header 2021-07-13 11:15:25 +02:00
Kamil Braun
9f5eeec56a test: raft: include the leader's ID in the not_a_leader exception's message 2021-07-13 11:15:25 +02:00
Piotr Sarna
a1813c9b34 db,view,table: drop unneeded time point parameter
Now that restriction checking is translated to the partition-slice-style
interface, checking the partition/clustering key restrictions for views
can be performed without the time point parameter.
The parameter is dropped from all relevant call sites.
2021-07-13 10:40:08 +02:00
Piotr Sarna
1e0880e345 cql3,expr: unify get_value
Now that there's only one helper function for getting values,
the call can be inlined instead.
2021-07-13 10:40:08 +02:00
Piotr Sarna
95002bb8d4 cql3,expr: purge mutation-based is_satisfied_by
The interface is now unified, and all callers use the original
CQL3-backed API.
2021-07-13 10:40:08 +02:00
Piotr Sarna
37fc3f4b5b db,view: migrate key checks from the deprecated is_satisfied_by
Last two users of the mutation-based is_satisfied_by function
were in the partition/clustering key checks. These functions are now
translated to use the original API.
2021-07-13 10:40:07 +02:00
Piotr Sarna
d6b0a8338a db,view: migrate checking view filter to new is_satisfied_by
In order to unify the interfaces, the is_satisfied_by flavor
for mutations is getting deprecated. In order to be able to remove it,
one of its biggest users, the matches_view_filter() function,
is translated to the other variant.
2021-07-13 10:04:03 +02:00
Piotr Sarna
786db7e9a8 db,view: add a helper result builder class
In order to migrate from mutation-based restriction checks,
code in view.cc needs to have a way of translating results
to partition-slice-based representation.
A slightly simplified builder from multishard_mutation_query.cc
is injected into the view code.
2021-07-13 10:04:03 +02:00
Piotr Sarna
32d87837b1 db,view: move make_partition_slice helper function up
No functional changes, it will be needed for a future patch.
2021-07-13 10:04:02 +02:00
Botond Dénes
2bbfb76cc5 compaction/leveled_compaction_strategy.cc: remove unused <ranges> include
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210713063506.419658-1-bdenes@scylladb.com>
2021-07-13 10:34:22 +03:00
Benny Halevy
1db0612a06 cql3: query_processor: delete service_level_controller param
The query_processor internal_state doesn't use the
service_level_controller as it only needs
service::client_state::for_internal_calls()

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210713055703.131099-1-bhalevy@scylladb.com>
2021-07-13 10:34:05 +03:00
Benny Halevy
90e5181192 main: defer-stop sl_controller after start and drain before storage_proxy drain_on_shutdown
As per 32bcbe59ad, the sl_controller is stopped after set_distributed_data_accessor is called.
However if scylla shuts down before that happens, the sl_controller still needs to be stopped.

We need to drain the service level controller before storage_proxy::drain_on_shutdown
is called to prevent queries by the update loop from starting after the
storage_proxy has been drained - leading to issues similar to #9009.

Fixes #9014

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210713055606.130901-1-bhalevy@scylladb.com>
2021-07-13 10:33:29 +03:00
Avi Kivity
f0e2f31839 Merge "Implement validation compaction" from Botond
"
Currently, when sstables are suspected to be corrupt, one has a few bad
choices on how to verify that they are indeed correct:
* Obtain suspect sstable files and manually inspect them. This is
  problematic because it requires a scylla engineer to have direct access
  to data, which is not always simple or even possible due to privacy
  protection rules.
* Run sstable scrub in abort mode. This is enough to confirm whether
  there is any corruption or not, but only in a binary manner. It is not
  possible to explore the full scope of the corruption, as the scrub
  will abort on the first corruption.
* Run sstable scrub in non-abort mode. Although this allows for
  exploring the full scope of the corruption and it even gets rid of it,
  it is a very intrusive and potentially destructive process that some
  users might not be willing to even risk.

This patchset offers an alternative: validation compaction. This is a
completely non-intrusive compaction that reads all sstables in turn and
validates their contents, logging any discrepancies it can find. It does
not mutate their content, it doesn't even re-writes them. It is akin to
a dry-run mode for sstable scrub. The reason it was not implemented as
such is that the current compaction infrastructure assumes that input
sstables are replaced by output sstables as part of the compaction
process. Lifting this assumption seemed error-prone and risky, so
instead I snatched the unused "Validation" compaction type for this
purpose. This compaction type completely bypasses the regular compaction
infrastructure but only at the low-level. It still integrates fully
into compaction-manager.

Fixes: #7736
Refs: https://github.com/scylladb/scylla-tools-java/issues/263

Tests: unit(dev)
"

* 'validation-compaction/v5' of https://github.com/denesb/scylla:
  test/boost/sstable_datafile_test: add test for validation compaction
  test/boost/sstable_datafile_test: scrub tests: extract corrupt sst writer code into function
  api: storage_service: expose validation compaction
  sstables/compaction_manager: add perform_sstable_validation()
  sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME
  sstables/compaction_manager: add maintenance scheduling group
  sstables/compaction_manager: drop _scheduling_group field
  sstables/compaction_manager: run_custom_job(): replace parameter name with compaction type
  sstables/compaction_manager: run_custom_job(): keep job function alive
  sstables/compaction_descriptor: compaction_options: add validation compaction type
  sstables/compaction: compaction_options::type(): add static assert for size of index_to_type
  sstables/compaction: implement validation compaction type
  sstables/compaction: extract compaction info creation into static method
  sstables/compaction: extract sstable list formatting to a class
  sstables/compaction: scrub_compaction: extract reporting code into static methods
  position_in_paritition{_view}: add has_key()
  mutation_fragment_stream_validator: add schema() accessor
2021-07-13 10:29:40 +03:00
Tomasz Grabiec
e947fac74c database: Fix cache metrics not being registered
Introduced in 6a6403d. The default constructor with dummy_app_stats is
also used by production code.

Fixes #9012
Message-Id: <20210712221447.71902-1-tgrabiec@scylladb.com>
2021-07-13 07:50:44 +03:00
Avi Kivity
058afbcee8 build: re-enable -Wmisleading-indentation
This can catch mismatches between visual indication about
control flow and what the compiler actually does. Looks like
boost cleaned up its indentation since it was disabled in
7f38634080 ("dist/debian: Switch to g++-7/boost-1.63 on
Ubuntu 14.04/16.04"). It's unlikely to pop back since modern
compilers enable it by default.

Closes #9015
2021-07-12 22:29:19 +03:00
Avi Kivity
8fb4fe2f24 Merge "reader_concurrency_semaphore: relax on destroy stop checks" from Botond
"
Currently we `assert(_stopped)` in the destructor, but this is too
harsh, especially on freshly created semaphore instances that weren't
even used yet. This basically mandates semaphores to be initialized at
the end of the constructor body, which is very cumbersome.
Further to that, this series relaxes the checks on destroying an
unstopped previously (but not currently) used semaphore. As destroying
such a semaphore without stop is risky an error is still logged.

Tests: unit(dev)
"

* 'reader-concurrency-semaphore-relax-stop-check/v1' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: relax _stopped check when destroying a used semaphore
  reader_concurrency_semaphore: allow destroying without stop() when not used yet
  reader_concurrency_semaphore: add permit-stats
2021-07-12 20:07:01 +02:00
Nadav Har'El
f540a69a82 Update tools/java submodule
* tools/java 5013321823...79a441972d (2):
  > Add Zstd compressor
  > Settings Schema: fix typo in settings printing

Adding the Zstd compressor fixes #8887.
2021-07-12 20:07:00 +02:00
Avi Kivity
4d48e1e9e1 build: avoid sanitize/coverage builds in multi-mode targets
The default target (i.e. what gets executed under "ninja") excludes
sanitize and coverage modes (since they're useful in special cases
only), but the other multi-mode targets such as "ninja build" do not.
This means that two extra modes are built.

Make things consistent by always using default_modes (which default to
release,dev,debug). This can be overriden using the --mode switch
to configure.py.

Closes #8775
2021-07-12 20:07:00 +02:00
Botond Dénes
f8004c652b reader_concurrency_semaphore: relax _stopped check when destroying a used semaphore
Further relax the conditions under which we abort on destroying a
unstopped semaphore. We already allow destroying completely unused
semaphores, this patch further relaxes this to allow destroying formerly
used but presently not used semaphores without stopping. We still call
`on_internal_error_noexcept()` even if destroying the semaphore is safe,
because without calling `stop()`, destroying the semaphore depends on
luck, which we shouldn't rely on.
2021-07-12 15:53:00 +03:00
Botond Dénes
750b20fd85 reader_concurrency_semaphore: allow destroying without stop() when not used yet
To make it easier to construct objects with semaphore members. When the
construction of such object fails, they can now just destroy their
semaphore member as usual, without having to employ tricks to make sure
its is stopped before.
2021-07-12 15:53:00 +03:00
Botond Dénes
03959a332b reader_concurrency_semaphore: add permit-stats
Which stores permit related stats. For now only total number of permits
is maintained which is useful to determine whether the semaphore was
used already or not.
2021-07-12 15:53:00 +03:00
Nadav Har'El
3fda13e20e cql-pytest: fix sporadic failure in over-zealous TTL test
We have been seeing rare failures of the cql-pytest (translated from
Cassandra's unit tests) for testing TTL in secondary indexes:
cassandra_tests/validation/entities/secondary_index_test.py::testIndexOnRegularColumnInsertExpiringColumn

The problem is that the test writes an item with 1 second TTL, and then
sleeps *exactly* 1.0 seconds, and expects to see the item disappear
by that time. Which doesn't always happen:

The problem with that assumption stems from Scylla's TTL clock ("gc_clock")
being based on Seastar's lowres clock. lowres_clock only has a 10ms
"granularity": The time Scylla sees when deciding whether an item expires
may be up to 10ms in the past - the arbitrary point when the lowres timer
happened to last run. In rare overload cases, the inaccuracy may be even
grater than 10ms (if the timer got delayed by other things running).

So when Scylla is asked to expire an item in 1 second - we cannot be
sure it will be expired in exactly 1 second or less - the expiration
can be also around 10ms later.

So in this patch we change the test to sleep with more than enough
margin - 1.1 seconds (i.e., 100ms more than 1 second). By that time
we're sure the item must have expired.

Before this patch, I saw the test failing once every few hundred runs,
after this patch I ran if 2,000 times without a single failure.

Fixes #9008

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210712100655.953969-1-nyh@scylladb.com>
2021-07-12 13:48:21 +03:00
Yaron Kaikov
aa7c40ba50 dist: build docker based on ubuntu 20.04 OS
Today our docker image is based on Centos7 ,Since centos will be EOL in
2024 and no longer has stable release stream. let's move our docker image to be based on ubuntu 20.04

Based on the work done in https://github.com/scylladb/scylla/pull/8730,
let's build our docker image based on local packages using buildah

Closes #8849
2021-07-12 13:32:03 +03:00
Piotr Jastrzebski
c010cefc4d cdc: Handle compact storage tables correctly
When a table with compact storage has no regular column (only primary
key columns), an artificial column of type empty is added. Such column
type can't be returned via CQL so CDC Log shouldn't contain a column
that reflects this artificial column.

This patch does two things:
1. Make sure that CDC Log schema does not contain columns that reflect
   the artificial column from a base table.
2. When composing mutation to CDC Log, ommit the artificial column.

Fixes #8410

Test: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8988
2021-07-12 12:17:35 +03:00
Nadav Har'El
2cc8c40c07 Merge 'Fix some issues found by gcc 11' from Avi Kivity
This series fixes some issues that gcc 11 complains about. I believe all
are correct errors from the standard's view. Clang accepts the changed code.

Note that this is not enough to build with gcc 11, but it's a start.

Closes #9007

* github.com:scylladb/scylla:
  utils: compact-radix-tree: detemplate array_of<>
  utils: compact-radix-tree: don't redefine type as member
  raft: avoid changing meaning of a symbol inside a class
  cql3: lists: catch polymorphic exceptions by reference
2021-07-12 11:17:57 +03:00
Avi Kivity
4433178ccb utils: exceptions: convert sprint() to format()
sprint() is type-unsafe, and we are converging on format(). Convert
exceptions.hh to format().

Closes #9006
2021-07-12 11:17:57 +03:00
Botond Dénes
5c39a2921e test/boost/sstable_datafile_test: add test for validation compaction
Add a two new unit tests, one which cover the whole stack, and one which
stresses the validation part.
2021-07-12 10:25:15 +03:00
Botond Dénes
8296759cce test/boost/sstable_datafile_test: scrub tests: extract corrupt sst writer code into function
So the two tests having this almost identical code can shared it, and so
that it can be used by new tests.
2021-07-12 10:25:15 +03:00
Botond Dénes
b0ef57c833 api: storage_service: expose validation compaction 2021-07-12 10:25:15 +03:00
Botond Dénes
47283ed151 sstables/compaction_manager: add perform_sstable_validation()
Exposing validation compaction on the compaction manager level.
To keep things simple, validation compaction uses the custom job
infrastructure.
2021-07-12 10:25:15 +03:00
Botond Dénes
4c05e5f966 sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME
Run this compaction in the maintenance group which is now available,
resolving the FIXME asking for this.
2021-07-12 10:25:15 +03:00
Botond Dénes
c8f8e9232c sstables/compaction_manager: add maintenance scheduling group
rewrite_sstables() wants to be run in the maintenance group and soon we
will add another compaction type which also wants to be run in the
said group. To enable this propagate the maintenance scheduling group
(both CPU and IO) to the compaction manager.
2021-07-12 10:25:15 +03:00
Botond Dénes
12b8b650b7 sstables/compaction_manager: drop _scheduling_group field
Use the equivalent _compaction_controller.sg() instead.
2021-07-12 10:25:15 +03:00
Botond Dénes
75bad71f0e sstables/compaction_manager: run_custom_job(): replace parameter name with compaction type
All callers use it to do operations that are closely associated with one
of the standard compaction types, so no reason to pass in a custom
string instead of the compaction type enum.
2021-07-12 10:25:15 +03:00
Botond Dénes
ddf2700b2e sstables/compaction_manager: run_custom_job(): keep job function alive
For the duration of the job, allowing coroutine lambdas to be used as
well.
2021-07-12 10:25:15 +03:00
Botond Dénes
891921377d sstables/compaction_descriptor: compaction_options: add validation compaction type
This enables starting validation compaction via `compact_sstables()`.
2021-07-12 10:25:15 +03:00
Botond Dénes
349a3ed4e8 sstables/compaction: compaction_options::type(): add static assert for size of index_to_type
To remind those adding a member to the variant, to also add the
corresponding entry here.
2021-07-12 10:25:15 +03:00
Botond Dénes
a57caf5229 sstables/compaction: implement validation compaction type
Validation just reads all the passed-in sstables and runs the mutation
stream through a mutation fragment stream validator, logging all errors
found, and finally also logging whether all the sstables are valid or
not. Validation is not really a compaction as it doesn't write any
output. As such it bypasses most of the usual compaction machinery, so
the latter doesn't have to be adapted to this outlier.
This patch only adds the implementation, but it still cannot be started
via `compact_sstables()`, that will be implemented by the next patches.
2021-07-12 10:25:15 +03:00
Botond Dénes
cae8624edb sstables/compaction: extract compaction info creation into static method
To make this snippet reusable by the soon-to-be-added validation
compaction as well.
2021-07-12 07:53:11 +03:00
Botond Dénes
3b5ae0b894 sstables/compaction: extract sstable list formatting to a class
To make it reusable both inside compaction class itself (between
compaction start and end messages) and for outside code as well.
2021-07-12 07:11:29 +03:00
Botond Dénes
35f49a5baa sstables/compaction: scrub_compaction: extract reporting code into static methods
All the error messages reporting about invalid bits found in the stream.
This allows reusing these messages in the soon-to-be-added validation
compaction. In the process, the error messages are made more
comprehensive and more uniform as well.
2021-07-12 07:11:29 +03:00
Botond Dénes
5e77f07263 position_in_paritition{_view}: add has_key() 2021-07-12 07:11:29 +03:00
Botond Dénes
7cf5b43bbc mutation_fragment_stream_validator: add schema() accessor 2021-07-12 07:11:29 +03:00
Avi Kivity
29c9570556 utils: compact-radix-tree: detemplate array_of<>
The radix tree template defines a nested class template array_of;
both a generic template and a fully specialized version. However,
gcc (I believe correctly) rejects the fully specialized template
that happens to be a member of another class template.

As it happens, we don't really need a template here at all. Define
a non-template class for each of the cases we need, and use
std::conditional_t to select the type we need.
2021-07-11 18:16:21 +03:00
Avi Kivity
f576ecb7cc utils: compact-radix-tree: don't redefine type as member
The `direct_layout` and `indirect_layout` template classes accept
a template parameter named `Layout` of type `layout`, and re-export
`Layout` as a static data member named `layout`. This redefinition
of `layout` is disliked by gcc. Fix by renaming the static data member
to `this_layout` and adjust all references.
2021-07-11 18:16:21 +03:00
Avi Kivity
332b5c395f raft: avoid changing meaning of a symbol inside a class
The construct

struct q {
    a a;
};

Changes the meaning of `a` from a type to a data member. gcc dislikes
it and I agree. Fully qualify the type name to avoid an error.
2021-07-11 18:16:21 +03:00
Avi Kivity
cb8ef1489b cql3: lists: catch polymorphic exceptions by reference
gcc 11 notes that catching polymorphic exceptions is a bad idea;
the resulting copy can slice the exception object. Fix by
capturing by reference.
2021-07-11 17:34:43 +03:00
Avi Kivity
222ef17305 build, treewide: enable -Wredundant-move
Returning a function parameter guarantees copy elision and does not
require a std::move().  Enable -Wredundant-move to warn us that the
move is unneeded, and gain slightly more readable code. A few violations
are trivially adjusted.

Closes #9004
2021-07-11 12:53:02 +03:00
Dejan Mircevski
7119730f2d cql3: Don't look for indexed column in CK prefix
When creating an index-table query, we form its clustering-key
restrictions by picking the right restrictions from the WHERE clause.
But we skip the indexed column, which isn't in the index-table clutering
key.  This is, however, both incorrect and unnecessary:

It is incorrect because we compare the column IDs from different schemas
(indexed table vs. base table).  We should instead be comparing column
names.

It is unnecessary because this code is only executed when the whole
partition key plus a clustering prefix is specified in the WHERE clause.
In such cases, the index cannot possibly be on a member of the
clustering prefix, as such a query would be satisfied out of the base
table.  Therefore, it is redundant to check for the indexed table among
the CK prefix elements.

A careful reader will note that this check was first introduced to fix
the issue #7888 in commit 0bd201d.  But it now seems to me that that fix
was misguided.  The root problem was the old code miscalculating the
clustering prefix by including too many columns in it; it should have
stopped before reaching the indexed column.  The new code, introduced by
commit 845e36e76, calculates the clustering prefix correctly, never
reaching the indexed column.

(Details, for the curious: the old code invoked
 clustering_key_restrictions::prefix_size(), which is buggy -- it
 doesn't check the restriction operator.  It will, for instance,
 calculate the prefix of `c1=0 AND c2 CONTAINS 0 AND c3=0` as 3, because
 it restricts c1, c2, and c3.  But the correct prefix is clearly 1,
 because c2 is not restricted by equality.)

Tests: unit (dev, debug)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8993
2021-07-08 21:39:38 +03:00
Avi Kivity
9059514335 build, treewide: enable -Wpessimizing-move warning
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.

Fix the handful of cases (all trivial), and enable the warning.

Closes #8992
2021-07-08 17:52:34 +03:00
Avi Kivity
fe4002c6c4 Update seastar submodule
* seastar eaa00e761f...8ed9771ae9 (5):
  > *: drop prometheus protobuf support
  > reactor: Fix calculations of bandwidth in legacy mode
  > gate: add gate::holder
  > Revert "gate: add gate::holder", does not build.
  > gate: add gate::holder
2021-07-08 17:42:39 +03:00
Avi Kivity
f756f34392 Merge "Add scylla-bench datasets to perf_fast_forward" from Tomasz
"
After this series one can use perf_fast_forward to generate the data set.
It takes a lot less time this way than to use scylla-bench.
"

* 'perf-fast-forward-scylla-bench-dataset' of github.com:tgrabiec/scylla:
  tests: perf_fast_forward: Use data_source::make_ck()
  tests: perf_fast_forward: Move declaration of clustered_ds up
  tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default
  tests: perf_fast_forward: Add data sets which conform to scylla-bench schema
2021-07-08 17:33:30 +03:00
Nadav Har'El
d0546a9bb5 cql-pytest: improve README
This patch adds to cql-pytest/README.md a paragraph on where run /
run-cassandra expect to find Scylla or Cassandra, and how to override
that choice.

Also make a couple of trivial formatting changes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708142730.813660-1-nyh@scylladb.com>
2021-07-08 17:29:20 +03:00
Avi Kivity
4f1e21ceac Merge "reader_concurrency_semaphore: get rid of global semaphores" from Botond
"
When obtaining a valid permit was made mandatory, code which now had to
create reader permits but didn't have a semaphore handy suddenly found
itself in a difficult situation. Many places and most prominently tests
solved the problem by creating a thread-local semaphore to source
permits from. This was fine at the time but as usual, globals came back
to haunt us when `reader_concurrency_semaphore::stop()` was
introduced, as these global semaphores had no easy way to be stopped
before being destroyed. This patch-set cleans up this wart, by getting
rid of all global semaphores, replacing them with appropriately scoped
local semaphores, that are stopped after being used. With that, the
FIXME in `~reader_concurrency_semaphore()` can be resolved and we an
finally `assert()` that the semaphore was stopped before being
destroyed.

This series is another preparatory one for the series which moves the
semaphore in front of the cache.

tests: unit(dev)
"

* 'reader-concurrency-semaphore-mandatory-stop/v2' of https://github.com/denesb/scylla: (26 commits)
  reader_concurrency_semaphore: assert(_stopped) in the destructor
  test/lib: remove now unused reader_permit.{hh,cc}
  test/boost: migrate off the global test reader semaphore
  test/manual: migrate off the global test reader semaphore
  test/unit: migrate off the global test reader semaphore
  test/perf: migrate off the global test reader semaphore
  test/perf: perf.hh: add reader_concurrency_semaphore_wrapper
  test/lib: migrate off the global test reader semaphore
  test/lib/simple_schema: migrate off the global test reader semaphore
  test/lib/sstable_utils: migrate off the global test reader semaphore
  test/lib/test_services: migrate off the global test reader semaphore
  test/lib/sstable_test_env: add reader_concurrency_semaphore member
  test/lib/cql_test_env: add make_reader_permit()
  test/lib: add reader_concurrency_semaphore.hh
  test/boost/sstable_test: migrate row counting tests to seastar thread
  test/boost/sstable_test: test_using_reusable_sst(): pass env to func
  test/lib/reader_lifecycle_policy: add permit parameter to factory function
  test/boost/mutation_reader_test: share permit between readers in a read
  memtable: migrate off the global reader concurrency semaphore
  mutation_writer: multishard_writer: migrate off the global reader concurrency semaphore
  ...
2021-07-08 17:28:13 +03:00
Botond Dénes
42bd5c980f reader_concurrency_semaphore: assert(_stopped) in the destructor
Now that there are no more global semaphore which are impossible to stop
properly we can resolve the related FIXME and arm the assert in the
semaphore destructor.
We can also remove all the other cleanup code from the destructor as
they are taken care of by stop(), which we now assert to have been run.
2021-07-08 16:53:38 +03:00
Botond Dénes
6b941c4d34 test/lib: remove now unused reader_permit.{hh,cc}
Finally getting rid of the global test reader concurrency semaphore.
2021-07-08 16:53:38 +03:00
Botond Dénes
2d2b9e7b36 test/boost: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
0bf07cde7b test/manual: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
18e0c40c5d test/unit: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
37a1e506b1 test/perf: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
2454811dd6 test/perf: perf.hh: add reader_concurrency_semaphore_wrapper
A convenience, self-closing wrapper for those perf tests that have no
way to stop the semaphore and wait for it too.
2021-07-08 16:53:38 +03:00
Nadav Har'El
e22a52e69c cql-pytest: fix tests on Cassandra 3
After commit 76227fa ("cql-pytest: use NetworkTopologyStrategy, not
SimpleStrategy"), the cql-pytest tests now NetworkTopologyStrategy instead
of SimpleStrategy in the test keyspaces. The tests continued to use the
"replication_factor" option. The support for this option is a relatively
recent, and was only added to Cassandra in the 4.0 release series
(see https://issues.apache.org/jira/browse/CASSANDRA-14303). So users
who happen to have Cassandra 3 installed and want to run a cql-pytest
against it will see the test failing when it can't create a keyspace.

This patch trivially fixes the problem by using the name of the current
DC (automatically determined) instead of the word 'replication_factor'.

Almost all tests are fixed by a single fix to the test_keyspace fixture
which creates one keyspace used by most tests. Additional changes were
needed in test_keyspace.py, for tests which explicitly create keyspaces.

I tested the result on Cassandra 3.11.10, Cassandra 4 (git master) and
Scylla.

Fixes #8990

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708123428.811184-1-nyh@scylladb.com>
2021-07-08 15:35:21 +02:00
Nadav Har'El
eb11ce046c cql-pytest: add reproducer for concurrent DROP KEYSPACE bug
We know that today in Scylla concurrent schema changes done on different
coordinators are not safe - and we plan to address this problem with Raft.
However, the test in this patch - reproducing issue #8968 - demonstrates
that even on a single node concurrent schema changes are not safe:

The test involves one thread which constantly creates a keyspace and
then a table in it - and a second thread which constantly deletes this
keyspace. After doing this for a while, the schema reaches an inconsistent
state: The keyspace is at a state of limbo where it cannot be dropped
(dropping it succeeds, but doesn't actually drop it), and a new keyspace
cannot be created under the same name).

Note that to reproduce this bug, it was important that the test create
both a keyspace and a table. Were the test to just create an empty keyspace,
without a table in it, the bug would not be reproduced.

Refs #8968.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210704121049.662169-1-nyh@scylladb.com>
2021-07-08 15:35:03 +02:00
Botond Dénes
0e78399051 test/lib: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
5fff314739 test/lib/simple_schema: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
d520655730 test/lib/sstable_utils: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
3679418e62 test/lib/test_services: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
0acc4d63da test/lib/sstable_test_env: add reader_concurrency_semaphore member
To enable tests using the test env to conveniently create permits for
themselves, reducing the pain of migrating to local semaphores.
2021-07-08 15:28:39 +03:00
Botond Dénes
7174d1beee test/lib/cql_test_env: add make_reader_permit()
A convenience method, allowing tests using the cql test env to
conveniently create a permit, reducing the pain of migrating to local
semaphores.
2021-07-08 15:28:39 +03:00
Botond Dénes
b739525fb6 test/lib: add reader_concurrency_semaphore.hh
Supplying a convenience semaphore wrapper, which stops the contained
semaphore when destroyed. It also provides a more convenient
`make_permit()`.  This class is intended to make the migration to local
semaphores less painful.
2021-07-08 15:28:36 +03:00
Benny Halevy
fa5d70da32 storage_proxy: abstract_read_resolver: handle semaphore_timed_out error
semaphore_timed_out errors should be ignored, similar to
rpc::timeout_error or seastar::timed_out_error, so that they
eventually be converted to `read_timeout_exception` via
the data/digest read resolver on_timeout() method.

Otherwise, the semaphore timeout is mistranslated to
read_failure_exception, via on_error().

Note that originally the intention was to change the exception
thrown by the reader_concurrency_semaphore expiry_handler, but
there are already several places in the code that catch and handle
the semaphore_timed_out exception that would need to be changed,
increasing the risk in this change.

Fixes #8958

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210708083252.1934651-2-bhalevy@scylladb.com>
2021-07-08 15:23:30 +03:00
Benny Halevy
023d103fee utils: exceptions: is_timeout_exception: add timed_out_error
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210708083252.1934651-1-bhalevy@scylladb.com>
2021-07-08 15:23:29 +03:00
Nadav Har'El
814c4ad4ce cql-pytest: fix run-cassandra for older versions of Cassandra
In older versions of Cassandra (such as 3.11.10 which I tried), the
CQL server is not turned on by default, unless the configuration file
explicitly has "start_native_transport: true" - without it only the
Thrift server is started.

So fix the cql-pytest/run-cassandra to pass this option. It also
works correctly in Cassandra 4.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708113423.804980-1-nyh@scylladb.com>
2021-07-08 14:59:09 +03:00
Avi Kivity
7d214800d0 Merge 'Generate view updates in smaller parts' from Piotr Sarna
In order to avoid large allocations and too large mutations
generated from large view updates, granularity of the process
is broken down from per-partition to smaller chunks.
The view update builder now produces partial updates, no more
than 100 view rows at a time.

The series was tested manually with a particular scenario in mind -
deleting a large base partition, which results in creating
a view update per each deleted row - which, with sufficiently
large partitions, can reach millions. Before the series, Scylla
experienced an out-of-memory condition after the view update
generation mechanism tried to load too much data into a contiguous
buffer. Multiple large allocation warnings and reactor stalls were
observed as well. After the series, the operation is still rather slow,
but does not induce reactor stalls nor allocator problems.
A reduced version of the above test is added as a unit test -
it does not check for huge partitions, but instead uses a number
just large enough to cause the update generation process to be split
into multiple chunks.

Fixes #8852

Closes #8906

* github.com:scylladb/scylla:
  cql-pytest: add a test case for base range deletion
  cql-pytest: add a test case for base partition deletion
  table: elaborate on why exceptions are ignored for view updates
  view: generate view updates in smaller parts
  table: coroutinize generating view updates
  db,view: move view_update_builder to the header
2021-07-08 12:57:05 +03:00
Piotr Sarna
bc0038913c cql-pytest: add a test case for base range deletion
The test case checks that deleting a base table clustering range
works fine. This operation is potentially heavy, as it involves
generating a view update for every row. With large enough ranges,
the number can reach millions and beyond.
2021-07-08 11:43:08 +02:00
Piotr Sarna
ef47b4565c cql-pytest: add a test case for base partition deletion
The test case checks that deleting a whole base table partition
works fine. This operation is potentially heavy, as it involves
generating a view update for every row. With large enough partitions,
the number can reach millions and beyond.
2021-07-08 11:42:54 +02:00
Botond Dénes
b9a5fd57bf test/boost/sstable_test: migrate row counting tests to seastar thread
To facilitate further patching.
2021-07-08 12:38:21 +03:00
Botond Dénes
fb310ec6e7 test/boost/sstable_test: test_using_reusable_sst(): pass env to func
To facilitate further patching.
2021-07-08 12:38:19 +03:00
Botond Dénes
46d21e842d test/lib/reader_lifecycle_policy: add permit parameter to factory function
The factory method doesn't match the signature of
`reader_lifecycle_policy::make_reader()`, notably the permit is missing.
Add it as it is important that the wrapping evictable reader and
underlying reader share the permits.
2021-07-08 12:31:36 +03:00
Botond Dénes
2a45d643b6 test/boost/mutation_reader_test: share permit between readers in a read
Permits were designed such that there is one permit per read, being
shared by all readers in that read. Make sure readers created by tests
adhere to this.
2021-07-08 12:31:36 +03:00
Botond Dénes
0f36e5c498 memtable: migrate off the global reader concurrency semaphore
Require the caller of `create_flush_reader()` to pass a permit instead.
2021-07-08 12:31:36 +03:00
Botond Dénes
7a4381b491 mutation_writer: multishard_writer: migrate off the global reader concurrency semaphore
Use a local one instead, and stop it when the writer is destroyed.
2021-07-08 12:31:36 +03:00
Botond Dénes
17a0e22cb1 sstables: mx/writer: migrate off the global reader concurrency_semaphore
And use a local one instead, stopping it when the writer is destroyed.
2021-07-08 12:31:36 +03:00
Botond Dénes
f1c1e05a05 sstables: stop semaphores 2021-07-08 12:31:36 +03:00
Botond Dénes
c51892f02e sstables: sstable::has_partition_key(): convert to coroutine 2021-07-08 12:31:36 +03:00
Botond Dénes
c0a8068c16 sstables: generate_summary(): fix indentation 2021-07-08 12:31:36 +03:00
Botond Dénes
fec137f3f6 sstables: generate_summary(): make it a coroutine
Indentation is left broken.
2021-07-08 12:31:36 +03:00
Botond Dénes
c4e71fb9b8 reader_concurrency_semaphore: remove default name parameter
Naming the concurrency semaphore is currently optional, unnamed
semaphores defaulting to "Unnamed semaphore". Although the most
important semaphores are named, many still aren't, which makes for a
poor debugging experience when one of these times out.
To prevent this, remove the name parameter defaults from those
constructors that have it and require a unique name to be passed in.
Also update all sites creating a semaphore and make sure they use a
unique name.
2021-07-08 12:31:36 +03:00
Piotr Sarna
6a461d00c6 table: elaborate on why exceptions are ignored for view updates
The generate_and_propagate_view_updates() function explicitly
ignores exceptions reported from the underlying view update
propagation layer. This decision is now explained in the comment.
2021-07-08 11:21:55 +02:00
Piotr Sarna
bf0777e97a view: generate view updates in smaller parts
In order to avoid large allocations and too large mutations
generated from large view updates, granularity of the process
is broken down from per-partition to smaller chunks.
The view update builder now produces partial updates, no more
than 100 view rows at a time.
2021-07-08 11:17:27 +02:00
Piotr Sarna
1000d52cfa table: coroutinize generating view updates
... which will make the incoming changes easier to review.
2021-07-08 11:17:27 +02:00
Piotr Sarna
679dc4d824 db,view: move view_update_builder to the header
The builder is going to be used directly by the callers,
which requires making its definition public.
No semantic changes were intended.
2021-07-08 11:17:27 +02:00
Raphael S. Carvalho
1924e8d2b6 treewide: Move compaction code into a new top-level compaction dir
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.

Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
2021-07-07 23:21:51 +03:00
Tomasz Grabiec
33cba08735 tests: perf_fast_forward: Use data_source::make_ck()
Data sources differ in clustering key type. Make sure to use the right
data_value instance to produce correct keys.
2021-07-07 20:27:44 +02:00
Tomasz Grabiec
fa481e92c1 tests: perf_fast_forward: Move declaration of clustered_ds up 2021-07-07 20:27:44 +02:00
Tomasz Grabiec
407e42f5d8 tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default
This dataset exists for convenience, to be able to run scylla-bench
against the data set generated by perf_fast_forward. It doesn't
increase coverage. So do not include it by default to not waste
resources on it.
2021-07-07 20:27:44 +02:00
Tomasz Grabiec
d7250a12fd tests: perf_fast_forward: Add data sets which conform to scylla-bench schema
Useful for fast generation of test data.
2021-07-07 20:27:44 +02:00
Avi Kivity
5571ef0d6d compression: define 'class' attribute for compression and deprecate 'sstable_compression'
Cassandra 3.0 deprecated the 'sstable_compression' attribute and added
'class' as a replacement. Follow by supporting both.

The SSTABLE_COMPRESSION variable is renamed to SSTABLE_COMPRESSION_DEPRECATED
to detect all uses and prevent future misuse.

To prevent old-version nodes from seeing the new name, the
compression_parameters class preserves the key name when it is
constructed from an options map, and emits the same key name when
asked to generate an options map.

Existing unit tests are modified to use the new name, and a test
is added to ensure the old name is still supported.

Fixes #8948.

Closes #8949
2021-07-07 19:15:20 +02:00
Avi Kivity
99d5355007 Merge "Cache sstable indexes in memory" from Tomasz
"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.

Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.

The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.

SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).

In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.

Performance testing results follow.

1) scylla-bench large partition reads

  Populated with:

        perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
            -c1 -m1G --populate --value-size=1024 --rows=10000000

  Single partition, 9G data file, 4MB index file

  Test execution:

    build/release/scylla -c1 -m4G
    scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
       -clustering-row-count 10000000 -duration 60m

  TL;DR: after: 2x throughput, 0.5 median latency

    Before (c1daf2bb24):

    Results
    Time (avg):	 5m21.033180213s
    Total ops:	 966951
    Total rows:	 966951
    Operations/s:	 3011.997048812112
    Rows/s:		 3011.997048812112
    Latency:
      max:		 74.055679ms
      99.9th:	 63.569919ms
      99th:		 41.320447ms
      95th:		 38.076415ms
      90th:		 37.158911ms
      median:	 34.537471ms
      mean:		 33.195994ms

    After:

    Results
    Time (avg):	 5m14.706669345s
    Total ops:	 2042831
    Total rows:	 2042831
    Operations/s:	 6491.22243800942
    Rows/s:		 6491.22243800942
    Latency:
      max:		 60.096511ms
      99.9th:	 35.520511ms
      99th:		 27.000831ms
      95th:		 23.986175ms
      90th:		 21.659647ms
      median:	 15.040511ms
      mean:		 15.402076ms

2) scylla-bench small partitions

  I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
  half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
  code performed slightly better.

  Below is a representative run over data set which does not fit in memory.

  scylla -c1 -m4G
  scylla-bench -workload uniform -mode read  -concurrency 400 -partition-count 10000000 \
      -clustering-row-count 1 -duration 60m -no-lower-bound

  Before:

    Time (avg):	 51.072411913s
    Total ops:	 3165885
    Total rows:	 3165885
    Operations/s:	 61988.164024260645
    Rows/s:		 61988.164024260645
    Latency:
      max:		 34.045951ms
      99.9th:	 25.985023ms
      99th:		 23.298047ms
      95th:		 19.070975ms
      90th:		 17.530879ms
      median:	 3.899391ms
      mean:		 6.450616ms

  After:

    Time (avg):	 50.232410679s
    Total ops:	 3778863
    Total rows:	 3778863
    Operations/s:	 75227.58014424688
    Rows/s:		 75227.58014424688
    Latency:
      max:		 37.027839ms
      99.9th:	 24.805375ms
      99th:		 18.219007ms
      95th:		 14.090239ms
      90th:		 12.124159ms
      median:	 4.030463ms
      mean:		 5.315111ms

  The results include the warmup phase which populates the partition index cache, so the hot-cache effect
  is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
  moves it lower.

3) perf_fast_forward --run-tests=large-partition-skips

    Caching is not used here, included to show there are no regressions for the cold cache case.

    TL;DR: No significant change

    perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G

    Config: rows: 10000000, value size: 2000

    Before:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429822            4  10000000     274500         62     274521     274429   153889.2 153883   19696986  153853       0        0        0        0        0        0        0  22.5%
    1       1        36.856236            4   5000000     135662          7     135670     135650   155652.0 155652   19704117  139326       1        0        1        1        0        0        0  38.1%
    1       8        36.347667            4   1111112      30569          0      30570      30569   155652.0 155652   19704117  139071       1        0        1        1        0        0        0  19.5%
    1       16       36.278866            4    588236      16214          1      16215      16213   155652.0 155652   19704117  139073       1        0        1        1        0        0        0  16.6%
    1       32       36.174784            4    303031       8377          0       8377       8376   155652.0 155652   19704117  139056       1        0        1        1        0        0        0  12.3%
    1       64       36.147104            4    153847       4256          0       4256       4256   155652.0 155652   19704117  139109       1        0        1        1        0        0        0  11.1%
    1       256       9.895288            4     38911       3932          1       3933       3930   100869.2 100868    3178298   59944   38912        0        1        1        0        0        0  14.3%
    1       1024      2.599921            4      9757       3753          0       3753       3753    26604.0  26604     801850   15071    9758        0        1        1        0        0        0  14.6%
    1       4096      0.784568            4      2441       3111          1       3111       3109     7982.0   7982     205946    3772    2442        0        1        1        0        0        0  13.8%

    64      1        36.553975            4   9846154     269359         10     269369     269337   155663.8 155652   19704117  139230       1        0        1        1        0        0        0  28.2%
    64      8        36.509694            4   8888896     243467          8     243475     243449   155652.0 155652   19704117  139120       1        0        1        1        0        0        0  26.5%
    64      16       36.466282            4   8000000     219381          4     219385     219374   155652.0 155652   19704117  139232       1        0        1        1        0        0        0  24.8%
    64      32       36.395926            4   6666688     183171          6     183180     183165   155652.0 155652   19704117  139158       1        0        1        1        0        0        0  21.8%
    64      64       36.296856            4   5000000     137753          4     137757     137737   155652.0 155652   19704117  139105       1        0        1        1        0        0        0  17.7%
    64      256      20.590392            4   2000000      97133         18      97151      94996   135248.8 131395    7877402   98335   31282        0        1        1        0        0        0  15.7%
    64      1024      6.225773            4    588288      94492       1436      95434      88748    46066.5  41321    2324378   30360    9193        0        1        1        0        0        0  15.8%
    64      4096      1.856069            4    153856      82893         54      82948      82721    16115.0  16043     583674   11574    2675        0        1        1        0        0        0  16.3%

    After:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429240            4  10000000     274505         38     274515     274417   153887.8 153883   19696986  153849       0        0        0        0        0        0        0  22.4%
    1       1        36.933806            4   5000000     135377         15     135385     135354   155658.0 155658   19704085  139398       1        0        1        1        0        0        0  40.0%
    1       8        36.419187            4   1111112      30509          2      30510      30507   155658.0 155658   19704085  139233       1        0        1        1        0        0        0  22.0%
    1       16       36.353475            4    588236      16181          0      16182      16181   155658.0 155658   19704085  139183       1        0        1        1        0        0        0  19.2%
    1       32       36.251356            4    303031       8359          0       8359       8359   155658.0 155658   19704085  139120       1        0        1        1        0        0        0  14.8%
    1       64       36.203692            4    153847       4249          0       4250       4249   155658.0 155658   19704085  139071       1        0        1        1        0        0        0  13.0%
    1       256       9.965876            4     38911       3904          0       3906       3904   100875.2 100874    3178266   60108   38912        0        1        1        0        0        0  17.9%
    1       1024      2.637501            4      9757       3699          1       3700       3697    26610.0  26610     801818   15071    9758        0        1        1        0        0        0  19.5%
    1       4096      0.806745            4      2441       3026          1       3027       3024     7988.0   7988     205914    3773    2442        0        1        1        0        0        0  18.3%

    64      1        36.611243            4   9846154     268938          5     268942     268921   155669.8 155705   19704085  139330       2        0        1        1        0        0        0  29.9%
    64      8        36.559471            4   8888896     243135         11     243156     243124   155658.0 155658   19704085  139261       1        0        1        1        0        0        0  28.1%
    64      16       36.510319            4   8000000     219116         15     219126     219101   155658.0 155658   19704085  139173       1        0        1        1        0        0        0  26.3%
    64      32       36.439069            4   6666688     182954          9     182964     182943   155658.0 155658   19704085  139274       1        0        1        1        0        0        0  23.2%
    64      64       36.334808            4   5000000     137609         11     137612     137596   155658.0 155658   19704085  139258       2        0        1        1        0        0        0  19.1%
    64      256      20.624759            4   2000000      96971         88      97059      92717   138296.0 131401    7877370   98332   31282        0        1        1        0        0        0  17.2%
    64      1024      6.260598            4    588288      93967       1429      94905      88051    45939.5  41327    2324346   30361    9193        0        1        1        0        0        0  17.8%
    64      4096      1.881338            4    153856      81780        140      81920      81520    16109.8  16092     582714   11617    2678        0        1        1        0        0        0  18.2%

4) perf_fast_forward --run-tests=large-partition-slicing

    Caching enabled, each line shows the median run from many iterations

    TL;DR: We can observe reduction in IO which translates to reduction in execution time,
           especially for slicing in the middle of partition.

    perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases

    Config: rows: 10000000, value size: 2000

    Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000491          127         1       2037         24       2109        127        4.0      4        128       2       2        0        1        1        0        0        0       157      80 3058208  15.0%
    0       32        0.000561         1740        32      56995        410      60031      47208        5.0      5        160       3       2        0        1        1        0        0        0       386     111  113353  17.5%
    0       256       0.002052          488       256     124736       7111     144762      89053       16.6     17        672      14       2        0        1        1        0        0        0      2113     446   52669  18.6%
    0       4096      0.016437           61      4096     249199        692     252389     244995       69.4     69       8640      57       5        0        1        1        0        0        0     26638    1717   23321  22.4%
    5000000 1         0.002171          221         1        461          2        466        221       25.0     25        268       3       3        0        1        1        0        0        0       638     376 14311524  10.2%
    5000000 32        0.002392          404        32      13376         48      13528      13015       27.0     27        332       5       3        0        1        1        0        0        0       931     432  489691  11.9%
    5000000 256       0.003659          279       256      69967        764      73130      52563       39.5     41        780      19       3        0        1        1        0        0        0      2689     825   93756  15.8%
    5000000 4096      0.018592           55      4096     220313        433     234214     218803       94.2     94       9484      62       9        0        1        1        0        0        0     27349    2213   26562  21.0%

    After:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000229          115         1       4371         85       4585        115        2.1      2         64       1       1        1        0        0        0        0        0        90      31 1314749  22.2%
    0       32        0.000277         2174        32     115674       1015     128109      14144        3.0      3         96       2       1        1        0        0        0        0        0       319      62   52508  26.1%
    0       256       0.001786          576       256     143298       5534     179142     113715       14.7     17        544      15       1        1        0        0        0        0        0      2110     453   45419  21.4%
    0       4096      0.015498           61      4096     264289       2006     268850     259342       67.4     67       8576      59       4        1        0        0        0        0        0     26657    1738   22897  23.7%
    5000000 1         0.000415          233         1       2411         15       2456        234        4.1      4        128       2       2        1        0        0        0        0        0       199      72 2644719  16.8%
    5000000 32        0.000635         1413        32      50398        349      51149      46439        6.0      6        192       4       2        1        0        0        0        0        0       458     128  125893  18.6%
    5000000 256       0.002028          486       256     126228       3024     146327      82559       17.8     18       1024      13       4        1        0        0        0        0        0      2123     385   51787  19.6%
    5000000 4096      0.016836           61      4096     243294        814     263434     241660       73.0     73       9344      62       8        1        0        0        0        0        0     26922    1920   24389  22.4%

Future work:

 - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
   which may reduce the hit ratio.

 - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.

 - Disable cache population for "bypass cache" reads

 - Add a switch to disable sstable index caching, per-node, maybe per-table

 - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
   page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
   the partition index page can be hot.

 - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
   bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
   partition entry and then let binary search read the rest.

In V2:

 - Fixed perf_fast_forward regression in the number of IOs used to read partition index page
   The reader uses 32K reads, which were split by page cache into 4K reads
   Fix by propagating IO size hints to page cache and using single IO to populate it.
   New patch: "cached_file: Issue single I/O for the whole read range on miss"

 - Avoid large allocations to store partition index page entries (due to managed_vector storage).
   There is a unit test which detects this and fails.
   Fixed by implementing chunked_managed_vector, based on chunked_vector.

 - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks

 - Simplify region_impl::free_buf() according to Avi's suggestions

 - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind

 - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.

 - Wire up system/drop_sstable_caches RESTful API

 - Fix use-after-move on permit for the old scanning ka/la index reader

 - Fixed more cases of double open_data() in tests leading to assert failure

 - Adjusted cached_file class doc to account for changes in behavior.

 - Rebased

Fixes #7079.
Refs #363.
"

* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
  api: Drop sstable index caches on system/drop_sstable_caches
  cached_file: Issue single I/O for the whole read range on miss
  row_cache: cache_tracker: Do not register metrics when constructed for tests
  sstables, cached_file: Evict cache gently when sstable is destroyed
  sstables: Hide partition_index_cache implementation away from sstables.hh
  sstables: Drop shared_index_lists alias
  sstables: Destroy partition index cache gently
  sstables: Cache partition index pages in LSA and link to LRU
  utils: Introduce lsa::weak_ptr<>
  sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
  sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
  cached_file: Introduce get_page_units()
  sstables: read: Document that primitive_consumer::read_32() is alloc-free
  sstables: read: Count partition index page evictions
  sstables: Drop the _use_binary_search flag from index entries
  sstables: index_reader: Keep index objects under LSA
  lsa: chunked_managed_vector: Adapt more to managed_vector
  utils: lsa: chunked_managed_vector: Make LSA-aware
  test: chunked_managed_vector_test: Make exception_safe_class standard layout
  lsa: Copy chunked_vector to chunked_managed_vector
  ...
2021-07-07 18:17:10 +03:00
Takuya ASADA
def81807aa scylla-fstrim.timer: drop BindsTo=scylla-server.service
To avoid restart scylla-server.service unexpectedly, drop BindsTo=
from scylla-fstrim.timer.

Fixes #8921

Closes #8973
2021-07-07 17:36:24 +03:00
Dejan Mircevski
7d6ef0de8d cql3: Drop more dead code
After 845e36e76 "cql3: Use expr for global-index partition slice",
there is actually more dead code than was initially dropped.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8981
2021-07-07 13:59:58 +02:00
Calle Wilund
ce45ffdffb commitlog: Use defensive copies of segment list in iterations
Fixes #8952

In 5ebf5835b0 we added a segment
prune after flushing, to deal with deadlocks in shutdown.
This means that calls that issue sync/flush-like ops "for-all",
need to operate on a defensive copy of the list.

Closes #8980
2021-07-07 13:30:37 +02:00
Pavel Emelyanov
63a2fed585 hasher: More picky noexcept marking of feed_hash()
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.

The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.

The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.

fixes: #8983
tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
2021-07-07 12:00:16 +03:00
Pavel Solodovnikov
b959f5d394 test: lib: copy query_options in single_node_cql_env::execute_cql()
`query_processor::execute_direct()` takes a non-const ref
to query options, meaning it's not safe to pass the same
instance to subsequent invocations of `execute_direct()`
in the tests.

Copy default query options at each invocation of `execute_cql()`
so no possible side-effects can occur.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210705094824.243573-2-pa.solodovnikov@scylladb.com>
2021-07-07 11:46:50 +03:00
Nadav Har'El
775a64b003 test/alternator: test for change in CDC preimage
In pull request #8568, the CDC API changed slightly, with preimage data
gaining extra "delete$k" values for columns whose preimage was missing.
In this new test, we verify that this change did not break Alternator.
We didn't expect it to break Alternator, because it just outputs the known
base-table columns and ignores the columns which weren't a real base-table
column - like this "delete$k".

In the test we set up a stream with preimages, ensure that a real column
(note that an LSI key is a real column instead of a map element) has a
null preimage - and see that the preimage is returned as expected,
without fake columns like "delete$k".

The test passes, showing that PR #8568 was ok.
The test also passes, as expected, on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210504120121.915829-1-nyh@scylladb.com>
2021-07-06 14:53:42 +02:00
Nadav Har'El
76227fafad cql-pytest: use NetworkTopologyStrategy, not SimpleStrategy
All tests in cql-pytest use a test keyspace created with the SimpleStrategy
replication strategy. This was never intentional. We are recommending to
users that they should use NetworkTopologyStrategy instead, and even
want to deprecate SimpleStrategy (this is #8586), so tests should stop
using SimpleStrategy and should start using the same strategy users would
use in their applications - NetworkTopologyStrategy.

Almost all tests are fixed by a single change in conftest.py which
changes how "test_keyspace" is created. But additionally, tests in
test_keyspace.py which explicitly create keyspaces (that's the point of
that test file...) also had to be modified to use NetworkTopologyStrategy.
Note that none of the tests relied on any special features or
implementation details of SimpleStrategy.

This patch is part of the bigger effort to remove reliance on
SimpleStrategy from all tests, of all types - which we will need to do if
we ever want to forbid SimpleStrategy by default. The issue of that effort:
Refs #8638

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210620102341.195533-1-nyh@scylladb.com>
2021-07-06 14:52:46 +02:00
Benny Halevy
ac7db8a043 repair: row_level: coroutinize clear_gently
Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210706090252.1776563-1-bhalevy@scylladb.com>
2021-07-06 13:42:45 +03:00
Benny Halevy
612793c2d4 locator: token_metadata: reuse utils::stall_free clear_gently helpers
Use the generic clear_gently functions that were added
in eca9f45c59.

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210706090243.1776466-1-bhalevy@scylladb.com>
2021-07-06 12:06:43 +03:00
Avi Kivity
4c01a88c9d logalloc: do not capture backtraces by default in debug mode
logalloc has a nice leak/double-free sanitizer, with the nice
feature of capturing backtraces to make error reports easy to
track down. But capturing backtraces is itself very expensive.

This patch makes backtrace capture optional, reducing database_test
runtime from 30 minutes to 20 minutes on my machine.

Closes #8978
2021-07-06 00:18:22 +02:00
Avi Kivity
2c8d84b864 Merge "Make logging for sstable data corruption useful" from Raphael
"
When a corrupted sstable fails to be read either on regular read or in
regular compaction, our logging is not useful as it can't pinpoint
the SSTable that was being read from, also it may not print useful
details about the corruption.

For example, when a compaction fails on data corruption, a cryptic
message as follow will be dumped:
    compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum): retrying

there are two problems with the log above:
    1) it doesn't tell us which sstable is corrupted
    2) it doesn't tell us detailed info about the checksum failure on compressed chunk

with those problems fixed, we'll now get a much more useful message:
    compaction_manager - compaction failed: sstables::malformed_sstable_exception (Failed to read partition
        from SSTable /home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file offset 406491
        failed checksum, expected=0, actual=1422312584): retrying

tests: mode(dev).
"

* 'log_data_corruption_v2.1' of github.com:raphaelsc/scylla:
  sstables: Attach sstable name to exception triggered in sstable mutation reader
  test/broken_sstable_test: Make test more robust
  sstables: Make log more useful when compressed chunk fails checksum
  sstables: Use correct exception when compressed chunk fails checksum
2021-07-05 20:37:19 +03:00
Piotr Jastrzebski
27fe3c3aa0 partition_snapshot_flat_reader: Fix typo in next_range_rombstone
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2fd48c08092d6ed1b452b6fe2e43e2273a78c8c2.1625500334.git.piotr@scylladb.com>
2021-07-05 19:39:08 +03:00
Takuya ASADA
f19ebe5709 dist/redhat: fix systemd unit name of scylla-node-exporter
systemd unit name of scylla-node-exporter is
scylla-node-exporter.service, not node-exporter.service.

Fixes #8966

Closes #8967
2021-07-05 18:06:51 +03:00
Avi Kivity
e2f865c739 Merge 'Use expressions to calculate the global-index partition slice' from Dejan Mircevski
Another step towards dropping the `restrictions` class.  When calculating the partition slice of a global-index table, use `expression` objects instead of a `restrictions` subclass.

Refs #7217.

Tests: unit (all dev, some debug)

Closes #8950

* github.com:scylladb/scylla:
  cql3: Use expr for global-index partition slice
  cql3: Fully explain statement_restrictions members
  cql3: Pass schema reference not pointer
  cql3: Replace count_if with find_atom
  cql3: Fix _partition_range_is_simple calculation
  cql3: Add test cases for indexed partition column
2021-07-05 18:04:54 +03:00
Takuya ASADA
f71f9786c7 dist: stop removing /etc/systemd/system/*.mount on package uninstall
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.

Also, we mixed two types of files as ghost file, it should handle differently:
 1. automatically generated by postinst scriptlet
 2. generated by user invoked scylla_setup

The package should remove only 1, since 2 is generated by user decision.

However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).

To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.

See scylladb/scylla-enterprise#1780

Closes #8810
Closes #8924

Closes #8959
2021-07-05 18:03:51 +03:00
Nadav Har'El
12b058abdf Merge 'repair: row_level: clear_gently on close' from Benny Halevy
To prevent a reactor stall as seen in #8926

Fixes #8926

Test: unit(dev)
DTest: repair_additional_test.py:RepairAdditionalTest.repair_same_row_diff_value_3nodes_diff_shard_count_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8928

* github.com:scylladb/scylla:
  repair: row_level: clear_gently: clear_gently each repair_row
  repair: row_level: repair_meta: clear_gently on stop
  repair: row_level_repair: run: stop master repair_meta
  utils: stall_free: implemnt clear_gently of froeign_ptr
  utils: stall_free: define generic clear_gently methods
2021-07-04 23:00:37 +03:00
Piotr Grabowski
64e93ca38c main: fix max-io-requests spelling in warning text
An incorrect spelling of max-io-requests was used in creating the
warning message text. Due to conversion to unsigned, the code would
crash due to bad_lexical_cast exception. The spelling of this
configuration name was fixed in the past (44f3ad836b), but only in the
'if' condition.

Fix by using the correct spelling.

Closes #8963
2021-07-04 18:37:43 +03:00
Tomasz Grabiec
2c727f37fb api: Drop sstable index caches on system/drop_sstable_caches 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
f553db69f7 cached_file: Issue single I/O for the whole read range on miss
Currently, reading a page range would issue I/O for each missing
page. This is inefficient, better to issue a single I/O for the whole
range and populate cache from that.

As an optimization, issue a single I/O if the first page is missing.

This is important for index reads which optimistically try to read
32KB of index file to read the partition index page.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
6a6403d19d row_cache: cache_tracker: Do not register metrics when constructed for tests
Some tests will create two cache_tracker instances because of one
being embedded in the sstable test env.

This would lead to double registration of metrics, which raises run
time error. Avoid by not registering metrics in prometheus in tests at
all.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
1f74863bf8 sstables, cached_file: Evict cache gently when sstable is destroyed
We must evict before the _cached_index_file associated with the
sstable goes away. Better to do it gently to avoid stalls.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
f14576f4be sstables: Hide partition_index_cache implementation away from sstables.hh
Reduces scope of the header to index_reader.hh which reduces
recompilation time.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
7d34799f3f sstables: Drop shared_index_lists alias 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
af4cc233c3 sstables: Destroy partition index cache gently
There could be a lot of them so we should clear it gently to avoid
reactor stalls.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
9f957f1cf9 sstables: Cache partition index pages in LSA and link to LRU
As part of this change, the container for partition index pages was
changed from utils::loading_shared_values to intrusive_btree. This is
to avoid reactor stalls which the former induces with a large number
of elements (pages) due to its use of a hashtable under the hood,
which reallocates contiguous storage.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
b3728f7d9b utils: Introduce lsa::weak_ptr<>
Simplifies managing non-owning references to LSA-managed objects. The
lsa::weak_ptr is a smart pointer which is not invalidated by LSA and
can be used safely in any allocator context. Dereferenced will always
give a valid reference.

This can be used as a building block for implementing cursors into
LSA-based caches.

Example simple use:

     // LSA-managed
     struct X : public lsa::weakly_referencable<X> {
         int value;
     };

     lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] {
           X* x = current_allocator().construct<X>();
           return x->weak_from_this();
     });

     std::cout << x_ptr->value;
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
2a852cd0c9 sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
The new names are less confusing.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
934824394a sstables, cached_file: Avoid copying buffers from cache when parsing promoted index 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
7b6f18b4ed cached_file: Introduce get_page_units()
Will be needed later for reading a page view which cannot use
make_tracked_temporary_buffer(). Standardize on get_page_units(),
converting existing code to wrap the units in a deleter.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
23bc19643f sstables: read: Document that primitive_consumer::read_32() is alloc-free
Callers will rely on it to assume that it does not invalidate
references to LSA objects.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
b98e660a4a sstables: read: Count partition index page evictions 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
8360a64f73 sstables: Drop the _use_binary_search flag from index entries
It doesn't have to be set by the parser now that the cursors are
created lazily, pass it to the cursor when it's created.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
06e373e272 sstables: index_reader: Keep index objects under LSA
In preparation for caching index objects, manage them under LSA.

Implementation notes:

key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it.  Actual linearization should be rare since partition
keys are typically small.

Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:

  class parsed_promoted_index_entry;
  calss parsed_partition_index_entry;

This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
20ef54e9ed lsa: chunked_managed_vector: Adapt more to managed_vector
For seamless transition.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
78e5b9fd85 utils: lsa: chunked_managed_vector: Make LSA-aware
The max chunk size is set to be 10% of segment size.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
856e4a539d test: chunked_managed_vector_test: Make exception_safe_class standard layout
Required by managed_vector<> due to its use of offsetof()

In preparation for swtiching chunked_managed_vector storage to
managed_vector<>.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
c87ea09535 lsa: Copy chunked_vector to chunked_managed_vector
In preparation for adapting it to LSA. Split into two steps to make
reiew easier.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
1523a7d367 utils: managed_vector: Make clear_and_release() public
Will be needed by index reader to ensure that destructor doesn't
invoke the allocator so that all is destroyed in the desried
allocation context before the object is destroyed.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
2b673478aa sstables: index_reader: Do not expose index_entry references
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.

This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
a955e7971d sstables: index_reader: Don't store schema reference inside index_entry
To save space.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
9e7bf066a9 sstables: index_reader: Don't store file object inside promoted_index
The file object which is currently stored there has per-request
tracing wrappers (permit, trace_state) attached to it. It doesn't make
sense once the entry is cached and shared. Annotate when the cursor is
created instead.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
86b135056c sstables: index_reader: Don't store front buffer inside promoted_index
Index reads and promoted index reads are both using the same
cached_file now, so there's no need to pass the buffers between the
index parser and promoted index reader.

Makes the promoted_index structure easier to move to LSA.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
484e06d69b cached_file: Always start at offset 0
All current uses start at offset 0, so simplify the code by assuming it.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
078a6e422b sstables: Cache all index file reads
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.

As part of this, cached_file needed to be adjusted in the following ways.

The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.

When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.

In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.

While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
b5ca0eb2a2 lsa: Introduce lsa_buffer
lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns
buffers allocated inside LSA segments. It uses an alternative
allocation method which differs from regular LSA allocations in the
following ways:

  1) LSA segments only hold buffers, they don't hold metadata. They
     also don't mix with standard allocations. So a 128K segment can
     hold 32 4K buffers.

  2) objects' life time is managed by lsa_buffer, an owning smart
     pointer, which is automatically updated when buffers are migrated
     to another segment. This makes LSA allocations easier to use and
     off-loads metadata management to the client (which can keep the
     lsa_buffer wherever he wants).

The metadata is kept inside segment_descriptor, in a vector. Each
allocated buffer will have an entangled object there (8 bytes), which
is paired with an entabled object inside lsa_buffer.

The reason to have an alternative allocation method is to efficiently
pack buffers inside LSA segments.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
a23f27034f lsa: Introduce entangled helper
Will be useful in building higher-level LSA tools.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
056f14063e lsa: Encapsulate segment_descriptor::_free_space access
Prepares for reusing some of its bits for storing segment kind.
2021-07-02 19:02:13 +02:00
Dejan Mircevski
845e36e761 cql3: Use expr for global-index partition slice
Instead of creating a single_column_clustering_key_restrictions
object, create an equivalent vector of expr::expressions and calculate
from it the clustering ranges just like we do for base-table queries.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-07-02 17:31:33 +02:00
Dejan Mircevski
75f4325ee4 cql3: Fully explain statement_restrictions members
Nail down the assumptions before making futher use of these variables.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-07-02 17:31:33 +02:00
Dejan Mircevski
28e92dfa4c cql3: Pass schema reference not pointer
... to get_single_column_clustering_bounds().

No need for the pointer; a reference is simpler and cleaner.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-07-02 17:31:33 +02:00
Dejan Mircevski
de17b5449b cql3: Replace count_if with find_atom
count_if finds all matching atoms, which is redundant when we only
want to find one.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-07-02 17:31:33 +02:00
Dejan Mircevski
3a149daab5 cql3: Fix _partition_range_is_simple calculation
Was updated for every restriction instead of only for partition ones.
The only impact is on performance. The bug was introduced in 4661aa0
"cql3: Track IN partition-key restrictions".

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-07-02 17:31:33 +02:00
Dejan Mircevski
53f376b83f cql3: Add test cases for indexed partition column
We didn't have a case when a global index exists on a partition column
and the SELECT statement specifies the full partition key.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-07-02 17:28:56 +02:00
Tomasz Grabiec
019956739d cached_file: Switch to bplus::tree
In order to be able to move it to LSA later.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
f537d1a7e5 tests: sstables: Do not call open_data() twice
make_sstable_containing() already calls open_data(), so does
load(). This will trigger assertion failure added in a later patch:

   assert(!_cached_index_file);

There is no need to call load() here.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
627a2ef087 test: cached_file: Add test for eof_error 2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8fbea0b5b7 utils: cached_file: Introduce file wrapper
It's an adpator between seastar::file and cached_file. It gives a
seastar::file which will serve reads using a given cached_file as a
read-through cache.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8e2118069b sstables: cached_file: Account buffers returned by cached_file under read_permit
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
a5c72ed899 sstables, database: Keep cache_tracker reference inside sstables_manager
So that sstable code can pick it up for caching (lru and region).
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
4b51e0bf30 row_cache: Move cache_tracker to a separate header
It will be needed by the sstable layer to get the to the LRU and the
LSA region. Split to avoid inclusion of whole row_cache.hh
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
7fa4e10aa0 row_cache: Use generic LRU for eviction
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
6b59c8cfb1 utils: Introduce general-purpose LRU
The LRU can link objects of different types, which is achieved by
having a virtual base class called "evictable" from which the linked
objects should inherit. Whe the object is removed from the LRU,
evictable::on_evicted() is called.

The container is non-owning.
2021-07-02 10:25:58 +02:00
Benny Halevy
68bd748af2 repair: row_level: clear_gently: clear_gently each repair_row
Rows might be large so free them gently by:
- add bytes_ostream.clear_gently that may yield in the chunk
  freeing loop.
- use that in frozen_mutation_fragment, contained in repair_row.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:16:11 +03:00
Benny Halevy
defe789d46 repair: row_level: repair_meta: clear_gently on stop
To prevent a reactor stall as seen in #8926

Note: this patch doesn't use coroutines, to
faciliate backporting.

Coroutinization will be done in a follow-up patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:16:11 +03:00
Benny Halevy
0bf46751fa repair: row_level_repair: run: stop master repair_meta
Not only close it.  Next patch will use clear_gently on stop
to prevent reactor stalls.

double-stop prevention code was added to stop()
since, in the error-free case, repair_meta master is
already stopped by `repair_row_level_stop` when auto_stop
master deferred action is called.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:16:11 +03:00
Benny Halevy
9963d15613 utils: stall_free: implemnt clear_gently of froeign_ptr
clear_gently of the foreign_ptr needs to run on the owning
shard, so provide a specialization from the SmartPointer
implementation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:16:11 +03:00
Benny Halevy
eca9f45c59 utils: stall_free: define generic clear_gently methods
Define a bunch of clear_gently methods that asynchronously
clear the contents of containers and allow yielding.

This replaces clear_gently(std::list<T>&) used by row level
repair by a more generic template implementation.

Note that we do not use coroutines in this patch
to facilitate backporting to releases that do not support coroutines
and since a miscompilation bug was hit with clang++ 11 when attempting
to coroutinize this patch (see
https://bugs.llvm.org/show_bug.cgi?id=50345).

Test: stall_free_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:00:49 +03:00
Avi Kivity
4209dfd753 Merge "evictable_readers: don't drop static rows, drop assumption about snapshot isolation" from Botond
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
  recreation

As they depend on each other, it is easier to add a test if they are in
a series.

Fixes: #8923
Fixes: #8893

Tests: unit(dev, mutation_reader_test:debug)
"

* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add more test for reader recreation
  evictable_reader: relax partition key check on reader recreation
  evictable_reader: always reset static row drop flag
2021-07-01 14:15:46 +03:00
Asias He
e95f7f4af0 storage_service: Make heartbeat response log debug level
It is too noisy. It is supposed to be debug level. I forgot to move back
to debug level after testing during development.

Refs #8825

Closes #8960
2021-07-01 11:32:43 +03:00
Avi Kivity
0d87744ba0 Revert "dist: stop removing /etc/systemd/system/*.mount on package uninstall"
This reverts commit a677c46672. It causes
upgrade from a version that did not have a commit to a version that
does have the commit to lose the .mount files, since they change
from being owned by the package (via %ghost) to not being owned.

Fixes #8924.
2021-07-01 08:55:54 +03:00
Nadav Har'El
7a5111c580 Merge 'messaging_service: do not listen on port 0' from Benny Halevy
We never want to listen on port 0, even if configured so.
When the listen port is set to 0, the OS will choose the
port randomly, which makes it useless for communicating
with other nodes in the cluster, since we don't support that.

Also, it causes the listen_ports_conf_test internode_ssl_test
to fail since it expects to disable listening on storage_port
or ssl_storage_port when set to 0, as seen in
https://github.com/scylladb/scylla-dtest/issues/2174.

Fixes #8957

Test: unit(dev)
DTest: listen_ports_conf_test (modified)

Closes #8956

* github.com:scylladb/scylla:
  messaging_service: do_start_listen: improve info log accuracy
  messaging_service: never listen on port 0
2021-06-30 18:41:58 +03:00
Nadav Har'El
7ab48b405f CQL: always validate NetworkTopologyStrategy replication factor
The replication factor passed to NetworkTopologyStrategy (which we call
by the confusing name "auto expand") may or may not be used (see
explanation why in #8881), but regardless, we should validate that it's
a legal number and not some non-numeric junk, and we should report the error.

Before this patch, the two commands

CREATE KEYSPACE name WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
ALTER KEYSPACE name WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 'foo' }

succeed despite the invalid replication factor "foo". After this patch,
the second command fails.

The problem fixed here is reproduced by the existing test
test_keyspace.py::test_alter_keyspace_invalid when switching it to use
NetworkTopologyStrategy, as suggested by issue #8638.

Fixes #8880
Refs #8881

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210620100442.194610-1-nyh@scylladb.com>
2021-06-30 16:49:46 +03:00
Benny Halevy
51bc6c8b5a messaging_service: do_start_listen: improve info log accuracy
Make sure to log the info message when we actually
start listening.

Also, print a log message when listening on the
broadcast address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-30 16:25:21 +03:00
Benny Halevy
df442d4d24 messaging_service: never listen on port 0
We never want to listen on port 0, even if configured so.
When the listen port is set to 0, the OS will choose the
port randomly, which makes it useless for communicating
with other nodes in the cluster, since we don't support that.

Also, it causes the listen_ports_conf_test internode_ssl_test
to fail since it expects to disable listening on storage_port
or ssl_storage_port when set to 0, as seen in
https://github.com/scylladb/scylla-dtest/issues/2174.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-30 16:24:54 +03:00
Botond Dénes
75e8d2d04a test: mutation_reader_test: add more test for reader recreation 2021-06-30 11:21:58 +03:00
Botond Dénes
852bf6befd evictable_reader: relax partition key check on reader recreation
When recreating the underlying reader, the evictable reader validates
that the first partition key it emits is what it expects to be. If the
read stopped at the end of a partition, it expects the first partition
to be a larger one. If the read stopped in the middle of a certain
partition it expects the first partition to be the same it stopped in
the middle of. This latter assumption doesn't hold in all circumstances
however. Namely, the partition it stopped in the middle of might get
compacted away in the time the read was paused, in which case the read
will resume from a greater partition. This perfectly valid cases however
currently triggers the evictable reader's self validation, leading to
the abortion of the read and a scary error to be logged. Relax this
check to accept any partition that is >= compared to the one the read
stopped in the middle of.
2021-06-30 11:21:53 +03:00
Botond Dénes
2740dd2ae4 evictable_reader: always reset static row drop flag
When the evictable reader recreates in underlying reader, it does it
such that it continues from the exact mutation fragment the read was
left off from. There are however two special mutation fragments, the
partition start and static row that are unconditionally re-emitted at
the start of a new read. To work around this, when stopping at either of
these fragments the evictable reader sets two flags
_drop_partition_start and _drop_static_row to drop the unneeded
fragments (that were already emitted by it) from the underlying reader.
These flags are then reset when the respective fragment is dropped.
_drop_static_row has a vulnerability though: the partition doesn't
necessarily have a static row and if it doesn't the flag is currently
not cleared and can stay set until the next fill buffer call causing the
static row to be dropped from another partition.
To fix, always reset the _drop_static_row flag, even if no static row
was dropped (because it doesn't exist).
2021-06-30 10:05:35 +03:00
Nadav Har'El
029991bfc2 test/cql-pytest: test that SSL CQL port doesn't accept unencrypted connections
Scylla doesn't allow unencrypted connections over encrypted CQL ports
(Cassandra does allow this, by setting "optional: true", but it's not
secure and not recommended). Here we add a test that in indeed, we can't
connect to an SSL port using an unencrypted connection.

The test passes on Scylla, and also on Cassandra (run it on Cassandra
with "test/cql-pytest/run-cassandra --ssl" - for which we added support
in a recent patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210629121514.541042-1-nyh@scylladb.com>
2021-06-29 16:42:22 +03:00
Nadav Har'El
dc4c05b2e3 test/cql-pytest: switch some fixture scopes from "session" to "module"
Fixtures in conftest.py (e.g., the test_keyspace fixture) can be shared by
all tests in all source files, so they are marked with the "session"
scope: All the tests in the testing session may share the same instance.
This is fine.

Some of test files have additional fixtures for creating special tables
needed only in those files. Those were also, unnecessarily, marked
"session" scope as well. This means that these temporary tables are
only deleted at the very end of test suite, event though they can be
deleted at the end of the test file which needed them - other test
source files don't have access to it anyway. This is exactly what the
"module" fixture scope is, so this patch changes all the fixtures that
are private to one test file to use the "module" scope.

After this patch, the teardown of the last test in the suite goes down
from 0.26 seconds to just 0.06 seconds.

Another benefit is that the peak disk usage of the test suite is
lower, because some of the temporary tables are deleted sooner.

This patch does not change any test functionality, and also does not
make any test faster - it just changes the order of the fixture
teardowns.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #8932
2021-06-29 16:10:47 +03:00
Calle Wilund
a40b6a2f54 commitlog: Use disk file alignment info (with lower value if possible)
Previously, the disk block alignment of segments was hardcoded (due to
really old code). Now we use the value as declared in the actual file
opened. If we are using a previously written file (i.e. o_dsync), we
can even use the sometimes smaller "read" alignment.

Also allow config to completely override this with a disk alignment
config option (not exposed to global config yet, but can be).

v2:
* Use overwrite alignment if doing only overwrite
* Ensure to adjust actual alignment if/when doing file wrapping

v3:
* Kill alignment config param. Useless and unsafe.

Closes #8935
2021-06-29 16:00:49 +03:00
Nadav Har'El
7e4bef96af test/cql-pytest: support "--ssl" option in run-cassandra
This patch adds support for the "--ssl" option in run-cassandra, which
will now be able, like run (which runs Scylla), to run Cassandra with
listening to a *SSL-encrypted* CQL connection. The "--ssl" option is also
passed to the tests, so they know to encrypt their CQL connections.

We already had support for this feature in the test/cql-pytest/run
script - which runs Scylla. Adding this also to the run-cassandra
script can help verify that a behavior we notice in Scylla's SSL support
and we want to add to a test - is also shared by Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210629082532.535229-1-nyh@scylladb.com>
2021-06-29 12:05:40 +03:00
Raphael S. Carvalho
ef76cdb2c7 sstables: Attach sstable name to exception triggered in sstable mutation reader
When compaction fails due to a failure that comes from a specific
sstable, like on data corruption, the log isn't telling which
sstable contributed to that. Let's always attach the sstable name to
the exception triggered in sstable mutation reader.

Exceptions in la and mx consumer attached sst name, but now only sst
mutation reader will do it so as to avoid duplicating the sst name.

Now:
ERROR 2021-06-11 16:07:34,489 [shard 0] compaction_manager - compaction failed:
sstables::malformed_sstable_exception (Failed to read partition from SSTable
/home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file
offset 406491 failed checksum, expected=0, actual=1422312584): retrying

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-06-28 12:54:24 -03:00
Raphael S. Carvalho
95cc48508c test/broken_sstable_test: Make test more robust
Test breaks very easily whenever there's a change in the message
formatted for malformed_sstable_exception. Make test more robust
by not checking exact message, but that the message contains
both the expected exception and the sstable filename.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-06-28 12:27:29 -03:00
Raphael S. Carvalho
c7424ca6e6 sstables: Make log more useful when compressed chunk fails checksum
The current log is useless when checksum fails, you don't know
which compressed chunk failed checksum, the expected and the
actual checksum, the size of chunk and so on.

Before:
compressed chunk failed checksum
Now:
compressed chunk of size 3735 at file offset 406491 failed checksum, expected=0, actual=1422312584

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-06-28 12:25:51 -03:00
Raphael S. Carvalho
9daf5d1ab8 sstables: Use correct exception when compressed chunk fails checksum
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-06-28 12:25:44 -03:00
Takuya ASADA
edd54a9463 reloc: add arch to relocatable package filename
Add architecture name for relocatable packages, to support distributing
both x86_64 version and aarch64 version.

Also create symlink from new filename to old filename to keep
compatibility with older scripts.

Fixes #8675

Closes #8709

[update tools/python3 submodule:

* tools/python3 ad04e8e...afe2e7f (1):
  > reloc: add arch to relocatable package filename
]
2021-06-28 15:01:09 +03:00
Avi Kivity
f660726773 Update seastar submodule
* seastar 0e48ba883...eaa00e761 (3):
  > memory: reduce statistics TLS initialization even more
  > Merge "Sanitize io-topology creation on start" from Pavel E
  > doc/prometheus: note that metric family is passed by query name
2021-06-28 11:52:36 +03:00
Botond Dénes
09309f5dbf reader_concurrency_semaphore: on_permit_created(): remove noexcept
The permit creation path enters the semaphore's permit gate in
on_permit_created(). Entering this gate can throw so this method is not
noexcept. Remove the noexcept specifier accordingly.
Also enter the gate before adding the permit to the permit list, to save
some work when this fails.

Fixes: #8933

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210628074941.32878-1-bdenes@scylladb.com>
2021-06-28 11:04:38 +03:00
Avi Kivity
c0c1e26014 Merge 'Remove code writing LA/KA sstables' from Piotr Jastrzębski
Now that all supported versions write mc/md sstables, we can deprecate the MC_SSTABLE feature bit and consider it implicitly true, and with it the ability to write la/ka sstables.

We still need to support reading them, e.g. from restoring old snapshots or migrating data from legacy clusters.

Test: unit(dev, debug)

Fixes #8352

Closes #8884

* github.com:scylladb/scylla:
  compress: Remove unused make_compressed_file_k_l_format_output_stream
  sstables: move sstable_writer to separate header
  sstable_writer: remove get_metadata_collector
  sstables: stop including metadata_collector.hh in sstables.hh
  sstables: Remove duplicated friend declaration
  sstables: remove unused KL writer
  sstables: Always use MC/MD writer
  sstable_datafile_test: switch tests to use latest sstables format
  sstable_datafile_test: switch compaction_with_fully_expired_table to latest sstable version
  test_offstrategy_sstable_compaction: test all writable sstables
  compaction_with_fully_expired_table: Remove some LA specific code
  sstable_mutation_test: test latest sstable format instead of LA
  sstable_test: Test MX sstables instead of KA/LA
  sstable_datafile_test: Fix schema used by check_compacted_sstables
  sstables: Remove LA/KA sstable writting tests that check exact format
  sstables: define writable_sstable_versions
  features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled
2021-06-27 20:50:51 +03:00
Avi Kivity
121971ec0f Merge "storage_proxy: specialize query_singular() for non-IN queries" from Gleb
"
query_singular() accepts a partition_range_vector, corresponding to an IN
query. But such queries are rare compared to single-partition queries.

Co-routinise the code and special case non-IN queries by avoiding
the call to map_reduce. Also replace executers array with small_vector
to avoid an allocation in the common case.

perf_simple_query --smp 1 --operations-per-shard 1000000 --task-quota-ms 10:

before: median 204545.04 tps ( 81.1 allocs/op,  15.1 tasks/op,   48828 insns/op)

after:  median 219769.97 tps ( 74.1 allocs/op,  12.1 tasks/op,   46495 insns/op)

So, a ~7% improvement in tps and 5% improvement in instructions per op.
Also large reduction in tasks and allocations.

This is an alternative proposal to https://github.com/scylladb/scylla/pull/8909.
The benefit of this one is that it does not duplicate any code (almost).
"

* 'query_singular-coroutine' of github.com:scylladb/scylla-dev:
  storage_proxy: avoid map_reduce in storage_proxy::query_singular if only one pk is queried
  storage_proxy: use small_vector in storage_proxy::query_singular to store executors
  storage_proxy: co-routinize storage_proxy::query_singular()
2021-06-27 16:30:19 +03:00
Piotr Jastrzebski
10228b35c5 compress: Remove unused make_compressed_file_k_l_format_output_stream
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:31 +02:00
Piotr Jastrzebski
430fd5cfa9 sstables: move sstable_writer to separate header
This class is used in only few places and does not have to be included
everywhere sstable class is needed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:31 +02:00
Piotr Jastrzebski
9e7144f719 sstable_writer: remove get_metadata_collector
This function is only called internally so it does not have to be
exposed and can be inlined instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:31 +02:00
Piotr Jastrzebski
2d6608bb88 sstables: stop including metadata_collector.hh in sstables.hh
metadata collector is rarely used so it's better to include it only
in those few places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:31 +02:00
Piotr Jastrzebski
39851f76fc sstables: Remove duplicated friend declaration
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:30 +02:00
Piotr Jastrzebski
c7096470bf sstables: remove unused KL writer
Previous two patches removed the usage of KL writer so the code is now
dead and can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:30 +02:00
Piotr Jastrzebski
8293384189 sstables: Always use MC/MD writer
Previous patch made MC the lowest sstables format in use so
the removed check is always true now and we can remove the else
part.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:30 +02:00
Piotr Jastrzebski
314bc0e8a5 sstable_datafile_test: switch tests to use latest sstables format
instead of LA. Ability to write LA and KA sstables will be removed
by the following patches so we need to switch all the tests to write
newer sstables.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:30 +02:00
Piotr Jastrzebski
f03ed9b9a7 sstable_datafile_test: switch compaction_with_fully_expired_table to latest sstable version
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:30 +02:00
Piotr Jastrzebski
1ed298b08b test_offstrategy_sstable_compaction: test all writable sstables
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:12 +02:00
Gleb Natapov
e5154e77d3 storage_proxy: avoid map_reduce in storage_proxy::query_singular if only one pk is queried
Not using IN queries is a common case, so avoid map_reduce overhead for
them.
2021-06-27 14:49:44 +03:00
Gleb Natapov
8018eb4612 storage_proxy: use small_vector in storage_proxy::query_singular to store executors
Having only one pk to query is the common case, so avoid allocating
executor vector for that case.
2021-06-27 14:48:15 +03:00
Gleb Natapov
d908912dbd storage_proxy: co-routinize storage_proxy::query_singular()
The code becomes much easier to understand and allocation for
used_replicas can be dropped.
2021-06-27 14:47:19 +03:00
Avi Kivity
b7cb687d36 Merge "Harden storage_service::stop_transport" from Pavel E
"
Stopping transport (cql, thrift, messaging, etc.) can happen from
several places -- drain, decommission, stop, isolation. Some of
them can even run in parallel. This patch makes transport stopping
bulletproof.

tests: unit(dev)
       start-stop (dev)
       start-drain-stop (dev)
fixes: #8911
"

* 'br-stop-transport-races' of https://github.com/xemul/scylla:
  storage_service: Indentation fix
  storage_service: Make stop_transport re-entrable
  storage_service: Stop transport on decommission
2021-06-27 14:46:46 +03:00
Pavel Emelyanov
7014da9404 storage_service: Unregister disk error handlers on stop
Storage service install disk error handlers in constructor and these
connections are not unregistered. It's not a problem in real life,
because storage service is not stopped, but in some tests this can
lead to use-after-frees.

The sstables_datafile_test runs some of the testcases in cql_test_env
which starts and (!) stops the storage service. Other testcases are
run in a lightweight sstables_test_env which does not mess with the
storage service at all. Now, if a case of the 2nd kind is run after
the one of the 1st and (for whatever reason) generates a disk error
it will trigger use-after-free -- after the 1st testcase the storage
service disk error registration would remain, but the storage service
itself would already be stopped, thus triggering the disk error will
try to access stopped sharded storage service inside the .isolate().

The fix is to keep the scoped connection on the storage service list
of various listeners. On stop it will go away automagically.

tests: unit(dev), sstables_datafile_test with forced disk error

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210625062648.27812-1-xemul@scylladb.com>
2021-06-27 14:41:55 +03:00
Avi Kivity
6676ceabde Merge 'Prevent reactor stall in utils::merge_to_gently' from Benny Halevy
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Also, eliminate extraneous loop on merge

first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.

Fixes #8897

Test: unit(dev), stall_free_test(debug)
DTest: repair_additional_test.py:RepairAdditionalTest.{repair_same_row_diff_value_3nodes_diff_shard_count_test,repair_disjoint_row_3nodes_diff_shard_count_test} (dev)

Closes #8925

* github.com:scylladb/scylla:
  utils: merge_to_gently: eliminate extraneous loop on merge
  utils: merge_to_gently: prevent stall in std::copy_if
2021-06-27 13:56:32 +03:00
Raphael S. Carvalho
29c93ae592 compaction: Reduce backlog of compacting SSTable properly
It was observed that as compaction progresses the backlog of compacting SSTable
is being reduced very slowly, which causes shares to be higher than needed, and
consequently compaction acts much more aggressively than it has to.

https://user-images.githubusercontent.com/1409139/120237819-93dfc080-c232-11eb-9042-68114e285ea0.png

The graph above shows the amount of backlog that is reduced from a SSTable
being compacted. The red line denotes the total backlog of the SSTable, before
it's selected for compaction. The expectation is that the more a SSTable is
compacted the more backlog will be reduced from it. However, in the current
implementation, it can be seen that the backlog to be reduced, from the SSTable
being compacted, starts being inversely proportional to the amount of data
already compacted.

Turns out that this problem happens because the implementation of backlog
formula becomes incorrect when the SSTable is being compacted.

Backlog for a sstable is currently defined as:
    Bi = Ei * log (T / Ei)

    where Ei = Si - Ci (bytes left to be compacted)
        and Si = size of SStable
        and Ci = total bytes compacted
        and T = total size of table

The formula above can also be rewritten as follows:
    Bi = Ei * log (T) - Ei * log (Ei)

the second term `Ei * log (Ei)` can be rewritten as:
    = (Si - Ci) * log (Ei)
    = Si * log (Ei) - Ci * log (Ei)

However, digging backlog implementation, turns out that we're incorrectly
implementing that second term as:
    = Si * log (Si) - Ci * log (Ei)

Given that Si > Ei, for a SSTable being compacted, the backlog will be higher
than it should.

the following table shows how the backlog of a SSTable being compacted behaves
now versus how it's supposed to behave:
https://gist.github.com/raphaelsc/42e14be0d7d4ed264e538c2d217c8f95

Turns out that this is not the only problem. It was a mistake to change the
formula from `Ei * log(T / Si)` to `Ei * log(T / Ei)`, when fixing the
shrinking table issue, because that also causes the backlog of a compacting
SSTable to be incorrectly reduced.

With the formula rewritten as follows:
    Bi = Ei * log (T) - Ei * log (Ei)

It becomes clear that the more a SSTable is compacted, the slower it becomes
for backlog to be reduced, as T / Ei can increase considerably over time.

So we're reverting the formula back to `Ei * log(T / Si)`.

The graph below shows a better backlog behavior when table is shrinking:
https://user-images.githubusercontent.com/1409139/123495186-06a54700-d5f9-11eb-9386-3fcf4dd8e4d3.png

While analyzing the problem when table is shrinking, realized that it's because
T in the formula is implemented as the effective size (total + partial -
compacted).

With the new formula rewritten as follows:
    Bi = Ei * log (T) - Ei * log (Si)

It becomes clearer that T cannot be lower than Si whatsoever, otherwise the
backlog becomes negative. Also, while table is shrinking, it can happen that
the backlog will be so low that compaction will barely make any progress.
To fix both issues, let's implement T as total size (sum of all Si) rather than
effective size (sum of all Ei).

The graph below shows that this change prevents the backlog from going negative
while still providing similar and expected behavior as before, see:
https://user-images.githubusercontent.com/1409139/123495185-060cb080-d5f9-11eb-89f7-ed445729702a.png

Fixes #8768.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210626003133.3011007-1-raphaelsc@scylladb.com>
2021-06-27 11:43:48 +03:00
Pavel Emelyanov
a89ae9a8e7 storage_service: Indentation fix
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-25 13:21:10 +03:00
Pavel Emelyanov
bd2a58060e storage_service: Make stop_transport re-entrable
It may happen that disk error opccurs and subsequent isolation runs
in parallel with drain or decommission or shutdown. In this case the
stop_transport method would be running two times in parallel. Also
the drain after decommission is not disabled, so it may happen that
stop_transport will be called two times in a row (however -- not in
parallel).

Using shared_promise solves all the possible reentrances.

(the indentation is deliberately left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-25 13:18:43 +03:00
Pavel Emelyanov
b0199b005d storage_service: Stop transport on decommission
The stop_transport sequence is:
- stop client services (cql, thrift, alternator)
- stop gossiping
- stop messaging
- stop stream manager

The decommissioning goes very similarly
- stop client services
- stop batchlog manager
- stop gossiping
- stop messaging

So this change makes decommission stop all networking _before_
batchlog, like it's already done on drain, and additionally stop
the streaming manager.

This change is prerequisite for fixing race between transport
stop and isolation (on disk error).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-25 13:15:38 +03:00
Piotr Jastrzebski
995eb8c274 compaction_with_fully_expired_table: Remove some LA specific code
Following patches will switch all sstable writing tests to use
the latest sstables format. compaction_with_fully_expired_table
contains some test for a LA specific behaviour so let's remove it
to make the switch possible.

For more context see https://github.com/scylladb/scylla/issues/2620

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
8ff37bec17 sstable_mutation_test: test latest sstable format instead of LA
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
80f8f970e9 sstable_test: Test MX sstables instead of KA/LA
Replace calls to make_compressed_file_k_l_format_input_stream
with calls to make_compressed_file_m_format_input_stream.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
131a0babc0 sstable_datafile_test: Fix schema used by check_compacted_sstables
check_compacted_sstables is used in compact_02 test which uses sstables
created by compact_sstables. The problem is that schema used in
check_compacted_sstables and compact_sstables is not the same.
The type of r1 column is different. This was not a problem when the
test was running on LA sstables but following patches will switch
all the tests to use MC and then sstable schema becomes validated
when reading the sstable and the test will fail such validation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
680e341f54 sstables: Remove LA/KA sstable writting tests that check exact format
Those tests check that created sstables have exactly the expected bytes
inside. This won't work with other sstable formats and writting LA/KA
sstables will be removed by the following patches so there's nothing
we can do with those tests but to remove them. Otherwise they will be
failing after LA/KA writting capability is removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
2bd6ad1e2f sstables: define writable_sstable_versions
and use it instead of all_sstable_versions in tests that check
writting of sstables. Following patches remove LA/KA writer so we
want tests to be ready for that and not break by trying to write LA/KA
sstables.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
1bdcef6890 features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled
These features have been around for over 2 years and every reasonable
deployment should have them enabled.

The only case when those features could be not enabled is when the user
has used enable_sstables_mc_format config flag to disable MC sstable
format. This case has been eliminated by removing
enable_sstables_mc_format config flag.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Pavel Emelyanov
3552e99ce7 scylla-gdb: Bring scylla netw back to work
The netw command tries to access the netw::_the_messaging_service that
was removed long ago. The correct place for the messaging service is
in debug:: namespace.

The scylla-gdb test checks that, but the netw command sees that the ptr
in question is not initialized, thinks it's not yet sharded::start()-ed
and exits without errors.

tests: unit(gdb)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624135107.12375-1-xemul@scylladb.com>
2021-06-24 20:59:27 +03:00
Nadav Har'El
4d7f55a29f cql: add configurable restriction of DateTieredCompactionStrategy
DateTieredCompactionStrategy (DTCS) has been un-recommended for a long time
(users should use TimeWindowCompactionStrategy, TWCS, instead). This
patch adds a new configuration option - restrict_dtcs - which can be used
to restrict the ability to use DTCS in CREATE TABLE or ALTER TABLE
statements. This is part of a "safe mode" effort to allow an installation
to restrict operations which are un-recommended or dangerous.

The new restrict_dtcs option has three values: "true", "false", and "warn":

For the time being, "false" is still the default, and means DTCS is not
restricted  and can still be used freely. We can easily change this
default in a followup patch.

Setting a value of "true" means that DTCS *is* restricted -
trying to create a a table or alter a table with it will fail with an error.

Setting a value of "warn" will allow the create or alter operation, but
will warn the user - both with a warning message which will immediately
appear in cqlsh (for example), and with a log message.

Fixes #8914.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210624122411.435361-1-nyh@scylladb.com>
2021-06-24 20:59:27 +03:00
Benny Halevy
b96eeaefe4 utils: merge_to_gently: eliminate extraneous loop on merge
first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.

Note that the standard states that no iterators or references are invalidated
on insert so we can safely keep looking at `first1` after inserting a copy of
`*first2` before it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-24 14:58:12 +03:00
Benny Halevy
453e7c8795 utils: merge_to_gently: prevent stall in std::copy_if
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Note that the standard states that no iterators or references are invalidated
on insert so we can keep inserting before last1 when merging the
remainder of list2 at the tail of list1.

Fixes #8897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-24 14:47:25 +03:00
Benny Halevy
9bbe7b1482 sstables: mx_sstable_mutation_reader: enforce timeout
Check if the timeout has expired before issuing I/O.

Note that the sstable reader input_stream is not closed
when the timeout is detected. The reader must be closed anyhow after
the error bubbles up the chain of readers and before the
reader is destroyed.  This might already happen if the reader
times out while waiting for reader_concurrency_semaphore admission.

Test: unit(dev), auth_test.test_alter_with_timeouts(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210624073232.551735-1-bhalevy@scylladb.com>
2021-06-24 12:26:57 +02:00
Kamil Braun
a3f3563828 storage_service: check for existing normal token owners before bootstrapping
The bootstrap procedure starts by "waiting for range setup", which means
waiting for a time interval specified by the `ring_delay` parameter (30s
by default) so the node can receive the tokens of other nodes before
introducing its own tokens.

However it may sometimes happen that the node doesn't receive the
tokens. There are no explicit checks for this. But the code may crash in
weird ways if the tokens-received assuption is false, and we are lucky
if it does crash (instead of, for example, allowing the node to
incorrectly bootstrap, causing data loss in the process).

Introduce an explicit check-and-throw-if-false: a bootstrapping node now
checks that there's at least one NORMAL token in the token ring, which
means that it had to have contacted at least one existing node
in the cluster, which means that it received the gossip application
states of all nodes from that node; in particular the tokens of all
nodes.

Also add an assert in CDC code which relies on that assumption
(and would cause weird division-by-zero errors if the assumption
was false; better to crash on assert than this).

Ref #8889.

Closes #8896
2021-06-24 13:19:08 +03:00
Asias He
2ad8fb756e gossip: Promote gossip quarantine over log to info level
1) Start node n1, n2, n3

2) Bootstrap n4 and kill n4 in the middle of bootstrap

3) Wipe data on n4 and start n4 again

After step 2, n1, n2 and n3 will remove n4 from gossip after
fat_client_timeout and put n4 in quarantine for quarantine_delay().

If n4 bootstraps again in step 3 before the quarantine finishes, n1, n2
and n3 will ignore gossip updates from n4, and n4 will not learn gossip
updates from the cluster.

After PR #8896, the bootstrap will be rejected.

This patch promotes the gossip quarantine over log to info level, so
that dtest can wait for the log to bootstrap the node again.

Refs #8889
Refs #8890

Closes #8905
2021-06-24 12:51:32 +03:00
Michael Livshin
9b9efb2b42 disable caching of the system.large_* tables
The cache of system.large_{partition,rows,cells} accumulates range
tombstones (https://github.com/scylladb/scylla/issues/7750), and those
range tombstones can be evicted only together with their partition
(https://github.com/scylladb/scylla/issues/3288).

Making the system.large_* tables uncached should work around the
problem until #3288 is fixed.

Fixes #8874
Refs #7750
Refs #3288

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>
2021-06-24 12:26:45 +03:00
Piotr Sarna
ae9e52a774 Merge 'Cleanup and improvements for docs/alternator/alternator.md' from Nadav Har'El
Make some improvements to docs/alternator.md as suggested by a user who
had trouble understanding the previous version, and also a few other
random cleanups.

Closes #8910

* github.com:scylladb/scylla:
  docs/alternator/alternator.md: improve "Running Alternator" section
  docs/alternator/alternator.md: correct minor typos
  docs/alternator/alternator.md: fix link format
2021-06-24 12:03:26 +03:00
Avi Kivity
14252c8b71 Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695)​ (v3)' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
    issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.

Closes #8764

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog_test: Improve test assertion
  commitlog: Add waitable future for background sync/flush
  commitlog: abort queues on shutdown
  commitlog: break out "abort" calls into member functions
  commitlog: Do explicit discard+delete in shutdown
  commitlog: Recycle or not should not depend on shutdown state
  commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-06-24 12:03:26 +03:00
Pavel Emelyanov
a61afe8421 btree: Improve unlink_leftmost_without_rebalance()
The helper is used to walk the tree key-by-key destroying it
in the mean time. Current implementation of this method just
uses the "regular" erasing code which actually rebalances the
tree despite the name.

The biggest problem with removing the rebalancing is that at
some point non-balanced tree may have the left-most key on an
inner node, so to make 100% rebalance-less unlink every other
method of the tree would have to be prepared for that. However,
there's an option to make "light rebalance" (as it's called in
this patch) that only maintains this crucial property of the
tree -- the left-most key is on the leaf.

Some more tech details. Current rebalancer starts when the
node population falls below 1/2 of its capacity and tries to
- grab a key from one of the siblings if it's balanced
- merge two siblings together if they are small enough

The light rebalance is lighter in two ways. First, it leaves
the node unbalanced until it becomes empty. And then it goes
ahead and replaces it with the next sibling.

This change removes ~60% of the keys movements on random test.
Keys still move when the sibling replace happens because in
this case the separation key needs to be placed at the right
sibling 0 position which means shifting all its keys right.

tests: unit(debug)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210623083836.27491-1-xemul@scylladb.com>
2021-06-24 12:03:26 +03:00
Raphael S. Carvalho
ab9d08d80e sstables: Remove unused filtering reader from sstable_set::make_local_shard_sstable_reader()
It's been a long time since table no longer accepts shared sstables, so this
code which creates a filtering reader, if sstable is shared, is never used.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-2-raphaelsc@scylladb.com>
2021-06-24 12:03:26 +03:00
Raphael S. Carvalho
88119a5c81 distributed_loader: Kill table's _sstables_opened_but_not_loaded
_sstables_opened_but_not_loaded was needed because the old loader would
open sstables from all shards before loading them.
In the new loader, introduced with reshape, make_sstables_available()
is called on each shard after resharding and reshape finished, so
there's no need whatsoever for that mess.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-1-raphaelsc@scylladb.com>
2021-06-24 12:03:26 +03:00
Tomasz Grabiec
ee28eb4100 Merge "test: raft: move some tests to raft folder" from Pavel Solodovnikov
Move `raft_sys_table_storage_test` and `raft_address_map_test` to
`test/raft` folder since they naturally belong here, not in
`test/boost` folder.

Tests: unit(dev)

* manmanson/move_some_raft_tests_to_raft_folder:
  test: raft: move `raft_address_map_test` to `raft` folder
  test: raft: move `raft_sys_table_storage_test` to `raft` folder
  configure: add extended raft testing dependencies
2021-06-24 12:03:26 +03:00
Pavel Emelyanov
e031e7b0a7 scylla-gdb: Do not leave random offset in smp-queues known vptrs
The process of getting a queue pointer is quite tricky here.
First, it checks if the vptr resolves into 'vtable for async_work_item'
and puts a None mark into known_vptrs dict.
Then, if the entry is present there are two options. First, if it's NOT
None, it's translated directly into the queue object. But if it's None,
then a loop over an offset starts that tries to check is the vptr + offset
maps to a queue. And here's the problem -- if no offsets were mapped to
any specific queues the last checked offset is put into the known vptrs
dict, so the next vptrs will miss the 2nd offset checking, but will go
ahead and use the "random" offset that had failed previously.

tests: unit(gdb)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624085723.7156-1-xemul@scylladb.com>
2021-06-24 12:03:22 +03:00
Nadav Har'El
b965bc76e0 docs/alternator/alternator.md: improve "Running Alternator" section
A user complained that the "Running Alternator" section was confusing.
It didn't say outright which two configurations are necessary and you
had to read a few paragraph to reach it, and it mixed the YAML names
of options and the command-line names, which are subtly different.

This patch tries to improve this.
2021-06-23 19:41:52 +03:00
Tomasz Grabiec
a60e73fe14 Merge "raft: allow to initiate leader stepdown process explicitly" from Gleb
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.

* scylla-dev/raft-stepdown-v4:
  raft: test: test leadership transfer timeout
  raft: allow to initiate leader stepdown process
2021-06-23 00:14:46 +02:00
Pavel Solodovnikov
a96ddbec35 test: raft: move raft_address_map_test to raft folder
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:33:22 +03:00
Pavel Solodovnikov
cf5025c44e test: raft: move raft_sys_table_storage_test to raft folder
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:31:41 +03:00
Pavel Solodovnikov
6912f76e45 configure: add extended raft testing dependencies
Rename `scylla_raft_dependencies` to `scylla_minimal_raft_dependencies`
and introduce `scylla_raft_dependencies` that contains
`scylla_core` (i.e., all scylla source files).

The new `scylla_raft_dependencies` variable will be used
for `raft_address_map_test` and `raft_sys_table_storage_test`,
which use a lot of machinery from scylla.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:26:18 +03:00
Nadav Har'El
3895d4bb99 docs/alternator/alternator.md: correct minor typos
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-22 20:03:48 +03:00
Benny Halevy
4ab4f63efe sstables: mx/writer: flush_tmp_bufs: maybe_yield in loop
This loop may cause pretty long reactor stalls as seen in
https://github.com/scylladb/scylla/issues/8900

Apparently output_stream<CharType>::slow_write returns
a ready future and no yielding is considered, so
add a check in the top level loop (that must already
be called from a seastar thread).

Fixes #8900

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210622152206.156302-1-bhalevy@scylladb.com>
2021-06-22 18:56:12 +03:00
Avi Kivity
d27e88e785 Merge "compaction: prevent broken_promise or dangling reader errors" from Benny
"
This series prevents broken_promise or dangling reader errors
when (resharding) compaction is stopped, e.g. during shutdown.

At the moment compaction just closes the reader unilaterally
and this yanks the reader from under the queue_reader_handle feet,
causing dangling queue reader and broken_promise errors
as seen in #8755.

Instead, fix queue_reader::close to set value on the
_full/_not_full promises and detach from the handle,
and return _consume_fut from bucket_writer::consume
if handle is terminated.

Fixes #8755

Test: unit(dev)
DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
"

* tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated
  queue_reader_handle: add get_exception method
  queue_reader: close: set value on promises on detach from handle
2021-06-22 18:52:11 +03:00
Nadav Har'El
5bb4966cac docs/alternator/alternator.md: fix link format
Unfortunately the scylla.docs.scylladb.com formatter which generates
https://scylla.docs.scylladb.com/master/alternator/alternator.html
doesn't know how to recognize HTTP URLs and convert them into proper
HTML links (something which github's formatter does).

So convert the two URLs we had in alternator.md into markdown links
which both github and our formatter recognize.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-22 18:43:27 +03:00
Calle Wilund
373fa3fa07 table: ensure memtable is actually in memtable list before erasing
Fixes #8749

if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase

v2:
* Make interface more set-like (tolerate non-existance in erase).

Closes #8904
2021-06-22 15:58:56 +02:00
Asias He
ffa211a8c7 repair: Avoid copy rows in apply_rows_on_master_in_thread
The rows are not used after the call to to_repair_rows_list. Use
std::move() to avoid copying.

Fixes #8902

Closes #8903
2021-06-22 15:58:56 +02:00
Benny Halevy
02917c79b6 logalloc: get rid of unused _descendant_blocked_requests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210620064204.1709957-1-bhalevy@scylladb.com>
2021-06-22 15:58:56 +02:00
Piotr Dulikowski
de1679b1b9 hints: make hints concurrency configurable and reduce the default
Previously, hinted handoff had a hardcoded concurrency limit - at most
128 hints could be sent from a single shard at once. This commit makes
this limit configurable by adding a new configuration option:
`max_hinted_handoff_concurrency_per_shard`. This option can be updated
in runtime. Additionally, the default concurrency per shard is made
lower and is now 8.

The motivation for reducing the concurrency was to mitigate the negative
impact hints may have on performance of the receiving node due to them
not being properly isolated with respect to I/O.

Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)

Refs: #8624

Closes #8646
2021-06-22 15:58:56 +02:00
Gleb Natapov
09528b8671 raft: test: test leadership transfer timeout
Test that if leadership transfer cannot be done in configured time frame
fsm cancels the leadership transfer process. Also check that timeout_now
message is resent on each tick while leadership transfer is in progress.
2021-06-22 14:42:50 +03:00
Gleb Natapov
ed49d29473 raft: allow to initiate leader stepdown process
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.

We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
2021-06-22 14:36:42 +03:00
Konstantin Osipov
bd410da77a raft: (service) rename raft_services service to raft_group_registry
This is a more informative name. Helps see that, say, group0
is a separate service and not bundle all raft services together.
Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>
2021-06-21 14:53:54 +03:00
Konstantin Osipov
025f18325e raft: (service) move raft service to namespace service
Message-Id: <20210619211412.3035835-2-kostja@scylladb.com>
2021-06-21 14:53:54 +03:00
Calle Wilund
fdb5801704 table: Always use explicit commitlog discard + clear out rp_set
Fixes #8733

If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.

v3:
* Fix table::clear to discard rp_sets before memtables

Closes #8894
2021-06-21 14:53:54 +03:00
Takuya ASADA
a677c46672 dist: stop removing /etc/systemd/system/*.mount on package uninstall
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.

Also, we mixed two types of files as ghost file, it should handle differently:
 1. automatically generated by postinst scriptlet
 2. generated by user invoked scylla_setup

The package should remove only 1, since 2 is generated by user decision.

See scylladb/scylla-enterprise#1780

Closes #8810
2021-06-21 14:53:54 +03:00
Calle Wilund
0a7823e683 commitlog_test: Add test case for usage/disk size threshold mismatch
Refs #8270

Tries to simulate case where we mismatch segments usage with actual
disk footprint and fail to flush enough to allow segment recycling
2021-06-21 06:01:19 +00:00
Calle Wilund
954da1f0a9 commitlog_test: Improve test assertion
Changes it so actual data is printed, not just error.
2021-06-21 06:01:19 +00:00
Calle Wilund
d6113912cd commitlog: Add waitable future for background sync/flush
Commitlog timer issues un-waited syncs on all segments. If such
a sync takes too long we can end up keeping a segment alive across
a shutdown, causing the file to be left on disk, even if actually
clean.

This adds a future in segment_manager that is "chained" with all
active syncs (hopefully just one), and ensures we wait for this
to complete in shutdown, before pruning and deleting segments
2021-06-21 06:01:19 +00:00
Benny Halevy
499357fb43 row_cache: autoupdating_underlying_reader: fast_forward_to: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210613104232.634621-2-bhalevy@scylladb.com>
2021-06-20 14:46:35 +03:00
Benny Halevy
3db7db5743 row_cache: autoupdating_underlying_reader: fast_forward_to: capture snapshot by value when updating reader
Currently we capture the snapshot mutation_source by reference
for calling create_underlying_reader after closing the reader.
However, if close_reader yields, the snapshot reference passed
may be gone, so capture it by value instead.

Fixes #8848

Test: unit(dev)
DTest: restore_snapshot_using_old_token_ownership_test(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210613104232.634621-1-bhalevy@scylladb.com>
2021-06-20 14:46:35 +03:00
Avi Kivity
5b3fb83ebe Merge "Remove unused code here and there" from Pavel E
"
Few randomly spotted dead code locations over past time.
Compile-test only.
"

* 'br-remove-unused-stuff' of https://github.com/xemul/scylla:
  database: Remove unused forward declarations
  feature: Remove unused friendship with gossiper
  schema_tables: Remove unused sharded<proxy> argument
  database: Remove few unused sharded<proxy> captures
  view_update_generator: Remove unused struct sstable_with_table
  storage_service: Remove write-only _force_remove_completion
  distributed_loader: Remove unused load-prio manipulations
2021-06-20 12:01:40 +03:00
Pavel Emelyanov
ab4fc41f25 database: Remove unused forward declarations
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
d606321575 feature: Remove unused friendship with gossiper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
96131349e8 schema_tables: Remove unused sharded<proxy> argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
0f36f00682 database: Remove few unused sharded<proxy> captures
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
64bb16af8a view_update_generator: Remove unused struct sstable_with_table
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
cbcbf648b6 storage_service: Remove write-only _force_remove_completion
This boolean became effectively unused after 829b4c14 (repair:
Make removenode safe by default)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
7396de72b1 distributed_loader: Remove unused load-prio manipulations
Mostly this was removed by 6dfeb107 (distributed_loader: remove unused
code).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pekka Enberg
055bc33f0f Update tools/java submodule
* tools/java 599b2368d6...5013321823 (4):
  > cassandra-stress: fix failure due to the assert exception on disconnect when test is completed
  > node_probe: toppartitions: Fix wrong class in getMethod
  > Fix NullPointerException in SettingsMode
  > cassandra-stress: Remove maxPendingPerConnection default
2021-06-18 14:19:34 +03:00
Pekka Enberg
2a9443a753 Update tools/jmx submodule
* tools/jmx a7c4c39...5311e9b (2):
  > storage_service: takeSnapshot: support the skipFlush option
  > build(deps): bump snakeyaml from 1.16 to 1.26 in /scylla-apiclient
2021-06-18 14:19:29 +03:00
Avi Kivity
b099e7c254 Merge "Untie hints managers and storage service" from Pavel E
"
The storage service is carried along storage proxy, hints
resource manager and hints managers (two of them) just to
subscribe the hints managers on lifecycle events (and stop
the subscription on shutdown) emitted from storage service.

This dependency chain can be greatly simplified, since the
storage proxy is already subscribed on lifecycle events and
can kick managers directly from its hooks.

tests: unit(dev),
       dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev)
"

* 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla:
  hints: Drop storage service from managers
  hints: Do not subscribe managers on lifecycle events directly
2021-06-17 17:12:31 +03:00
Nadav Har'El
a9b383f423 cql-pytest: improve test for SSL/TLS versions
The existing test_ssl.py which tests for Scylla's support of various TLS
and SSL versions, used a deprecated and misleading Python API for
choosing the protocol version. In particular, the protocol version
ssl.PROTOCOL_SSLv23 is *not*, despite it's name, SSL versions 2 or 3,
or SSL at all - it is in fact an alias for the latest TLS version :-(
This misunderstanding led us to open the incorrect issue #8837.

So in this patch, we avoid the old Python APIs for choosing protocols,
which were gradually deprecated, and switch to the new API introduced
in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum
desired protocol version.

With this new API, we can correctly connect with various versions of
the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the
fixed test, we confirm that Scylla does *not* allow SSLv3 - as desired -
so issue #8837 is a non-issue.

Moreover, after issue #8827 was already fixed, this test now passes,
so the "xfail" mark is removed.

Refs #8837.
Refs #8827.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210617134305.173034-1-nyh@scylladb.com>
2021-06-17 17:06:31 +03:00
Nadav Har'El
8f107ece9f Update seastar submodule
* seastar 813eee3e...0e48ba88 (5):
  > net/tls: on TLS handshake failure, send error to client
  > net/dns: fix build on gcc 11
  > core: fix docstring for max_concurrent_for_each
  > test: alien_test: replace deprecated call to alien::submit_to() with new variant
  > alien: prepare for multi-instance use

The fix "net/tls: on TLS handshake failure, send error to client"
fixes #8827.

The test

    test/cql-pytest/run --ssl test_ssl.py

now xpasses, so I'll remove the "xfail" mark in a followup patch.
2021-06-17 16:24:57 +03:00
Pavel Emelyanov
92a4278cd1 hints: Drop storage service from managers
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:09:36 +03:00
Pavel Emelyanov
acdc568ecf hints: Do not subscribe managers on lifecycle events directly
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:06:26 +03:00
Tomasz Grabiec
6d8440fe70 Merge "raft: (testing) leadership transfer tests" from Pavel Solodovnikov
The patch set introduces a few leadership transfer tests, some of them
are adaptations of corresponding etcd tests (e.g.
`test_leader_transfer_ignore_proposal` and `test_transfer_non_member`).

Others test different scenarios ensuring that pending leadership
transfer doesn't disrupt the rest of the cluster from progressing:

Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and
`test_leader_transferee_dies_upon_receiving_timeout_now`) as well as
lost `vote_request(force)` from the new candidate
(test_leader_transfer_lost_force_vote_request) don't impact the
election process following that and the leader is elected as normal.

* manmanson/leadership_transfer_tests_v3:
  raft: etcd_test: test_transfer_non_member
  raft: etcd_test: test_leader_transfer_ignore_proposal
  raft: fsm_test: test_leader_transfer_lost_force_vote_request
  raft: fsm_test: test_leader_transfer_lost_timeout_now
  raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
2021-06-17 13:58:31 +02:00
Piotr Sarna
8cca68de75 cql3: add USING TIMEOUT support for deletes
Turns out the DELETE statement already supports attributes
like timestamp, so it's ridiculously easy to add USING TIMEOUT
support - it's just the matter of accepting it in the grammar.

Fixes #8855

Closes #8876
2021-06-17 14:21:01 +03:00
Nadav Har'El
45c2442f49 Merge 'Avoid large allocs in mv update code' from Piotr Sarna
This series addresses #8852 by:
 * migrating to chunked_vector in view update generation code to avoid large allocations
 * reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent

Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust.

Tests: unit(release)
Refs  #8852

Closes #8856

* github.com:scylladb/scylla:
  db,view: limit the number of simultaneous view update futures
  db,view: use chunked_vector for view updates
2021-06-17 14:01:38 +03:00
Avi Kivity
4d70f3baee storage_proxy: change unordered_set<inet_address> to small_vector in write path
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.

This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:

 - there is a quadratic algorithm in
   abstract_write_response_handler::response(): it searches for a replica
   and erases it. Since this happens for every replica, it happens N^2/2
   times.
 - replica sets for writes always include all datacenters, while reads
   usually involve just one datacenter.

So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.

We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.

Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
 --task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.

before: median 204842.54 tps ( 54.2 allocs/op,  13.2 tasks/op,   49890 insns/op)
after:  median 206077.65 tps ( 52.2 allocs/op,  13.2 tasks/op,   49138 insns/op)

Closes #8847
2021-06-17 13:46:40 +03:00
Avi Kivity
98cdeaf0f2 schema_tables: make the_merge_lock thread_local
the_merge_lock is global, which is fine now because it is only used
in shard 0. However, if we run multiple nodes in a single process,
there will be multiple shard 0:s, and the_merge_lock will be accessed
from multiple threads. This won't work.

To fix, make it thread_local. It would be better to make it a member
of some controlling object, but there isn't one.

Closes #8858
2021-06-17 13:41:11 +03:00
Avi Kivity
00ff3c1366 Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
         [(-pp | --print-port)] [(-pw <password> | --password <password>)]
         [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
         [(-u <username> | --username <username>)] snapshot
         [(-cf <table> | --column-family <table> | --table <table>)]
         [(-kc <kclist> | --kc.list <kclist>)]
         [(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]

-sf / –skip-flush    Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```

But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.

This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.

In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).

Refs #8725

Closes #8726

* github.com:scylladb/scylla:
  test: database_test: add snapshot_skip_flush_works
  api: storage_service/snapshots: support skip-flush option
  snapshot: support skip_flush option
  table: snapshot: add skip_flush option
  api: storage_service/snapshots: add sf (skip_flush) option
2021-06-17 13:32:23 +03:00
Nadav Har'El
7fd7e90213 cql-pytest: translate Cassandra's tests for static columns
This is a translation of Cassandra's CQL unit test source file
validation/entities/StaticColumnsTest.java into our our cql-pytest framework.

This test file checks various features of static columns. All these tests
pass on Cassandra, and all but one pass on Scylla. The xfailing test,
testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra
allows but we don't. The new issue about that is:

Refs #8869.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616141633.114325-1-nyh@scylladb.com>
2021-06-17 11:08:28 +02:00
Nadav Har'El
b6b4df9a47 heat-weighted load balancing: improve handling of near-perfect cache
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.

So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.

So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:

1. Two nodes with hit rates above 0.95 will be considered to have the
   same hit rate, so they will get equal amount of work - even if one
   has hit rate 0.98 and the other 0.99.

2. A cold node with it rate 0.0 will get 5% of the work of a node with
   the perfect hit rate limited to 0.95. This will allow the cold node to
   slowly warm up its cache. Before this patch, if the hot node happened
   to have a hit rate of 0.999 (the previous maximum), the cold node would
   get just 0.1% of the work and remain almost idle and fill its cache
   extremely slowly - which is a waste.

Fixes #8815.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>
2021-06-17 11:02:08 +02:00
Piotr Sarna
1fb831c8c1 db,view: limit the number of simultaneous view update futures
Previously the view update code generated a continuation for each
view update and stored them all in a vector. In certain cases
the number of updates can grow really large (to millions and beyond),
so it's better to only store a limited amount of these futures
at a time.
2021-06-17 10:20:52 +02:00
Piotr Sarna
a7f7716ecf db,view: use chunked_vector for view updates
The number of view updates can grow large, especially in corner
cases like removing large base partitions. Chunked vector
prevents large allocations.
2021-06-17 10:15:17 +02:00
Avi Kivity
3c21833aac cql3: expr: make column_value (and similar) a first-class expression
Currently, column names can only appear in a boolean binary expression,
but not on their own. This means that in the statement

   SELECT a FROM tab WHERE a > 3;

We can represent the WHERE clause as an expression, but not the selector.

To pave the way for using expressions in selector contexts, we promote
the elements of binary_operator::lhs (column_value, column_value_tuple,
token) to be expressions in their own right. binary_operator::lhs
becomes an expression (wrapped in unique_ptr, because variants can't
contain themselves).

Note that all three new possibilities make sense in a selector:

  SELECT column FROM tab
  SELECT token(pk) FROM tab
  SELECT function_that_accepts_a_tuple((col1, col2)) FROM tab

There is some fallout from this:

 - because binary_operator contains a unique_ptr, it is no longer
   copyable. We add a copy constructor and assignment operator to
   compensate.
 - often, the new elements don't make sense when evaluating a boolean
   expression, which is the only context we had before. We call
    on_internal_error in these cases. The parser right now prevents such
   cases from being constructed in the first place (this is equivalent to
   if (some_struct_value) in C).
 - in statement_restrictions.cc, we need to evalute the lhs in the context
   of the full binary operator. I introduced with_current_binary_operator()
   for this; an alternative approach is to create a new sub-visitor.

Closes #8797
2021-06-17 10:08:58 +03:00
Tomasz Grabiec
6bdf8c4c46 Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja
Miscellaneous preparatory patches for group 0 discovery.

* scylla-dev/raft-group-0-part-2-v4:
  raft: (service) servers map is gid -> server, not sid -> server
  system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
  raft: (server) wait for configuration transition to complete
  raft: (server) implement raft::server::get_configuration()
  raft: (service) don't throw from schema state machine
  raft: (service) permit some scylla.raft cells to be empty
  raft: (service) properly handle failure to add a server
  raft: implement is_transient_error()
2021-06-17 00:15:40 +02:00
Asias He
7a32cab524 gossip: Fix use-after-free in real_mark_alive and mark_dead
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.

After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.

To fix, make a copy before we yield.

Fixes #8859

Closes #8862
2021-06-16 21:16:26 +02:00
Konstantin Osipov
18e3fcdbf1 raft: (service) servers map is gid -> server, not sid -> server
Raft Group registry should map Raft Group Id to Raft Server,
not Raft Server ID (which is identical for all groups) to Raft server.

Raft Group 0 ID works as a cluster identifier, so is generated when a
new cluster is created and is shared by all nodes of the same cluster.

Implement a helper to get raft::server by group id.

Consistently throw a new raft_group_not_found exception
if there is no server or rpc for the specified group id.
2021-06-16 19:05:50 +03:00
Calle Wilund
14559b5a86 commitlog: abort queues on shutdown
In case we only have a single segment active when shutting down,
the replenisher can be blocked even though we manually flush-deleted.

Add a signal type and abort queues using this to wake up waiter and
force them to check shutdown status.
2021-06-16 15:35:56 +00:00
Calle Wilund
227b573cdf commitlog: break out "abort" calls into member functions 2021-06-16 15:35:56 +00:00
Calle Wilund
5cd9691f00 commitlog: Do explicit discard+delete in shutdown
When we are shutting down, before trying to close the gate,
we should issue a discard to ensure waking up the replenish task
2021-06-16 15:35:56 +00:00
Calle Wilund
03b8baaa8d commitlog: Recycle or not should not depend on shutdown state
If we are using recycling, we should always use recycle in
delete_segments, otherwise we can cause deadlock with replenish
task, since it will be waiting for segment, then shutdown is set,
and we are called, and can't fulfil the alloc -> deadlock
2021-06-16 15:35:56 +00:00
Calle Wilund
5ebf5835b0 commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
If a segments, when finishing a flush call, is deletable, we should issue
a manual call to discard function (which moves deleteable segments off
segment list) asap, since we otherwise are dependent on more calls
from flush handlers (memtable flush). And since we could have blocked
segment allocation, this can cause deadlocks, at least in tests.
2021-06-16 15:35:56 +00:00
Calle Wilund
cbddcf46aa commitlog: Flush all segments if we only have one.
Handle test cases with borked config so we don't deadlock
in cases where we only have one segment in a commitlog
2021-06-16 15:35:56 +00:00
Calle Wilund
a0f559a44c commitlog: Always force flush if segment allocation is waiting
Refs #8270

If segement allocation is blocked, we should bypass all thresholds
and issue a flush of as much as possible.
2021-06-16 15:35:56 +00:00
Calle Wilund
bcf4d07f0b commitlog: Include segment wasted (slack) size in footprint check
Refs #8270

Since segment allocation looks at actual disk footprint, not active,
the threshold check in timer task should include slack space so we
don't mistake sparse usage for space left.
2021-06-16 15:35:56 +00:00
Calle Wilund
1187f5c181 commitlog: Adjust (lower) usage threshold
Refs #8270

Try to ensure we issue a flush as soon as we are allocating in the
last allowable segment, instead of "half through". This will make
flushing a little more eager, but should reduce latencies created
by waiting for segment delete/recycle on heavy usage.
2021-06-16 15:35:56 +00:00
Avi Kivity
f05ddf0967 Merge "Improve LSA descriptor encoding" from Pavel
"
The LSA small objects allocation latency is greatly affected by
the way this allocator encodes the object descriptor in front of
each allocated slot.

Nowadays it's one of VLE variants implemented with the help of a
loop. Re-implementing this piece with less instructions and without
a loop allows greatly reducing the allocation latency.

The speed-up mostly comes from loop-less code that doesn't confuse
branch predictor. Also the express encoder seems to benefit from
writing 8 bytes of the encoded value in one go, rather than byte-
-by-byte.

Perf measurements:

1. (new) logallog test shows ~40% smaller times

2. perf_mutation in release mode shows ~2% increase in tps

3. the encoder itself is 2 - 4 times faster on x86_64 and
   1.05 - 3 times faster on aarch64. The speed-up depends on
   the 'encoded length', old encoder has linear time, the
   new one is constant

tests: unit(dev), perf(release), just encoder on Aarch64
"

* 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla:
  lsa: Use express encoder
  uleb64: Add express encoding
  lsa: Extract uleb64 code into header
  test: LSA allocation perf test
2021-06-16 18:07:13 +03:00
Pavel Emelyanov
8d0780fb92 lsa: Use express encoder
To make it possible to use the express encoder, lsa needs to
make sure that the value is below express supreme value and
provide the size of the gap after the encoded value.

Both requirements can be satisfied when encoding the migrator
index on object allocation.

On free the encoded value can be larger, so the extended
express encoder will need more instructions and will not be
that efficient, so the old encoder is used there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:47:12 +03:00
Pavel Emelyanov
1782b0c6b9 uleb64: Add express encoding
Standard encoding is compiled into a loop that puts values
into memory byte-by-byte. This works slowly, but reliably.
When allocating an object LSA uses ubel64 encoder with 2
features that allow to optimize the encoder:

1. the value is migrator.index() which is small enough
   to fit 2 bytes when encoded
2. After the descriptor there usually comes an object
   which is of 8+ bytes in size

Feature #1 makes it possible to encode the value with just
a few instructions. In O3 level clang makes it like

  mov    %esi,%ecx
  and    $0xfc0,%ecx
  and    $0x3f,%esi
  lea    (%rsi,%rcx,4),%ecx
  add    $0x40,%ecx

Next, the encoder needs to put the value into a gap whose
size depends on the alignment of previous and current objects,
so the classical algo loops through this size. Feature #2
makes it possible to put the encoded value and the needed
amount of zeros by using 2 64-bit movs. In this case the
encoded value gets off the needed size and overwrites some
memory after. That's OK, as this overwritten memory is where
the allocated object _will_ be, so the contents there is not
of any interest.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:47:10 +03:00
Pavel Emelyanov
d8dea48248 lsa: Extract uleb64 code into header
The LSA code encodes an object descriptor before the object
itself. The descriptor is 32-bit value and to put it in an
efficient manner it's encoded into unsigned little-endian
base-64 sequence.

The encoding code is going to be optimized, so put it into a
dedicated header in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:46:44 +03:00
Avi Kivity
0948908502 Merge "mutation_reader: multishard_combining_reader clean-up close path" from Botond
"
The close path of the multishard combining reader is riddled with
workarounds the fact that the flat mutation reader couldn't wait on
futures when destroyed. Now that we have a close() method that can do
just that, all these workarounds can be removed.
Even more workarounds can be found in tests, where resources like the
reader concurrency semaphore are created separately for each tested
multishard reader and then destroyed after it doesn't need it, so we had
to come up with all sorts of creative and ugly workarounds to keep
these alive until background cleanup is finished.
This series fixes all this. Now, after calling close on the multishard
reader, all resources it used, including the life-cycle policy, the
semaphores created by it can be safely destroyed. This greatly
simplifies the handling of the multishard reader, and makes it much
easier to reason about life-cycle dependencies.

Tests: unit(dev, release:v2, debug:v2,
    mutation_reader_test:debug -t test_multishard,
    multishard_mutation_query_test:debug,
    multishard_combining_reader_as_mutation_source:debug)
"

* 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla:
  mutation_reader: reader_lifecycle_policy: remove convenience methods
  mutation_reader: multishard_combining_reader: store shard_reader via unique ptr
  test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
  test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
  test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
  test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
  reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
  test/lib/reader_lifcecycle_policy: fix indentation
  mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
  reader_lifecycle_policy implementations: fix indentation
  mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
  mutation_reader: shard_reader::close(): wait on the remote reader
  multishard_mutation_query: destroy remote parts in the foreground
  mutation_reader: shard_reader::close(): close _reader
  mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment
2021-06-16 17:25:50 +03:00
Benny Halevy
693d5d9e6b mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated
When the queue_reader_handle is terminated it was
either explicitly aborted or the reader was closed prematurely.

In this case _consume_fut should hold the root-cause error
(e.g. when compaction is stopped).  Return it instead
of trying to push the mutation fragment.
If no error is returned from _consume_fut, make to sure to
return either the queue_reader_handle error, if available,
or a generic error since the writer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-16 17:25:16 +03:00
Benny Halevy
b5efc3ceac queue_reader_handle: add get_exception method
To be used by the mutation_writer in the following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-16 17:25:16 +03:00
Benny Halevy
4830b6647c queue_reader: close: set value on promises on detach from handle
To prevent broken_promise exception.

Since close() is manadatory the queue_reader destructor,
that just detaches the reader from the handle, is not
needed anymore, so remove it.

Adjust the test_queue_reader unit test accordingly.

Test: test_queue_reader(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-16 17:25:14 +03:00
Konstantin Osipov
9c93d77e74 system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
Fix a bug in definitions of system.raft, system.raft_snapshots,
group_id is TIMEUUID, not long.
2021-06-16 16:52:43 +03:00
Konstantin Osipov
c67c77ed03 raft: (server) wait for configuration transition to complete
By default, wait for the server to leave the joint configuration
when making a configuration change.

When assembling a fresh cluster Scylla may run a series of
configuration changes. These changes would all go through the same
leader and serialize in the critical section around server::cas().

Unless this critical section protects the complete transition from
C_old configuration to C_new, after the first configuration
is committed, the second may fail with exception that a configuration
change is in progress. The topology changes layer should handle
this exception, however, this may introduce either unpleasant
delays into cluster assembly (i.e. if we sleep before retry), or
a busy-wait/thundering herd situation, when all nodes are
retrying their configuration changes.

So let's be nice and wait for a full transition in
server::set_configuration().
2021-06-16 16:52:43 +03:00
Konstantin Osipov
631c89e1a6 raft: (server) implement raft::server::get_configuration()
raft::server::set_configuration() is useless on
application level if we can't query the previous configuration.
2021-06-16 16:52:43 +03:00
Konstantin Osipov
867440f080 raft: (service) don't throw from schema state machine
It's now started as Scylla starts, and state machine failure
leads to panic at start.
2021-06-16 16:52:43 +03:00
Konstantin Osipov
845ff9f344 raft: (service) permit some scylla.raft cells to be empty
When loading raft state from scylla.raft, permit some cells
to be empty. Indeed, the server is not obliged to persist
all vote, term, snapshot once it starts. And the log can be
empty.
2021-06-16 16:52:43 +03:00
Konstantin Osipov
b8fa6c6e9c raft: (service) properly handle failure to add a server
future.get() is not available outside thread context
and co_await is not available inside catch (...) block.
2021-06-16 16:47:11 +03:00
Konstantin Osipov
73c59865f7 raft: implement is_transient_error()
Add a helper to classify Raft exceptions as transient.
2021-06-16 16:26:31 +03:00
Pavel Emelyanov
1e67361267 test: LSA allocation perf test
The test measures the time it takes to allocate a bunch
of small objects on LSA inside single segment.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 13:40:44 +03:00
Botond Dénes
b4e69cf63d test/lib/test_utils: require(): also log failed conditions
Currently `require()` throws an exception when the condition fails. The
problem with this is that the error is only printed at the end of the
test, with no trace in the logs on where exactly it happened, compared
to other logged events. This patchs also adds an error-level log line to
address this.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210616065711.46224-1-bdenes@scylladb.com>
2021-06-16 12:05:25 +03:00
Botond Dénes
28c2b54875 mutation_reader: reader_lifecycle_policy: remove convenience methods
These convenience methods are not used as much anymore and they are not
even really necessary as the register/unregister inactive read API got
streamlined a lot to the point where all of these "convenience methods"
are just one-liners, which we can just inline into their few callers
without loosing readability.
2021-06-16 11:29:37 +03:00
Botond Dénes
63f0839164 mutation_reader: multishard_combining_reader: store shard_reader via unique ptr
No need for a shared pointer anymore, as we don't have to potentially
keep the shard reader alive after the multishard reader is destroyed, we
now do proper cleanup in close().
We still need a pointer as the shard reader is un-movable but is stored
in a vector which requires movable values.
2021-06-16 11:29:37 +03:00
Botond Dénes
a69db31b5c test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
Now that we don't rely on any external machinery to keep the relevant
parts of the context alive until needed as its life-cycle is effectively
enclosed in that of the life-cycle policy itself, we can cleanup the
context in `destroy_reader()` itself, avoiding a background trip back to
this shard.
2021-06-16 11:29:36 +03:00
Botond Dénes
d2ddaced4e test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
2021-06-16 11:29:36 +03:00
Botond Dénes
5a271e42a5 test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
2021-06-16 11:29:36 +03:00
Botond Dénes
c09c62a0fb test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
2021-06-16 11:29:36 +03:00
Botond Dénes
578a092e4a reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
To prevent use-after-free resulting from any permit out-living the
semaphore.
2021-06-16 11:29:36 +03:00
Botond Dénes
a10a6e253e test/lib/reader_lifcecycle_policy: fix indentation
Left broken from the previous patch.
2021-06-16 11:29:36 +03:00
Botond Dénes
8c7447effd mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
2021-06-16 11:29:35 +03:00
Avi Kivity
c3838cbc3b Merge 'Make calculating affected ranges yieldable' from Piotr Sarna
This series partially addresses #8852 and its problems caused by deleting large partitions from tables with materialized views. The issue in question is not fixed by this series, because a full fix requires a more complex rewrite of the view update mechanism.
This series makes calculating affected clustering ranges for materialized view updates more resilient to large allocations and stalls. It does so by futurizing the function which can potentially involve large computations and makes it use non-contiguous storage instead of std::vector to avoid large allocations.

Tests: unit(release)

Closes #8853

* github.com:scylladb/scylla:
  db,view,table: futurize calculating affected ranges
  table: coroutinize do_push_view_replica_updates
  db,view: use chunked vector for view affected ranges
  interval: generalize deoverlap()
2021-06-16 11:26:49 +03:00
Botond Dénes
4ecf061c90 reader_lifecycle_policy implementations: fix indentation
Left broken from the previous patch.
2021-06-16 11:21:38 +03:00
Botond Dénes
a7e59d3e2c mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.

A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
2021-06-16 11:21:38 +03:00
Botond Dénes
13d7806b62 mutation_reader: shard_reader::close(): wait on the remote reader
We now have a future<> returning close() method so we don't need to
do the cleanup of the remote reader in the background, detaching it
from the shard-reader under destruction. We can now wait for the
cleanup properly before the shard reader is destroyed and just pass the
stopped reader to reader_lifecycle_policy::destroy_reader(). This patch
does the first part -- moving the cleanup to the foreground, the API
change of said method will come in the next patch.
2021-06-16 11:21:38 +03:00
Botond Dénes
ab8d2a04a5 multishard_mutation_query: destroy remote parts in the foreground
Currently the foreign fields of the reader meta are destroyed in the
background via the foreign pointer's destructor (with one exception).
This makes the already complicated life-cycle of these parts and their
dependencies even harder to reason about, especially in tests, where
even things like semaphores live only within the test.
This patch makes sure to destroy all these remote fields in the
foreground in either `save_reader()` or `stop()`, ensuring that once
`stop()` returns, everything is cleaned up.
2021-06-16 11:21:38 +03:00
Botond Dénes
7552cc73cf mutation_reader: shard_reader::close(): close _reader
The reason we got away without closing _reader so far is that it is an
`std::unique_ptr<evictable_reader>` which is a
`flat_mutation_reader::impl` instance, without the
`flat_mutation_reader` wrapper, which contains the validations for
close.
2021-06-16 11:21:33 +03:00
Avi Kivity
fce124bd90 Merge "Introduce flat_mutation_reader_v2" from Tomasz
"
This series introduces a new version of the mutation fragment stream (called v2)
which aims at improving range tombstone handling in the system.

When compacting a mutation fragment stream (e.g. for sstable compaction, data query, repair),
the compactor needs to accumulate range tombstones which are relevant for the yet-to-be-processed range.
See range_tombstone_accumulator. One problem is that it has unbounded memory footprint because the
accumulator needs to keep track of all the tombstoned ranges which are still active.

Another, although more benign, problem is computational complexity needed to maintain that data structure.

The fix is to get rid of the overlap of range tombstones in the mutation fragment stream. In v2 of the
stream, there is no longer a range_tombstone fragment. Deletions of ranges of rows within a given
partition are represented with range_tombstone_change fragments. At any point in the stream there
is a single active clustered tombstone. It is initially equal to the neutral tombstone when the
stream of each partition starts. The range_tombstone_change fragment type signify changes of the
active clustered tombstone. All fragments emitted while a given clustered tombstone is active are
affected by that tombstone. Like with the old range_tombstone fragments, the clustered tombstone
is independent from the partition tombstone carried in partition_start.

The memory needed to compact a stream is now constant, because the compactor needs to only track the
current tombstone. Also, there is no need to expire ranges on each fragment because the stream emits
a fragment when the range ends.

This series doesn't convert all readers to v2. It introduces adaptors which can convert
between v1 and v2 streams. Each mutation source can be constructed with either v1 or v2 stream factory,
but it can be asked any version, performing conversion under the hood if necessary.

In order to guarantee that v1 to v2 conversion produces a well-formed stream, this series needs to
impose a constraint on v1 streams to trim range tombstones to clustering restrictions. Otherwise,
v1->v2 converted could produce range tombstone changes which lie outside query restrictions, making
the stream non-canonical.

The v2 stream is strict about range tombstone trimming. It emits range tombstone changes which reflect
range tombstones trimmed to query restrictions, and fast-forwarding ranges. This makes the stream
more canonical, meaning that for a given set of writes, querying the database should produce the
same stream of fragments for a given restrictions. There is less ambiguity in how the writes
are represented in the fragment stream. It wasn't the case with v1. For example, A given set
of deletions could be produced either as one range_tombstone, or may, split and/or deoverlapped
with other fragments. Making a stream canonical is easier for diff-calculating.

The mc sstable reader was converted to v2 because it seemed like a comparable effort to do that
versus implementing range tombstone trimming in v1.

The classes related to mutation fragment streams were cloned:
flat_mutation_reader_v2, mutation_fragment_v2, related concepts.

Refs #8625. To fully fix #8625 we need to finish the transition and get rid of the converters.
Converters accumulate range tombstones.

Tests:

 - unit [dev]
"

* tag 'flat_mutation_reader_range_tombstone_split-v3.2' of github.com:tgrabiec/scylla: (26 commits)
  tests: mutation_source_test: Run tests with conversions inserted in the middle
  tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests()
  tests: Add tests for flat_mutation_reader_v2
  flat_mutation_reader: Update the doc to reflect range tombstone trimming
  sstables: Switch the mx reader to flat_mutation_reader_v2
  row_cache: Emit range tombstone adjacent to upper bound of population range
  tests: sstables: Fix test assertions to not expect more than they should
  flat_mutation_reader: Trim range tombstones in make_flat_mutation_reader_from_fragments()
  clustering_ranges_walker: Emit range tombstone changes while walking
  tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream
  Clone flat_reader_assertions into flat_reader_assertions_v2
  test: lib: simple_schema: Reuse new_tombstone()
  test: lib: simple_schema: Accept tombstone in delete_range()
  mutation_source: Introduce make_reader_v2()
  partition_snapshot_flat_reader: Trim range tombstones to query ranges
  mutation_partition: Trim range tombstones to query ranges
  sstables: reader: Inline specialization of sstable_mutation_reader
  sstables: k_l: reader: Trim range tombstones to query ranges
  clustering_ranges_walker: Introduce split_tombstone()
  position_range: Introduce contains() check for ranges
  ...
2021-06-16 11:10:54 +03:00
Piotr Sarna
f832a30388 db,view,table: futurize calculating affected ranges
In order to avoid stalls on large inputs, calculating
affected ranges is now able to yield.
2021-06-16 09:51:31 +02:00
Piotr Sarna
e3fa0246a1 table: coroutinize do_push_view_replica_updates
Makes the code cleaner, but more importantly it will make it easier
to futurize calculate_affected_clustering_ranges in the near future.
2021-06-16 09:51:30 +02:00
Avi Kivity
44f3ad836b main: use correct max-io-requests option spelling
We check for the existence of the option using one spelling,
then read it using another, so we crash with bad_lexical_cast
if it's present when casting the empty string to unsigned.

Fix by using the correct spelling.

Closes #8866
2021-06-16 09:35:05 +02:00
Tomasz Grabiec
605a6e0166 Merge "Remove int_or_strong_ordering concept" from Pavel
The one was added to smothly switch tri-comparing stuff from int
to strong-ordering. As for today only tests still need it and the
conversion is pretty simple, plus operator<<(ostream&) for the
std::strong_ordering type.

* xemul/br-remove-int-or-strong-ordering-2:
  util: Drop int_or_strong_ordering concept
  tests: Switch total-order-check onto strong_ordering
  to_string: Add formatter for strong_ordering
  tests: Return strong-ordering from tri-comparators
2021-06-16 09:34:49 +02:00
Botond Dénes
114459684b mutation_reader: foreign_reader::close() use on_internal_error_noexcept()
Instead of the throwing on_internal_error(). `close()` is noexcept so we
can't throw exceptions here.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210615133130.786048-1-bdenes@scylladb.com>
2021-06-16 09:34:49 +02:00
Asias He
11959173a4 storage_service: Add node_ops_cmd_heartbeat_updater helper
Multiple node operations use a similar heart beat update logic. Add a
helper to reduce the code duplication.

Fixes #8825

Closes #8826
2021-06-16 09:34:49 +02:00
Gleb Natapov
580edcef27 raft: register metrics only after fsm is created
Metrics access _fsm pointer, so we should register them only after the
pointer is populated.

Fixes: #8824

Message-Id: <YMilsCslLAeEnbaw@scylladb.com>
2021-06-16 09:34:49 +02:00
Asias He
c2cfdcd345 gossiper: Set minimum value for quarantine_delay
When a new node bootstraps to join the cluster, it will be set in
bootstrap gossip status. If the node is gone in the middle, the node
will be removed by gossip after the new node fails to update gossip
after fat_client_timeout, which reverts the new node as pending node.

However, if the new node is slow to update gossip and it finishes
bootstrapping after existing nodes have removed the new node after
fat_client_timeout. In handle_state_normal handler, the existing nodes
will fail to find the host id for the new node and throw and in turn
terminate the scylla process.

To mitigate the problem, we set fat_client_timeout which is half of
quarantine_delay to a minimum value if users set a small ring_delay
value.

Refs #8702
Refs #8859

Closes #8860
2021-06-16 09:34:49 +02:00
Tomasz Grabiec
3fcd1f43ba tests: mutation_source_test: Run tests with conversions inserted in the middle 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
cddcba27de tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests()
All readers are now flat so there is no need for this grouping.

Will be needed for the next patch, which needs a single function with
all test cases.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
ffb616fef6 tests: Add tests for flat_mutation_reader_v2 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
3deaa15751 flat_mutation_reader: Update the doc to reflect range tombstone trimming 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
a4275cf8bc sstables: Switch the mx reader to flat_mutation_reader_v2
The main difficulty was in making sure that emitted range tombstone
changes reflect range tombstones trimmed to clustering restrictions.
This is handled by mutation_fragment_filter and
clustering_ranges_walker. They return a list of range_tombstone_change
fragments to emit for each hop as the reader walks over the clustering
domain.

Tests which were using a normalizing reader expected range tombstones
to be split around rows. Drop this an adjust the tests accoridngly. No
reader splits range tombstones around rows now.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
cf958b0ad0 row_cache: Emit range tombstone adjacent to upper bound of population range
Cache populating reader was emitting the row entry which stands for
the upper bound of the population range, but did not emit range
tombstones for the clustering range corresponding to:

  [ before(key), after(key) ).

This surfaces after sstable readers are changed to trim emitted range
tombstones to the fast-forwarding range. Before, it didn't cause
problems, because that range tombstone part would be emitted as part
of the sstable read.

The fix is to drop the optimization which pushes the row after
population is done, and let the regular handling for
copy_from_cache_to_buffer() take care of emitting the row and
tombstones for the remaining range.

A unit test is added which covers population from all sstable
versions.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
5b182ff29a tests: sstables: Fix test assertions to not expect more than they should
Before this patch, the tests expected readers to emit range tombstones
which are outside clustering restrictions. Readers do not have to emit
range tombstones outside clustering restrictions, so fix tests to only
expect the part which overlaps with query ranges.

This is a preparatory patch before changing readers to trim range
tombstones to clustering ranges.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
558d88ea17 flat_mutation_reader: Trim range tombstones in make_flat_mutation_reader_from_fragments()
This is needed to change the guarantees of flat_mutation_reader v1 to
produce only range tombstones trimmed to clustering restrictions. The
reason for this is so that v2 has a canonical representation in which
all fragments have position inside clustering restrictions. Conversion
from v1 to v2 can guarantee that only if v1 trims range tombstones.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
77c618f46e clustering_ranges_walker: Emit range tombstone changes while walking
The walker will now emit range tombstone change fragments while
walking.  This is in order to support the guarantee of
flat_mutation_reader_v2 saying that clustering range tombstone
information must be trimmed to clustering key restrictions.

For example, for ranges:

   [1, 3) [5, 9) [10, 11)

advancing generates the following changes:

  using rtc = range_tombstone_change;

  advance_to(0, {})  ->  []
  advance_to(2, t1)  ->  [ rtc(2, t1) ]
  advance_to(4, t2)  ->  [ rtc(3, {}) ]
  advance_to(15, t3)  ->  [ rtc(5, t2), rtc(9, {}), rtc(10, t2), rtc(11, {}) ]
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
ed055db63e tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
276c68c867 Clone flat_reader_assertions into flat_reader_assertions_v2 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
a13e7b30b7 test: lib: simple_schema: Reuse new_tombstone() 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
7e01679c99 test: lib: simple_schema: Accept tombstone in delete_range() 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
79795a1a61 mutation_source: Introduce make_reader_v2()
Mutation sources can now produce natively either v1 or v2 streams. We
still have both v1 and v2 make_reader() variants, which wrap in
appropriate converters under the hood.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
4046dda844 partition_snapshot_flat_reader: Trim range tombstones to query ranges
This is needed to change the guarantees of flat_mutation_reader v1 to
produce only range tombstones trimmed to clustering restrictions. The
reason for this is so that v2 has a canonical representation in which
all fragments have position inside clustering restrictions. Conversion
from v1 to v2 can guarantee that only if v1 trims range tombstones.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
655bc9fba5 mutation_partition: Trim range tombstones to query ranges
Current code was only selecting overlapping range tombstones.

We will need range tombstones to be trimmed. This is needed to change
the semantics of flat_mutation_reader v1 to produce only range
tombstones trimmed to clustering restrictions. This constructor is
used in unit tests which verify what reader produces.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
8784ffe07f sstables: reader: Inline specialization of sstable_mutation_reader
Needed before converting the mx reader to flat_mutation_reader_v2
because now it and the k_l reader cannot share the reader
implementation. They derive from different reader impl bases and push
different fragment types.
2021-06-16 00:23:49 +02:00
Pavel Solodovnikov
e9258f43cd raft: etcd_test: test_transfer_non_member
Test that a node outside configuration, that receives `timeout_now`
message, doesn't disrupt operation of the rest of the cluster.

That is, `timeout_now` has no effect and the outsider stays in
the follower state without promoting to the candidate.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
2b6d73de98 raft: etcd_test: test_leader_transfer_ignore_proposal
Test that a leader which has entered leader stepdown mode rejects
new append requests.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
ab6b0e3d62 raft: fsm_test: test_leader_transfer_lost_force_vote_request
3-node cluster (A, B, C). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C).

Wait up until the former leader commits the new configuration and starts
leader transfer procedure, sending out the `timeout_now` message to
one of the remaining nodes. But at that point it haven't received it yet.

Deliver the `timeout_now` message to the target but lose all the
`vote_request(force)` messages it attempts to send.
This should halt the election process.
Then wait for election timeout so that candidate node starts another
normal election (without `force` flag for vote requests).

Check that this candidate further makes progress and is elected a
leader.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
97fe6f9d49 raft: fsm_test: test_leader_transfer_lost_timeout_now
3-node cluster (A, B, C). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C).

Wait up until the former leader commits the new configuration and starts
leader transfer procedure, sending out the `timeout_now` message to
one of the remaining nodes. But at that point it haven't received it yet.

Lose this message and verify that the rest of the cluster (B, C)
can make progress and elect a new leader.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
c32497b798 raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
4-node cluster (A, B, C, D). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C, D).
Communicate the cluster up to the point where A starts to resign
its leadership (calls `transfer_leadership()`).
At this point, A should send a `timeout_now` message to one
the remaining nodes (B, C or D) and the new configuration should be
committed. But no nodes actually have received the `timeout_now` message
yet.

Determine on which node the message should arrive, accept the
`timeout_now` message and disconnect the target from the rest of the
group.

Check that after that the cluster, which has only two live members,
could progress and elect a new leader through a normal election process.

tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:19 +03:00
Botond Dénes
98e5f0429b mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment
About the multishard reader not being able to wait on returned future.
It can now via the `close()` method.
2021-06-15 15:23:32 +03:00
Tomasz Grabiec
53568f6939 sstables: k_l: reader: Trim range tombstones to query ranges
This is needed to change the guarantees of flat_mutation_reader v1 to
produce only range tombstones trimmed to clustering restrictions. The
reason for this is so that v2 has a canonical representation in which
all fragments have position inside clustering restrictions. Conversion
from v1 to v2 can guarantee that only if v1 trims range tombstones.
2021-06-15 13:14:45 +02:00
Tomasz Grabiec
f339eb3e9c clustering_ranges_walker: Introduce split_tombstone() 2021-06-15 13:14:45 +02:00
Tomasz Grabiec
08f471043e position_range: Introduce contains() check for ranges 2021-06-15 13:14:45 +02:00
Tomasz Grabiec
c9f2daaa8e range_tombstone: Introduce trim() 2021-06-15 13:14:45 +02:00
Tomasz Grabiec
1ef73abd82 flat_mutation_reader_v2: Implement read_mutation_from_flat_mutation_reader() 2021-06-15 13:14:45 +02:00
Tomasz Grabiec
9996b7ca18 flat_mutation_reader: Introduce adaptors between v1 and v2 of mutation fragment stream
The transition to v2 will be incremental. To support that, we need
adaptors between v1 and v2 which will be inserted at places which are
boundaries of conversion.

The v1 -> v2 converter needs to accumulate range tombstones, so has unbounded worst case memory footprint.

The v2 -> v1 converter trims range tombstones around clustering rows,
so generates more fragments than necessary.

Because of that, adpators are a temporary solution and we should not
release with them on the produciton code paths.
2021-06-15 13:10:47 +02:00
Tomasz Grabiec
08b5773c12 Adapt flat_mutation_reader_v2 to the new version of the API
When compacting a mutation fragment stream (e.g. for sstable
compaction, data query, repair), the compactor needs to accumulate
range tombstones which are relevant for the yet-to-be-processed range.
See range_tombstone_accumulator. One problem is that it has unbounded
memory footprint because the accumulator needs to keep track of all
the tombstoned ranges which are still active.

Another, although more benign, problem is computational complexity
needed to maintain that data structure.

The fix is to get rid of the overlap of range tombstones in the
mutation fragment stream. In v2 of the stream, there is no longer a
range_tombstone fragment. Deletions of ranges of rows within a given
partition are represented with range_tombstone_change fragments. At
any point in the stream there is a single active clustered
tombstone. It is initially equal to the neutral tombstone when the
stream of each partition starts. The range_tombstone_change fragment
type signify changes of the active clustered tombstone. All fragments
emitted while a given clustered tombstone is active are affected by
that tombstone. Like with the old range_tombstone fragments, the
clustered tombstone is independent from the partition tombstone
carried in partition_start.

The v2 stream is strict about range tombstone trimming. It emits range
tombstone changes which reflect range tombstones trimmed to query
restrictions, and fast-forwarding ranges. This makes the stream more
canonical, meaning that for a given set of writes, querying the
database should produce the same stream of fragments for a given
restrictions. There is less ambiguity in how the writes are
represented in the fragment stream. It wasn't the case with v1. For
example, A given set of deletions could be produced either as one
range_tombstone, or may, split and/or deoverlapped with other
fragments. Making a stream canonical is easier for diff-calculating.

The classes related to mutation fragment streams were cloned:
flat_mutation_reader_v2, mutation_fragment_v2, and related concepts.

Refs #8625.
2021-06-15 13:10:47 +02:00
Tomasz Grabiec
e3309322c3 Clone flat_mutation_reader related classes into v2 variants
To make review easier, first clone the classes without chaning the
logic. Logic and API will change in subsequent commits.
2021-06-15 13:10:09 +02:00
Tomasz Grabiec
eb0078d670 flat_mutation_reader: Document current guarantees about mutation fragment stream 2021-06-15 12:37:09 +02:00
Alejo Sanchez
9a22a30554 raft: replication test: split elect_new_leader for prevote
Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-02-v3-01

Tests: unit ({dev}), unit ({debug}), unit ({release})

This fixes current election hangs in next.

Message-Id: <20210610143558.131685-1-alejo.sanchez@scylladb.com>
2021-06-15 11:53:24 +02:00
Avi Kivity
10e75bc363 storage_proxy: remove excess continuations around abstract_read_executor::make_requests()
abstract_read_executor::make_requests() calls make_{data,digest}_request(),
which loop over endpoints in a parallel_for_each(), then collects the
result of the parallel_for_each()es with when_all_succeed(), then
a handle_execption() (or discard_result() in related callers). The
caller of make_requests then attaches a finally() block to keep `this`
alive, and discards the remaining future.

So, a lot of continuations are generated to merge the results, all in
order to keep a reference count alive.

Remove those excess continuations by having individual make_*_request()
variants elevate the reference count themselves. They all already have
a continuation to uncorporate the result into the executor, all they need
is an extra shared_from_this() call. The parallel_for_each() loops
are converted to regular for loops.

Note even a local request that hits cache ends up with a non-ready future
due to an execution_stage for replica access, so these continuations
generate reactor tasks.

perf_simple_query reports:
  before: median 203905.19 tps ( 87.1 allocs/op,  20.1 tasks/op,   50860 insns/op)
  after:  median 214646.89 tps ( 81.1 allocs/op,  15.1 tasks/op,   48604 insns/op)
2021-06-15 10:49:57 +02:00
Piotr Sarna
3592d9b36e db,view: use chunked vector for view affected ranges
There were large allocation reportsa from vectors used for
calculating affected ranges. In order to reduce the pressure
on the allocator, chunked vector is used for storing intermediate
results.
2021-06-15 10:30:27 +02:00
Piotr Sarna
fbc83d5ac6 interval: generalize deoverlap()
Instead of working only for std::vector, deoverlap is now capable
of using other structures - including chunked_vector, which will
help split large allocations into smaller ones.
2021-06-15 10:30:27 +02:00
Tomasz Grabiec
9d49a26e79 Merge "raft: randomized_nemesis_test: tick servers less often than the network in basic_test" from Kamil
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.

However, to closely simulate a production environment, we want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.

We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).

To support these use cases we first generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.

We use this new functionality to tick Raft servers less often than the
network in basic_test.

This patchset effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.

The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.

The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.

* kbr/tick-network-often-v4:
  raft: randomized_nemesis_test: generalize `ticker` to take a set of functions
  raft: randomized_nemesis_test: split `environment::tick` into two functions
  raft: randomized_nemesis_test: fix potential use-after-free in basic_test
2021-06-15 01:54:57 +02:00
Kamil Braun
8f1caa6a90 raft: randomized_nemesis_test: generalize ticker to take a set of functions
... with associated calling periods and use the new API in `basic_test`.

Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.

However, to closely simulate a production environment, we may want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.

We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).

To support these use cases we generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.

We also modify `basic_test` to use this new approach: we tick Raft
servers once per 10 network ticks (in particular, once per 10 reactor
yields).

This commit effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.

The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.

The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.

With this change we must also wait a bit longer for the first node to
elect itself as a leader at the beginning of the test.
2021-06-14 16:54:38 +02:00
Kamil Braun
c0b80f1f8a raft: randomized_nemesis_test: split environment::tick into two functions
One for ticking the network and one for ticking the servers.
2021-06-14 16:54:38 +02:00
Kamil Braun
f42776aded raft: randomized_nemesis_test: fix potential use-after-free in basic_test
The test starts by waiting a certain number of ticks for the first node
to elect itself as a leader.

If this wait times out - i.e. the number of ticks passes before the node
manages to elect itself - the future associated with the task which checks
for the leader condition becomes discarded (it is passed to
`with_timeout`) and the task may keep using the `environment` (which it
has a reference to) even after the `environment` is destroyed.

Furthermore, the aforementioned task is a coroutine which uses lambda
captures in its body. Leaving `with_timeout` destroys the lambda object,
causing the coroutine to refer to no-longer-existing captures.

We fix the problems by:
- making `environment` `weakly_referencable` and checking if its alive
  before it's used inside the task,
- not capturing anything in the lambda but passing whatever's needed as
  function arguments (so these things get allocated inside the coroutine
  frame).
2021-06-14 16:54:38 +02:00
Nadav Har'El
3645c7104b Merge: Wrap alternator start-stop into controller
Merged patch series by Pavel Emelyanov:

Alternator start and stop code is sitting inside the main()
and it's a big piece of code out there. Havig it all in main
complicates rework of start-stop sequences, it's much more
handy to have it in alternator/.

This set puts the mentioned code into transport- and thrift-
like controller model. While doing it one more call for global
storage service goes away.

* 'br-alternator-clientize' of https://github.com/xemul/scylla:
  alternator: Move start-stop code into controller
  alternator: Move the whole starting code into a sched group
  alternator: Dont capture db, use cfg
  alternator: Controller skeleton
  alternator: Controller basement
  alternator: Drop storage service from executor
2021-06-14 15:44:10 +03:00
Michael Livshin
15b0e5c4d2 sstables: count read range tombstones
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210602152210.17948-2-michael.livshin@scylladb.com>
2021-06-14 14:37:33 +02:00
Michael Livshin
9ef2317248 row_cache: count range tombstones processed during read
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210602152210.17948-1-michael.livshin@scylladb.com>
2021-06-14 14:29:05 +02:00
Nadav Har'El
6726fe79b6 Merge 'view: fix use-after-move when handling view update failures' from Piotr Sarna
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Fixes #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)

Closes #8834

* github.com:scylladb/scylla:
  view: fix use-after-move when handling view update failures
  db,view: explicitly move the mutation to its helper function
  db,view: pass base token by value to mutate_MV
2021-06-14 13:15:35 +03:00
Alejo Sanchez
5c8092cf42 raft: fix election with disruptive candidate
This patch also fixes rare hangs in debug mode for drops_04 without
prevote.

Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-05-v2-dueling

Tests: unit ({dev}), unit ({debug}), unit ({release})

Changes in v2:
    - Fixed commit message                               @kostja

Whithout prevote, a node disconnected for long enough becomes candidate.
While disconnected (A) it keeps increasing its term.
When it rejoins it disrupts the current leader (C) which steps down due
to the higher term in (A)'s append_entries_reply and (C) also increases
its term.

Meanwhile followers (B) and (D) don't know (C) stepped down but see it
alive according to the current failure detecture implementation, and
also (A) has shorter log than them.
So they reject (A)'s vote requests (Raft 4.2.3 Disruptive servers).

Then (C) rejects voting for (A) because it has shorter log.
And (C) becomes candidate but even though (A) votes for (C), the
previous followers (B) and (D) ignore a vote request while leader (C) is
still alive and election timeout has not passed.

(A) and (C) alone can't reach quorum 2/4. So elections never succeed.

This patch addresses this problem by making followers not ignore vote
requests from who they think is the current leader even though
election timout was not reached.

As @kostja noted, if failure detector would consider a leader alive only
as long as it sends heartbeats (append requests) this patch is no longer
needed.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210611172734.254757-1-alejo.sanchez@scylladb.com>
2021-06-14 11:07:38 +02:00
Piotr Jastrzebski
1ed92e37f8 database: Fix warning about deprecated update_shares_for_class usage
This patch fixes the following compilation warning:

database.cc:430:33: warning: 'update_shares_for_class' is deprecated:
Use io_priority_class.update_shares [-Wdeprecated-declarations]
    _inflight_update = engine().update_shares_for_class(_io_priority,
    uint32_t(shares));

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8751
2021-06-14 10:42:22 +03:00
Piotr Sarna
8a049c9116 view: fix use-after-move when handling view update failures
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Refs #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
2021-06-14 09:36:10 +02:00
Piotr Sarna
7cdbb7951a db,view: explicitly move the mutation to its helper function
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
2021-06-14 09:34:40 +02:00
Piotr Sarna
88d4a66e90 db,view: pass base token by value to mutate_MV
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
2021-06-14 09:30:38 +02:00
Nadav Har'El
6a8441ef03 Update seastar submodule
* seastar 4506b878...813eee3e (12):
  > reactor: fix race with boost::barrier destructor during smp initialialization
  > Merge "Merge io-group and io-queue configs" from Pavel E
  > tests: add test for skipping data from a socket
  > tests: transform socket_test into a test suite
  > .gitignore: Add tags
  > tls: retain handshake error and return original problem on repeated failures
  > iostream: fix skipping from closed sockets
  > gitignore .cooking_memory
  > Merge 'metrics: Fix dtest->ulong conversion error' from Benny Halevy
  > io_priority_class: Make update_shares const
  > Remove <seastar/core/apply.hh>
  > smp: allow having multiple instances of the smp class

The fix to make io_priority::update_shares() const will allow getting
rid of one of the compilation warnings.
2021-06-14 10:27:14 +03:00
Nadav Har'El
061e43e9d4 Merge 'Fix some compilation warnings' from Piotr Jastrzębski
Closes #8850

* github.com:scylladb/scylla:
  priority_manager: Fix warnings about deprecated register_one_priority_class usage
  main: Fix warning about deprecated usage of io_queue::capacity
2021-06-14 10:05:27 +03:00
Piotr Jastrzebski
831a60a6cd priority_manager: Fix warnings about deprecated register_one_priority_class usage
This patch fixes following warnings:
service/priority_manager.cc:30:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    : _commitlog_priority(engine().register_one_priority_class("commitlog", 1000))

service/priority_manager.cc:31:35: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    , _mt_flush_priority(engine().register_one_priority_class("memtable_flush", 1000))

service/priority_manager.cc:32:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    , _streaming_priority(engine().register_one_priority_class("streaming", 200))

service/priority_manager.cc:33:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    , _sstable_query_read(engine().register_one_priority_class("query", 1000))

service/priority_manager.cc:34:37: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    , _compaction_priority(engine().register_one_priority_class("compaction", 1000))

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-14 08:49:46 +02:00
Piotr Jastrzebski
3ec04433f7 main: Fix warning about deprecated usage of io_queue::capacity
This patch fixes the following warning:

main.cc:307:53: warning: 'capacity' is deprecated: modern I/O queues
should use a property file [-Wdeprecated-declarations]
            auto capacity = engine().get_io_queue().capacity();

It's fine to just check --max-io-requests directly because seastar
sets io_queue::capacity to the value of this parameter anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-14 08:49:42 +02:00
Raphael S. Carvalho
846f0bd16e sstables: Fix incremental selection with compound sstable set
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.

The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.

Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.

Fixes #8802.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
2021-06-13 16:45:07 +03:00
Kamil Braun
9e85921006 storage_proxy: remove a feedback loop from the speculative retry latency metric
To handle a read request from a client, the coordinator node must send
data and digest requests to replicas, reconcile the obtained results
(by merging the obtained mutations and comparing digests), and possibly
send more requests to replicas if the digests turned out to be different
in order to perform read repair and preserve consistency of observed reads.

In contrast to writes, where coordinators send their mutation write requests
to all replicas in the replica set, for reads the coordinators send
their requests only to as many replicas as is required to achieve
the desired CL.

For example consider RF=3 and a CL=QUORUM read. Then the coordinator sends
its request to a subset of 2 nodes out of the 3 possible replicas. The
choice of the 2-node subset is random; the distribution used for the
random roll is affected by certain things such as the "cache hitrate"
metric. The details are not that relevant for this discussion.

If not all of the the initially chosen replicas
answer within a certain time period, the coordinator may send an
additional request to one more replica, hoping that this replica helps
achieving the desired CL so the entire client request succeeds. This
mechanism is called "speculative retry" and is enabled by default.

This time period - call it `T` - is chosen based on keyspace
configuration. The default value is "99.0PERCENTILE", which means that
`T` is roughly equal to the 99th percentile of the latency distribution
of previous requests (or at least the most recent requests; the
algorithm uses an exponential decay strategy to make old request less
relevant for the metric). The latencies used are the durations of whole
coordinator read requests: each such duration measurement starts before
the first replica request is sent and ends after the last replica
request is answered, among the replica requests whose results were used
for the reconciled result returned to the client (there may be more
requests sent later "in the background" - they don't affect the client
result and are not taken into account for the latency measurement).

This strategy, however, gives an undesired effect which appears
when a significant part of all requests require a speculative retry to
succeed. To explain this effect it's best to consider a scenario which
takes this to the extreme - where *all* requests require a speculative retry.

Consider RF=3 and CL=QUORUM so each read request initially uses 2
replicas. Let {A, B, C} be the set of replicas. We run a uniformly
distributed read workload.

Initially the cluster operates normally. Roughly 1/3 of all requests go
to replicas {A, B}, 1/3 go to {A, C}, and 1/3 go to {B, C}. The 99th
percentile of read request latencies is 50ms. Suppose that the average
round-trip latency between a coordinator and any replica is 10ms.

Suddenly replica C is hard-killed: non-graceful shutdown, e.g. power
outage. This means that other nodes are initially not aware that C is down,
they must wait for the failure detector to convict C as unavailable
which happens after a configurable amount of time. The current default
is 20s, meaning that by default coordinators will still attempt to send
requests to C for 20s after it is hard-killed.

During this period the following happens:
- About 2/3 of all requests - the ones which were routed to {A, C} and
  {B, C} - do not finish within 50ms because C does not answer. For
  these requests to finish, the coordinator performs a speculative retry
  to the third replica which finishes after ~10ms (the average round-trip
  latency). Thus the entire request, from the coordinator's POV, takes ~60ms.
- Eventually (very quickly in fact - assuming there are many concurrent
  requests) the P99 latency rises to 60ms.
- Furthermore, the requests which initially use {A, C} and {B, C} start
  taking more than 2/3 of all requests because they are stuck in the foreground
  longer than the {A, B} requests (since their latencies are higher).
- These requests do not finish within 60ms. Thus coordinators perform
  speculative retries. Thus they finish after ~70ms.
- Eventually the P99 latency rises to 70ms.
- These bad requests take an even longer portion of all requests.
- These requests do not finish within 70ms. They finish after ~80ms.
- Eventually the P99 latency rises to 80ms.
- And so on.

In metrics, we observe the following:
- Latencies rise roughly linearly. They rise until they hit a certain limit;
  this limit comes from the fact that `T` is upper-bounded by the
  read request timeout parameter divided by 2. Thus if the read request
  timeout is `5s` and P99 latencies are `3s`, `T` will be `2.5s`, not `3s`.
  Thus eventually all requests will take about `2.5s + 10ms` to finish
  (`2.5s` until speculative retry happens, `10ms` for the last round-trip),
  unless the node is marked as DOWN before we reach that limit.
- Throughput decreases roughly proportionally to the y = 1/x function, as
  expected from Little's law.

Everything goes back to normal when nodes mark C as DOWN, which happens
after ~20s by default as explained above. Then coordinators start
routing all requests to {A, B} only.

This does not happen for graceful shutdowns, where C announces to the
cluster that it's shutting down before shutting down, causing other
nodes to mark it as DOWN almost immediately.

The root cause of the issue is a feedback loop in the metric used to
calculate `T`: we perform a speculative retry after `T` -> P99 request
latencies rise above `T + 10ms` -> `T` rises above `T + 10ms` -> etc.

We fix the problem by changing the measurements used for calculating
`T`. Instead of measuring the entire coordinator read latency, we
measure each replica request separately and take the maximum over these
measurements. We only take into account the measurements for requests
that actually contributed to the request's result.

The previous statistic would also measure failed requests latencies. Now we
measure only latencies of successful replica requests. Indeed this makes
sense for the speculative retry use case; the idea behind speculative retry
is that we assume that requests usually succeed within a certain time
period, and we should perform the retry if they take longer than that.
To measure this time period, taking failed requests into account doesn't
make much sense.

In the scenario above, for a request that initially goes to {A, C}, the
following would happen after applying the fix:
- We send the requests to A and C.
- After ~10ms A responds. We record the ~10ms measurement.
- After ~50ms we perform speculative retry, sending a request to B.
- After ~10ms B responds. We record the ~10ms measurement.

The maximum over recorded measurements is ~10ms, not ~60ms.
The feedback loop is removed.

Experiments show that the solution is effective: in scenarios like
above, after C is killed, latencies only rise slightly by a constant
amount and then maintain their level, as expected. Throughput also drops
by a constant amount and maintains its level instead of continuously
dropping with an asymptote at 0.

Fixes #3746.
Fixes #7342.

Closes #8783
2021-06-13 16:19:11 +03:00
Avi Kivity
d6f3a62c13 Merge 'Add option to forbid SimpleStrategy in CREATE/ALTER KEYSPACE' from Nadav Har'El
This series adds a new configuration option -
restrict_replication_simplestrategy - which can be used to restrict the
ability to use SimpleStrategy in a CREATE KEYSPACE or ALTER KEYSPACE
statement. This is part of a new effort (dubbed "safe mode") to allow an
installation to restrict operations which are un-recommended or dangerous
(see issue #8586 for why SimpleStrategy is bad).

The new restrict_replication_simplestrategy option has three values:
"true", "false", and "warn":

For the time being, the default is still "false", which means SimpleStrategy is not
restricted, and can still be used freely.

Setting a value of "true" means that SimpleStrategy *is* restricted -
trying to create a a table with it will fail:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    ConfigurationException: SimpleStrategy replication class is not
    recommended, and forbidden by the current configuration. Please use
    NetworkToplogyStrategy instead. You may also override this restriction
    with the restrict_replication_simplestrategy=false configuration
    option.

Trying to ALTER an existing keyspace to use SimpleStrategy will
similarly fail.

The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE/ALTER KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    Warnings :
    SimpleStrategy replication class is not recommended, but was used for
    keyspace try1. The restrict_replication_simplestrategy configuration
    option can be changed to silence this warning or make it into an error.

Fixes #8586

Closes #8765

* github.com:scylladb/scylla:
  cql: create_keyspace_statement: move logger out of header file
  cql: allow restricting SimpleStrategy in ALTER KEYSPACE
  cql: allow restricting SimpleStrategy in CREATE KEYSPACE
  config: add configuration option restrict_replication_simplestrategy
  config: add "tri_mode_restriction" type of configurable value
  utils/enum_option.hh: add implicit converter to the underlying enum
2021-06-13 15:39:18 +03:00
Nadav Har'El
6f813bd3a1 cql: create_keyspace_statement: move logger out of header file
Move the logger declaration from the header file into the only source
file that uses it.

This is just a small cleanup similar to what the previous patch did in
alter_keyspace_statement.{cc,hh}.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:40 +03:00
Nadav Har'El
dea075c038 cql: allow restricting SimpleStrategy in ALTER KEYSPACE
In the previous patch we made CREATE KEYSPACE honor the
"restrict_replication_simplestrategy" option. In this patch we do the
same for ALTER KEYSPACE.

We use the same function check_restricted_replication_strategy()
used in CREATE KEYSPACE for the logic of what to allow depending on the
configuration, and what errors or warnings to generate.

One of the non-self-explanatory changes in this patch is to execute():
Previosuly, alter_keyspace_statement inherited its execute() from
schema_altering_statement. Now we need to override it to check if the
operation is forbidden before running schema_altering_statement's execute()
or to warn after it is run. In the previous patch we didn't need to add
a new execute() for create_keyspace_statement because we already had one.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:40 +03:00
Nadav Har'El
b9539d7135 cql: allow restricting SimpleStrategy in CREATE KEYSPACE
This patch uses the configuration option which we added in the previous
patch, "restrict_replication_simplestrategy", to control whether a user
can use the SimpleStrategy replication strategy in a CREATE KEYSPACE
operation. The next patch will do the same for ALTER KEYSPACE.

As a tri_mode_restriction, the restrict_replication_simplestrategy option
has three values - "true", "false", and "warn":

The value "false", which today is still the default, means that
SimpleStrategy is not restricted, and can still be used freely.

The value "true" means that SimpleStrategy *is* restricted - trying to
create a a table with it will fail:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    ConfigurationException: SimpleStrategy replication class is not
    recommended, and forbidden by the current configuration. Please use
    NetworkToplogyStrategy instead. You may also override this restriction
    with the restrict_replication_simplestrategy=false configuration
    option.

The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    Warnings :
    SimpleStrategy replication class is not recommended, but was used for
    keyspace try1. The restrict_replication_simplestrategy configuration
    option can be changed to silence this warning or make it into an error.

Because we plan to use the same checks and the same error messages
also for ALTER TABLE (in the next patch), we encapsulate this logic in
a function check_restricted_replication_strategy() which we will use for
ALTER TABLE as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:25 +03:00
Nadav Har'El
8a4ac6914a config: add configuration option restrict_replication_simplestrategy
This patch adds a configuration option to choose whether the
SimpleStrategy replication strategy is restricted. It is a
tri_mode_restriction, allowing to restrict this strategy (true), to allow
it (false), or to just warn when it is used (warn).

After this patch, the option exists but doesn't yet do anything.
It will be used in the following two patches to restrict the
CREATE KEYSPACE and ALTER KEYSPACE operations, respectively.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:16 +03:00
Nadav Har'El
a3d6f502ad config: add "tri_mode_restriction" type of configurable value
This patch adds a new type of configurable value for our command-line
and YAML parsers - a "tri_mode_restriction" - which can be set to three
values: "true", "false", or "warn".

We will use this value type for many (but not all) of the restriction
options that we plan to start adding in the following patches.
Restriction options will allow users to ask Scylla to restrict (true),
to not restrict (false) or to warn about (warn) certain dangerous or
undesirable operations.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:44:20 +03:00
Nadav Har'El
afacffc556 utils/enum_option.hh: add implicit converter to the underlying enum
Add an implicit converter of the enum_option to the underyling enum
it is holding. This is needed for using switch() on an enum_option.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 13:18:49 +03:00
Avi Kivity
ec60f44b64 main: improve process file limit handling
We check that the number of open files is sufficent for normal
work (with lots of connections and sstables), but we can improve
it a little. Systemd sets up a low file soft limit by default (so that
select() doesn't break on file descriptors larger than 1023) and
recommends[1] raising the soft limit to the more generous hard limit
if the application doesn't use select(), as ours does not.

Follow the recommendation and bump the limit. Note that this applies
only to scylla started from the command line, as systemd integration
already raises the soft limit.

[1] http://0pointer.net/blog/file-descriptor-limits.html

Closes #8756
2021-06-13 09:19:35 +03:00
Tomasz Grabiec
7521301b72 Merge "raft: add tests for non-voters and fix related bugs" from Kostja
Add test coverage inspired by etcd for non-voter servers,
and fix issues discovered when testing.

* scylla-dev/raft-learner-test-v4:
  raft: (testing) test non-voter can vote
  raft: (testing) test receiving a confchange in a snapshot
  raft: (testing) test voter-non-voter config change loop
  raft: (testing) test non-voter doesn't start election on election timeout
  raft: (testing) test what happens when a learner gets TimeoutNow
  raft: (testing) implement a test for a leader becoming non-voter
  raft: style fix
  raft: step down as a leader if converted to a non-voter
  raft: improve configuration consistency checks
  raft: (testing) test that non-voter stays in PIPELINE mode
  raft: (testing) always return fsm_debug in create_follower()
2021-06-12 21:36:47 +03:00
Botond Dénes
cb208a56f2 docs/guides/debugging.md: expand section on libthread-db
Fix a typo in enabling libthread-db debugging.

Add command line snippet which can enable libthread-db debugging on
startup.

Split the long wall of text about likely problems into separate
per-problem subsections.

Add sub-section about recently found Fedora bug(?)
https://bugzilla.redhat.com/show_bug.cgi?id=1960867.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210603150607.378277-1-bdenes@scylladb.com>
2021-06-12 21:36:47 +03:00
Nadav Har'El
9774c146cc cql-pytest: add test for connecting with different SSL/TLS versions
This is a reproducer for issue #8827, that checks that a client which
tries to connect to Scylla with an unsupported version of SSL or TLS
gets the expected error alert - not some sort of unexpected EOF.

Issue #8827 is still open, so this test is still xfailing. However,
I verified that with a fix for this issue, the test passes.

The test also prints which protocol versions worked - so it also helps
checking issue #8837 (about the ancient SSL protocol being allowed).

Refs #8837
Refs #8827

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210610151714.1746330-1-nyh@scylladb.com>
2021-06-12 21:36:47 +03:00
Pavel Emelyanov
7b1f2d91a5 scylla-gdb: Remove maximum-request-size report
The recent seastar update moved the variable again, so to have a
proper support for it we'd need to have 2 try-catch attempts and
a default. Or 1 try-catch, but make sure the maintainer commits
this patch AND seastar update in one go, so that the intermediate
variable doesn't creep into an intermediate commit. Or bear the
scylla-gdb test is not bisect-safe a little bit.

Instead of making this complex choise I suggest to just drop the
volatile variable from the script at all. This thing is actually
a constant derived from the latency goal and io-properties.yaml
file, so it can be calculated without gdb help (unlike run-time
bits like group rovers or numbers of queued/executing resources).
To free developers from doing all this math by hands there's an
"ioinfo" tool that (when run with correct options) prints the
results of this math on the screen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210610120151.1135-1-xemul@scylladb.com>
2021-06-11 19:06:43 +02:00
Michael Livshin
2bbc293e22 tests: improve error reporting of test_env::reusable_sst()
Distinguish the "no such sstable" case from any reading errors.

While at it, coroutinize the function.

Refs #8785.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210610113304.264922-1-michael.livshin@scylladb.com>
2021-06-11 19:06:43 +02:00
Pavel Emelyanov
fbd98e6292 alternator: Move start-stop code into controller
This move is not "just move", but also includes:

- putting the whole thing into seastar::async()
- switch from locally captured dependencies into controller's
  class members
- making smp_service_groups optional because it doesn't have
  default contructor and should somehow survive on constructed
  controller until its start()

Also copy few bits from main that can be generalized later:

- get_or_default() helper from main
- sharded_parameter lambda for cdc
- net family and preferred thing from main

( this also fixed the indentation broken by previous patch )

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:17:27 +03:00
Pavel Emelyanov
9e2ad77436 alternator: Move the whole starting code into a sched group
The controller won't have the database_config at hands to get
the sched group from. All other client services run the whole
controller start in the needed sched group, so prepare the
alternator controller for that.

To make it compile (and while-at-it) also move up the sharded
server and executor instances and the smp_service_group. All
of these will migrate onto the controller in the next patch.

( the indentation is deliberately left broken )

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:11:02 +03:00
Pavel Emelyanov
f918a75572 alternator: Dont capture db, use cfg
When .init()ing the server one needs to provide the
max_concurrent_requests_per_shard value from config.

Instead of carrying the database around for it -- use the
db::config itself which is at hand. All the shards share
its instance anyway.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:09:16 +03:00
Pavel Emelyanov
4aad618409 alternator: Controller skeleton
Add the controller class with all the needed dependencies. For
now completely unused (thus a bunch of (void)-s here and there).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:08:37 +03:00
Pavel Emelyanov
316e9af234 alternator: Controller basement
Add header and source file for transport- (and thrift-) like controller
that'll do all the bookkeeping needed to start and stop this client
service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:06:10 +03:00
Pavel Emelyanov
773d2fe2a4 alternator: Drop storage service from executor
It's completely unused in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:05:11 +03:00
Konstantin Osipov
2be8a73c34 raft: (testing) test non-voter can vote
When a non-voter is requested a vote, it must vote
to preserve liveness. In Raft, servers respond
to messages without consulting with their current configuration,
and the non-voter may not have the latest configuration
when it is requested to vote.
2021-06-11 17:16:57 +03:00
Konstantin Osipov
eaf32f2c3c raft: (testing) test receiving a confchange in a snapshot 2021-06-11 17:16:56 +03:00
Konstantin Osipov
d08ad76c24 raft: (testing) test voter-non-voter config change loop 2021-06-11 17:16:55 +03:00
Konstantin Osipov
6e4619fe87 raft: (testing) test non-voter doesn't start election on election timeout 2021-06-11 17:16:55 +03:00
Konstantin Osipov
c8ae13a392 raft: (testing) test what happens when a learner gets TimeoutNow
Once learner receives TimeoutNow it becomes a candidate, discovers it
can't vote, doesn't increase its term and converts back to a
follower. Once entries arrive from a new leader it updates its
term.
2021-06-11 17:16:55 +03:00
Konstantin Osipov
a972269630 raft: (testing) implement a test for a leader becoming non-voter 2021-06-11 17:16:55 +03:00
Konstantin Osipov
ba046ed1ab raft: style fix 2021-06-11 17:16:54 +03:00
Konstantin Osipov
b0a1ebc635 raft: step down as a leader if converted to a non-voter
If the leader becomes a non-voter after a configuration change,
step down and become a follower.

Non-voting members are an extension to Raft, so the protocol spec does
not define whether they can be leaders. I can not think of a reason
why they can't, yet I also can not think of a reason why it's useful,
so let's forbid this.

We already do not allow non-voters to become candidates, and
they ignore timeout_now RPC (leadership transfer), so they
already can not be elected.
2021-06-11 17:16:50 +03:00
Konstantin Osipov
684e0d2a8c raft: improve configuration consistency checks
Isolate the checks for configuration transitions in a static function,
to be able to unit test outside class server.

Split the condition of transitioning to an empty configuration
from the condition of transitioning into a configuration with
no voters, to produce more user-friendly error messages.

*Allow* to transfer leadership in a configuration when
the only voter is the leader itself. This would be equivalent
to syncing the leader log with the learner and converting
the leader to the follower itself. This is safe, since
the leader will re-elect itself quickly after an election timeout,
and may be used to do a rolling restart of a cluster with
only one voter.

A test case follows.
2021-06-11 17:16:47 +03:00
Konstantin Osipov
3e6fd5705b raft: (testing) test that non-voter stays in PIPELINE mode
Test that configuration changes preserve PIPELINE mode.
2021-06-11 17:07:39 +03:00
Konstantin Osipov
1dfe946c91 raft: (testing) always return fsm_debug in create_follower()
create_follower() is a test helper, so it's OK to return
a test-enabled FSM from it.
This will be used in a subsequent patch/test case.
2021-06-11 12:24:43 +03:00
Alejo Sanchez
ff34a6515d raft: replication test: fix elect_new_leader
Recently, the logic of elect_new_leader was changed to allow the old
leader to vote for the new candidate. But the implementation is wrong as
it re-connects the old leader in all cases disregarding if the nodes
were already disconnected.

Check if both old leader and the requested new leader are connected
first and only if it is the case then the old leader can participate in
the election.

There were occasional hangs in the loop of elect_new_leader because
other nodes besides the candidate were ticked.  This patch fixes the
loop by removing ticks inside of it.

The loop is needed to handle prevote corner cases (e.g. 2 nodes).

While there, also wait log on all followers to avoid a previously
dropped leader to be a dueling candidate.

And update _leader only if it was changed.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210609193945.910592-3-alejo.sanchez@scylladb.com>
2021-06-10 12:36:25 +02:00
Alejo Sanchez
add12d801d raft: log ignored prevote
Add a log line for ignored prevote.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210609193945.910592-2-alejo.sanchez@scylladb.com>
2021-06-10 12:33:34 +02:00
Benny Halevy
e0622ef461 compaction_manager: stop_ongoing_compactions: print reason for stopping
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210610084704.388215-1-bhalevy@scylladb.com>
2021-06-10 11:52:57 +03:00
Piotr Sarna
7506f44c77 cql3: use existing constant for max result in indexed statements
Original code which introduced enforcing page limits for indexed
statements created a new constant for max result size in bytes.
Botond reported that we already have such a constant, so it's now
used instead of reinventing it from scratch.

Closes #8839
2021-06-10 11:08:54 +03:00
Nadav Har'El
b26fcf5567 test/alternator: increase timeouts in test_tracing.py
The query tracing tests in test/alternator's test_tracing.py had one
timeout of 30 seconds to find the trace, and one unclearly-coded timeout
for finding the right content for the trace. We recently saw both
timeouts exceeded in tests, but only rarely and only in debug mode,
in a run 100 times slower than normal.

This patch increases both timeouts to 100 seconds. Whatever happens then,
we win: If the test stops failing, we know the new timeout was enough.
If the test continues to fail, we will be able to conclude that we have a
real bug - e.g., perhaps one of the LWT operations has a bug causing it
to hang indefinitely.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210608205026.1600037-1-nyh@scylladb.com>
2021-06-10 09:19:01 +03:00
Benny Halevy
8ecc626c15 queue_reader_handle: mark copy constructor noexcept
It is trivially so, as std::exception_ptr is nothrow default
constructible.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210609135925.270883-2-bhalevy@scylladb.com>
2021-06-09 20:09:01 +03:00
Benny Halevy
3100cdcc65 queue_reader_handle: move-construct also _ex
We're only moving the other reader without the
other's exception (as it maybe already be abandoned
or aborted).

While at it, mark the constructor noexcept.

Fixes #8833

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210609135925.270883-1-bhalevy@scylladb.com>
2021-06-09 20:09:01 +03:00
Pavel Emelyanov
990db016e9 transport: Untie transport and database
Both controller and server only need database to get config from.
Since controller creation only happens in main() code which has the
config itself, we may remove database mentioning from transport.

Previous attempt was not to carry the config down to the server
level, but it stepped on an updateable_value landmine -- the u._v.
isn't copyable cross-shard (despite the docs) and to properly
initialize server's max_concurrent_requests we need the config's
named_value member itself.

The db::config that flies through the stack is const reference, but
its named_values do not get copied along the way -- the updateable
value accepts both references and const references to subscribe on.

tests: start-stop in debug mode

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210607135656.18522-1-xemul@scylladb.com>
2021-06-09 20:04:12 +03:00
Eliran Sinvani
9bfb2754eb dist: rpm: Add specific versioning and python3 dependency
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.

Fixes #8829

Closes #8832
2021-06-09 20:02:43 +03:00
Asias He
0665d9c346 gossip: Handle nodes removed from live endpoints directly
When a node is removed from the _live_endpoints list directly, e.g., a
node being decommissioned, it is possible the node might not be marked
as down in gossiper::failure_detector_loop_for_node loop before the loop
exits. When the gossiper::failure_detector_loop loop starts again, the
node will not be considered because it is not present in _live_endpoints
list any more. As a result, the node will not be marked as down though
gossiper::failure_detector_loop_for_node loop.

To fix, we mark the nodes that are removed from _live_endpoints
lists as down in the gossiper::failure_detector_loop loop.

Fixes #8712

Closes #8770
2021-06-09 15:02:25 +02:00
Tomasz Grabiec
419ee84d86 Merge "sstable: validate first and last keys ordering" from Benny
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.

It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.

set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().

This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.

Refs #8772

Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)

* tag 'validate-first-and-last-keys-ordering-v1':
  sstable: validate first and last keys ordering
  test: lib: reusable_sst: save unexpected errors
  test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
  test: sstable_test: define primary key in schema for compressed sstable
2021-06-09 14:43:02 +02:00
Avi Kivity
a57d8eef49 Merge 'streaming: make_streaming_consumer: close reader on errors' from Benny Halevy
Currently, if e.g. find_column_family throws an error,
as seen in #8776 when the table was dropped during repair,
the reader is not closed.

Use a coroutine to simplify error handling and
close the reader if an exception is caught.

Also, catch an error inside the lambda passed to make_interposer_consumer
when making the shared_sstable for streaming, and close the reader
their and return an exceptional future early, since
the reader will not be moved to sst->write_components, that assumes
ownership over it and closes it in all cases.

Fixes #8776

Test: unit(dev)
DTest: repair_additional_test.py:RepairAdditionalTest.repair_while_table_is_dropped_test (dev, debug) w/ https://github.com/scylladb/scylla/pull/8635#issuecomment-856661138

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8782

* github.com:scylladb/scylla:
  streaming: make_streaming_consumer: close reader on errors
  streaming: make_streaming_consumer: coroutinize returned function
2021-06-09 15:02:36 +03:00
Tomasz Grabiec
ce7a404f17 Merge "Cleanups/refactoring for Raft Group 0" from Kostja
* scylla-dev/raft-group-0-part-1-rebase:
  raft: (service) pass Raft service into storage_service
  raft: (service) add comments for boot steps
  raft: add ordering for raft::server_address based on id
  raft: (internal) simplify construction of tagged_id
  raft: (internal) tagged_id minor improvements
2021-06-09 10:48:05 +02:00
Avi Kivity
d2157dfea7 Merge 'locator: token_metadata: simplify tokens_iterator' from Michał Chojnowski
`ring_range()`/`tokens_iterator` are more complicated than they need to be. The `include_min` parameter is not used anywhere, and `tokens_iterator` is pimplified without a good reason. Simplify that.

Closes #8805

* github.com:scylladb/scylla:
  locator: token_metadata: depimplify tokens_iterator
  locator: token_metadata: remove _ring_pos from tokens_iterator_impl
  locator: token_metadata: remove tokens_end()
  locator: token_metadata: remove `include_min` from tokens_iterator_impl
  locator: token_metadata: remove the `include_min` parameter from `ring_range()`
2021-06-08 15:42:41 +03:00
Konstantin Osipov
267a8e99ad raft: (service) pass Raft service into storage_service
Raft group 0 initialization and configuration changes
should be integrated with Scylla cluster assembly,
happening when starting the storage service and joining
the cluster. Prepare for this.

Since Raft service depends on query processor, and query
processor depends on storage service, to break a dependency
loop split Raft initialization into two steps: starting
an under-constructed instance of "sharded" Raft service,
accepting an under-constructed instance of "sharded"
query_processor, and then passed into storage service start
function, and then the local state of Raft groups from system
tables once query processor starts.

Consistently abbreviate raft_services instance raft_svcs, as
is the convention at Scylla.

Update the tests.
2021-06-08 14:52:32 +03:00
Konstantin Osipov
959bd21cdb raft: (service) add comments for boot steps 2021-06-08 14:52:32 +03:00
Konstantin Osipov
b81580f3c6 raft: add ordering for raft::server_address based on id 2021-06-08 14:52:32 +03:00
Konstantin Osipov
d42d5aee8c raft: (internal) simplify construction of tagged_id
Make it easy to construct tagged_id from UUID.
2021-06-08 14:52:32 +03:00
Konstantin Osipov
c9a23e9b8a raft: (internal) tagged_id minor improvements
Introduce a syntax helper tagged_id::create_random_id(),
used to create a new Raft server or group id.

Provide a default ordering for tagged ids, for use
in Raft leader discovery, which selects the smallest
id for leader.
2021-06-08 14:52:32 +03:00
Benny Halevy
5a8531c4c8 repair: get_sharder_for_tables: throw no_such_column_family
Insteadof std::runtime_error with a message that
resembles no_such_column_family, throw a
no_such_column_family given the keyspace and table uuid.

The latter can be explicitly caught and handled if needed.

Refs #8612

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210608113605.91292-1-bhalevy@scylladb.com>
2021-06-08 14:45:44 +03:00
Nadav Har'El
355dbf2140 test/cql-pytest: option for running the tests over SSL
This patch adds a "--ssl" option to test/cql-pytest's pytest, as well as
to the run script test/cql-pytest/run. When "test/cql-pytest/run --ssl"
is used, Scylla is started listening for encrypted connections on its
standard port (9042) - using a temporary unsigned certificate. Then, the
individual tests connect to this encrypted port using TLSv1.2 (Scylla
doesn't support earlier version of SSL) instead of TCP.

This "--ssl" feature allows writing test which stress various aspects of
the connection (e.g., oversized requests - see PR #8800), and then be
able to run those tests in both TCP and SSL modes.

Fixes #8811

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210607200329.1536234-1-nyh@scylladb.com>
2021-06-08 11:43:20 +02:00
Kamil Braun
3778a816c1 storage_proxy: abstract_read_executor: make certain methods private
The methods `make_mutation_data_request`, `make_data_request`
and `make_digest_request` were marked as protected, but weren't used by
deriving classes. The "API" for deriving classes is encapsulated through
plural versions of these functions, such as `make_mutation_data_requests`
(note the "s" at the end), which send a request to a set of replicas
(rather than a single replica) but also do other important things - like
gathering statistics - hence we don't want the deriving classes to use
them directly.

Marking these singular methods as private communicates the intent more
clearly.
2021-06-08 12:32:47 +03:00
Asias He
5c9816615f streaming: Enable off-strategy compaction for bootstrap and replace
The off-strategy compaction is now enabled for repair based node
operations. It is not bound to repair based node operations though. It
makes sense to enable it for streaming based node operations too.

Fixes #8820

Closes #8821
2021-06-08 12:13:20 +03:00
Pavel Emelyanov
4ad4208426 util: Drop int_or_strong_ordering concept
Nobody uses it now. All tri-comparing stuff is strong_ordering now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-08 11:40:55 +03:00
Pavel Emelyanov
3f34878708 tests: Switch total-order-check onto strong_ordering
This helper uses int_or_strong_ordering to facilitate vector
ordering checks. The rework is straightforward.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-08 11:40:55 +03:00
Pavel Emelyanov
133692477d to_string: Add formatter for strong_ordering
There's not default one (yet), but totat_order_check.hh wants
to print and format these values.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-08 11:33:04 +03:00
Raphael S. Carvalho
f8b2a6c923 sstables: Optimize incremental selection when only primary set contains sstables
Compound set's incremental selector isn't needed when only one set
contains sstables, which is the common case because secondary set
will only contain data during maintenance operations.
From now on, if only primary set contains data, its selector will
be built directly without compound set's selector acting as an
interposer.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210607193651.126937-1-raphaelsc@scylladb.com>
2021-06-08 10:25:19 +03:00
Benny Halevy
2e93996473 streaming: make_streaming_consumer: close reader on errors
Currently, if e.g. find_column_family throws an error,
as seen in #8776 when the table was dropped during repair,
the reader is not closed.

Use a coroutine to simplify error handling and
close the reader if an exception is caught.

Also, catch an error inside the lambda passed to make_interposer_consumer
when making the shared_sstable for streaming, and close the reader
their and return an exceptional future early, since
the reader will not be moved to sst->write_components, that assumes
ownership over it and closes it in all cases.

Fixes #8776

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-08 08:50:46 +03:00
Benny Halevy
42028c324c streaming: make_streaming_consumer: coroutinize returned function
To simplify error handling in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-08 08:48:33 +03:00
Pavel Emelyanov
cd166fa942 tests: Return strong-ordering from tri-comparators
Some collection tests still use int-s for it. The conversion
is straightforward.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-07 21:41:08 +03:00
Avi Kivity
3e3003fcc1 Merge 'cql3: limit the concurrency of indexed statements' from Piotr Sarna
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096 and additionally throttled
if enough results are already fetched. For paged queries it means that
the query returns as soon as 1MB of data is ready, and for unpaged ones
the concurrency will no longer be doubled as soon as the previous
iteration fetched 1MB of results.

The fixed 4096 value can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.

Fixes #8799

Tests: unit(release),
       manual: observing metrics of modified index_paging_test

Closes #8814

* github.com:scylladb/scylla:
  cql3: limit the transitional result size for indexed queries
  cql3: return indexed pages after 1MB worth of data
  cql3: limit the concurrency of indexed statements
2021-06-07 18:00:51 +03:00
Gleb Natapov
5d15ecb7e5 raft: do not block io_fiber just because of a slow follower
Currently if append_message cannot be sent to one of the followers the
entire io_fiber will block which eventually stop the replication. The
patch changes message sending part of io_fiber to be non blocking. The
code adds a hash table that is used to keep track of append_request
sending status per destination. All the remaining futures are waited for
during abort.
Message-Id: <20210606140305.2930189-2-gleb@scylladb.com>
2021-06-07 16:55:14 +02:00
Gleb Natapov
01b6a2eb38 raft: randomized_nemesis_test: tick virtual clock less aggressively
Currently each tick of the virtual clock immediately schedules the next one
at the end of the task queue, but this is too aggressive. If a tick
generates work that need two tasks to be scheduled one after another
such implementation will make the task queue grow to infinity. Considering
that in the debug mode even ready future causes preemption and task
queue shuffling may cause two or more ticks to be executed without any
other work done in the middle it is very easy to get to such situation.

The patch changes the virtual clock to tick only when a shard is idle.
Message-Id: <20210606140305.2930189-1-gleb@scylladb.com>
2021-06-07 16:54:56 +02:00
Piotr Sarna
df0d44486a cql3: limit the transitional result size for indexed queries
Unpaged indexed queries already have a concurrency limit of 4096,
but now the concurrency is further limited by previous number of bytes
fetched. Once this number reached 1MB, the concurrency will not be
increased in consecutive queries to avoid overload.
2021-06-07 16:29:18 +02:00
Piotr Sarna
60e55b6c7f cql3: return indexed pages after 1MB worth of data
Currently there's no practical limit of the resulting page size
for an indexed query, because it simply translates a page worth
of base primary keys into base rows. In order to avoid sending
too large pages, the result is returned after hitting a 1MB limit.
2021-06-07 16:05:50 +02:00
Piotr Sarna
8eeac10ded cql3: limit the concurrency of indexed statements
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096.
The number can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.

Fixes #8799

Tests: unit(release),
       manual: observing metrics of modified index_paging_test
2021-06-07 15:56:15 +02:00
Benny Halevy
5f31beaf97 flat_mutation_reader: unify reader_consumer declarations
Put the reader_consumer declaration in flat_mutation_reader.hh
and include it instead of declaring the same `using reader_consumer`
declaration in several places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210607075020.31671-1-bhalevy@scylladb.com>
2021-06-07 16:11:18 +03:00
Pavel Solodovnikov
76bea23174 treewide: reduce header interdependencies
Use forward declarations wherever possible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>

Closes #8813
2021-06-07 15:58:35 +03:00
Avi Kivity
0048c404d2 Merge 'dht: token: make some cosmetic changes' from Michał Chojnowski
This is a set of a few cosmetic changes in dht/token. Mostly some comments and a simplification of `midpoint()`.

Closes #8803

* github.com:scylladb/scylla:
  dht: token: add a comment excusing the `const bytes&` constructor
  dht: token: simplify midpoint()
  dht: token: add a comment to normalize()
  dht: token: use {read,write}_unaligned instead of std::copy_n
  dht: token-sharding: fix a typo in a comment
2021-06-07 15:41:15 +03:00
Piotr Sarna
fa29b79c20 transport: close connections when too large requests arrive
Too large requests are currently handled by the CQL server by
skipping them and sending back an error response.
That's however wasteful and dangerous: bogus request sizes
will force Scylla to potentially skip gigabytes of data
- and skipping is done by simply reading from the socket,
so it may results in gigabytes of bandwidth wasted.
Even if the request size is not bogus, closing the connection
forces users to adjust their request sizes, which should be done
anyway.

Originally, there was a bug in handling too large requests which
only read their headers and then left the connection in a broken,
undefined state, trying to interpret the rest of the large request
as a next CQL header. It was later fixed to skip the request, but
closing the connection is a safer thing to do.

Fixes #8798

Closes #8800
2021-06-07 12:23:55 +03:00
Avi Kivity
e6c5a63581 Merge "Fix several issues on transport stop" from Pavel E
"
There's a bunch of issues with starting and stopping of cql_server with
the help of cql_controller.

fixes: #8796
tests: manual(start + stop,
              start + exception on cql_set_state()
	     )
       unit not run, they don't mess with transport controller
"

* 'br-transport-stop-fixes' of https://github.com/xemul/scylla:
  transport/controller: Stop server on state change failure too
  transport/controller: Rollback server start on state change failure too
  transport/controller: Do not leave _server uninitialized
  transport/controller: Rework try-catch into defers
2021-06-07 11:41:36 +03:00
Michał Chojnowski
3ea97e7a11 locator: token_metadata: depimplify tokens_iterator
This class has no meaningful dependencies, so pimpl is unreasonable here.
2021-06-07 10:41:23 +02:00
Michał Chojnowski
baaac5bb7c locator: token_metadata: remove _ring_pos from tokens_iterator_impl
_ring_pos is slightly confusing. I thought at first that it doesn't do anything
since operator== doesn't use it.
This cosmetic patch tries to improve the readability, and also removes
operator!= which is generated automatically in C++20.
2021-06-07 10:41:22 +02:00
Michał Chojnowski
30e5290cea locator: token_metadata: remove tokens_end()
It's an internal method of token_metadata_impl and doesn't have to exist.
2021-06-07 10:41:11 +02:00
Alejo Sanchez
bd168d57ff raft: fix vote reply handling in prevote
Do not register a reply to prevote as a real vote

Found and authored by @kostja.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210604122530.1975388-1-alejo.sanchez@scylladb.com>
2021-06-06 19:18:49 +03:00
Tomasz Grabiec
50d64646cd Merge "raft: replication test fixes and OOP refactor" from Alejo
Feature requests, fixes, and OOP refactor of replication_test.

Note: all known bugs and hangs are now fixed.

A new helper class "raft_cluster" is created.
Each move of a helper function to the class has its own commit.
New helpers are provided

To simplify code, for now only a single apply function can be set per
raft_cluster. No tests were using in any other way. In the future,
there could be custom apply functions per server dynamically assigned,
if this becomes needed.

* alejo/raft-tests-replication-02-v3-30: (66 commits)
  raft: replication test: wait for log for both index and term
  raft: replication test: reset network at construction
  raft: replication test: use lambda visitor for updates
  raft: replication test: move structs into class
  raft: replication test: move data structures to cluster class
  raft: replication test: remove shared pointers
  raft: replication test: move get_states() to raft_cluster
  raft: replication test: test_server inside raft_cluster
  raft: replication test: rpc declarative tests
  raft: replication test: add wait_log
  raft: replication test: add stop and reset server
  raft: replication test: disconnect 2 support
  raft: replication test: explicit node_id naming
  raft: replication test: move definitions up
  raft: replication test: no append entries support
  raft: replication test: fix helper parameter
  raft: replication test: stop servers out of config
  raft: replication test: wait log when removing leader from configuration
  raft: replication test: only manipulate servers in configuration
  raft: replication test: only cancel rearm ticker for removed server
  ...
2021-06-06 19:18:49 +03:00
Piotr Sarna
cb17aa1e53 Merge 'test/alternator: rewrite run script to share code with cql-pytest's run script' from Nadav Har'El
In this small series, I rewrite test/alternator/run to Python using the utility
functions developed for test/cql-pytest. In the future, we should do the same to
test/redis/run and test/scylla-gdb/run.

The benefit of this rewrite is less code duplication (all run scripts start with
the same duplicate code to deal with temporary directories, to run Scylla IP
addresses, etc.), but most importantly - in the future fixes we do to cql-pytest
(e.g., parameters needed to start Scylla efficiently, how to shut down Scylla,
etc.) will appear automatically in alternator test without needing to remember
to change both.

Another benefit is that test/alternator/run will now be Python, not a shell
script. This should make it easier to integrate it into test.py (refs #6212) in
the future - if we want to.

Closes #8792

* github.com:scylladb/scylla:
  test/alternator: rewrite test/alternator/run script in Python
  test/cql-pytest: make test run code more general
2021-06-06 19:18:49 +03:00
Nadav Har'El
fe1fa9d72b docs: update Alternator's compatibility.md
In the last year, four new features were added to DynamoDB which we
don't yet support - Kinesis Streams, PartiQL, Contributor Insights and
Export to S3. Let's document them as missing Alternator features, and
point to the four newly-created issues about these features.

Refs #8786
Refs #8787
Refs #8788
Refs #8789

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210603125825.1179171-1-nyh@scylladb.com>
2021-06-06 19:18:49 +03:00
Avi Kivity
872cd8f692 test: adjust copyright statement to use ScyllaDB rather than old name 2021-06-06 19:18:49 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Avi Kivity
b3ce1c8b40 gdb: prepare for Seastar's "smp: allow having multiple instances of the smp class"
scylladb/seastar@e6463df8a0 ("smp: allow
having multiple instances of the smp class") changes the type of
seastar::smp::_qs from a unique_ptr to a regular pointer. Adjust for
that change, with a fallback to support older versions.

Closes #8784
2021-06-06 19:18:49 +03:00
Nadav Har'El
48ff641f67 Merge 'commitlog: make_checked_file for segments, report and ignore other errors on shutdown' from Benny Halevy
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.

This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.

In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.

Fixes #8577

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8578

* github.com:scylladb/scylla:
  commitlog: segment_manager::shutdown: abort on errors
  commitlog: allocate_segment_ex: make_checked_file
2021-06-06 19:18:49 +03:00
Avi Kivity
8a4abe9895 cql3: expression: don't copy expression in has_supporting_index()
std::bind() copies the bound parameters for safekeeping. Here this
includes expr, which can be quite heavyweight. Use std::ref() to
prevent copying. This is safe since the bound expression is executed
and discarded before has_supporting_index() returns.

Closes #8791
2021-06-06 19:18:49 +03:00
Nadav Har'El
cee0340c89 scripts/pull_github_pr.sh: do not hard-code project name
The current pull_github_pr.sh hard-codes the project name "scylladb/scylla".
Let's determine it automaticaly, from the git origin url.

This will allow using exactly the same script in other Scylla subprojects,
e.g., Seastar.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318142624.1794419-1-nyh@scylladb.com>
2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
2187a59089 treewide: move service::cas_request out from storage_proxy.hh
And remove all remaining inclusions of `storage_proxy.hh` in the
headers.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
e0749d6264 treewide: some random header cleanups
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
142d3b5ad9 cdc: self-sufficient headers fixup
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Gleb Natapov
bb822c92ab raft: change raft::rpc api to return void for most sending functions
Most RAFT packets are sent very rarely during special phases of the
protocol (like election or leader stepdown). The protocol itself does
not care if a packet is sent or dropped, so returning futures from their
send function does not serve any purpose. Change the raft's rpc interface
to return void for all packet types but append_request. We still want to
get a future from sending append_request for backpressure purposes since
replication protocol is more efficient if there is no packet loss, so
it is better to pause a sender than dropping packets inside the rpc. Rpc
is still allowed to drop append_requests if overloaded.
2021-06-06 19:18:49 +03:00
Gleb Natapov
f5a54d6c05 raft: move ELECTION_TIMEOUT definition to a public header
Move ELECTION_TIMEOUT definition to be visible to outside modules.
2021-06-06 19:18:49 +03:00
Gleb Natapov
87844c0ce1 raft: remove unused clock type definition
RAFT uses logical clock now and this define is from older times.
2021-06-06 19:18:49 +03:00
Gleb Natapov
90ea71da54 raft: wait for io and applier fiber to stop before before aborting snapshots and waiters
IO and applier fibers may update waiters and start new snapshot
transfers, so abort() needs to wait for them to stop before proceeding
to abort waiters and snapshot transfers,
2021-06-06 19:18:49 +03:00
Yaron Kaikov
6a447db8a8 scylla_util.py: Fix Azure support for machine-image
In https://github.com/scylladb/scylla/pull/7807 we added support for
Azure instance in Scylla.

The following changes are required in order machine-image to work:
1) fix wrong metadata URL and updating metadata path values (was
   intreduce in
f627fcbb0c)
2) fix function naming which been used my machine image
3) add missing function which are reuqired by mahcine-image
4) cleanup unused functions

Closes #8596
2021-06-06 09:21:23 +03:00
Asias He
2a7b855255 repair: Init repair metrics during startup
The _node_ops_metrics is thread local, it is constructed when it
is first accessed.

If there are no node operations, the metrics will not be shown. To make the
metrics more consistent, init during startup.

Refs #8311

Closes #8780
2021-06-06 09:21:23 +03:00
Benny Halevy
3f9bad0f0a test: compound_test: use tests::random
For reproducibility.

Test: compound_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210602061910.286893-2-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
40e032ff8b test: compound_test: use to seastar test framework
Prepare for using tests::random instead of std::rand
for reproducibility.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210602061910.286893-1-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Asias He
9b902fad79 gossiper: Update timestamp for nodes in ack and ack2 msg handler
In commit 425e3b1182 (gossip: Introduce
direct failure detector), the call to notify_failure_detector inside ack
and ack2 msg handler was removed since there is no need to update the
old failure detector anymore. However, the timestamp for endpoit_state
is also updated inside notify_failure_detector. With the new failure
detector we still need the timestamp for endpoit_state. Otherwise, nodes
might be removed from gossip wrongly.

For example, as we saw in issue #8702:

INFO  2021-05-24 22:45:24,713 [shard 0] gossip - FatClient 127.0.60.2
has been silent for 5000ms, removing from gossip

To fix, update the timestamp as we do before in ack and ack2 msg
handler.

Fixes #8702

Closes #8777
2021-06-06 09:21:23 +03:00
Benny Halevy
f081e651b3 memtable_list: rename request_flush to just flush
Now that it returns a future that always waits on
pending flushes there is no point in calling it `request_flush`.
`flush()` is simpler and better describes its function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
4f20cd3bea memtable_list: rename seal_active_memtable_immediate to seal_active_memtable
Now that there's no more seal_active_memtable_delayed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
ba65b90b34 memtable_list: get rid of seal_active_memtable_delayed
This path is unused since e5be3352cf.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Calle Wilund
3b55ef36d1 cf_prop_defs: Fix extensions merge to handle removal
Fixes #8773

When refactored for cdc, properties -> extensions merge
was modified so it did not handle _removal_ (i.e. an
extension function returning null -> no entry in new map).

This causes certain enterprise extensions to not be able
to disable themselves.

Fixed by filtering existing extensions by property keywords.
Unit test added.

Closes #8774
2021-06-06 09:21:23 +03:00
Nadav Har'El
f22ed3ff5c test/alternator: reduce very high timeout in one tracing test
In test_tracing.py::test_slow_query_log, the was what looked like
an innocent 30-second timeout, but this was in fact a 8 minute
timeout - because it started with sleeping 1 second, then 2 seconds,
then 3, ... until 30 seconds. Such a high timeout is frustrating when
trying to debug failures in the test - which is only expected to take
2 seconds (and all of it because of an artificial timeout).

So fix the loop to stop iterating after 60 seconds (a compromise
between 30 seconds and 8 minutes...), sleeping a constant amount
between iterations.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210601150631.1037158-1-nyh@scylladb.com>
2021-06-06 09:21:23 +03:00
Avi Kivity
100d6f4094 build: enable -Wunused-function
Also drop a single violation in transport/server.cc. This helps
prevent dead code from piling up.

Three functions in row_cache_test that are not used in debug mode
are moved near their user, and under the same ifdef, to avoid triggering
the error.

Closes #8767
2021-06-06 09:21:23 +03:00
Benny Halevy
6ce826206a sstables: use vector empty method rather than size
Testing std::vector::empty() is slightly more efficient
than testing for `size() > 0`.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210601115552.155148-2-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
0565ba31a1 compaction_info: is_stop_requested: use sstring::empty rather than size
`!empty()` is slightly more efficient than `size() > 0`.

While at it, mark the function noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210601115552.155148-1-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
948a9da832 table: do_apply: verify that _async_gate is open
Applying changes to the memtable after table::stop
is prohibited. Verify that by making sure that
the _async_gate is still open in `do_apply`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210601055042.41380-1-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
82a263f672 database: apply_in_memory: run_when_memory_available under table::run_async
Make sure to apply the mutation under the table's _async_gate.

Fixes #8790

Test: unit(dev), view_build_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8794
2021-06-06 09:21:23 +03:00
Calle Wilund
131da30856 table: Always use explicit commitlog discard + clear out rp_set
Fixes #8733

If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.

Closes #8766
2021-06-06 09:21:23 +03:00
Pavel Emelyanov
0944d69475 repair, streaming: Generalize consumer lambdas
Both streaming and repair call the distributed sstables writing with
equal lambdas each being ~30 lines of code. The only difference between
them is repair might request offstrategy compaction for new sstable.

Generalization of these two pieces save lines of codes and speeds the
release/repair/row_level.o compilation by half a minute (out of twelve).

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210531133113.23003-1-xemul@scylladb.com>
2021-06-06 09:21:23 +03:00
Lubos Kosco
777771df34 scylla_util.py: Relax GCE setup NVMe device checks
We don't want to fail I/O setup if there are more than one NVMe devices
mounted as root nor if there are no NVMe devices.

Fixes #8032

Closes #8444
2021-06-06 09:21:23 +03:00
Botond Dénes
b0056f88dc test.py: revamp coverage support
Instead of attempting to universally set the proper environment
necessary for tests to generate profiling data such that coverage.py can
process it, allow each Test subclass to set up the environment as needed
by the specific Test variant.
With this we now have support for all current test types, including cql,
cql-pytest and alternator tests.
2021-06-06 09:21:23 +03:00
Botond Dénes
438391b4cc scripts/coverage.py: check that --path is a directory
To detect a bad --path that would fail coverage generation early.
2021-06-06 09:21:23 +03:00
Botond Dénes
ca91fd0e34 scripts/coverage.py: update main()'s docstring with new --run modifiers
And fix a typo while there.
2021-06-06 09:21:23 +03:00
Botond Dénes
2ba3fc2e11 scripts/coverage.py: add --distinct-id parameter
Yet another modifier for `--run`, allowing running the same executable
multiple times and then generating a coverage report across all runs.
This will also be used by test.py for those test suites (cql test) which
run the same executable multiple times, with different inputs.
2021-06-06 09:21:23 +03:00
Botond Dénes
b1f46b3693 scripts/coverage.py: add --executable parameter
Another modifier for `--run`, allowing to override the test executable
path. This is useful when the real test is ran through a run-script,
like in the case of cql-pytest.
2021-06-06 09:21:23 +03:00
Michał Chojnowski
81c1a7f7e9 locator: token_metadata: remove include_min from tokens_iterator_impl
`include_min` is always set to the default value. Get rid of it.
2021-06-05 17:40:35 +02:00
Michał Chojnowski
2a3bd2babe locator: token_metadata: remove the include_min parameter from ring_range()
`include_min` is always set to the default value. Remove it.
2021-06-05 17:40:35 +02:00
Michał Chojnowski
23b7178f0d dht: token: add a comment excusing the const bytes& constructor 2021-06-05 15:22:42 +02:00
Michał Chojnowski
31aad81dc9 dht: token: simplify midpoint()
midpoint doesn't have to be so complicated.
2021-06-05 15:22:35 +02:00
Pavel Emelyanov
76947c829e transport/controller: Stop server on state change failure too
If on stop the set_cql_state() throws the local sharded<cql_server>
will be left not stopped and will fail the respective assertion on
its destruction.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-04 16:53:21 +03:00
Pavel Emelyanov
f6ef148c76 transport/controller: Rollback server start on state change failure too
If set_cql_state() throws the cserver remains started. If this
happens on start before the controller stop defer action is
scheduled the destruction of controller will fain on assertion
that checks the _server must be stopped.

Effectively this is the fix of #8796

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-04 16:50:51 +03:00
Pavel Emelyanov
6995e41e64 transport/controller: Do not leave _server uninitialized
If an exception happens after sharded<cql_server>.start() the
controller's _server pointer is left pointing to stopped sharded
server. This makes it impossible to start the server again (via
API) since the check for if (_server) will always be true.

This is the continuation of the ae4d5a60 fix.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-04 16:48:26 +03:00
Pavel Emelyanov
12220b74e8 transport/controller: Rework try-catch into defers
This is to make further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-04 16:48:12 +03:00
Alejo Sanchez
3e91a8ca0d raft: replication test: wait for log for both index and term
Waiting on index alone does not guarantee leader correct leader log
propagation. This patch add checking also the term of the leader's last
log entry.

This was exposed with occasional problems with packet drops.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-04 08:38:19 -04:00
Alejo Sanchez
545893145e raft: replication test: reset network at construction
Reset network in constructor, not in unrelated function.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-04 08:18:32 -04:00
Alejo Sanchez
294dcfb204 raft: replication test: use lambda visitor for updates
Process updates with a lambda visitor.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-04 08:18:31 -04:00
Nadav Har'El
0bb2e010f5 test/alternator: rewrite test/alternator/run script in Python
We already wrote the test/cql-pytest/run script in Python in a way
it can be reusable for the other test/*/run scripts.

So this patch replaces the test/alternator/run shell script with Python
code which does the same thing (safely runs Scylla with Alternator and
pytest on it in a temporary directory and IP address), but sharing most
of the code that cql-pytest uses.

The benefit of reusing the test/cql-pytest/run.py library goes beyond
shorter code - the main benefit will be that we can't forget to fix one
of the test/*/run scripts (e.g., add more command line options or fix a
bug) when fixing another one.

To make the test/cql-pytest/run.py library reusable for running
Alternator, I needed to generalize a few things in this patch (e.g.,
the way we check and wait for Scylla to boot with the different APIs we
intend to check). There is also one bug-fix on how interrupts are
handled (they are now better guaranteed to kill pytest) - and now fixing
this bug benefits all runners using run.py (cql-pytest/run,
cql-pytest/run-cassandra and alternator/run).

In the future, we can port the runners which are still duplicate shell
scripts - test/redis/run and test/scylla-gdb/run - to Python in a
similar manner to what we did here for test/alternator/run.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-03 11:23:00 +03:00
Nadav Har'El
ef45fccdae test/cql-pytest: make test run code more general
Change the cql-pytest-specific run_cql_pytest() function to a more
general function to run pytest in any directory. Will be useful for
reusing the same code for other test runners (e.g., Alternator), and
is also clearer.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-03 11:22:36 +03:00
Benny Halevy
8f054edec7 test: database_test: add snapshot_skip_flush_works
Test that taking a snapshot with the skip_flush option
does not flush the memtable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 20:39:29 +03:00
Benny Halevy
0c80d9d7a7 api: storage_service/snapshots: support skip-flush option
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
     [(-pp | --print-port)] [(-pw <password> | --password <password>)]
     [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
     [(-u <username> | --username <username>)] snapshot
     [(-cf <table> | --column-family <table> | --table <table>)]
     [(-kc <kclist> | --kc.list <kclist>)]
     [(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]

-sf / –skip-flush    Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```

But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.

This patch wires the skip_flush option support to the
REST API.

Fixes #8725

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 17:20:21 +03:00
Benny Halevy
9cf858b5fc snapshot: support skip_flush option
skip_flush is disabled by default.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 17:20:21 +03:00
Benny Halevy
52fd2b71b7 table: snapshot: add skip_flush option
skip_flush is false by default.

Also, log a debug message when starting the snapshot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 17:20:21 +03:00
Benny Halevy
4169f56407 api: storage_service/snapshots: add sf (skip_flush) option
Note: I tried adding the option and calling it "skip_flush"
but I couldn't make it work with scylla-jmx, hence it's
called by the abbreviated name - "sf".

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 17:20:19 +03:00
Benny Halevy
4274cf6351 sstable: validate first and last keys ordering
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.

It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.

set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().

While at it, change the exception type thrown when
the key in the summary is empty from runtime_error
to malformed_sstable_exception, sice the function
is called from the read path and the corruption
may already be present in the sstable.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 12:28:48 +03:00
Benny Halevy
7a4591119b test: lib: reusable_sst: save unexpected errors
reusable_sst tries openeing an sstable using
all sstable format versions in descending order.

It is expected to see "file not found" if the
actual sstable version is not the latest one.

That said, we may hit other error if the sstable
is malformed in any way, so do not override
this kind of error if "file not found" errors
are hit after it, and return the unexpected error
instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 12:25:29 +03:00
Benny Halevy
9452b99b40 test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
Currently the test is using "first_key", "last_key"
literals for the first and last keys and expects them
to sort properly with the murmur3 partitioner.
Also it does that for all generated sstables
which is less interesting for reshape.

Use token_generation_for_current_shard to
generate random, properly ordered keys.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 12:25:29 +03:00
Benny Halevy
d5405dade7 test: sstable_test: define primary key in schema for compressed sstable
Otherwise, the primary_key will be considered as composite,
as its length does not equal 1.  That hampers token caluclation
when decorating the dirst and last keys in the summary file.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 12:25:29 +03:00
Alejo Sanchez
a3fc974de9 raft: replication test: move structs into class
Move auxiliary classes connection and hash_connection out of
raft_cluster and into connected class.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
5b688d42d7 raft: replication test: move data structures to cluster class
Move state_machine, persistence, connection, hash_connection, connected,
failure_detector, and rpc inside raft_cluster.

This commit moves declaration of class raft_cluster up.
(Minimize changed lines)
Moves apply_fn definition from state_machine to raft_cluster.
Fixes namespace in declarations
Keeps static rpc::net outside for now to keep this commit simple.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
1250d910ee raft: replication test: remove shared pointers
Following gleb, tomek, and kamil's suggestion, remove unnecessary use of
lw_shared_ptr.

This also solves the problem of constructing a lw_shared_ptr from a
forward declaration (connected) in a subsequent patch.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
aa1200ee50 raft: replication test: move get_states() to raft_cluster
Move get_states() helper inside raft cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
740545cdc5 raft: replication test: test_server inside raft_cluster
Since there are no more external users of test_server, move it to
raft_cluster and remove member access operator.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
1ee4408869 raft: replication test: rpc declarative tests
Convert rpc replication tests to declarative form.

This will enable moving remaining parts inside raft_cluster.

For test stability, add support for checking rpc config of a node
eventually changes to the expected configuration.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
f11ae18158 raft: replication test: add wait_log
Allow test cases to specify waiting for log for one or more servers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
fa84b15909 raft: replication test: add stop and reset server
Add stop an reset server support.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
19d28e7e0f raft: replication test: disconnect 2 support
Support custom disconnection of 2 servers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
e2612e5327 raft: replication test: explicit node_id naming
Use node_id{x} for more expressive naming in tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
bdfdd2da0b raft: replication test: move definitions up
Move definitions up for next patch.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
14bd29f974 raft: replication test: no append entries support
Handle test cases not appending entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
a73db881cb raft: replication test: fix helper parameter
Use vector instead of initializer_list for function helper parameter.
This is not a constructor and it complicates usage.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
8468059d0e raft: replication test: stop servers out of config
As requested by @gleb-cloudious, stop servers taken out of
configuration.

Adjust other parts of code relying on all servers being active.

Remove temporary stop on rpc server.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
51343d4de7 raft: replication test: wait log when removing leader from configuration
If leader is removed from configuration wait log first.

Remove wait_log_all for every case as it was too broad fix.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
e032d8446f raft: replication test: only manipulate servers in configuration
Only start/stop, init/start/reamr tickers, wait log, elapse_election,
run free election, check for leader, and verify servers in current
configuration.

This is necessary for having servers out of configuration not
present/stopped.

Temporarily stop a server in rpc test until we truly stop servers out of
configuration in next commit.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
ec078ca55f raft: replication test: only cancel rearm ticker for removed server
When changing configuration, don't pause and restart all tickers.
Only do it for the specific server(s) being removed.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
802f68317e raft: replication test: only pause restart tickers in config
Only pause and restart tickers for servers in configuration.

Currently when a server is taken out it's reset and a new one is set up,
but out of configuration. @gleb-cloudious requested to have fully
stopped servers when out of configuration, until they are re-added.
This change is needed to allow that or else restart would arm tickers on
servers no longer present.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
85f299e39b raft: replication test: simplify calls to helpers
Pass test update directly to helpers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
27f50b3589 raft: replication test: persisted snapshots in raft_cluster
Move persisted snapshots inside raft_cluster, de-cluttering code.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
5b601c133b raft: replication test: verify in raft_cluster
Do verifications in raft_cluster::verify().

This will enable having persisted snapshots inside the class and
de-clutter caller code.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:03 -04:00
Alejo Sanchez
ce6746b888 raft: replication test: connected inside raft_cluster
Keep connected inside raft_cluster.

Helpers are already provided to handle connectivity.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:02 -04:00
Alejo Sanchez
b41ce7084b raft: replication test: snapshots inside raft_cluster
Keep snapshots inside raft_cluster, removing this need outside.

If this is needed later, a const getter can be implemented.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:02 -04:00
Alejo Sanchez
4c2f8d84c5 raft: replication test: remove obsolete param
Since create_server() is in raft_cluster, there's no need for
change_configuration() to pass total values anymore. Remove it.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:02 -04:00
Alejo Sanchez
e9df914692 raft: replication test: elect_new_leader wait log and pause
Do wait_log() for the next leader always in elect_new_leader.

Only wait log for new leader if it's connected to the old leader.

Pause and restart tickers when creating a candidate to avoid another
node to become a dueling candidate.

Remove pause and restart tickers around calls to elect_new_leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:02 -04:00
Alejo Sanchez
52188016af raft: replication test: create_server in raft_cluster
Remove the global create_raft_server() and replace with a
create_server() helper in replication_test().

This will allow not requiring the user of raft_cluster to create special
objects.

Note this does not move(apply) anymore as it's kept in raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:47:02 -04:00
Alejo Sanchez
1edcb6e647 raft: replication test: reset snapshots
When stopping a server also delete snapshots and persisted snapshots.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 23:46:11 -04:00
Alejo Sanchez
453f19cf0e raft: replication test: reset server helper
Add a helper to reset a server in raft_cluster.

Besides simplifying code and preventing errors, this will help move
create_raft_server logic to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
d3b7f21b88 raft: replication test: pause tickers before stopping
Pause tickers before stopping servers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
30c9daafd2 raft: replication test: tick helper
Move test tick handling to raft_cluster as helper method.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
2e61c507d2 raft: replication test: tickers on raft_cluster
Move tickers to raft_cluster helper class. Ticker initialization and
pause is done automatically at start_all() and stop_all().

Add temporary helpers to manage specific tickers. These might be removed
later once proper node abort and reset are implemented.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
aea77871c4 raft: replication test: cluster tracking leader
Track current leader inside helper class.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
ca8e55613e raft: replication test: elect first leader in raft_cluster
Run first leader election inside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
322802308c raft: replication test: use id 0 for rpc tests
raft_cluster at the moment only allows sequential 0 based ids.

The code was generating ids over this and causing problems for code
changes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
c1a6e81002 raft: replication test: fix partition wait log
When partitioning, don't wait_log on servers outside configuration.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:20 -04:00
Alejo Sanchez
6db730c500 raft: replication test: partition helper
Add a partition handling helper to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
848c244932 raft: replication test: track in_configuration in raft_cluster
Keep track of servers in configuration inside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
16728b8966 raft: replication test: use cluster saved apply function
Use apply function saved in cluster at creation time.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
3daed889b8 raft: replication test: change_configuration in raft_cluster
Move change_configuration to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
102b8e71bb raft: replication test: free_election in raft_cluster
Move free_election to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
60d4d06861 raft: replication test: wait_log_all in raft_cluster
Move wait_log_all to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
d1ba0fe719 raft: replication test: wait_log in raft_cluster
Move wait_log to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
3e4871b884 raft: replication test: elect_new_leader in raft_cluster
Move elect_new_leader to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
59b9642be5 raft: replication test: elapse_election in raft_cluster
Move elapse_election to raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
b3e2b54913 raft: replication test: move add_entry up
Style.

Move definition of add_entry and add_remaining_entries with the rest of
raft_cluster definitions.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
8cd2abe72b raft: replication test: remove spurious check
Going forward the leader is always in configuration and up to date.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
2d51d1bbc5 raft: replication test: raft_cluster add_entries
Move add_entries() to raft_cluster and provide a helper to add remaining
entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
2a1e7a15a6 raft: replication test: calculate first value helper
Helper to calculate what's the value number to be added after snapshot
and leader initial log.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
e2f425e210 raft: replication test: initial state helper
Move initial_state preparation to its own helper function.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
d2c0308a85 raft: replication test: move declarations up
Move declarations near the top of the file for following refactors to
raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
a3700a6d0a raft: replication test: move up set_config
Move set_config above raft_cluster for a subsequent commit.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
57da05c986 raft: replication test: use disconnect() helper
For rpc tests, use raft_cluster::disconnect() instead of the local
connected reference.

This removes connected object use outside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
54c919b726 raft: replication test: add connectivity helpers
Add connectivity helpers disconnect(server, except) and connect_all() to
so users of raft_cluster don't need to keep the a connectivity object
pointer.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
5e324f3438 raft: replication test: rpc with raft_cluster
Use raft_cluster for rpc tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
752d53a909 raft: replication test: use parallel start/stop
Start and stop servers in parallel.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
bcf5181697 raft: replication test: cluster class
Use raft_cluster class to handle servers.

First part of this change.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
5fc0a1251d raft: replication test: helper uuid to local id
Add a helper to convert from UUID to size_t id.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
7e93501d4c raft: replication test: use optional
Instead of tracking with a boolean use an optional for partition leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
ccb85bce02 raft: replication test: wait log on next leader only
When there's a defined next leader, only wait for log propagation for
this follower.

Splits wait_log() to waiting for one follower with wait_log() and
waiting for all followers with wait_log().

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
2aa1646e35 raft: replication test: remove wait after adding entries
Remove log wait after adding entries. It was added to handle some debug
hangs but it is not good for testing.

There are already wait logs at proper code locations.
(e.g. elect_new_leader, partition)

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
0216d0a7b0 raft: replication test: remove unused param
elect_new_leader doesn't need to know configuration anymore.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
effcb7c5f6 raft: tests: move conversion helpers to header
Move replication test helpers to header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Alejo Sanchez
7327cbd871 raft: replication test: use structs to avoid alias
Use structs for test commands.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-01 21:50:19 -04:00
Avi Kivity
e9e5663731 build, utils/bptree.hh: drop -Wno-gnu-designator warning
Drop the warning about old-stye GNU designated initializers and
convert two violations in bptree.hh to the standard C++20 syntax.

Closes #8743
2021-05-31 18:51:49 +03:00
Nadav Har'El
ff81072f64 cql-pytest: port Cassandra's unit test validation/entities/secondary_index_test
In this patch, we port validation/entities/secondary_index_test.java,
resulting in 41 tests for various aspects of secondary indexes.
Some of the original Java tests required direct access to the Cassandra
internals not available through CQL, so those tests were omitted.

In porting these tests, I uncovered 9 previously-unknown bugs in Scylla:

Refs #8600: IndexInfo system table lists MV name instead of index name
Refs #8627: Cleanly reject updates with indexed values where value > 64k
Refs #8708: Secondary index is missing partitions with only a static row
Refs #8711: Finding or filtering with an empty string with a secondary
            index seems to be broken
Refs #8714: Improve error message on unsupported restriction on partition
            key
Refs #8717: Recent fix accidentally broke CREATE INDEX IF NOT EXISTS
Refs #8724: Wrong error message when attempting index of UDT column with
            a duration
Refs #8744: Index-creation error message wrongly refers to "map" - it can
            be any collection
Refs #8745: Secondary index CREATE INDEX syntax is missing the "values"
            option

These tests also provide additional reproducers for already known issues:

Refs #2203: Add support for SASI
Refs #2962: Collection column indexing
Refs #2963: Static column indexing
Refs #4244: Add support for mixing token, multi- and single-column
            restrictions

Due to these bugs, 15 out of the 41 tests here currently xfail. We actually
had more failing tests, but we fixed a few of the above issues before this
patch went in, so their tests are passing at the time of this submission.

All 41 tests pass when running against Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210531112354.970028-1-nyh@scylladb.com>
2021-05-31 18:31:13 +03:00
Piotr Sarna
389a0a52c9 treewide: revamp workload type for service levels
This patch is not backward compatible with its original,
but it's considered fine, since the original workload types were not
yet part of any release.
The changes include:
 - instead of using 'unspecified' for declaring that there's no workload
   type for a particular service level, NULL is used for that purpose;
   NULL is the standard way of representing lack of data
 - introducing a delete marker, which accompanies NULL and makes it
   possible to distinguish between wanting to forcibly reset a workload
   type to unspecified and not wanting to change the previous value
 - updating the tests accordingly

These changes come in as a single patch, because they're intertwined
with each other and the tests for workload types are already in place;
an attempt to split them proved to be more complicated than it's worth.

Tests: unit(release)

Closes #8763
2021-05-31 18:18:33 +03:00
Piotr Dulikowski
b0c22f2e39 repair: trigger repair abort_source only from shard 0
When user requests repair to be forcefully aborted, the `_abort_all_as`
abort source could be modified from multiple shards in parallel by the
`tracker::abort_all_repairs()` function, which can lead to undefined
behavior and to a crash. This commit makes sure that `_abort_all_as` is
used only from shard 0 when repair is aborted.

Fixes #8693

Closes #8734
2021-05-31 15:57:31 +03:00
Michał Chojnowski
a2352ea332 dht: token: add a comment to normalize()
The purpose and name of normalize are not obvious and deserve an explanatory
comment.
2021-05-31 11:54:58 +02:00
Michał Chojnowski
3d9b8c9eff dht: token: use {read,write}_unaligned instead of std::copy_n
A cosmetic change.
2021-05-31 11:54:58 +02:00
Michał Chojnowski
3c88a9ccb6 dht: token-sharding: fix a typo in a comment 2021-05-31 11:54:45 +02:00
Avi Kivity
e96ff3d82d dist: add new docker building process
The new process has the following differences from the Dockerfile
based image:

 - Using buildah commands instead of a Dockerfile. This is more flexible
   since we don't need to pack everything into a "build context" and
   transfer it to the container; instead we interact with the container
   as we build it.
 - Using packages instead of a remote yum repository. This makes it
   easy to create an image in one step (no need to create a repository,
   promote, then download the packages back via yum. It means that
   the image cannot be upgraded via yum, but container images are
   usually just replaced with a new version.
 - Build output is an OCI archive (e.g. a tarball), not a docker image
   in a local repoistory. This means the build process can later be
   integrated into ninja, since the artifact is just a file. The file
   can be uploaded into a repository or made available locally with
   skopeo.
 - any build mode is supported, not just release. This can be used
   for quick(er) testing with dev mode.

I plan to integrate it further into the build system, but currently
this is blocked on a buildah bug [1].

[1] https://github.com/containers/buildah/issues/3262

Closes #8730
2021-05-31 10:05:22 +03:00
Nadav Har'El
2440569984 secondary index: fix error message which erroneously refered to "map"
The value of a frozen collection may only be indexed (using a secondary
index) in full - it is not allowed to index only the keys for example -
"CREATE INDEX idx ON table (keys(v))" is not allowed.

The error message referred to a frozen<map>, but the problem can happen
on any frozen collection (e.g., a frozen set), not just a frozen map,
so can be confusing to a user who used a frozen set, and getting an
error about a frozen map.

So this patch fixes the error message to refer to a "frozen collection".

Note that the Cassandra error message in this case is different - it
reads: "Frozen collections are immutable and must be fully indexed".

Fixes #8744.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210529094056.825117-1-nyh@scylladb.com>
2021-05-30 23:23:20 +03:00
Botond Dénes
cd6bbd37a4 utils/utf8.c: move includes outside of namespaces
Including in the middle of a namespace is not a good practice.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210528142502.962947-1-bdenes@scylladb.com>
2021-05-30 23:23:20 +03:00
Raphael S. Carvalho
a7cdd846da compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]
2021-05-30 23:22:51 +03:00
Benny Halevy
1c0769d789 table: clear: make exception safe
It is currently possible that _memtables->add_memtable()
will throw after _memtables->clear(), leaving the memtables
list completely empty.  However, we do rely on always
having at least one allocated in the memtables list
as active_memtable() references a lw_shared_ptr<memtable>
at the back of the memtables vector, and it expected
to always be allocated via add_memtable() upon construction
and after clear().

This change moves the implementation of this convention
to memtable_list::clear() and makes the latter exception safe
by first allocating the to-be-added empty memtable and
only then clearing the vector.

Refs #8749

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210530100232.2104051-1-bhalevy@scylladb.com>
2021-05-30 13:22:52 +03:00
Avi Kivity
791412b046 test: user_defined_function_test: raise Lua timeout
user_defined_function_test fails sporadically in debug mode
due to lua timeout. Raise the timeout to avoid the failure, but
not so much that the test that expects timout becomes too slow.

Fixes #8746.

Closes #8747
2021-05-30 13:10:57 +03:00
Piotr Jastrzebski
76d7c761d1 schema: Stop using deprecated constructor
This is another boring patch.

One of schema constructors has been deprecated for many years now but
was used in several places anyway. Usage of this constructor could
lead to data corruption when using MX sstables because this constructor
does not set schema version. MX reading/writing code depends on schema
version.

This patch replaces all the places the deprecated constructor is used
with schema_builder equivalent. The schema_builder sets the schema
version correctly.

Fixes #8507

Test: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>
2021-05-30 11:58:27 +03:00
Nadav Har'El
1507bbb35a cql-pytest: increase default server-side timeouts
Sometimes the cql-pytest tests run extremely slowly. This can be
a combination of running the debug build (which is naturally slow)
and a test machine which is overcommitted, or experiencing some
transient swap storm or some similar event. We don't want tests, which
we run on a 100% reliable setups, to fail just because they run into
timeouts in Scylla when they run very slowly.

We already noticed this problem in the past, and increased the CQL client
timeout in conftest.py from the default of 10 seconds to 120 seconds -
the old default of 10 seconds was not enough for some long operations
(such as creating a table with multiple views) when the test ran very
slowly.

However, this only fixed the client-side timeout. We also have a bunch
of server-side timeouts, configured to all sorts of arbitrary (and
fairly small) numbers. For example, the server has a "write request
timeout" option, which defaults to just 2 seconds. We recently saw
this timeout exceeded in a slow run which tried to do a very large
write.

So this patch configures all the configurable server-side timeouts we
have to default to 300 seconds. This should be more than enough for even
the slowest runs (famous last words...). This default is not a good idea
on real multi-node clusters which are expected to deal with node loss,
but this is not the case in cql-pytest.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210529213648.856503-1-nyh@scylladb.com>
2021-05-30 01:20:14 +03:00
Avi Kivity
d23bebf5c2 Merge "Unexport storage service dependencies" from Pavel E
"
Right now storage service is used as "provider" of another
services -- database, feature service and tokens. This set
unexports the first pair. This dropps a bunch of calls for
global storage service instances from the places that don't
really need it.

tests: unit(dev), start-stop
"

* 'br-pupate-storage-service' of https://github.com/xemul/scylla:
  storage-service: Don't export features
  api: Get features from proxy
  storage-service: Don't export database
  storage-service: Turn some global helpers into methods
  storage-service: Open-code simple config getters
  view: Get database from stprage_proxy
  main: Use local database instance
  api: Use database from http_ctx
2021-05-29 20:52:47 +03:00
Pavel Emelyanov
598bbfab15 storage-service: Don't export features
Now storage service uses the feature service instance internally
and doesn't need to provide getter for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:16:12 +03:00
Pavel Emelyanov
651568318d api: Get features from proxy
The reset_local_schema call needs proxy and feature service to do its
job. Right now the features are retrived from global storage service,
but they are present on the proxy as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:15:15 +03:00
Pavel Emelyanov
b990b764ca storage-service: Don't export database
Now storage service uses the database instance internally and
doesn't need to provide getter for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:13:27 +03:00
Pavel Emelyanov
0651038f29 storage-service: Turn some global helpers into methods
There are two static helpers used by storage service that grab
global storage service. To simplify these two turn both into
storage service methods and use 'this' inside.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:12:25 +03:00
Pavel Emelyanov
5ae8accfed storage-service: Open-code simple config getters
There are two db::config getters in storage_service.cc that
are used only once. Both call for global storage service, but
since they are called from storage service it's simpler to break
this loop and make storage service get needed config options
directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:11:24 +03:00
Pavel Emelyanov
1ce0682821 view: Get database from stprage_proxy
The db::view code already uses proxy rather actively, so instead of
depending on the storage service to be at hands it's better to make
db::view require the proxy. For now -- via global instance.

There's one dependency on storage service left after this patch --
to get the tokens. This piece is to be fixed later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:09:32 +03:00
Pavel Emelyanov
6d53ddaa5f main: Use local database instance
All start-stop code in main has the sharded<database> at hands, there's
no need in getting it from global storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:08:57 +03:00
Pavel Emelyanov
e476247763 api: Use database from http_ctx
Instead of getting database from global storage service it's simpler
and better to grab it from the http context at hands.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:08:25 +03:00
Asias He
e86d39faf0 storage_service: Update peer table only if the peer is part of the ring
Consider the following procedure:

- n1, n2, n3

- n3 is network partitioned from the cluster

- n4 replaces n3

- n3 has the network partition fixed

- n1 learns n3 as NORMAL status and calls
  storage_service::handle_state_normal which in turn calls
  update_peer_info, all columns except tokens column in system.peers are
  written

- n1 restarts before figure out n4 is the new owner and deletes the
  entry for n3 in system.peers

- n3 is removed from gossip by all the nodes in the cluster
  automatically because they detect the collision and removes n3

- n1 restarts, leaving the entry in system.peers for n3 forever

To fix, we can update peer tables only if the node is part of the ring.

Fixes #8729

Closes #8742
2021-05-28 15:03:26 +02:00
Avi Kivity
b6c49fd320 Update seastar submodule
> Merge "memory: optimize thread-local initialization" from Avi
  > Merge "Move priority classes manipulations from io-queue" from Pavel E
  > gate: add default move assignment operator
2021-05-28 11:47:54 +03:00
Pavel Emelyanov
526d31734c scylla-gdb: scylla_io_queues: Support new registered classes layout
Starting from seastar commit 5dae0cf3c48159990f51e5d38495af5ae224c2f8
all the registered classes info was moved into io_priority_class::_infos
array.

tests: scylla-gdb(release, old and new seastars)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210528083941.27990-1-xemul@scylladb.com>
2021-05-28 11:47:38 +03:00
Avi Kivity
0acf5bfca6 build: enable -Wreturn-std-move
Clang warns when "return std::move(x)" is needed to elide a copy,
but the call to std::move() is missing. We disabled the warning during
the migration to clang. This patch re-enables the warning and fixes
the places it points out, usually by adding std::move() and in one
place by converting the returned variable from a reference to a local,
so normal copy elision can take place.

Closes #8739
2021-05-27 21:16:26 +03:00
Avi Kivity
d3e5b37059 Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund"
This reverts commit e9c940dbbc, reversing
changes made to 6144656b25. Since it was
merged commitlog_test consistently times out in debug mode.
2021-05-27 21:16:26 +03:00
Wojciech Mitros
725c6aac81 test/perf: close test_env to pass an assert in sstables_manager destructor
When destroying an perf_sstable_test_env, an assert in sstables_manager
destructor fails, because it hasn't been closed.
Fix by removing all references to sstables from perf_sstable_test_env,
and then closing the test_env(as well as the sstables_manager)

Fixes #8736

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #8737
2021-05-27 17:41:17 +03:00
Michał Chojnowski
5e9f741bb4 repair: remove range_split.hh
Dead code since 80ebedd242.

Closes #8698
2021-05-27 17:21:37 +03:00
Avi Kivity
5f8484897b Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.

---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in a new keyspace that uses EverywhereStrategy so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.

Fixes #7961.

---

For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.

dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally

Closes #8643

* github.com:scylladb/scylla:
  docs: update cdc.md with info about the new internal table
  sys_dist_ks: don't create old CDC generations table on service initialization
  sys_dist_ks: rename all_tables() to ensured_tables()
  cdc: when creating new generations, use format v2 if possible
  main: pass feature_service to cdc::generation_service
  gms: introduce CDC_GENERATIONS_V2 feature
  cdc: introduce retrieve_generation_data
  test: cdc: include new generations table in permissions test
  sys_dist_ks: increase timeout for create_cdc_desc
  sys_dist_ks: new table for exchanging CDC generations
  tree-wide: introduce cdc::generation_id_v2
2021-05-27 17:13:44 +03:00
Avi Kivity
e8e4456ec7 Merge 'Introduce per-service-level workload types and their first use-case - shedding in interactive workloads' from Piotr Sarna
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.

The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.

It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.

Closes #8680

* github.com:scylladb/scylla:
  test: add a case for conflicting workload types
  cql-pytest: add basic tests for service level workload types
  docs: describe workload types for service levels
  sys_dist_ks: fix redundant parsing in get_service_level
  sys_dist_ks: make get_service_level exception-safe
  transport: start shedding requests during potential overload
  client_state: hook workload type from service levels
  cql3: add listing service level workload type
  cql3: add persisting service level workload type
  qos: add workload_type service level parameter
2021-05-27 17:01:56 +03:00
Avi Kivity
f3e8e625c0 Update tools/java submodule (toppartitions single jmx call)
* tools/java fd92603b99...599b2368d6 (1):
  > toppartitions: Fix toppartitions to only jmx once

Ref #8459.
2021-05-27 16:57:57 +03:00
Konstantin Osipov
52f7ff4ee4 raft: (testing) update copyright
An incorrect copyright information was copy-pasted
from another test file.

Message-Id: <20210525183919.1395607-1-kostja@scylladb.com>
2021-05-27 15:47:49 +03:00
Nadav Har'El
92b7a84e90 secondary index: in error message, call UDT as UDT
It is forbidden to create a secondary index of a column which includes in
any way the "duration" type. This includes a UDT which including duration.
Our code attempted to print in this case the message "Secondary indexes
are not supported on UDTs containing durations" - but because we tested
for tuples first, and UDTs are also tuples - we got the message about
tuples.

By changing the order of the tests, we get the most specific (and
useful) error message.

Fixes #8724.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526201042.642550-1-nyh@scylladb.com>
2021-05-27 15:46:30 +03:00
Piotr Sarna
99f356d764 test: add a case for conflicting workload types
The test case verifies that if several workload types are effective
for a single role, the conflict resolution is well defined.
2021-05-27 14:31:36 +02:00
Piotr Sarna
01b7e445f9 cql-pytest: add basic tests for service level workload types
The test cases check whether it's possible to declare workload
type for a service level and if its input is validated.
2021-05-27 14:31:36 +02:00
Piotr Sarna
54a5d4516c docs: describe workload types for service levels
A paragraph about workload types is added to docs/service_levels.md
2021-05-27 14:31:36 +02:00
Piotr Sarna
d45574ed28 sys_dist_ks: fix redundant parsing in get_service_level
The routine used for getting service level information already
operates on the service level name, but the same information is
also parsed once more from a row from an internal table.
This parsing is redundant, so it's hereby removed.
2021-05-27 14:31:26 +02:00
Piotr Sarna
7faba19605 sys_dist_ks: make get_service_level exception-safe
In order to avoid killing the node if a parsing error occurs,
the routine which fetches service level information is made
exception-safe.
2021-05-27 14:31:25 +02:00
Pavel Emelyanov
d2442a1bb3 tests: Ditch storage_service_for_tests
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.

Ref: #2795
tests: unit(dev), perf_sstable(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
2021-05-27 14:39:13 +03:00
Piotr Sarna
cb27ebe61d transport: start shedding requests during potential overload
This commit implements the following overload prevention heuristics:
if the admission queue becomes full, a timer is armed for 50ms.
If any of the ongoing requests finishes, the timer is disarmed,
but if that doesn't happen, the server goes into shedding mode,
which means that it reads new requests from the socket and immediately
drops them until one of the ongoing requests finishes.
This heuristics is not recommended for OLAP workloads,
so it is applied only if the session declared itself as
interactive (via service level's workload_type parameter).
2021-05-27 13:02:22 +02:00
Piotr Sarna
409c67b1b4 client_state: hook workload type from service levels
The client state is now aware of its workload type derived
from its attached service level.
2021-05-27 13:02:22 +02:00
Piotr Sarna
762e2f48f2 cql3: add listing service level workload type
The workload type information is now presented in the output
of LIST SERVICE LEVEL and LIST ALL SERVICE LEVELS statements.
2021-05-27 13:02:22 +02:00
Piotr Sarna
4816678eb6 cql3: add persisting service level workload type
The workload type information can now be set via CQL
and it's persisted in the distributed system table.
2021-05-27 13:02:22 +02:00
Piotr Sarna
578543603d qos: add workload_type service level parameter
The workload type is currently one of three values:
 - unspecified
 - interactive
 - batch

By defining the workload type, the service level makes it easier
for other components to decide what to do in overload scenarios.
E.g. if the workload is interactive, requests can be shed earlier,
while if it's batched (or unspecified), shedding does not take place.
Conversely, batch workloads could accept long full scan operations.
2021-05-27 13:02:22 +02:00
Dejan Mircevski
b54872fd95 auth: Remove const from role_manager methods
Some subclasses want to maintain state, which constness needlessly precludes.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8721
2021-05-27 11:27:38 +03:00
Nadav Har'El
97e827e3e1 secondary index: fix regression in CREATE INDEX IF NOT EXISTS
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.

The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.

This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
    cassandra_tests/validation/entities/secondary_index_test.py::
    testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.

Fixes #8717.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
2021-05-27 09:10:41 +02:00
Asias He
72cc596842 repair: Wire off-strategy compaction for regular repair
We have enabled off-strategy compaction for bootstrap, replace,
decommission and removenode operations when repair based node operation
is enabled. Unlike node operations like replace or decommission, it is
harder to know when the repair of a table is finished because users can
send multiple repair requests one after another, each request repairing
a few token ranges.

This patch wires off-strategy compaction for regular repair by adding
a timeout based automatic off-strategy compaction trigger mechanism.
If there is no repair activity for sometime, off-strategy compaction
will be triggered for that table automatically.

Fixes #8677

Closes #8678
2021-05-26 11:41:27 +03:00
Konstantin Osipov
ac43941f17 rpc: don't include an unused header (raft_services.hh)
Message-Id: <20210525183919.1395607-7-kostja@scylladb.com>
2021-05-26 11:07:44 +03:00
Konstantin Osipov
7ca4ffc309 system_keyspace: coroutinize db::system_keyspace::setup()
Message-Id: <20210525183919.1395607-19-kostja@scylladb.com>
2021-05-26 11:06:21 +03:00
Avi Kivity
e2e723cc4c build: enable -Wrange-loop-construct warning
This warning triggers when a range for ("for (auto x : range)") causes
non-trivial copies, prompting the developer to replace with a capture
by reference. A few minor violations in the test suite are corrected.

Closes #8699
2021-05-26 10:32:56 +03:00
Avi Kivity
3896e35897 Merge 'storage_service: Respect --enable-repair-based-node-ops flag during removenode' from Asias He
In commit 829b4c1 (repair: Make removenode safe by default), removenode
was changed to use repair based node operations unconditionally. Since
repair based node operations is not enabled by default, we should
respect the flag to use stream to sync data if the flag is false.

Fixes #8700

Closes #8701

* github.com:scylladb/scylla:
  storage_service: Add removenode_add_ranges helper
  storage_service: Respect --enable-repair-based-node-ops flag during removenode
2021-05-26 10:32:56 +03:00
Avi Kivity
e9c940dbbc Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

Closes #8695

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-05-25 18:34:29 +03:00
Kamil Braun
739c24b020 docs: update cdc.md with info about the new internal table 2021-05-25 16:07:23 +02:00
Kamil Braun
c948573398 sys_dist_ks: don't create old CDC generations table on service initialization
The old table won't be created in clusters that are bootstrapped after
this commit. It will stay in clusters that were upgraded from a version
before this commit.

Note that a fully upgraded cluster doesn't automatically create a new
generation in the new format. Even if the last generation was created
before the upgrade, the cluster will keep using it.
A new generation will be created in the new format when either:
1. a new node bootstraps (in the new version),
2. or the user runs checkAndRepairCdcStreams, which has a new check: if
   the current generation uses the old format, the command will decide
   that repair is needed, even if the generation is completely fine
   otherwise (also in the new version).

During upgrade, while the CDC_GENERATIONS_V2 feature is still not
enabled, the user may still bootstrap a node in the old version of
Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In
that case a new generation will be created in the old format,
using the old table definitions.
2021-05-25 16:07:23 +02:00
Kamil Braun
2835697ac1 sys_dist_ks: rename all_tables() to ensured_tables()
The static function `all_tables` in system_distributed_keyspace.cc was
used by the `system_distributed_keyspace` service initialization
function (`start()`) to ensure that a certain set of tables - which the
service provides accessors to - exist in the cluster. For each table in
the vector returned by `all_tables()` the function would try to create
the table, ignoring the "table already exists" error if it is thrown.

The commit renames `all_tables` to `ensured_tables` to better convey the
intention of this function and documents its purpose in a comment.

We do this because in the future the service may provide accessors to
tables which it *does not* actually create. The example - coming in a
later commit - is a table which was created in a previous version of
Scylla, and for which we still have to provide accessors for backward
compatibility / correct handling of the upgrade procedure, but which we
do not want to create in clusters that were freshly created using the
new version of Scylla, since in that case these tables would be just
unnecessary garbage. We mention this use case in the comment.
2021-05-25 16:07:23 +02:00
Kamil Braun
337a4ef8ad cdc: when creating new generations, use format v2 if possible
A node with this commit, when creating a new CDC generation (during
bootstrap, upgrade, or when running checkAndRepairCdcStreams command)
will check for the CDC_GENERATIONS_V2 feature and:
- If the feature is enabled create the generation in the v2 format
  and insert it into the new internal table. This is safe because
  a node joins the feature only if it understands the new format.
- Otherwise create it in the v1 format, limiting its size as before,
  and insert it into the old table.

The second case should only happen if we perform bootstrap or run
checkAndRepairCdcStreams in the middle of an upgrade procedure. On fully
upgraded clusters the feature shall be enabled, causing all new
generations to use the new format.
2021-05-25 16:07:23 +02:00
Kamil Braun
4d3870b24b main: pass feature_service to cdc::generation_service 2021-05-25 16:07:23 +02:00
Kamil Braun
2ac9239f6a gms: introduce CDC_GENERATIONS_V2 feature
When a node joins this feature (which it does immediately when upgrading
to a version that has this commit), it says: "I understand the new
generation storage format and the new identifier format". Thus, when the
feature becomes enabled - after all nodes have joined it - it means that
it's safe to create new generations using these new storage/ID formats.
2021-05-25 16:07:23 +02:00
Kamil Braun
9c1a3180bb cdc: introduce retrieve_generation_data
This function given a generation ID retrieves its data from the internal
table in which the data resides. This depends on the version of the ID:
for _v1 we're using system_distributed.cdc_generation_descriptions, for
_v2 we're using the better
system_distributed_v2.cdc_generation_descriptions_v2 (see the previous
commit for detailed explanation of the superiority of the new table).
2021-05-25 16:07:23 +02:00
Kamil Braun
f25e77c202 test: cdc: include new generations table in permissions test 2021-05-25 16:07:23 +02:00
Kamil Braun
1c25b9df56 sys_dist_ks: increase timeout for create_cdc_desc
If we want to allow larger generations, we may want to give this
operation a bit more time.
2021-05-25 16:07:23 +02:00
Kamil Braun
3155cde9c8 sys_dist_ks: new table for exchanging CDC generations
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in the `system_distributed_everywhere` keyspace so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID. The timestamp and the UUID form the
   "generation identifier" of this new generation; this should explain
   why we introduced the generation_id_v2 type in previous commits.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

Note that the node is still using the old method - the actual switch
will be done in a later commit.
2021-05-25 16:07:23 +02:00
Calle Wilund
a96433c684 commitlog_test: Add test case for usage/disk size threshold mismatch
Refs #8270

Tries to simulate case where we mismatch segments usage with actual
disk footprint and fail to flush enough to allow segment recycling
2021-05-25 12:43:12 +00:00
Calle Wilund
bf0a91b566 commitlog: Flush all segments if we only have one.
Handle test cases with borked config so we don't deadlock
in cases where we only have one segment in a commitlog
2021-05-25 12:43:12 +00:00
Calle Wilund
8ce836209b commitlog: Always force flush if segment allocation is waiting
Refs #8270

If segement allocation is blocked, we should bypass all thresholds
and issue a flush of as much as possible.
2021-05-25 12:43:12 +00:00
Calle Wilund
e34ed30178 commitlog: Include segment wasted (slack) size in footprint check
Refs #8270

Since segment allocation looks at actual disk footprint, not active,
the threshold check in timer task should include slack space so we
don't mistake sparse usage for space left.
2021-05-25 12:43:12 +00:00
Calle Wilund
ec40207e7f commitlog: Adjust (lower) usage threshold
Refs #8270

Try to ensure we issue a flush as soon as we are allocating in the
last allowable segment, instead of "half through". This will make
flushing a little more eager, but should reduce latencies created
by waiting for segment delete/recycle on heavy usage.
2021-05-25 12:43:12 +00:00
Benny Halevy
6144656b25 table: seal_active_memtable: update stats also on the error path
Currently the pending (memtables) flushes stats are adjusted back
only on success, therefore they will "leak" on error, so move
use a .then_wrapped clause to always update the stats.

Note that _commitlog->discard_completed_segments is still called
only on success, and so is returning the previous_flush future.

Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-2-bhalevy@scylladb.com>
2021-05-25 12:51:54 +02:00
Benny Halevy
d46958d3ce phased_barrier: advance_and_await: abort on allocation failure
Currently, advance_and_wait() allocates a new gate
which might fail.  Rather than returning this failure
as an exceptional future - which will require its callers
to handle that failure, keep the function as noexcept and
let an exception from make_lw_shared<gate>() terminate the program.

This makes the function "fail-free" to its callers,
in particular, when called from the table::stop() path where
we can't do much about these failures and we require close/stop
functions to always succeed.

The alternative of make the allocation of a new gate optional
and covering from it in start() is possible but was deemed not
worth it as it will add complexity and cost to start() that's
called on the common, hot, path.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-1-bhalevy@scylladb.com>
2021-05-25 12:50:59 +02:00
Avi Kivity
e391e4a398 test: serialized_action_test: prevent false-positive timeout in test_phased_barrier_reassignment
test_phased_barrier_reassignment has a timeout to prevent the test from
hanging on failure, but it occastionally triggers in debug mode since
the timeout is quite low (1ms). Increase the timeout to prevent false
positives. Since the timeout only expires if the test fails, it will
have no impact on execution time.

Ref #8613

Closes #8692
2021-05-25 11:20:18 +02:00
Benny Halevy
3ad0f156b9 memtable_list: request_flush: wait on pending flushes also when empty()
In https://github.com/scylladb/scylla/issues/8609,
table::stop() that is called from database::drop_column_family
is expected to wait on outstanding flushes by calling
_memtable->request_flush(), but the memtable_list is considered
empty() at this point as it has a single empty memtable,
so request_flush() returns a ready future, without waiting
on outstanding flushes. This change replaces the call to
request_flush with flush().

Fix that by either returning _flush_coalescing future
that resolves when the memtable is sealed, if available,
or go through the get_flush_permit and
_dirty_memory_manager->flush_one song and dance, even though
the memtable is empty(), as the latter waits on pending flushes.

Fixes #8609

Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210524143438.1056014-1-bhalevy@scylladb.com>
2021-05-25 11:19:51 +02:00
Kamil Braun
d71513d814 abstract_replication_strategy: avoid reactor stalls in get_address_ranges and friends
The algorithm used in `get_address_ranges` and `get_range_addresses`
calls `calculate_natural_endpoints` in a loop; the loop iterates over
all tokens in the token ring. If the complexity of a particular
implementation of `calculate_natural_endpoints` is large - say `θ(n)`,
where `n` is the number of tokens - this results in an `θ(n^2)`
algorithm (or worse). This case happens for `Everywhere` replication strategy.

For small clusters this doesn't matter that much, but if `n` is, say, `20*255`,
this may result in huge reactor stalls, as observed in practice.

We avoid these stalls by inserting tactical yields. We hope that
some day someone actually implements a subquadratic algortihm here.

The commit also adds a comment on
`abstract_replication_strategy::calculate_natural_endpoints` explaining
that the interface does not give a complexity guarantee (at this point);
the different implementations have different complexities.

For example, `Everywhere` implementation always iterates over all tokens
in the token ring, so it has `θ(n)` worst and best case complexity.
On the other hand, `NetworkTopologyStrategy` implementation usually
finishes after visiting a small part of the token ring (specifically,
as soon as it finds a token for each node in the ring) and performs
a constant number of operations for each visited token on average,
but theoretically its worst case complexity is actually `O(n + k^2)`,
where `n` is the number of all tokens and `k` is the number of endpoints
(the `k^2` appears since for each endpoint we must perform finds and
inserts on `unordered_set` of size `O(k)`; `unordered_set` operations
have `O(1)` average complexity but `O(size of the set)` worst case
complexity).

Therefore it's not easy to put any complexity guarantee in the interface
at this point. Instead, we say that:
- some implementations may yield - if their complexities force us to do so
- but in general, there is no guarantee that the implementation may
  yield - e.g. the `Everywhere` implementation does not yield.

Fixes #8555.

Closes #8647
2021-05-25 11:53:28 +03:00
Raphael S. Carvalho
ee39eb9042 sstables: Fix slow off-strategy compaction on STCS tables
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.

Fixes #8449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
2021-05-25 11:24:42 +03:00
Asias He
70147dcb5a storage_service: Add removenode_add_ranges helper
Share the code between restore_replica_count and
removenode_with_stream to reduce duplication.

Refs #8700
2021-05-25 10:44:31 +08:00
Asias He
a285bd28e2 storage_service: Respect --enable-repair-based-node-ops flag during removenode
In commit 829b4c1 (repair: Make removenode safe by default), removenode
was changed to use repair based node operations unconditionally. Since
repair based node operations is not enabled by default, we should
respect the flag to use stream to sync data if the flag is false.

Fixes #8700
2021-05-25 10:42:58 +08:00
Kamil Braun
4658adbe18 tree-wide: introduce cdc::generation_id_v2
This is a new type of CDC generation identifiers. Compared to old IDs,
additionally to the timestamp it contains an UUID.

These new identifiers will allow a safer and more efficient algorithm of
introducing new generations into a cluster (introduced in a later commit).

For now, nodes keep using the old identifier format when creating new
generations and whenever they learn about a new CDC generation from gossip
they assume that it also is stored in the v1 format. But they do know how
to (de)serialize the second format and how to persist new identifiers in
local tables.
2021-05-24 17:50:21 +02:00
Avi Kivity
948e2c0b36 utils: config_file: delete unneeded template instantation of operator<<()
config_file.cc instantiates std::istream& std::operator>>(std::istream&,
std::unordered_map<seastar::sstring, seastar::sstring>&), but that
instantiation is ignored since config_file_impl.hh specializes
that signature. -Winstantiation-after-specialization warns about it,
so re-enable it now that the code base is clean.

Also remove the matching "extern template" declaration, which has no
definition any more.

Closes #8696
2021-05-24 18:34:45 +03:00
Avi Kivity
60fb224171 Update seastar submodule
* seastar 28dddd2683...f0f28d07e1 (4):
  > httpd: allow handler to not read an empty content
Fixes #8691.
  > compat: source_location: implement if no std or experimental are available
  > compat: source_location: declare using in seastar::compat namespace
  > perftune.py: fix a bug in mlx4 IRQs names matching pattern
2021-05-24 17:44:08 +03:00
Piotr Sarna
95c6ec1528 Merge 'test/cql-pytest: clean up tests to run on Cassandra' from Nadav Har'El
To keep our cql-pytest tests "correct", we should strive for them to pass on
Cassandra - unless they are testing a Scylla-only feature or a deliberate
difference between Scylla and Cassandra - in which case they should be marked
"scylla-only" and cause such tests to be skipped when running on Cassandra.

The following few small patches fix a few cases where our tests we failing on
Cassandra. In one case this even found a bug in the test (a trivial Python
mistake, but still).

Closes #8694

* github.com:scylladb/scylla:
  test/cql-pytest: fix python mistake in an xfailing test
  test/cql-pytest: mark some tests with scylla-only
  test/cql-pytest: clean up test_create_large_static_cells_and_rows
2021-05-24 16:42:01 +02:00
Avi Kivity
789757a692 Merge 'cql3: represent lists as chunked_vector instead of std::vector' from Michał Chojnowski
The cql3 layer manipulates lists as `std::vector`s (of `managed_bytes_opt`). Since lists can be arbitrarily large, let's use chunked vectors there to prevent potentially large contiguous allocations.

Closes #8668

* github.com:scylladb/scylla:
  cql3: change the internal type of tuples::in_value from std::vector to chunked_vector
  cql3: change the internal type of lists::value from std::vector to chunked_vector
  cql3: in multi_item_terminal, return the vector of items by value
2021-05-24 17:19:45 +03:00
Nadav Har'El
edc2c65552 Merge 'Fix service level negative timeouts' from Piotr Sarna
This series fixes a minor validation issue with service level timeouts - negative values were not checked. This bug is benign because negative timeouts act just like a 0s timeout, but the original series claimed to validate against negative values, so it's hereby fixed.
More importantly however, this series follows by enabling cql-pytest to run service level tests and provides a first batch of them, including a missing test case for negative timeouts.
The idea is similar to what we already have in alternator test suite - authentication is unconditionally enabled, which doesn't affect any existing tests, but at the same time allows writing test cases which rely on authentication - e.g. service levels.

Closes #8645

* github.com:scylladb/scylla:
  cql-pytest: introduce service level test suite
  cql-pytest: add enabling authentication by default
  qos: fix validating service level timeouts for negative values
2021-05-24 16:30:13 +03:00
Tomasz Grabiec
b1821c773f Merge "raft: basic RPC module testing" from Pavel Solodovnikov
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).

The test suite currenty consists of the following
test-cases:
 * Loading server instance with configuration from a snapshot.
 * Loading server instance with configuration from a log.
 * Configuration changes (remove + add node).
 * Leader elections don't lead to RPC configuration changes.
 * Voter <-> learner node transitions also don't change RPC
   configuration.
 * Reverting uncommitted configuration changes updates
   RPC configuration accordingly (two cases: revert to
   snapshot config or committed state from the log).

A few more refactorings are made along the way to be
able to reuse some existing functions from
`replication_test` in `rpc_test` implementation.

Please note, though, that there are still some functions
that are borrowed from `replication_test` but not yet
extracted to common helpers.

This is mostly because RPC tests doesn't need all
the complexity that `replication_test` has, thus,
some helpers are copied in a reduced form.

It would take some effort to refactor these bits to
fit both `replication_test` and `rpc_test` without
sacrificing convenience.
This will probably be addressed in another series later.

* manmanson/raft-rpc-tests-v9-alt3:
  raft: add tests for RPC module
  test: add CHECK_EVENTUALLY_EQUAL utility macro
  raft: replication_test: reset test rpc network between test runs
  raft: replication_test: extract tickers initialization into a separate func
  raft: replication_test: support passing custom `apply_fn` to `change_configuration()`
  raft: replication_test: introduce `test_server` aggregate struct
  raft: replication_test: support voter<->learner configuration changes
  raft: remove duplicate `create_command` function from `replication_test`
  raft: avoid 'using' statements in raft testing helpers header
2021-05-24 14:44:37 +02:00
Benny Halevy
56d3cb514a sstables: parse statistics: improve error handling
Properly return malformed_sstable_exception if the
statistics file fails to parse.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210524113808.973951-1-bhalevy@scylladb.com>
2021-05-24 15:12:48 +03:00
Nadav Har'El
5da0ad2ebc Merge branch 'coverage-py-missing-features/v1' of https://github.com/denesb/scylla into next
This patchset adds the missing features noted by the patchset
introducing it, namely:
* The ability to run a test through `coverage.py`, automating the entire
  process of setting up the environment, running the test and generating
  the report. This is possible with the new `--run` command line
  argument. It supports either generating a report immediately after
  running the provided test or just doing the running part, allowing the
  user to generate the report after having run all the tests they wanted
  to.
* A tweakable verbosity level.

It is also possible to specify a subset of the profiling data as input
for the report.
The documentation was also completed, with examples for all the
intended uses-cases.
With these changes, `coverage.py` is considered mature, the remaining
rough edges being located in other scripts (`tests.py` and
`configure.py`).
It is now possible to generate a coverage report for any test desired.

Also on: https://github.com/denesb/scylla.git
coverage-py-missing-features/v1

Botond Dénes (5):
  scripts/coverage.py: allow specifying the input files to generate the
    report from
  scripts/coverage.py: add capability of running a test directly
  scripts/coverage.py: add --verbose parameter
  scripts/coverage.py: document intended uses-cases
  HACKING.md: redirect to ./coverage.py for more details

 scripts/coverage.py | 143 +++++++++++++++++++++++++++++++++++++++-----
 HACKING.md          |  19 +-----
 2 files changed, 129 insertions(+), 33 deletions(-)
2021-05-24 14:54:28 +03:00
Avi Kivity
50f3bbc359 Merge "treewide: various header cleanups" from Pavel S
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places

A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).

The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).

Before:

	Command being timed: "ninja dev-build"
	User time (seconds): 28262.47
	System time (seconds): 824.85
	Percent of CPU this job got: 3979%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2129888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1402838
	Minor (reclaiming a frame) page faults: 124265412
	Voluntary context switches: 1879279
	Involuntary context switches: 1159999
	Swaps: 0
	File system inputs: 0
	File system outputs: 11806272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After:

	Command being timed: "ninja dev-build"
	User time (seconds): 26270.81
	System time (seconds): 767.01
	Percent of CPU this job got: 3905%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2117608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1400189
	Minor (reclaiming a frame) page faults: 117570335
	Voluntary context switches: 1870631
	Involuntary context switches: 1154535
	Swaps: 0
	File system inputs: 0
	File system outputs: 11777280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The observed improvement is about 5% of total wall clock time
for `dev-build` target.

Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"

* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
  transport: remove extraneous `qos/service_level_controller` includes from headers
  treewide: remove evidently unneded storage_proxy includes from some places
  service_level_controller: remove extraneous `service/storage_service.hh` include
  sstables/writer: remove extraneous `service/storage_service.hh` include
  treewide: remove extraneous database.hh includes from headers
  treewide: reduce boost headers usage in scylla header files
  cql3: remove extraneous includes from some headers
  cql3: various forward declaration cleanups
  utils: add missing <limits> header in `extremum_tracking.hh`
2021-05-24 14:24:20 +03:00
Yaron Kaikov
dd453ffe6a install.sh: Setup aio-max-nr upon installation
This is a follow up change to #8512.

Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla

As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.

Closes #8650
2021-05-24 14:24:20 +03:00
Takuya ASADA
3d307919c3 scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".

So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.

Fixes #8279

Closes #8681
2021-05-24 14:24:08 +03:00
Nadav Har'El
5206665b15 test/cql-pytest: fix python mistake in an xfailing test
The xfailing test cassandra_tests/validation/entities/collections_test.py::
testSelectionOfEmptyCollections had a Python mistake (using {} instead
of set() for an empty set), which resulted in its failure when run
against Cassandra. After this patch it passes on Cassandra and fails on
Scylla - as expected (this is why it is marked xfail).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 13:14:54 +03:00
Nadav Har'El
f26b31e950 test/cql-pytest: mark some tests with scylla-only
Tests which are known to test a Scylla-only feature (such as CDC)
or to rely on a known and difference between Scylla and Cassandra
should be marked "scylla-only", so they are skipped when running
the tests against Cassandra (test/cql-pytest/run-cassandra) instead
of reporting errors.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 13:03:48 +03:00
Nadav Har'El
c8117584e3 test/cql-pytest: clean up test_create_large_static_cells_and_rows
The test test_create_large_static_cells_and_rows had its own
implementation of "nodetool flush" using Scylla's REST API.
Now that we have a nodetool.flush() function for general use in
cql-pytest, let's use it and save a bit of duplication.

Another benefit is that now this test can be run (and pass) against
Cassandra.

To allow this test to run on Cassandra, I had to remove a
"USING TIMEOUT" which wasn't necessary for this test, and is
not a feature supported by Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-24 12:31:51 +03:00
Eliran Sinvani
f2091bb227 workload prioritization: Reduce the logging sensitivity to "glitches" in availability
Before this patch every failure to pull the configuration have been
reported as a warning. However this is confusing for users for two
reasons:
1. It pollutes the logs if the configuration is polled which is Scylla's
   mode of operation. Such a line is logged every failed iteration.
2. It confuses users because even though this level is warning, it logs
   out an exception and the log message contains the word failed.

We see it a lot during QA runs and customer questions from the field.

Point 2 is only solvable by reducing the verbosity of the logged
information, which will make debugging harder.
Point 1 is addressed here in the following manner, first the
one shot configuration pull function is not handling the exception
itself, this is OK because it is harmless to fail once or twice in a
row in configuration pulling like in every other query, the caller is
the one that will be responsible to handle the exception and log the
information. Second, the polling loop capture the exceptions being
thrown from the configuration pulling function and only report an error
with the latest exception if the polling has failed in consecutive
iterations over the last 90 seconds. This value was chosen because this
is about the empirical worst case time that it takes to a node to notice
one of the other nodes in the cluster is down (hence not querying it).

It is not important for the user or to us to be notified on temporary
glitches in availability (through this error at least) and since we are
eventually consistent is ok that some nodes will catch up with the
configuration later than others.

We also set a threshold in which if the configuration still couldn't be
retrieved then the logging level is bumped to ERROR.

Closes #8574
2021-05-24 10:51:47 +02:00
Piotr Sarna
17f4a55664 qos: remove unused with_user_service_level helper
This helper function is an artifact of forward-porting service levels,
and it wouldn't even compile when used because of mismatched
function declarations. It's not used anywhere in the open-source code,
so it's removed to avoid future merge conflicts.

Message-Id: <c9f421d0c4c1a807626775d324fd35b4c72505fe.1621845335.git.sarna@scylladb.com>
2021-05-24 11:42:51 +03:00
Michał Chojnowski
4b60e69e7c keys, compound: take the argument to from_single_value() by reference
Since serialize_value needs to copy the values to a bigger buffer anyway,
there is no point in copying the argument higher in the call chain.
This patch eliminates some pointless copies, for example in
alternator/executor.cc

Closes #8688
2021-05-24 11:20:24 +03:00
Asias He
425e3b1182 gossip: Introduce direct failure detector
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes #8488

Closes #8036
2021-05-24 10:47:06 +03:00
Piotr Sarna
890ed201fd Merge 'Enable -Wunused-private-field warning' from Avi Kivity
The -Wunused-private-field was squelched when we switched to
clang to make the change easier. But it is a useful warning, so
re-enable it.

It found a serious bug (#8682) and a few minor instances of waste.

Closes #8683

* github.com:scylladb/scylla:
  build: enable -Wunused-private-field warning
  test: drop unused fields
  table: drop unused field database_sstable_write_monitor::_compaction_manager
  streaming: drop unused fields
  sstables: mx reader: drop unused _column_value_length field
  sstables: index_consumer: drop unused max_quantity field
  compaction: resharding_compaction: drop unused _shard field
  compaction: compaction_read_monitor: drop unused _compaction_manager field
  raft: raft_services: drop unused _gossiper field
  repair: drop unused _nr_peer_nodes field
  redis: drop unused fields _storage_proxy and _requests_blocked_memory
  mutation_rebuilder: drop unused field _remaining_limit
  db: data_listeners: remove unused field _db
  cql3: insert_json_statement: note bug with unused _if_not_exists
  cql3: authorized_prepared_statement_cache: drop unused field _logger
  auth: service_level_resource_view: drop unused field _resource
2021-05-24 09:21:10 +02:00
Michał Chojnowski
03faf139c8 collection_mutation: don't linearize collection values
Yet another patch preventing potentially large allocations.
Currently, collection_mutation{_view,}_description linearize each collection
value during deserialization. It's not unthinkable that a user adds a
large element to a list or a map, so let's avoid that.

This patch removes the dependency on linearizing_input_stream, which does not
provide a way to read fragmented subbuffers, and replaces it with a new
helper, which does. (Extending linearizing_input_stream is not viable without
rewriting it completely).

Only linearization of collection values is corrected in this patch.
Collection keys are still linearized. Storing them in managed_bytes is likely
to be more harmful than helpful, because large map keys are extremely unlikely,
and UUIDs, which are used as keys in lists, do not fit into manages_bytes's
small value optimization, so this would incure an extra allocation for every
list element.

Note: this patch leaves utils/linearizing_input_stream.hh unused.

Refs: #8120

Closes #8690
2021-05-23 12:16:56 +03:00
Michał Chojnowski
65be64d0fe types: don't linearize values in abstract_type::hash
Yet another patch aiming to prevent potentially large allocations.
abstract_type::hash somehow evaded the anti-linearization patches until now.
Fix that.

Note that decimals and varints are still linearized, but we leave it be,
under the assumption that nobody inserts 128KiB-large varints into a database.

Refs: #8120

Closes #8689
2021-05-23 12:11:53 +03:00
Michał Chojnowski
ffdb706984 keys, compound: eliminate some careless copies of shared pointers
Using `auto` copies the shared pointers. We don't want that, so let's use
`const auto&`.

Closes #8686
2021-05-23 12:11:46 +03:00
Michał Chojnowski
ebe485953a types: fix a case of type punning via union
Type punning via unions is legal in C, but illegal (undefined behaviour) in C++.
Use the legal bit_cast instead.

Closes #8685
2021-05-23 10:12:56 +03:00
Michał Chojnowski
e4405692ae types: remove some dead code
Closes #8684
2021-05-23 09:57:30 +03:00
Michał Chojnowski
23909e91a4 alternator: executor: eliminate some pointless reserializations
There are places where abstract_type::deserialize is called just to pass the
result to compound_wrapper::from_singular, which immediately serializes it
again.

Get rid of this ritual by adding a version of from_singular which takes
a serialized argument.

As a bonus, along the way we eliminate some pointless copies of lw_shared_ptr
and std::shared_ptr caused by two careless uses of `auto`.

Closes #8687
2021-05-23 09:42:09 +03:00
Gleb Natapov
b4d6bdb16e raft: test: check that a leader does not send probes to a follower in the snapshot mode
Message-Id: <YKTNN7vNGkQwTDX7@scylladb.com>
2021-05-23 01:06:12 +02:00
Michał Chojnowski
d72b91053b logalloc: fix quadratic behaviour of reclaim_from_evictable
As an optimization for optimistic cases, reclaim_from_evictable first evicts
the requested amount of memory before attempting to reclaim segments through
compactions. However, due to an oversight, it does this before every compaction
instead of once before all compactions.

Usually reclaim_from_evictable is called with small targets, or is preemptible,
and in those cases this issue is not visible. However, when the target is bigger
than one segment and the reclaim is not preemptible, which is he case when it's
called from allocating_section, this results in a quadratic explosion of
evictions, which can evict several hundred MiB to reclaim a few MiB.

Fix that by calculating the target of memory eviction only once, instead of
recalculating it after every compaction.

Fixes #8542.

Closes #8611
2021-05-22 20:49:00 +02:00
Avi Kivity
78e392c01d build: enable -Wunused-private-field warning
The -Wunused-private-field was squelched when we switched to
clang to make the change easier. But it is a useful warning, so
re-enable it.

It found a serious bug (#8682) and a few minor instances of waste.
2021-05-21 21:05:16 +03:00
Avi Kivity
7e5a0b6fd0 test: drop unused fields
Drop unused fields in various tests and test libraries.
2021-05-21 21:04:49 +03:00
Avi Kivity
1d508106be table: drop unused field database_sstable_write_monitor::_compaction_manager 2021-05-21 21:04:20 +03:00
Avi Kivity
84be89eb3b streaming: drop unused fields 2021-05-21 21:03:23 +03:00
Avi Kivity
047b3f85d3 sstables: mx reader: drop unused _column_value_length field 2021-05-21 21:02:55 +03:00
Avi Kivity
32d9ba2fbb sstables: index_consumer: drop unused max_quantity field 2021-05-21 21:02:16 +03:00
Avi Kivity
cb587aaa5c compaction: resharding_compaction: drop unused _shard field 2021-05-21 21:01:54 +03:00
Avi Kivity
f62469b7c5 compaction: compaction_read_monitor: drop unused _compaction_manager field
A constructor that now takes on argument is made explicit.
2021-05-21 21:00:47 +03:00
Avi Kivity
b8137986e6 raft: raft_services: drop unused _gossiper field 2021-05-21 21:00:04 +03:00
Avi Kivity
0b8b9f0cbf repair: drop unused _nr_peer_nodes field 2021-05-21 20:59:23 +03:00
Avi Kivity
195c969304 redis: drop unused fields _storage_proxy and _requests_blocked_memory
Probably carried over by copy-paste. Also drop storage_proxy include.
2021-05-21 20:58:32 +03:00
Avi Kivity
a0257d95c2 mutation_rebuilder: drop unused field _remaining_limit
And its initializer.
2021-05-21 20:57:33 +03:00
Avi Kivity
924f93028a db: data_listeners: remove unused field _db
Remove the unused field and the constructor that populated it.
2021-05-21 20:56:42 +03:00
Avi Kivity
539948760f cql3: insert_json_statement: note bug with unused _if_not_exists
The parser accepts INSERT JSON ... IF NOT EXISTS but we later
ignore it. This is a bug (#8682). Note it down and shut down
a compiler error that will result when we enable
-Wunused-private-field.
2021-05-21 20:54:44 +03:00
Avi Kivity
adbe7ad919 cql3: authorized_prepared_statement_cache: drop unused field _logger 2021-05-21 20:54:01 +03:00
Avi Kivity
ed160df0f9 auth: service_level_resource_view: drop unused field _resource
It should probably be used in operator<<(std::ostream&), but whoever
implements that gets to re-add the field.
2021-05-21 20:53:01 +03:00
Botond Dénes
61c43b8983 HACKING.md: redirect to ./coverage.py for more details
scripts/coverage.py now has a detailed help, don't repeat its content in
HACKING.md, instead redirect users who want to learn more to the
script's help.
2021-05-21 11:50:39 +03:00
Botond Dénes
cd17932b96 scripts/coverage.py: document intended uses-cases 2021-05-21 11:50:39 +03:00
Botond Dénes
4647472fee scripts/coverage.py: add --verbose parameter 2021-05-21 11:50:39 +03:00
Botond Dénes
2d7fe702e1 scripts/coverage.py: add capability of running a test directly
Through `coverage.py`. This saves the user from all the required env
setup required for `coverage.py` to successfully generate the report
afterwards. Instead all of this is taken care automatically, by just
running:

    ./scripts/coverage.py --run ./build/coverage/.../mytest arg1 ...  argN

`coverage.py` takes care of running the test and generating a coverage
report from it.

As a side effect, also fix `main()` ignoring its `argv` parameter.
2021-05-21 11:50:39 +03:00
Botond Dénes
64c4557ba8 scripts/coverage.py: allow specifying the input files to generate the report from
Currently `coverage.py` includes all raw profiling data found at PATH
automatically. This patch gives an option to override this, instead
including only the given input files in the report.
2021-05-21 11:50:39 +03:00
Nadav Har'El
a2379b96b1 alternator test: test for large BatchGetItem
This patch adds an Alternator test, test_batch_get_item_large,
which checks a BatchGetItem with a moderately large (1.5 MB) response.
The test passes - we do not have a bug in BatchGetItem - but it
does reproduce issue #8522 - the long response is stored in memory as
one long contiguous string and causes a warning about an over-sized
allocation:

  WARN ... seastar_memory - oversized allocation: 2281472 bytes.

Incidentally, this test also reproduces a second contiguous
allocation problem - issue #8183 (in BatchWriteItem which we use
in this test to set up the item to read).

Refs #8522
Refs #8183

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210520161619.110941-1-nyh@scylladb.com>
2021-05-21 08:38:53 +02:00
Avi Kivity
4383674760 cql3: result_set: switch rows to chunked_vector
_rows uses a deque, but doesn't need any special functionality.

Switch to chunked_vector, which uses one less allocation in the
common case (std::deque has an extra allocation for managing its
chunks).

Closes #8679
2021-05-20 20:14:15 +03:00
Avi Kivity
eac6fb8d79 gdb: bypass unit test on non-x86
The gdb self-tests fail on aarch64 due to a failure to use thread-local
variables. I filed [1] so it can get fixed.

Meanwhile, disable the test so the build passes. It is sad, but the aarch64
build is not impacted by these failures.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=27886

Closes #8672
2021-05-20 20:14:15 +03:00
Asias He
2ec1f719de repair: Always use run_replace_ops
Currently, the new NODE_OPS_CMD for replace operation is used only when
repair based node operation is enabled. However, We can use the
NODE_OPS_CMD to run replace operation and use streaming instead of
repair to sync data as well.

After this patch, we will use streaming inside run_replace_ops if repair
based node ops is not enabled. So that we can take the benefits that
NODE_OPS_CMD brings in commit 323f72e48a
(repair: Switch to use NODE_OPS_CMD for replace operation).

Fixes #8013
2021-05-20 20:14:15 +03:00
Avi Kivity
bb51f7d928 Update seastar submodule
* seastar 847fccaf5e...28dddd2683 (13):
  > reactor: disable xfs extent size hints if using the kernel page cache
  > smp: replace _reactors global with a local
  > Merge "Add test for IO-scheduler (fails now)" from Pavel E
  > weak_ptr: lift restriction on copying
  > core: expose hidden method from parent class
  > perftune.py: __get_feature_file(): verify that parameters are not None
  > gate: assert no outstanding requests when destroyed
  > httpd: add status_types
  > cmake: use -O2 for CMAKE_CXX_FLAGS_DEV with clang
  > compat: source_location: use std::source_location only if available
  > iotune: disambiguate "this" lambda capture in C++20 mode
  > Merge "Consider disk saturation request lengths" from Pavel E
  > Merge 'seastar-addr2line: support oneline backtrace in resolve call' from Benny Halevy
2021-05-20 20:14:15 +03:00
Benny Halevy
5724233609 scylla-gdb: scylla_io_queues: support io_group._max_bytes_count
_maximum_request_size is renamed to _max_bytes_count
in 40a29d5590

This patch adds support for ioq io_group._max_bytes_count
if io_group._maximum_request_size isn't found.

Test: scylla-gdb(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210520151537.710123-1-bhalevy@scylladb.com>
2021-05-20 20:14:15 +03:00
Avi Kivity
30034371e7 Merge "Remove most of global pointers from repair" from Pavel
"
There are many global stuff in repair -- a bunch of pointers to
sharded services, tracker, map of metas (maybe more). This set
removes the first group, all those services had become main-local
recently. Along the way a call to global storage proxy is dropped.

To get there the repair_service is turned into a "classical"
sharded<> service, gets all the needed dependencies by references
from main and spreads them internally where needed. Tracker and other
stuff is left global, but tracker is now the candidate for merging
with the now sharded repair_service, since it emulates the sharded
concept internally.

Overall the change is

- make repair_service sharded and put all dependencies on it at start
- have sharded<repair_service> in API and storage service
- carry the service reference down to repair_info and repair_meta
  constructions to give them the depedencies
- use needed services in _info and _meta methods

tests: unit(dev), dtest.repair(dev)
"

* 'br-repair-service' of https://github.com/xemul/scylla: (29 commits)
  repair: Drop most of globals from repair
  repair: Use local references in messaging handler checks
  repair: Use local references in create_writer()
  repair: Construct repair_meta with local references
  repair: Keep more stuff on repair_info
  repair: Kill bunch of global usages from insert_repair_meta
  repair: Pass repair service down to meta insertion
  repair: Keep local migration manager on repair_info
  repair: Move unused db captures
  repair: Remove unused ms captures
  repair: Construct repair_info with service
  repair: Loop over repair sharded container
  repair: Make sync_data_using_repair a method
  repair: Use repair from storage service
  repair: Keep repair on storage service
  repair: Make do_repair_start a method
  repair: Pass repair_service through the API until do_repair_start
  repair: Fix indentation after previous patch
  repair: Split sync_data_using_repair
  repair: Turn repair_range a repair_info method
  ...
2021-05-20 10:57:48 +03:00
Pavel Solodovnikov
b51b11f226 transport: remove extraneous qos/service_level_controller includes from headers
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:32:15 +03:00
Pavel Solodovnikov
238273d237 treewide: remove evidently unneded storage_proxy includes from some places
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:19:32 +03:00
Pavel Solodovnikov
0663aa6ca1 service_level_controller: remove extraneous service/storage_service.hh include
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:18:41 +03:00
Pavel Solodovnikov
d7a77a993f sstables/writer: remove extraneous service/storage_service.hh include
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:03:24 +03:00
Pavel Solodovnikov
c3a7b55507 treewide: remove extraneous database.hh includes from headers
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:59:14 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Pavel Solodovnikov
9352a08468 cql3: remove extraneous includes from some headers
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:32:57 +03:00
Piotr Sarna
223a59c09c test: make rjson allocator test working in sanitize mode
Following Nadav's advice, instead of ignoring the test
in sanitize/debug modes, the allocator simply has a special path
of failing sufficiently large allocation requests.
With that, a problem with the address sanitizer is bypassed
and other debug mode sanitizers can inspect and check
if there are no more problems related to wrapping the original
rapidjson allocator.

Closes #8539
2021-05-20 00:42:47 +03:00
Pavel Solodovnikov
ae213e1e25 cql3: various forward declaration cleanups
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 00:18:00 +03:00
Pavel Solodovnikov
94b5c6333f utils: add missing <limits> header in extremum_tracking.hh
This makes all headers in scylla to be self-sufficient up
to the moment.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 00:04:51 +03:00
Pavel Solodovnikov
a66de8658b raft: add tests for RPC module
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).

The test suite currenty consists of the following
test-cases:
 * Loading server instance with configuration from a snapshot.
 * Loading server instance with configuration from a log.
 * Configuration changes (remove + add node).
 * Leader elections don't lead to RPC configuration changes.
 * Voter <-> learner node transitions also don't change RPC
   configuration.
 * Reverting uncommitted configuration changes updates
   RPC configuration accordingly (two cases: revert to
   snapshot config or committed state from the log).

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:14:04 +03:00
Pavel Solodovnikov
e030e291a8 test: add CHECK_EVENTUALLY_EQUAL utility macro
It would be good to have a `CHECK` variant in addition
to an existing `REQUIRE_EVENTUALLY_EQUAL` macro. Will be used
in raft RPC tests.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:12:55 +03:00
Pavel Solodovnikov
2067cc75c6 raft: replication_test: reset test rpc network between test runs
Currently, emulated rpc network is shared between all test cases
in `replication_test.cc` (see static `rpc::net` map).
Though, its value is not reset when executing a subsequent test
case, which opens a possibility for heap-use-after-free bugs.

Also, make all `send_*` functions in test rpc class to throw an
error if a node being contacted is not in the network instead of
past-the-end access. This allows to safely contact a non-existent
node, which will be used in RPC tests later.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-19 23:06:29 +03:00
Avi Kivity
c71d007797 consistency_level: deinline assure_sufficient_live_nodes()
assure_sufficient_live_nodes() is a huge template calling other
huge templates, and requires "network_topology_strategy.hh". It is
inlined in consistency_level.hh. This increases compile time and recompiles.

Move the template out-of-line and use "extern template" to instantiate it.
This is not ideal as new callers would require updates to the
instantiated signatures, but I think our goal should be to de-template
it completely instead. Meanwhile, this reduces some pain.

Ref #1.

Closes #8637
2021-05-19 15:03:51 +03:00
Avi Kivity
d8121961fa Merge 'cql-pytest: add nodetool flush feature and use it in a test' from Nadav Har'El
The first patch adds a nodetool-like capability to the cql-pytest framework.
It is *not* meant to be used to test nodetool itself, but rather to give CQL
tests the ability to use nodetool operations - currently only one operation -
"nodetool flush".

We try to use Scylla's REST API, if possible, and only fall back to using an
external "nodetool" command when the REST API is not available - i.e., when
testing Cassandra. The benefit of using the REST API is that we don't need
to run the jmx server to test Scylla.

The second patch is an example of using the new nodetool flush feature
in a test that needs to flush data to reproduce a bug (which has already
been fixed).

Closes #8622

* github.com:scylladb/scylla:
  cql-pytest: reproducer for issue #8138
  cql-pytest: add nodetool flush feature
2021-05-19 14:40:18 +03:00
Nadav Har'El
fd8d15a1a6 cql-pytest: reproducer for issue #8138
We add a reproducing test for issue #8138, were if we write to an
TWCS table, scanning it would yield no rows - and worse - crash the
debug build.

This test requires "nodetool flush" to force the read to happen from
sstables, hence the nodetool feature was implemented in the previous
patch (on Scylla, it uses the REST API - not actually running nodetool
or requiring JMX).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-19 13:58:14 +03:00
Nadav Har'El
49580a4701 cql-pytest: add nodetool flush feature
This patch adds a nodetool-compatible capability to the cql-pytest
framework. It is *not* meant to be used to test nodetool itself, but
rather to give CQL tests the ability to use nodetool operations -
currently one operation - "nodetool flush".

Use it in a test as:

     import nodetool
     nodetool.flush(cql, table)

I chose a functional API with parameters ("cql") instead of a fixture
with an implied connection so that in the future we may allow multiple
multiple nodes and this API will allow sending nodetool requests to
different nodes. However, multi-node support is not implemented yet,
nor used in any of the existing tests.

The implementation uses Scylla's REST API if available, or if not, falls
back to using an external "nodetool" command (which can be overridden
using the NODETOOL environment variable). This way, both cql-pytest/run
(Scylla) and cql-pytest/run-cassandra (Cassandra) now correctly support
these nodetool operations, and we still don't need to run JMX to test
Scylla.

The reason We want to support nodetool.flush() is to reproduce bugs that
depend on data reaching disk. We already had such a reproducer in
test_large_cells_rows.py - it too did something similar - but it was
Scylla-only (using only the REST API). Instead of copying such code to
multiple places, we better have a common nodetool.flush() function, as
done in this patch. The test in test_large_cells_rows.py can later be
changed to use the new function.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-19 13:55:25 +03:00
Avi Kivity
794d272e35 Merge "Refine allocation strategy" from Pavel E
"
This set does two things:
- hides migrate-fn machinery in allocation_strategy header
- conceptualizes dynamic objects

The former is possible after IMR rework -- nowadays users of LSA don't
need to do anything special with "migrators" so they can be turned to
be internal allocation-strategy helpers.

The latter is to make sure dynamic objects do not forget to overload
the size_for_allocation_strategy(). If this happens the whole thing
compiles fine and sometimes works, but generates memory corruptions, so
it's worth adding more confidence here.

tests: unit(dev)
"

* 'br-lsa-hide-migrators' of https://github.com/xemul/scylla:
  bptree: Require dynamic object for nodes reconstruct
  allocation_strategy, code: Conceptualize dynamic objects
  allocation_strategy: Hide migrators
  allocation_strategy, code: Simplify alloc()
  allocation_strategy: Mark size_for_allocation_strategy noexcept
2021-05-19 10:14:51 +03:00
Pavel Emelyanov
0c4ba56594 bptree: Require dynamic object for nodes reconstruct
The B+ tree is not intrusive and supports both kinds of objects --
dynamic (in sense of previous patch) and fixed-size. Respectively,
the nodes provide .storage_size() method and get the embedded object
storage size themselves. Thus, if a dynamic object is used with the
tree but it misses the .storage_size() itself this would come
unnoticed. Fortunately, dynamic objects use the .reconstruct()
method, so the method should be equipeed with the DybnamicObject
concept.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Pavel Emelyanov
9216a5bc08 allocation_strategy, code: Conceptualize dynamic objects
Usually lsa allocation is performed with the construct() helper that
allocates a sizeof(T) slot and constructs it in place. Some rare
objects have dynamic size, so they are created by alloc()ating a
slot of some specific size and (!) must provide the correct overload
of size_for_allocation_strategy that reports back the relevant
storage size.

This "must provide" is not enforced, if missed a default sizer would
be instantiated, but won't work properly. This patch makes all users
of alloc() conform to DynamicObject concept which requires the
presense of .storage_size() method to tell that size.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Pavel Emelyanov
b8a4f32b48 allocation_strategy: Hide migrators
After IMR rework the only lsa-migrating functionality is standard one
that calls move constructors on lsa slots. Hide the whole thing inside
allocation-strategy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Pavel Emelyanov
28f01aadc9 allocation_strategy, code: Simplify alloc()
Todays alloc() accepts migrate-fn, size and alignment. All the callers
don't really need to provide anything special for the migrate-fn and
are just happy with default alignof() for alignment. The simplification
is in providing alloc() that only accepts size arg and does the rest
itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Pavel Emelyanov
fdfcda97d7 allocation_strategy: Mark size_for_allocation_strategy noexcept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Botond Dénes
dbb6851d4d test/manual/sstable_scan_footprint: don't double close the semaphore
The semaphore `stats_collector` references is the one obtained from the
database object, which is already stopped by `database::stop()`, making
the stop in `~stats_collector()` redundant, and even worse, as it
triggers an assert failure. Remove it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210518140913.276368-1-bdenes@scylladb.com>
2021-05-18 17:55:52 +03:00
Avi Kivity
16ff92745f Merge 'perf: add alternator frontend to perf_simple_query' from Piotr Sarna
The perf_simple_query tool is extended with another protocol
aside from CQL - alternator. The alternative (pun intended) benchmark
can be executed by using the `--alternator X` parameter, where X
specifies one of the alternator's mandatory write isolation options:
 - "forbid_rmw" - forbids RMW (read-modify-write) requests
 - "unsafe" - never uses LWT (lightweight transactions), even for RMW
 - "always_use_lwt" - uses LWT even for non-RMW requests
 - "only_rmw_uses_lwt" - that one's rather self-explanatory

Alternator cooperates with existing `--write` and `--delete` parameters.

Aside from being able to check for improvements/regressions
in the alternator module, it's also possible to check how different
isolation levels influence the number of allocations and overall
performance, or to compare alternator against CQL.

Example output showing the difference in isolation levels:

```bash
$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --write --alternator only_rmw_uses_lwt --default-log-level error
random-seed=1235000092
Started alternator executor
10873.76 tps (202.9 allocs/op,  12.4 tasks/op,  369921 insns/op)
11096.09 tps (202.7 allocs/op,  12.1 tasks/op,  374792 insns/op)
11100.09 tps (203.0 allocs/op,  12.1 tasks/op,  376469 insns/op)
11068.98 tps (203.1 allocs/op,  12.1 tasks/op,  377132 insns/op)
11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)

median 11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)
median absolute deviation: 14.85
maximum: 11100.09
minimum: 10873.76

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --random-seed 1235000092 --write --alternator always_use_lwt \
    --default-log-level error
random-seed=1235000092
Started alternator executor
3605.35 tps (877.4 allocs/op, 174.6 tasks/op,  986666 insns/op)
3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op)
3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op)
3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op)

median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
median absolute deviation: 75.15
maximum: 3605.35
minimum: 3409.88
```

Closes #8656

* github.com:scylladb/scylla:
  perf: add alternator frontend to perf_simple_query
  cdc: make metadata.hh self-sufficient
  test: add minimal alternator_test_env
2021-05-18 16:17:54 +03:00
Piotr Sarna
6c6ccda8a0 perf: add alternator frontend to perf_simple_query
The perf_simple_query tool is extended with another protocol
aside from CQL - alternator. The alternative (pun intended) benchmark
can be executed by using the `--alternator X` parameter, where X
specifies one of the alternator's mandatory write isolation options:
 - "forbid_rmw" - forbids RMW (read-modify-write) requests
 - "unsafe" - never uses LWT (lightweight transactions), even for RMW
 - "always_use_lwt" - uses LWT even for non-RMW requests
 - "only_rmw_uses_lwt" - that one's rather self-explanatory

Alternator cooperates with existing --write and --delete parameters.

Aside from being able to check for improvements/regressions
in the alternator module, it's also possible to check how different
isolation levels influence the number of allocations and overall
performance, or to compare alternator against CQL.

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --write --alternator only_rmw_uses_lwt --default-log-level error
random-seed=1235000092
Started alternator executor
10873.76 tps (202.9 allocs/op,  12.4 tasks/op,  369921 insns/op)
11096.09 tps (202.7 allocs/op,  12.1 tasks/op,  374792 insns/op)
11100.09 tps (203.0 allocs/op,  12.1 tasks/op,  376469 insns/op)
11068.98 tps (203.1 allocs/op,  12.1 tasks/op,  377132 insns/op)
11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)

median 11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)
median absolute deviation: 14.85
maximum: 11100.09
minimum: 10873.76

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --random-seed 1235000092 --write --alternator always_use_lwt \
    --default-log-level error
random-seed=1235000092
Started alternator executor
3605.35 tps (877.4 allocs/op, 174.6 tasks/op,  986666 insns/op)
3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op)
3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op)
3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op)

median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
median absolute deviation: 75.15
maximum: 3605.35
minimum: 3409.88
2021-05-18 15:10:31 +02:00
Piotr Sarna
6e28c01c53 cdc: make metadata.hh self-sufficient
The header relies on topology_description class definition,
which is part of cdc/generation.hh.
2021-05-18 15:10:31 +02:00
Piotr Sarna
b6d6247a74 test: add minimal alternator_test_env
A minimal implementation of alternator test env, a younger cousin
of cql_test_env, is implemented. Note that using this environment
for unit tests is strongly discouraged in favor of the official
test/alternator pytest suite. Still, alternator_test_env has its uses
for microbenchmarks.
2021-05-18 15:10:31 +02:00
Takuya ASADA
a3b25e3d29 unified/uninstall.sh: simplify uninstall.sh, delete all files correctly
Current uninstall.sh is trying to do similar logic with install.sh,
but it makes script larger meaninglessly, and also it failing to remove
few files under /opt/scylladb.

Let's just do rm -rf /opt/scylladb, and drop few other files located out
side of /opt/scylladb.

Closes #8662
2021-05-18 14:55:18 +02:00
Asias He
0858619cba storage_service: Abort restore_replica_count when node is removed from the cluster
Consider the following procedure:

- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
  cluster
- n1 is down in the middle of removenode operation

Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.

This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues.  We need a way to abort
such streaming attempts.

To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.

In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.

This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.

Fixes #8651

Closes #8655
2021-05-18 14:55:18 +02:00
Botond Dénes
82bff1bcc6 test: cql_test_env: use proper scheduling groups
Currently `cql_test_env` runs its `func` in the default (main) group and
also leaves all scheduling groups in `dbcfg` default initialized to the
same scheduling group. This results in every part of the system,
normally isolated from each other, running in the same (default)
scheduling group. Not a big problem on its own, as we are talking about
tests, but this creates an artificial difference between the test and
the real environment, which is ever more pronounced since certain query
parameters are selected based on the current scheduling group.
To bring cql test env just that little bit closer to the real thing,
this patch creates all the scheduling groups main does (well almost) and
configures `dbcfg` with them.
Creating and destroying the scheduling group on each setup-teardown of
cql test env breaks some internal seastar components which don't like
seeing the same scheduling group with the same name but different id. So
create the scheduling groups once on first access and keep them around
until the test executable is running.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>
2021-05-18 13:44:54 +03:00
Botond Dénes
300ee974f7 test: use with_cql_test_env_thread where needed
Currently `with_cql_test_env()` is equivalent to
`with_cql_test_env_thread()`, which resulted in many tests using the
former while really needing the latter and getting away with it. This
equivalence is incidental and will go away soon, so make sure all cql
test env using tests that expect to be run in a thread use the
appropriate variant.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>
2021-05-18 13:44:52 +03:00
Avi Kivity
6db826475d Merge "Introduce segregate scrub mode" from Botond
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.

This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.

The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
  that is suspected to be corrupt (but even these a certain degree) have
  to consider the possibility that they are rehabilitating corruptions,
  leaving them in the system without a warning, in the sense that the
  user won't see any more problems due to low-level corruptions and
  hence might think everything is alright, while data is still corrupt
  from the business logic point of view. It is very hard to draw a line
  between what should and shouldn't scrub do, yet there is a demand from
  users for scrub that can restore data without loosing any of it. Note
  that anybody executing such a scrub is already in a bad shape, even if
  they can read their data (they often can't) it is already corrupt,
  scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
  enum, which now selects the scrub mode. This means that
  `skip_corrupted` cannot be combined with segregate to throw out what
  the former can't fix. This was chosen for simplicity, a bunch of
  flags, all interacting with each other is very hard to see through in
  my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
  fragment-level disorder. Maybe it should only do it on the partition
  level, or maybe this should be made configurable, allowing the user to
  select what to happen with those data that cannot be fixed.

Tests: unit(dev), unit(sstable_datafile_test:debug)
"

* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
  test: boost/sstable_datafile_test: add tests for segregate mode scrub
  api: storage_service/keyspace_scrub: expose new segregate mode
  sstables: compaction/scrub: add segregate mode
  mutation_fragment_stream_validator: add reset methods
  mutation_writer: add segregate_by_partition
  api: /storage_service/keyspace_scrub: add scrub mode param
  sstables: compaction/scrub: replace skip_corrupted with mode enum
  sstables: compaction/scrub: prevent infinite loop when last partition end is missing
  tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
2021-05-18 13:43:01 +03:00
Botond Dénes
5eb4517f56 read_context: move_to_next_partition(): make reader creation atomic
Otherwise an interleaving cache update can clear the `_prev_snapshot`
before the reader is created, leading to the reader being created via a
null mutation source.

Tests: unit(dev, release, debug:row_cache_test)

Fixes #8671.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210518092317.227433-1-bdenes@scylladb.com>
2021-05-18 13:41:48 +03:00
Piotr Sarna
c8653d1321 cql3: enhance the fix for index paging type check
The original fix stripped the reversed type only from
the base table column, but it's better to be safe than sorry,
so the reverse is also stripped from the view column.

Refs #8667
Message-Id: <cb5dedb0b8b6b5eea3a69863ae50a0e906482665.1621330463.git.sarna@scylladb.com>
2021-05-18 12:47:35 +03:00
Takuya ASADA
60c0b37a4c install.sh: apply correct file security context when copying files
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.

Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.

Fixes #8589

Closes #8602
2021-05-18 12:09:51 +03:00
Takuya ASADA
6faa8b97ec install.sh: fix not such file or directory on nonroot
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.

Fixes #8663

Closes #8664
2021-05-18 12:03:45 +03:00
Avi Kivity
593ad4de1e Merge 'Fix type checking in index paging' from Piotr Sarna
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.

Tests: unit(release), along with the additional test case
       introduced in this series; the test also passes
       on Cassandra

Fixes #8666

Closes #8667

* github.com:scylladb/scylla:
  test: add a test case for paging with desc clustering order
  cql3: relax a type check for index paging
2021-05-18 11:34:59 +03:00
Kamil Braun
03ad111beb tree-wide: comments on deprecated functions to access global variables
Closes #8665
2021-05-18 11:31:10 +03:00
Botond Dénes
ae366868fb multishard_mutation_query: save_reader(): avoid round-trip for destroying rparts
Force its destruction when saving the reader.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514140844.119362-1-bdenes@scylladb.com>
2021-05-18 10:07:13 +03:00
Botond Dénes
c98b0d0de8 test: cql_test_env: add trace logs to execute_cql()
In tests executing tons of these, it is useful to be able to enable a
trace logging of each one, to see which is the last successful one.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514140531.118390-1-bdenes@scylladb.com>
2021-05-18 10:06:22 +03:00
Michał Chojnowski
49793a4919 cql3: change the internal type of tuples::in_value from std::vector to chunked_vector
While having a large list in the IN clause is unlikely, it's still an
arbitrarily large piece of user-provided data.
On principle, let's use a chunked container here to prevent large contiguous
allocations.
2021-05-17 17:12:07 +02:00
Michał Chojnowski
dcbc053ecd cql3: change the internal type of lists::value from std::vector to chunked_vector
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
2021-05-17 17:09:55 +02:00
Piotr Sarna
c36f432423 test: add a test case for paging with desc clustering order
Issue #8666 revealed an issue with validating types for paged
indexed queries - namely, the type checking mechanism is too strict
in comparing types and fails on mismatched clustering order -
e.g. an `int` column type is different from `int` with DESC
clustering order. As a result, users see a *very* confusing
message (because reversed types are printed as their underlying type):
 > Mismatched types for base and view columns c: int and int
This test case fails before the fix for #8666 and thus acts
as a regression test.
2021-05-17 17:06:50 +02:00
Piotr Sarna
544ef2caf3 cql3: relax a type check for index paging
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager
- namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.

Tests: unit(release), along with the additional test case
       introduced in this series; the test also passes
       on Cassandra
Fixes #8666
2021-05-17 17:06:50 +02:00
Michał Chojnowski
4baeea0199 cql3: in multi_item_terminal, return the vector of items by value
Returning by reference requires that the elements are internally stored in in
the multi_item_terminal as a std::vector, but in the next patch we will change
the internal type of lists::value from std::vector to utils::chunked_vector.

The copy is not a problem because all users of multi_item_terminal were copying
the returned vector.
2021-05-17 16:46:28 +02:00
Botond Dénes
dca808dd51 perf/perf_simple_query: add --enable-cache option
Allowing for testing performance with/out cache.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210517045402.16153-1-bdenes@scylladb.com>
2021-05-17 14:06:18 +02:00
Raphael S. Carvalho
10ae77966c compaction_manager: Don't swallow exception in procedure used by reshape and resharding
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.

Fixes #8657.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
2021-05-17 13:57:05 +02:00
Pavel Solodovnikov
f38c5b5359 raft: replication_test: extract tickers initialization into a separate func
Extract raft tickers initialization into `init_raft_tickers()`
functon. This will be later used in rpc tests.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-17 13:36:09 +03:00
Pavel Solodovnikov
e0f8ded9bf raft: replication_test: support passing custom apply_fn to change_configuration()
This will be used later in rpc tests to support passing a dummy
apply function, which does not need to update state machine at all.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-17 13:36:09 +03:00
Pavel Solodovnikov
3d669df2cb raft: replication_test: introduce test_server aggregate struct
Use a somewhat neater structure to represent `create_raft_server()`
return value instead of cumbersome
`std::pair<std::unique_ptr<raft::server>, state_machine*>`.

Not only `test_server` is much shorter, but it is also much more
descriptive.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-17 13:36:09 +03:00
Pavel Solodovnikov
a29db1deda raft: replication_test: support voter<->learner configuration changes
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-17 13:36:09 +03:00
Pavel Solodovnikov
def97cd730 raft: remove duplicate create_command function from replication_test
Include the version from `helpers.hh`. This also makes possible
to use additional utilities from this header file, like `id()`
and `address_set()`, which comes handy in simple tests and
will be used in rpc testing.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-17 13:36:09 +03:00
Pavel Solodovnikov
0389001496 raft: avoid 'using' statements in raft testing helpers header
It is generally considered a bad practice to use the `using`
directives at global scope in header files.

Also, many parts of `test/raft/helpers.hh` were already
using `raft::` prefixes explicitly, so definitely not much
to lose there.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-17 13:36:09 +03:00
Avi Kivity
8d6e575f59 perf_fast_forward: report instructions per fragment
Use a hardware counter to report instructions per fragment. Results
vary from ~4k insns/f when reading sequentially to more than 1M insns/f.

Instructions per fragment can be a more stable metric than frags/sec.
It would probably be even more stable with a fake file implementation
that works in-memory to eliminate seastar polling instruction variation.

Closes #8660
2021-05-17 11:33:24 +02:00
Tomasz Grabiec
8dddfab5db Merge 'db/virtual tables: Add infrastructure + system.status example table' from Piotr Wojtczak
This is the 1st PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). Virtual tables created within this framework are "materialized" in memtables, so current solution is for small tables only. As an example system.status was added. It was checked that DISTINCT and reverse ORDER BY do work.

This PR was created by @jul-stas and @StarostaGit
Fixes #8343

This is the same as #8364, but with a compilation fix (newly added `close()` method was not implemented by the reader)

Closes #8634

* github.com:scylladb/scylla:
  boost/tests: Add virtual_table_test for basic infrastructure
  boost/tests: Test memtable_filling_virtual_table as mutation_source
  db/system_keyspace: Add system.status virtual table
  db/virtual_table: Add a way to specify a range of partitions for virtual table queries.
  db/virtual_table: Introduce memtable_filling_virtual_table
  db: Add virtual tables interface
  db: Introduce chained_delegating_reader
2021-05-17 11:29:37 +02:00
Piotr Sarna
1a625806d8 cql-pytest: introduce service level test suite
The test suite leverages the fact that authentication is now enabled
in cql-pytest to perform validations on service level statements.
2021-05-17 10:49:45 +02:00
Piotr Sarna
588a0dfd38 cql-pytest: add enabling authentication by default
Following alternator unit tests, cql-pytest now also boots
Scylla/Cassandra with authentication enabled.
Unconditionally enabling authentication does not ruin any existing
test case, while it enables testing more scenarios. For instance,
Scylla-specific service levels can only be created and attached
to roles, which depends on authentication being enabled.
A sad side-effect is that Scylla boots slower with PasswordAuthenticator
than without it - it takes 15 seconds to set up the default
superuser account due to a hardcoded sleep duration [1] :( That should be
solved by a separate fix though.

[1]:
auth/common.hh:
inline future<> delay_until_system_ready(seastar::abort_source& as) {
    return sleep_abortable(15s, as);
}
2021-05-17 10:49:45 +02:00
Piotr Sarna
7d10213567 qos: fix validating service level timeouts for negative values
Commit message of 6e8305449 claimed to validate against negative
timeout values, while it turned out not to be the case.
The check is now added.
2021-05-17 10:49:45 +02:00
Botond Dénes
5e39cedbe3 evictable_reader: remove _reader_created flag
This flag is not really needed, because we can just attempt a resume on
first use which will fail with the default constructed inactive read
handle and the reader will be created via the recreate-after-evicted
path.
This allows the same path to be used for all reader creation cases,
simplifying the logic and more importantly making further patching
easier without the special case.
To make the recreate path (almost) as cheap for the first reader
creation as it was with the special path, `_trim_range_tombstones` and
`_validate_partition_key` is only set when really needed.

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141511.127735-1-bdenes@scylladb.com>
2021-05-16 14:45:46 +03:00
Botond Dénes
3b57106627 evictable_reader: remove destructor
We now have close() which is expected to clean up, no need for cleanup
in the destructor and consequently a destructor at all.

Message-Id: <20210514112349.75867-1-bdenes@scylladb.com>
2021-05-16 12:19:41 +03:00
Benny Halevy
f4cfa530cc perf: enable instructions_retired_counter only once per executor::run
Enabling it for each run_worker call will invoke ioctl
PERF_EVENT_IOC_ENABLE in parallel to other workers running
and this may skew the results.

Test: perf_simple_query
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210514130542.301168-1-bhalevy@scylladb.com>
2021-05-16 12:13:27 +03:00
Pavel Emelyanov
0068988e81 repair: Drop most of globals from repair
No code left that uses these globals, so rip them altogether. Also
drop the former messaging init/uninit methods that are now only
setting up those globals.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
315698c683 repair: Use local references in messaging handler checks
Some time ago checks for sys-dist-ks and view-update-generator to
be locally initalized were moved inside the repair service message
handlers. Now everything is ready to use service's reference instead
of global pointers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
e748e16352 repair: Use local references in create_writer()
The repair_writer::create_writer() method needs sys-dist-ks
and view-update-generator. It's only called from repair_meta
which already has both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
394acdc139 repair: Construct repair_meta with local references
The repair_meta needs sys-dist-ks and view-update-generator. Now
when it's created both are available. Once from the repair-service
and another time from the repair_info.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
548c694e8c repair: Keep more stuff on repair_info
The repair-meta is once created from the repair_info. It will need
the sys-dist-ks and view-update-generator. Put them into the info
to have them at meta creation.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
2012ea26cd repair: Kill bunch of global usages from insert_repair_meta
The insert_repair_meta needs to peek global proxy to get db from,
migration_manager to call get_schema_for_write(), global messaging
to pass it as argument to the mentioned call and to construct
the repair_meta.

All three can be obtained from repair-service which's now passed
there as argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
abc40cccfd repair: Pass repair service down to meta insertion
The repair_meta will need to get local references to sys_dist_ks
and view_update_generator. One of the places where it's created
is insert_repair_meta that's called (almost) directly from the
repair messaging handler which already has the repair service.

One thing to take care of is that the handler reshards on entry,
so do the container().invoke_on() and get the local repair from
the lambda's argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
c47e1f9776 repair: Keep local migration manager on repair_info
The repair_range routine needs to mess with migration manager.
Fortunatelly the routine had been patched to be repair_info's
method and the repair_info itself can get the migration manager
from repair_service.

ATTN -- the obtained reference is local, not sharded<>, but the
repair_info doesn't reshard and can carry local reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
60cbb700ef repair: Move unused db captures
Similarly to previous patch -- the db captures can also be relaxed
in some places.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
96b546797e repair: Remove unused ms captures
Now when the repair service is passed around repair-info
creation a lot of local messaging captures become unused and
can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
c63990e4a7 repair: Construct repair_info with service
The repair_info object will need to carry more services
references on board. This now can be easily achieved by passing
the repair service into the repair_info constructor. The info
can then get all it needs from the service. This patch is the
step #1 here -- replace db and messaging args with repair
service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
9bc122c99f repair: Loop over repair sharded container
Previous patches made sync_data_using_repair and do_repair_start
methods of repair service. This was done to have local repair
reference near the creation of repair_info. The last step left
is to make the local repair service available inside the .invoke_on
lambda. This patch makes this invoke_on on the repair service
itself thus automagically getting the local repair service in lambda.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
a49df10b42 repair: Make sync_data_using_repair a method
The do_..._with_repair()-s all call sync_data_using_repair, the latter
was previously prepared to receive local repair service reference via
"this", so it's finally time to make it happen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
5c020880f9 repair: Use repair from storage service
This is the continuation of the previous patch -- the do_..._with_repair
functions become repair_service methods and will get local repair
service reference as "this".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
23e8e60ec0 repair: Keep repair on storage service
Storage service calls a bunch of do_something_with_repair() methods. All
of them need the local repair_service and the only way to get it is by
keeping it on storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
4cbcc81167 repair: Make do_repair_start a method
The do_repair_start() creates repair_info which will need the
repair_service reference, so turn this function into a method
to have repair_service as "this".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
4f9623fd87 repair: Pass repair_service through the API until do_repair_start
The do_repair_start() will need the repair_service reference in the
next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
a2baabedad repair: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
d47b7e5387 repair: Split sync_data_using_repair
The routine in question creates repair_info inside. The repair_info
will need to receive local repair_service reference somehow, this
split prepares ground for this change.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
d92d404629 repair: Turn repair_range a repair_info method
This routine uses global migration_manager pointer. Next patches
will keep the reference on a manager on repair_info and it will
be possible to use this->migration_manager reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
5c24b0750e repair: Move local_is_initialized checks up the stack
The sys-dist-ks and view-update-generator checked are global pointers
and are performed inside static repair_meta's method. At the same
time the caller of this method is repair_service class which already
has both on-board, so move the checks up. Later they will use the
service-local references, not global pointers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
6a0d0bb093 repair: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
735a076a63 repair: Do init/uninit of messaging in start/stop
Right now repair messaging handlers are set-up on all shards by
doing messaging.invoke_on_all() calls. Since now repair service
is sharded and its .start() and .stop() are invoke-on-all-ed, it's
better to move messaging init/deinit into them.

The indentation is deliberately left broken until next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
ea3b0877a4 repair: Add dependencies to repair service
The repair service needs database, migration manager, messaging and
sys-dist-ks + view-update-generator pair. Put all these guys on it
in advance and equip the service with getters for future use.

Some dependencies are sharded because they are used in cross-shard
context and needs more care to be cleaned out later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
ebc6a81700 repair: Add repair_service::start method
To be stuffed later. There's no deferred ::stop call because
sharded<>::stop calls it by itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
bbb92882de repair: Make repair_service sharded<>
It will pop up on all shards, but the existing initialization
will only happen on shard 0.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
715c4d5a47 repair: Remove unused service arg from messaging init
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
ad16d0a2b5 repair: Make repair_service::_tracker be unique-ptr
Now the repair service exists in a single instance, but it's becoming
a sharded<> service. Tracker expects to be constructed once, so make
it a pointer and next patch eill instantiate it on shard 0 only.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Pavel Emelyanov
aa2b4f7821 repair: Turn repair_service into class
Now it's a struct with everything being public

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Tomasz Grabiec
28ac8d0f2b Merge "raft: randomized_nemesis_test framework" from Kamil
We introduce `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could
come up with.  Represented by a C++ concept, it consists of: a set of
inputs (represented by the `input_t` type), outputs (`output_t` type),
states (`state_t`), an initial state (`init`) and a transition
function (`delta`) which given a state and an input returns a new
state and an output.

The rest of the testing infrastructure is going to be generic
w.r.t. `PureStateMachine`. This will allow easily implementing tests
using both simple and complex state machines by substituting the
proper definition for this concept.

Next comes `logical_timer`: it is a wrapper around
`raft::logical_clock` that allows scheduling events to happen after a
certain number of logical clock ticks.  For example,
`logical_timer::sleep(20_t)` returns a future that resolves after 20
calls to `logical_timer::tick()`. It will be used to introduce
timeouts in the tests, among other things.

To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.

`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.

The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `persistence`.

Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:

1. Before sending a command to a Raft leader through `server::add_entry`,
   one must first directly contact the instance of `impure_state_machine`
   replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
   (represented by a promise-future pair) and a unique ID; it stores the
   input side of the channel (the promise) with this ID internally and returns
   the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
   and sends it as a command to Raft. Thus commands are (ID, machine input)
   pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
   with the given ID. If it finds one, it sends the output through this
   channel.
5. The command sender waits for the output on the obtained future.

The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.

Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.

A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.

We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.

The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
 be used by the message-passing method.

The actual message passing is implemented with `network` and `delivery_queue`.

The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.

`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.

`persistence` represents the data that does not get lost between server
crashes and restarts.

We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.

Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc`.  We ensure that the stored log either ``touches''
the stored snapshot on the right side or intersects it.

In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.

Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.

`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.

`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.

The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.

The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.

That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.

`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.

`environment` represents a set of `raft_server`s connected by a `network`.

The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.

Needs to be periodically `tick()`ed which ticks the network
and underlying servers.

`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.

Finally, we add a simple test that serves as an example of using the
implemented framework. We introduce `ExRegister`, an implementation
of `PureStateMachine` that stores an `int32_t` and handles ``exchange''
and ``read'' inputs; an exchange replaces the state with the given value
and returns the previous state, a read does not modify the state and returns
the current state.  In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.

* kbr/randomized-nemesis-test-v5:
  raft: randomized_nemesis_test: basic test
  raft: randomized_nemesis_test: ticker
  raft: randomized_nemesis_test: environment
  raft: randomized_nemesis_test: server
  raft: randomized_nemesis_test: delivery queue
  raft: randomized_nemesis_test: network
  raft: randomized_nemesis_test: heartbeat-based failure detector
  raft: randomized_nemesis_test: memory backed persistence
  raft: randomized_nemesis_test: rpc
  raft: randomized_nemesis_test: impure_state_machine
  raft: randomized_nemesis_test: introduce logical_timer
  raft: randomized_nemesis_test: `PureStateMachine` concept
2021-05-14 17:33:40 +02:00
Tomasz Grabiec
0fdd2f8217 Merge "raft: fsm cleanups" from Gleb
* scylla-dev/raft-cleanup-v1:
  raft: drop _leader_progress tracking from the tracker
  raft: move current_leader into the follower state
  raft: add some precondition checks
2021-05-14 17:24:59 +02:00
Asias He
e4872a78b5 storage_service: Delay update pending ranges for replacing node
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:

1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive

2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing

3) replacing node responds echo message so other nodes can mark
replacing node as alive

This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.

For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)

```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)

c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```

To solve this problem for older releases without the patch "repair:
Switch to use NODE_OPS_CMD for replace operation", a minimum fix is
implemented in this patch. Once existing nodes learn the replacing node
is in HIBERNATE state, they add the replacing as replacing, but only add
the replacing to the pending list only after the replacing node is
marked as alive.

With this patch, when the existing nodes start to write to the replacing
node, the replacing node is already alive.

Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test
Fixes: #8013

Closes #8614
2021-05-14 17:24:28 +02:00
Tomasz Grabiec
102dcfc1fd Merge "scylla-gdb.py: introduce scylla read-stats" from Botond
Too many or too resource-hungry reads often lie at the heart of issues
that require an investigation with gdb. Therefore it is very useful to
have a way to summarize all reads found on a shard with their states and
resource consumptions. This is exactly what this new command does. For
this it uses the reader concurrency semaphores and their permits
respectively, which are now arranged in an intrusive list and therefore
are enumerable.
Example output:
(gdb) scylla read-stats
Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1
   permits count       memory table/description/state
         1     1     14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active
        16     0        53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active
         1     0         1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive
         1     0            0 *.*/view_builder/active
         1     0            0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active
        20     1     14334414 Total

* botond/scylla-gdb.py-scylla-reads/v5:
  scylla-gdb.py: introduce scylla read-stats
  scylla-gdb.py: add pretty printer for std::string_view
  scylla-gdb.py: std_map() add __len__()
  scylla-gdb.py: prevent infinite recursion in intrusive_list.__len__()
2021-05-14 16:07:14 +02:00
Takuya ASADA
838acb44d0 scylla-fstrim.timer: fix wrong description from 'daily' to 'weekly'
It scheduled weekly, not daily.

Fixes #8633

Closes #8644
2021-05-14 16:02:12 +02:00
Asias He
b8749f51cb repair: Consider memory bloat when calculate repair parallelism
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.

Be more conservative when calculating the parallelism to avoid repair
using too much memory.

Fixes #8641

Closes #8652
2021-05-14 16:02:08 +02:00
Piotr Sarna
c1cb7d87e1 auth: remove the fixed 15s delay during auth setup
The auth intialization path contains a fixed 15s delay,
which used to work around a couple of issues (#3320, #3850),
but is right now quite useless, because a retry mechanism
is already in place anyway.
This patch speeds up the boot process if authentication is enabled.
In particular, for a single-node clusters, common for test setups,
auth initialization now takes a couple of milliseconds instead
of the whole 15 seconds.

Fixes #8648

Closes #8649
2021-05-14 16:01:59 +02:00
Kamil Braun
c21311ecca raft: randomized_nemesis_test: basic test
This is a simple test that serves as an example of using the
framework implemented in the previous commits. We introduce
`ExRegister`, an implementation of `PureStateMachine` that stores
an `int32_t` and handles ``exchange'' and ``read'' inputs;
an exchange replaces the state with the given value and returns
the previous state, a read does not modify the state and returns
the current state.  In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.
2021-05-14 15:11:01 +02:00
Kamil Braun
66b9bc6fe1 raft: randomized_nemesis_test: ticker
`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.

The commit also introduces a `with_env_and_ticker` helper function which
creates an `environment`, a `ticker`, and passes references to them to
the given function. It destroys them after the function finishes
by calling `abort()`.
2021-05-14 15:11:01 +02:00
Kamil Braun
c7cef58797 raft: randomized_nemesis_test: environment
`environment` represents a set of `raft_server`s connected by a `network`.

The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.

Needs to be periodically `tick()`ed which ticks the network
and underlying servers.

New servers can be created in the environment by calling `new_server`.
2021-05-14 15:11:01 +02:00
Kamil Braun
5095a4158e raft: randomized_nemesis_test: server
`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.
2021-05-14 15:11:01 +02:00
Kamil Braun
f139fd4c28 raft: randomized_nemesis_test: delivery queue
The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.

That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.
2021-05-14 15:11:01 +02:00
Kamil Braun
2956f5f76c raft: randomized_nemesis_test: network
`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.

The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.
2021-05-14 15:11:01 +02:00
Kamil Braun
3068a0aa70 raft: randomized_nemesis_test: heartbeat-based failure detector
In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.

Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.

`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.
2021-05-14 15:11:01 +02:00
Kamil Braun
51df600478 raft: randomized_nemesis_test: memory backed persistence
`persistence` represents the data that does not get lost between server
crashes and restarts.

We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.

Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc` coming in a later commit.  We ensure that the stored
log either ``touches'' the stored snapshot on the right side or intersects it.
2021-05-14 15:11:01 +02:00
Kamil Braun
7a1f6e6d7b raft: randomized_nemesis_test: rpc
We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.

The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
 be used by the message-passing method.

The actual message passing is implemented with `network` and `delivery_queue`
which are introduced in later commits.

The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.

`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.
2021-05-14 15:11:01 +02:00
Kamil Braun
905126acc3 raft: randomized_nemesis_test: impure_state_machine
To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.

`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.

The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `raft::persistence`
coming with a later commit.

Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:

1. Before sending a command to a Raft leader through `server::add_entry`,
   one must first directly contact the instance of `impure_state_machine`
   replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
   (represented by a promise-future pair) and a unique ID; it stores the
   input side of the channel (the promise) with this ID internally and returns
   the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
   and sends it as a command to Raft. Thus commands are (ID, machine input)
   pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
   with the given ID. If it finds one, it sends the output through this
   channel.
5. The command sender waits for the output on the obtained future.

The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.

Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.

A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.
2021-05-14 15:11:01 +02:00
Kamil Braun
3e02befccd raft: randomized_nemesis_test: introduce logical_timer
This is a wrapper around `raft::logical_clock` that allows scheduling
events to happen after a certain number of logical clock ticks.
For example, `logical_timer::sleep(20_t)` returns a future that resolves
after 20 calls to `logical_timer::tick()`.
2021-05-13 11:34:00 +02:00
Kamil Braun
15e3bd2620 raft: randomized_nemesis_test: PureStateMachine concept
The commit introduces `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could come up with.
Represented by a C++ concept, it consists of: a set of inputs
(represented by the `input_t` type), outputs (`output_t` type), states (`state_t`),
an initial state (`init`) and a transition function (`delta`) which
given a state and an input returns a new state and an output.

The rest of the testing infrastructure is going to be
generic w.r.t. `PureStateMachine`. This will allow easily implementing
tests using both simple and complex state machines by substituting the
proper definition for this concept.

One possibility of modifying this definition would be to have `delta`
return `future<pair<state_t, output_t>>` instead of
`pair<state_t, output_t>`. This would lose some ``purity'' but allow
long computations without reactor stalls in the tests. Such modification,
if we decide to do it, is trivial.
2021-05-13 11:34:00 +02:00
Alejo Sanchez
68f69671b5 raft: style: test optionals directly
Avoid using has_value() and test optional directly

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>
2021-05-12 20:39:52 +02:00
Piotr Wojtczak
e6254acfd3 boost/tests: Add virtual_table_test for basic infrastructure 2021-05-12 17:05:35 +02:00
Piotr Wojtczak
8825ae128d boost/tests: Test memtable_filling_virtual_table as mutation_source
Uses the infrastructure for testing mutation_sources, but only a
subset of it which does not do fast forwarding (since virtual_table
does not support it).
2021-05-12 17:05:35 +02:00
Juliusz Stasiewicz
874f4de60c db/system_keyspace: Add system.status virtual table
This change uses the previously introduced
memtable_filling_virtual_table
to expose nodetool status as a virtual table.
2021-05-12 17:05:35 +02:00
Tomasz Grabiec
57ed93bf44 db/virtual_table: Add a way to specify a range of partitions for virtual
table queries.

This change introduces a query_restrictions object into the virtual
table infrastructure, for now only holding a restriction on partition
ranges.
That partition range is then implemented into
memtable_filling_virtual_table.
2021-05-12 17:05:35 +02:00
Piotr Wojtczak
38720847f2 db/virtual_table: Introduce memtable_filling_virtual_table
This change adds a more specific implementation of the virtual table
called memtable_filling_virtual_table. It produces results by filling
a memtable on each read.
2021-05-12 17:05:34 +02:00
Juliusz Stasiewicz
61a0314952 db: Add virtual tables interface
This change introduces the basic interface we expect each virtual
table to implement. More specific implementations will then expand
upon it if needed.
2021-05-12 17:05:34 +02:00
Juliusz Stasiewicz
8333d66d4e db: Introduce chained_delegating_reader
This change adds a new type of mutation reader which purpose
is to allow inserting operations before an invocation of the proper
reader. It takes a future to wait on and only after it resolves	will
it forward the execution to the underlying flat_mutation_reader
implementation.
2021-05-12 17:05:34 +02:00
Eliran Sinvani
5eb84f110e gossiper: remove excess error logging from gossiper
We remove a log of severity error that is later thrown as an
exception, being catched few lines below and then printed out as
a warning.

Fixes #8616

Closes #8617
2021-05-12 15:02:35 +02:00
Tomasz Grabiec
f8d7374400 Merge 'Add additional sstable stats' from Michael Livshin
Refs #251.

Closes #8630

* github.com:scylladb/scylla:
  statistics: add global bloom filter memory gauge
  statistics: add some sstable management metrics
  sstables: make the `_open` field more useful
  sstables: stats: noexcept all accessors
2021-05-12 14:35:13 +02:00
Avi Kivity
c3f17ea0a3 Merge "Fix query performance for range tombstone covering many rows" from Tomasz
"
Row cache reader can produce overlapping range tombstones in the mutation
fragment stream even if there is only a single range tombstone in sstables,
due to #2581. For every range between two rows, the row cache reader queries
for tombstones relevant for that range. The result of the query is trimmed to
the current position of the reader (=position of the previous row) to satisfy
key monotonicity. The end position of range tombstones is left unchanged. So
cache reader will split a single range tombstone around rows. Those range
tombstones are transient, they will be only materialized in the reader's
stream, they are not persisted anywhere.

That is not a problem in itself, but it interacts badly with mutation
compactor due to #8625. The range_tombstone_accumulator which is used to
compact the mutation fragment stream needs to accumulate all tombstones which
are relevant for the current clustering position in the stream. Adding a new
range tombstone is O(N) in the number of currently active tombstones. This
means that producing N rows will be O(N^2).

In a unit test introduced in this series, I saw reading 137'248 rows which
overlap with a range tombstone take 245 seconds. Almost all of CPU time is in
drop_unneeded_tombstones().

The solution is to make the cache reader trim range tombstone end to the
currently emited sub-range, so that it emits non-overlapping range tombstones.

Fixes #8626.

Tests:

  - row_cache_test (release)
  - perf_row_cache_reads (release)
"

* tag 'fix-perf-many-rows-covered-by-range-tombstone-v2' of github.com:tgrabiec/scylla:
  tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone
  row_cache: Avoid generating overlapping range tombstones
  range_tombstone_accumulator: Avoid update_current_tombstone() when nothing changed
2021-05-12 14:07:48 +03:00
Tomasz Grabiec
a9dd7a295d tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone
Reproduces #8626.

Output:

    test_scan_with_range_delete_over_rows
    Populating with rows
    Rows: 702710
    Scanning...
    read: 540.007324 [ms], preemption: {count: 2356, 99%: 1.131752 [ms], max: 1.148589 [ms]}, cache: 251/252 [MB]
    read: 651.942688 [ms], preemption: {count: 1176, 99%: 1.131752 [ms], max: 1.009652 [ms]}, cache: 251/252 [MB]
2021-05-12 11:58:36 +02:00
Michael Livshin
357ab759ee statistics: add global bloom filter memory gauge
Refs #251.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-12 03:48:07 +03:00
Michael Livshin
5abeadde4d statistics: add some sstable management metrics
Add the following metrics, as part of #251:

- open for writing (a.k.a. "created", unless I'm missing something?)

- open for reading

- deleted

- currently open for reading/writing (gauges)

Refs #251.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-12 03:48:07 +03:00
Michael Livshin
9a2b54fcf6 sstables: make the _open field more useful
The field is hitherto only used in scylla-gdb.py.  Let it store the
open mode (if any).

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-12 03:48:07 +03:00
Michael Livshin
1f83251b2b sstables: stats: noexcept all accessors
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-12 03:48:07 +03:00
Benny Halevy
c0dafa75d9 utils: phased_barrier: advance_and_await: make noexcept
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.

In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().

Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.

Fixes #8636

A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
2021-05-12 01:36:11 +02:00
Benny Halevy
b4cbd46adb row_cache: create_underlying_reader: call read_context on_underlying_created only on success
ctx.on_underlying_created() mustn't be called if src.make_reader failed
and a reader isn't created.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511054525.35090-1-bhalevy@scylladb.com>
2021-05-12 01:34:48 +02:00
Tomasz Grabiec
6863a5e43b row_cache: Avoid generating overlapping range tombstones
Row cache reader can produce overlapping range tombstones in the
mutation fragment stream even if there is only a single range
tombstone in sstables, due to #2581. For every range between two rows,
the row cache reader queries for tombstones relevant for that
range. The result of the query is trimmed to the current position of
the reader (=position of the previous row) to satisfy key
monotonicity. The end position of range tombstones is left
unchanged. So cache reader will split a single range tombstone around
rows. Those range tombstones are transient, they will be only
materialized in the reader's stream, they are not persisted anywhere.

That is not a problem in itself, but it interacts badly with mutation
compactor due to #8625. The range_tombstone_accumulator which is used
to compact the mutation fragment stream needs to accumulate all
tombstones which are relevant for the current clustering position in
the stream. Adding a new range tombstone is O(N) in the number of
currently active tombstones. This means that producing N rows will be
O(N^2).

In a unit test, I saw reading 137'248 rows which overlap with a range
tombstone take 245 seconds. Almost all of CPU time is in
drop_unneeded_tombstones().

The solution is to make the cache reader trim range tombstone end to
the currently emited sub-range, so that it emits non-overlapping range
tombstones.

Fixes #8626.
2021-05-12 00:10:24 +02:00
Tomasz Grabiec
80cd829139 range_tombstone_accumulator: Avoid update_current_tombstone() when nothing changed
Recalculation of the current tombstone is O(N) in the number of active
range tombstones. This can be a significant overhead, so better avoid it.

Solves the problem of quadratic complexity when producing lots of
overlaping range tombstones with a common end bound.

Refs #8625
Refs #8626
2021-05-12 00:10:24 +02:00
Nadav Har'El
cee4c075d2 Merge 'Fix index name conflicts with regular tables' from Piotr Sarna
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.

This series comes with a test.

Fixes #8620
Tests: unit(release)

Closes #8632

* github.com:scylladb/scylla:
  cql-pytest: add regression tests for index creation
  cql3: fail to create an index if there is a name conflict
  database: check for conflicting table names for indexes
2021-05-11 18:40:15 +03:00
Nadav Har'El
c7a814fd5c utils/enum_option.hh: make it easier to compare the value
The operator== of enum_option<> (which we use to hold multi-valued
Scylla options) makes it easy to compare to another enum_option
wrapper, but ugly to compare the actual value held. So this patch
adds a nicer way to compare the value held.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210511120222.1167686-1-nyh@scylladb.com>
2021-05-11 18:39:10 +03:00
Benny Halevy
9ba960a388 utils: phased_barrier::operation do not leak gate entry when reassigned
utils::phased_barrier holds a `lw_shared_ptr<gate>` that is
typically `enter()`ed in `phased_barrier::start()`,
and left when the operation is destroyed in `~operation`.

Currently, the operation move-assign implementation is the
default one that just moves the lw_shared gate ptr from the
other operation into this one, without calling `_gate->leave()` first.

This change first destroys *this when move-assigned (if not self)
to call _gate->leave() if engaged, before reassigning the
gate with the other operation::_gate.

A unit test that reproduces the issue before this change
and passes with the fix was added to serialized_action_test.

Fixes #8613

Test: unit(dev), serialized_action_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>
2021-05-11 18:39:10 +03:00
Avi Kivity
1d8234f52d Merge "reader_concurrency_semaphore: improve diagnostics printout" from Botond
"
The current printout is has multiple problems:
* It is segregated by state, each having its own sorting criteria;
* Number of permits and count resources is collapsed in to a single
  column, not clear which is the one printed.
* Number of available/initial units of the semaphore are not printed;

This series solves all this problems:
* It merges all states into a single table, sorted by memory
  consumption, in descending order.
* It separates number of permits and count resources into separate
  columns.
* Prints a summary of the semaphore units.
* Provides a cap on the maximum amount of printable lines, to not blow
  up the logs.

The goal of all this is to make it easy to find the culprit a semaphore
problem: easily spot the big memory consumers, then unpack the name
column to determine which table and code path is responsible.
This brings the printout close to the recently `scylla reads`
scylla-gdb.py command, providing a uniform report format across the two
tools.
Example report:
INFO  2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics:
permits count   memory  table/description/state
7       2       77M     ks.tbl1/op1/active
6       3       59M     ks.tbl1/op0/active
4       0       36M     ks.tbl1/op2/active
3       1       36M     ks.tbl0/op2/active
11      2       43M     permits omitted for brevity

31      8       251M    total
"

* 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla:
  test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
  reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header
  reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines
  reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order
  reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table
  reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources
2021-05-11 18:39:10 +03:00
Avi Kivity
eed89a9b56 Update tools/jmx submodule (toppartitions multi-sampler query)
* tools/jmx 440313e...a7c4c39 (1):
  > storage_service: Fix getToppartitions to always return both reads and writes
2021-05-11 18:39:10 +03:00
Nadav Har'El
af485f5226 secondary index: fix index name in IndexInfo system table
In commit 3e39985c7a we added the Cassandra-compatible system table
system."IndexInfo" (note the capitalized table name) which lists built
indexes. Because we already had a table of built materialized views, and
indexes are implemented as materialized views, the index list was
implemented as a virtual table based on the view list.

However, the *name* of each materialized view listed in the list of
views looks like something_index, with the suffix "_index", while the
name of the table we need to print is "something". We forgot to do this
transformation in the virtual table - and this is what this patch does.

This bug can confuse applications which use this system table to wait for
an index to be built. Several tests translated from Cassandra's unit
tests, in cassandra_tests/validation/entities/secondary_index_test.py fail
in wait_for_index() because of this incompatibility, and pass after this
patch.

This patch also changes the unit test that enshrined the previous, wrong,
behavior, to test for the correct behavior. This problem is typical of
C++ unit tests which cannot be run against Cassandra.

Fixes #8600

Unfortunately, although this patch fixes "typical" applications (including
all tests which I tried) - applications which read from IndexInfo in a
"typical" method to look for a specific index being ready, the
implementation is technically NOT correct: The problem is that index
names are not sorted in the right order, because they are sorted with
the "_index" prefix.
To give an example, the index names "a" should be listed before "a1", but
the view names "a1_index" comes before "a_index" (because in ASCII, 1
comes before underscore). I can't think of any way to fix this bug
without completely reimplementing IndexInfo in a different way - probably
based on a temporary memtable (which is fine as this is not a
performance-critical operation). We'll need to do this rewrite eventually,
and I'll open a new issue.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>
2021-05-11 18:39:10 +03:00
Avi Kivity
61c7f874cc Merge 'Add per-service-level timeouts' from Piotr Sarna
Ref: #7617

This series adds timeout parameters to service levels.

Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this:
```cql
CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s;
ATTACH SERVICE LEVEL sl2 TO cassandra;
ALTER SERVICE LEVEL sl2 WITH write_timeout = null;
```
Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`).

Closes #7913

* github.com:scylladb/scylla:
  docs: add a paragraph describing service level timeouts
  test: add per-service-level timeout tests
  test: add refreshing client state
  transport: add updating per-service-level params
  client_state: allow updating per service level params
  qos: allow returning combined service level options
  qos: add a way of merging service level options
  cql3: add preserving default values for per-sl timeouts
  qos: make getting service level public
  qos: make finding service level public
  treewide: remove service level controller from query state
  treewide: propagate service level to client state
  sstables: disambiguate boost::find
  cql3: add a timeout column to LIST SERVICE LEVEL statement
  db: add extracting service level info via CQL
  types: add a missing translation for cql_duration
  cql3: allow unsetting service level timeouts
  cql3: add validating service level timeout values
  db: add setting service level params via system_distributed
  cql3: add fetching service level attrs in ALTER and CREATE
  cql3: add timeout to service level params
  qos: add timeout to service level info
  db,sys_dist_ks: add timeout to the service level table
  migration_manager: allow table updates with timestamp
  cql3: allow a null keyword for CQL properties
2021-05-11 18:39:10 +03:00
Nadav Har'El
3c2e852dd9 Merge 'scylla-gdb unit test' from Michael Livshin
This patchset adds a basic scylla-gdb.py test to the test suite.

First two patches add the test itself (disabled), subsequent ones are
fixes for scylla-gdb.py to make the test pass, and the last one
enables the test.

Closes #8618

* github.com:scylladb/scylla:
  test: enable scylla-gdb/run
  scylla-gdb.py: "this" -> "self"
  scylla-gdb.py: wrap std::unordered_{set,map} and flat_hash_map
  scylla-gdb.py: robustify execution_strategy traversal
  scylla-gdb.py: recognize new sstable reader types
  scylla-gdb.py: make list_unordered_map more resilient
  scylla-gdb.py: robustify netw & gms
  scylla-gdb.py: redo find_db() in terms of sharded()
  scylla-gdb.py: debug::logalloc_alignment may not exist
  scylla-gdb.py: handle changed container type of keyspaces
  scylla-gdb.py: walk intrusive containers using provided link fields
  test: add a basic test for scylla-gdb.py
  test.py: refine test mode control
2021-05-11 18:39:10 +03:00
Avi Kivity
b1f9df279a Merge "Untie cdc, storage service and migration notifier knot" from Pavel E
"
Storage service needs migration notifier reference to pass it to cdc
service via get_local_storage_service(). This set removes

- get_local_storage_service from cdc
- migration notifier from storage service
- db_context::builder from cdc (released nuclear binding energy)

tests: unit(dev)
"

* 'br-cdc-no-storage-service' of https://github.com/xemul/scylla:
  storage_service: Remove migration notifier dependency
  cdc: Remove db_context::builder
  cdc: Provide migration notifier right at once
  cdc: Remove db_context::builder::with_migration_notifier
2021-05-11 18:39:10 +03:00
Michael Livshin
ff7d781988 test: enable scylla-gdb/run
It should pass now.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Avi Kivity
6548436db3 Merge "Improve coverage support" from Botond
"
This patch-set builds on the existing very basic coverage generation
support and greatly improves it, adding an almost fully automated way of
generating reports, as well as a more manual way.
At the heart of this is a new build mode, coverage, that is dedicated
to coverage report generation, containing all the required build flags,
without interfering with that of the "host" build mode, like currently
(with the --coverage flag).
Additionally a new script, scripts/coverage.py, is added which automates
the magic behind the scenes needed to get from raw profile files to a
nice html report, as long as the raw files are at the expected place.
There are still some rough edges:
* There is no direct ninja support for coverage generation, one has to
  build the tests, then run them via test.py.
* Building and running just a few tests is a miserable experience
  (#8608).
* Only boost unit tests are supported at the moment when using test.py.
* A --verbose flag for coverage.py would be nice.
* coverage.py could have a way to run a test itself, automatically
  adding the required ENV variable(s).

I plan on addressing all these in the future, in the meanwhile, with
this series, the coverage report generation is made available for
non-masochists as well.
"

* 'coverage-improvements/v1' of https://github.com/denesb/scylla:
  HACKING.md: update the coverage guide
  test.py: add basic coverage generation support
  scripts: introduce coverage.py
  configure.py: replace --coverage with a coverage build mode
  configure.py: make the --help output more readable
  configure.py: add build mode descriptions
  configure.py: fix fallback mode selection for checkheaders target
  configure.py: centralize the declaration of build modes
2021-05-11 18:39:10 +03:00
Michael Livshin
ee80c81593 scylla-gdb.py: "this" -> "self"
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Asias He
4f0a1cbca3 repair: Wire off-strategy compaction for decommission
When decommission is done, all nodes that receive data from the
decommission node will run node_ops_cmd::decommission_done handler.

Trigger off-strategy compaction inside the handler to wire off-strategy
for decommission.

Refs #5226

Closes #8607
2021-05-11 18:39:10 +03:00
Michael Livshin
b711fc5762 scylla-gdb.py: wrap std::unordered_{set,map} and flat_hash_map
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Nadav Har'El
df9faba652 Merge 'storage_proxy: place unique_response_handler:s in small_vector instead of std::vector' from Avi Kivity
This cuts an allocation in the write path. Instruction count reduction isn't
large, but performance does improve (results are consistent):

before: 196369.48 tps ( 55.2 allocs/op,  13.2 tasks/op,   51658 insns/op)
after:  199290.32 tps ( 54.2 allocs/op,  13.2 tasks/op,   51600 insns/op)

(this is perf_simple_query --write --smp 1 --operations-per-shard 1000000)

Since small_vector requires noexcept move constructor and assignment,
they corresponding unique_response_handler members are adjusted/added
respectively.

Closes #8606

* github.com:scylladb/scylla:
  storage_proxy: place unique_response_handler:s in small_vector instead of std::vector
  storage_proxy: make unique_response_handler friendly to small_vector
  storage_proxy: give a name to a vector of unique_response_handlers
2021-05-11 18:39:10 +03:00
Michael Livshin
b0fbd0062e scylla-gdb.py: robustify execution_strategy traversal
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Yaron Kaikov
588a065304 scylla_io_setup: configure "aio-max-nr" before iotune
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```

We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before

1) Let's move configure_io_slots() to scylla_util.py since both
   scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure

Fixes: #8587

Closes #8512
2021-05-11 18:39:10 +03:00
Michael Livshin
4ea6c7cd49 scylla-gdb.py: recognize new sstable reader types
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Nadav Har'El
fb0c4e469a Merge 'token_metadata: Fix get_all_endpoints to return nodes in the ring' from Asias He
The get_all_endpoints() should return the nodes that are part of the ring.

A node inside _endpoint_to_host_id_map does not guarantee that the node
is part of the ring.

To fix, return from _token_to_endpoint_map.

Fixes #8534

Closes #8536

* github.com:scylladb/scylla:
  token_metadata: Get rid of get_all_endpoints_count
  range_streamer: Handle everywhere_topology
  range_streamer: Adjust use_strict_sources_for_ranges
  token_metadata: Fix get_all_endpoints to return nodes in the ring
2021-05-11 18:39:10 +03:00
Michael Livshin
513695c5ba scylla-gdb.py: make list_unordered_map more resilient
Some unordered_map instantiations have cache=true, some cache=false,
but we don't need to care.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
2a386c06d9 scylla-gdb.py: robustify netw & gms
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
76c2d792c9 scylla-gdb.py: redo find_db() in terms of sharded()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
ed2d471e79 scylla-gdb.py: debug::logalloc_alignment may not exist
I haven't found a way to make it stay -- __attribute__((used)) is not
enough and apparently lld is going to ignore __attribute__((retain))
until at least LLVM 13.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
77d8272cca scylla-gdb.py: handle changed container type of keyspaces
Used to be std::unordered_map, but is a flat_hash_map now.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
69a5aef620 scylla-gdb.py: walk intrusive containers using provided link fields
clang & gdb apparently conspire to not reveal template argument types
beyond the first one -- at least for some templates, and definitely
for Boost's intrusive container ones.  This severely restricts our
ability to find the right intrusive list link by examining the
container type.

Allow the caller to simply provide the relevant field name, so we
don't have to guess.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
73f9f08df6 test: add a basic test for scylla-gdb.py
(And disable it initially, because it won't pass without subsequent
commits)

Runs only in release mode, to keep things more realistic.

Doesn't exercise Scylla much at present -- just stops it after several
compactions and tries (almost) all "scylla *" commands in order.

Refs #6952.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
3bff94cd29 test.py: refine test mode control
* Add ability to skip tests in individual modes using "skip_in_<mode>".

* Add ability to allow tests in specific modes using "run_in_<mode>".

* Rename "skip_in_debug_mode" to "skip_in_debug_modes", because there
  is an actual mode named "debug" and this is confusing.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Piotr Sarna
1cb804f024 cql-pytest: add regression tests for index creation
This commit adds unit tests for an issue with index creation
after a table with malicious name is previously created as well.
The cases cover both indexes with a default name and the ones with
explicit name set.
2021-05-11 17:34:37 +02:00
Piotr Sarna
0ef0a4c78d cql3: fail to create an index if there is a name conflict
When an index with an explicit name is created, it's underlying
materalized view's name is set to <index-name>_index.
If there already exists a regular table with such a name,
the creation should fail with a proper error message.
2021-05-11 15:21:00 +02:00
Piotr Sarna
fa53bf5c1e database: check for conflicting table names for indexes
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
2021-05-11 15:20:59 +02:00
Botond Dénes
69d04d161e test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
Not really testing anything, at least not automatically. It just
provides coverage for the diagnostics dump code, as well as allows for
developers to inspect the printout visually when making changes.
2021-05-10 18:06:30 +03:00
Botond Dénes
542be8d208 scylla-gdb.py: introduce scylla read-stats
Too many or too resource-hungry reads often lie at the heart of issues
that require an investigation with gdb. Therefore it is very useful to
have a way to summarize all reads found on a shard with their states and
resource consumptions. This is exactly what this new command does. For
this it uses the reader concurrency semaphores and their permits
respectively, which are now arranged in an intrusive list and therefore
are enumerable.
Example output:
(gdb) scylla read-stats
Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1
   permits count       memory table/description/state
         1     1     14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active
        16     0        53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active
         1     0         1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive
         1     0            0 *.*/view_builder/active
         1     0            0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active
        20     1     14334414 Total

The command accepts a list of semaphores to dump reads from as its
arguments, or if none are provided, it will dump reads from the
semaphores of the local database instance.
2021-05-10 16:38:26 +03:00
Piotr Sarna
7f086d8f73 docs: add a paragraph describing service level timeouts
Along with examples.
2021-05-10 12:39:41 +02:00
Piotr Sarna
570c63d39b test: add per-service-level timeout tests
The test suite checks if per-service-level timeouts
work and validate their input.
2021-05-10 12:39:41 +02:00
Piotr Sarna
43f1f9e445 test: add refreshing client state
With a helper client state refresher, some attributes
which are usually only refreshed after a client disconnects
and then reconnects, can be verified in the test suite.
2021-05-10 12:39:41 +02:00
Piotr Sarna
6da59b8a38 transport: add updating per-service-level params
Per-service-level parameters (currently timeouts)
are now updated when a new connection is established.
The other connections which have the changed role are currently
not immediately reloaded.
2021-05-10 12:39:41 +02:00
Piotr Sarna
7ee5686d6c client_state: allow updating per service level params
Per-service-level params can now be updated with a helper function.
2021-05-10 12:39:41 +02:00
Piotr Sarna
368a6976ff qos: allow returning combined service level options
Originally, the API for finding a service level controller returned
its name, which also implied that only a single service level
may be active for a user and provide its options.
After adding timeout parameters it makes more sense to return a result
which combines multiple service level parameters - e.g. a user
can be attached to one level for read timeouts and a separate one
for write timeouts.
2021-05-10 12:39:41 +02:00
Piotr Sarna
cbedefb0f9 qos: add a way of merging service level options
In order to combine multiple service level options coming from
multiple roles, a helper function is provided to merge two
of them. The semantics depend on each parameter, but for timeouts,
which are the only parameters at the time of writing this message,
the minimum value of the two is taken. That in particular means
that when service level A has timeout = 50ms and service level B
has timeout = 1s, the resulting service level options
would set the timeout to 50ms.
2021-05-10 12:39:41 +02:00
Piotr Sarna
4ba1ac57a1 cql3: add preserving default values for per-sl timeouts
In order for per-service-level timeouts to work as expected,
a special value is reserved for internally marking the timeouts
as deleted.
2021-05-10 11:48:14 +02:00
Piotr Sarna
fb4e8951f5 qos: make getting service level public 2021-05-10 11:48:14 +02:00
Piotr Sarna
06d0e1853d qos: make finding service level public 2021-05-10 11:48:14 +02:00
Piotr Sarna
e257ec11c0 treewide: remove service level controller from query state
... since it's accessible through its member, client state.
2021-05-10 11:48:14 +02:00
Piotr Sarna
d1f2e8b469 treewide: propagate service level to client state
... since it's going to be used to set up per-service-level
timeouts.
2021-05-10 11:48:14 +02:00
Piotr Sarna
00e59a9823 sstables: disambiguate boost::find
There are multiple functions named `find` in boost,
so to avoid future clashes, this one is explicitly marked
as belonging to boost::range.
2021-05-10 11:48:14 +02:00
Piotr Sarna
04880f4e44 cql3: add a timeout column to LIST SERVICE LEVEL statement
Listing service levels now includes the timeout parameter.
2021-05-10 11:48:14 +02:00
Piotr Sarna
e8d271fea9 db: add extracting service level info via CQL 2021-05-10 11:45:09 +02:00
Piotr Sarna
b7a8aecb39 types: add a missing translation for cql_duration
Its data type is duration_type.
2021-05-10 11:04:39 +02:00
Piotr Sarna
e225e01449 cql3: allow unsetting service level timeouts
via using 'null' as a value.
2021-05-10 11:04:36 +02:00
Piotr Sarna
6e83054497 cql3: add validating service level timeout values
The checks cover proper granulatity (1ms) and not using negative
values.
2021-05-10 11:00:51 +02:00
Piotr Sarna
7bb34fdede db: add setting service level params via system_distributed
Service level params (various timeout values) are now properly
stored in system_distributed.service_levels table.
2021-05-10 10:43:23 +02:00
Piotr Sarna
4ce83b9a93 cql3: add fetching service level attrs in ALTER and CREATE
ALTER SERVICE LEVEL and CREATE SERVICE LEVEL statements now
extract service level attrs and pass them to the service level
controller.
2021-05-10 10:43:23 +02:00
Piotr Sarna
aa37974192 cql3: add timeout to service level params
Timeout value can now be properly parsed from CQL.
2021-05-10 10:43:21 +02:00
Piotr Sarna
3339ea1d0d qos: add timeout to service level info
Service level information now consists of the timeout config,
which stores the timeout value for all operations.
2021-05-10 10:22:11 +02:00
Piotr Sarna
ef8da7930f db,sys_dist_ks: add timeout to the service level table
In order to be able to store timeouts in the service level table,
an appropriate column is added.
2021-05-10 10:10:38 +02:00
Piotr Sarna
7e6beabf27 migration_manager: allow table updates with timestamp
In order to avoid needless schema disagreements, a way of announcing
a schema change with fixed timestamp is added.
That way, when nodes update schemas of their internal tables (e.g.
during updates), it's possible for all nodes to use an identical
timestamp for this operation, which in turn makes their digests
identical.
2021-05-10 10:10:38 +02:00
Piotr Sarna
774d7546d9 cql3: allow a null keyword for CQL properties
This keyword is going to be useful for resetting service level
parameters.
2021-05-10 10:10:38 +02:00
Botond Dénes
a6166671ef reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header
Provide a quick summary in the first line of the printout, about the
available/initial resources, number of queued reads and number of
inactive reads.
2021-05-10 10:15:47 +03:00
Botond Dénes
0a908a47d6 reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines
This report is logged, so we don't want huge printouts, cap the table at
20 lines, and print only a summary for the rest.
For manual dumps, allow the limit to be set to a custom value, including
no limit at all.
2021-05-10 10:15:47 +03:00
Botond Dénes
f0fc3eaefc reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order
So the largest memory consumer are at the top.
2021-05-10 10:15:47 +03:00
Botond Dénes
06e17c48e5 reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table
The goal of the printout is to allow finding the culprit for semaphore
related problems and this usually involves finding the table/op/state
eating the most memory. This is much easier when all the permit
summaries are in a single table.
2021-05-10 10:15:47 +03:00
Botond Dénes
595a44bee2 reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources
Currently we have a single "count" column and it is not at all clear
what it refers to: the number of permits or count resources used by
them. Whichever it is, it only represent one of them, so in this commit
we add a "permits" column, which in addition to clearing things up,
supplies further information to the printout.
2021-05-10 10:15:47 +03:00
Avi Kivity
dd904f7776 storage_proxy: place unique_response_handler:s in small_vector instead of std::vector
This cuts an allocation in the write path. Instruction count reduction isn't
large, but performance does improve (results are consistent):

before: 196369.48 tps ( 55.2 allocs/op,  13.2 tasks/op,   51658 insns/op)
after:  199290.32 tps ( 54.2 allocs/op,  13.2 tasks/op,   51600 insns/op)

(this is perf_simple_query --write --smp 1 --operations-per-shard 1000000)
2021-05-09 23:53:10 +03:00
Avi Kivity
b015dedd96 storage_proxy: make unique_response_handler friendly to small_vector
small_vector wants the move constructor to be noexcept, and move
assighment to exist (and be noexcept). These are easy to achieve.
2021-05-09 19:22:20 +03:00
Avi Kivity
8398510382 storage_proxy: give a name to a vector of unique_response_handlers
We'd like to change the vector to a more involved type, so to avoid
repeating it everywhere, give it a name. The actual type isn't changed
in this patch.
2021-05-09 19:17:50 +03:00
Benny Halevy
e041411947 storage_service: do_isolate_on_error: promote warning to error
This is a serious condition that warrants a log error.

Fixes #8610

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210509110221.146275-1-bhalevy@scylladb.com>
2021-05-09 14:15:50 +03:00
Gleb Natapov
78c5a72b32 raft: drop _leader_progress tracking from the tracker
The tracker maintains a separate pointer to current leader progress,
but all this complexity is not needed because the tracker already have
find() function that can either find a leader's progress by id or return
null. Removing the tracking simplifies code and make going out of sync
(which is always a possibility if a state is maintained in two different
places) impossible.
2021-05-09 13:55:55 +03:00
Gleb Natapov
1245736776 raft: move current_leader into the follower state
Only when fsm is in the follower state current_leader has any meaning.
In the leader state a node is always its own follower and in a candidate
state there is no leader. To make sure that the current_leader value
cannot be out of sync with fsm state move it into the follower state.
2021-05-09 13:55:55 +03:00
Raphael S. Carvalho
8480839932 LCS/reshape: Don't reshape single sstable in level 0 with strict mode
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.

This is fixed by respecting the offstrategy threshold. Unit test is
added.

Fixes #8573.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
2021-05-09 11:09:54 +03:00
Benny Halevy
2a168c3224 atomic_cell: get rid of is_value_fragments
It isn't used.  Along with it, get rid also of:
managed_bytes::is_fragmented and
managed_bytes_basic_view::is_fragmented

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210506174115.171048-1-bhalevy@scylladb.com>
2021-05-09 11:08:53 +03:00
Botond Dénes
04c5e42f80 HACKING.md: Core dump debugging: link to debugging.md
Instead of some slides from an internal summit. debugging.md has much
more details then said slides (which lacks the associated voice
recording).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210507125956.381763-1-bdenes@scylladb.com>
2021-05-07 15:32:12 +02:00
Botond Dénes
93cff4925d HACKING.md: update the coverage guide
To include the automated way via test.py as well as the manual way, via
coverage.py.
2021-05-07 15:54:49 +03:00
Botond Dénes
435d699393 test.py: add basic coverage generation support
Add support for the newly added coverage mode. When --mode=coverage,
also invoke the coverage generation report script to produce a coverage
report after having run the tests.
There are still some rough edges, alternator and cql tests don't work.
2021-05-07 15:54:49 +03:00
Botond Dénes
bc3a424b0e scripts: introduce coverage.py
This script finds all the profiling data generated by tests at the
specified path, it merges them and generates a html report combining all
their results.
It can be used both as a standalone python script, or imported and
invoked from python directly.
2021-05-07 15:54:49 +03:00
Botond Dénes
c2808fcd0d configure.py: replace --coverage with a coverage build mode
A separate build mode is a much better fit for coverage generation.
Generating coverage requires certain flags and optimization modes,
which is much better expressed with a separate build mode, then by
bolting it on top of an existing one, possibly conflicting with its own
requirements.
This patch therefore converts the current `--coverage` flag to a build
mode of its own. The build mode is based on debug mode, in fact seastar
is built in plain debug mode, with some extra cflags.
The new build mode is called "coverage" and it is a non-default mode (by
default configure.py doesn't generate build files for it).
2021-05-07 15:23:31 +03:00
Botond Dénes
62cc0fcb78 configure.py: make the --help output more readable
The huge amount of choices for the --with argument obscures the help
output, making it hard to read. This patch removes the choices list and
instead manually checks the passed in artifacts. Unknown artifacts are
removed from the list and if it remains empty the script is aborted.
Available artifacts can be listed by a new --list-artifacts flag.
2021-05-07 15:23:29 +03:00
Botond Dénes
7f3228a197 configure.py: add build mode descriptions
A short description of each build mode in the help text of the option
which chooses them.
2021-05-07 14:49:22 +03:00
Botond Dénes
693c2cc20a configure.py: fix fallback mode selection for checkheaders target
Currently modes[0] is used as the fallback when 'dev' is not available.
But modes is a dict with mode names as keys, so this won't work. Replace
it with modes.keys()[0] to select the first key instead.
2021-05-07 11:35:01 +03:00
Botond Dénes
240ee1070c configure.py: centralize the declaration of build modes
Currently the declaration of build modes is scattered throughout the
script. We have several places where build modes are mentioned
hardcoded, and related configuration is also scattered in several data
structures. This commit centralized all this into a single data
structure, all other code uses this to iterate over modes and to mutate
their configuration.

This patch was motivated by the wish to make it easier to add a new
build mode, which is what the next patch does. This is not something we
do often, but I believe these changes also serve to make the code easier
to understand and modify.
2021-05-07 11:31:48 +03:00
Botond Dénes
365951f7f7 scylla-gdb.py: add pretty printer for std::string_view
This should really be provided by the C++ standard library and indeed I
do recall pretty-printing for std:: types working sometimes. They don't
for the most of the time however, which is not a disaster  if you just
want to use them in the gdb shell, but is not fine if we want to rely
on them in internal scripts, which is what the next patch does. So
provide it.
2021-05-07 09:09:21 +03:00
Gleb Natapov
0634674aef raft: add some precondition checks
Check that fsm does not process messages from itself and that it does
not tries to become its own follower.
2021-05-07 08:04:16 +03:00
Tomasz Grabiec
abe3d7d7d3 Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity
storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.

This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.

Test results (perf_simple_query --smp 1 --task-quota-ms 10):

```
before: median 184810.98 tps ( 91.1 allocs/op,  20.1 tasks/op,   54564 insns/op)
after:  median 192125.99 tps ( 87.1 allocs/op,  20.1 tasks/op,   53673 insns/op)
```

4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).

The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.

I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.

Closes #8592

* github.com:scylladb/scylla:
  storage_proxy, treewide: use utils::small_vector inet_address_vector:s
  storage_proxy, treewide: introduce names for vectors of inet_address
  utils: small_vector: add print operator for std::ostream
  hints: messages.hh: add missing #include
2021-05-06 18:00:54 +02:00
Tomasz Grabiec
6aec8cc447 Merge "raft: fixes and improvements for snapshot transfer" from Gleb
* scylla-dev/raft-snapshot-fixes-v4:
  raft: document that add entry my throw commit_status_unknown
  raft: test: add test of a leadership change during ongoing snapshot transfer
  raft: test: retry submitting an entry if it was dropped
  raft: test: wait for the log to be fully replicated on new leader only
  raft: drop waiters with outdated terms
  raft: make snapshot transfer abortable
  raft: accept snapshots transfer from multiple nodes simultaneously
  raft: do not send probes while transferring snapshot
  raft: handle messages sending errors
  raft: test: return error from rpc module if nodes are disconnected
  raft: fix a typo in a variable name
2021-05-06 17:44:22 +02:00
Avi Kivity
d6d6758857 Merge 'Switch to use NODE_OPS_CMD for decommission and bootstrap operation' from Asias He
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.

In this patch set, we continue the work to switch decommission and bootstrap
operation to use NODE_OPS_CMD.

Fixes #8472
Fixes #8471

Closes #8481

* github.com:scylladb/scylla:
  repair: Switch to use NODE_OPS_CMD for bootstrap operation
  repair: Switch to use NODE_OPS_CMD for decommission operation
2021-05-06 17:28:19 +03:00
Avi Kivity
f2132150c4 Merge "Extract reader concurrency semaphore tests into separate file" from Botond
"
The current home of these tests -- mutation_reader_test -- is already
one of the larges test files we have. To reduce the size of the former
and to make finding these tests easier they are moved to a separate
file.
"

* 'reader-concurrency-semaphore-test/v2' of https://github.com/denesb/scylla:
  test: move reader_concurrency_semaphore related tests into separate file
  test: mutation_reader_test: convert restricted reader tests to semaphore tests
2021-05-06 17:13:45 +03:00
Gleb Natapov
aa7ea333da raft: document that add entry my throw commit_status_unknown 2021-05-06 11:59:36 +03:00
Gleb Natapov
3a1bff26dd raft: test: add test of a leadership change during ongoing snapshot transfer 2021-05-06 11:34:31 +03:00
Gleb Natapov
612e0f08c4 raft: test: retry submitting an entry if it was dropped 2021-05-06 11:34:31 +03:00
Gleb Natapov
0b2c9c549a raft: test: wait for the log to be fully replicated on new leader only
When forcing new leader it should be enough to wait for log to be fully
replicated to that particular leader.
2021-05-06 11:34:31 +03:00
Gleb Natapov
d2f58d8656 raft: drop waiters with outdated terms
Currently an entry is declared to be dropped only when an entry with
different term is committed with the same index, but that may create a
situation where, if no new entries are submitted for a long time, an
already dropped entry will not be noticed for a long time as well.

Consider the case where a client submits 10 entries on a leader A, but
before they get replicated the leadership moves to a node B. B will
commit a dummy entry which will be committed eventually and will release
one of the waiters on A, but if anything else is submitted to B 9 other
waiters will wait forever.

The way to solve that is to drop all waiters that wait for a term
smaller that one been committed. There is no chance they will be
committed any longer since terms in the log may only grow.
2021-05-06 11:34:31 +03:00
Gleb Natapov
6abe2772dc raft: make snapshot transfer abortable
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
2021-05-06 11:34:31 +03:00
Gleb Natapov
50d545a138 raft: accept snapshots transfer from multiple nodes simultaneously
A leader may change while one of its followers is in snapshot transfer
mode and that node may get additional request for snapshot transfer from
a new leader while previous transfer is still not aborted. Currently
such situation will trigger an assert. This patch allows to have active
snapshot transfers from multiple nodes, but only one of them will succeed
in the end, all other will be replied to with 'fail'.
2021-05-06 11:34:31 +03:00
Gleb Natapov
073a9be4c7 raft: do not send probes while transferring snapshot
If a follower is in snapshot transfer mode there is no need to send
probe append messages to it.
2021-05-06 11:34:31 +03:00
Gleb Natapov
08077a21b7 raft: handle messages sending errors
Fail to send a message should not abort raft server.
2021-05-06 11:34:31 +03:00
Gleb Natapov
d0ebd79deb raft: test: return error from rpc module if nodes are disconnected
Returning an error when nodes are disconnected closer resembles what
will happen in real networking.
2021-05-06 11:34:31 +03:00
Gleb Natapov
c4d87d7a23 raft: fix a typo in a variable name 2021-05-06 11:33:47 +03:00
Asias He
5a410cb6e3 token_metadata: Get rid of get_all_endpoints_count
It is now only a wrapper for count_normal_token_owners.

Refs #8534
2021-05-06 15:36:20 +08:00
Botond Dénes
c872a963b6 test: move reader_concurrency_semaphore related tests into separate file
The mutation_reader_test is already one of our largest test files.
Move the reader concurrency semaphore related tests to a new file,
making them easier to find making the mutation reader test a little bit
smaller too.
2021-05-06 08:59:47 +03:00
Botond Dénes
5f217b6dee test: mutation_reader_test: convert restricted reader tests to semaphore tests
These two tests (restricted_reader_timeout and
restricted_reader_max_queue_length) are testing the semaphore in
reality, but through the restricted reader, which is distracting as it
needlessly brings in an additional layer into the picture. Rewrite them
to test the semaphore directly, getting much lighter in the process.
2021-05-06 08:57:12 +03:00
Botond Dénes
77da141604 scylla-gdb.py: std_map() add __len__() 2021-05-06 08:41:49 +03:00
Botond Dénes
0b6705c253 scylla-gdb.py: prevent infinite recursion in intrusive_list.__len__()
Apparently, if a container is passed to `list()` it tries to obtain its
size first, which in this case leads to infinite recursion. To prevent
this, coerce `self` to an iterator.
2021-05-06 08:41:49 +03:00
Asias He
4793894fac range_streamer: Handle everywhere_topology
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.

Shortcut it not to use strict source in case the keyspace is
everywhere_topology.

Refs #8503
2021-05-06 10:02:11 +08:00
Asias He
1b7414860b range_streamer: Adjust use_strict_sources_for_ranges
Now the get_all_endpoints() returns the number of nodes in the ring. We
need to adjust the checking for using strict source or not.

Use strict when number of nodes in the ring is equal or more than RF

Refs #8534
2021-05-06 10:02:11 +08:00
Asias He
ddeabba6aa token_metadata: Fix get_all_endpoints to return nodes in the ring
The get_all_endpoints() should return the nodes that are part of the ring.

A node inside _endpoint_to_host_id_map does not guarantee that the node
is part of the ring.

To fix, return from _token_to_endpoint_map.

Fixes #8534
2021-05-06 10:02:11 +08:00
Avi Kivity
e9802348b5 storage_proxy, treewide: use utils::small_vector inet_address_vector:s
Replace std::vector<inet_address> with a small_vector of size 3 for
replica sets (reflecting the common case of local reads, and the somewhat
less common case of single-datacenter writes). Vectors used to
describe topology changes are of size 1, reflecting that up to one
node is usually involved with topology changes. At those counts and
below we save an allocation; above those counts everything still works,
but small_vector allocates like std::vector.

In a few places we need to convert between std::vector and the new types,
but these are all out of the hot paths (or are in a hot path, but behind a
cache).
2021-05-05 18:36:54 +03:00
Avi Kivity
cea5493cb7 storage_proxy, treewide: introduce names for vectors of inet_address
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
2021-05-05 18:36:48 +03:00
Gleb Natapov
745f63991f raft: test: fix c&p error in a test
Message-Id: <YJKBOwBX8hqHLxsB@scylladb.com>
2021-05-05 17:18:49 +02:00
Avi Kivity
ddb1f0e6ca Merge "Choose the user max-result-size for service levels" from Botond
"
Choosing the max-result-size for unlimited queries is broken for unknown
scheduling groups. In this case the system limit (unlimited) will be
chosen. A prime example of this break-down is when service levels are
used.

This series fixes this in the same spirit as the similar semaphore
selection issue (#8508) was fixed: use the user limit as the fall-back
in case of unknown scheduling groups.
To ensure future fixes automatically apply to both query-classification
related configurations, selecting the max result size for unlimited
queries is now delegated to the database, sharing the query
classification logic with the semaphore selection.

Fixes: #8591

Tests: unit(dev)
"

* 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla:
  service/storage_proxy: get_max_result_size() defer to db for unlimited queries
  database: add get_unlimited_query_max_result_size()
  query_class_config: add operator== for max_result_size
  database: get_reader_concurrency_semaphore(): extract query classification logic
2021-05-05 18:11:10 +03:00
Lauro Ramos Venancio
15f72f7c9e TWCS: initialize _highest_window_seen
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.

This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.

Fixes #8569

Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>

Closes #8590
2021-05-05 17:31:05 +03:00
Avi Kivity
1ed3f54f4a Merge "size_tiered_compaction_strategy: get_buckets improvements" from Benny
"
This patchset contains 3 main improvements to STCS get_buckets
implementation and algorithm:

1. Consider only current bucket for each sstable.
   No need to scan all buckets using a map
   since the inserted sstables are sorted by size.
2. Use double precision for keeping bucket average size.
   Prevent rounding error accumulation.
3. Don't let the bucket average drift too high.
   As we insert increasingly larger sstables into a bucket,
   it's average size drifts up and eventually this may break
   the bucket invariant that all sstables in the bucket should
   be within the (bucket_low, bucket_high) range relative
   to the bucket average.

Test: unit(dev)
DTest: compaction_test.py:TestCompaction_with_SizeTieredCompactionStrategy,
    compaction_additional_test.py:CompactionAdditionalStrategyTests_with_SizeTieredCompactionStrategy

Fixes #8584
"

* tag 'stcs-buckets-v3' of github.com:bhalevy/scylla:
  compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
  compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
  compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
  compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
  compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
2021-05-05 16:25:12 +03:00
Avi Kivity
6977064693 dist: scylla_raid_setup: reduce xfs block size to 1k
Since Linux 5.12 [1], XFS is able to to asynchronously overwrite
sub-block ranges without stalling. However, we want good performance
on older Linux versions, so this patch reduces the block size to the
minimum possible.

That turns out to be 1024 for crc-protected filesystems (which we want)
and it can also not be smaller than the sector size. So we fetch the
sector size and set the block size to that if it is larger than 512.
Most SSDs have a sector size of 512, so this isn't a problem.

Tested on AWS i3.large.

Fixes #8156.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251b

Closes #8585
2021-05-05 16:07:50 +03:00
Nadav Har'El
64a4e5e059 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505121830.964529-1-nyh@scylladb.com>
2021-05-05 15:24:25 +03:00
Nadav Har'El
5fbd78ed96 CONTRIBUTING.md: add the requirement for self-contained headers
As far as I can tell, we never documented requirement for self-contained
headers in our coding style. So let's do it now, and explain the
"ninja dev-headers" command and how to use it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505120908.963388-1-nyh@scylladb.com>
2021-05-05 15:10:46 +03:00
Botond Dénes
9a32889ac0 test: boost/sstable_datafile_test: add tests for segregate mode scrub
Add two new unit test dedicated to the new segregate scrub mode.
2021-05-05 14:35:04 +03:00
Botond Dénes
550a1cd036 api: storage_service/keyspace_scrub: expose new segregate mode
Allow invoking scrub with the newly added segregate mode as well.
2021-05-05 14:35:04 +03:00
Botond Dénes
674a77ead0 sstables: compaction/scrub: add segregate mode
In segregate mode scrub will segregate the content of of input sstables
into potentially multiple output sstables such that they respect
partition level and fragment level monotonicity requirements. This can
be used to fix data where partitions or even fragments are out-of-order
or duplicated. In this case no data is lost and after the scrub each
sstables contains valid data.
Out-of-order partitions are fixed by simply being written into a
separate output, compared to the last one compaction was writing into.
Out-of-order fragments are fixed by injecting a
partition-end/partition-start pair right before them, effectively
moving them into a separate (duplicate) partition which is then treated
in the above mentioned way.
This mode can fix corruptions where partitions are out-of-order or
duplicated.
This mode cannot fix corruptions where partitions were merged, although
data will be made valid from the database level, it won't be on the
business-logic level.
2021-05-05 14:33:49 +03:00
Benny Halevy
ead96e21c3 compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:37 +03:00
Benny Halevy
c1681cb9ea compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
SSTables are added in increasing size order so the bucket's
average might drift upwards.
Don't let it drift too high, to a point where the smallest
SSTable might fall out of range.

For example, here's a simulation run of the algorithm for these sstable sizes:
    [21, 123, 252, 363, 379, 394, 407, 428, 463, 467, 470, 523, 752, 774]

the simulated compaction strategy options are:
min_sstable_size = 4
bucket_low = 0.66667
bucket_high = 1.5

For each bucket, the following is printed: (avg * bucket_low) avg (avg * bucket_high)

UNCHANGED:
buckets={
    (  14.0)   21.0 (  31.5): [21]
    (  82.0)  123.0 ( 184.5): [123]
    ( 276.4)  414.6 ( 621.9): [252, 363, 379, 394, 407, 428, 463, 467, 470, 523]
    ( 508.7)  763.0 (1144.5): [752, 774]
}

IMPROVED:
buckets={
    (  14.0)   21.0 (  31.5): [21]
    (  82.0)  123.0 ( 184.5): [123]
    ( 247.0)  370.5 ( 555.8): [252, 363, 379, 394, 407, 428]
    ( 320.5)  480.8 ( 721.1): [463, 467, 470, 523]
    ( 508.7)  763.0 (1144.5): [752, 774]
}

Fixes #8584

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:28 +03:00
Benny Halevy
d3aa5265ab compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
Using integer division lose accuracy by rounding down the result.
Each time we calculate:
```
    auto total_size = bucket.size() * old_average_size;
    auto new_average_size = (total_size + size) / (bucket.size() + 1);
```

We accumulate the rounding error.
total_size might be too small since old_average_size was previously
rounded down, and then new_average_size is rounded down again.

Rather than trying to compensate for the rounding errors
by e.g. adding size / 2 to the dividend, simply keep the average
as a double precision number.

Note that we multiply old_average_size by options.bucket_{low,high},
that are double precision too so the size comparisons
are already using FP instructions implicitly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:25 +03:00
Benny Halevy
44b094f9a5 compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
Since now it became a reference used to update the bucket's average size
after a new sstable is inserted into it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:20 +03:00
Benny Halevy
336a4dc0fd compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
Since the sstables are sorted in increasing size order
there is no need to consider all buckets to find a matching one.

Instead, just consider the most recently inserted bucket.

Once we see a sstable size outside the allowed range for this bucket,
create a new bucket and consider this one for the next sstable.

Note, `old_average_size` should be renamed since this change
turns it into a reference and it's assigned with the new average_size.
This patch keeps the old name to reduce the churn. The following
patch will do only the rename.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:05 +03:00
Botond Dénes
9d5e958331 service/storage_proxy: get_max_result_size() defer to db for unlimited queries
Defer picking the appropriate max result size for unlimited queries to
the database, which is already the place where we make query classifying
decisions. This move means that all these decisions are now centralized
in the database, not scattered in different places and fixing one fixes
all users.
2021-05-05 13:30:50 +03:00
Botond Dénes
992819b188 database: add get_unlimited_query_max_result_size()
Similar to the already existing get_reader_concurrency_semaphore(),
this method determines the appropriate max result size for the query
class, which is deduced from the current scheduling group. This method
shares its scheduling group -> query class association mechanism with
the above mentioned semaphore getter.
2021-05-05 13:30:42 +03:00
Nadav Har'El
58e275e362 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
2021-05-05 13:23:00 +03:00
Avi Kivity
83a826a4de Merge 'Azure Ls v2 local disk setup' from Lubos Kosco
fixes #8325

The iotune tests happened on Centos 8.2 both with stock and elrepo kernel, using Scylla 4.3 rc3

results are in https://docs.google.com/spreadsheets/d/1_uYq8UxY47XF5jreetrpleykLPqNGjfPXIirvTPh6rk/edit#gid=1101336711

Closes #7807

* github.com:scylladb/scylla:
  scylla_io_setup: add disk properties for L Azure instances
  scylla_util.py: add new class for Azure cloud support
2021-05-05 12:39:00 +03:00
Avi Kivity
3114f09d76 utils: small_vector: add print operator for std::ostream
In order to replace std::vector with utils::small_vector, it needs to
support this feature too.
2021-05-05 12:10:59 +03:00
Avi Kivity
84ea06f15b hints: messages.hh: add missing #include
Make the header self-contained.
2021-05-05 12:10:17 +03:00
Botond Dénes
104a47699c mutation_fragment_stream_validator: add reset methods
Allow resetting the validator to a given partition or mutation fragment.
This allows a user which is able to fix corrupt streams to reset the
validator to a partition or row which the validator normally wouldn't
accept and hence it wouldn't advance its internal state to it.
2021-05-05 12:03:42 +03:00
Botond Dénes
a53e6bc6e8 mutation_writer: add segregate_by_partition
Add a new segregator which segregates a stream, potentially containing
duplicate or even out-of-order partitions, into multiple output streams,
such that each output stream has strictly monotonic partitions.
This segregator will be used by a new scrub compaction mode which is
meant to fix sstables containing duplicate or out-of-order data.
2021-05-05 12:03:42 +03:00
Botond Dénes
34643ac997 api: /storage_service/keyspace_scrub: add scrub mode param
Add direct support to the newly added scrub mode enum. Instead of the
legacy `skip_corrupted` flag, one can now select the desired mode from
the mode enum. `skip_corrupted` is still supported for backwards
compatibility but it is ignored when the mode enum is set.
2021-05-05 12:03:42 +03:00
Botond Dénes
03728f5c26 sstables: compaction/scrub: replace skip_corrupted with mode enum
We want to add more modes than the current two, so replace the current
boolean mode selector with an enum which allows for easy extensions.
2021-05-05 12:03:42 +03:00
Botond Dénes
ba75115e20 sstables: compaction/scrub: prevent infinite loop when last partition end is missing
Scrub compaction will add the missing last partition-end in a stream
when allowed to modify the stream. This however can cause an infinite
loop:
1) user calls fill_buffer()
2) process fragments until underlying is at EOS
3) add missing partition end
4) set EOS
5) user sees that last buffer wasn't empty
6) calls fill_buffer() again
7) goto (3)

To prevent this cycle, break out of `fill_buffer()` early when both the
scrub reader and the underlying is at EOS.
2021-05-05 12:03:42 +03:00
Botond Dénes
41181a5c2f tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
No point in creating a permit for every mutation fragment.
2021-05-05 12:03:42 +03:00
Botond Dénes
e84c31fab8 query_class_config: add operator== for max_result_size 2021-05-05 11:20:22 +03:00
Botond Dénes
9313acb304 database: get_reader_concurrency_semaphore(): extract query classification logic
Into a local function. In the next patch we want to add another method
which needs to classify queries based on the current scheduling group,
so prepare for sharing this logic.
2021-05-05 10:41:04 +03:00
Tomasz Grabiec
121eb32679 Merge 'test: perf: report instructions retired per operations' from Avi Kivity
Instructions retired per op is a much more stable than time per op
(inverse throughput) since it isn't much affected by changes in
CPU frequencey or other load on the test system (it's still somewhat
affected since a slower system will run more reactor polls per op).
It's also less indicative of real performance, since it's possible for
fewer inststructions to execute in more time than more instructions,
but that isn't an issue for comparative tests).

This allows incremental changes to the code base to be compared with
more confidence.

Current results are around 55k instructions per read, and 52k for writes.

Closes #8563

* github.com:scylladb/scylla:
  test: perf: tidy up executor_stats snapshot computation
  test: perf: report instructions retired per operations
  test: perf: add RAII wrapper around Linux perf_event_open()
  test: perf: make executor_stats_snapshot() a member function of executor
2021-05-05 00:54:08 +02:00
Tomasz Grabiec
b8665c459d Merge "raft: replication test updates" from Alejo
Cleanups, fixes, and configuration change support for replication tests.

* alejo/raft-tests-replication-01-fixes-v13:
  raft: replication test: remove obsolete helper
  raft: replication test: add_entry with retries
  raft: replication test: support config change
  raft: replication test: add dummy command support
  raft: replication test: test both with and without prevote
  raft: replication test: make initial leader just default
  raft: replication test: create command helper
  raft: replication test: free elections as helper
  raft: replication test: fix election connectivity
  raft: replication test: fix custom election
  raft: replication test: add helpers for threshold and election
  raft: replication test: connectivity improvement
  raft: replication test: helper for server_address
  raft: replication test: use wait_log()
  raft: replication test: cycle leader more
  raft: replication test: fix a test description
  raft: replication test: remove multiple state machines
  raft: replication test: remove checksum
  raft: replication test: remove unused class param
2021-05-04 18:52:47 +02:00
Alejo Sanchez
27ad2a0f28 raft: replication test: remove obsolete helper
As we are now serially adding commands with consecutive integers there
is no need to build vectors of commands. Remove helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-04 11:01:07 -04:00
Alejo Sanchez
0a54fd848b raft: replication test: add_entry with retries
The current leader might have stepped down. Try again and learn if
there's a new leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-04 11:00:46 -04:00
Nadav Har'El
df65d09e08 Merge ' cdc: log: fill cdc$deleted_ columns in pre-images ' from Piotr Grabowski
Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example:

```
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
```

For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode):

```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

`v=NULL` has two meanings:
1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2).
2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image.

Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image).

A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row:

If in pre-image 'true' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

If in pre-image 'full' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
```

A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation.

No such change is necessary for the post-image rows, as those images are always generated in the `full` mode.

Additional example:
Additional example of trouble decoding pre-images before this change.
tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode:

```
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
```

```
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

```
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
```

generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

Both pre-images look the same, but:
1. `v=NULL` in tbl2 describes v being omitted from the pre-image.
2. `v=NULL` in tbl3 described v being `NULL` in the pre-image.

Closes #8568

* github.com:scylladb/scylla:
  cdc: log: assert post_image is always in full mode
  cdc: tests: check cdc$deleted_ columns in images
  cdc: log: fill cdc$deleted_ columns in pre-images
2021-05-04 14:45:27 +03:00
Lubos Kosco
c26bcf29f9 scylla_io_setup: add disk properties for L Azure instances 2021-05-04 13:13:05 +02:00
Lubos Kosco
f627fcbb0c scylla_util.py: add new class for Azure cloud support 2021-05-04 13:12:42 +02:00
Piotr Grabowski
cd6154e8bf cdc: log: assert post_image is always in full mode
Add an assertion that checks that post_image can never be in non-full
mode.
2021-05-04 12:33:15 +02:00
Piotr Grabowski
778fbb144f cdc: tests: check cdc$deleted_ columns in images
Add a test that checks whether the cdc$deleted_ columns are properly
filled in the pre/post-image rows.

This test checks tables with only atomic columns, tables with frozen
collections and non-frozen collections. The test is performed with
both 'true' pre-image mode and 'full' pre-image mode.
2021-05-04 12:33:15 +02:00
Calle Wilund
7e345e37e8 cql/cdc_batch_delete_postimage_test - rename test files + fix result
The tests, when added, where not named kosher (*_test), which the
runner apparently quaintly, require to pick it up (instead of the more
sensisble *.cql).

Thusly, the test was never run beyond initial creation, and also
bit-rotted slightly during behaviour changes.

Renamed and re-resulted.

Closes #8581
2021-05-04 12:47:33 +03:00
Avi Kivity
ef2313325b Merge "Teach sstables streams new streams API" from Pavel E
"
Recent changes in seastar added the ability for data sinks to
advertise the buffer size up to the stream level. This change was
needed to make the output stack honor the io-queue's max request
length. There are two more places left to patch.

The first is the sstables checksumming writer. This is the sink
implementation that has another sink inside. So this one is patched
to report up (to the output stream) the buffer size from the lower
sink (which is a file data sink that already "knows" the maximim IO
lengths).

The second one is the compress sink, but this sink embeds an output
stream inside, so even if it's working with larger buffers, that
inner stream will split them properly. So this place is patched just
to stop using the deprecated output stream constructor.

tests: unit(dev)
"

* 'br-streams-napi' of https://github.com/xemul/scylla:
  sstables: Make checksum sink report buffer size from lower sink
  sstables: Report buffer size from compressed file sink
2021-05-04 12:22:38 +03:00
Pavel Emelyanov
13b07a3c58 sstables: Make checksum sink report buffer size from lower sink
The checksum sink carries another sink on board and forwards
the put buffers lower, so there's no point in making these
two have different buffer sizes. This is what really happens
now, but this change makes this more explicit and makes the
checksumming code conform to the new output stream API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-04 12:01:30 +03:00
Pavel Emelyanov
01b979beca sstables: Report buffer size from compressed file sink
This change just moves the place from which the output_stream
knows the compression::uncompressed_chunk_length() value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-04 12:01:27 +03:00
Benny Halevy
946f9d9c83 commitlog: segment_manager::shutdown: abort on errors
Currently, if sync_all_segments fails during shutdown,
_shutdown is never set, causing replenish_reserve
to hang, as possibly seen in #8577.

It is better if scylla aborts on critical errors
during shutdown rather than just hang.

Refs #8577

Test: unit(dev)
DTest: commitlog_test.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-04 10:00:03 +03:00
Pekka Enberg
6583a04e5d Update seastar submodule
* seastar f1b6b95b...847fccaf (1):
  > perftune.py: fix parsing of 'write_back_cache' YAML option
2021-05-04 09:12:49 +03:00
Benny Halevy
f01307d816 commitlog: allocate_segment_ex: make_checked_file
To make sure no errors writing to commitlog are tolerated.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-04 09:00:58 +03:00
Avi Kivity
6ffd813b7b Merge 'hints: delay repair until hints are replayed' from Piotr Dulikowski
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.

When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.

This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.

This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.

Fixes #8102

Tests:
- unit(dev)
- some manual tests:
    - shutting down repair coordinator during hints replay,
    - shutting down node participating in repair during hints replay,

Closes #8452

* github.com:scylladb/scylla:
  repair: introduce abort_source for repair abort
  repair: introduce abort_source for shutdown
  storage_proxy: add abort_source to wait_for_hints_to_be_replayed
  storage_proxy: stop waiting for hints replay when node goes down
  hints: dismiss segment waiters when hint queue can't send
  repair: plug in waiting for hints to be sent before repair
  repair: add get_hosts_participating_in_repair
  storage_proxy: coordinate waiting for hints to be sent
  config: add wait_for_hint_replay_before_repair option
  storage_proxy: implement verbs for hint sync points
  messaging_service: add verbs for hint sync points
  storage_proxy: add functions for syncing with hints queue
  db/hints: make it possible to wait until current hints are sent
  db/hints: add a metric for counting processed files
  db/hints: allow to forcefully update segment list on flush
2021-05-03 18:47:27 +03:00
Alejo Sanchez
56e977ae69 raft: replication test: support config change
Add support for configuration change on leader.

Keep track of servers in config in test.

Add a dummy entry to confirm configuration changed. If the add fails,
because the old leader was not in the new config and stepped down, the
config is considered changed, too.

Add a test with some configuration changes.
Add a test cycling every scenario for 1 of 4 nodes removed.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
8d8af92cbb raft: replication test: add dummy command support
Use a special value as dummy entry to be ignored when seen in state
machine input.

Ignore dummy entries for count.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
4aa52be7e5 raft: replication test: test both with and without prevote
Before this change the default was prevote enabled.
With this change each test is run with and without prevote.
This duplicates the number of test cases.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
e759e492c7 raft: replication test: make initial leader just default
The test suite requires an initial leader and at the moment it's always
just 0. Make it default and simplify code.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
eb5bbcdec7 raft: replication test: create command helper
Factor out repeated code and make it available for other uses.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
eb94dd26dc raft: replication test: free elections as helper
Add a helper to run free elections and use it in partitioning.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
cb297a57df raft: replication test: fix election connectivity
If a leader was already disconnected the election of a new leader could
re-connect. Save original connectivity and restore it when done electing
new leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
0a5c605713 raft: replication test: fix custom election
Use the new specific connectivity to manage old leader disconnection
more specifically.

This fixes having elections where the vote of the old leader is required
for quorum. For example {A,B} and we want to switch leader.  For B to
become candidate it has to see A as down. Then A has to see B's request
for vote, and vote for A.

So to make the general case old leader needs to be first disconnected
from all nodes, make the desired node candidate, then have the old
leader connected only to the desired candidate (else, other nodes would
see the new candidate as disrupting a live leader).

Also, there might be stray messages from the former leader. These could
revert the candidate to follower. To handle this this patch retries
the process until the desired node becomes leader.

The helper function elect_me_leader() is split and renamed to
wait_until_candidate() and wait_election_done(). The former ticks until
the node is a candidate and the later waits until a candidate either
becomes a leader or reverts to follower

The existing etcd test workaround of incrementing from n=2 to n=3 nodes
is corrected back to original n=2.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
9909983e38 raft: replication test: add helpers for threshold and election
Add 2 helper functions for making nodes reach timeout threshold and to
elect a specific node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
38526d7a2f raft: replication test: connectivity improvement
Replace simple full disconnect of a node with specific from -> to
disconnection tracking.

This will help electing new leaders.

Say there are {A,B,C} with A leader and we want to elect B.
Before this patch, we would disconnect A, run an election with just
{B,C}, and then re-connect A.

If we have {A,B} and want to elect B, this won't work as B needs 2/2+1
votes and A is disconnected. Even if we made A stepped down. This patch
corrects this shortcoming. (@gleb-cloudius)

With this patch, we can specify other followers (not the previous or
next leader) to not see the old leader, but the new and old leaders see
each other just fine. In the example {A,B,C} above we can cut A<->B
specifcally.

Also, this is closer to etcd testing and should help porting cases.

NOTE: in the current test implementation failure_detector reports
node.is_alive(other_node) if there is a connection both ways.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
f53dea432c raft: replication test: helper for server_address
A helper function to convert from local 0-based id to raft 1-based
server_address.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
294e16cf8b raft: replication test: use wait_log()
Use wait_log() helper in leftover election code.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
355c8a052f raft: replication test: cycle leader more
For ported etcd test cycle leader, cycle some more.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
5b2c9a6c94 raft: replication test: fix a test description
Fix replace_log_leaders_log_empty description comment.

Reported by @kbraun

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
bbb56e2265 raft: replication test: remove multiple state machines
Checksum was removed so undo support for multiple versions added in:

    test: add support for different state machines
    43dc5e7dc2

NOTE: as there is a test with custom total_values, expected value cannot
      be static const anymore. (line 630)

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
e77af8573b raft: replication test: remove checksum
Previously, entries were added in parallel and we needed to check if
order was broken. Using a simple checksum was better than a hash as you
could easily find the position it broke (we add consecutive numbers).

Now order of entries is forced so it's not useful. This patch removes
it.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
9335941b49 raft: replication test: remove unused class param
persisted_snapshots is not used in state_machine class. Remove it.

Reported by @kbraun

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Benny Halevy
9cc45fe5c9 flat_mutation_reader: consume_mutation_fragments_until: maybe yield after each popped mutation_fragment
Address the following reactor stall seen with
4.6.dev-0.20210421.2ad09d0bf:
```
2021-04-29T17:19:11+00:00  perf-latency-nemesis-fix-late-db-node-afb25d9a-5 !INFO    | scylla[9515]: Reactor stalled for 19 ms on shard 2. Backtrace: 0x4044de4 0x4044121 0x4044caf 0x7f537c6601df 0x13792ff 0x137fc18 0x11d89ec 0x1444424 0x13edd69 0x12bdc57 0x12bc1fa 0x12bb6f3 0x12ba304 0x12b94ce 0x1282525 0x12812ec 0x1524fda 0x12aa3ec 0x12aa228 0x4057d3f 0x4058fc7 0x407797b 0x40234ba 0x93f8 0x101902

void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:772
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:791
seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1223
seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1104
 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1118
 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1206
?? ??:0
logalloc::region_impl::object_descriptor::encode(char*&, unsigned long) const at ./utils/logalloc.cc:1184
 (inlined by) logalloc::region_impl::alloc_small(logalloc::region_impl::object_descriptor const&, unsigned int, unsigned long) at ./utils/logalloc.cc:1293
logalloc::region_impl::alloc(migrate_fn_type const*, unsigned long, unsigned long) at ./utils/logalloc.cc:1515
managed_bytes at ././utils/managed_bytes.hh:149
 (inlined by) managed_bytes at ././utils/managed_bytes.hh:198
 (inlined by) atomic_cell_or_collection::copy(abstract_type const&) const at ./atomic_cell.cc:86
operator() at ./mutation_partition.cc:1462
 (inlined by) std::__exception_ptr::exception_ptr compact_radix_tree::tree<cell_and_hash, unsigned int>::copy_slots<compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&>(compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const::{lambda(unsigned int)#1}, row::row(schema const&, column_kind, row const&)::$_14&, compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> > >(compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head const&, cell_and_hash const*, unsigned int, unsigned int, compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >&, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&>(compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const::{lambda(unsigned int)#1}&&, row::row(schema const&, column_kind, row const&)::$_14&) at ././utils/compact-radix-tree.hh:1406
 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head*, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&>(compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:1293
 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head*, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >(compact_radix_tree::variadic_union<compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> > const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:829
 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head*, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >(compact_radix_tree::variadic_union<compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> > const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:832
 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head*, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&>(row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:837
 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head*, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head::clone<row::row(schema const&, column_kind, row const&)::$_14&>(row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:499
void compact_radix_tree::tree<cell_and_hash, unsigned int>::clone_from<row::row(schema const&, column_kind, row const&)::$_14&>(compact_radix_tree::tree<cell_and_hash, unsigned int> const&, row::row(schema const&, column_kind, row const&)::$_14&) at ././utils/compact-radix-tree.hh:1866
 (inlined by) row at ./mutation_partition.cc:1465
deletable_row at ././mutation_partition.hh:831
 (inlined by) rows_entry at ././mutation_partition.hh:940
 (inlined by) rows_entry* allocation_strategy::construct<rows_entry, schema const&, clustering_key_prefix const&, deletable_row const&>(schema const&, clustering_key_prefix const&, deletable_row const&) at ././utils/allocation_strategy.hh:153
 (inlined by) operator() at ././cache_flat_mutation_reader.hh:467
operator() at ././row_cache.hh:601
 (inlined by) decltype(auto) with_allocator<cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}::operator()() const::{lambda()#1}>(allocation_strategy&, cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}::operator()() const::{lambda()#1}) at ././utils/allocation_strategy.hh:311
 (inlined by) operator() at ././row_cache.hh:600
 (inlined by) decltype(auto) logalloc::allocating_section::with_reclaiming_disabled<cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}&>(logalloc::region&, cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}&) at ././utils/logalloc.hh:757
 (inlined by) operator() at ././utils/logalloc.hh:779
 (inlined by) _ZN8logalloc18allocating_section12with_reserveIZNS0_clIZN5cache11lsa_manager36run_in_update_section_with_allocatorIZNS3_26cache_flat_mutation_reader18maybe_add_to_cacheERK14clustering_rowEUlvE_EEvOT_EUlvE_EEDcRNS_6regionESC_EUlvE_EEDcSC_ at ././utils/logalloc.hh:728
 (inlined by) decltype(auto) logalloc::allocating_section::operator()<cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}>(logalloc::region&, cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}) at ././utils/logalloc.hh:778
 (inlined by) void cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&) at ././row_cache.hh:599
 (inlined by) cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&) at ././cache_flat_mutation_reader.hh:459
 (inlined by) cache::cache_flat_mutation_reader::maybe_add_to_cache(mutation_fragment const&) at ././cache_flat_mutation_reader.hh:446
operator() at ././cache_flat_mutation_reader.hh:321
 (inlined by) operator() at ././flat_mutation_reader.hh:549
seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<consume_mutation_fragments_until<cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(flat_mutation_reader&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&>(flat_mutation_reader) at ././seastar/include/seastar/core/future.hh:2135
 (inlined by) auto seastar::futurize_invoke<consume_mutation_fragments_until<cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(flat_mutation_reader&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&>(flat_mutation_reader) at ././seastar/include/seastar/core/future.hh:2166
 (inlined by) {lambda()#2} seastar::do_until<consume_mutation_fragments_until<cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(flat_mutation_reader&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, consume_mutation_fragments_until<{lambda()#1}, mutation_fragment, {lambda(mutation_fragment)#1}>(seastar::future, flat_mutation_reader, {lambda()#1}, mutation_fragment, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(flat_mutation_reader&, seastar::future<void>) at ././seastar/include/seastar/core/loop.hh:341
 (inlined by) seastar::future<void> consume_mutation_fragments_until<cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(flat_mutation_reader&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:547
 (inlined by) cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:317
 (inlined by) operator() at ././cache_flat_mutation_reader.hh:277
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&) at ././seastar/include/seastar/core/future.hh:2135
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, seastar::internal::monostate) at ././seastar/include/seastar/core/future.hh:1979
 (inlined by) seastar::future<void> seastar::future<void>::then_impl<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, seastar::future<void> >(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&) at ././seastar/include/seastar/core/future.hh:1601
 (inlined by) seastar::internal::future_result<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, void>::future_type seastar::internal::call_then_impl<seastar::future<void> >::run<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(seastar::future<void>&, seastar::internal::future_result&&) at ././seastar/include/seastar/core/future.hh:1234
 (inlined by) seastar::future<void> seastar::future<void>::then<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, seastar::future<void> >(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&) at ././seastar/include/seastar/core/future.hh:1520
 (inlined by) cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:276
operator() at ././cache_flat_mutation_reader.hh:266
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&) at ././seastar/include/seastar/core/future.hh:2135
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, seastar::internal::monostate) at ././seastar/include/seastar/core/future.hh:1979
 (inlined by) seastar::future<void> seastar::future<void>::then_impl<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, seastar::future<void> >(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&) at ././seastar/include/seastar/core/future.hh:1601
 (inlined by) seastar::internal::future_result<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, void>::future_type seastar::internal::call_then_impl<seastar::future<void> >::run<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(seastar::future<void>&, seastar::internal::future_result&&) at ././seastar/include/seastar/core/future.hh:1234
 (inlined by) seastar::future<void> seastar::future<void>::then<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, seastar::future<void> >(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&) at ././seastar/include/seastar/core/future.hh:1520
 (inlined by) cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:265
operator() at ././cache_flat_mutation_reader.hh:240
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&>(cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&) at ././seastar/include/seastar/core/future.hh:2135
 (inlined by) auto seastar::futurize_invoke<cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&>(cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&) at ././seastar/include/seastar/core/future.hh:2166
 (inlined by) seastar::future<void> seastar::do_until<cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}, cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}) at ././seastar/include/seastar/core/loop.hh:341
 (inlined by) cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:239
 (inlined by) operator() at ././cache_flat_mutation_reader.hh:230
cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:235
flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:405
 (inlined by) seastar::future<bool> flat_mutation_reader::impl::fill_buffer_from<flat_mutation_reader>(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ./flat_mutation_reader.cc:203
operator() at ./row_cache.cc:406
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&>(single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&) at ././seastar/include/seastar/core/future.hh:2135
 (inlined by) auto seastar::futurize_invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&>(single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&) at ././seastar/include/seastar/core/future.hh:2166
 (inlined by) seastar::future<void> seastar::do_until<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}) at ././seastar/include/seastar/core/loop.hh:341
 (inlined by) single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ./row_cache.cc:405
 (inlined by) operator() at ./row_cache.cc:402
 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>(std::__invoke_other, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60
 (inlined by) std::__invoke_result<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>::type std::__invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>(std::__invoke_result&&, (single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&)...) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:95
 (inlined by) std::invoke_result<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>::type std::invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>(std::invoke_result&&, (single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&)...) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/functional:88
 (inlined by) auto seastar::internal::future_invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&, seastar::internal::monostate>(single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&, seastar::internal::monostate&&) at ././seastar/include/seastar/core/future.hh:1209
 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1582
 (inlined by) _ZN7seastar8futurizeINS_6futureIvEEE22satisfy_with_result_ofIZZNS2_14then_impl_nrvoIZN34single_partition_populating_reader11fill_bufferENSt6chrono10time_pointINS_12lowres_clockENS7_8durationIlSt5ratioILl1ELl1000EEEEEEEUlvE_S2_EET0_OT_ENKUlONS_8internal22promise_base_with_typeIvEERSF_ONS_12future_stateINSJ_9monostateEEEE_clESM_SN_SR_EUlvE_EEvSM_SI_ at ././seastar/include/seastar/core/future.hh:2120
operator() at ././seastar/include/seastar/core/future.hh:1575
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, seastar::future>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2228
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2637
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2796
operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3987
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(std::__invoke_other, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>, void>::type std::__invoke_r<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:110
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:291
std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:622
 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:60
?? ??:0
?? ??:0
```

Fixes #8579

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8580
2021-05-03 14:06:26 +03:00
Avi Kivity
9d018b5f40 Update seastar submodule
* seastar 0b2c25d133...f1b6b95b69 (10):
  > Merge "Cap streams buffer sizes with IO limits" from Pavel
  > io_queue: fix mismatched class/struct tag for priority_class_data
  > perftune.py: instrument bonding tuning flow with 'nic' parameter
  > perftune.py: strip a newline off the 'scheduler' file content
  > perftune.py: add support for virtual non-raid disk devices
  > doc: fix typos in doc/tutorial.md
  > Merge "Add IO in-disk stats" from Pavel E
  > iotune: Perform fs-check on all directories
  > file: Keep reference on io-queue
  > Merge "Assorted set of improvements over io-queue" from Pavel E
2021-05-03 13:21:25 +03:00
Eliran Sinvani
fc93133cbe Service level controller: fix wrong default service level removal log
An out of block log print resulted in repeated prints about removal of
the default service level. The period of this print is every time the
configuration is scanned for changes. It happens when the default
service level is one of the last on the map (sorted as in the map).

Fixes #8567

Closes #8576
2021-05-03 09:08:41 +03:00
Pavel Solodovnikov
4c351ff260 raft: switch group_id type from uint64_t to utils::UUID
Introduce a tagged id struct for `group_id`.

Raft code would want to generate quite a lot of unique
raft groups in the future (e.g. tablets). UUID is designed
exactly for that (e.g. larger capacity than `uint64_t`, obviously,
and also has built-in procedures to generate random ids).

Also, this is a preparation to make "raft group 0" use a random
ID instead of a literal fixed `0` as a group id.

The purpose is that every scylla cluster must have a unique ID
for "raft group 0" since we don't want the nodes from some other
cluster to disrupt the current cluster. This can happen if,
for some reason, a foreign node happens to contact a node in
our cluster.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>
2021-05-02 16:39:54 +03:00
Pavel Solodovnikov
a7bd7dd122 utils: make basic UUID constructors constexpr
Mark default and `UUID(most_sig_bits, least_sig_bits)` ctors
as constexpr.

This allows to construct constexpr constants using UUID type.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-2-pa.solodovnikov@scylladb.com>
2021-05-02 16:39:52 +03:00
Avi Kivity
ae660eeec4 logalloc: reduce minimum lsa reserve in allocating_section to 1
Many workloads have fairly constant and small request sizes, so we
don't need large reserves for them. These workloads suffer needlessly
from the current large reserve of 10 segments (1.2MB) when they really
need a few hundred bytes. Reduce the reserve to a minimum of 1 segment.

Note that due to #8542 this can make a large difference. Consider a
workload that has a 1000-byte footprint in cache. If we've just
consumed some free memory and reduced the reserve to zero, then
we'll evict about 50,000 objects before proceeding to compact. With
the reserved reduced to 1, we'll evict 128 objects.  All this
for 1000 bytes of memory.

Of course, #8542 should be fixed, but reducing the reserve provides
some quick relief and makes sense even with the larger fix. The
reserve will quickly grow for workloads that handle bigger requests,
so they won't see an impact from the reduction.

Closes #8572
2021-05-02 15:22:04 +02:00
Pavel Emelyanov
6de8bb663b storage_service: Remove migration notifier dependency
The only reason why storage service keeps a refernce on the migration
notifier is that the latter was needed by cdc before previous patch.
Now cdc gets the notifier directly from main, so storage service is
a bit more off the hook.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-29 22:47:13 +03:00
Pavel Emelyanov
cc813ef0dd cdc: Remove db_context::builder
Right now the builder is just an opaque transfer between cdc_service
constructor args and cdc_service's db_context constructor args.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-29 22:46:57 +03:00
Pavel Emelyanov
3a7ca647af cdc: Provide migration notifier right at once
The only way db_context's migration notifier reference is set up
is via cdc_service->db_context::builder->.build chain of calls.
Since the builder's notifier optional reference is always
disengaged (the .with_migration_notifier is removed by previous
patch) the only possible notifier reference there is from the
storage service which, in turn, is the same as in main.cc.

Said that -- push the notifier reference onto db_context directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-29 22:40:24 +03:00
Pavel Emelyanov
421a514c30 cdc: Remove db_context::builder::with_migration_notifier
It's unused and removing it makes next patch's life simpler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-29 22:39:12 +03:00
Piotr Grabowski
b1650114eb cdc: log: fill cdc$deleted_ columns in pre-images
Before this change, cdc$deleted_ columns were all NULL in pre-images.
Lack of such information made it hard to correctly interpret the
pre-image rows, for example:

INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);

For this example, pre-image generated for the second operation
would look like this (in both 'true' and 'full' pre-image mode):

pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1

v=NULL has two meanings:
1. If pre-image was in 'true' mode, v=NULL describes that v was not
affected (affected columns: pk, ck, v2).
2. If pre-image was in 'full' mode, v=NULL describes that v was equal
to NULL in the pre-image.

Therefore, to properly decode pre-images you would need to know in
which mode pre-image was configured on the CDC-enabled table at the
moment this CDC log row was inserted. There is no way to determine
such information (you can only check a current mode of pre-image).

A solution to this problem is to fill in the cdc$deleted_ columns
for pre-images. After this change, for the INSERT described above, CDC
now generates the following log row:

If in pre-image 'true' mode:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1

If in pre-image 'full' mode:
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1

A client library now can properly decode a pre-image row. If it sees
a NULL value, it can now check the cdc$deleted_ column to determine
if this NULL value was a part of pre-image or it was omitted due to
not being an affected column in the delta operation.

No such change is necessary for the post-image rows, as those images
are always generated in the 'full' mode.

Additional example of trouble decoding pre-images before this change.
tbl2 - 'true' pre-image mode, tbl3 - 'full' pre-image mode:

INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);

INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);

generated pre-image:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1

INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);

generated pre-image:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1

Both pre-images look the same, but:
1. v=NULL in tbl2 describes v being omitted from the pre-image.
2. v=NULL in tbl3 described v being NULL in the pre-image.
2021-04-29 18:04:07 +02:00
Botond Dénes
0e8818f6ac scylla-gdb.py: scylla apply: don't change current shard
Scylla apply iterates over all shards in undetermined order and leaves
the last shard as the current one. This is counter-intuitive and can
lead to surprises as the user might not expect the current shard to be
changed by a command that just executes a command on each shard.
This patch ensures that both in case of the happy and error paths the
current shard is unchanged.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429104937.61315-1-bdenes@scylladb.com>
2021-04-29 14:48:29 +03:00
Botond Dénes
9fc3cba055 sstables: improve error message for invalid sstable paths
The error message currently complains about "invalid version" and later
says the reason is that the path is not recognized. This is confusing so
change the error message to start with "invalid path" instead. It is the
path that is invalid not the version after all.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429092749.52659-1-bdenes@scylladb.com>
2021-04-29 12:50:48 +03:00
Botond Dénes
824b49aeb4 tools/scylla-sstable-index: use defer() to close sstables manager
So it is closed when loading the sstable throws an exception too.
Failing to close the manager will mask the real error as the user will
only see the assert failure due to failing to close the manager.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429092248.50968-1-bdenes@scylladb.com>
2021-04-29 12:50:25 +03:00
Eliran Sinvani
0320110b04 messaging service: be more verbose when shutting down servers and clients
We encountered a phenomena where shutting down the messaging service
don't complete, leaving the shutdown process stuck. Since we couldn't
pinpoint where exactly the shutdown went wrong, here we add some
verbosity to the shutdown stages so we can more accurately pinpoint the
culprit.

Closes #8560
2021-04-29 12:28:17 +03:00
Botond Dénes
26ae9555d1 test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.

Refs: #8493

Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
2021-04-29 11:45:53 +03:00
Benny Halevy
ad8d93dd1a repair: repair_meta::stop: demote log message to debug level
This log message was added in 77cc694a08.
info log level was erroneously left over from development
and it's too noisy.  Demote it to debug level.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210429071539.1264244-1-bhalevy@scylladb.com>
2021-04-29 11:07:59 +03:00
Avi Kivity
706428a8a3 Merge 'cql3: Check if partition-key restrictions are all EQ at preparation time' from Dejan Mircevski
Previously, we checked if all partition-key restrictions were EQ at runtime.  This is, however, known already at prep time; no need to redo it on every query execution.  Move the check to prep time.

Tests: unit (dev, debug), perf_simple_query

Closes #8565

* github.com:scylladb/scylla:
  cql3: Replace runtime check with a prepared flag
  cql3: Track IN partition-key restrictions
  cql3: Inline add_single_column_restriction
  cql3: Inline statement_restrictions::add_restriction
2021-04-29 08:41:16 +03:00
Dejan Mircevski
57fa66a0a7 cql3: Replace runtime check with a prepared flag
Checking that every PK restriction is an EQ was happening at runtime.
This is wasteful, as the result is always the same.  Replace that
check with a flag computed once at preparation time.

Separate the simple-case processing into its own function rather than
pass the flag as an extra parameter.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-04-28 16:44:48 -04:00
Dejan Mircevski
4661aa0269 cql3: Track IN partition-key restrictions
Add a bool member to statement_restrictions indicating whether any of
the partition columns are restricted by IN, which requires more
complex processing.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-04-28 15:47:32 -04:00
Tomasz Grabiec
6c168ee0eb row_cache: Always touch the partition on entry
This fixes a potential cause for reactor stalls during memory
reclamation. Applies only to schemas without clustering columns.

Every partition in cache has a dummy row at the end of the clustering
range (last dummy). That dummy must be evicted last, because MVCC
logic needs it to be there at all times. If LRU picks it for eviction
and it's not the last row, eviction does nothing and moves
on. Eventually, all other rows in this partition will be evicted too
and then the partition will go away with it.

Mutation reader updates the position of rows in the LRU (aka touching)
as it walks over them. However, it was not always touching the last
dummy row. If the partition was fully cached, and schema had no
clustering key, it would exit early before reaching the last dummy
row, here:

    inline
    void cache_flat_mutation_reader::move_to_next_entry() {
	clogger.trace("csm {}: move_to_next_entry(), curr={}", fmt::ptr(this), _next_row.position());
	if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
	    move_to_next_range();

That's because no_clustering_row_between() is always true for any key
in tables with no clustering columns, and the reader advances to
end-of-stream without advancing _next_row to the last dummy. This is
expected and desired, it means that the query range ends at the
current row and there is no need to move further. We would not take
this exit for tables with a non-singular clustering key domain and
open-ended query range, since there would be a key gap before the last
dummy row.

Refs #2972.

The effect of leaving the last dummy row not touched will be that such
scans will segregate rows in the LRU, bring all regular rows to the
front, and dummy rows at the tail. When eviction reaches the band of
dummy rows, it will have to walk over it, because evicting them
releases no memory. This can cause a reactor stall.

An easy fix for the scenario would be to always touch the dummy entry
when entering the partition. It's unlikely that the read will not
proceed to the regular rows. It would be best to avoid linking such
dummies in the LRU, but that's a much more complex change.

Discovered in perf_row_cache_update, test_small_partitions(). I saw
200ms stalls with -m8G.

Refs #8541.

Tests:

   - row_cache_test (release)
   - perf_simple_query [no change]

Message-Id: <20210427111619.296609-1-tgrabiec@scylladb.com>
2021-04-28 21:59:28 +03:00
Dejan Mircevski
35e733ee88 cql3: Inline add_single_column_restriction
Invoking statement_restrictions::add_single_column_restriction()
outside the constructor would leave some data members out-of-date.
Prevent it by deleting the method and inlining its body into the only
call site.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-04-28 13:34:53 -04:00
Avi Kivity
2b252ef9b7 test: perf: tidy up executor_stats snapshot computation
Now that executor_stats_snapshot() is a member function, we can move
the capture of _count into invocations into it, capturing all the
stats in one place.
2021-04-28 19:02:35 +03:00
Dejan Mircevski
fc1c9b4289 cql3: Inline statement_restrictions::add_restriction
Invoking this method outside the constructor would leave some data
members out-of-date.  Prevent it by deleting the method and inlining
its body into the only call site.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-04-28 11:57:02 -04:00
Avi Kivity
863b49af03 test: perf: report instructions retired per operations
Instructions retired per op is a much more stable than time per op
(inverse throughput) since it isn't much affected by changes in
CPU frequencey or other load on the test system (it's still somewhat
affected since a slower system will run more reactor polls per op).
It's also less indicative of real performance, since it's possible for
fewer inststructions to execute in more time than more instructions,
but that isn't an issue for comparative tests).

This allows incremental changes to the code base to be compared with
more confidence.
2021-04-28 18:46:55 +03:00
Avi Kivity
0bc98caf3e test: perf: add RAII wrapper around Linux perf_event_open()
Make it easy to embed in other classes.

A helper function is provided for the instructions retired counter.
2021-04-28 18:41:02 +03:00
Avi Kivity
498e6b9a64 test: perf: make executor_stats_snapshot() a member function of executor
I'd like to add an instructions counter which isn't accessible via
a global, so make the snapshot function a member. Out of respect to #1,
define functions for getting the number of allocations and tasks processed,
as they need heavy header files.
2021-04-28 18:38:35 +03:00
Benny Halevy
96ef204676 dht/token: shard_of: reuse shard_of_minimum_token
Returning shard 0 for the minimum token better
be hardcoded in one place.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210428113339.1092555-1-bhalevy@scylladb.com>
2021-04-28 15:08:36 +03:00
Benny Halevy
662355519d dht/i_partitioner: split_range_to_single_shard: drop unused lambda capture of start_shard
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210428113440.1099877-1-bhalevy@scylladb.com>
2021-04-28 15:07:57 +03:00
Benny Halevy
31b80b5752 scylla-gdb: scylla shard: print current shard with no arg
Currently `scylla shard` with no args results in
the following error:
```
(gdb) scylla shard
Traceback (most recent call last):
  File "master-scylla-gdb.py", line 2384, in invoke
    id = int(arg)
ValueError: invalid literal for int() with base 10: ''
Error occurred in Python: invalid literal for int() with base 10: ''
```

Instead, let it just print the current shard, similar to
`(gdb) thread`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210428093913.1070051-1-bhalevy@scylladb.com>
2021-04-28 13:04:17 +02:00
Avi Kivity
a43e896396 test: perf: don't truncate allocation/req and tasks/req report
I used {:.0} to truncate to integer, but apparently that resulted
in only one significant digit in the report, so 93.1 was reported as
90. Use the {:5.1f} to avoid truncation, and even get an extra
digit (we can have fractional tasks/op due to batching).

Current result is 93.1 allocs/op, 20.1 tasks/op (which suggests
batch size of around 10).

Closes #8550
2021-04-28 12:50:13 +02:00
Avi Kivity
3e6232bb92 Merge "Wire offstrategy compaction to repair-based removenode" from Raphael
"
From now on, offstrategy compaction is triggered on completion of repair-based
removenode. So compaction will no longer act aggressively while removenode
is going on, which helps reducing both latency and operation time.

Refs #5226.
"

* 'offstrategy_removenode' of github.com:raphaelsc/scylla:
  repair: Wire offstrategy compaction to repair-based removenode
  table: introduce trigger_offstrategy_compaction()
  repair/row_level: make operations_supported static const
2021-04-28 12:02:07 +03:00
Nadav Har'El
7d2df8a9bc test/alternator,cql-pytest: fix resource leak on failure
In the alternator and cql-pytest test frameworks, we have some convenient
contextmanager-based functions that allows us to create a temporary
resource (e.g., a table) that will be automatically deleted, for
example:

    with create_stream_test_table(...) as table:
        test_something(table)

However, our implementation of these functions wasn't safe. We had
code looking like:

    table = ...
    yield table
    table.delete()

The thinking was that the cleanup part (the table.delete()) will be
called after the user's code. However, if the user's code threw
(i.e., a failed assertion), the cleanup wasn't called... When the user's
code throws, it looks as if the "yield" throws. So the correct code
should look like:

    table = ...
    try:
        yield table
    finally:
        table.delete()

Python's contextmanager documentation indeed gives this idiom in its
example.

This patch fixes all contextmanager implementations in our tests to do
the cleanup even if the user's "with" block throws.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210428083748.552203-1-nyh@scylladb.com>
2021-04-28 10:51:02 +02:00
Takuya ASADA
c9324634ca scylla_raid_setup: enabling mdmonitor.service on Debian variants
On Debian variants, mdmonitor.service cannnot enable because it missing
[Install] section, so 'systemctl enable mdmonitor.service' will fail,
not able to run mdmonitor after the system restarted.

To force running the service, add Wants=mdmonitor.service on
var-lib-scylla.mount.

Fixes #8494

Closes #8530
2021-04-28 11:32:27 +03:00
Asias He
60ba8eb9b8 sstables: Add debug info when create_sharding_metadata generates zero ranges
The range passed to create_sharding_metadata is supposed to be owned or
at least partially owned by the shard. Log keys, range and split
ranges for debugging if the range does not belong to the shard.

This is helpful for debugging "Failed to generate sharding
metadata for foo.db" issues reported.

Refs #7056

Closes #8557
2021-04-28 11:22:06 +03:00
Avi Kivity
abb111297a Merge 'Calculate partition ranges from expr::expression' from Dejan Mircevski
In an ongoing effort to drop the `restrictions` class hierarchy, rewrite the partition-range calculation code to use the new `expression` objects.

Refs: #7217 #3815

Tests: unit (dev, debug)

Closes #8525

* github.com:scylladb/scylla:
  cql3: Specialize partition-range computation for EQ
  cql3: Replace some bounds_ranges calls
  cql3: Get partition range from expr::expression
  cql3: Track partition-range expressions
2021-04-28 10:26:00 +03:00
Asias He
84a78f4558 repair: Switch to use NODE_OPS_CMD for bootstrap operation
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.

In this patch, we continue the work to switch bootstrap operation to use
NODE_OPS_CMD.

The benefits:

- It is more reliable to detect pending node operations, to avoid
  multiple topology changes at the same time, than using gossip status.

- The cluster reverts to a state before the bootstrap operation
  automatically in case of error much faster than gossip.

- Allows users to pass a list of dead nodes to ignore during bootstrap
  explicitly.

- The BOOTSTRAP gossip status is not needed any more. This is one step
  closer to achieve gossip-less topology change.

Fixes #8472
2021-04-28 09:53:04 +08:00
Dejan Mircevski
84fa370415 cql3: Specialize partition-range computation for EQ
Save a couple of allocations per request by treating all-EQ cases
specially during the computation of partition ranges.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-04-27 20:06:57 -04:00
Raphael S. Carvalho
1d5cf2cc5d repair: Wire offstrategy compaction to repair-based removenode
removenode_with_repair() runs on all the nodes that need to sync data
from other nodes, so offstrategy compaction can be easily wired by
notifying tables when removenode completes.

From now on, when user runs removenode, new sstables produced in receiving
nodes will be added to table's maintenance set, and when the operation
completes, offstrategy compacted will be started to reshape those new
ssts before integrating them into the main set.

Refs #5226.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-27 12:26:53 -03:00
Piotr Dulikowski
0db45d1df5 repair: introduce abort_source for repair abort
Adds an abort_source for repair_tracker which is triggered when all
repairs are asked to be stopped. It is currently used by the "wait for
all hints to be sent" operation - now, it aborts when repairs are
requested to be aborted.
2021-04-27 16:16:57 +02:00
Piotr Dulikowski
3a2d09b644 repair: introduce abort_source for shutdown
Adds an abort_source to repair_tracker. The abort source is triggered
when the repair subsystem is shut down. Its purpose is to allow
operations such like waiting for hints to be sent to be able to abort
themselves.
2021-04-27 16:16:57 +02:00
Piotr Dulikowski
958a13577c storage_proxy: add abort_source to wait_for_hints_to_be_replayed
Now, the function wait_for_hints_to_be_replayed will take an abort
source and will stop when abort is requested.
2021-04-27 16:16:54 +02:00
Piotr Dulikowski
22e06ace2c storage_proxy: stop waiting for hints replay when node goes down
Now, when repair coordinator is waiting for hints to be replayed on some
remote node and that node goes down, it stops waiting for it.
2021-04-27 16:11:47 +02:00
Piotr Dulikowski
9d68824327 hints: dismiss segment waiters when hint queue can't send
When a hint queue becomes stuck due to not being able to send to its
destination (e.g. destination node is no longer UP, or we failed to send
some hints from a file), then it's better to immediately dismiss anybody
who waits for hint replay instead of letting them wait until timeout.
2021-04-27 15:58:15 +02:00
Piotr Dulikowski
49f4a2f968 repair: plug in waiting for hints to be sent before repair
Now, the repair coordinator will wait before hints are sent between
participating nodes before continuing with repair.
2021-04-27 15:58:11 +02:00
Piotr Dulikowski
d9ba743ba1 repair: add get_hosts_participating_in_repair
Adds the `get_hosts_participating_in_repair` function which returns a
list of hosts participating in repair.

This list will be used by repair coordinator to tell other nodes to wait
until they replay their hints towards the nodes from the list.
2021-04-27 15:32:03 +02:00
Piotr Dulikowski
46075af7c4 storage_proxy: coordinate waiting for hints to be sent
Adds a `wait_for_hints_to_be_replayed` function which waits until all
hints between specified endpoints are replayed.

For each node, a hint sync point is created. Then, repair coordinator
waits until the hint sync point is reached on every node, or timeout
occurs. This is done by querying each host participating in repair every
second in order if the sync point is still there.
2021-04-27 15:31:42 +02:00
Piotr Dulikowski
86d831b319 config: add wait_for_hint_replay_before_repair option
Adds the `wait_for_hint_replay_before_repair` configuration option. If
set to true, the repair coordinator will first wait until the cluster
replays its hints towards the nodes participating in repair. It is set
to true by default, and is live-updateable. It will be used in
subsequent commits from the same PR.
2021-04-27 15:16:26 +02:00
Piotr Dulikowski
485036ac33 storage_proxy: implement verbs for hint sync points
Implements HINT_QUEUE_MARK and HINT_QUEUE_SYNC verb handlers in
`storage_proxy`.
2021-04-27 15:06:39 +02:00
Piotr Dulikowski
82c419870a messaging_service: add verbs for hint sync points
Adds two verbs: HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK.
Those will make it possible to create a sync point and regularly poll
to check its existence.
2021-04-27 15:06:39 +02:00
Piotr Dulikowski
244738b0d5 storage_proxy: add functions for syncing with hints queue
Adds two methods to `storage_proxy`:

- `create_hint_queue_sync_point` - creates a "hint sync point" which
  is kept present in storage_proxy until all hint queues on the local
  node reach their curent end. It will also disappear if given deadline
  is reached first.
- `check_hint_queue_sync_point` - checks if given hint sync point still
  exists.

The created sync point waits for hint queues in all hint managers, on
all shards.
2021-04-27 15:06:39 +02:00
Peter Veentjer
c255903fb0 dist: Added r5b to ena instance_class.
The r5b instances also have ena support. For a confirmation
that all r5b instances have ena, go to the following page:

https://instances.vantage.sh/

Select the r5b and add the 'enhanced networking' column. Then
it will show that for every r5b type there is ena support

Closes #8546
2021-04-27 15:39:24 +03:00
Nadav Har'El
6de04bbed5 Merge 'Forward-port service level fixes' from Piotr Sarna
The original series which forward-ported the service levels into open-source omitted important fixes to their infrastructure. The fixes are hereby ported.

Tests: unit(release)

Closes #8540

* github.com:scylladb/scylla:
  workload prioritization: Fix configuration change detection
  workload prioritization: add exception protection in configuration polling
2021-04-27 13:40:21 +03:00
Eliran Sinvani
02d37cb133 workload prioritization: Fix configuration change detection
The configuration detection is based on a loop that
advances two iterators and compares the two collection
for deducing the configuration change. In order to
correctly deduce the changes the iteration have to be
according to the key (service level name) order for both
of the collections. If it doesn't happen the results are
undefined and in some cases can lead to a crash of the
system. The bug is that the _service_level_db field was
implemented using an unordered_map which obviously don't
guarantie the configuration change detection assumption.
The fix was simply to change the field type to a map
instead of unordered_map.

Another problem is that when a static service level (i.e
default) is at the end of the keys list, it is repeatedly
being deleted - which doesn't really do anything since deleting
a static service level is just retaining it's defult values
but it is stil wrong.
2021-04-27 12:29:31 +02:00
Eliran Sinvani
946fc6af08 workload prioritization: add exception protection in configuration polling
Exceptions around the loop polling were not handled properly.
This is an issue due to the fact that if an unhandled exception
slips out to the configuration polling loop itself it will break
it. When the configuration polling loop is broken, any further
change to the configuration will not be acted uppon in the nodes
where the loop is broken until the node is restarted. The chances
for exceptions are now greater than before since in one of the
previous commits we started quering the workload prioritization
configuration table with a sensible, shorter timeout.
This change also adds a logger for the workload prioritization
module and some logging mainly arround the configuration polling loop.

Most logs are added in the info level since they are not expected to
happen frequently but when they do we would like to have some
information by default regarding what broke the loop.
2021-04-27 12:29:31 +02:00
Avi Kivity
7a6b678044 Update tools/java submodule for EveryWhere compaction strategy
* tools/java 57eb143119...fd92603b99 (1):
  > Add EverywhereStrategy class
2021-04-27 12:23:23 +03:00
Nadav Har'El
f50db50d10 test/cql-pytest: test for "WHERE v=NULL" in restrictions
Issues #4476 and #8489, and also Cassandra's CASSANDRA-10715, all request
that filtering with "WHERE v=NULL" should return the rows where the column
v is unset. However, we made a deliberate decision to do something else:
That "WHERE v=NULL" should match no row. Exactly like it does in SQL.
This is what this test verifies - that "WHERE v=NULL" never matches any
row - not even rows where "v" is unset.

This test is expected to fail on Cassandra (so marked cassandra_bug),
because in Cassandra the "WHERE v=NULL" restriction is forbidden,
instead of succeeding and returning nothing.

Although we differ here from Cassandra, after a lot of deliberation we
decided that Scylla's behavior is the correct one, so this test verifies
it.

Refs #4776.
Refs #8489.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210426183145.323301-1-nyh@scylladb.com>
2021-04-27 09:26:33 +03:00
Dejan Mircevski
e2c8ff6bf2 gdb: Fix heapprof() dereferencing of backtrace
Seastar seems to have added another layer of indirection to
alloc_site_list_head/backtrace, so scylla_heapprof() can't find the
members it's looking for, resulting in errors.  Fix it by
dereferencing the added layer.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8551
2021-04-27 01:49:26 +02:00
Kamil Braun
4c95277619 raft: fsm: fix assertion failure on stray rejects
When probes are sent over a slow network, the leader would send
multiple probes to a lagging follower before it would get a
reject response to the first probe back. After getting a reject, the
leader will be able to correctly position `next_idx` for that
follower and switch to pipeline mode. Then, an out of order reject
to a now irrelevant probe could crash the leader, since it would
effectively request it to "rewind" its `match_idx` for that
follower, and the code asserts this never happens.

We fix the problem by strengthening `is_stray_reject`. The check that
was previously only made in `PIPELINE` case
(`rejected.non_matching_idx <= match_idx`) is now always performed and
we add a new check: `rejected.last_idx < match_idx`. We also strengthen
the assert.

The commit improves the documentation by explaining that
`is_stray_reject` may return false negatives.  We also precisely state
the preconditions and postconditions of `is_stray_reject`, give a more
precise definition of `progress.match_idx`, argue how the
postconditions of `is_stray_reject` follow from its preconditions
and Raft invariants, and argue why the (strengthened) assert
must always pass.
Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>
2021-04-27 01:07:22 +02:00
Pavel Solodovnikov
fba1910770 raft: fix incorrect rpc setup in server_impl::start()
RPC configuration was updated only when an instance was
started with an initial snapshot.

In case we don't have an initial snapshot, but do have
a non-empty log with a configuration entry, the RPC
instance isn't set up correctly.

Fix that by moving RPC setup code outside the check for
snapshot id and look at `_log.get_configuration()` instead.

Also, set up RPC mappings both for `current` and `previous`
components, since in case the last configuration index
points to an entry from the log, it can happen to be
a joint configuration entry.

For example, this can happen if a leader made an attempt
to change configuration, but failed shortly afterwards
without being able to commit the new configuration.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com>
Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>
2021-04-26 20:46:50 +02:00
Nadav Har'El
f17de6ca45 test/cql-pytest: test that "!=" not supported in WHERE
Our documentation of SELECT https://docs.scylladb.com//getting-started/dml
suggests that like a "=" operator exists, there is also a "!=" operator.

However, this is not true: The != operator (which is recognized by the
parser) is not allowed in WHERE clauses. This test verifies that this
is indeed the case - neither Cassandra nor Scylla allow this operator
in WHERE clauses.

Refs https://github.com/scylladb/scylla-doc-issues/issues/732

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210426165511.318066-1-nyh@scylladb.com>
2021-04-26 20:23:21 +03:00
Piotr Sarna
8aaa3a7bb8 Merge 'reader_permit: always forward resources to the semaphore' from Botond
This series is a conceptual revert of 4c8ab10, which turned out to be a
misguided defense mechanism that proved to be a hotbed for bugs. This
protection was superseded by 0fe75571d9 which guarantees forward
progress at all times without all the gotchas and bad interactions
introduced by 4c8ab10.
The latest instance of bad interaction that triggered this series is a
case of resource units being leaked when a previously evicted reader is
re-admitted, leaking already owned resources on each re-admission.
To prove that neither the resource leak, nor the deadlock 4c8ab10 was
supposed to guard against exists after this series, it includes two unit
tests stressing the respective areas: readmission and admission on a
highly contested semaphore.

Fixes: #8493

Also on: https://github.com/denesb/scylla.git
reader-permit-resource-leak-v2

Changelog

v2:
* Rebase over the recently merged reader close series. Fix merge
  conflicts and an exposed bug.

* 'reader-permit-resource-leak-v2' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
  test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
  reader_concurrency_semaphore: add dump_diagnostics()
  reader_permit: always forward resources
  reader_concurrency_semaphore: inactive_read_handle: abandon(): close reader
2021-04-26 16:30:18 +02:00
Nadav Har'El
732fc9ba00 Merge 'Add username to alternator tracing' from Piotr Sarna
This series adds filling the `username` column in alternator tracing info, if the username is available. When alternator is enforcing authorization, each request contains a username in its headers.

The difference is as follows. A tracing entry excerpt before the series:
```
{
	(...)
	'source_ip': '::',
	'table_names': 'alternator_Pets.Pets',
	'username': '<unauthenticated request>'
}
```
and after the series:
```
{
	(...)
	'source_ip': '::',
	'table_names': 'alternator_Pets.Pets',
	'username': 'alternator'
}
```
This series also modifies one of the tests to check the username column.

Fixes #8547

Closes #8548

* github.com:scylladb/scylla:
  test: add username verification to alternator tracing tests
  alternator: add user context to tracing
  alternator: return username when verifying signature
2021-04-26 16:30:15 +02:00
Botond Dénes
45d580f056 test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.

The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.
2021-04-26 15:57:17 +03:00
Botond Dénes
cadc26de38 test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.
2021-04-26 15:57:17 +03:00
Botond Dénes
d246e2df0a reader_concurrency_semaphore: add dump_diagnostics()
Allow semaphore related tests to include a diagnostics printout in error
messages to help determine why the test failed.
2021-04-26 15:56:56 +03:00
Botond Dénes
caaa8ef59a reader_permit: always forward resources
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.
2021-04-26 15:56:56 +03:00
Botond Dénes
2b66f7222e reader_concurrency_semaphore: inactive_read_handle: abandon(): close reader
fa43d7680 recently introduced mandatory closing of readers before they
are destroyed. One reader destroy path that was left not closing the
reader before destruction is `inactive_reader_handle::abandon()`. This
path is executed when the handle is destroyed while still referring to a
non-evicted inactive read. This patch fixes it up to close the reader
and adds a small unit test which checks that this happens.
2021-04-26 15:56:54 +03:00
Piotr Dulikowski
427bbf6d86 db/hints: make it possible to wait until current hints are sent
Implements `wait_until_hints_are_replayed` method returning a future
which blocks until either all current hint segments are replayed
(returns success in this case), or when provided timeout is reached
(returns a timeout exception in this case).
2021-04-26 13:57:03 +02:00
Piotr Sarna
0779fa8428 test: add username verification to alternator tracing tests
The test case now additionally checks if the username entry from
found tracing events matches the username used by the test suite.
2021-04-26 11:54:02 +02:00
Piotr Sarna
1b400b07b9 alternator: add user context to tracing
Before this patch, each entry in alternator tracing included
an "<unauthenticated request>" field. It's not really true,
because most of alternator requests are actually performed
by authenticated users (unless auth is disabled).
2021-04-26 11:54:01 +02:00
Piotr Sarna
ddd9c2f2d7 alternator: return username when verifying signature
The username will be used later for tracing purposes.
It will also very likely be useful later when we decide to add
ACL support.
2021-04-26 11:53:19 +02:00
Avi Kivity
5801c93715 utils: rjson: convert enable_if to concept
Simpler and easier to understand. Vague comment about enable_if
removed.

Closes #8405
2021-04-25 21:53:46 +03:00
Botond Dénes
f7f5fca5a8 Add very basic coverage report generation support
This patch introduces the most basic bare infrastructure to generate
coverage report as well as a guide on how to manually generate them.
Although this barely qualifies as "support", it already allows one to
generate a coverage report with the help of this guide.

One immediate limitation of this patch is that it only supports clang,
which is not a terrible problem, given that its our main compiler
currently.

Future patches will build on this to incrementally improve and
automate this:
* Provide script to automatically merge profraw files and generate html
  report from it.
* Integrate into test.py, adding a flag which causes it to generate
  a coverage report after a run.
* Support GCC too, but at least auto-detect whether clang is used or
  not.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210423140100.659452-1-bdenes@scylladb.com>
2021-04-25 15:59:20 +03:00
Avi Kivity
fa43d7680c Merge "Close flat mutation readers" from Benny
"
This patchset adds future-returning close methods to all
flat_mutation_reader-s and makes sure that all readers
are explicitly closed and waited for.

The main motivation for doing so is for providing a path
for cancelling outstanding i/o requests via a the input_stream
close (See https://github.com/scylladb/seastar/issues/859)
and wait until they complete.

Also, this series also introduces a stop
method to reader_concurrency_semaphore to be used when
shutting down the database, instead of calling
clear_inactive_readers in the database destructor.

The series does not change microbenchmarks performance in a significant way.
It looks like the results are within the tests' jitter.

- perf_simple_query: (in transactions per second, more is better)
before: median 184701.83 tps (90 allocs/op, 20 tasks/op)
after:  median 188970.69 tps (90 allocs/op, 20 tasks/op) (+2.3%)

- perf_mutation_readers: (in time per iteration, less is better)
combined.one_row                               65.042ns  -> 57.961ns  (-10.9%)
combined.single_active                         46.634us  -> 46.216us  ( -0.9%)
combined.many_overlapping                      364.752us -> 371.507us ( +1.9%)
combined.disjoint_interleaved                  43.634us  -> 43.448us  ( -0.4%)
combined.disjoint_ranges                       43.011us  -> 42.991us  ( -0.0%)
combined.overlapping_partitions_disjoint_rows  57.609us  -> 58.820us  ( +2.1%)
clustering_combined.ranges_generic             93.464ns  -> 96.236ns  ( +3.0%)
clustering_combined.ranges_specialized         86.537ns  -> 87.645ns  ( +1.3%)
memtable.one_partition_one_row                 903.546ns -> 957.639ns ( +6.0%)
memtable.one_partition_many_rows               6.474us   -> 6.444us   ( -0.5%)
memtable.one_large_partition                   905.593us -> 878.271us ( -3.0%)
memtable.many_partitions_one_row               13.815us  -> 14.718us  ( +6.5%)
memtable.many_partitions_many_rows             161.250us -> 158.590us ( -1.6%)
memtable.many_large_partitions                 24.237ms  -> 23.348ms  ( -3.7%)
average                                        -0.02%

Fixes #1076
Refs #2927

Test: unit(release, debug)
Perf: perf_mutation_readers, perf_simple_query (release)
Dtest: next-gating(release),
       materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_max_to_half_test repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test(debug)
"

* tag 'flat_mutation_reader-close-v7' of github.com:bhalevy/scylla: (94 commits)
  mutation_reader: shard_reader: get rid of stop
  mutation_reader: multishard_combining_reader: get rid of destructor
  flat_mutation_reader: abort if not closed before destroyed
  flat_mutation_reader: require close
  repair: row_level_repair: run: close repair_meta when done
  repair: repair_reader: close underlying reader on_end_of_stream
  perf: everywhere: close flat_mutation_reader when done
  test: everywhere: close flat_mutation_reader when done
  mutation_partition: counter_write_query: close reader when done
  index: built_indexes_reader: implement close
  mutation_writer: multishard_writer: close readers when done
  mutation_writer: feed_writer: close reader when done
  table: for_all_partitions_slow: close iteration_step reader when done
  view_builder: stop: close all build_step readers
  stream_transfer_task: execute: close send_info reader when done
  view_update_generator: start: close staging_sstable_reader when done
  view: build_progress_virtual_reader: implement close method
  view: generate_view_updates: close builder readers when done
  view_builder: initialize_reader_at_current_token: close reader before reassigning it
  view_builder: do_build_step: close build_step reader when done
  ...
2021-04-25 13:53:11 +03:00
Avi Kivity
54b76e82bc Merge "Make migration manager main-local" from Pavel
"
There are few places left that call for migration manager
by global reference. This set patches all those places
and makes the migration manager a service that locally
lives in main(). Surprisingly, the largest changes are to
get rid of global migration manager calls from ... the
migration manager itself.

Two tricks here. First, repair code gets its private global
migration manager pointer. That's not nice, but it aligned
with current repair design -- all its references are now
"global". Some day they all will be moved into sharded
repair service, for now these globals just describe the real
dependencies of the repair code.

Second is storage proxy that needs to call migration manager
to get schema. Proper layering makes migration manager sit
on top of storage proxy, so the direct back-reference is
not nice. To overcome this the proxy gets migration manager's
shared_from_this() pointer and drops all of them on stop.
This makes sure that by the time migration manager stops
no references from proxy exist.

tests: unit(dev), start-stop, start-drain-stop
"

* 'br-turn-migration-manager-local' of https://github.com/xemul/scylla: (21 commits)
  migration_manager: Make it main-local
  tests: Have own migration manager instances
  tests: Use migration_manager from cql_test_env
  migration_manager: Call maybe_sync from this
  migration_manager: Make get_schema_for_... methods
  migration_manager: Hide get_schema_definition
  streaming: Keep migration_manager ptr in rpc lambdas
  storage_proxy: Keep migration_manager ptr in rpc lambdas
  streaming: Get migration_manager shared_ptr in messaging
  storage_proxy: Get migration_manager shared_ptr in messaging
  migration_manager: Make maybe_sync a method
  migration_manager: Open-code merge lambda
  migration_manager: Turn do_announce_new_type non-static
  migration_manager: Make announce() non-static method
  storage_servive: Use local migration manager
  storage_service: Keep migration manager on board
  migration_manager: Use 'this' where appropriate
  repair: Use private migration manager pointer
  repair: Keep private sharded migration manager pointer
  redis: Carry sharded migration manager over init
  ...
2021-04-25 13:29:16 +03:00
Nadav Har'El
de938eba8c Reduce dependency on header utils/rjson.hh
If utils/rjson.hh is modified, 300 (!) source files get recompiled.
This is frustrating for anyone working on this header file (like me).
Moreover - utils/rjson.hh includes the large rapidjson header
files (rapidjson is a header-only library!), slowing the compilation
all these 300 files.

It turns out most includers utils/rjson.hh get it because
column_computation.hh includes it. But the fact that column
computations are serialized as JSON are an internal implementation
detail that the users of this header don't need to know - and they
care even less that this JSON implementation uses utils/rjson.hh.

So in this patch column_computation.hh no longer includes rjson.hh,
and no longer exposes a method taking a rjson::value that was never
used outside the implementation.

After this patch, touching utils/rjson.hh only recompiles 21 files.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210422183526.114366-1-nyh@scylladb.com>
2021-04-25 13:20:51 +03:00
Benny Halevy
5ca8f28297 storage_service: load_new_sstables: log success message as info, not warning
Success is important, but nothing to be warned about.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210425070909.476226-1-bhalevy@scylladb.com>
2021-04-25 12:39:47 +03:00
Benny Halevy
6e62ec8c24 mutation_reader: shard_reader: get rid of stop
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
fc5e4688db mutation_reader: multishard_combining_reader: get rid of destructor
Now that the multishard_combining_reader is guaranteed to be called
there is no longer need for stopping the shard readers
in the destructor.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
b134640829 flat_mutation_reader: abort if not closed before destroyed
The motivation to abort if the reader is not closed before its destroyed
is mainly driven by:
1. Aborting will force us find and fix missing closes.
Otherwise, log warnings can easily be lost in the noise.
(ERRORs however are caught by dtests but won't be necessarily
caught in SCT / production environments)

2. Following patches remove existing cleanup code
in destructors that is not needed once close() is mandated.
If we don't abort on missing close we'll have to keep maintaining
both cleanup paths forever.

3. Not enforcing close exposes us to leaks and potential
use-after-free from background tasks that are left behind.
We want to stop guranteeing the safety of the background tasks
post close().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
5b22731f9a flat_mutation_reader: require close
Make flat_mutation_reader::impl::close pure virtual
so that all implementations are required to implemnt it.

With that, provide a trivial implementation to
all implementations that currently use the default,
trivial close implementation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
77cc694a08 repair: row_level_repair: run: close repair_meta when done
Always close repair_meta (that closes its reader).

Proper closing is done via the repair_meta::stop path.

Ignore any errors when auto-closing in a deferred action
since there is nothing else we can do at this point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
0c19f788e5 repair: repair_reader: close underlying reader on_end_of_stream
Need to close the reader before reassigning it with
an empty f_m_r.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
391f942b2a perf: everywhere: close flat_mutation_reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
aa5289f255 test: everywhere: close flat_mutation_reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
7427b60caf mutation_partition: counter_write_query: close reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
2fa8b3b84e index: built_indexes_reader: implement close
Close underlying reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
e453f890f2 mutation_writer: multishard_writer: close readers when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
64c5b7fda6 mutation_writer: feed_writer: close reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
825acd4031 table: for_all_partitions_slow: close iteration_step reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
dad6c94476 view_builder: stop: close all build_step readers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
6082d854f9 stream_transfer_task: execute: close send_info reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
02d74e1530 view_update_generator: start: close staging_sstable_reader when done
The staging_sstable_reader has to be closed before it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
1e1c8ea824 view: build_progress_virtual_reader: implement close method
Close underlying reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
2d8b00f2d8 view: generate_view_updates: close builder readers when done
Make sure to close the builder's _updates and optional _existings
readers before they are destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
652ba714fe view_builder: initialize_reader_at_current_token: close reader before reassigning it
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
7093610931 view_builder: do_build_step: close build_step reader when done
Make sure to close the build_step reader before destroying it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
51c96d405d mutation_reader: evictable_reader: fill_buffer: make sure to close the reader
If reader.fill_buffer() fails, we will not call `maybe_pause` and the
reader will be destroyed, so make sure to close it.

Otherwise, the reader is std:move'ed to `maybe_pause` that either
paused using register_inactive_read or further std::move'ed to _reader,
in both cases it doesn't need to be closed.  `with_closeable`
can safely try to close the moved-from reader and do nothing in this
case, as the f_m_r::impl was already moved away.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
7c7569f0ad querier_cache: implement stop
Close the _closing_gate to wait on background
close of dropped queries, and close all remaining queriers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
87c62b5f59 test: querier_cache_test: close looked up querier
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
3e7075a739 compaction: setup: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
90a7a8ff0e compaction: close reader when done consuming
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
07f34b4a32 querier_cache: lookup_querier: close the querier before dropping it
Make sure to close the dropped querier before it's destroyed.
The operation is moved to the background so not to penelize
the common path.

A following patch will add a querier_cache::close() method
that will close _closing_gate to wait on the querier close
(among other things it needs to wait on :)).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
4a0abc7b9c querier_cache: lookup_querier: define as a private method
In preparation to closing the querier in the background
before dropping it.

With that, stats need not be passed as a parameter,
but rather the _stats member can be used directly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
fa6d6c17f2 mutation_partition: mark query_result_builder constructor noexcept
It is trivially so.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
320cb67b08 table: query, mutation_query: close querier when done
Make sure to close the querier and subsequently its reader before
destroying it (unless it was moved).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
8b8c721431 querier: add close method
Depening on the variant _reader contents, either close
the reader or unregister the inactive reader and close it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
3f00c21481 querier_cache: evict: close evicted reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
0d8d56c36f querier: coroutinize evict methods
Instead of calling a lambda function for each index
simply iterate over all indices and use co_await / co_return
in the inner loop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
2f9cf01aa7 querier_cache: futurize evict api
Prepare for futurizing the lower-level inactive reads api.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
57f921de4f database: streaming_reader_lifecycle_policy: destroy_reader: close inactive reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
29b2b1f8dd reader_lifecycle_policy: close inactive_read
Make sure to close the unregistered inactive_read
before it's destroyed, if the unregistered reader_opt
is engaged.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
a144819683 reader_concurrency_semaphore: unregister_inactive_read: close reader also on internal error
"forward" the unregister to the other semaphore
in case on_internal_error throws rather than aborting.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
c8e30db5db reader_concurrency_semaphore: close evicted reader
Close readers in the background:
- evicted based on ttl, or
- those that weren't admitted by register_inactive_read
- those that are destoryed in clear_inactive_reads.

Use a gate for waiting on these background closes in stop().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
be1cafc1a5 reader_concurrency_semaphore: do_wait_admission: close evicted readers
enqueue_waiter before evicting readers and start a loop in the
background to dequeue and close inactive_readers until
either the _wait_list is empty or there are no more inactive_readers
to evict.

We admit the read synchronously only if the wait_list is empty
and the semaphore has_available_units to statisfy admission.

We need to enqueue the reader before starting to evict readers
to make sure any evicted resources are assigned to the
waiter at the head of the queue and not "stolen"
in case we yield and some other caller grabs them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
43bf0f9356 reader_concurrency_semaphore: add stop method
In addition to clear_inactive_reads, that's currently called when
the database object is destroyed, introduce a stop() method that will:
1. wait on all background closes of inactive_reads.
2. close all present inactive_reads and waits on their close.
3. signal waiters on the wait_list via broken() with a proper
   exception indicating that the semaphore was closed.

In addition, assert in the semaphore's destructor
that it has no remaining inactive reads.

Stop must be called from whoever owns the r_c_s.
Mainly, from database::stop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
2f4134e1cc reader_concurrency_semaphore: broken: make broken_semaphore the default exception
Rather than explcitily generating it by all callers
and then not using the argument at all.

Prepare for providing a different exception_ptr
from a stop() path to be introduced in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
cd0991f28d multishard_mutation_query: read_context::stop: properly close unregistered inactive_reads
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
93d6dcdbcf multishard_mutation_query: read_context: stop: wait on unregistering inactive reads
Currently unregister_inactive_read for other shards is moved
to the background with nothing keep the respective
reader_concurrency_semaphore around.

This change runs the loop in parallel_for_each
so that we don't have to serially wait on all of them
but rather they can run in parallel on all shards, but
all are waited on via the returned future<>.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
8421e1f61e mutlishard_mutation_query: read_context: close: unregister all inactive reads
Currently only if the reader_meta is in the saved state
we unregister its inactive_read, yet it is possible
that it will hold an inactive_read also in the lookup state.

To cover all cases, rather than testing the reader_state,
unregister if the inactive_read_handle is engaged.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
53889ef9b0 multishard_mutation_query: read_page: close reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
afa2fe0b76 multishard_mutation_query: read_page: make compaction_state first
To simplify error handling for always closing the reader
in this function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
e2a767bef7 multishard_mutation_query: page_consume_result: mark constructor noexcept
As it can't throw. This is needed to simplify the following
patch that will always close the reader in read_page.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
a3f9dc6e0b mutation_reader: multishard_combining_reader: implement close
Close all underlying shard readers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
58b1da8cf5 mutation_reader: shard_reader: implement close
return reader lifecycle policy's destroy_reader future
so it can be waited on by caller (multishard_combining_reader::close).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
2c1edb1a94 mutation_reader: reader_lifecycle_policy: return future from destroy_reader
So we can wait on it from to-be-introduced shard_reader::close().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
bfe56fd99c mutation_reader: shard_reader: get rid of _stopped
It's unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
e1ec401bb6 mutation_reader: evictable_reader: implement close
If there's an active reader then close it, else,
try to resume the paused reader, and close it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
84206501ae mutation_reader: foreign_reader: wait for readahead and close underlying reader
Move the logic in ~foreign_reader to close()
to wait on the read_ahead future and close the underlying
reader on the remote shard.  Still call close in the background
in ~foreign_reader if destroyed without closing to keep the current
behavior, but warn about it, until it's proved to be unneeded.

Also, added on_iternal_error in close if _read_ahead_future
is engaged but _reader is not, since this must never happen
and we wait on the _read_ahead_future without the _reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
ea3f2a6536 mutation_reader: restricting_mutation_reader: close underlying reader
If a reader was admitted, close it in close().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
f9daceda87 test: mutation_reader_test: multi_partition_reader: close underlying readers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
db66a39b3e test: row_cache_test: close readers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
f9ae50483f mutation_reader: merging_reader: close underlying merger
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
dccdbdff95 mutation_reader: mutation_fragment_merger: close underlying producer
This will be needed by the merging_reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
761a38ce21 mutation_reader: mutation_reader_merger: make sure to close underlying readers
These will be called by merging_reader::close via
mutation_fragment_merger::close in the following patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
7d42a71310 mutation_reader: position_reader_queue: add close method
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
b140ea6df2 mutation_reader: compacting_reader: implement close
Close underlying reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
32ab957f82 mutation_reader: filtering_reader: implement close method
Close underlying reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
38e48bb462 size_estimates_reader: close partition_reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
13dfc41d8c row_cache: cache_flat_mutation_reader: close underlying readers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
0a2670c9ec row_cache: hold read_context as unique_ptr
Such that the holder, that is responsible for closing the
read_context before destroying it, holds it uniquely.

cache_flat_mutation_reader may be constructed either
with a read_context&, where it knows that the read_context
is owned externally, by the caller, or it could
be constructed with a std::unique_ptr<read_context> in
which case it assumes ownership of the read_context
and it is now responsible for closing it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
8531eaaacf row_cache: make_reader: make read_context only when needed
So we can have better control on who's responsible to close it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
9944586480 row_cache: make_reader: use range directly
Not via ctx, so we can delay the making of the read_context,
as needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
4c969756ac row_cache: scanning_and_populating_reader: make sure to close underlying readers
Note that scanning_and_populating_reader::read_next_partition
now closes the current reader unconditionally
and before assigning a new reader.  This should be an improvement
since we want to release resources the reader resources as early
as possible, certainly before allocating new resources.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
e34ed3d3e4 row_cache: range_populating_reader: add close method
To close the undelying _reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
c707ff27a4 row_cache: single_partition_populating_reader: add close method
To close the optional underlying _reader and _read_context.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
63522361f2 row_cache: read_context: add close method
To close the underlying reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
4b0fcc7d99 row_cache: autoupdating_underlying_reader: add close method
To close the undelying reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
3853d7a376 row_cache: autoupdating_underlying_reader: close reader before updating it
use the newly introduced reassign method to first
close the flat_mutation_reader_opt before assigning it with
a new reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
593bc9806d memtable: memtable_snapshot_source: make sure to close readers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
574759bf95 memtable: flush_reader: make sure to close partition reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
b13f6e817c test: row_cache_stress_test: close reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
93b5d7d4c2 memtable: scanning_reader: make sure to close underlying reader
Close _delegate if it's engaged both in the close() method
and when ever it is currently reset by _delegate = {}.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
efe938cf1f flat_mutation_reader: make sure to close reader passed to read_mutation_from_flat_mutation_reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
4b8dc7ac7e flat_mutation_reader: make sure to close flat_mutation_reader_from_mutations
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:25:47 +03:00
Benny Halevy
0da2eea211 flat_mutation_reader: flat_multi_range_mutation_reader: close underlying reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
18268ab474 flat_mutation_reader: forwardable_empty_mutation_reader: close optional underlying reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
e2e642b1b1 flat_mutation_reader: make_forwardable, make_nonforwardable: close underlying reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
978501c336 flat_mutation_reader: partition_reversing_mutation_reader: implement no-op close
We don't own _source therefore do not close it.
That said, we still need to make sure that the reversing reader
itself is closed to calm down the check when it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
f4dfaaa6c9 flat_mutation_reader: delegating_reader: close reader when moved to it
The underlying reader is owned by the caller if it is moved to it,
but not if it was constructed with a reference to the underlying reader.
Close the underlying reader on close() only in the former case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
0e0edef8d8 flat_mutation_reader: transforming_reader: close underlying reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
3c05529329 sstables: scrub_compaction: reader: close underlying reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
75eed563bc sstables: write_components: close reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
8c585ccb5c sstables: sstable_mutation_reader: implement close
Close both the _index_reader and _context, if they are engaged.
Warn and ignore any erros from close as it may be called either
from the destructor or from f_m_r close.

Call close() for closing in the background if needed when destroyed
and warn about.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
6a82e9f4be sstables: index_reader: mark close noexcept
We'd like that to simplify the soon-to-be-introduced
sstable_mutation_reader::close error handling path.

close_index_list can be marked noexcept since parallel_for_each is,
with that index_reader::close can be marked noexcept too.

Note that since reader close can not fail
both lower and upper bounds are closed (since
closing lower_bound cannot fail).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
5dce9997ff test/lib: mutation_source_test: close readers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
266d060aef test/lib: flat_reader_assertions: close reader in destructor
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
844bc40060 everywhere: use with_closeable to close flat_mutation_reader
`with_closeable` simplifies scoped use of
flat_mutation_reader, making sure to always close
the reader after use.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
ca06d3c92a flat_mutation_reader: log a warning if destroyed without closing
We cannot close in the background since there are use cases
that require the impl to be destroyed synchronously.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
81391b845f reader_permit: expose description method
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
a471579bd7 flat_mutation_reader: introduce close
Allow closing readers before destorying them.
This way, outstanding background operations
such as read-aheads can be gently canceled
and be waited upon.

Note that similar to destructors, close must not fail.
There is nothing to do about errors after the f_m_r is done.
Enforce that in flat_mutation_reader::close() so if the f_m_r
implementation did return a failure, report it and abort as internal
error.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Dejan Mircevski
962373a0a7 cql3: Replace some bounds_ranges calls
We will remove bounds_ranges when we kill the restrictions class
hierarchy.  Of the several call sites, two can be easily modified to
avoid it.  Others are more complicated and will be modified in a
subsequent commit.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-04-23 15:01:39 -04:00
Dejan Mircevski
b432bdb24e cql3: Get partition range from expr::expression
... instead of a restrictions subclass, which will soon be eliminated.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-04-23 15:01:39 -04:00
Pavel Emelyanov
13d264d6bd migration_manager: Make it main-local
Now everybody is patched to use component-local instance of migration
manager and its global instance can be moved into main() scope.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
b7a4fb0cf0 tests: Have own migration manager instances
No more global migration manager usage left, so all the tests
can be patched to use local migration manager instance. In fact,
it's only the cql_test_env that's such.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
37c91c4c5c tests: Use migration_manager from cql_test_env
All the tests that need migration manager are run inside
cql_test_env context and can use the migration manager
from the env. For now this is still the global one, but
next patch will change this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
3c3535b4d8 migration_manager: Call maybe_sync from this
The only caller of maybe_sync() method is now the method itself
and can stop using global migration manager instance and switch
to using 'this'.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
6b31c47a75 migration_manager: Make get_schema_for_... methods
These two helpers are now namespace-scoped methods, but both
need the migration manager instance inside. All their callers
are now patched to have the migration manager at hands, so
the helpers can be turned into methods.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
1021a180e7 migration_manager: Hide get_schema_definition
This method is exclusively used inside migration manager code,
so (for now) no use in keeping it exposed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
e0ca3ccc1c streaming: Keep migration_manager ptr in rpc lambdas
Same as previous patch, but for streaming.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
d76ff4b32f storage_proxy: Keep migration_manager ptr in rpc lambdas
This patch is the bridge between the previous one and the
next one and is quite messy to be merged with either.

No heavy changes -- just copy the migration manager's ptr
onto rpc lambdas. Will be used in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
423d0baa65 streaming: Get migration_manager shared_ptr in messaging
Same as in previous patch, but for streaming code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
a4569a30f3 storage_proxy: Get migration_manager shared_ptr in messaging
The proxy's messaging code uses migration manager to obtain schema.
Since proxy is more low-level service than migration manager, it's
incorrect to make proxy reference the manager directly. Instead,
push the shared_ptr into proxy's messaging code. This kills two
birst with one stone:

1: let proxy use migration manager
2: makes sure that by the time migration manager is stopped
   the proxy's use of this pointer is gone (unregistered from
   rpc)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
b73a93dab7 migration_manager: Make maybe_sync a method
Right now the maybe_sync is namespace-scope function. Turn it into
a migration_manager method so that it can use 'this' instead of
get_local_migration_manager().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
46bf6872d5 migration_manager: Open-code merge lambda
This lambda uses global migration manager instance. Open-coding
this short lambda makes further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
ef20d4ee59 migration_manager: Turn do_announce_new_type non-static
It's the only place that calls recently patched .announce()
method, so instead of grabbing global migration manager,
use 'this'.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
7aa1b5d395 migration_manager: Make announce() non-static method
This method needs to get migration manager instance to call
methods on it, so turn it non-static to have the instance in
'this'. Caller (yes, only one) gets local migration manager
itself, but will be patched soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
877ad36424 storage_servive: Use local migration manager
Now when the migration manager is on board storage service
can use it insted of global instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
b7fe191e3d storage_service: Keep migration manager on board
The storage service needs migration manager to sync schema
on lifecycle notifiers and to stop the guy on drain. So this
patch just pushes the migration manager reference all the
way through the storage service constructor.

Few words about tests. Since now storage service needs the
migration manager in constructor, some tests should take it
from somewhere. The cql_test_env already has (and uses) it,
all the others can just provide a not-started sharded one,
it won't be in use in _those_ tests anyway.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
6d1eede472 migration_manager: Use 'this' where appropriate
Some its non-static method call get_local_migration_manager
instead of using 'this'. None of these places use this to
get cross-shard instance, so it's safe to use 'this' there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
0223644ac5 repair: Use private migration manager pointer
Nothing special here, just replace the code-wide global
with repair-wide global.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
4c30556b8e repair: Keep private sharded migration manager pointer
It's nowadays standard for repair to keep global pointers on
the needed services. Keep the migration manager there too to
avoid explicit call to get_local_migration_manager. Later this
pile of global pointers will be encapsulated on redis service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
2e74dc5fd7 redis: Carry sharded migration manager over init
The only place in redis that needs migration manager is the
::init method that's called on start. It's possible to pass
the migration manager as an argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Pavel Emelyanov
e7dc059917 migration_manager: Merge migration_task in
The migration_task is the class with the single static method
that's called from a single place in migration manager and
this method calls migration manager back right at once. There's
no much sense in keeping this abstraction, merge it into the
migration manager.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-23 17:13:24 +03:00
Benny Halevy
f7e00e781c repair: row_level: run: row_level_stop_finished incorrectly set too early
Should set_repair_state to row_level_stop_started before calling
repair_row_level_stop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210422111723.401719-1-bhalevy@scylladb.com>
2021-04-23 11:25:02 +02:00
Piotr Dulikowski
5a49fe74bb db/hints: add a metric for counting processed files
Adds a field to `end_point_hints_manager::sender`:
`_total_replayed_segments_count` which keeps track of how many segments
were replayed so far. This metric will be used to calculate the
sequence number of the last current hint segments in the queue - so
that we can implement waiting for current segments to be replayed.
2021-04-22 18:45:34 +02:00
Avi Kivity
0af7a22c21 repair: remove partition_checksum and related code
80ebedd242 made row-level repair mandatory, so there remain no
callers to partition_checksum. Remove it.

Closes #8537
2021-04-22 18:56:53 +03:00
Dejan Mircevski
da844a4b59 cql3: Track partition-range expressions
Add a statement_restrictions member that tracks expressions that
together define the partition range.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-04-22 11:35:37 -04:00
Piotr Dulikowski
e48739a6da db/hints: allow to forcefully update segment list on flush
Endpoint hints manager keeps a list of segments to replay. New segments
are appended to it lazily - only when a hint flush occurs (hints
commitlog instance is re-created) and the list is empty. Because of
that, this list cannot be currently used to tell how many segments are
on disk.

This commit allows to trigger hints flush and forcefully update the list
of segments to replay. In later commits, a mechanism will be implemented
which will allow to wait until a given number of hint segments is
replayed. Triggering a hints flush with segment list update will allow
us to properly synchronize and determine up to which segment we need to
wait.
2021-04-22 17:34:04 +02:00
Avi Kivity
c36549b22e Merge 'rjson: Add throwing allocator' from Piotr Sarna
This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory.
This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors.

Fixes #8521
Refs #8515

Tests:
 * unit(release)
 * YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust
 relevant YCSB workload config for using 1.5MB objects:
```yaml
fieldcount=150
fieldlength=10000
```

Closes #8529

* github.com:scylladb/scylla:
  test: add a test for rjson allocation
  test: rename alternator_base64_test to alternator_unit_test
  rjson: add a throwing allocator
2021-04-22 17:12:02 +03:00
Piotr Sarna
83a45adbb7 test: add a test for rjson allocation
The test cases check if the new rjson allocator throws
when it fails to allocate/reallocate memory.
2021-04-22 15:59:13 +02:00
Avi Kivity
34b57688b9 tools: toolchain: dbuild: define die() earlier
die() is called before it is defined, so it doesn't work. Move it eariler.

Ref #8520.

Closes #8523
2021-04-22 15:38:10 +02:00
Eliran Sinvani
480a12d7b3 Materialized views: fix possibly old views comming from other nodes
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.

Here we plug the hole by fixing such views before they are registered.

Closes #8509
2021-04-22 15:38:10 +02:00
Kamil Braun
8e9a9f8bd3 raft: fsm: include config entries in output.committed
Otherwise waiters on committed configuration changes (e.g.
`server::set_configuration`) would never get notified.

Also if we tried to send another entry concurrently we would get
replication_test: raft/server.cc:318: void raft::server_impl::notify_waiters(std::map<index_t, op_status> &, const std::vector<log_entry_ptr> &): Assertion `entry_idx >= first_idx' failed.
(not sure if this commit also fixes whatever caused that).
Message-Id: <20210419181319.68628-2-kbraun@scylladb.com>
2021-04-22 15:38:10 +02:00
Avi Kivity
350f79c8ce Merge 'sstables: remove large allocations when parsing cells' from Wojciech Mitros
sstable cells are parsed into temporary_buffers, which causes large contiguous allocations for some cells.
This is fixed by storing fragments of the cell value in a fragmented_temporary_buffer instead.
To achieve this, this patch also adds new methods to the fragmented_temporary_buffer(size(), ostream& operator<<()) and adds methods to the underlying parser(primitive_consumer) for parsing byte strings into fragmented buffers.

Fixes #7457
Fixes #6376

Closes #8182

* github.com:scylladb/scylla:
  primitive_consumer: keep fragments of parsed buffer in a small_vector
  sstables: add parsing of cell values into fragmented buffers
  sstables: add non-contiguous parsing of byte strings to the primitive_consumer
  utils: add ostream operator<<() for fragmented_temporary_buffer::view
  compound_type: extend serialize_value for all FragmentedView types
2021-04-22 15:38:10 +02:00
Nadav Har'El
fc2da8058c Merge 'qos: make sure to wait for service level updates on shutdown' from Piotr Sarna
The service level controller spawns an updating thread,
which wasn't properly waited for during shutdown.
This behavior is now fixed.

Tests: manual

Fixes #8468

Closes #8470

* github.com:scylladb/scylla:
  qos: make sure to wait for sl updates on shutdown
  db: stop using infinite timeout for service level updates
2021-04-22 15:38:09 +02:00
Pekka Enberg
0ddbed2513 dist: Add support for disabling writeback cache
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: #7341
Tests: dtest (next-gating)

Closes #8526
2021-04-22 11:24:49 +03:00
Asias He
b6104e5f44 doc: Update bootstrap with everywhere_topology
Document how we choose node to sync with if everywhere_topology is used.

Refs #8503

Closes #8518
2021-04-22 11:24:49 +03:00
Avi Kivity
a063173ace Merge "Fix unbounded memory usage and high write amplification in TWCS reshape" from Raphael
"
Memory usage is considerably reduced by making reshape switch to partitioned set,
given that input sstables are disjoint. This will benefit reshape for all
strategies, not only TWCS.

Write amplification is reduced a lot by compacting all input sstables at once,
which is possible given that unbounded memory usage is fixed too.

With both these issues fixed, TWCS reshape will be much more efficient.

tests: mode(dev).
"

* 'twcs_reshape_fixes' of github.com:raphaelsc/scylla:
  tests: sstables: Check that TWCS is able to reshape disjoint sstables efficiently
  TWCS: Reshape all sstables in a time window at once if they're disjoint
  sstables: Extract code to count amount of overlapping into a function
  LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
  compaction: Make reshape compaction always use partitioned_sstable_set
  compaction: Allow a compaction type to override the sstable_set for input sstables
2021-04-22 11:24:49 +03:00
Piotr Sarna
55ae110774 qos: make sure to wait for sl updates on shutdown
The service level controller spawns an updating thread,
which wasn't properly waited for during shutdown.
This behavior is now fixed.
In order to make the shutdown order more standardized,
the operation is split into two phases - draining and stopping.

Tests: manual

Fixes #8468
2021-04-22 09:58:27 +02:00
Piotr Sarna
ad661561c8 db: stop using infinite timeout for service level updates
Due to a porting bug, the routines for updating service levels
used the default infinite timeout for internal CQL queries,
which causes Scylla to hang on shutdown. The behavior is now
fixed and the routines use the same timeout as the other
similar functions - 10s at the time of writing this message.
2021-04-22 09:03:21 +02:00
Raphael S. Carvalho
394b9ddb31 tests: sstables: Check that TWCS is able to reshape disjoint sstables efficiently
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-21 11:03:16 -03:00
Raphael S. Carvalho
d5fc2f3839 TWCS: Reshape all sstables in a time window at once if they're disjoint
With repair-based operations, each window will have 256 disjoint
sstables due to data segregation which produces N sstables for each
vnode range, where N = # of existing windows. So each window ends up
with one sstable per vnode range = 256.
Given that reshape now unconditionally uses partitioned set's incremental
selector, all the 256 sstables can be compacted at once as compaction
essentially becomes a copy operation, where only one sstable will be
opened at a time, making its memory usage very efficient.
By compacting all sstables at once, write amplification is a lot
reduced because each byte is now only rewritten once.
Previously, with the initial set of 256 sstables, write amp could be
up to 8, which makes reshape for TWCS very slow.

Refs #8449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-21 11:03:16 -03:00
Raphael S. Carvalho
0f7774a6f8 sstables: Extract code to count amount of overlapping into a function
This function will be reused by TWCS reshape when checking if all
sstables in a window are disjoint and can be all compacted together.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-21 11:03:16 -03:00
Raphael S. Carvalho
39ecddbd34 LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.

Fixes #8531.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-21 11:03:07 -03:00
Asias He
1513de633b repair: Switch to use NODE_OPS_CMD for decommission operation
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.

In this patch, we continue the work to switch decommission operation to use
NODE_OPS_CMD.

The benefits:

- A UUID is used to identify each node operation across the cluster.

- It is more reliable to detect pending node operations, to avoid
  multiple topology changes at the same time.

- The cluster reverts to a state before the decommission operation
  automatically in case of error. Without this patch, the node to be
  decommissioned will be stuck in decommission status forever until it
  is restarted and goes back to normal status.

- Allows users to pass a list of dead nodes to ignore for decommission
  explicitly.

- The LEAVING gossip status is not needed any more. This is one step
  closer to achieve gossip-less topology change.

- Allows us to trigger of off-strategy easily on the node receiving the
  ranges

Fixes #8471
2021-04-21 20:35:54 +08:00
Piotr Sarna
dfd1ea6b92 test: rename alternator_base64_test to alternator_unit_test
With the more generic name, I would no longer feel bad adding
non-base64 test cases to it.
2021-04-21 14:26:40 +02:00
Piotr Sarna
45d7144529 rjson: add a throwing allocator
The default rapidjson allocator returns nullptr from
a failed allocation or reallocation. It's not a bug by itself,
but rapidjson internals usually don't check for these return values
and happily use nullptr as a valid pointer, which leads to segmentation
faults and memory corruptions.
In order to prevent these bugs, the default allocator is wrapped
with a class which simply throws once it fails to allocate or reallocate
memory, thus preventing the use of nullptr in the code.
One exception is Malloc/Realloc with size 0, which is expected
to return nullptr by rapidjson code.
2021-04-21 14:26:38 +02:00
Takuya ASADA
00dcaf2896 dist/debian: rename .default file correctly
On 'product != scylla' environment, we have a bug with .default file
(sysconfig file) handling.
Since .default file should be install original name, package name can be
doesn't match with .default filename.
(ex: default file is /etc/default/scylla-node-exporter, but
     package name is scylla-enterprise-node-exporter)
When filename doesn't match with package name, it should be renamed with
as follows:
  <package name>.<filename>.default
We already do this on .service file, but mistakenly haven't handled
.default file, so let's add it too.

Related scylladb/scylla-enterprise#1718
Fixes #8527

Closes #8528
2021-04-21 14:24:21 +03:00
Piotr Sarna
2ad09d0bf8 Merge 'treewide: remove inclusions of storage_proxy.hh from headers' from Avi Kivity
Reduce rebuilds and build time by removing unnecessary includes. Along the way,
improve header sanity.

Ref #1.

Test: dev-headers, unit(dev).

Closes #8524

* github.com:scylladb/scylla:
  treewide: remove inclusions of storage_proxy.hh from headers
  storage_proxy: unnest coordinator_query_result
  treewide: make headers self-sufficient
  utils: intrusive_btree: add missing #pragma once
2021-04-21 08:22:52 +02:00
Avi Kivity
09819a4c62 Update seastar submodule
* seastar 0b2c25d133...980a29fb70 (1):
  > Merge "Assorted set of improvements over io-queue" from Pavel E

Fixes #8378
2021-04-21 08:22:52 +02:00
Benny Halevy
7130e2e7ff sstables: harden unlink
Make sure that sstable::unlink will never fail.

It will terminate in the unlikely case toc_filename
throws (e,g, on bad_alloc), otherwise it ignores any other error
and juts warns about it.

Make unlink a coroutine to simplify the implementation
without introducing additional allocations.

Note that remove_by_toc_name and maybe_delete_large_data_entries
are executed asynchronously and concurrently.
Waiting for them to finish is serialized by co_await,
making sure that both are being waited on so not to leave
abandoned futures behind.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210420135020.102733-1-bhalevy@scylladb.com>
2021-04-21 08:22:52 +02:00
Raphael S. Carvalho
678e4c0bb9 compaction: Make reshape compaction always use partitioned_sstable_set
Reshape compaction potentially works with disjoint sstables, so it will
benefit a lot from using partitioned_sstable_set, which is able to
incrementally open the disjoint sstables. Without it, all sstables are
opened at once, which means unbounded memory usage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-20 15:39:51 -03:00
Avi Kivity
daeddda7cc treewide: remove inclusions of storage_proxy.hh from headers
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).

Ref #1.
2021-04-20 21:23:00 +03:00
Avi Kivity
cdf30524f3 storage_proxy: unnest coordinator_query_result
Nested classes cannot be forward declared, and
storage_proxy::coordinator_query_result is used in pagers, where
we'd like to forward-declare it. Unnest it and introduce an alias
for compatibility.
2021-04-20 21:23:00 +03:00
Avi Kivity
14a4173f50 treewide: make headers self-sufficient
In preparation for some large header changes, fix up any headers
that aren't self-sufficient by adding needed includes or forward
declarations.
2021-04-20 21:23:00 +03:00
Avi Kivity
6db1a71775 utils: intrusive_btree: add missing #pragma once
Interferes with making headers self-sufficient, so add it now.
2021-04-20 21:23:00 +03:00
Raphael S. Carvalho
ad9bc808b9 compaction: Allow a compaction type to override the sstable_set for input sstables
By default, compaction will pick a implementation of sstable_set as
defined by the underlying compaction strategy.
However, reshape compaction potentially works with disjoint sstables
and will benefit a lot from always using partitioned set.
For example, when reshaping a TWCS table, it's better to use the
partitioned set rather than the time window set, as the former will
be much more memory efficient by incrementally selecting sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-20 12:03:44 -03:00
Nadav Har'El
50f3201ee2 alternator: fix inequality check of two sets
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.

Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!

So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.

The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).

Refs #5021
Fixes #8513

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
2021-04-20 13:14:19 +02:00
Nadav Har'El
dae7528fe5 alternator: fix equality check of nested document containing a set
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.

This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.

Refs #5021
Fixes #8514

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
2021-04-20 13:14:10 +02:00
Nadav Har'El
46448b0983 alternator: fix equality check of two unset attributes
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:

When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).

The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.

This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.

Fixes #8511

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
2021-04-20 13:14:00 +02:00
Botond Dénes
4c3454dd07 database: get_reader_concurrency_semaphore(): make the user semaphore the catch-all
Currently said method uses the system semaphore as a catch-all for all
scheduling groups it doesn't know about. This is incompatible with the
recent forward-porting of the service-level infrastructure as it means
that all service level related scheduling groups will fall back to the
system scheduling group, which causes two problems:
* They will experience much limited concurrency, as the system semaphore
  is assigned much less count units, to match the much more limited
  internal traffic.
* They compete with internal reads, severely impacting the respective
  internal processes, potentially causing extreme slowdown, or even
  deadlock in the case of an internal query executed on behalf of a
  user query being blocked on the latter.

Even if we don't have any custom service level scheduling groups at the
moment, it is better to change this such that unknown scheduling groups
fall-back to using the user semaphore. We don't expect any new internal
scheduling group to pop up any time soon (and if they do we can adjust
get_reader_concurrency_semaphore() accordingly), but we do expect user
scheduling groups to be created in the future, even dynamically.

To minimize the chance of the wrong workload being associated with the
user semaphore, all statically created scheduling groups are now
explicitly listed in `get_reader_concurrency_semaphore()`, to make their
association with the respective semaphore explicit and documented.
Added a unit test which also checks the correct association for all
these scheduling groups.

Fixes: #8508

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210420105156.94002-1-bdenes@scylladb.com>
2021-04-20 14:06:25 +03:00
Piotr Sarna
ec750e5f49 rjson: make the max nested level configurable
Back when rjson was only part of alternator, there was a hardcoded
limit of nested levels - 78. The number was calculated as:
 - taking the DynamoDB limit (32)
 - adding 7 to it to make alternator support more cases
 - doubling it because rjson internals bump the level twice
   for each alternator object (because the alternator object
   is represented as a 2-level JSON object).

Since rjson is no longer specific to alternator, this limit
is now configurable, and the original default value is explained
in a comment.

Message-Id: <51952951a7cd17f2f06ab36211f74086e1b60d2d.1618916299.git.sarna@scylladb.com>
2021-04-20 14:05:03 +03:00
Nadav Har'El
c29f55e801 Merge 'Unify CQL and Redis server code' from Pekka Enberg
The Redis server started as a copy of the CQL server, but did not
receive all the fixes of the CQL server over time. For example, commit
1a8630e ("transport: silence "broken pipe" and "connection reset by
peer" errors") was only done on the CQL server.

To remedy the situation, this pull request unifies code between the CQL
and Redis servers by introducing a "generic_server" component, and
switching CQL and Redis to use it.

Test: dtest(dev)

Closes #8388

* github.com:scylladb/scylla:
  generic_server: Rename "maybe_idle" to "maybe_stop"
  generic_server: API documentation for connection and server classes
  transport, redis: Use generic server::listen()
  transport/server: Remove "redis_server" prefix from logging
  transport/server: Remove "cql_server" prefix from logging
  generic_server: Remove unneeded static_pointer_cast<>
  transport, redis: Use generic server::do_accepts()
  transport, redis: Use generic server::process()
  redis: Move Redis specific code to handle_error()
  transport: Move CQL specific error handling to handle_error()
  transport, redis: Move connection tracking to generic_server::server class
  transport, redis: Move _stopped and _connections_list to generic_server::server class
  transport, redis: Move total_connections to generic_server::server class
  transport, redis: Use generic server::maybe_idle()
  transport, redis: Move list_base_hook<> inheritance to generic_server::connection
  transport, redis: Use generic connection::shutdown()
2021-04-20 12:20:25 +03:00
Tomasz Grabiec
dc7beec382 Merge "Tweak cache_flat_mutation_reader" from Pavel Emelyanov
The set recycles 16 bytes from the reader class, makes use of
rows collection sugar, generalizes range tombstones emission and
adds an invariant-check.

tests: unit(dev)

* xemul/br-cache-reader-cleanups-1.2:
  cache_flat_mutation_reader: Generalize range tombstones emission
  cache_flat_mutation_reader: Tune forward progress check
  cache_flat_mutation_reader: Use rows insertion sugar
  cache_flat_mutation_reader: Move state field
  cache_flat_mutation_reader: Remove raiish comparator
  cache_flat_mutation_reader: Remove unused captured variable
  cache_flat_mutation_reader: Fix trace message text
2021-04-19 21:21:49 +02:00
Benny Halevy
a57459e983 compaction: cleanup_compaction: no need to filter tokens belonging to other shards
As sstables are always resharded if needed when loaded.

Refs #6807

Test: unit(release,debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210419142743.265729-1-bhalevy@scylladb.com>
2021-04-19 17:22:53 +02:00
Benny Halevy
9c89702fb2 perf_simple_query: use tests::random::get_int for reproducible results
Support for random-seed was added in 4ad06c7eeb
but the program still uses std::rand() to draw random keys.

Use tests::random::get_int instead so we can get reprodicible
sequence of keys given a particular random-seed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210418104455.82086-1-bhalevy@scylladb.com>
2021-04-19 17:22:53 +02:00
Piotr Sarna
2591cbb62e main: add a debug symbol for service level controller
It's notoriously hard to find the service level controller symbol
(possible by guessing the offset based on system_distributed_keyspace
address, but it's very cumbersome). To make the debugging process
easier, the symbol is exported via the `debug` namespace.

Closes #8506
2021-04-19 11:29:01 +03:00
Kamil Braun
617813ba66 sys_dist_ks: new keyspace for system tables with Everywhere strategy
`system_distributed_everywhere` is a new keyspace that uses Everywhere
replication strategy. This is useful, for example, when we want to store
internal data that should be accessible by every node; the data can be
written using CL=ALL (e.g. during node operations such as node
bootstrap, which require all nodes to be alive - at least currently) and
then read by each node locally using CL=ONE (e.g. during node restarts).

Closes #8457
2021-04-19 11:22:57 +03:00
Nadav Har'El
13104bd7e2 Merge 'repair: Handle everywhere_topology in bootstrap_with_repair ' from Asias He
repair: Handle everywhere_topology in bootstrap_with_repair

 The everywhere_topology returns the number of nodes in the cluster as
 RF. This makes only streaming from the node losing the range impossible
 since no node is losing the range after bootstrap.

 Shortcut to stream from all nodes in local dc in case the keyspace is
 everywhere_topology.

 Fixes #8503

Closes #8505

* github.com:scylladb/scylla:
  repair: Make the log more accurate in bootstrap_with_repair
  repair: Handle everywhere_topology in bootstrap_with_repair
2021-04-19 11:19:01 +03:00
Asias He
4c4334e912 repair: Make the log more accurate in bootstrap_with_repair
We have logs

   expected 1 node losing range but found *more* nodes

However, we can find zero node as well. Drop the word *more* in the log.

In addition, print the number of nodes found.

Refs #8503
2021-04-19 15:15:05 +08:00
Takuya ASADA
0b01e1a167 dist: add DefaultDependencies=no to .mount units
To avoid ordering cycle error on Ubuntu, add DefaultDependencies=no
on .mount units.

Fixes #8482

Closes #8495
2021-04-19 09:06:42 +03:00
Botond Dénes
8287cdb2ff scripts/build-help.sh: extend help text with more targets
Mention executables (scylla, tools and tests) as well as how to build
individual object files and how to verify individual headers. Also
mention the not-at-all obvious trick of how to build tests with debug
symbols.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210416131950.175413-1-bdenes@scylladb.com>
2021-04-19 06:33:01 +02:00
Asias He
3c36517598 repair: Handle everywhere_topology in bootstrap_with_repair
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.

Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.

Fixes #8503
2021-04-19 10:47:36 +08:00
Tomasz Grabiec
dbd0b9a3ef gdb: Fix miscalculation of small pool memory usage "scylla memory"
It should not count free pages which used to belong to a given pool.
Message-Id: <20210415175923.683555-1-tgrabiec@scylladb.com>
2021-04-18 14:03:17 +03:00
Tomasz Grabiec
68cde23912 gdb: Fix --size option of "scylla task_histogram"
By default, argparse will provide the value of the option as str.
Later, we compare it with int, which will be always False. Fix by
telling argparse to provide as int.
Message-Id: <20210415182149.686355-1-tgrabiec@scylladb.com>
2021-04-18 14:03:17 +03:00
Botond Dénes
8a43a11f7b scylla-gdb.py: get_base_class_offset(): make sure offset is returned as int
Looks like in python 3, division automatically yields a double/float,
even if both operands are integers. This results in
get_base_class_offset() returning a double/float, which breaks pointer
arithmetics (which is what the returned value is used for), because now
instead of decrementing/incrementing the pointer, the pointer will be
converted to a double itself silently, then back to some corrupt pointer
value. One user visible effect is `intrusive_list` being broken, as it
uses the above method to calculate the member type pointer from the node
pointers.
Fix by coercing the returned value to int.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210415080034.167762-1-bdenes@scylladb.com>
2021-04-18 14:03:17 +03:00
Pavel Emelyanov
5ecbc33be5 database.*: Remove unused headers
The database.hh is the central recursive-headers knot -- it has ~50
includes. This patch leaves only 34 (it remains the champion though).
Similar thing for database.cc.
Both changes help the latter compile ~4% faster :)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210414183107.30374-1-xemul@scylladb.com>
2021-04-18 14:03:17 +03:00
Pavel Emelyanov
2a7171110d cache_flat_mutation_reader: Generalize range tombstones emission
The range tombstone can be added-to-buffer from two places:
when it was found in cache and when it was read from the
underlying reader. Both adders can now be generalized.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-16 17:55:46 +03:00
Pavel Emelyanov
2e98cfbf1d cache_flat_mutation_reader: Tune forward progress check
When adding a range tombstone to the buffer the need
to stop stuffing the already full one is only done if
this particular range timbstone changes the lower_bound.
This check can be tuned -- if the lower bound changed
_at_ _all_ after a range tombstone was added, we may
still abort the loop.

This change will allow to generalize range tombstone
emission by the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-16 17:55:46 +03:00
Pavel Emelyanov
a35de6ea3e cache_flat_mutation_reader: Use rows insertion sugar
When inserting a rows_entry via unique_ptr the ptr inquestion
can be pushed as is, the intrusive btree code releases the
pointer (to be exception safe) itself. This makes the code
a bit shorter and simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-16 17:55:46 +03:00
Pavel Emelyanov
df488dd8ac cache_flat_mutation_reader: Move state field
There are two alignment gaps in the middle of the
c_f_m_r -- one after the state and another one after
the set of bools. Keeping them togethers allows the
compiler to pack the c_f_m_r better.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-16 17:55:46 +03:00
Pavel Emelyanov
bc3f910fc1 cache_flat_mutation_reader: Remove raiish comparator
The instance of position_in_partition::tri_compare sits
on the reader itself and just occupies memory. It can be
created on demand all the more so it's only one place that
needs it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-16 17:55:46 +03:00
Pavel Emelyanov
41352334ba cache_flat_mutation_reader: Remove unused captured variable
The captured timeout is not used in lambda.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-16 17:55:41 +03:00
Pavel Emelyanov
eb65f8ed6b cache_flat_mutation_reader: Fix trace message text
The entry inserted in this branch is not dummy, but an empty row.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-16 17:55:22 +03:00
Nadav Har'El
8751728314 Merge 'Improve validation of "enable", "postimage" and "ttl" CDC options' from Piotr Grabowski
First commit:
In the first commit, add validation of `enable` and `postimage` CDC options. Both options are boolean options, but previously they were not validated, meaning you could issue a query:

```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'};
```

and it would be executed without any errors, silently interpreting `dsfdsd` as false.

The first commit narrows possible values of those boolean CDC options to `false`, `true`, `0`, `1`. After applying this change, issuing the query above would result in this error message:

```
ConfigurationException: Invalid value for CDC option "enabled": dsfdsd
```

I actually encountered this lacking validation myself, as I mistakenly issued a query:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'preimage': true, 'postimage': 'full'};
```
incorrectly assigning `full` to `postimage`, instead of `preimage`. However, before this commit, this query ran correctly and it interpreted `full` as `false` and disabled postimages altogether.

Second commit:
The second commit improves the error message of invalid `ttl` CDC option:

Before:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'};
ServerError: stoi
```

After:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'};
ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd
```

```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'};
ConfigurationException: Invalid CDC option: ttl too large
```

Closes #8486

* github.com:scylladb/scylla:
  cdc: improve exception message of invalid "ttl"
  cdc: add validation of "enable" and "postimage"
2021-04-15 11:59:41 +02:00
Takuya ASADA
cbbd5b2b6f unified: abort install when non-bash shell detected
On Debian variants, sh -x ./install.sh will fail since our script in
written in bash, and /bin/sh in Debian variants is dash, not bash.

So detect non-bash shell and print error message, let users to run in
bash.

Fixes #8479

Closes #8484
2021-04-15 11:59:41 +02:00
Avi Kivity
935378fa53 main: start background reclaim before bootstrap
We start background reclaim after we bootstrap, so bootstrap doesn't
benefit from it, and sees long stalls.

Fix by moving background reclaim initialization early, before
storage_service::join_cluster().

(storage_service::join_cluster() is quite odd in that main waits
for it synchronously, compared to everything else which is just
a background service that is only initialized in main).

Fixes #8473.

Closes #8474
2021-04-15 11:59:41 +02:00
Raphael S. Carvalho
84f7ae2c82 table: remove unneeded code as sstables are not shared anymore
given that resharding is now a synchronous mandatory step, before
table is populated, snapshot() can now get rid of code which takes
into account whether or not a sstable is shared.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210414121549.85858-1-raphaelsc@scylladb.com>
2021-04-15 11:59:41 +02:00
Avi Kivity
b19d318701 Update seastar submodule
* seastar d2dcda96bb...0b2c25d133 (4):
  > reactor: reactor_backend_epoll: stop using signals for high resolution timers
  > reactor: move task_quota_timer_thread_fn from reactor to reactor_backend_epoll
  > Merge "Report maximum IO lenghts via file API" from Pavel E
  > Merge "Improve efficiency of io-tester" from Pavel E
2021-04-15 11:59:41 +02:00
Piotr Grabowski
61c8e196be cdc: improve exception message of invalid "ttl"
Improve the exception message of providing invalid "ttl" value to the
table.

Previously, if you executed a CREATE TABLE query with invalid "ttl"
value, you would get a non-descriptive error message:

CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'};
ServerError: stoi

This commit adds more descriptive exception messages:

CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'};
ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd

CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'};
ConfigurationException: Invalid CDC option: ttl too large
2021-04-14 17:40:23 +02:00
Piotr Grabowski
10390afc10 cdc: add validation of "enable" and "postimage"
Add validation of "enable" and "postimage" CDC options. Both options
are boolean options, but previously they were not validated, meaning
you could issue a query:

CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'};

and it would be executed without any errors, silently interpreting
"dsfdsd" as false.

This commit narrows possible values of those boolean CDC options to
false, true, 0, 1. After applying this change, issuing the query above
would result in this error message:

ConfigurationException: Invalid value for CDC option "enabled": dsfdsd
2021-04-14 17:36:38 +02:00
Nadav Har'El
4cf21f3a0f cql-pytest: update run-cassandra script for Java 11
This patch fixes cql-pytest/run-cassandra to work on systems which
default to Java 11, including Fedora 33.

Recent versions of Cassandra can run on Java 11 fine, but requires a
bunch of weird JVM options to work around its JPMS (Java Platform Module
System) feature. Cassandra's start scripts require these options to
be listd in conf/jvm11-server.options, which is read by the startup
script cassandra.in.sh.

Because our "run-cassandra" builds its own "conf" directory, we need
to create a jvm11-server.options file in that directory. This is ugly,
but unfortunately necessary if cql-pytest/run-cassandra is to run with
on systems defaulting to Java 11.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210406220039.195796-1-nyh@scylladb.com>
2021-04-14 13:16:00 +02:00
Asias He
9ea57dff21 gossip: Relax failure detector update
We currently only update the failure detector for a node when a higher
version of application state is received. Since gossip syn messages do
not contain application state, so this means we do not update the
failure detector upon receiving gossip syn messages, even if a message
from peer node is received which implies the peer node is alive.

This patch relaxes the failure detector update rule to update the
failure detector for the sender of gossip messages directly.

Refs #8296

Closes #8476
2021-04-14 13:16:00 +02:00
Tomasz Grabiec
320f6bf220 Merge 'test: perf: perf_simple_query: collect allocation and task statistics' from Avi Kivity
Calculate and display the number of memory allocations and tasks
executed per operation. Sample results (--smp 1):

180022.46 tps (90 allocs/op, 20 tasks/op)
178963.44 tps (90 allocs/op, 20 tasks/op)
178702.41 tps (90 allocs/op, 20 tasks/op)
177679.74 tps (90 allocs/op, 20 tasks/op)
179539.36 tps (90 allocs/op, 20 tasks/op)

median 178963.44 tps (90 allocs/op, 20 tasks/op)
median absolute deviation: 575.92
maximum: 180022.46
minimum: 177679.74

This allows less noisy tracking of how some changes impact performance.

Closes #8425

* github.com:scylladb/scylla:
  test: perf: perf_simple_query: collect allocation and task statistics
  perf: deinline some functions in perf.hh
2021-04-14 13:16:00 +02:00
Kamil Braun
5c7ed7a83f time_series_sstable_set: return partition start if some sstables were ck-filtered out
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.

In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.

We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.

Fixes #8447.

Closes #8448
2021-04-14 13:16:00 +02:00
Calle Wilund
03590c8254 commitlog_test: Add test for deadlock in shutdown w. segment wait
Refs #8438

Ensures shutting down (well behaved) works even if an allocating
path is stuck waiting for a new segment - i.e. other aspect of

Closes #8475
2021-04-14 13:16:00 +02:00
Michael Livshin
4ccb1b3a2f build: add nix-shell support
Support native building & unit testing in the Nix ecosystem under
nix-shell.

Actual dist packaging for Nixpkgs/NixOS is not there (yet?), because:

* Does not exactly seem like a huge priority.

* I don't even have a firm idea of how much work it would entail (it
  certainly does not need the ld.so trickery, so there's that.  But at
  least some work would be needed, seeing how ScyllaDB needs to
  integrate with its environment and NixOS is a little unorthodox).

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210413110508.5901-4-michael.livshin@scylladb.com>
2021-04-14 13:15:59 +02:00
Michael Livshin
d87e751182 build: add a structural way to distro-extend configure.py
For now just for additional cflags, ldflags & cmake arguments.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210413110508.5901-3-michael.livshin@scylladb.com>
2021-04-14 13:15:59 +02:00
Michael Livshin
5cb4005e84 build: extend configure.py's subprocess environment properly
The `env` parameter to `subprocess.Popen()` and friends, when it is
not `None`, is not an addition to the subprocess environment but the
_whole_ subprocess environment.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210413110508.5901-2-michael.livshin@scylladb.com>
2021-04-14 13:15:59 +02:00
Avi Kivity
b756693e64 Merge "mutation_query: move query methods into table" from Botond
"
These methods are generic ways to query a mutation source. At least they
used to be, but nowadays they are pretty specific to how tables are
queried -- they use a querier cache to lookup queriers from and save
them into. With the coming changes to how permits are obtained, they are
about to get even more specific to tables. Instead of forcing the
genericity and keep adding new parameters, this patchset bites the
bullet and moves them to table. `data_query()` is inlined into
`table::query()`, while `mutation_query()` is replaced with
`table::mutation_query()`.
The only other users besides table are tests and they are adjusted to
use similarly named local methods that just combine the right querier
with the right result builder. This combination is what the tests really
want to test, as this is also what is used by the table methods behind
the scenes.

Tests: unit(release, debug)
"

* 'mutation-query-move-query-methods-into-table/v1' of https://github.com/denesb/scylla:
  mutation_query: remove now unused mutation_query()
  test: mutation_query_test: use local mutation_query() implementation
  database: mutation_query(): use table::mutation_query()
  table: add mutation_query()
  query: remove the now unused data_query()
  test: mutation_query_test: use local data_query() implementation
  table: query(): inline data_query() code into query()
  table: make query() a coroutine
2021-04-14 13:15:59 +02:00
Pekka Enberg
2b6438c044 generic_server: Rename "maybe_idle" to "maybe_stop" 2021-04-13 14:13:24 +03:00
Pekka Enberg
66276d6636 generic_server: API documentation for connection and server classes 2021-04-13 14:13:24 +03:00
Pekka Enberg
16f262b852 transport, redis: Use generic server::listen()
Let's pull up cql_server listen() to generic_server::server base class and
convert redis_server to use it.
2021-04-13 14:13:24 +03:00
Pekka Enberg
6c619e4462 transport/server: Remove "redis_server" prefix from logging
The logger itself has the name "redis_server" that appears in the logs.
2021-04-13 13:57:22 +03:00
Pekka Enberg
7ef3c60864 transport/server: Remove "cql_server" prefix from logging
The logger itself has the name "cql_server" that appears in the logs.
2021-04-13 13:57:22 +03:00
Pekka Enberg
f560b3daa3 generic_server: Remove unneeded static_pointer_cast<>
Now that do_accepts() is in generic_server, we can get rid of the
static_pointer_cast<>.
2021-04-13 13:57:22 +03:00
Pekka Enberg
ac90a8ea50 transport, redis: Use generic server::do_accepts()
The cql_server and redis_server share the same ancestor of do_accepts().
Let's pull up the cql_server version of do_accept() (that has more
functionality) to generic_server::server and use it in the redis_server
too.
2021-04-13 13:57:21 +03:00
Pekka Enberg
3689db26fc transport, redis: Use generic server::process()
Pull up the cql_server process() to base class and convert redis_server
to use it.

Please note that this fixes EPIPE and connection reset issue in the
Redis server, which was fixed in the CQL server in commit 1a8630e6a
("transport: silence "broken pipe" and "connection reset by peer"
errors").
2021-04-13 13:56:45 +03:00
Pekka Enberg
ef39216667 redis: Move Redis specific code to handle_error()
This moves the Redis specific error handling to handle_error() to make
process() more generic in preparation for move to generic_server.
2021-04-13 13:56:45 +03:00
Pekka Enberg
66d6899727 transport: Move CQL specific error handling to handle_error()
This moves the CQL specific error handling to handle_error() to make
process() more generic in preparation for move to generic_server.
2021-04-13 13:56:45 +03:00
Pekka Enberg
ab339cfaf7 transport, redis: Move connection tracking to generic_server::server class
The cql_server and redis_server classes have identical connection
tracking code. Pull it up to the generic_server::server base class.
2021-04-13 13:56:45 +03:00
Pekka Enberg
deac5b1810 transport, redis: Move _stopped and _connections_list to generic_server::server class
The cql_server and redis_server both have the same "_stopped" and
"_connections_list" member variables. Pull them up to the
generic_server::server base class.
2021-04-13 13:56:45 +03:00
Pekka Enberg
1af73bec7b transport, redis: Move total_connections to generic_server::server class
Both cql_server and redis_server have the same "total_connections"
member variable so pull that up to the generic_server::server base
class.
2021-04-13 13:56:45 +03:00
Pekka Enberg
7b46c2da53 transport, redis: Use generic server::maybe_idle()
The cql_server and redis_server classes have a maybe_idle() method,
which sets the _all_connections_stopped promise if server wants to stop
and can be stopped. Pull up the duplicated code to
generic_server::server class.
2021-04-13 13:56:45 +03:00
Pekka Enberg
4664a55e05 transport, redis: Move list_base_hook<> inheritance to generic_server::connection
Both cql_server::connection and redis_server::connection inherit
boost::intrusive::list_base_hook<>, so let's pull up that to the
generic_server::connection class that both inherit.
2021-04-13 13:56:45 +03:00
Pekka Enberg
19507bb7ea transport, redis: Use generic connection::shutdown()
This patch moves the duplicated connection::shutdown() method to to a
new generic_server::connection base class that is now inherited by
cql_server and redis_server.
2021-04-13 13:56:44 +03:00
Tomasz Grabiec
163f2be277 Merge 'Make sure that cache_flat_mutation_reader::do_fill_buffer does not fast forward finished underlying reader' from Piotr Jastrzębski
It is possible that a partition is in cache but is not present in sstables that are underneath.
In such case:
1. cache_flat_mutation_reader will fast forward underlying reader to that partition
2. The underlying reader will enter the state when it's empty and its is_end_of_stream() returns true
3. Previously cache_flat_mutation_reader::do_fill_buffer would try to fast forward such empty underlying reader
4. This PR fixes that

Test: unit(dev)

Fixes #8435
Fixes #8411

Closes #8437

* github.com:scylladb/scylla:
  row_cache: remove redundant check in make_reader
  cache_flat_mutation_reader: fix do_fill_buffer
  read_context: add _partition_exists
  read_context: remove skip_first_fragment arg from create_underlying
  read_context: skip first fragment in ensure_underlying
2021-04-13 00:45:10 +02:00
Piotr Jastrzebski
cb3dbb1a4b row_cache: remove redundant check in make_reader
This check is always true because a dummy entry is added at the end of
each cache entry. If that wasn't true, the check in else-if would be
an UB.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-04-12 21:12:33 +02:00
Piotr Jastrzebski
1f644df09d cache_flat_mutation_reader: fix do_fill_buffer
Make sure that when a partition does not exist in underlying,
do_fill_buffer does not try to fast forward withing this nonexistent
partition.

Test: unit(dev)

Fixes #8435
Fixes #8411

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-04-12 21:08:40 +02:00
Piotr Jastrzebski
ceab5f026d read_context: add _partition_exists
This new state stores the information whether current partition
represented by _key is present in underlying.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-04-12 20:57:20 +02:00
Piotr Jastrzebski
b3b68dc662 read_context: remove skip_first_fragment arg from create_underlying
All callers pass false for its value so no need to keep it around.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-04-12 19:51:06 +02:00
Piotr Jastrzebski
088a02aafd read_context: skip first fragment in ensure_underlying
This was previously done in create_underlying but ensure_underlying is
a better place because we will add more related logic to this
consumption in the following patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-04-12 19:46:04 +02:00
Avi Kivity
fcc17d43a6 treewide: correct mislicensed source files
alternator/expressions.g had both AGPL and proprietary licensing. The
proprietary one is removed.

gms/inet_address_serializer.hh had only a proprietary license; it is
replaced by the AGPL.

Fixes #8465.

Closes #8466
2021-04-12 17:42:59 +03:00
Avi Kivity
e3db889057 Merge 'Introduce service levels' from Piotr Sarna
This series introduces service level syntax borrowed from https://docs.scylladb.com/using-scylla/workload-prioritization/ , but without workload prioritization itself - just for the sake of using identical syntax to provide different parameters later. The new parameters may include:
 * per-service-level timeouts
 * oltp/olap declaration, which may change the way Scylla treats long requests - e.g. time them out (the oltp way) or keep them sustained with empty pages (the olap way)

Refs #7617

Closes #7867

* github.com:scylladb/scylla:
  transport: initialize query state with service level controller
  main: add initializing service level data accessor
  service: make enable_shared_from_this inheritance public
  cql3: add SERVICE LEVEL syntax (without an underscore)
  unit test: Add unit test for per user sla syntax
  cql: Add support for service level cql queries
  auth: Add service_level resource for supporting in authorization of cql service_level
  cql: Support accessing service_level_controller from query state
  instantiate and initialize the service_level_controller
  qos: Add a standard implementation for service level data accessor
  qos: add waiting for the updater future
  service/qos: adding service level controller
  service_levels: Add documentation for distributed tables
  service/qos: adding service level table to the distributed keyspace
  service/qos: add common definitions
  auth: add support for role attributes
2021-04-12 17:34:43 +03:00
Piotr Sarna
26ee6aa1e9 transport: initialize query state with service level controller
Query state should be aware of the service level controller in order
to properly serve service-level-related CQL queries.
2021-04-12 16:31:27 +02:00
Piotr Sarna
32bcbe59ad main: add initializing service level data accessor
The accessor must be set up in order to be able to use
statement related to service level management.
2021-04-12 16:31:27 +02:00
Piotr Sarna
3626bc253d service: make enable_shared_from_this inheritance public
Without being public, making shared pointer from the service level
accessor is not accessible outside of the class.
2021-04-12 16:31:27 +02:00
Piotr Sarna
c7f66d6fdd cql3: add SERVICE LEVEL syntax (without an underscore)
In order for the syntax to be more natural, it's now possible
to use SERVICE LEVEL instead of SERVICE_LEVEL in all appropriate
places. The old syntax is supported as well.
2021-04-12 16:31:27 +02:00
Eliran Sinvani
144fe02c23 unit test: Add unit test for per user sla syntax
This commit adds the infrastructure needed to test per user sla,
more specificaly, a service level accessor that triggers the
update_service_levels_from_distributed_data function uppon any
change to the dystributed sla data.
A test was added that indirectly consumes this infrastructure by
changing the distributed service level data with cql queries.
Message-Id: <23b2211e409446c4f4e3e57b00f78d9ff75fc978.1609249294.git.sarna@scylladb.com>
2021-04-12 16:31:26 +02:00
Eliran Sinvani
2701481cbc cql: Add support for service level cql queries
This patch adds support for new service level cql queries.
The queries implemented are:
CREATE SERVICE_LEVEL [IF NOT EXISTS] <service_level_name>
ALTER SERVICE_LEVEL <service_level_name> WITH param = <something>
DROP SERVICE_LEVEL [IF EXISTS] <service_level_name>
ATTACH SERVICE_LEVEL <service_level_name> TO <role_name>
DETACH SERVICE_LEVEL FROM <role_name>
LIST SERVICE_LEVEL <service_level_name>
LIST ALL SERVICE_LEVELS
LIST ATTACHED SERVICE_LEVEL OF <role_name>
LIST ALL ATTACHED SERVICE_LEVELS
2021-04-12 16:30:01 +02:00
Eliran Sinvani
a88929da15 auth: Add service_level resource for supporting in authorization of cql service_level
queries

In order to be able to manage service_level configuration one must be authorized
to do so, or to be a superuser. This commit adds the support for service_levels
resource. Since service_levels are relative, reconfiguring one service level is not locallized
only to that service level and will affect the QOS for all of the service levels,
so there is not much sense of granting permissions to manage individual service_levels.
This is why only root resource named service_levels that  represents all service levels is used.
This commit also implements the unit test additions for the newly introduced resource.
Message-Id: <81ab16fa813b61be117155feea405da6266921e3.1609237687.git.sarna@scylladb.com>
2021-04-12 16:01:04 +02:00
Eliran Sinvani
f78707d3fb cql: Support accessing service_level_controller from query state
In order to implement service level cql queries, the queries objects
needs access to the service_level_controller object when processing.
This patch adds this access by embedding it into the query state object.
In order to accomplish the above the query processor object needs an
access to service_level_controller in order to instantiate the query state.
Message-Id: <68f5a7796068a49d9cd004f1cbf34bdf93b418bc.1609234193.git.sarna@scylladb.com>
2021-04-12 16:01:04 +02:00
Eliran Sinvani
e173eaa032 instantiate and initialize the service_level_controller
This patch adds the initialization of service_level_controller. It
constructs the distributed service and start the watch loop for
distributed data changes.
Message-Id: <e97661194833d576aa39b3e7886366590f272612.1609175402.git.sarna@scylladb.com>
2021-04-12 16:01:04 +02:00
Eliran Sinvani
8493e19840 qos: Add a standard implementation for service level data accessor
service_level_controller defines an interface for accessing the service
level distributed data, this patch implements a standard implementation
of the interface that delegates to the system distributed keyspace.
Message-Id: <25e68302f6f4d4fe5fcb66ea19159ad68506ba64.1609175314.git.sarna@scylladb.com>
2021-04-12 16:01:04 +02:00
Piotr Sarna
41951d34ad qos: add waiting for the updater future
The distributed data updated used to spawn a future without waiting
for it. It was quite safe, since the future had its own abort source,
but it's better to remember it and wait for it during stop() anyway.
2021-04-12 16:01:04 +02:00
Eliran Sinvani
a54ea4667b service/qos: adding service level controller
adding the service level controller implementation. The implementation
follows the design in:
https://docs.google.com/document/d/1RrSTZ3ZX86-YDt2POwAVwFeKN9uX8frEvATJda5n1FU/edit?usp=sharing
Some interfaces were added for registration with system componnents.
The method of registration is chosen over a constructor parameter, due to
the componnets being initialized prior to the service level controller being created.
Message-Id: <e9c4e7d5b411062b6a553f5c6861e7875cd71d2c.1609171761.git.sarna@scylladb.com>
2021-04-12 16:01:04 +02:00
Eliran Sinvani
3ecdab30a1 service_levels: Add documentation for distributed tables
This patch adds documentation for the distributed tables
used for service_level feature and their meaning and usage.
Message-Id: <5b7d2be166c2381ed33094b4545fafe0f142583f.1609170862.git.sarna@scylladb.com>
2021-04-12 16:01:03 +02:00
Eliran Sinvani
dd74556ad9 service/qos: adding service level table to the distributed keyspace
This patch adds the service level table and functions to manipulate it
to the distributed keyspace.

Message-Id: <b6cb7f311ac1ee6802d8f3d78eac9cf40fe21f68.1609161341.git.sarna@scylladb.com>
2021-04-12 15:58:09 +02:00
Eliran Sinvani
4fea0762c2 service/qos: add common definitions
Adding common definitions that will be used by the
performance isolation classes. Mainly defines the
common ground for configuring a service level
through the service level options structure.

Message-Id: <12476f4a8e21af3a4c7a892683940698f3beacce.1609160860.git.sarna@scylladb.com>
2021-04-12 15:58:09 +02:00
Eliran Sinvani
23e889d710 auth: add support for role attributes
In the general case roles might come with attributes attached to them
these attributes can originate in mechanisms such as LDAP where in
the undelying directory each entity can have a key:value data structure.
This patch add support for such attributes in the role manager interface,
it also implements the attribute support in the standard role
manager in the form of a table with an attribute map in the distributed system keyspace.
Message-Id: <f53c74a7ac315c4460ff370ea6dbb1597821edc2.1609158013.git.sarna@scylladb.com>
2021-04-12 15:58:09 +02:00
Ivan Prisyazhnyy
0836efd830 tracing: test/boost/tracing: fix use after free
fixes AddressSanitizer: stack-buffer-underflow on address 0x7ffd9a375820 at pc 0x555ac9721b4e bp 0x7ffd9a374e70 sp 0x7ffd9a374620

Backend registry holds a unique pointer to the backend implementation
that must outlive the whole tracing lifetime until the shutdown call.

So it must be catched/moved before the program exits its scope by
passing out the lambda chain.

Regarding deletion of the default destructor: moving object requires
a move constructor (for do_with) that is not implicitly provided if
there is a user-defined object destructor defined even tho its impl
is default.

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>

Closes #8461
2021-04-12 16:44:07 +03:00
Avi Kivity
bad4924868 Merge 'Add a ninja help build target' from Pekka Enberg
This pull request adds a "ninja help" build target in hopes of making
the different build targets more discoverable to developers.

Closes #8454

* github.com:scylladb/scylla:
  building.md: Document "ninja help" target
  configure.py: "ninja help" target
  building.md: Document "ninja <mode>-dist" target
  configure.py: Add <mode>-dist target as alias for dist-<mode>
2021-04-12 16:30:37 +03:00
Avi Kivity
80529f7097 Revert "nonroot: generate scylla_sysconfdir.py correctly"
This reverts commit e991e01f2e. It
breaks installation on CentOS 7.

Fixes #8456.
2021-04-12 16:19:39 +03:00
Gleb Natapov
9fdb3d3d98 raft: stop using seastar::pipe to pass log entries to apply_fiber
Stop use seastar::pipe and use seastar::queue directly to pass log
entries to apply_fiber. The pipe is a layer above queue anyway and it
adds functionality that we do not need (EOS) and hinds functionality that
we do (been able to abort()). This fixes a crash during abort where the
pipe was uses after been destroyed.

Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>
2021-04-12 13:18:03 +02:00
Avi Kivity
a24771125e Update seastar submodule
* seastar 1c1f610ceb...d2dcda96bb (3):
  > closeable: add with_closeable and with_stoppable helpers
  > circleci: relax concurrency of the build process
  > logger: failed_to_log: print source location and format string
2021-04-12 12:52:01 +03:00
Raphael S. Carvalho
224120f7df sstables: rewrite compound_sstable_set::all()
Procedure is rewritten using std::partition, making it easier to
maintain and it also fixes a theoretical quadratic behavior because
list is entirely copied when extending it, which isn't harmful
because maintenance set will be rarely populated and there are only
2 sets at most.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210409171412.57729-1-raphaelsc@scylladb.com>
2021-04-12 12:45:43 +03:00
Piotr Sarna
d77eb39076 Merge 'cdc: log: avoid linearizations' from Michał Chojnowski
CDC log uses `bytes` to deal with cells and their values, and linearizes all
values indiscriminately.  This series makes a switch from `bytes` to
`managed_bytes` to avoid that linearization.

Fixes #7506.

Closes #8429

* github.com:scylladb/scylla:
  cdc: log: change yet another occurence of `bytes` to `managed_bytes`
  cdc: log: switch the remaining usages of `bytes` to `managed_bytes` in collection_visitor
  cdc: log: change `deleted_elements` in log_mutation_builder from bytes to managed_bytes
  cdc: log: rewrite collection merge to use managed_bytes instead of bytes
  cdc: log: don't linearize collections in get_preimage_col_value
  cdc: log: change return type of get_preimage_col_value to managed_bytes
  cdc: log: remove an unnecessary copy in process_row_visitor::live_atomic_cell
  cdc: log: switch cell_map from bytes to managed_bytes
  cdc: log: change the argument of log_mutation_builder::set_value to managed_bytes_view
  cdc: log: don't linearize the primary key in log_mutation_builder
  atomic_cell: add yet another variant of make_live for managed_bytes_view
  compound: add explode_fragmented
2021-04-12 10:56:12 +02:00
Avi Kivity
bd16e98019 expr: give a name to a tuple of columns
Right now, binary_operator::lhs is a variant<column_value,
std::vector<column_value>, token>. The role of the second branch
(a vector of column values) is to represent a tuple of columns
e.g. "WHERE (a, b, c) = ?"), but this is not clear from the type
name.

Inroduce a wrapper type around the vector, column_value_tuple, to
make it clear we're dealing with tuples of CQL references (a
column_value is really a column_ref, since it doesn't actually
contain any value).

Closes #8208
2021-04-12 09:40:16 +02:00
Pekka Enberg
d34571dfd9 building.md: Document "ninja help" target 2021-04-12 10:35:02 +03:00
Pekka Enberg
698710598a configure.py: "ninja help" target
This adds a "help" build target, which prints out important build
targets. The printing is done in a separate shell script, becaue "ninja"
insists on print out the "command" before executing it, which makes the
help text unreadable.
2021-04-12 10:35:02 +03:00
Kamil Braun
7ffb0d826b clustering_order_reader_merger: handle empty readers
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).

The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).

It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.

Fixes #8445.

Closes #8446
2021-04-12 10:34:52 +03:00
Pekka Enberg
e77c7f4543 building.md: Document "ninja <mode>-dist" target
Let's document the new "dist-<mode>" to encourage people to use it.
2021-04-12 10:31:46 +03:00
Pekka Enberg
e959c90af8 configure.py: Add <mode>-dist target as alias for dist-<mode>
The build and test build targets put "mode" as prefix, so let's unify
the dist target too in preparation for "ninja help".
2021-04-12 10:29:54 +03:00
Michael Livshin
09f221203f build: tolerate ./build being a symbolic link
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210411122951.14196-1-michael.livshin@scylladb.com>
2021-04-12 10:08:56 +03:00
Avi Kivity
9bc45d9243 build: drop lld from install-dependencies.sh on s390x
lld is not available any more on s390x. Since it's optional, we can
just drop it on that platform.

Closes #8430
2021-04-12 09:46:33 +03:00
Nadav Har'El
2932f20b40 cql-pytest: translate Cassandra's reproducers for issue #2963
This is a translation of Cassandra's CQL unit test source file
validation/entities/SecondaryIndexOnStaticColumnTest.java into our
our cql-pytest framework.

This test file checks various features of indexing (with secondary index)
static rows. All these tests pass on Cassandra, but fail on Scylla because
of issue #2963 - we do not yet support indexing of a static row.
The failing test currently fail as soon as they try to create the index,
with the message:
"Indexing static columns is not implemented yet."

Refs #2963.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210411153014.311090-1-nyh@scylladb.com>
2021-04-12 08:11:35 +02:00
Nadav Har'El
989589b570 test/cql-pytest,alternator,redis: avoid an annoying warning
This patch avoids an annoying warning

    Warning: Unknown config ini key: flake8-ignore

when running one of the pytest-based test projects (cql-pytest,
alternator and redis) on recent versions of pytest.

In commit 2022da2405, we added to the
toplevel Scylla directory a "tox.ini" file with some intention to
configure Python syntax checking. One of the configurations in this
tox.ini is:

    [pytest]
    flake8-ignore =
        E501

It turns out that pytest, if a certain test directory does not have its
own pytest.ini file, looks up in ancestor directory for various
configuration files (the configuration file precedence is described in
https://docs.pytest.org/en/stable/customize.html), and this includes
this tox.ini configuration section. Recent versions of pytest complain
about the "flake8-ignore" configuration parameter, which they don't
recognize. This parameter may be ok (?) if you install a flake8 pytest
plugin, but we do not require users to do this for running these tests.

Moreover, whatever noble intentions this commit and its tox.ini had,
nobody ever followed up on it. The three pytest-based test directories
never adhered to flake8's recommended syntax, and never intended to do
so. None of the developers of these tests use flake8, or seem to wish
to do so. If this ever changes, we can change the pytest.ini or undo this
commit and go back to a top-level tox.ini, but I don't see this happening
anytime soon.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210411085708.300851-1-nyh@scylladb.com>
2021-04-12 08:04:06 +02:00
Avi Kivity
35a3d65ee7 install.sh: document pathname components
install.sh supports two different ways of redirecting paths:
--root for creating a chroot-style tree, and --prefix for changing
the installed file location. Document them.

Closes #8389
2021-04-11 21:03:57 +03:00
Avi Kivity
ec3db140cb utils: data_input: replace enable_if with tightened concept
std::is_fundamental isn't a good constraint since it include nullptr_t
and void. Replace with std::integral which is sufficient. Use a concept
instead of enable_if to simplify the code.

Closes #8450
2021-04-11 18:56:21 +03:00
Nadav Har'El
d5121d1476 scripts/refresh-submodules.sh: allow choosing which submodule to refresh
Currently, scripts/refresh-submodules.sh always refreshes all
submodules, i.e., takes the latest version of all of all of them and
commits it. But sometimes, a committer only wants to refresh a specific
submodule, and doesn't want to deal with the implications of updating
a different one.

As a recent example, for issue #8230, I wanted to update the tools/java
submodule, which included a fix for sstableloader, without updating the
Seastar submodule - which contained completely irrelevant changes.

So in this patch we add the ability to override the default list of
submodules that refresh-submodules.sh uses, with one or more command
line parameters. For example:

    scripts/refresh-submodules.sh tools/java

will update only tools/java.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210411151421.309483-1-nyh@scylladb.com>
2021-04-11 18:35:04 +03:00
Benny Halevy
705f9c4f79 commitlog: segment_manager: max_size must be aligned
This was triggered by the test_total_space_limit_of_commitlog dtest.
When it passes a very large commitlog_segment_size_in_mb (1/6th of the
free memory size, in mb), segment_manager constructor limits max_size
to std::numeric_limits<position_type>::max() which is 0xffffffff.

This causes allocate_segment_ex to loop forever when writing the segment
file since `dma_write` returns 0 when the count is unaligned (seen 4095).

The fix here is to select a sligtly small maxsize that is aligned
down to a multiple of 1MB.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>
2021-04-11 13:17:50 +03:00
Avi Kivity
3814f74a74 Update seastar submodule
* seastar caba9fda34...1c1f610ceb (3):
  > scripts/perftune.py: allow configuring disks write cache mode
  > test: file_utils: tmp_dir_do_with_fail_remove_test: rename inner tmp_dir to trigger error
  > circleci: switch to dedicated machine
2021-04-11 13:13:53 +03:00
Raphael S. Carvalho
5c630f405a table: introduce trigger_offstrategy_compaction()
this function will be used on repair-based operation completion,
to notify table about the need to start offstrategy compaction
process on the maintenance sstables produced by the operation.
Function which notifies about bootstrap and replace completion
is changed to use this new function.
Removenode and decommission will reuse this function.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-09 14:53:14 -03:00
Raphael S. Carvalho
f60f32f7fa repair/row_level: make operations_supported static const
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-09 14:42:10 -03:00
Tomasz Grabiec
305372820d Merge "Make position_in_partition::tri_compare use strong_ordering" from Pavel Emelyanov
There are some users of that tri_comparator which are also
converted to strong_ordering. Most of the code using those
is, in turn, already handling return values interchangeably.

The bound_view::tri_compare, which's used by the guy, is
still returning int.

tests: unit(dev)

* xemul/br-position-tri-compare:
  code: Relax position_in_partition::tri_compare users
  position_in_partition: Convert tri_compare to strong_ordering
  test: Convert clustering_fragment_summary::tri_cmp to strong_ordering
  repair: Convert repair_sync_boundary::tri_compare to strong_ordering
  view: Don't expect int from position_in_partition::tri_compare
2021-04-09 17:54:38 +02:00
Pavel Emelyanov
64074f45ce code: Relax position_in_partition::tri_compare users
There are some pieces left doing res <=> 0 with the
res now being a strong_ordering itself. All these can
be just dropped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 18:20:39 +03:00
Pavel Emelyanov
92e72c62dc position_in_partition: Convert tri_compare to strong_ordering
All its users are now ready to accept both - int and
the strong_ordering value, so the change is pretty
straightforward.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 18:20:39 +03:00
Pavel Emelyanov
a15f158661 test: Convert clustering_fragment_summary::tri_cmp to strong_ordering
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 18:20:39 +03:00
Pavel Emelyanov
ba4699ffca repair: Convert repair_sync_boundary::tri_compare to strong_ordering
The change partially reverts 37855641

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 18:20:39 +03:00
Pavel Emelyanov
70c851e69b view: Don't expect int from position_in_partition::tri_compare
Now it's int, but soon will be std::strong_ordering.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 18:20:39 +03:00
Tomasz Grabiec
20af98c61b Merge "Tweak partition_snapshot_row_cursor" from Pavel Emelyanov
The set puts the partition_snapshot_row_cursor on a
diet: 320 -> 224 bytes, makes use of btree API sugar
to save some CPU cycles and compacts the code.

tests: unit(dev)

* xemul/br-row-cursor-cleanups-2:
  partition_snapshot_row_cursor: Rewrite row() with consume_row()
  partition_snapshot_row-cursor: Add const consume_row() version
  partition_snapshot_row_cursor: Add concept to .consume_row()
  partition_snapshot_row_cursor: Don't carry end iterators
  partition_snapshot_row_cursor: Move cells hash creation to reader
  partition_snapshot_row_cursor: Move read_partition into test
  partition_snapshot_row_cursor: Move is_in_latest_version inline
  partition_snapshot_row_cursor: Use is_in_latest_version where appropriate
  partition_snapshot_row_cursor: Less dereferences in key() method
  partition_snapshot_row_cursor: Update change mark in prepare_heap
  partition_snapshot_row_cursor: Clear current row when recreating
  partition_snapshot_row_cursor: Use btree::lower_bound sugar
  partition_snapshot_row_cursor: Factor out next() and erase_and_advance()
  partition_snapshot_row_cursor: Relax vector of iterators
  btree: Add operator bool()
  clustering_row: Add new .apply() overload
2021-04-09 14:51:24 +02:00
Botond Dénes
54edd613c8 mutation_query: remove now unused mutation_query()
If somebody wants to query a generic mutation source in the future, they
can still do it via `mutation_querier::consume_page()` and the right
result builder.
2021-04-09 13:40:27 +03:00
Botond Dénes
3dbb456fba test: mutation_query_test: use local mutation_query() implementation
Add a local `mutation_query()` variant, which only contains the pieces
of logic the test really wants to test: invoking
`mutation_querier::consume_page()` with a `reconcilable_result_builder`.
This allows us to get rid of the now otherwise unused
`mutation_query()`.
2021-04-09 13:40:27 +03:00
Botond Dénes
80a03826e3 database: mutation_query(): use table::mutation_query()
Instead of `mutation_query()` from `mutation_query.hh`. The latter is
about to be retired as we want to migrate all users to
`table::mutation_query()`.
As part of this change, move away from `mutation_query_stage` too. This
brings the code paths of the two query variants closer together, as they
both have an execution stage declared in `database`.
2021-04-09 13:40:27 +03:00
Botond Dénes
5c8f142fe5 table: add mutation_query()
We want to migrate `database::mutation_query()` off `mutation_query()`
to use `table::mutation_query()` instead. The reason is the same as for
making `table::query()` standalone: the `mutation_query()`
implementation increasingly became specific to how tables are queried
and is about to became even more specific due to impending changes to
how permits are obtained. As no-one in the codebase is doing generic
mutation queries on generic mutation sources we can just make this a
member of table.
This patch just adds `table::mutation_query()`, no user exists yet.
`table::mutation_query()` is identical to `mutation_query()`, except
that it is a coroutine.
2021-04-09 13:40:27 +03:00
Botond Dénes
a4facf316d query: remove the now unused data_query()
If somebody wants to query a generic mutation source in the future, they
can still do it via `data_querier::consume_page()` and the right result
builder.
2021-04-09 13:40:27 +03:00
Botond Dénes
59ea36731b test: mutation_query_test: use local data_query() implementation
The test only wants to test result size calculation so it doesn't need
the whole `data_query()` logic. Replace the call to `data_query()` with
one to a local alternative which contains just the necessary bits --
invoking `data_querier::consume_page()` with the right result builder.
This allows us get rid of the now otherwise unused `data_query()`.
2021-04-09 13:40:27 +03:00
Botond Dénes
c3f0681011 table: query(): inline data_query() code into query()
`data_query()` is now just a thin wrapper over
`data_querier::consume_page()`. Furthermore, contrary to the old data
query method, it is not a generic way of querying a mutation source, it
is now closely tied to how we query tables. It does a querier lookup and
save. In the future we plan on tying it even closer to the table in how
permits are obtained. For this reason it is better to just inline it
into the `query()` method which invokes it.
2021-04-09 13:40:27 +03:00
Pavel Emelyanov
89eece3aca partition_snapshot_row_cursor: Rewrite row() with consume_row()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 12:18:29 +03:00
Pavel Emelyanov
ae6b677f9a partition_snapshot_row-cursor: Add const consume_row() version
It's the same as the existing one, but doesn't modify
anything (cursor and pointing rows_entry's) and calls
consumer with const row reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 12:18:29 +03:00
Pavel Emelyanov
5e28075ec0 partition_snapshot_row_cursor: Add concept to .consume_row()
Nothing special here

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 12:18:29 +03:00
Pavel Emelyanov
d891cfe6cd partition_snapshot_row_cursor: Don't carry end iterators
The btree's iterator can be checked to reach the tree's end
without holding the ending iterator itself. This makes the
whole p_s_r_c 20% smaller (288 bytes -> 224 bytes) since it
now keeps 4 extra iterators on-board -- inside small vectors
for heap and current_row.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 12:18:29 +03:00
Pavel Emelyanov
4558eb3afc partition_snapshot_row_cursor: Move cells hash creation to reader
Right now call to .row() method may create hash on row's cells.
It's counterintuitive to see a const method that transparently
changes something it points to. Since the only caller of a row()
who knows whether the hash creation is required is the cache
reader, it's better to move the call to prepare_hash() into it.

Other than making the .row() less surprising this also helps to
get rid of the whole method by the next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 12:18:29 +03:00
Pavel Emelyanov
00caf5f219 partition_snapshot_row_cursor: Move read_partition into test
The method in question is test-only helper, there's no
need in keeping it as a part of the API.

Another reason to move is that the method is O(number of
rows) and doesn't preempt while looping, but cursor code
users try hard not to stall the reactor. So even though
this method has a meaningful semantics within the class,
it will better be reinvented if needed in core code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 12:16:13 +03:00
Avi Kivity
ad10a6a220 Update seastar submodule
* seastar fcd46c138...caba9fda3 (5):
  > file: mark overlayfs as not supporting RWF_NOWAIT
  > dns: fix tcp sendv return value to c-ares
Fixes #8442.
  > test: closeable: allocate variables accessed by continuations using do_with
  > test: Fix leak in io_queue_test
  > test: rpc_test: reduce memory usage in compression tests
2021-04-09 11:48:50 +03:00
Pavel Emelyanov
9f323355a6 partition_snapshot_row_cursor: Move is_in_latest_version inline
The method is currently defined outside of the class which
gives compiler less chances to really inline it when needed.

Also, keeping this simple piece of code inline is less code
to read (and compile).

Mark the guy noexcept while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 11:45:45 +03:00
Pavel Emelyanov
cc57e35c6a partition_snapshot_row_cursor: Use is_in_latest_version where
appropriate

Checking for _current_row[0].version being 0 (or not being 0)
is better understood if done with a well named existing helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 11:45:45 +03:00
Pavel Emelyanov
353a8f66a2 partition_snapshot_row_cursor: Less dereferences in key() method
The valid cursor's key is kept on the _position as well,
but getting it from there is 1 defererence less:

_current_row -(*)-> row -> key
_position -(**)-> std::optional -> key

* iterator's -> is pointer dereference
** std::optional is designed not to be a pointer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 11:45:45 +03:00
Pavel Emelyanov
353a1306ce partition_snapshot_row_cursor: Update change mark in prepare_heap
The heap's iterators validity is checked with the change mark,
which is updated every time heap is recreated. Factor these
updates out and keep the mark together with the heap it protects.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 11:45:45 +03:00
Pavel Emelyanov
1a1f05f50b partition_snapshot_row_cursor: Clear current row when recreating
The cursor keeps current row in a separate vector of iterators
and reconstructs it in a dedicated method, which _expects_ that
the vector is empty on entry.

It's better to keep the logic of current row construction in one
place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 11:45:45 +03:00
Pavel Emelyanov
2edd072d27 partition_snapshot_row_cursor: Use btree::lower_bound sugar
When checking if the lower-bound entry matched the search
key it's possible to avoid extra comparison with the help
of the collection used to store the rows (btree).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 11:45:45 +03:00
Pavel Emelyanov
9aee0ad8b3 partition_snapshot_row_cursor: Factor out next() and erase_and_advance()
Both helpers do the same -- advance the cursor to the next row.
The latter may additionally remove the row from the uniquely
owned version.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 11:45:45 +03:00
Pavel Emelyanov
2fb0f7315c partition_snapshot_row_cursor: Relax vector of iterators
The cursor maintains a vector of iterators that correspond to
each of the versions scanned. However, only the iterator in
the latest one is really needed, so the whole vector can be
reduced down to an optional.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 11:45:45 +03:00
Botond Dénes
b03f360bb0 table: make query() a coroutine
This method is very hard to read or modify in its current form due to
all the continuation-chain boilerplate. Make it a coroutine to
facilitate future changes in the next patches but not just.
2021-04-09 11:04:35 +03:00
Pavel Emelyanov
26e27e27e8 btree: Add operator bool()
The btree's iterators allow for simple checking for '== tree.end()'
condition. For this check neither the tree itself, nor the ending
iterator is required. One just need to check if the _idx value is
the npos.

One additional change to make it work is required -- when removing
an entry from the inline node the _idx should be set to npos.

This change is, well, a bugfix. An iterator left with 0 in _idx is
treated as a valid one. However, the bug is non-triggerable. If such
an "invalid" iterator is compared against tree.end() the check would
return true, because the tree pointers would conside.

So this patch adds an operator bool() to btree iterator to facilitate
simpler checking if it reached the end of the collection or not.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 10:05:47 +03:00
Pavel Emelyanov
772fe2b089 clustering_row: Add new .apply() overload
The clustering_row is a wrapper over the deletable_row and
facilitates the apply-creation of the latter from some other
objects. Soon it will accept the deletable_row itself for
apply()-ing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 10:05:47 +03:00
Benny Halevy
830128cd95 streaming: stream_session: do not log err.c_str verbatim
It is dangerous to print a formatted string as is, like
sslog.warn(err.c_str()) since it might hold curly braces ('{}')
and those require respective runtime args.

Instead, it should be logged as e.g. sslog.warn("{}", err.c_str()).

This will prevent issues like #8436.

Refs #8436

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210408173048.124417-2-bhalevy@scylladb.com>
2021-04-09 08:36:49 +03:00
Benny Halevy
76cd315c42 streaming: stream_session: do not escape curly braces in format strings
Those turn into '{}' in the formatted strings and trigger
a logger error in the following sstlog.warn(err.c_str())
call.

Fixes #8436

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210408173048.124417-1-bhalevy@scylladb.com>
2021-04-09 08:36:49 +03:00
Gleb Natapov
b9175edea4 raft: test: check that a server with id zero cannot be neither created nor added to a config
Message-Id: <20210407134853.1964226-2-gleb@scylladb.com>
2021-04-08 17:07:18 +02:00
Gleb Natapov
fb938a36d4 raft: disallow adding and creating servers with id zero
Id zero has special meaning in the code and cannot be valid server id.
Message-Id: <20210407134853.1964226-1-gleb@scylladb.com>
2021-04-08 17:07:18 +02:00
Kamil Braun
3687757115 sstables: fix TWCS single key reader sstable filter
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.

Fixes #8432.

Closes #8433
2021-04-08 18:03:49 +03:00
Avi Kivity
3a58985674 Merge 'scylla_ntp_setup: detect already installed ntp client' from Takuya ASADA
On current implementation, we may re-run ntp configuration even it
already configured.
Also, the system may configured with non-default ntp client, we just
ignoring that and configure with default ntp client.

This patch minimize unnecessary re-configuration of ntp client.
It run in following order:
 1. Check NTP client is already running. If it running, skip setup
 2. Check NTP client is alrady installed. If it installed, use it
 3. If there is non of NTP client package installed,
    - if it's CentOS, install chrony
    - if it's on other distributions, install systemd-timesyncd

Closes #8431

* github.com:scylladb/scylla:
  scylla_ntp_setup: detect already installed ntp client
  scylla_util.py: return bool value on systemd_unit.is_active()
2021-04-08 17:27:15 +03:00
Takuya ASADA
735c83b27f scylla_ntp_setup: detect already installed ntp client
On current implementation, we may re-run ntp configuration even it
already configured.
Also, the system may configured with non-default ntp client, we just
ignoring that and configure with default ntp client.

This patch minimize unnecessary re-configuration of ntp client.
It run in following order:
 1. Check NTP client is already running. If it running, skip setup
 2. Check NTP client is alrady installed. If it installed, use it
 3. If there is non of NTP client package installed,
    - if it's CentOS, install chrony
    - if it's on other distributions, install systemd-timesyncd

Related with #8344, #8339
2021-04-08 22:52:02 +09:00
Botond Dénes
32ae51dc2c table: query(): fix typo (short_read_allwoed)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210408133018.65692-1-bdenes@scylladb.com>
2021-04-08 16:34:08 +03:00
Tomasz Grabiec
6d6f39a7b3 Merge "fixes for stepdown and quorum check" from Gleb
The series contains code cleanups and fixes for stepdown process
and quorum check code.

Note this is re-send of already posted patches lumped together for
convenience.

* scylla-dev/raft-fixes-v1:
  raft: add test for check quorum on a leader
  raft: fix quorum check code for joint config and non-voting members
  raft: do not hang on waiting for entries on a leader that was removed from a cluster
  raft: add more tracing to stepdown code
  raft: use existing election_elapsed() function instead of redo the calculation
  raft: test: add test case for stepdown process
  raft: check that a node is still the leader after initiating stepdown process
2021-04-08 15:18:52 +02:00
Takuya ASADA
2545d7fd43 scylla_util.py: return bool value on systemd_unit.is_active()
Currently, 'if unit.is_active():' is always True since is_active()
returns result in string (active, inactive, unknown).
To avoid such scripting bug, change return value in bool.
2021-04-08 21:54:05 +09:00
Michał Chojnowski
6b31f73987 cdc: log: change yet another occurence of bytes to managed_bytes 2021-04-08 10:16:21 +02:00
Michał Chojnowski
061f72166c cdc: log: switch the remaining usages of bytes to managed_bytes in collection_visitor 2021-04-08 10:16:21 +02:00
Michał Chojnowski
2760382a68 cdc: log: change deleted_elements in log_mutation_builder from bytes to managed_bytes 2021-04-08 10:16:21 +02:00
Michał Chojnowski
ba53c85829 cdc: log: rewrite collection merge to use managed_bytes instead of bytes 2021-04-08 10:16:21 +02:00
Michał Chojnowski
42acdc4d09 cdc: log: don't linearize collections in get_preimage_col_value 2021-04-08 10:16:21 +02:00
Michał Chojnowski
70a2bed70b cdc: log: change return type of get_preimage_col_value to managed_bytes 2021-04-08 10:16:21 +02:00
Michał Chojnowski
4214e74678 cdc: log: remove an unnecessary copy in process_row_visitor::live_atomic_cell 2021-04-08 10:16:11 +02:00
Michał Chojnowski
c2b43c8daf cdc: log: switch cell_map from bytes to managed_bytes 2021-04-08 10:05:30 +02:00
Michał Chojnowski
4e8eb07de4 cdc: log: change the argument of log_mutation_builder::set_value to managed_bytes_view 2021-04-08 10:05:00 +02:00
Michał Chojnowski
f18b74eee5 cdc: log: don't linearize the primary key in log_mutation_builder 2021-04-08 10:04:31 +02:00
Michał Chojnowski
890e6377ab atomic_cell: add yet another variant of make_live for managed_bytes_view
We will use it in the next patches of this series.
2021-04-08 10:04:23 +02:00
Michał Chojnowski
5a2b492f09 compound: add explode_fragmented
We will use it in the next patches in this series.
2021-04-08 10:02:54 +02:00
Asias He
a8c90a5848 storage_service: Reject replacing a node that has left the ring
1) start n1, n2, n3

2) decommission n3

3) remove /var/lib/scylla for n3

4) start n4 with the same ip address as n3 to replace n3

5) replace will be successful

If a node has left the ring, we should reject the replace operation.

This patch makes the check during replace operation more strict and
rejects the replace if the node has left the ring.

After the patch, we will see

ERROR 2021-04-07 08:02:14,099 [shard 0] init - Startup failed:
std::runtime_error (Cannot replace_adddress 127.0.0.3 because it has left
the ring, status=LEFT)

Fixes #8419

Closes #8420
2021-04-07 19:42:28 +03:00
Avi Kivity
202c631dee test: perf: perf_simple_query: collect allocation and task statistics
Calculate and display the number of memory allocations and tasks
executed per operation. Sample results (--smp 1):

180022.46 tps (90 allocs/op, 20 tasks/op)
178963.44 tps (90 allocs/op, 20 tasks/op)
178702.41 tps (90 allocs/op, 20 tasks/op)
177679.74 tps (90 allocs/op, 20 tasks/op)
179539.36 tps (90 allocs/op, 20 tasks/op)

median 178963.44 tps (90 allocs/op, 20 tasks/op)
median absolute deviation: 575.92
maximum: 180022.46
minimum: 177679.74

This allows less noisy tracking of how some changes impact performance.
2021-04-07 17:54:48 +03:00
Avi Kivity
3a90df39c5 perf: deinline some functions in perf.hh
Those functions were defined in a header, but not marked inline.
This made including the header from two source files impossible,
as the linker would complain about duplicate symbols.

Rather than making them inline, put them in a new source file
perf.cc as they don't need to be inline.
2021-04-07 17:51:58 +03:00
Avi Kivity
29a674cd94 test: perf: perf_fast_forward: report allocation rate and tasks
These are more stable than cpu consumed across runs, and impact
performance directly.

Closes #8422
2021-04-07 15:41:43 +02:00
Piotr Sarna
8e808a56d2 Merge 'commitlog: Fix race and edge condition in delete_segments' from Calle Wilund
Fixes #8363
Fixes #8376

Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.

1.) It uses parallel processing, with deferral. This means that
    the disk usage variables it looks at might not be fully valid
    - i.e. we might have already issued a file delete that will
    reduce disk footprint such that a segment could instead be
    recycled, but since vars are (and should) only updated
    _post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
    delete a single segment, and this segment is the border segment
    - i.e. the one pushing us over the limit, yet allocation is
    desperately waiting for recycling. In this case we should
    allow it to live on, and assume that next delete will reduce
    footprint. Note: to ensure exact size limit, make sure
    total size is a multiple of segment size.

if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.

Closes #8372

* github.com:scylladb/scylla:
  commitlog: Add signalling to recycle queue iff we fail to recycle
  commitlog: Fix race and edge condition in delete_segments
  commitlog: coroutinize delete_segments
  commitlog_test: Add test for deadlock in recycle waiter
2021-04-07 15:13:25 +02:00
Nadav Har'El
0dd6f2db8f Merge 'CDC generations: refactors and improvements' from Kamil Braun
The "most important" major changes are:

1. storage_service: simplify CDC generation management during node replace

Previously, when node A replaced node B, it would obtain B's
generation timestamp from its application state (gossiped by other
nodes) and start gossiping it immediately on bootstrap.

But that's not necessary:
  - if this is the timestamp of the last (current) generation, we would
     obtain it from other nodes anyway (every node gossips the last known
     timestamp),
  - if this is the timestamp of an earlier generation, we would forget
     it immediately and start gossiping the last timestamp (obtained from
     other nodes).

This commit simplifies the bootstrap code (in node-replace case) a bit:
the replacing node no longer attempts to retrieve the CDC generation
timestamp from the node being replaced.

2. tree-wide: introduce cdc::generation_id type

Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.

However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.

Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.

In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.

This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
- if a piece of code doesn't care about the timestamp, it just passes
   the ID around
- if it does care, it can access it using the `get_ts` function.
   The fact that `get_ts` simply accesses the ID's only field is an
   implementation detail.

3. cdc: handle missing generation case in check_and_repair_cdc_streams

check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.

I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.

---

Additionally the PR does some simplifications, stylistic improvements,
removes some dead code, coroutinizes some functions, uncoroutinizes others
(due to miscompiles), adds additional logging, updates some stale comments.
Read commit messages for more details.

Closes #8283

* github.com:scylladb/scylla:
  cdc: log a message when creating a new CDC generation
  cdc: handle missing generation case in check_and_repair_cdc_streams
  tree-wide: introduce cdc::generation_id type
  tree-wide: rename "cdc streams timestamp" to "cdc generation id"
  cdc: remove some functions from generation.hh
  storage_service: make set_gossip_tokens a static free-function
  db: system_keyspace: group cdc functions in single place
  cdc: get rid of "get_local_streams_timestamp"
  sys_dist_ks: update comment at quorum_if_many
  storage_service: simplify CDC generation management during node replace
2021-04-07 14:49:02 +03:00
Kamil Braun
6525111d21 cdc: log a message when creating a new CDC generation 2021-04-07 13:47:16 +02:00
Kamil Braun
0978155bec cdc: handle missing generation case in check_and_repair_cdc_streams
check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.

I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.
2021-04-07 13:47:16 +02:00
Kamil Braun
99fd2244a3 tree-wide: introduce cdc::generation_id type
This is a follow-up to the previous commit.

Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.

However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.

Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.

In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.

This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
   the ID around
2. if it does care, it can simply access it using the `get_ts` function.
   The fact that `get_ts` simply accesses the ID's only field is an
   implementation detail.

Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).
2021-04-07 13:47:13 +02:00
Raphael S. Carvalho
8e0a1ca866 sstable_set: Implement compound_sstable_set's create_single_key_sstable_reader()
compound set isn't overriding create_single_key_sstable_reader(), so
default implementation is always called. Although default impl will
provide correct behavior, specialized ones which provides better perf,
which currently is only available for TWCS, were being ignored.

compound set impl of single key reader will basically combine single key
readers of all sets managed by it.

Fixes #8415.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210406205009.75020-1-raphaelsc@scylladb.com>
2021-04-07 12:36:30 +03:00
Nadav Har'El
da11cd99f7 Merge 'Add a (failing) test for picking secondary indexes in order' from Piotr Sarna
Currently the heuristics for picking an index for a query
are not very well defined. It would be best if we used
statistics to pick the index which is likely to perform
the fastest, but for starters we should at least let the user
decide which index to pick by picking the first one by the
order of restrictions passed to the query.
The (failing) test case from this patch shows the expected
results.

Ref: #7969

Closes #8414

* github.com:scylladb/scylla:
  cql-pytest: add a failing test for index picking order
  cql3: add tracing used secondary index
2021-04-07 11:40:37 +03:00
Piotr Sarna
1f7b972db7 cql-pytest: add a failing test for index picking order
Currently the heuristics for picking an index for a query
are not very well defined. It would be best if we used
statistics to pick the index which is likely to perform
the fastest, but for starters we should at least let the user
decide which index to pick by picking the first one by the
order of restrictions passed to the query.
The (failing) test case from this patch shows the expected
results.

Ref: #7969
2021-04-07 10:05:00 +02:00
Gleb Natapov
68d73bd4c8 raft: add test for check quorum on a leader 2021-04-07 10:15:33 +03:00
Gleb Natapov
b3cb4f3966 raft: fix quorum check code for joint config and non-voting members
Current leader code check for most nodes to be alive, but this is
incorrect since some nodes may be non-voting and hence should not
cause a leader to stepdown if dead. It also incorrect with joint config
since quorum is calculated differently there. Fix it by introducing
activity_tracker class that knows how to handle all the above details.
2021-04-07 10:15:33 +03:00
Gleb Natapov
a48a2c454b raft: do not hang on waiting for entries on a leader that was removed from a cluster
If a leader is removed from a cluster it will never know when entries
that it did not committed yet will be committed, so abort the wait in
this case with uncertainty error.
2021-04-07 10:15:33 +03:00
Gleb Natapov
db03c94692 raft: add more tracing to stepdown code 2021-04-07 10:15:33 +03:00
Gleb Natapov
7dec56721c raft: use existing election_elapsed() function instead of redo the calculation 2021-04-07 10:15:33 +03:00
Gleb Natapov
bdb59307d3 raft: test: add test case for stepdown process
Add the test for the case where C_new entry is not the last one in a
leader that is been removed from a cluster. In this case a leader will
continue replication even after committing C_new and will start stepdown
process later, when at least one follower is fully synchronized.
2021-04-07 10:15:33 +03:00
Gleb Natapov
3bcd3212e2 raft: check that a node is still the leader after initiating stepdown process
Usually initiation of stepdown process does not immediately depose the
current leader, but if the current leader is no longer part of the
cluster it will happen. We were missing the check after initiating
stepdown process in append reply handling.
2021-04-07 10:15:33 +03:00
Avi Kivity
5109bf8b99 config: relax batch size warning and failure thresholds
We inherited very low threshold for warning and failing multi-partition
batches, but these warnings aren't useful. The size of a batch in bytes
as no impact on node stability. In fact the warnings can cause more
problems if they flood the log.

Fix by raising the warning threshold to 128 kiB (our magic size)
and the fail threshold to 1 MiB.

Fixes #8416.

Closes #8417
2021-04-06 20:56:06 +03:00
Calle Wilund
d734f85280 commitlog: Add signalling to recycle queue iff we fail to recycle
Fixes #8376

If a recycle should fail, we will sort of handle it by deleting
the segment, so no leaks. But if we have waiter(s) on the recycle
queue, we could end up deadlocked/starved because nothing is
incoming there.

This adds an abort of the queue iff we failed and no objects are
available. This will wake up any waiter, and he should retry,
and hopefully at least be able to create a new segment.
We then reset the queue to a new one. So we can go on.

v2:
* Forgot to reset queue
v3:
* Nicer exception handling in allocate_segment_ex
2021-04-06 16:38:14 +00:00
Calle Wilund
15dd76f0c2 commitlog: Fix race and edge condition in delete_segments
Fixes #8363

Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.

1.) It uses parallel processing, with deferral. This means that
    the disk usage variables it looks at might not be fully valid
    - i.e. we might have already issued a file delete that will
    reduce disk footprint such that a segment could instead be
    recycled, but since vars are (and should) only updated
    _post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
    delete a single segment, and this segment is the border segment
    - i.e. the one pushing us over the limit, yet allocation is
    desperately waiting for recycling. In this case we should
    allow it to live on, and assume that next delete will reduce
    footprint. Note: to ensure exact size limit, make sure
    total size is a multiple of segment size.

Fixed by
a.) Doing delete serialized. It is not like being parallel here will
    win us speed awards. And now we can know exact footprint, and
    how many segments we have left to delete
b.) Check if we are a block across the footprint boundry, and people
    might be waiting for a segment. If so, don't delete segment, but
    recycle.

As a follow-up, we should probably instead adjust the commitlog size
limit (per shard) to be a multiple of segment sizes, but there is
risks in that too.
2021-04-06 16:38:14 +00:00
Calle Wilund
d9a9897892 commitlog: coroutinize delete_segments
Because we like cow routines.
2021-04-06 16:38:14 +00:00
Calle Wilund
813694b617 commitlog_test: Add test for deadlock in recycle waiter
Not a very good test, mind you. Nothing to verify, just see if
the test times out. But try to make it at least complete for
failure report.
2021-04-06 16:38:14 +00:00
Piotr Sarna
1c99ed6ced cql3: add tracing used secondary index
The indexed queries will now record which index was chosen
for fetching the base table keys.
Example output:
 activity
------------------------------------------------------------------------------------------------------------------------
                                                                                                    Parsing a statement
                                                                                                 Processing a statement
                                                                  Consulting index my_v2_idx for a single slice of keys
 Creating read executor for token -3248873570005575792 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE
                                                                                            read_data: querying locally
                                               Start querying singular range {{-3248873570005575792, pk{000400000002}}}
                           Querying cache for range {{-3248873570005575792, pk{000400000002}}} and slice {(-inf, +inf)}
                                                                                                       Querying is done
                                                                                   Done processing - preparing a result
2021-04-06 17:16:29 +02:00
Tomasz Grabiec
4b10247a4f Merge "raft: do not assert when receiving unexpected messages in a leader state" from Gleb
* scylla-dev/raft-cleanup-v2:
  raft: test: add test that leader behaves as expected when it gets unexpended messages
  raft: do not assert when receiving unexpected messages in a leader state
  raft: use existing function to check if election timeout elapsed
2021-04-06 16:52:23 +02:00
Konstantin Osipov
c83cf1f965 uuid: switch the API to use std::chrono
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.

The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.

For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.

* switch  get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
  min_time_UUID(), max_time_UUID(), create_time_safe() to
  std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
  redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
2021-04-06 17:12:54 +03:00
Nadav Har'El
91249e9683 Update tools/java submodule
* tools/java 5756445ec7...57eb143119 (1):
  > sstableloader: Handle non-prepared batches with ":" in identifier names

Fixes #8230.
2021-04-06 16:37:03 +03:00
Nadav Har'El
0d0db05cf3 test/alternator: speed up two slow xfailing tests
By far the two slowest Alternator tests when running a development build on
my laptop are
	test_gsi.py::test_gsi_projection_include
and
	test_gsi.py::test_gsi_projection_keys_only
Each of those takes around 3.2, and the sum of just these two tests is as
much as 10% (!) of all other 600 tests.

The reason why these tests are slow is that they check scanning a GSI
with *projection*. Scylla currently ignores the projection, so the scan
returns the wrong value. Because this is a GSI, which supports only
eventually- consistent reads, we need to retry the read - and did it for
up to 3 seconds!

But this retry only makes sense if the GSI read did not *yet* return
the expected data. But in these xfailing test, we read a *wrong* item
(with too many attributes) almost immediately, and this should indicate
an immediate failure - no amount of retry would help. So in this patch
we detect this case and fail the test immediately instead of wasting
3 seconds in retries.

On my laptop with dev build, this patch reduces the time to run the
entire Alternator test suite from 70 seconds to 63 seconds.

Also, now that we never just waste time until the timeout, we can
increase it to any number, and in this patch we increase it from 3
seconds to 5.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317183918.1775383-1-nyh@scylladb.com>
2021-04-06 14:49:15 +02:00
Nadav Har'El
15cab90f7b test/alternator: switch some fixture scopes from "session" to "module"
In conftest.py we have several fixtures creating shared tables which many
test files can share, so they are marked with the "session" scope - all
the tests in the testing session may share the same instance. This is fine.

Some of test files have additional fixtures for creating special tables
needed only in those files. Those were also, unnecessarily, marked
"session" scope as well. This means that these temporary tables are
only deleted at the very end of test suite, event though they can be
deleted at the end of the test file which needed them. This is exactly
what the "module" fixture scope is, so this patch changes all the
fixtures private to one test file to be "module".

After this patch, the teardown of the last test in the suite goes down
from 4 seconds to just 1.5 seconds (it's still long because there are
still plenty of session-scoped fixtures in conftest.py).

Another small benefit is that the peak disk usage of the test suite is
lower, because some of the temporary tables are deleted sooner.

This patch does not change any test functionality, and also does not
make any test faster - it just changes the order of the fixture
teardowns.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317175036.1773774-1-nyh@scylladb.com>
2021-04-06 14:43:36 +02:00
Takuya ASADA
0b2c1edddc scylla_ntp_setup: support systemd-timesyncd
On Ubuntu/Debian systemd-timesyncd is default NTP client, and installed
by default.
So use it instead of installing chrony.

Fixes #8339

Closes #8344
2021-04-06 15:28:34 +03:00
Kamil Braun
e486e0f759 tree-wide: rename "cdc streams timestamp" to "cdc generation id"
Each CDC generation always has a timestamp, but the fact that the
timestamp identifies the generation is an implementation detail.
We abstract away from this detail by using a more generic naming scheme:
a generation "identifier" (whatever that is - a timestamp or something
else).

It's possible that a CDC generation will be identified by more than a
timestamp in the (near) future.

The actual string gossiped by nodes in their application state is left
as "CDC_STREAMS_TIMESTAMP" for backward compatibility.

Some stale comments have been updated.
2021-04-06 13:15:31 +02:00
Kamil Braun
0cb2f58514 cdc: remove some functions from generation.hh
They are not used outside of the generation module.
2021-04-06 13:15:31 +02:00
Kamil Braun
deae0aa8ba storage_service: make set_gossip_tokens a static free-function
It's always good to make the storage_service class smaller.
2021-04-06 13:15:31 +02:00
Kamil Braun
1019ff07cb db: system_keyspace: group cdc functions in single place 2021-04-06 13:15:31 +02:00
Kamil Braun
2e2d51cf2b cdc: get rid of "get_local_streams_timestamp"
This function retrieves the persisted timestamp of the last known CDC
generation (which this node is currently gossiping to other nodes).
It checks that the timestamp is present; if not, it throws an error.

The check is unnecessary. It's used only in a quite esoteric place
(start_gossiping, which implements an almost-never-used API call),
and it's fine if the timestamp is gone - in start_gossiping,
we can start gossiping the tokens without the CDC generation timestamp
(well, if the timestamp is not present in system tables, something
weird must have happened, but that doesn't mean we can't resume
gossiping - fixing CDC generation management in such a case is
a separate problem).
2021-04-06 13:15:31 +02:00
Kamil Braun
3cebe99613 sys_dist_ks: update comment at quorum_if_many
The comment mentioned tables that no longer exist: their names have
changed some time ago. Update the comment to be name-agnostic.

Furthemore, the second part of the comment related to a case of "joining
a node without bootstrapping". Fortunately this operation is no longer
possible (after #6848 which became part of Scylla 4.3) so we can shorten
the comment.
2021-04-06 13:15:31 +02:00
Kamil Braun
bb477b9bb4 storage_service: simplify CDC generation management during node replace
Previously, when node A replaced node B, it would obtain B's
generation timestamp from its application state (gossiped by other
nodes) and start gossiping it immediately on bootstrap.

But that's not necessary:
1. if this is the timestamp of the last (current) generation, we would
   obtain it from other nodes anyway (every node gossips the last known
   timestamp),
2. if this is the timestamp of an earlier generation, we would forget
   it immediately and start gossiping the last timestamp (obtained from
   other nodes).

This commit simplifies the bootstrap code (in node-replace case) a bit:
the replacing node no longer attempts to retrieve the CDC generation
timestamp from the node being replaced.
2021-04-06 13:15:31 +02:00
Takuya ASADA
e991e01f2e nonroot: generate scylla_sysconfdir.py correctly
We have scripting bug, when /var/log/journal exists, install.sh does not generate scylla_sysconfdir.py.
Stop generating scylla_sysconfdir.py in if else condition, do that
unconditionally in install.sh, also drop pre-generated
scylla_sysconfdir.py from dist/common/scripts.

Also, $rsysconfdir is correct path to point nonroot mode sysconfdir,
instead of $sysconfdir.

Fixes #8385

Closes #8386
2021-04-05 15:31:12 +03:00
Avi Kivity
56cd058b34 config: correct description of listen_address
- it does not support using interface names
 - listen_interface is not supported
 - 0.0.0.0 will work (and is reasonable) if you set broadcast_address
 - empty setting is not supported

Fixes #8381.

Closes #8409
2021-04-05 14:06:48 +03:00
Gleb Natapov
cd24dfc7e5 storage_proxy: do not crash on LOCAL_QUORUM access to a DC with zero replication
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.

Fixes #8354

Message-Id: <YGro+l2En3fF80CO@scylladb.com>
2021-04-05 14:04:58 +03:00
Avi Kivity
b2f0a9d05c caching_options.hh: move code to .cc
caching_options is by no means performance sensitive, but it is
included in many places (via schema.hh), and it turn it pulls in
other includes. Reduce include load by moving deinlining it.

Ref #1.

Closes #8408
2021-04-05 13:05:43 +03:00
Avi Kivity
a9835ec128 caching_options: detemplate from_map()
I wanted to move caching_option.hh's content to .cc (so that it doesn't
pull in rjson.hh everywhere), but for that, we must first make from_map()
a non-template function.

Luckily, it is only called with one parameter type, so just substitute
that type for the template parameter.

Closes #8406
2021-04-04 21:29:25 +03:00
Avi Kivity
832117c6d9 types: convert has_empty predicate to a concept
has_empty is a textbook example of a concept: it checks whether a type
has an empty() method that returns bool. It is now implemented with
enable_if, simplify it to a concept.

I verified that the debug build doesn't contain any incorrect
emtpyable<T> (e.g. for strings).

Closes #8404
2021-04-04 21:24:05 +03:00
Michał Chojnowski
f23a47e365 utils: fragment_range: fix FragmentedView utils for views with empty fragments
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.

Fixes #8398.

Closes #8397
2021-04-04 15:31:51 +03:00
Avi Kivity
82c76832df treewide: don't include "db/system_distributed_keyspace.hh" from headers
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.

Ref #1.

Closes #8403
2021-04-04 14:00:26 +03:00
Avi Kivity
9853e07821 composite: replace enable_if with constraints
Easier to read.

Closes #8399
2021-04-04 13:56:51 +03:00
Kamil Braun
641040d465 sys_dist_ks: remove dead code (expire_cdc_* functions)
These functions were not used anywhere but had to be maintained anyway.
When (if) the expiration algorithm actually gets implemented (see issue #7300),
the functions can be added back (perhaps they will need to look differently
at that time, and it's likely that the `expire` column won't be used in the
expiration algorithm in the end anyway).
2021-04-04 13:12:12 +03:00
Kamil Braun
4f3f245188 sys_dist_ks: coroutinize system_distributed_keyspace::start 2021-04-04 13:10:44 +03:00
Avi Kivity
40b60e8f09 Merge 'repair: Switch to use NODE_OPS_CMD for replace operation' from Asias He
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:

1) replacing node does not respond echo message to avoid other nodes to
   mark replacing node as alive

2) replacing node advertises hibernate state so other nodes knows
   replacing node is replacing

3) replacing node responds echo message so other nodes can mark
   replacing node as alive

This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.

For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)

```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)

c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```

To solve this problem, we can do the replacing operation in multiple stages.

One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416

1) replacing node does not respond echo message

2) replacing node advertises prepare_replace state (Remove replacing
   node from natural endpoint, but do not put in pending list yet)

3) replacing node responds echo message

4) replacing node advertises hibernate state (Put replacing node in
   pending list)

Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.

This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.

Improvements:

1) It solves the race between marking replacing node alive and sending
   writes to replacing node

2) The cluster reverts to a state before the replace operation
   automatically in case of error. As a result, it solves when the
   replacing node fails in the middle of the operation, the repacing
   node will be in HIBERNATE status forever issue.

3) The gossip status of the node to be replaced is not changed until the
   replace operation is successful. HIBERNATE gossip status is not used
   anymore.

4) Users can now pass a list of dead nodes to ignore explicitly.

Fixes #8013

Closes #8330

* github.com:scylladb/scylla:
  repair: Switch to use NODE_OPS_CMD for replace operation
  gossip: Add advertise_to_nodes
  gossip: Add helper to wait for a node to be up
  gossip: Add is_normal_ring_member helper
2021-04-04 12:54:09 +03:00
Gleb Natapov
10781037f5 raft: test: add test that leader behaves as expected when it gets unexpended messages 2021-04-04 11:33:35 +03:00
Gleb Natapov
28add88a1f raft: do not assert when receiving unexpected messages in a leader state
Current code assert when it gets InstallSnapshot/AppendRequest in a leader
state and the term in the message is equal current term. It is true that
such messages cannot be received if the protocol works correctly, but we
should not crash on a network input nonetheless.
2021-04-04 11:33:35 +03:00
Gleb Natapov
995cd1c8a7 raft: use existing function to check if election timeout elapsed
is_past_election_timeout() repeats the calculation that
election_elapsed() is doing. Use existing function instead.
2021-04-04 11:33:35 +03:00
Piotr Dulikowski
f186de909d storage_service/removenode: update gossiper state before excise
In `storage_service::removenode`, in "Step 5", services which implement
`endpoint_lifecycle_subscriber` are first notified about the node
leaving the cluster, and only after that the gossiper state is updated
(comments added by me):

    // This function indirectly notifies subscribers
    ss.excise(std::move(tmp), endpoint);
    // This function updates the gossiper state
    ss._gossiper.advertise_token_removed(endpoint, host_id).get();

This order is confusing for those subscribers which expect the fact
that the node is leaving to be reflected in the gossiper state - more
specifically, for hints manager.

The hints manager has a function `can_send()` which determines if it is
OK for it to try send hints. More specifically, it looks at the
gossiper state to see if the destination node is ALIVE or if it has
left the ring. The first case is obvious as the destination node will
be able to receive the hints as writes, while the other means that the
hints will be sent with CL=ALL to its new replicas.

When a node leaves the cluster, all hint queues either to or from that
node enter the "drain" mode - the queue will attempt to send out all
hints and will drop those hints which failed to be sent. This mode is
triggered by a notification from the storage_service (hints manager is
a lifecycle subscriber).

The core drain logic for a queue looks as follows:

    manager_logger.trace("Draining for {}: start", end_point_key());
    set_draining();
    send_hints_maybe();
    _ep_manager.flush_current_hints().handle_exception([] (auto e) {
        manager_logger.error("Failed to flush pending hints: {}. Ignoring...", e);
    }).get();
    send_hints_maybe();
    manager_logger.trace("Draining for {}: end", end_point_key());

And `send_hints_maybe` contains the following loop:

    while (replay_allowed() && have_segments() && can_send()) {
        if (!send_one_file(*_segments_to_replay.begin())) {
            break;
        }
        _segments_to_replay.pop_front();
        ++replayed_segments_count;
    }

Coming back to the `storage_service::removenode` - because of the order
of `excise` and `advertise_token_removed`, draining starts before the
node which is being removed is removed from gossiper state. In turn,
it might happen that the drain logic calls `send_hints_maybe` twice and
does not send any hints - the loop in that function will immediately
stop because `can_send()` is false because the gossiper state still
reports that the target node is not alive. The logic expects `can_send`
to be true here because the node has left the ring.

This patch changes the order of `excise` and `advertise_token_removed`
in `storage_service::removenode` - now, the first one is called after
the other. This ensures that the gossiper state is updated before
listeners are called, and the race descrbed in the commit message does
not happen anymore - `can_send` is true when the node is being drained.

The race described here was exposed by the following commit:
77a0f1a153

Fixes: #5087

Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)

Closes #8284
2021-04-02 11:05:16 +02:00
Avi Kivity
fb890889cc version: prepare for the 4.6 cycle 2021-04-01 20:40:52 +03:00
Avi Kivity
eeaceb4bff Update seastar submodule
* seastar 398f1c3274...fcd46c1387 (1):
  > cmake: tighten check for -fstack-clash-protection
2021-04-01 18:49:16 +03:00
Wojciech Mitros
201b86b042 primitive_consumer: keep fragments of parsed buffer in a small_vector
When we want to parse a linearized buffer of bytes, we're copying them
into the first and only element of the _read_bytes vector. Thus
_read_bytes often contains only one element, which makes a small_vector
a better alternative.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-04-01 16:05:52 +02:00
Tomasz Grabiec
307bd354d2 Merge 'hints: use token_metadata to tell if node has left the ring' from Piotr Dulikowski
This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function.

Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)

Closes #8387

* github.com:scylladb/scylla:
  hints: clarify docstring comment for can_send
  hints: use token_metadata to tell if node is in the ring
  hints: slightly reogranize "if" statement in can_send
  storage_service: release token_metadata lock before notify_left
  storage_service: notify_left after token_metadata is replicated
2021-04-01 15:51:46 +02:00
Avi Kivity
e45466ed07 Update seastar submodule
* seastar 72e3baed9c...398f1c3274 (4):
  > coroutine: Remove return_value for future<void>
  > tls: preserve exact error state so repeated calls generate same message
Fixes #8391.
  > Add deferred_close and deferred_stop
  > httpd: add status_types 406, 415, 422
2021-04-01 16:41:59 +03:00
Wojciech Mitros
599cfe586f sstables: add parsing of cell values into fragmented buffers
The entire sstable cell value is currently stored in a single
temporary_buffer. Cells may be very large, so to avoid large
contiguous allocations, the buffer is changed to
a fragmented_temporary_buffer.

Fixes #7457
Fixes #6376

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-04-01 15:36:58 +02:00
Avi Kivity
bfed9c15d5 Update tools/java submodule
* tools/java fb21784b91...5756445ec7 (1):
  > sstableloader: fix handling of rewritten partition

Ref scylladb/scylla-tools-java#238.
2021-04-01 16:07:46 +03:00
Avi Kivity
4739df2cb1 Merge 'cql3: remove linearizations in the write path' from Michał Chojnowski
As a part of the effort of removing big, contiguous buffers from the codebase,
cql3::raw_value should be made fragmented. Unfortunately a straightforward
rewrite to a fragmented buffer type is not possible, because we want
cql3::raw_value to be compatible with cql3::raw_value_view, and we want that
view to be based on fragmented_temporary_buffer::view, so that it can be
used to view data coming directly from seastar without copying.

This patch makes cql3::raw_value fragmented by making cql3::raw_value_view
a `variant` of managed_bytes_view and fragmented_temporary_buffer::view.

Code users which depended on `cql3::raw_value` being `bytes`,
and cql::raw_value_view being `fragmented_temporary_buffer::view` underneath
were adjusted to the new, dual representation, mainly through the
`cql3::raw_value_view::with_value` visitor and deserialization/validation
helpers added to `cql3::raw_value_view`.

The second part of this series gets rid of linearizations occuring when processing
compound types in the CQL layer. This is achieved by storing their elements in
`managed_bytes` instead of `bytes` in the partially deserialized form (`lists::value`
`tuples::value`, etc.) outputting `managed_bytes` instead of `bytes` in functions
which go from the partially deserialized form to the atomic cell format (for frozen
types), and avoiding calling deserialize/serialize on individual elements when
it's not necessary. (It's only necessary for CQLv2, because since CQLv3 the format
on the wire is the same as our internal one).

The above also forces some changes to `expression.cc`, and `restrictions`, mainly because
`IN` clauses store their arguments as `lists` and `tuples`, and the code which handled
this clause expected `bytes`.

After this series, the path from prepared CQL statements to `atomic_cell_or_collection`
is almost completely linearization-free. The last remaining place is `collection_mutation_description`,
where map keys are linearized to `bytes`.

Closes #8160

* github.com:scylladb/scylla:
  cql3: update_parameters: remove unused version of make_cell for bytes_view
  types: collection: remove an unused version of pack_fragmented
  cql3: optimize the deserialization of collections
  cql3: maps, sets: switch the element type from bytes to managed_bytes
  cql3: expression: use managed_bytes instead of bytes where possible
  cql3: expr: expression: make the argument of to_range a forwarding reference
  cql3: don't linearize elements of lists, tuples, and user types
  cql3: values: add const managed_bytes& constructor to raw_value_view
  cql3: output managed_bytes instead of bytes in get_with_protocol_version
  types: collection: add versions of pack for fragmented buffers
  types: add write_collection_{value,size} for managed_bytes_mutable_view
  cql3: tuples, user_types: avoid linearization in from_serialized() and get()
  types: tuple: add build_value_fragmented
  cql3: update_parameters: add make_cell version for managed_bytes_view
  cql3: remove operation::make_*cell
  cql3: values: make raw_value fragmented
  cql3: values: remove raw_value_view::operator==
  cql3: switch users of cql3::raw_value_view to internals-independent API
  cql3: values: add an internals-independent API to raw_value_view
  utils: managed_bytes: add a managed_bytes constructor from FragmentedView
  utils: managed_bytes: add operator<< and to_hex for managed_bytes
  utils: fragment_range: add to_hex
  configure: remove unused link dependencies from UUID_test
2021-04-01 15:21:32 +03:00
Takuya ASADA
3af31eebeb scylla_setup: stop hardcode product name on scylla_setup
Stop hardcode product name on scylla_setup, dynamically generate
scylla_product.py in install.sh.

Fixes #8367

Closes #8384
2021-04-01 15:07:58 +03:00
Avi Kivity
ecc5b57183 Merge "reader_concurrency_semaphore: refactor do_wait_admission() to facilitate changes to admission conditions" from Botond
"
This small patchset restructures the do_wait_admission() to facilitate
future changes to the wait/admission conditions. The changes we want to
facilitate are Benny's flat mutation reader close series and my stalled
readers series. As an added benefit the code is more readable and a
small theoretical corner-case bug is fixed. No logical changes (besides
the small bug-fix).

Tests: unit(dev)
"

* 'reader-concurrency-semaphore-refactor/v1' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: remove now unused may_proceed()
  reader_concurrency_semaphore: restructure do_wait_admission()
  reader_concurrency_semaphore: extract enqueueing logic into enqueue_waiter()
  reader_concurrency_semaphore: make admission conditions consistent
2021-04-01 13:50:32 +03:00
Raphael S. Carvalho
bb9a109c1a distributed_loader: inform which table is being resharded
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210330163956.60585-1-raphaelsc@scylladb.com>
2021-04-01 13:08:59 +03:00
Pavel Emelyanov
8bbe2eae5e btree: Convert comparator to <=>
It turned out that all the users of btree can already be converted
to use safer std::strong_ordering. The only meaningful change here
is the btree code itself -- no more ints there.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210330153648.27049-1-xemul@scylladb.com>
2021-04-01 12:56:08 +03:00
Pavel Emelyanov
ccc1f24097 row_cache: Remove mentionings of cache_streamed_mutation
This class was replaced by cache_flat_mutation_reader long ago
and doesn't exist.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210330153942.27222-1-xemul@scylladb.com>
2021-04-01 12:54:45 +03:00
Michał Chojnowski
c555e84a77 cql3: update_parameters: remove unused version of make_cell for bytes_view
It became unused after previous patches in this series changed the
representation of collections in cql3 from bytes_view to managed_bytes_view.
2021-04-01 10:44:21 +02:00
Michał Chojnowski
472f0eb932 types: collection: remove an unused version of pack_fragmented
It was made unused by previous patches in this series.
2021-04-01 10:44:21 +02:00
Michał Chojnowski
458878a414 cql3: optimize the deserialization of collections
Before this patch, deserializing a collection from a (prepared) CQL request
involved deserializing every element and serializing it again. Originally this
was a hacky method of validation, and it was also needed to reserialize nested
frozen collections from the CQLv2 format to the CQLv3 format.

But since then we started doing validation separately (before calls to
from_serialized) and CQLv2 became irrelevant, making reserialization of
elements (which, among other things, involves a memory alocation for every
element) pure waste.

This patch adds a faster path for collections in the v3 format, which does not
involve linearizing or reserializing the elements (since v3 is the same as
our internal format).

After this patch, the path from prepared CQL statements to
atomic_cell_or_collection is almost completely linearization-free. The last
remaining place is collection_mutation_description, where map keys are
linearized.
2021-04-01 10:44:21 +02:00
Michał Chojnowski
a0f12b8d63 cql3: maps, sets: switch the element type from bytes to managed_bytes 2021-04-01 10:44:21 +02:00
Michał Chojnowski
979666075f cql3: expression: use managed_bytes instead of bytes where possible 2021-04-01 10:44:21 +02:00
Michał Chojnowski
6e7e795dfd cql3: expr: expression: make the argument of to_range a forwarding reference
Make to_range able to handle rvalues. We will pass managed_bytes&& to it
in the next patch to avoid pointless copying.
The public declaration of to_range is changed to a concrete function to avoid
having to explicitly instantiate to_range for all possible reference types of
clustering_key_prefix.
2021-04-01 10:44:21 +02:00
Michał Chojnowski
0bb959e890 cql3: don't linearize elements of lists, tuples, and user types
This patch switches the type used to store collection elements inside the
intermediate form used in lists::value, tuples::value etc. from bytes
to managed_bytes. After this patch, tuple and list elements are only linearized
in from_serialized, which will be corrected soon.
This commit introduces some additional copies in expression.cc, which
will be dealt with in a future commit.
2021-04-01 10:44:21 +02:00
Michał Chojnowski
fa2749c2a0 cql3: values: add const managed_bytes& constructor to raw_value_view
Will be used in the next patch. Separated for clarity.
2021-04-01 10:44:21 +02:00
Michał Chojnowski
8927aaf225 cql3: output managed_bytes instead of bytes in get_with_protocol_version 2021-04-01 10:44:21 +02:00
Michał Chojnowski
aab9509775 types: collection: add versions of pack for fragmented buffers
We will need them to port the representation of collection types
in cql3/ from bytes to managed_bytes.
The version which takes an iterator of `bytes` as an argument will
be removed after that transition is complete.
2021-04-01 10:44:21 +02:00
Michał Chojnowski
e9c05582a4 types: add write_collection_{value,size} for managed_bytes_mutable_view
We will use them to avoid linearization when going from the intermediate
std::vector<bytes> form in cql3/ to the atomic_cell format, by outputting
managed_bytes instead of bytes in get_with_protocol_version.
2021-04-01 10:44:21 +02:00
Michał Chojnowski
3387d43a34 cql3: tuples, user_types: avoid linearization in from_serialized() and get()
Deserialize from raw_value_view without linearizing and output managed_bytes
instead of bytes.
2021-04-01 10:44:20 +02:00
Michał Chojnowski
a10a82da30 types: tuple: add build_value_fragmented
A version of build_value which produces fragmented output.
We will use it to avoid linearization in tuples::value and user_types::value.
2021-04-01 10:42:07 +02:00
Michał Chojnowski
9777026e71 cql3: update_parameters: add make_cell version for managed_bytes_view
We will use it to port the representation of collections in cql3/
from bytes to managed_bytes.
The duplicate version for bytes_view will be removed after that transition
is complete.
2021-04-01 10:42:07 +02:00
Michał Chojnowski
c2c6b2abfa cql3: remove operation::make_*cell
The operation::make_*cell functions are useless aliases to methods of
update_parameters, and are used interchangeably with them throughout the code.
Remove them.

Also, remove the now-unused update_parameters::make_cell version for
fragmented_temporary_buffer::view.
2021-04-01 10:42:07 +02:00
Michał Chojnowski
463ec1b082 cql3: values: make raw_value fragmented
As a part of the effort of removing big, contiguous buffers from the codebase,
cql3::raw_value should be made fragmented. Unfortunately the change involves
some nontrivial work, because raw_value must be viewable with raw_value_view,
and raw_value_view must accomodate both raw_value (that's where we store
values in prepared queries) and fragmented_temporary_buffer::view
(because that's the type of values coming from the wire).

This patch makes raw_value fragmented, by changing the backing type from
bytes to managed_bytes. raw_value_view is modified accordingly by changing
the backing type from fragmented_temporary_buffer::view to a variant of
fragmented_temporary_buffer::view and managed_bytes_view.

We have prepared the users of raw_value{_view} for this change in preceding
commits.
2021-04-01 10:42:07 +02:00
Michał Chojnowski
5984d6b2ce cql3: values: remove raw_value_view::operator==
It's only used in a single test, and there is no reason why it should ever
be used anywhere else. So let's remove it from the public header and move
it to that test.
2021-04-01 10:42:07 +02:00
Michał Chojnowski
b9322a6b71 cql3: switch users of cql3::raw_value_view to internals-independent API
We want to change the internals of cql3::raw_value{_view}.
However, users of cql3::raw_value and cql3::raw_value_view often
use them by extracting the internal representation, which will be different
after the planned change.

This commit prepares us for the change by making all accesses to the value
inside cql3::raw_value(_view) be done through helper methods which don't expose
the internal representation publicly.

After this commit we are free to change the internal representation of
raw_value_{view} without messing up their users.
2021-04-01 10:42:04 +02:00
Michał Chojnowski
b3167ac0a6 cql3: values: add an internals-independent API to raw_value_view
Currently, raw_value_view is backed by a fragmented_temporary_buffer::view,
and many users of this type use it by extracting that internal representation.
However, we want to change raw_value_view so that it can be created both
from fragmented_temporary_buffer and from managed_bytes, so that we can switch
the internals of raw_value from bytes to managed_bytes. To do that we need
to prepare all users for that more general representation.

This commit adds an API which allow using raw_value_view without accessing its
internal representation. In the next commits of this series we will switch all
callers who currently depend on that representation to the new API,
and then we will remove the old accessors and change the internals.
2021-04-01 10:39:42 +02:00
Michał Chojnowski
45e0ef26d3 utils: managed_bytes: add a managed_bytes constructor from FragmentedView
Just for convenience. We will use it in an upcoming patch where we switch
the inner representation of cql3::raw_value from bytes to managed_bytes, and we
will want to construct managed_bytes from fragmented_temporary_buffer::view.
2021-04-01 10:39:42 +02:00
Michał Chojnowski
4715268e30 utils: managed_bytes: add operator<< and to_hex for managed_bytes
We will need them to replace bytes with managed_bytes in some places in an
upcoming patch.

The change to configure.py is necessary because opearator<< links to to_hex
in bytes.cc.
2021-04-01 10:39:42 +02:00
Michał Chojnowski
14c4639994 utils: fragment_range: add to_hex 2021-04-01 10:39:42 +02:00
Michał Chojnowski
b6740a01ac configure: remove unused link dependencies from UUID_test 2021-04-01 10:39:42 +02:00
Piotr Dulikowski
6a1152ea9b hints: clarify docstring comment for can_send
Now, the docstring comment next to can_send better represents the
condition that is checked inside that function. The statement about
returning true when destination left the NORMAL state is replaced with a
statement about returning true when the destination has left the ring.
2021-04-01 03:58:29 +02:00
Piotr Dulikowski
4f90514247 hints: use token_metadata to tell if node is in the ring
Now, instead of looking at the gossiper state to check if the
destination node is still in the ring, we are using token_metadata as a
source of truth. This results in much simpler code in can_send() as
token_metadata has an is_member method which does exactly what we want.
2021-04-01 03:58:29 +02:00
Piotr Dulikowski
e7d9057d0c hints: slightly reogranize "if" statement in can_send
This commit reverses the order of if-else blocks in can_send, which
makes it - in my opinion, at least - slightly easier to read.
2021-04-01 03:58:29 +02:00
Piotr Dulikowski
b7f4f47608 storage_service: release token_metadata lock before notify_left
There is no need to keep holding the token_metadata lock after metadata
was successfully updated on all shards.
2021-04-01 03:58:21 +02:00
Asias He
323f72e48a repair: Switch to use NODE_OPS_CMD for replace operation
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:

1) replacing node does not respond echo message to avoid other nodes to
   mark replacing node as alive

2) replacing node advertises hibernate state so other nodes knows
   replacing node is replacing

3) replacing node responds echo message so other nodes can mark
   replacing node as alive

This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.

For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)

```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)

c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```

To solve this problem, we can do the replacing operation in multiple stages.

One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416

1) replacing node does not respond echo message

2) replacing node advertises prepare_replace state (Remove replacing
   node from natural endpoint, but do not put in pending list yet)

3) replacing node responds echo message

4) replacing node advertises hibernate state (Put replacing node in
   pending list)

Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.

This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.

Improvements:

1) It solves the race between marking replacing node alive and sending
   writes to replacing node

2) The cluster reverts to a state before the replace operation
   automatically in case of error. As a result, it solves when the
   replacing node fails in the middle of the operation, the repacing
   node will be in HIBERNATE status forever issue.

3) The gossip status of the node to be replaced is not changed until the
   replace operation is successful. HIBERNATE gossip status is not used
   anymore.

4) Users can now pass a list of dead nodes to ignore explicitly.

Refs #8013
2021-04-01 09:38:54 +08:00
Asias He
bdb95233e8 gossip: Add advertise_to_nodes
gossiper::advertise_to_nodes() is added to allow respond to gossip echo
message with specified nodes and the current gossip generation number
for the nodes.

This is helpful to avoid the restarted node to be marked as alive during
a pending replace operation.

After this patch, when a node sends a echo message, the gossip
generation number is sent in the echo message. Since the generation
number changes after a restart, the receiver of the echo message can
compare the generation number to tell if the node has restarted.

Refs #8013
2021-04-01 09:38:54 +08:00
Asias He
f690f3ee8e gossip: Add helper to wait for a node to be up
This patch adds gossiper::wait_alive helper to wait for nodes to be up
on all shards.

Refs #8013
2021-04-01 09:38:54 +08:00
Asias He
4f5676630e gossip: Add is_normal_ring_member helper
Check if a node is in NORMAL or SHUTDOWN status which means the node is part of
the token ring from the gossip point of view and operates in normal status or
was in normal status but is shutdown.

Refs #8013
2021-04-01 09:38:54 +08:00
Piotr Dulikowski
ca65f012b0 storage_service: notify_left after token_metadata is replicated
Previously, at the end of the removenode operation, endpoint lifecycle
subscribers are informed about the node being removed (the
"on_leave_cluster" method) before the token_metadata is updated to
reflect the fact that a node was removed.

Although no subscriber currently depends on token_metadata being
up-to-date when "on_leave_cluster" is called, the hints manager will
become sensitive to this in a later commit in this series.

This commit gets rid of the future problem by notifying subscribers
later, only after token_metadata is fully updated and replicated to
all shards.
2021-04-01 02:13:27 +02:00
Avi Kivity
bbec43f9a1 Update tools/java submodule
* tools/java ccc4201ded...fb21784b91 (2):
  > fix: Add dummy implementation of getToppartitions
  > nodetool: Make toppartitions call the generic endpoint

Fixes #4520.
2021-03-31 17:38:03 +03:00
Wojciech Mitros
b1b5bda848 sstables: add non-contiguous parsing of byte strings to the primitive_consumer
Currently, the primitive_consumer parses all values in contiguous buffers.
A string of bytes may be very long, so parsing it in a single buffer
can cause a big allocation. This patch allows parsing into
fragmented_temporary_buffers instead of temporary_buffers.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 12:09:52 +02:00
Wojciech Mitros
3f529b2860 utils: add ostream operator<<() for fragmented_temporary_buffer::view
We are going to store sstable cells' values in fragmented_temporary_buffers.
This patch will allow checking these values with loggers.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 12:09:52 +02:00
Wojciech Mitros
74a0c158c5 compound_type: extend serialize_value for all FragmentedView types
compound_type::serialize_value is currently implemented for arguments
of type 'bytes_view', 'managed_bytes', or 'managed_bytes_view'.
We will want to use it for a fragmented_temporary_buffer::view, so we
extend it for all FragmentedView types.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 12:09:52 +02:00
Pavel Emelyanov
887a1b0d3d tracing: Stop tracing in main's deferred action
Tracing is created in two steps and is destroyed in two too.
The 2nd step doesn't have the corresponding stop part, so here
it is -- defer tracing stop after it was started.

But need to keep in mind, that tracing is also shut down on
drain, so the stopping should handle this.

Fixes #8382
tests: unit(dev), manual(start-stop, aborted-start)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210331092221.1602-1-xemul@scylladb.com>
2021-03-31 12:28:37 +03:00
Piotr Jastrzebski
57c7964d6c config: ignore enable_sstables_mc_format flag
Don't allow users to disable MC sstables format any more.
We would like to retire some old cluster features that has been around
for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this
we first have to make sure that all existing clusters have them enabled.
It is impossible to know that unless we stop supporting
enable_sstables_mc_format flag.

Test: unit(dev)

Refs #8352

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8360
2021-03-31 12:23:59 +03:00
Avi Kivity
f9244734f9 Update seastar submodule
* seastar 48376c76a...72e3baed9 (3):
  > file: Add RFW_NOWAIT detection case for AuFS
  > sharded: provide type info on no sharded instance exception
  > iotune: Estimate accuarcy of measurement

Added missing include "database.hh" to api/lsa.cc since seastar::sharded<>
now needs full type information.
2021-03-31 10:40:04 +03:00
Avi Kivity
de10a74a84 Merge 'types: remove linearization from abstract_type::compare' from Wojciech Mitros
This patch is another series on removing big allocations from scylla.
The buffers in `compare_visitor` were replaced with `managed_bytes_view`, similiar change was also needed in tuple_deserializing_iterator and listlike_partial_deserializing_iterator, and was applied as well.

Tests:unit(dev)

Closes #8357

* github.com:scylladb/scylla:
  types: remove linearization from abstract_type::compare
  types: replace buffers in tuple_deserializing_iterator with fragmented ones
  types: make tuple_type_impl::split work with any FragmentedViews
  types: move read_collection_size/value specialization to header file
2021-03-31 08:50:52 +03:00
Wojciech Mitros
f57fa935a2 types: remove linearization from abstract_type::compare
To avoid high latencies caused by large contigous allocations
needed by linearizing, work on fragmented buffers instead.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:35:10 +02:00
Wojciech Mitros
daa31be37f types: replace buffers in tuple_deserializing_iterator with fragmented ones
In preparation for removing linearization from abstract_type::compare,
add options to avoid linearization in tuple_deserializing_iterator.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:35:09 +02:00
Wojciech Mitros
823d4c7529 types: make tuple_type_impl::split work with any FragmentedViews
We may want to store a tuple in a fragmented buffer. To split it
into a vector of optional bytes, tuple_type_impl::split can be used.
To split a contiguous buffer(bytes_view), simply pass
single_fragmented_view(bytes_view).

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:34:37 +02:00
Piotr Sarna
6a2377a233 Merge 'Fast slow query trace doc' from Ivan
Addressed https://github.com/scylladb/scylla/pull/8314#issuecomment-803671234
(write issue: "Tracing: slow query fast mode documentation request")

adds a fast slow queries tracing mode documentation to the docs/guide/tracing.md

patch to the scylla-doc will be dup-ed after this one merged

cc @nyh
cc @vladzcloudius

Closes #8373

* github.com:scylladb/scylla:
  tracing: api: fast mode doc improvement
  tracing: fast slow query tracing mode docs
2021-03-30 17:57:04 +02:00
Botond Dénes
4762b84b44 reader_concurrency_semaphore: remove now unused may_proceed() 2021-03-30 17:54:34 +03:00
Botond Dénes
94c7e619af reader_concurrency_semaphore: restructure do_wait_admission()
Currently the code is structured such that first the conditions required
for admission are checked. The success paths have early returns and if
all of them fail, we fall back to enqueueing the permit.
This patch restructures the code such that the wait conditions are
checked first, and if all of them fail, we fall back to admitting the
permit. This structure allows for easier introduction of additional
wait/admit conditions in the future.
2021-03-30 17:51:17 +03:00
Botond Dénes
d1dd55d98f reader_concurrency_semaphore: extract enqueueing logic into enqueue_waiter()
Besides making the code more readable, this also enables restructuring
`do_wait_admission()`, without moving too much code around.
As a bonus, queue length is now only checked when the permit actually
has to be enqueued.
2021-03-30 17:49:30 +03:00
Botond Dénes
d90cd6402c reader_concurrency_semaphore: make admission conditions consistent
Currently there are two places where we check admission conditions:
`do_wait_admission()` and `signal()`. Both use `has_available_units()`
to check resource availability, but the former has some additional
resource related conditions on top (in `may_proceed()`), which lead to
the two paths working with slightly different conditions. To fix, push
down all resource availability related checks to `has_available_units()`
to ensure admission conditions are consistent across all paths.
2021-03-30 17:39:57 +03:00
Ivan Prisyazhnyy
778d9217f3 tracing: api: fast mode doc improvement
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-30 16:22:56 +02:00
Ivan Prisyazhnyy
b3b66fb629 tracing: fast slow query tracing mode docs
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-30 16:22:56 +02:00
Avi Kivity
d2921b5112 Merge 'Clean up > 2-year-old features' from Piotr Sarna
Following the work started in 253a7640e, a new batch of old features is assumed to be always available. They are all still announced via gossip, but the code assumes that the feature is always true, because we only support upgrades from a previous release, and the release window is considerably smaller than 2 years.

Features picked this time via `git blame`, along with the date of their introduction:

* fe4afb1aa3 (Asias He                  2018-09-05 14:52:10 +0800  109) static const sstring ROW_LEVEL_REPAIR = "ROW_LEVEL_REPAIR";
* ff5e541335 (Calle Wilund              2019-02-05 13:06:07 +0000  110) static const sstring TRUNCATION_TABLE = "TRUNCATION_TABLE";
* fefef7b9eb (Tomasz Grabiec            2019-03-05 19:08:07 +0100  111) static const sstring CORRECT_STATIC_COMPACT_IN_MC = "CORRECT_STATIC_COMPACT_IN_MC";

Tests: unit(dev)

Closes #8235

* github.com:scylladb/scylla:
  sstables,test: remove variables depending on old features
  gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
  sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
  gms: make TRUNCATION_TABLE feature unconditionally true
  gms: make ROW_LEVEL_REPAIR feature unconditionally true
  repair: stop relying on ROW_LEVEL_REPAIR feature
2021-03-30 16:13:35 +03:00
Calle Wilund
c0666ea89b commitlog: Fix inner loop condition in allocation pre-fill
Fixes #8369

This was originally found (and fixed) by @gleb-cloudius, but the patch set with
the fix was reverted at some point, and the fix went away. Now the error remains
even in new, nice coroutine code.

We check the wrong var in the inner loop of the pre-fill path of
allocate_segment_ex, often causing us to generate giant writev:s of more or less
the whole file.  Not intended.

Closes #8370
2021-03-30 12:14:55 +02:00
Avi Kivity
c2866f46b5 test: relax quota for tests on machines with small page size
8a8589038c ("test: increase quota for tests to 6GB") increased
the quota for tests from 2GB to 6GB. I later found that the increased
requirement is related to the page size: Address Sanitizer allocates
at least a page per object, and so if the page size is larger the
memory requirement is also larger.

Make use of this by only increasing the quota if the page size
is greater than 4096 (I've only seen 4096 and 65536 in the wild).
This allows greater parallelism when the page size is small.

Closes #8371
2021-03-30 12:13:42 +02:00
Avi Kivity
8785dd62cb tests: use kernel page cache
Tests are short-lived and use a small amount of data. They
are also often run repeatly, and the data is deleted immediately
after the test. This is a good scenario for using the kernel page
cache, as it can cache read-only data from test to test, and avoid
spilling write data to disk if it is deleted quickly.

Acknowledge this by using the new --kernel-page-cache option for
tests.

This is expected to help on large machines, where the disk can be
overloaded. Smaller machines with NVMe disks probably will not see
a difference.

Closes #8347
2021-03-30 12:04:55 +02:00
Piotr Sarna
6de2691bbd sstables,test: remove variables depending on old features
In order to maintain backward compatibility wrt. cluster features,
two boolean variables were kept in sstable writers:
 - correctly_serialize_non_compound_range_tombstones
 - correctly_serialize_static_compact_in_mc

Since these features are assumed to always be present now,
the above variables are no longer needed and can be purged.
2021-03-30 09:37:41 +02:00
Piotr Sarna
e42dee6afb gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
The feature is assumed to be true due to being over 2 years old.
It's still advertised in gossip, but it's assumed to always be present.
2021-03-30 09:37:13 +02:00
Piotr Sarna
28c9af6fa5 sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
The feature bit is going away because it's over 2 years old,
so the code which depended on it becomes unconditional.
2021-03-30 09:37:04 +02:00
Piotr Sarna
08c4350968 gms: make TRUNCATION_TABLE feature unconditionally true
Turns out the feature was not used presently.
Historically, the commit which removed the support is
30a700c5b0 .
2021-03-30 09:36:45 +02:00
Piotr Sarna
c070178c7e gms: make ROW_LEVEL_REPAIR feature unconditionally true
The feature is assumed to be true due to being over 2 years old.
It's still advertised in gossip, but it's assumed to always be present.
2021-03-30 09:36:11 +02:00
Piotr Sarna
80ebedd242 repair: stop relying on ROW_LEVEL_REPAIR feature
The feature is going away because it's over 2 years old,
so the code which depended on it becomes unconditional.
2021-03-30 09:35:40 +02:00
Avi Kivity
c1badc6317 noexcept_traits: convert enable_if to concepts
A little easier to read.

Closes #8329
2021-03-30 09:30:23 +02:00
Avi Kivity
405c4e7af1 serializer: replace enable_if in deserialized_bytes_proxy with constraint
Simpler to read and understand.

Closes #8303
2021-03-30 09:30:06 +02:00
Avi Kivity
7c953f33d5 utils: disk-error-handler: replace enable_if with concepts
Simpler, cleaner. We also replace the deprecated std::result_of_t
with std::invoke_result_t.

Closes #8305
2021-03-30 09:29:46 +02:00
Nadav Har'El
115324f71a Merge 'Add partial admission control to Thrift frontend' from Piotr Sarna
This pull request adds partial admission control to Thrift frontend. The solution is partial mostly because the Thrift layer, aside from allowing Thrift messages, may also be used as a base protocol for CQL messages. Coupling admission control to this one is a little bit more complicated due to how the layer currently works - a Thrift handler, created once per connection, keeps a local `query_state` instance for the occasion of handling CQL requests. However, `query_state` should be kept per query, not per connection, so adding admission control to this aspect of the frontend is left for later.

Finally, the way service permits are passed from the server, via the handler factory, handler and then to queries is hacky. I haven't figured out how to force Thrift to pass custom context per query, so the way it works now is by relying on the fact that the server does not yield (in Seastar sense) between having read the request and launching the proper handler. Due to that, it's possible to just store the service permit in the server itself, pass the reference (address) to it down to the handler, and then read it back from the handling code and claim ownership of it. It works, but if anyone has a better idea, please share.

Refs #4826

Closes #8313

* github.com:scylladb/scylla:
  thrift: add support for max_concurrent_requests_per_shard
  thrift: add metrics for admission control
  thrift: add a counter for in-flight requests
  thrift: add a counter for blocked requests
  thrift: partially add admission control
  service_permit: add a getter for the number of units held
  thrift: coroutinize processing a request
  memory_limiter: add a missing seastarx include
2021-03-29 21:36:50 +03:00
Raphael S. Carvalho
a390f4eb61 sstables: optimize LCS reshape for repair-based operations
LCS reshape is currently inefficient for repair-based operation, because
the disjoint run of 256 sstables is reshaped into bigger L0 files, which
will be then integrated into the main sstable set.
On reshape completion, LCS has to compact those big L0 files onto higher
levels, until last level is reached, producing bad write amplification.

A much better approach is to instead compact that disjoint run into the
best possible level L, which can be figured out with:
	log (base fan_out) of (total_size / max_sstable_size)
This compaction will be essentially a copy operation. It's important
to do it rather than only mutating the level of sstables because we have
to reshape the input run according to LCS parameters like sstable size.

For repair-based bootstrap/replace, the input disjoint run is now efficiently
reshaped into an ideal level L, so there's no compaction backlog once
reshape completes.

This behavior will manifest in the log as this:
LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2

For repair-based decommission/removenode though, which reshape wasn't
wired on yet, level L may temporarily hold 2 disjoint runs, which overlap
one another, but LCS itself will incrementally merge them through either
promotion of L-1 into L, or by detecting overlapping in level L and
merging the overlapping sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>
2021-03-29 20:22:04 +03:00
Botond Dénes
3c54c990ab test: view_build_test: test_view_update_generator_buffering: fail gracefully
Failures in this test typically happen inside the test consumer object.
These however don't stop the test as the code invoking the consumer
object handles exceptions coming from it. So the test will run to
completion and will fail again when comparing the produced output with
the expected one. This results in distracting failures. The real problem
is not the difference in the output, but the first check that failed,
which is however buried in the noise. To prevent this add an "ok" flag
which is set to false if the consumer fails. In this case the additional
checks are skipped in the end to not generate useless noise.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-2-bdenes@scylladb.com>
2021-03-29 17:58:28 +03:00
Avi Kivity
a8463cfb37 Merge "reader_permit: signal leaked resources" from Botond
"
When a permit is destroyed we check if it still holds on to any
resources in the destructor. Any resources the permit still holds on are
leaked resources, as users should have released these. Currently we just
invoke `on_internal_error_noexcept()` to handle this, which -- depending
on the configuration -- will result in an error message or an assert. In
the former case, the resources will be leaked for good. This mini-series
fixes this, by signaling back these resources to the semaphore. This
helps avoid an eventual complete dry-up of all semaphore resources and a
subsequent complete shutdown of reads.

Tests: unit(release, debug)
"

* 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla:
  reader_permit: signal leaked resources
  test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
  sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
2021-03-29 17:57:31 +03:00
Botond Dénes
9e01c4c667 test: view_build_test: test_view_update_generator_buffering: use separate permit for readers
Said test has two separate logical readers, but they share the same
permit, which is illegal. This didn't cause any problems yet, but soon
the semaphore will start to keep score of active/inactive permits which
will be confused by such sharing, so have them use separate permits.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-1-bdenes@scylladb.com>
2021-03-29 17:35:51 +03:00
Takuya ASADA
6f678ab7ff aws: initialize self._disks['ebs'] when no EBS disks
Seems like aws_instance.ebs_disks() causes traceback when no EBS disks
available, need to initialize with empty list.

Fixes #8365

Closes #8366
2021-03-29 17:21:14 +03:00
Gleb Natapov
13a3cf62bb raft: move incoming message processing into per state functions
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.

Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
2021-03-29 15:48:43 +02:00
Tomasz Grabiec
43fd322856 Merge 'scylla-gdb.py: Add io-queues command' from Piotr Sarna
The command can be used to inspect IO queues of a local reactor.
Example output:
```
    (gdb) scylla io-queues
        Dev 0:
            Class:                  |shares:         |ptr:
            --------------------------------------------------------------------------------
            "default"               |1               |(seastar::priority_class_data *)0x6000002c6500
            "commitlog"             |1000            |(seastar::priority_class_data *)0x6000003ad940
            "memtable_flush"        |1000            |(seastar::priority_class_data *)0x6000005cb300
            "streaming"             |200             |(seastar::priority_class_data *)0x0
            "query"                 |1000            |(seastar::priority_class_data *)0x600000718580
            "compaction"            |1000            |(seastar::priority_class_data *)0x6000030ef0c0

            Max request size:    2147483647
            Max capacity:        Ticket(weight: 4194303, size: 4194303)
            Capacity tail:       Ticket(weight: 73168384, size: 100561888)
            Capacity head:       Ticket(weight: 77360511, size: 104242143)

            Resources executing: Ticket(weight: 2176, size: 514048)
            Resources queued:    Ticket(weight: 384, size: 98304)
            Handles: (1)
                Class 0x6000005d7278:
                    Ticket(weight: 128, size: 32768)
                    Ticket(weight: 128, size: 32768)
                    Ticket(weight: 128, size: 32768)
            Pending in sink: (0)
```

Created when debugging a core dump. Turned out not to be immediately useful for this use case, but I'm publishing it since it may come in handy in future investigations.

Closes #8362

* github.com:scylladb/scylla:
  scylla-gdb: add io-queues command
  scylla-gdb.py: add parsing std::priority_queue
  scylla-gdb.py: add parsing std::atomic
  scylla-gdb.py: add parsing std::shared_ptr
  scylla-db.py: add parsing intrusive_slist
2021-03-29 15:31:48 +02:00
Piotr Sarna
adf07eb8fb scylla-gdb: add io-queues command
The command can be used to inspect reactor's IO queues. Example output:
(gdb) scylla io-queues
    Dev 0:
	Class:                  |shares:         |ptr:
	--------------------------------------------------------------------------------
        "default"               |1               |(seastar::priority_class_data *)0x6000002c6500
        "commitlog"             |1000            |(seastar::priority_class_data *)0x6000003ad940
        "memtable_flush"        |1000            |(seastar::priority_class_data *)0x6000005cb300
        "streaming"             |200             |(seastar::priority_class_data *)0x0
        "query"                 |1000            |(seastar::priority_class_data *)0x600000718580
        "compaction"            |1000            |(seastar::priority_class_data *)0x6000030ef0c0

        Max request size:    2147483647
        Max capacity:        Ticket(weight: 4194303, size: 4194303)
        Capacity tail:       Ticket(weight: 73168384, size: 100561888)
        Capacity head:       Ticket(weight: 77360511, size: 104242143)

        Resources executing: Ticket(weight: 2176, size: 514048)
        Resources queued:    Ticket(weight: 384, size: 98304)
        Handles: (1)
            Class 0x6000005d7278:
                Ticket(weight: 128, size: 32768)
                Ticket(weight: 128, size: 32768)
                Ticket(weight: 128, size: 32768)
        Pending in sink: (0)
2021-03-29 15:01:25 +02:00
Piotr Sarna
f162423b8a scylla-gdb.py: add parsing std::priority_queue
The parsing assumes that the underlying storage is a vector,
which is often enough the case.
2021-03-29 13:10:36 +02:00
Piotr Sarna
e36c1f1d25 scylla-gdb.py: add parsing std::atomic 2021-03-29 13:10:36 +02:00
Piotr Sarna
0d4d04d3e6 scylla-gdb.py: add parsing std::shared_ptr 2021-03-29 13:10:36 +02:00
Piotr Sarna
c61822bc86 scylla-db.py: add parsing intrusive_slist 2021-03-29 13:10:36 +02:00
Piotr Sarna
bc1c92fd05 Merge 'Improve flat_mutation_reader::consume_pausable' from Piotr Jastrzębski
`flat_mutation_reader::consume_pausable` is widely used in Scylla. Some places
worth mentioning are memtables and combined readers but there are others as
well.

This patchset improves `consume_pausable` in three ways:
1. it removes unnecessary allocation
2. it rearranges ifs to not check the same thing twice
3. for a consumer that returns plain stop_iteration not a future<stop_iteration>
   it reduces the amount of future usage

Test: unit(dev, release, debug)

Combined reader microbenchmark has shown from 2% to 22% improvement in median
execution time while memtable microbenchmark has shown from 3.6% to 7.8%
improvement in median execution time.

Before the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          16
random seed:              3549335083

test                                      iterations      median         mad         min         max
combined.one_row                             1316234   140.120ns     0.020ns   140.074ns   140.141ns
combined.single_active                          7332    91.484us    31.890ns    91.453us    91.778us
combined.many_overlapping                        945   870.973us   429.720ns   868.625us   871.403us
combined.disjoint_interleaved                   7102    85.989us     7.847ns    85.973us    85.997us
combined.disjoint_ranges                        7129    85.570us     7.840ns    85.562us    85.596us
combined.overlapping_partitions_disjoint_rows        5458   124.787us    56.738ns   124.731us   125.370us
clustering_combined.ranges_generic           1920688   217.940ns     0.184ns   217.742ns   218.275ns
clustering_combined.ranges_specialized       1935318   194.610ns     0.199ns   194.210ns   195.228ns
memtable.one_partition_one_row                624001     1.600us     1.405ns     1.599us     1.605us
memtable.one_partition_many_rows               79551    12.555us     1.829ns    12.549us    12.558us
memtable.many_partitions_one_row               40557    24.748us    77.083ns    24.644us    25.135us
memtable.many_partitions_many_rows              3220   310.429us    57.628ns   310.295us   311.189us
```

After the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          16
random seed:              3549335083

test                                      iterations      median         mad         min         max
combined.one_row                             1358839   109.222ns     0.122ns   109.089ns   109.348ns
combined.single_active                          7525    87.305us    25.540ns    87.273us    87.362us
combined.many_overlapping                        962   853.195us     1.904us   851.244us   855.142us
combined.disjoint_interleaved                   7310    81.988us    28.877ns    81.949us    82.032us
combined.disjoint_ranges                        7315    81.699us    37.144ns    81.662us    81.874us
combined.overlapping_partitions_disjoint_rows        5591   120.964us    15.294ns   120.949us   121.120us
clustering_combined.ranges_generic           1954722   211.993ns     0.052ns   211.883ns   212.084ns
clustering_combined.ranges_specialized       2042194   187.807ns     0.066ns   187.732ns   188.289ns
memtable.one_partition_one_row                648701     1.542us     0.339ns     1.542us     1.543us
memtable.one_partition_many_rows               85007    11.759us     1.168ns    11.752us    11.782us
memtable.many_partitions_one_row               43893    22.805us    17.147ns    22.782us    22.843us
memtable.many_partitions_many_rows              3441   290.220us    41.720ns   290.172us   290.306us
```

Closes #8359

* github.com:scylladb/scylla:
  flat_mutation_reader: optimize consume_pausable for some consumers
  flat_mutation_reader: special case consumers in consume_pausable
  flat_mutation_reader: Change order of checks in consume_pausable
  flat_mutation_reader: fix indentation in consume_pausable
  flat_mutation_reader: Remove allocation in consume_pausable
  perf: Add benchmarks for large partitions
2021-03-29 13:06:56 +02:00
Piotr Sarna
4c79f132b6 thrift: add support for max_concurrent_requests_per_shard
The Thrift frontend is now capable of limiting the max number
of concurrent in-flight requests. Surplus requests are shed.

Tests: manual
2021-03-29 13:05:16 +02:00
Piotr Sarna
9f53327c9d thrift: add metrics for admission control
The new metrics include information about how many requests
were blocked on memory, how much is still available, etc.
2021-03-29 13:05:16 +02:00
Piotr Sarna
6b021779d2 thrift: add a counter for in-flight requests 2021-03-29 13:05:16 +02:00
Piotr Sarna
9391515461 thrift: add a counter for blocked requests
The counter tracks how many requests were blocked by the
memory estimation based admission control semaphore.
2021-03-29 13:05:16 +02:00
Piotr Sarna
ef1de114f0 thrift: partially add admission control
This commit adds admission control in the form of passing
service permits to the Thrift server.
The support is partial, because Thrift also supports running CQL
queries, and for that purpose a query_state object is kept
in the Thrift handler. However, the handler is generally created
once per connection, not once per query, and the query_state object
is supposed to keep the state of a single query only.
In order to keep this series simpler, the CQL-on-top-of-Thrift
layer is not touched and is left as TODO.
Moreover, the Thrift layer does not make it easy to pass custom
per-query context (like service_permit), so the implementation
uses a trick: the service permit is created on the server
and then passed as reference to its connections and their respective
Thrift handlers. Then, each time a query is read from the socket,
this service permit is overwritten and then read back from the Thrift
handler. This mechanism heavily relies on the fact that there are
zero preemption points between overwriting the service permit
and reading it back by the handler. Otherwise, races may occur.
This assumption was verified by code inspection + empirical tests,
but if somebody is aware that it may not always hold, please speak up.
2021-03-29 13:05:16 +02:00
Nadav Har'El
ccc75bfe2a Merge 'Disable thrift by default' from Piotr Sarna
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.

Fixes #8336

Closes #8338

* github.com:scylladb/scylla:
  docs: mention disabling Thrift by default
  db,config: disable Thrift by default
2021-03-29 12:48:20 +03:00
Piotr Sarna
3388694e69 service_permit: add a getter for the number of units held
The helper function makes debugging considerably easier.
2021-03-29 11:34:18 +02:00
Piotr Sarna
364b921e25 thrift: coroutinize processing a request
While not particularly useful now, it will facilitate
later changes which introduce service permits.
2021-03-29 11:34:18 +02:00
Piotr Sarna
09621e5fc5 memory_limiter: add a missing seastarx include
It's that or declaring everything that belongs to seastar namespace
explicitly, and including "seastarx.hh" is more standard.
2021-03-29 11:34:18 +02:00
Michał Chojnowski
8c45225f21 docs: remove the obsolete IMR design note
IMR, as described in this design note, was removed in 001652815c.
This doc should have been removed back then, but was overlooked.

Closes #8340
2021-03-29 10:58:05 +02:00
Pekka Enberg
aec33c599b Update tools/python3 submodule
* tools/python3 6f3bcbe...ad04e8e (2):
  > dist/debian: fix renaming debian/scylla-* files rule
  > fix license of package build script to AGPL
2021-03-29 11:50:24 +03:00
Pekka Enberg
203b7394d7 Update tools/java submodule
* tools/java 7b66b7a0fc...ccc4201ded (1):
  > dist/debian: fix renaming debian/scylla-* files rule
2021-03-29 11:50:19 +03:00
Tomasz Grabiec
c0ce122f77 Merge "raft: wire up rpc add_server/remove_server for configuration changes" from Pavel Solodovnikov
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

There is also another problem to be solved: in Raft an instance may
need to communicate with a peer outside its current configuration.
This may happen, e.g., when a follower falls out of sync with the
majority and then a configuration is changed and a leader not present
in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

* manmanson/raft_mappings_wiring_v12:
  raft: update README.md with info on RPC server address mappings
  raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes
  raft/fsm: add optional `rpc_configuration` field to fsm_output
  raft: maintain current rpc context in `server_impl`
  raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff`
  raft: unit-tests for `raft_address_map`
  raft: support expiring server address mappings for rpc module
2021-03-29 10:28:45 +02:00
Piotr Jastrzebski
86cf566692 flat_mutation_reader: optimize consume_pausable for some consumers
consumers that return stop_iteration not future<stop_iteration> don't
have to consume a single fragment per each iteration of repeat. They can
consume whole buffer in each iteration.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
26cc4f112d flat_mutation_reader: special case consumers in consume_pausable
consume_pausable works with consumers that return either stop_iteration
or future<stop_iteration>. So far it was calling futurize_invoke for
both. This patch special cases consumers that return
future<stop_iteration> and don't call futurize_invoke for them as this
is unnecessary work.

More importantly, this will allow the following patch to optimize
consumers that return plain stop_iteration.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
164e23d2b1 flat_mutation_reader: Change order of checks in consume_pausable
This way we can avoid checking is_buffer_empty twice.
Compiler might be able to optimize this out but why depend on it
when the alternative is not less readable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
776ba29cec flat_mutation_reader: fix indentation in consume_pausable
Code was left with wrong indentation by the previous commit that
removed do_with call around the code that's currently present.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
9fb0014d72 flat_mutation_reader: Remove allocation in consume_pausable
The allocation was introduced in 515bed90bb but I couldn't figure out
why it's needed. It seems that the consumer can just be captured inside
lambda. Tests seem to support the idea.

Indentation will be fixed in the following commit to make the review
easier.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
3aa7bee5e3 perf: Add benchmarks for large partitions
in perf_mutation_readers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:48:11 +02:00
Avi Kivity
ec4d91f9eb tools: toolchain: dbuild: improve cgroupv2 detection code
dbuild detects if the kernel is using cgroupv2 by checking if the
cgroup2 filesystem is mounted on /sys/fs/cgroup. However, on Ubuntu
20.10, the cgroup filesystem is mounted on /sys/fs/cgroup and the
cgroup2 filesystem is mounted on /sys/fs/cgroup/unified. This second
mount matches the search expression and gives a false positive.

Fix by adding a space at the end; this will fail to match
/sys/fs/cgroup/unified.

Closes #8355
2021-03-29 09:31:29 +03:00
Pavel Solodovnikov
2d9e94f050 raft: update README.md with info on RPC server address mappings
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:13 +03:00
Pavel Solodovnikov
f61206e483 raft: wire up rpc::add_server and rpc::remove_server for configuration changes
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:09 +03:00
Pavel Solodovnikov
16d9e8e9af raft/fsm: add optional rpc_configuration field to fsm_output
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.

Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.

This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:05 +03:00
Tomasz Grabiec
6035fd05b3 Merge "Unify drain() and drain_on_shutdown()" from Pavel Emelyanov
The start-stop code is drifting towards a straightforward
scheme of a bunch of

  service foo;
  foo.start();
  auto stop_foo = defer([&foo] { foo.stop(); });

blocks. The drain_on_shutdown() and its relation to drain()
and decommission() is a big hurdle on the way of this effort.

This set unifies drain() and drain_on_shutdown() so that drain
really becomes just some first steps of the regular shutdown,
i.e. -- what it should be. Some synchronisation bits around it
are still needed, though.

This unification also closes a bunch not-yet-caught bugs when
parts of the system remained running in case shutdown happens
after nodetool drain. In this case the whole drain_on_sutdown()
becomes a noop (just returns drain()'s future) and what's
missing in drain() becomes missing on shutdown.

tests: unit(dev), dtest(simple_boot_shutdown : dev),
       manual(start+stop, start+drain+stop : dev)
refs: #2737

* xemul/br-drain-on-shutdown:
  drain_on_shutdown: Simplify
  drain: Fix indentation
  storage_service: Unify drain and drain_on_shutdown
  storage_proxy: Drain and unsubscribe in main.cc
  migration_manager: Stop it in two phases
  stream_manager: Stop instances on drain
  batchlog_manager: Stop its instances on shutdown
  tracing: Shutdown tracing in drain
  tracing: Stop it in main.cc
  system_distributed_keyspace: Stop it in main.cc
  storage_service: Move (un)subscription to migration events
2021-03-26 18:37:27 +01:00
Pavel Solodovnikov
19cc85b3b6 raft: maintain current rpc context in server_impl
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.

It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).

This will be used later to update rpc module mappings
when cluster configuration changes.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
8799ccbab0 raft: use .contains instead of .count for std::set in raft::configuration::diff
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
7c229998e8 raft: unit-tests for raft_address_map
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
3c4d46728d raft: support expiring server address mappings for rpc module
This patch introduces `raft_address_map` class to abstract
the notion of expirable address mappings for a raft rpc module.

In Raft an instance may need to communicate with a peer outside
its current configuration. This may happen, e.g., when a follower
falls out of sync with the majority and then a configuration is
changed and a leader not present in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Emelyanov
d1796ab3dc drain_on_shutdown: Simplify
The modern version of this method doesn't need the
run_with_no_api_lock(), as it's launched on shard 0
anyway, neither it needs logging before and after
as it's done by the deferred action from main that
calls it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
58b47efe16 drain: Fix indentation
Previous patch left it broken for readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
8d7ad6de03 storage_service: Unify drain and drain_on_shutdown
Now they only differ in one bit -- compaction manager is
drained on drain and is left running (until regular stop)
on shutdown. So this unification adds a boolean flag for
this case.

Also the indentation is deliberately left broken for the
sake of patch readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
b60099c2f8 storage_proxy: Drain and unsubscribe in main.cc
Currently shutdown after drain leaves storage proxy
subscribed on storage_service events and without the
storage_proxy::drain_on_shutdown being called. So it
seems safe if the whole thing is relocated closer to
its starting peers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
9a8f125890 migration_manager: Stop it in two phases
Before the patch the migration manager was stopped in
two ways and one was buggy.

Plain shutdown -- it's just sharded::stop-ed by defer
in main(), but this happens long after the shutdown
of commitlog, which is not correct.
Shutdown after drain -- it's stopped twice, first time
right before the commitlog shutdown, second -- the
same defer in main(). And since the sharded::stop is
reentrable, the 2nd stop works noop.

This patch splits the stop into two phases: first it
stops the instances and does this in _both_ -- plain
shutdown and shutdown after drain. This phase is done
before commitlog shutdown in both cases. Second, the
existring deferred sharded::stop in main.cc.

This changes needs the migration_manager::stop() to
become re-entrable, but that's easily checked with the
help of abort_source the migration_manager has.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
de8a7fe798 stream_manager: Stop instances on drain
It's not seen directly from ths patch itself, but the only
difference between first several calls that drain() makes
and the stop_transport() is the do_stop_stream_manager()
in the latter.

Again, it's partially a bugfix (shutdown after drain leaves
streaming running), partially a must-have thing (streaming
is not expected in the background after drain), partially a
unification of two drains out there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
9a7e2a218b batchlog_manager: Stop its instances on shutdown
It's now stopped (not sharded::stop(), but batchlog_manager::stop)
on plain drain, but plain shutdown leaves it running, so fill this
gap.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
bcc3935ce7 tracing: Shutdown tracing in drain
First of all, shutdown that happens after nodetoo drain leaves
tracing up-n-running, so it's effectively a bugfix. But also
a step towards unified drain and drain_on_shutdown.

Keeping this bit in drain seems to be required because drain
stops transport, flushes column families and shuts commitlog
down. Any tracing activity happening after it looks uncalled
for.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
f1d7804102 tracing: Stop it in main.cc
The tracing::stop() just checks that it was shutdown()-ed and
otherwise a noop, so it's OK to stop tracing later. This brings
drain() and drain_on_shutdown() closer to each other and makes
main.cc look more like it should.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
d7cccec97f system_distributed_keyspace: Stop it in main.cc
It's now stopped in drain_on_shutdown, but since its stop()
method is a noop, it doesn't matter where it is. Keeping it
in main.cc next to related start brings drain_on_shutdown()
closer to drain() and the whole thing closer to the Ideal
start-stop sequence.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
5456174d69 storage_service: Move (un)subscription to migration events
After the patch the subscription effectively happens at
the same time as before, but is now located in main.cc,
so no real change here.

The unsubscription was in the drain_on_shutdown before
the patch, but after it it happens to be a defer next
to its peer, i.e. later, but it shouldn't be disastrous
for two reasons. First -- client services and migration
manager are already stopped. Second -- before the patch
this subscription was _not_ cancelled if shutdown ran
after nodetool drain and it didn't cause troubles.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Botond Dénes
d64b1fdd6a reader_permit: signal leaked resources
When destroying a permit with leaked resources we call
`on_internal_error_noexcept()` in the destructor. This method logs an
error or asserts depending on the configuration. When not asserting, we
need to return the leaked units to the semaphore, otherwise they will be
leaked for good. We can do this because we know exactly how many
resources the user of the permit leaked (never signalled).
2021-03-26 14:23:32 +02:00
Botond Dénes
0f1a72ba59 test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
To ensure the semaphores outlive all permits created as part of the
tests.
2021-03-26 14:22:43 +02:00
Botond Dénes
f843e3de08 sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
Used to produce the needed permits for the index reads, such that it
over-lives all the permits in use.
2021-03-26 11:06:02 +02:00
Tomasz Grabiec
f86896d387 Merge "Iterate range tombstones in partition_snapshot_reader" from Pavel Emelyanov
Currently the guy copies and merges all range tombstones from all partition
versions (that match the given range, but still) when being initialized or
decides to refresh iterators. This is a lot of potentially useless work and
memory, as the reader may be dropped before it emits all the mutations from
the given range(s).

It's better to walk the tombstones step-by-step, like it's done for rows.

fixes: #1671
tests: unit(dev)

* xemul/br-partiion-snapshot-reader-on-demand-range-tombstones-2:
  range_tombstone_stream: Remove unused methods
  partition_snapshot_reader: Emit range tombstones on demand
  partition_snapshot_reader: Introduce maybe_refresh_state
  partition_snapshot_reader: Move range tombstone stream member
  partition_snapshot_reader: Add reset_state method to helper class
  partition_snapshot_reader: Downgrade heap comparator
  partition_snapshot_reader: Use on-demand comparators
  range_tombstone_list: Add new slice() helper
  range_tombstone_list: Introduce iterator_range alias
2021-03-26 01:27:18 +01:00
Pavel Emelyanov
c6a0e0439e files: Construct file_impls properly
Constructors of classes inherited from file_impl copy alignment
values by hands, but miss the overwrite one, thus on a new file
it remains default-initialized.

To fix this and not to forget to properly initalize future fields
from file_impl, use the impl's copy constructor.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210325104830.31923-1-xemul@scylladb.com>
2021-03-26 00:22:11 +01:00
Tomasz Grabiec
ef06a939c4 Merge "raft: seven etcd unit tests ported" from Alejo
Seven etcd unit tests as boost tests.

* alejo/raft-tests-etcd-08-v4-communicate-v5:
  raft: etcd unit tests: test proposal handling scenarios
  raft: etcd unit tests: test old messages ignored
  raft: etcd unit tests: test single node precandidate
  raft: etcd unit tests: test dueling precandidates
  raft: etcd unit tests: test dueling candidates
  raft: etcd unit tests: test cannot commit without new term
  raft: etcd unit tests: test single node commit
  raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
  raft: etcd unit tests: fix test_progress_leader
  raft: testing: log comparison helper functions
  raft: testing: helper to make fsm candidate
  raft: testing: expose log for test verification
  raft: testing: use server_address_set
  raft: testing: add prevote configuration
  raft: testing: make become_follower() available for tests
2021-03-25 20:27:07 +01:00
Alejo Sanchez
ace0ee514f raft: etcd unit tests: test proposal handling scenarios
TestProposal
For multiple scenarios, check proposal handling.

Note, instead of expecting an explicit result for each specified case,
the test automatically checks for expected behavior when quorum is
reached or not.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
77163ea76a raft: etcd unit tests: test old messages ignored
TestOldMessages
Checks an append request from a leader from a previous term is ignored.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
bf65b19803 raft: etcd unit tests: test single node precandidate
TestSingleNodePreCandidate
Checks a single node configuration with precandidate on works to
automatically elect the node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
de7051467b raft: etcd unit tests: test dueling precandidates
TestDuelingPreCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Loser (3) should revert to follower when prevote
is rejected and revert to term 1.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
aa7d23f86b raft: etcd unit tests: test dueling candidates
TestDuelingCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Once reconnected, loser should not disrupt.

But note it will remain candidate with current algorithm without
prevoting and other fsms will not bump term.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
1eac94e7d6 raft: etcd unit tests: test cannot commit without new term
TestCannotCommitWithoutNewTermEntry tests the entries cannot be
committed when leader changes, no new proposal comes in and ChangeTerm
proposal is filtered.

NOTE: this doesn't check committed but it's implicit for next round;
      this could also use communicate() providing committed output map

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
b421fe3605 raft: etcd unit tests: test single node commit
Port etcd TestSingleNodeCommit

In a single node configuration elect the node, add 2 entries and check
number of committed entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
9b4538476b raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
Make test_leader_election_overwrite_newer_logs use newer communicate()
and other new helpers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
368eec1190 raft: etcd unit tests: fix test_progress_leader
Make implementation follow closer to original test.
Use newer boost test helpers.

NOTE: in etcd it seems a leader's self progress is in PIPELINE state.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Alejo Sanchez
ba29970e29 raft: testing: log comparison helper functions
Two helper functions to compare logs. For now only index, term, and data
type are used. Data content comparison does not seem to be necessary for now.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Alejo Sanchez
aeab4cf4a9 raft: testing: helper to make fsm candidate
Current election_timeout() helper might bump the term twice.
It's convenient and less error prone to have a more fine grained helper
that stops right when candidate state is reached.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:19 -04:00
Alejo Sanchez
7a6616f1cb raft: testing: expose log for test verification
Let derived classes access the log to verify its contents.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:03:46 -04:00
Alejo Sanchez
05b1f57e67 raft: testing: use server_address_set
Use server_address_set in local namespace for brevity.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:01:12 -04:00
Alejo Sanchez
9d0a7d8ccf raft: testing: add prevote configuration
Provide a generic prevote configuration for tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:00:28 -04:00
Dejan Mircevski
b2a04985f7 cql-pytest: Drop needless INSERT in test_null
One INSERT statement was unnecessary for the test, so delete it.
Another was necessary, so explain it.

Tests: cql-pytest/test_null on both Scylla and Cassandra

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8304
2021-03-25 16:37:00 +01:00
Tomasz Grabiec
7b30d31d77 Merge "raft: test configuration changes" from Kostja
Test raft configuration changes:
a node with empty configuration, transitioning
to an entirely different cluster, transitioning
in presence of down nodes, leader change during
configuration change, stray replies, etc.

* scylla-dev/raft-empty-confchange-v5: (21 commits)
  raft: (testing) stray replies from removed followers
  raft: always return a non-zero configuration index from the log
  raft: (testing) leader change during configuration change
  raft: (testing) test confchange {ABCDE} -> {ABCDEFG}
  raft: (testing) test confchange {ABCDEF} -> {ABCGH}
  raft: (testing) test confchange {ABC} -> {CDE}
  raft: (testing) test confchange {AB} -> {CD}
  raft: (testing) test confchange {A} -> {B}
  raft: (testing) test a server with empty configuration
  raft: (testing) introduce testing utilities
  raft: (testing) simplify id allocation in test
  raft: (testing) add select_leader() helper
  raft: (testing) introduce communicate() helper
  raft: (testing) style cleanup in raft_fsm_test
  raft: (testing) fix bug in election_threshold
  raft: minor style changes & comments
  raft: do not assert when transitioning to empty config
  raft: assert we never apply a snapshot over uncommitted entries (leader)
  raft: improve tracing
  raft: add fsm_output::empty() helper to aid testing
  ...
2021-03-25 14:01:09 +01:00
Wojciech Mitros
b152dc8c86 types: move read_collection_size/value specialization to header file
The template method needs to be specialized in each file that is
using it. To avoid rewriting the specialization into multiple files,
move it to the header file.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-25 12:18:38 +01:00
Avi Kivity
46185d7d82 Update tools/jmx submodule
* tools/jmx 9c687b5...440313e (1):
  > storage_service: Add a generic toppartitions endpoint
2021-03-25 12:36:10 +02:00
Alejo Sanchez
7e6807e8fc raft: testing: make become_follower() available for tests
Some etcd tests need to force a follower with a specific leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-24 19:11:09 -04:00
Piotr Wojtczak
c1daf2bb24 column_family: Make toppartitions queries more generic
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.

We provide three ways for filtering in the query parameter "name_list":
    1. A specific column family to include in the form "ks:cf"
    2. A keyspace, telling the server to include all column families in it.
       Specified by omitting the cf name, i.e. "ks:"
    3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.

Fixes #4520

Closes #7864
2021-03-24 17:54:05 +02:00
Raphael S. Carvalho
bcbb39999b LCS: Fix terrible write amplification when reshaping level 0
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.

That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.

This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.

Fixes #8345.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
2021-03-24 17:48:50 +02:00
Piotr Sarna
24a43681b4 thrift: handle gate closed exception on retry
During the retry mechanism, it's possible to encounter a gate
closed exception, which should simply be ignored, because
it indicates that the server is shutting down.

Closes #8337
2021-03-24 17:41:58 +02:00
Konstantin Osipov
1a1d7ab662 raft: (testing) stray replies from removed followers 2021-03-24 14:05:55 +03:00
Konstantin Osipov
0295163f6f raft: always return a non-zero configuration index from the log
Return snapshot index for last configuration index if there
is no configuration in the log.
2021-03-24 14:05:55 +03:00
Konstantin Osipov
cec59e53ef raft: (testing) leader change during configuration change 2021-03-24 14:05:36 +03:00
Pavel Emelyanov
37bec6fb76 commitlog: Open files with append_is_unlikely
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.

The option was introduced in seastar by 8bec57bc.

tests: unit(dev), dtest(commitlog:
                        test_batch_commitlog,
                        test_periodic_commitlog,
                        test_commitlog_replay_on_startup)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
2021-03-24 13:05:33 +02:00
Konstantin Osipov
a203c8833f raft: (testing) test confchange {ABCDE} -> {ABCDEFG} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
40e117d36e raft: (testing) test confchange {ABCDEF} -> {ABCGH} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
14b2d5d308 raft: (testing) test confchange {ABC} -> {CDE}
Test leader change during configuration change.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
3c718a175e raft: (testing) test confchange {AB} -> {CD} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
2e30c8540e raft: (testing) test confchange {A} -> {B}
Test non-restart and leader restart scenario.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
e23da06fef raft: (testing) test a server with empty configuration
Try becoming a candidate for such server, or adding it
to an existing configuration.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
b18599c630 raft: (testing) introduce testing utilities
Add a discrete_failure_detector, to be able
to mark a single server dead.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
8d26d24370 raft: (testing) simplify id allocation in test 2021-03-24 14:04:18 +03:00
Konstantin Osipov
322a15ec33 raft: (testing) add select_leader() helper
With leader stepdown extension, leadership transfer can happen
to any follower with long enough log. Add a helper to select that
follower from a list.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
4a00da276d raft: (testing) introduce communicate() helper
Allow to communicate between arbitrary number of FSMs. Drop
messages to FSMs which are not in the argument list.
Stop communication upon predicate.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
7182323ac0 raft: (testing) style cleanup in raft_fsm_test
1) Avoid memory violations on test failure
2) Print better diagnostics on failure (BOOST_CHECK_EQUAL vs
   BOOST_CHECK)
2021-03-24 14:04:18 +03:00
Konstantin Osipov
f0f25bf7fb raft: (testing) fix bug in election_threshold
election_threshold was ticking one extra tick,
causing the follower to become candidate in some cases.
This was rendering tests unstable.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
00d7379bc9 raft: minor style changes & comments
Add comments explaining the rationale from transfer_leadership()
(more PhD quotes), encapsulate stable leader check in tick()
into a lambda and add more detailed comments to it.
2021-03-24 14:04:18 +03:00
Piotr Sarna
06131e21a3 configure.py: add customizing clang inline threshold
Until clang figures things out with the now infamous
`-llvm -inline-threshold X` parameter, let's allow customizing
it to make the compilation of release builds less tiresome.
For instance, scylla's row_level.o object file currently does not compile
for me until I decrease the inline threshold to a low value (e.g. 50).

Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>
2021-03-24 12:09:26 +02:00
Tomasz Grabiec
9272e74e8c sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
2021-03-23 16:13:47 +01:00
Tomasz Grabiec
235154cca5 Merge "Teach scylla-gdb new trees in row cache" from Pavel Emelyanov
Clustering rows are now stored in intrusive btree, cells are
now stored in radix tree, but scylla-gdb tries to walk the
intrusive_set and vector/set union respectively.

For the former case -- the btree wrapper is introduced.

For the latter -- compiler optimizes-away too many important
bits and walking the tree turns into a bunch of hard-coded
hacks and reiterpret-casts. Untill better solution is found,
just print the address of the tree root.

* xemul/br-gdb-btree-rows:
  gdb: Show address of the row::_cells tree (or "empty" mark)
  gdb: Add support for intrusive B tree
  gdb: Use helper to get rows from mutation_partition
2021-03-23 12:50:17 +01:00
Pavel Emelyanov
1cd9ec952f gdb: Show address of the row::_cells tree (or "empty" mark)
Currently clang optimizes-out lots of critical stuff from
compact radix tree. Untill we find out the way to walk the
tree in gdb, it's better to at least show where it is in
memory.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 13:29:40 +03:00
Pavel Emelyanov
5c85fcb3c9 gdb: Add support for intrusive B tree
Rows inside partition are now stored in an intrusive B-tree,
so here's the helper class that wraps this collection.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 12:54:44 +03:00
Pavel Emelyanov
ed38b18a84 gdb: Use helper to get rows from mutation_partition
Preparation for the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 12:54:14 +03:00
Avi Kivity
3c292e31af utils: utf8: fix validate_partial() on non-SIMD-optimized architectures
validate_partial() is declared in the internal namespace, but defined
outside it. This causes calls to validate_partial() to be ambiguous
on architectures that haven't been SIMD-optimized yet (e.g. s390x).

Fix by defining it in the internal namespace.

Closes #8268
2021-03-23 09:21:14 +02:00
Avi Kivity
957259fab7 tools: toolchain: prepare: adjust manifest manipulations
The manifest manipulation commands stopped working with podman 3;
the containers-storage: prefix now throws errors.

Switch to `buildah manifest`; since we're building with buildah,
we might as well maintain the manifest with buildah as well.

Closes #8231
2021-03-23 09:18:19 +02:00
Avi Kivity
4dae434f69 utils: crc: fix build with big-endian architectures and 1-byte objects
crc has some code to reverse endianness on big-endian machines, but does
not handle the case of a 1-byte object (which doesn't need any adjustement).
This causes clang to complain that the switch statement doesn't handle that
case.

Fix by adding a no-op case.

Closes #8269
2021-03-23 09:16:20 +02:00
Konstantin Osipov
ce29fb44c3 raft: do not assert when transitioning to empty config
Throw instead, to make this case testable.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
2ee15ad6c7 raft: assert we never apply a snapshot over uncommitted entries (leader) 2021-03-22 18:55:40 +03:00
Konstantin Osipov
c7f7ad2c4e raft: improve tracing
Add tracing to apply_snapshot, request_vote.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
4dd66edae5 raft: add fsm_output::empty() helper to aid testing
Used in testing to implement trivial transport.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
89349f550c raft: aid testing by providing fsm::id() 2021-03-22 18:55:40 +03:00
Botond Dénes
742a33730a scylla-gdb.py: dereference_smart_ptr(): add support for seastar::smart_ptr
Although a seastar::smart_ptr is trivial to dereference manually, so is
adding support for it to dereference_smart_ptr(), avoiding the annoying
(but brief) detour which is currently needed.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210322150149.84534-1-bdenes@scylladb.com>
2021-03-22 17:30:35 +02:00
Piotr Sarna
b774d69ad2 docs: mention disabling Thrift by default
Thrift is no longer enabled by default, so the documentation
should mention that, as well as the suggested way of enabling it
if necessary.
2021-03-22 14:32:51 +01:00
Raphael S. Carvalho
c86dd125a1 sstables: clean up partitioned_sstable_set::insert()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322130227.16805-2-raphaelsc@scylladb.com>
2021-03-22 15:30:32 +02:00
Raphael S. Carvalho
48d8cc261e sstables: don't swallow exception in partitioned_sstable_set::insert()
regression introduced by 02b2df1ea9 (Fri Mar 12 01:22:41 2021 -0300).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322130227.16805-1-raphaelsc@scylladb.com>
2021-03-22 15:30:31 +02:00
Avi Kivity
50dda795e9 Update seastar submodule
* seastar 83339edb04...48376c76a1 (2):
  > iotune: Warn user about write-back cache mode
  > reactor: add --kernel-page-cache option to disable O_DIRECT
2021-03-22 13:33:08 +02:00
Avi Kivity
74df67776b bytes_ostream: convert write_placeholder from enable_if to concepts
Concepts are easier to read and result in better error messages.

This change also tightens the constraint from "std::is_fundamental" to
"std::integral". The differences are floating point values, nullptr_t,
and void. The latter two are illegal/useless to write, and nobody uses
floating point values for list lengths, so everything still compiles.

Closes #8326
2021-03-22 12:00:07 +01:00
Piotr Sarna
e2443337d9 db,config: disable Thrift by default
It will still be possible to use Thrift once it's enabled
in the yaml file, but it's better to not open this port
by default, since Thrift is definitely not the first choice
for Scylla users.

Fixes #8336
2021-03-22 10:54:26 +01:00
Piotr Sarna
23057dd186 Merge 'Implement RAFT's leader stepdown extension' from Gleb
This series implements leader stepdown extension. See patch 4 for
justification for its existence. First three patches either implement
cleanups to existing code that future patch will touch or fix bugs
that need to be fixed in order for stepdown test to work.

* 'raft-leader-stepdown-v3' of github.com:scylladb/scylla-dev:
  raft: add test for leader stepdown
  raft: introduce leader stepdown procedure
  raft: fix replication when leader is not part of current config
  raft: do not update last election time if current leader is not a part of current configuration
  raft: move log limiting semaphore into the leader state
2021-03-22 09:45:19 +01:00
Avi Kivity
3c44445c07 Merge "Introduce off-strategy compaction for repair-based bootstrap and replace" from Raphael
"
Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself.

This aggressiveness comes from the fact that:
1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation.
2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant.

To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set.

The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint.

NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series.

Refs #5226.
"

* 'offstrategy_v7' of github.com:raphaelsc/scylla:
  tests: Add unit test for off-strategy sstable compaction
  table: Wire up off-strategy compaction on repair-based bootstrap and replace
  table: extend add_sstable_and_update_cache() for off-strategy
  sstables/compaction_manager: Add function to submit off-strategy work
  table: Introduce off-strategy compaction on maintenance sstable set
  table: change build_new_sstable_list() to accept other sstable sets
  table: change non_staging_sstables() to filter out off-strategy sstables
  table: Introduce maintenance sstable set
  table: Wire compound sstable set
  table: prepare make_reader_excluding_sstables() to work with compound sstable set
  table: prepare discard_sstables() to work with compound sstable set
  table: extract add_sstable() common code into a function
  sstable_set: Introduce compound sstable set
  reshape: STCS: preserve token contiguity when reshaping disjoint sstables
2021-03-22 10:43:13 +02:00
Gleb Natapov
272cb1c1e6 raft: add test for leader stepdown 2021-03-22 10:31:16 +02:00
Gleb Natapov
9d6bf7f351 raft: introduce leader stepdown procedure
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:

1. Sometimes the leader must step down. For example, it may need to reboot
 for maintenance, or it may be removed from the cluster. When it steps
 down, the cluster will be idle for an election timeout until another
 server times out and wins an election. This brief unavailability can be
 avoided by having the leader transfer its leadership to another server
 before it steps down.

2. In some cases, one or more servers may be more suitable to lead the
 cluster than others. For example, a server with high load would not make
 a good leader, or in a WAN deployment, servers in a primary datacenter
 may be preferred in order to minimize the latency between clients and
 the leader. Other consensus algorithms may be able to accommodate these
 preferences during leader election, but Raft needs a server with a
 sufficiently up-to-date log to become leader, which might not be the
 most preferred one. Instead, a leader in Raft can periodically check
 to see whether one of its available followers would be more suitable,
 and if so, transfer its leadership to that server. (If only human leaders
 were so graceful.)

The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
2021-03-22 10:28:43 +02:00
Gleb Natapov
888b52dea1 raft: fix replication when leader is not part of current config
When a leader orchestrates its own removal from a cluster there is a
situation where the leader is still responsible for replication, but it
is no longer part of active configuration. Current code skips replication
in this case though. Fix it by always replicating in the leader state.
2021-03-22 09:52:17 +02:00
Gleb Natapov
1acc8996bc raft: do not update last election time if current leader is not a part of current configuration
Since we use external failure detector instead of relying on empty
AppendRequests from a leader there can be a situation where a node
is no longer part of a certain raft group but is still alive (and also
may be part of other raft groups). In such case last election time
should not be updated even if the node is alive. It is the same as if
it would have stopped to send empty AppendRequests in original raft.
2021-03-22 09:52:17 +02:00
Gleb Natapov
ccf4435759 raft: move log limiting semaphore into the leader state
Log limiting semaphore is used on a leader only, so it should be stored
inside the leader state.
2021-03-22 09:52:17 +02:00
Takuya ASADA
35a14ab22b configure.py: drop compat-python3 targets
Since we switched scylla-python3 build directory to tools/python3/build
on Jenkins, we nolonger need compat-python3 targets, drop them.

Related scylladb/scylla-pkg#1554

Closes #8328
2021-03-21 18:04:27 +02:00
Benny Halevy
f562c9c2f3 test: sstable_datafile_test: tombstone_purge_test: use a longer ttl
As seen in next-3319 unit testing on jenkins
The cell ttl may expire during the test (presuming
that the test machine was overloaded), leading to:
```
INFO  2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacting [/jenkins/workspace/scylla-master/next/scylla/testlog/release/scylla-af8644ec-7f07-4ffe-80bf-6703a942e435/la-17-big-Data.db:level=0:origin=, ]
INFO  2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacted 1 sstables to []. 4kB to 0 bytes (~0% of original) in 0ms = 0 bytes/s. ~128 total partitions merged to 0.
./test/lib/mutation_assertions.hh(108): fatal error: in "tombstone_purge_test": Mutations differ, expected {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
      },
      'value': { atomic_cell{1,ts=1616313953,expiry=1616313958,ttl=5} },
    },
  ]
}
}
 ...but got: {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
      },
      'value': { atomic_cell{DEAD,ts=1616313953,deletion_time=1616313953} },
    },
  ]
}
}
```

This corresponds to:
```
2395            auto mut2 = make_expiring(alpha, ttl);
2396            auto mut3 = make_insert(beta);
...
2399            auto sst2 = make_sstable_containing(sst_gen, {mut2, mut3});
```

Extend (logical) ttl to 10 seconds to reduce flakiness
due to real-time timing.

Test: sstable_datafile_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210321142931.1226850-1-bhalevy@scylladb.com>
2021-03-21 16:42:00 +02:00
Avi Kivity
1e820687eb Merge "reader_concurrency_semaphore: limit non-admitted inactive reads" from Botond
"
Due to bad interaction of recent changes (913d970 and 4c8ab10) inctive
readers that are not admitted have managed to completely fly under the
radar, avoiding any sort of limitation. The reason is that pre-admission
the permits don't forward their resource cost to the semaphore, to
prevent them possibly blocking their own admission later. However this
meant that if such a reader is registered as inactive, it completely
avoids the normal resource based eviction mechanism and can accumulate
without bounds.
The real solution to this is to move the semaphore before the cache and
make all reads pass admission before they get started (#4758). Although
work has been started towards this, it is still a while until it lands.
In the meanwhile this patchset provides a workaround in the form of a
new inactive state, which -- like admitted -- causes the permit to
forward its cost to the semaphore, making sure these un-admitted
inactive reads are accounted for and evicted if there is too much of
them.

Fixes: #8258

Tests: unit(release), dtest(oppartitions_test.py:TestTopPartitions.test_read_by_gause_key_distribution_for_compound_primary_key_and_large_rows_number)
"

* 'reader-concurrency-semaphore-limit-inactive-reads/v4' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add test for permit cleanup
  test: querier_cache_test: add memory based cache eviction test
  reader_permit: add inactive state
  querier: insert(): account immediately evicted querier as resource based eviction
  reader_concurrency_semaphore: fix clear_inactive_reads()
  reader_concurrency_semaphore: make inactive_read_handle a weak reference
  reader_concurrency_semaphore: make evict() noexcept
  reader_concurrency_semaphore: update out-of-date comments
2021-03-21 16:24:54 +02:00
Nadav Har'El
ab75226626 test/cql-pytest: remove xfail from passing test
After commit 0bd201d3ca ("cql3: Skip indexed
column for CK restrictions") fixed issue #7888, the test
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
began passing, as expected. So we can remove its "xfail" label.

Refs #7888.

cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210321080522.1831115-1-nyh@scylladb.com>
2021-03-21 16:02:30 +02:00
Avi Kivity
e2cd551880 Update seastar submodule
* seastar ea5e529f30...83339edb04 (21):
  > cmake: filter out -Wno-error=#warnings from pkgconfig (seastar.pc)
  > Merge 'utils/log.cc: fix nested_exception logging (again)' from Vlad Zolotarov
Fixes #8327.
  > file: Add option to refuse the append-challenged file
  > Merge "Teach io-tester to work on block device" from Pavel E
  > Merge "Cleanup files code" from Pavel E
  > install-dependencies: Support rhel-8.3
  > install-dependencies: Add some missing rh packages
  > file, reactor: reinstate RWF_NOWAIT support
  > file: Prevent fsxattr.fsx_extsize from overflow
  > cmake: enable clang's -Wno-error=#warnings if supported
  > cmake: harden seastar_supports_flag aginst inputs with spaces or #
  > cmake: fix seastar_supports_flag failing after first invocation
  > thread: Stop backtraces in main() on s390x architecture
  > intent: Explicitly declare constructors for references
  > test: file_io_test: parallel_overwrite: use testing::local_random_engine
  > util: log-impl: rework log_buf::inserter_iterator
  > rwlock: pass timeout parameter to get_units
  > concepts: require lib support to enable concepts
  > rpc: print more info on bad protocol magic
  > seastar-addr2line: strip input line to restore multiline support
  > log: skip on unknown nested mixing instead of stopping the logging
Ref #8327.
2021-03-21 15:58:10 +02:00
Nadav Har'El
10bf2ba60a cql-pytest: translate Cassandra's reproducers for issue #2962
This is a translation of Cassandra's CQL unit test source file
validation/entities/SecondaryIndexOnMapEntriesTest.java into our
our cql-pytest framework.

This test file checks various features of indexing (with secondary index)
individual entries of maps. All these tests pass on Cassandra, but fail on
Scylla because of issue #2962 - we do not yet support indexing of the content
of unfrozen collections. The failing test currently fail as soon as they
try to create the index, with the message:
"Cannot create secondary index on non-frozen collection or UDT column v".

Refs #2962.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210310124638.1653606-1-nyh@scylladb.com>
2021-03-21 12:30:00 +02:00
Avi Kivity
75da8a8d81 Merge 'Fix the retry mechanism in Thrift frontend' from Piotr Sarna
Thrift used to be quite unsafe with regard to its retry mechanism, which caused very rapid use of resources, namely the number of file descriptors. It was also prone to use-after-free due to spawning futures without guarding the captured objects with anything.
The mechanism is now cleaned up, and a simple exponential backoff replaced previous constant backoff policy.

Fixes #8317
Tests: unit(dev), manual(see #8317 for a simple reproducer)

Closes #8318

* github.com:scylladb/scylla:
  thrift: add exponential backoff for retries
  thrift: fix and simplify retry logic
2021-03-21 12:26:13 +02:00
Avi Kivity
a78f43b071 Merge 'tracing: fast slow query tracing' from Ivan Prisyazhnyy
The set of patches introduces a new tracing mode - `fast slow query tracing`. In this mode, Scylla tracks only tracing sessions and omits all tracing events if the tracing context does not have a `full_tracing` state set.

Fixes #2572

Motivation
---

We want to run production systems with that option always enabled so we could always catch slow queries without an overhead. The next step is we are gonna optimize further the costs of having tracing enabled to minimize session context handling overhead to allow it to be as transparent for the end-user as possible.

Fast tracing mode
---

To read the status do

    $ curl -v http://localhost:10000/storage_service/slow_query

To enable fast slow-query tracing

    $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true

Potential optimizations
---

- remove tracing::begin(lazy_eval)
- replace tracing::begin(string) for enum to remove copying and memory allocations
- merge parameters allocations
- group parameters check for trace context
- delay formatting
- reuse prepared statement shared_ptr instead of both copying it and copying its query

Performance
---

100% cache hits
---

1 Core:

```
$ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1

./cassandra-stress write n=100000 no-warmup -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=1 -mode native cql3

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=false
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=true
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done
```

  | qps |   |   |  
-- | -- | -- | -- | --
  | baseline | fast, slow | nofast, slow | %[1-fastslow/baseline]
  | 29,018 | 26,468 | 23,591 | 8.79%
  | 28,909 | 26,274 | 23,584 | 9.11%
  | 28,900 | 26,547 | 23,598 | 8.14%
  | 28,921 | 26,669 | 23,596 | 7.79%
  | 28,821 | 26,385 | 23,601 | 8.45%
stdev | 70.24030182 | 150.9678774 | 6.670832032 |  
avg | 28,914 | 26,469 | 23,594 |  
stderr | 0.24% | 0.57% | 0.03% |  
%[avg/baseline] |   | **8.46%** | 18.40% |  

8.46% performance degradation in `fast slow query mode` for pure in-memory workload with minimum traces.
18.40%  performance degradation in `original slow query mode` for pure in-memory workload with minimum traces.

0% cache hits
---

1GB memory, 1 Core:

    $ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --memory 1G --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1

2.4GB, 10000000 keys data:

    $ ./cassandra-stress write n=10000000 no-warmup -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 -mode native cql3
    $ curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true

CASSANDRA_STRESS prepared statements with BYPASS CACHE

    $ taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3

20000 reads IOPS, 100MB/s from disk

  | qps |   |   |  
-- | -- | -- | -- | --
  | baseline reads | fast, slow reads | %[1-fastslow/baseline] |  
  | 9,575 | 9,054 | 5.44% |  
  | 9,614 | 9,065 | 5.71% |  
  | 9,610 | 9,066 | 5.66% |  
  | 9,611 | 9,062 | 5.71% |  
  | 9,614 | 9,073 | 5.63% |  
stdev | 16.75410397 | 6.892024376 |
avg | 9,605 | 9,064 |
stderr | 0.17% | 0.08% |
%[avg/baseline] |   | **5.63%** |

5.63% performance degradation in `fast slow query mode` for pure on-disk workload with minimum traces.

Closes #8314

* github.com:scylladb/scylla:
  tracing: fast mode unit test
  tracing: rest api for lightweight slow query tracing
  tracing: omit tracing session events and subsessions in fast mode
2021-03-21 12:15:17 +02:00
Dejan Mircevski
318f773d81 types: Unreverse tuple subtype for serialization
When a tuple value is serialized, we go through every element type and
use it to serialize element values.  But an element type can be
reversed, which is artificially different from the type of the value
being read.  This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.

Fixes #7902

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8316
2021-03-21 12:07:29 +02:00
Dejan Mircevski
0bd201d3ca cql3: Skip indexed column for CK restrictions
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns.  But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key.  We end up
with invalid clustering slice, which can cause problems downstream.

Fix this by skipping the indexed column when assembling the clustering
restrictions.

Tests: unit (dev)

Fixes #7888

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8320
2021-03-21 09:52:06 +02:00
Avi Kivity
58b7f225ab keys: convert trichotomic comparators to return std::strong_ordering
A trichotomic comparator returning an int an easily be mistaken
for a less comparator as the return types are convertible.

Use the new std::strong_ordering instead.

A caller in cql3's update_parameters.hh is also converted, following
the path of least resistance.

Ref #1449.

Test: unit (dev)

Closes #8323
2021-03-21 09:30:43 +02:00
Avi Kivity
29a5047982 utils: error_injection: convert enable_if to concepts
Constrain inject() with a requires clause rather than enable_if,
simplifying the code and compiler diagnostics.

Note that the second instance could not have been called, since
the template argument does not appear in the function parameter
list and thus could not be deduced. This is corrected here.

Closes #8322
2021-03-21 09:28:23 +02:00
Avi Kivity
c28d67dd7f types: time_point_to_string: convert enable_if to concepts
time_point_to_string ensures its input is a time_point with
millisecond resolution (though it neglects to verify the epoch
is what it expects). Change the test from a clunky enable_if to
a nicer concept.

Closes #8321
2021-03-21 09:11:40 +02:00
Tomasz Grabiec
88a019ba21 Merge "raft: respond with snapshot_reply to send_snapshot RPC" from Kostja
Currently send_snapshot is the only two-way RPC used by Raft.
However, the sender (the leader) does not look at the receiver's
reply, other than checks it's not an error. This has the following
issues:

- if the follower has a newer term and rejects the snapshot for
  that reason, the leader will not learn about a newer follower
  term and will not step down
- the send_snapshot message doesn't pass through a single-endpoint
  fsm::step() and thus may not follow the general Raft rules
  which apply for all messages.
- making a general purpose transport that simply calls fsm::step()
  for every message becomes impossible.

Fix it by actually responding with snapshot_reply to send_snapshot
RPC, generating this reply in fsm::step() on the follower,
and feeding into fsm::step() on the leader.

* scylla-dev/raft-send-snapshot-v2:
  raft: pass snapshot_reply into fsm::step()
  raft: respond with snapshot_reply to send_snapshot RPC
  raft: set follower's next_idx when switching to SNAPSHOT mode
  raft: set the current leader upon getting InstallSnapshot
2021-03-19 18:13:40 +01:00
Piotr Sarna
31d3854bb7 thrift: add exponential backoff for retries
The original backoff mechanism which just retries after 1ms
may still lead to rapid resource depletion.
Instead, an exponential backoff is used, with a cap of ~2s.

Tests: manual, with cassandra-stress and browsing logs
2021-03-19 13:16:39 +01:00
Piotr Sarna
f81044d75d thrift: fix and simplify retry logic
The retry logic for Thrift frontend had two bugs:
1. Due to missing break in a switch statement,
   two retry calls were always performed instead of one,
   which acts a little bit like a Seastar forkbomb
2. The delayed action was not guarded with any gate,
   so it was theoretically possible to access a captured `this`
   pointer of an object which already got deallocated.

In order to fix the above, the logic is simplified to always
retry with backoff - it makes very little sense to skip the backoff
and immediate retries are not needed by anyone, while they cause
severe overload risk.

Tests: manual - a simple cassandra-stress invocation was able to crash
       scylla with a segfault:
       $ cassandra-stress write -mode thrift -rate threads=2000

Fixes #8317
2021-03-19 13:15:35 +01:00
Nadav Har'El
abab1d906c Merge 'sstables: convert enable_if to equivalent concepts' from Avi Kivity
enable_if is hard to understand, especially its error messages. Convert
enable_if in sstable code to concepts.

A new concept is introduced, self_describing, for the case of a type
that follows the obj.describe_type() protocol. Otherwise this is quite
straightforward.

Closes #8315

* github.com:scylladb/scylla:
  sstables: vector write: convert to concepts
  sstables: check_truncated_and_assign: convert to concept
  sstables: convert write() to concepts
  sstables: convert write_vint() to concepts
  sstables: vector parse(): convert to concept
  sstables: convert parse() for a self-describing type to concept
  sstables: read_vint(): convert enable_if to concepts
  sstables: add concept for self-describing type
2021-03-18 23:09:34 +02:00
Raphael S. Carvalho
64d78eae6a tests: Add unit test for off-strategy sstable compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 16:56:00 -03:00
Avi Kivity
bf0c7d1340 sstables: vector write: convert to concepts
We have an integral and a non-integral overload, each constrained
with enable_if. We use std::integral to constrain the integral
overload and leave the other unconstrained, as C++ will choose the
more constrained version when applicable.
2021-03-18 19:26:54 +02:00
Avi Kivity
11636563d9 sstables: check_truncated_and_assign: convert to concept
Use std::integral instead of static_assert to reject non-integral
parameters.
2021-03-18 19:26:54 +02:00
Avi Kivity
42e3f33722 sstables: convert write() to concepts
There are three variants: integral, enum, and self-describing
(currently expressed as not integral and not enum). Convert to
concepts by using the standard concepts or the new self_describing
concept.
2021-03-18 19:26:43 +02:00
Avi Kivity
4832041857 sstables: convert write_vint() to concepts
Instead of a maze of deleted functions, enable_if, and static_assert,
use the standard std::integral concept.
2021-03-18 19:24:42 +02:00
Nadav Har'El
0b2cf21932 alternator-test: increase read timeout and avoid retries
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:

  1. When the test machine and the build are extremely slow, and the
     operation is long (usually, CreateTable or DeleteTable involving
     multiple views), the 60 second timeout might not be enough.

  2. If the timeout is reached, boto3 silently retries the same operation.
     This retry may fail because the previous one really succeeded at
     least partially! The symptom is tests which report an error when
     creating a table which already exists, or deleting a table which
     dooesn't exist.

The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).

Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.

Fixes #8135

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
2021-03-18 18:58:08 +02:00
Avi Kivity
777d48e78d sstables: vector parse(): convert to concept
The two vector parse() overloads select between integral members
and non-integral members. Use std::integral to constrain the
integral overload and leave the other unconstrained; C++ will choose
the more constrained version when it applies.
2021-03-18 18:48:11 +02:00
Avi Kivity
bc42aee7c1 sstables: convert parse() for a self-describing type to concept
This parse() overload uses "not integral and not enum" to reject
non-self-describing types. Express it directly with the self_describing
concept instead.
2021-03-18 18:47:00 +02:00
Avi Kivity
a96b8e8aed sstables: read_vint(): convert enable_if to concepts
Convert read_vint() to a concept. The explicitly deleted version
is no longer needed since wrongly-typed inputs will be rejected
by the constraint. Similarly the static assert can be dropped
for the same reason.
2021-03-18 18:45:05 +02:00
Avi Kivity
bba9c1c616 sstables: add concept for self-describing type
Our sstable parsing and writing code contains a self-describing
type concept, where a type can advertise its members via a
describe_types() member function with a specific protocol.

Formalize that into a C++ concept. This is a little tricky, since
describe_type() accepts a parameter that is itself a template, and
requires clauses only work with concrete type. To handle this problem,
create such a concrete example type and use it in the concept.
2021-03-18 17:52:54 +02:00
Botond Dénes
7980140549 test: test_utils: do_check()/do_require(): tone down log to trace
They are way too noisy to be at debug level.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318143547.101932-1-bdenes@scylladb.com>
2021-03-18 16:59:59 +02:00
Raphael S. Carvalho
65b09567dd table: Wire up off-strategy compaction on repair-based bootstrap and replace
Now, sstables created by bootstrap and replace will be added to the
maintenance set, and once the operation completes, off-strategy compaction
will be started.

We wait until the end of operation to trigger off-strategy, as reshaping
can be more efficient if we wait for all sstables before deciding what
to compact. Also, waiting for completion is no longer an issue because
we're able to read from new sstables using partitioned_sstable_set and
their existence aren't accounted by the compaction backlog tracker yet.

Refs #5226.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
c45d2e1d27 table: extend add_sstable_and_update_cache() for off-strategy
Function is extended to add sstable to maintenance set if requested
by the caller.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
6ca2ac34ac sstables/compaction_manager: Add function to submit off-strategy work
This new variant will allow its caller to submit off-strategy job
asynchronously on behalf of a given table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
e0e5bf8285 table: Introduce off-strategy compaction on maintenance sstable set
Off-strategy compaction is about incrementally reshaping the off-strategy
sstables in maintenance set, using our existing reshape mechanism, until
the set is ready for integration into the main sstable set.
The whole operation is done in maintenance mode, using the streaming
scheduling group.
We can do it this way because data in maintenance set is disjoint, so
effects on read amplification is avoided by using
partitioned_sstable_set, which is able to efficiently and incrementally
retrieve data from disjoint sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
439e9b6fab table: change build_new_sstable_list() to accept other sstable sets
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
6e95860e09 table: change non_staging_sstables() to filter out off-strategy sstables
SSTables that are off-strategy should be excluded by this function as
it's used to select candidates for regular compaction.
So in addition to only returning candidates from the main set, let's
also rename it to precisely reflect its behavior.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
c64a156c53 table: Introduce maintenance sstable set
This new sstable set will hold sstables created by repair-based
operations. A repair-based op creates 1 sstable per vrange (256),
so sstables added to this new set are disjoint, therefore they
can be efficiently read from using partitioned_sstable_set.

Compound set is changed to include this new set, so sstables in
this new set are automatically included when creating readers,
computing statistics, and so on.
This new set is not backlog tracked, so changes were needed to
prevent a sstable in this set from being added or removed from
the tracker.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:47 -03:00
Raphael S. Carvalho
1e7a444a8b table: Wire compound sstable set
From now own, _sstables  becomes the compound set, and _main_sstables refer
only to the main sstables of the table. In the near future, maintenance
set will be introduced and will also be managed by the compound set.

So add_sstable() and on_compaction_completion() are changed to
explicitly insert and remove sstables from the main set.

By storing compound set in _sstables, functions which used _sstables for
creating reader, computing statistics, etc, will not have to be changed
when we introduce the maintenance set, so code change is a lot minimized
by this approach.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:46:06 -03:00
Raphael S. Carvalho
42b309b43e table: prepare make_reader_excluding_sstables() to work with compound sstable set
Compound set will not be inserted or erased directly, so let's change
this function to build a new set from scratch instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
4e142458eb table: prepare discard_sstables() to work with compound sstable set
After compound set, discard_sstables() will have to prune each set
individually and later refresh the compound set. So let's change
the function to support multiple sstable sets, taking into account
that a sstable set may not want to be backlog tracked.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
d25822a030 table: extract add_sstable() common code into a function
The purpose is to allow the code to be eventually reused by maintenance
sstable set, which will be soon introduced.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
e4b5f5ba33 sstable_set: Introduce compound sstable set
This new sstable set implementation is useful for combining operation of
multiple sstable sets, which can still be referenced individually via
its shared ptr reference.
It will be used when maintenance set is introduced in table, so a
compound set is required to allow both sets to have their operations
efficiently combined.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:49 -03:00
Raphael S. Carvalho
1261519266 reshape: STCS: preserve token contiguity when reshaping disjoint sstables
When reshaping hundreds of disjoint sstables, like on bootstrap,
contiguity wasn't being preserved because the heuristic for picking
candidates didn't take into account their token range, which resulted
in reshape messing with the contiguity that could otherwise be
preserved by respecting the token order of the disjoint sstables.
In other words, sstables with the smallest first tokens should be
compacted first. By doing that, the contiguity is preserved even
across size tiers, after reshape has completed its possible multiple
rounds to get all the data in shape.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:36:18 -03:00
Botond Dénes
ad02f313dd test: mutation_reader_test: add test for permit cleanup
Check that a permit correctly restores the units on the semaphore in
each state it can be destroyed in.
2021-03-18 16:18:22 +02:00
Raphael S. Carvalho
e53cedabb1 LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
2021-03-18 15:58:21 +02:00
Botond Dénes
2b7c1bce86 scylla-gdb.py: add variant_member convenience function
Allow conveniently accessing the active member of an `std::variant`
instance.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318134427.92668-1-bdenes@scylladb.com>
2021-03-18 15:57:51 +02:00
Konstantin Osipov
fcc6e621f8 raft: pass snapshot_reply into fsm::step()
By the time we receive snapshot_reply from a follower
we may no longer be the leader. Follower term may be
different from snapshot term, e.g. the follower may
be aware of a new leader already and have a higher term.

We should pass this information into (possibly ex-) leader FSM via
fsm::step() so that it can correctly change its state, and
not call FSM directly.
2021-03-18 16:56:46 +03:00
Konstantin Osipov
4afa662d62 raft: respond with snapshot_reply to send_snapshot RPC
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.

Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
  thus ensure that FSM testing results produced when using a test
  transport are representative of the real world uses of
  raft::rpc.
2021-03-18 16:56:42 +03:00
Konstantin Osipov
cb3314d756 raft: set follower's next_idx when switching to SNAPSHOT mode
Set follower's next_idx to snapshot index + 1 when switching
it to snapshot mode. If snapshot transfer succeeds, that's the
best match for the follower's next replication index. If it fails,
the leader will send a new probe to find out the follower position
again and re-try sending a possibly newer snapshot.

The change helps reduce protocol state managed outside FSM.
2021-03-18 16:35:11 +03:00
Ivan Prisyazhnyy
f00391af8b tracing: fast mode unit test
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:05:09 +02:00
Ivan Prisyazhnyy
7cbe2aa9c6 tracing: rest api for lightweight slow query tracing
The patch adds REST API support for the lightweight
slow query tracing (fast) mode that is implemented by
omitting all of the trace events during the tracing.

    $ curl -v http://localhost:10000/storage_service/slow_query
    $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:05:05 +02:00
Ivan Prisyazhnyy
85fbca2049 tracing: omit tracing session events and subsessions in fast mode
If tracing::tracing::_ignore_trace_events is enabled then
the tracing system must ignore all sessions events
for non full_tracing sessions (probability tracing and
user requested) and creating subsessions with the
make_trace_info.

Patch introduces the slow query tracing fast mode that
omits all events during tracing.

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:04:47 +02:00
Botond Dénes
c822f0d02a test: querier_cache_test: add memory based cache eviction test
Ensure that the memory consumption of querier cache entries is kept
under the limit.
2021-03-18 14:58:21 +02:00
Botond Dénes
a14bb4ba94 reader_permit: add inactive state
This state will be used for permits that are not in admitted state when
registered as inactive. We can have such reads if a read can be served
entirely from cache/memtables and it doesn't have to go to disk and
hence doesn't go through admission. These permits currently don't
forward their cost to the semaphore so they won't prevent their own
admission creating a deadlock. However, when in inactive state, we do
want to keep tabs on their resource consumption so we don't accumulate
too much of these inactive reads. So introduce a new state for these
non-admitted inactive reads. When entering the inactive state, the
permit registers its cost with the semaphore, and when unregistered as
inactive, it retracts it. This is a workaround (khm hack) until #4758 is
solved and all permits will be admitted on creation.
2021-03-18 14:58:21 +02:00
Botond Dénes
594636ebbf querier: insert(): account immediately evicted querier as resource based eviction
`reader_concurrency_semaphore::register_inactive_read()` drops the
registered inactive read immediately if there is a resource shortage.
This is in effect a resource based eviction, so account it as such in
`querier::insert()`.
2021-03-18 14:57:57 +02:00
Botond Dénes
1a337d0ec1 reader_concurrency_semaphore: fix clear_inactive_reads()
Broken by the move to an intrusive container (9cbbf40), which caused
said method to only clear the container but not destroy the inactive
reads contained therein. This patch restores the previous behaviour and
also adds a call the destructor (to ensure inactive reads are cleaned up
under any circumstances), as well as a unit test.
2021-03-18 14:57:57 +02:00
Botond Dénes
581edc4e4e reader_concurrency_semaphore: make inactive_read_handle a weak reference
Having the handle keep an owning reference to the inactive read lead to
awkward situations, where the inactive read is destroyed during eviction
in certain situations only (querier cache) and not in other cases.
Although the users didn't notice anything from this, it lead to very
brittle code inside the reader concurrency semaphore. Among others, the
inactive read destructor has to be open coded in evict() which already
lead to mistakes.
This patch goes back to the weak pointer paradigm used a while ago,
which is a much more natural fit for this. Inactive reads are still kept
in an intrusive list in the semaphore but the handle now keeps a weak
pointer to them. When destroyed the handler will destroy the inactive
read if it is still alive. When evicting the inactive read, it will
set the pointer in the handle to null.
2021-03-18 14:57:57 +02:00
Botond Dénes
cbc83b8b1b reader_concurrency_semaphore: make evict() noexcept
In the next patch it will be called from a destructor.
2021-03-18 14:57:57 +02:00
Botond Dénes
2d348e0211 reader_concurrency_semaphore: update out-of-date comments 2021-03-18 14:57:57 +02:00
Botond Dénes
3b8220f777 scylla-gdb.py: update w.r.t. storage_proxy::_hints_manager not being optional
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318110256.50137-1-bdenes@scylladb.com>
2021-03-18 12:47:57 +01:00
Piotr Sarna
2509b7dbde Merge 'dht: convert ring_position and decorated_key to std::strong_ordering' from Avi Kivity
As #1449 notes, trichotomic comparators returning int are dangerous as they
can be mistaken for less comparators. This series converts dht::ring_position
and dht::decorated_key, as well as a few closely related downstream types, to
return std::strong_ordering.

Closes #8225

* github.com:scylladb/scylla:
  dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
  pager: rephrase misleading comparison check
  test: total_order_checks: prepare for std::strong_ordering
  test: mutation_test: prepare merge_container for std::strong_ordering
  intrusive_array: prepare for std::strong_ordering
  utils: collection-concepts: prepare for std::strong_ordering
2021-03-18 11:51:54 +01:00
Avi Kivity
378556418c dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
Convert tri_comparators to return std::strong_ordering rather than int,
to prevent confusion with less comparators. Downstream users are either
also converted, or adjust the return type back to int, whichever happens
to be simpler; in all cases the change it trivial.
2021-03-18 12:40:05 +02:00
Avi Kivity
4ead1a79ce pager: rephrase misleading comparison check
We check !result_of_tri_compare, which makes it look like we're
checking a boolean predicate, whereas we're really checking for
equality. Change to result_of_tri_compare == 0, which is less likely
to be confusing, and is also compatible with std::strong_ordering.
2021-03-18 12:40:05 +02:00
Avi Kivity
a5f17b9a2d test: total_order_checks: prepare for std::strong_ordering
Adjust the total_order_check template to work with comparators
returning either int (as a temporary compatibility measure) or
std::strong_ordering (for #1449 safety).
2021-03-18 12:40:05 +02:00
Avi Kivity
f0092ae475 test: mutation_test: prepare merge_container for std::strong_ordering
The function merge_container() accepts a trichotomic comparator returning
an int. As #1449 explains, this is dangerous as it could be mistaken for
a less comparator. Switch to std::strong_ordering, but leave a compatible
merge_container() in place as it is still needed (even after this series).
2021-03-18 12:40:05 +02:00
Avi Kivity
fe0f983dfb intrusive_array: prepare for std::strong_ordering
Newer comparators can return std::strong_ordering, so don't
expect an int.
2021-03-18 12:40:05 +02:00
Avi Kivity
9fbe4850c9 utils: collection-concepts: prepare for std::strong_ordering
collection-concepts includes a Comparable concept for a trichotomic
comparator function, used in intrusive btree and double_decker. Prepare
for std::strong_ordering by also allowing std::strong_ordering as a
return type. Once we've cleaned the code base, we can tighten it to
only allow std::strong_ordering.
2021-03-18 12:40:03 +02:00
Piotr Sarna
0bcf584992 docs: mention --no-rebase in maintainer.md
For a default git config it's enough to pull with --no-ff
to ensure that a merge commit is created, but with a custom
configuration, it's better to also explicitly prevent
rebasing.

Message-Id: <7dc6027f1f38fa4db7435592a3b72308b1a08614.1616063525.git.sarna@scylladb.com>
2021-03-18 12:38:29 +02:00
Piotr Sarna
5a852d3812 Merge 'Decouple memory limiter sem from storage service' from Pavel
This set removes few more calls for global storage service and prevents
more of them to happen in thrift that's about to start using the memory
limiter semaphore too.

The set turns this semaphore into a sharded one living in the scope of
main(), makes others use the local instance and removes the no longer
needed bits from storage service.

tests: unit(dev)
branch: https://github.com/xemul/scylla/commits/br-global-memory-limiter-sem

* xemul_drop_memory_limiter:
  storage_service: Drop memory limiter
  memory_limiter: Use main-local instance everyehere
  main: Have local memory limiter and carry where needed
  memory_limiter: Encapsulate memory limiting facility
  cql_server: Remove semaphore getter fn from config
2021-03-18 11:29:32 +01:00
Pavel Emelyanov
dcdd207349 storage_service: Drop memory limiter
Nobody uses it now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
f0a79574d4 memory_limiter: Use main-local instance everyehere
The cql_server and alternator both need the limiter, so
patch them to stop using storage service's one and use
the main-local one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
359e9caf54 main: Have local memory limiter and carry where needed
Prepare memory limiters to have non-global instance of
the service. For now the main-local instance is not
used and (!) is not stopped for real, just like the
storage_service's one is.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
4ca2ae1341 memory_limiter: Encapsulate memory limiting facility
The storage service carries sempaphore and a size_t value
to facilitate the memory limiting for client services.

This patch encapsulates both fields on a separate helper
class that will be used by whoever needs it without
messing with the storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
c2f94fb527 cql_server: Remove semaphore getter fn from config
The cql_server() need to get the memory limiter semaphore
from local storage service instance. To make this happen
a callback in introduced on the config structure. The same
can be achieved in a simler manner -- by providing the
local storage service instances directly.

Actually, the storage service will be removed in further
patches from this place, so this patch is mostly to get
rid of the callback from the config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Nadav Har'El
4a7d3175e9 test/alternator: make another test faster
The slowest test in test_streams.py is test_list_streams_paged. It is meant
to test the ListStreams operation with paging. The existing test repeated
its test four times, for four different stream types. However, there is
no reason to suspect that the ListStreams operation might somehow be
different for the four stream types... We already have other tests which
create streams of the four types, and uses these streams - we don't
need the test for ListStreams to also test creating the four types.

By doing this test just once, not four times, we can save around 1.5
seconds of test time.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318073755.1784349-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
79af728335 test/alternator: make tracing test a bit faster
In the test test_tracing.py::test_tracing_all, we do some operations and
then need to wait until they appear in the tracing table.
The current code used an exponentially-increasing delay during this wait,
starting with 0.1 seconds and then doubling the delay until we find what
we're looking for.

However, it turns out that the delay until the data appears in the table
is deliberately chosen by Scylla - and is always around 2 seconds.
In this case, an exponential delay is really bad - we will usually wait
for around 1 seconds too long after the needed wait of 2 seconds.

So in this patch we replace the exponential delay by a constant delay -
we wait 0.3 seconds between each retry.

This change makes the test test_tracing.py::test_tracing_all finish
in a little over 2 seconds, instead of a little over 3 seconds
before this patch. We cannot reduce this 2 second time any further
unless we make the 2-second tracing delay configurable.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318000040.1782933-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
4e87f95b42 test/alternator: remove slow and unhelpful test
The test test_table.py::test_table_streams_on creates tables with various
stream types, and then immediately deletes them without testing anything.
This is a slow test (taking almost a full second on my laptop), and is
redundant because in test_streams.py we have tests which create tables
with streams in the same way - but then actually test that things work
with these streams. So this test might as well be removed, and this is
what we do in this patch.

Removing this test shaves another second from the Alternator test suite's
run time.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317230530.1780849-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
879656e3e0 test/alternator: make a test faster, safer and more correct
The test
test_condition_expression.py::test_condition_expression_with_forbidden_rmw
takes half a second to run (dev build, on my laptop), one of the slowest
tests in Alternator's test suite. Part of the reason was that it needlessly
set the same table to forbidden_rmw, multiple times.

Instead of doing that, we switch to using the test_table_s_forbid_rmw
fixture, which is a table like test_table_s but created just once in
forbid_rmw mode.

The result is a faster test (0.05 seconds instead of 0.5 seconds), but
also safer if we ever want to run tests in parallel. It also fixes a
bug in the test: At the end of the test, we intended to double-check
that although the forbid_rmw table forbids read-modify-write operations,
it does allow pure writes. Yet the test did this after clearing the
forbid_rmw mode... So after this patch the test verifies this on the
forbid_rmw table, as intended.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317222703.1779992-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
1c2e473e62 test/alternator: make a test faster
The test
test_condition_expression.py::test_condition_expression_with_permissive_write_isolation

Currently takes (on my laptop, dev build) a full two seconds, one of
the slowest tests. It is not surprising it is slow - it runs five other
tests three times each (for three different write isolation modes),
but it doesn't have to be this slow. Before this patch, for each of
the five tests we switch the write isolation mode three times, and
these switches involve schema changes and are fairly slow. So in
this patch we reverse the loop - and switch the write isolation mode
to the outer loop.

This patch halves the runtime of this test - from two seconds to one.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317221045.1779329-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Takuya ASADA
d9a625c842 scylla_setup: don't run node-exporter setup when it's not installed
We need to run package existance check before run setup of
node-exporter.

Fixes #8276

Closes #8278
2021-03-18 11:24:18 +01:00
Avi Kivity
f038d1555c Merge 'Add more context to configure.py' from Piotr Sarna
This series makes configure.py output slightly more helpful in case of incorrect parameters passed to the compiler/linker.

Closes #8267

* github.com:scylladb/scylla:
  configure: print more context if the linking attempt failed
  configure: provide more context on failed ./configure.py run
  configure: add verbose option to try_compile_and_link
2021-03-18 11:24:18 +01:00
Takuya ASADA
0424a41e30 tools/toolchain: stop ignoring error on install-dependencies.sh, run jmx/java script correctly
We should run install-dependencies.sh with -e option to prevent ignoring
error in the script.
Also, need to add tools/jmx/install-dependencies.sh and
tools/java/install-dependencies.sh, to fix 'No such file or directory'
error on them.

Fixes #8293

Closes #8294

[avi: did not regenerate toolchain image, since no new packages are
      installed]
2021-03-18 11:24:18 +01:00
Avi Kivity
b91d6776a0 Update tools/java submodule
* tools/java fdc8fcc22c...7b66b7a0fc (1):
  > dist/redhat: add support SLES
2021-03-18 11:24:18 +01:00
Nadav Har'El
bd742f2951 Merge 'treewide: get rid of incorrect reinterpret casts' from Michał Chojnowski
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.

The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.

Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.

Closes #8241

* github.com:scylladb/scylla:
  treewide: get rid of unaligned_cast
  treewide: get rid of incorrect reinterpret casts
2021-03-18 11:24:18 +01:00
Benny Halevy
7862cad669 sstable_set: partitioned_sstable_set: clone: do clone all sstables
The existing implementation wrongfully shares _all sstables
rather than cloning it. This caused a use-after-free
in `repair_meta::do_estimate_partitions_on_local_shard`
when traversing a shared sstable_set, during which
`table::make_reader_excluding_sstables` erased an entry.
The erase should have happened on a cloned copy
of the sstable_list, not on a shared copy.

The regression was introduced in
c3b8757fa1.

Added a unit test that reproduces the share-on-copy issue
for partitioned_stable_set (sstables::sstable_set).

Fixes #8274

Test: unit(release, debug)
DTest: materialized_views_test.py:TestMaterializedViews.simple_repair_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210317145552.701559-1-bhalevy@scylladb.com>
2021-03-18 11:15:59 +02:00
Piotr Sarna
ea096de1b4 service, transport: avoid using private storage_service fields
... in the transport controller. Instead, simple getters would suffice.

Message-Id: <582a71d0c1b61edf0107f5a2ef96536c395972d0.1615988516.git.sarna@scylladb.com>
2021-03-18 11:15:59 +02:00
Nadav Har'El
42169b2eef Merge 'Alternator: add slow query logging' from Piotr Sarna
This series adds slow query logging capability to alternator. Queries which last longer than the specified threshold are logged in `system_traces.node_slow_log` and traced.

In order to be better prepared for https://github.com/scylladb/scylla/issues/2572, this series also expands the tracing API to allow custom key-value params and adds a custom `alternator_op` parameter to the slow node log. This information can also be deduced from the tracing session id by consulting the system_traces.events table, but https://github.com/scylladb/scylla/issues/2572 's assumption is that this tracing might not always be available in the future.

This series comes with a simple test case which checks if operation logs indeed end up in `system_traces.node_slow_log`.

Tests:
unit(dev, alternator pytest)
manual: verified that no operations are logged if slow query logging is disabled; verified that operations that take less time than the threshold are not logged; verified with test_batch.py::test_batch_write_item_large that a large-enough operation is indeed logged and traced.

Fixes #8292

Example trace:

```cql
cqlsh> select parameters, duration from system_traces.node_slow_log where start_time=b7a44589-8711-11eb-8053-14c6c5faf955;

 parameters                                                                                  | duration
---------------------------------------------------------------------------------------------+----------
 {'alternator_op': 'DeleteTable', 'query': '{"TableName": "alternator_Test_1615979572905"}'} |    75732
```

Closes #8298

* github.com:scylladb/scylla:
  alternator: add test for slow query logging
  alternator: allow enabling slow query logging
  tracing: allow providing a custom session record param
2021-03-18 11:15:59 +02:00
Avi Kivity
de45575ea9 Merge "Allow all supported compaction types to be stopped by nodetool stop" from Raphael
"
All compaction types can now be stopped with the nodetool stop
command, example: nodetool stop SCRUB

Supported types are: COMPACTION, CLEANUP, VALIDATION, SCRUB,
INDEX_BUILD, RESHARD, UPGRADE, RESHAPE.
"

* 'stop_compaction_types_v2' of github.com:raphaelsc/scylla:
  compaction: Allow all supported compaction types to be stopped
  compaction: introduce function to map compaction name to respective type
  compaction: refactor mapping of compaction type to string
  compaction: move compaction_name() out of line
2021-03-18 11:15:59 +02:00
Botond Dénes
981699ae76 sstables: move promoted_index_blocks_reader into own header
index_entry.hh (the current home of `promoted_index_blocks_reader`) is
included in `sstables.hh` and thus in half our code-base. All that code
really doesn't need the definition of the promoted index blocks reader
which also pulls in the sstables parser mechanism. Move it into its own
header and only include it where it is actually needed: the promoted
index cursor implementations.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210317093654.34196-1-bdenes@scylladb.com>
2021-03-18 11:15:59 +02:00
Botond Dénes
5859195b36 sstables: mx/parser.hh: add missing include
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210317093806.34858-1-bdenes@scylladb.com>
2021-03-18 11:15:59 +02:00
Benny Halevy
2e7677f76b sstables: sstable_set_impl: include mutation_reader.hh
To make sstables/sstable_set_impl.hh self-sufficient
mutation_reader.hh provides position_reader_queue,
needed by time_series_sstable_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210317094223.590067-1-bhalevy@scylladb.com>
2021-03-18 11:15:59 +02:00
Konstantin Osipov
66c729da66 raft: set the current leader upon getting InstallSnapshot
If the current leader is set, the follower will not vote
for another candidate. This is also known as "sticky leadership" rule.

Before this change, the rule was enacted only upon receiving
AppendEntries RPC from the leader. Turn it on also upon receiving
InstallSnapshot RPC.
2021-03-18 08:36:57 +03:00
Michał Chojnowski
5c3385730b treewide: get rid of unaligned_cast
unaligned_cast violates strict aliasing rules. Replace it with
safe equivalents.
2021-03-17 17:00:41 +01:00
Michał Chojnowski
4e35befcf2 treewide: get rid of incorrect reinterpret casts
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.

The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.

Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.
2021-03-17 17:00:38 +01:00
Piotr Sarna
efe734c575 alternator: add test for slow query logging
The test checks whether slow queries are properly logged
in the system_traces.node_slow_log system table.
The test is deterministic because it uses the threshold of 0ms
to qualify a query as slow, which effectively makes all queries
"slow enough".
2021-03-17 13:24:26 +01:00
Benny Halevy
6846319e65 partitioned_sstables_set: insert: propagate exception
Do not swallow the caught exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210316170821.496218-1-bhalevy@scylladb.com>
2021-03-17 13:29:03 +02:00
Piotr Sarna
f9adee70d2 alternator: allow enabling slow query logging
Alternator is now aware of the slow query logging configuration
and can start tracing slow queries.
2021-03-17 11:20:42 +01:00
Piotr Sarna
5386739354 tracing: allow providing a custom session record param
The mechanism of session record params is currently only used
to store query strings and a couple more params like consistency level,
but since we now have more frontends than just CQL and Thrift,
it would be nice to also allow the users to put custom parameters in
there.
An immediate first user of this mechanism would be alternator,
which is going to put the operation type under the "alternator_op" key.
The operation type is not part of the query string due to how DynamoDB's
protocol works - the op type is stored separately in the HTTP header.
While it's possible to extract the operation type from the session_id,
it might not be the case once #2572 is implemented.
2021-03-17 11:14:28 +01:00
Gleb Natapov
32d386d0d8 raft: fix use after free during logging in append_entries_reply()
As the existing comment explains a progress can be deleted at the point
of logging. The logging should only be done if the progress still
exists.

Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>
2021-03-17 09:59:22 +02:00
Dejan Mircevski
8db24fc03b cql3/expr: Handle IN ? bound to null
Previously, we crashed when the IN marker is bound to null.  Throw
invalid_request_exception instead.

Fixes #8265

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8287
2021-03-17 09:59:22 +02:00
Avi Kivity
1afd6fbe06 hashing: appending_hash: convert from enable_if to concepts
A little simpler to understand.

Closes #8288
2021-03-17 09:59:22 +02:00
Piotr Sarna
7961a28835 Merge 'storage_proxy: Include counter writes in...
...  `writes_coordinator_outside_replica_set`' from Juliusz Stasiewicz

With this change, coordinator prefers himself as the "counter leader", so if
another endpoint is chosen as the leader, we know that coordinator was not a
member of replica set. With this guarantee we can increment
`scylla_storage_proxy_coordinator_writes_coordinator_outside_replica_set` metric
after electing different leader (that metric used to neglect the counter
updates).

The motivation for this change is to have more reliable way of counting
non-token-aware queries.

Fixes #4337
Closes #8282

* github.com:scylladb/scylla:
  storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set`
  counters: Favor coordinator as leader
2021-03-17 09:59:22 +02:00
Avi Kivity
972ea9900c Merge 'commitlog: Make pre-allocation drop O_DSYNC while pre-filling' from Calle Wilund
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

Closes #8250

* github.com:scylladb/scylla:
  commitlog: Make pre-allocation drop O_DSYNC while pre-filling
  commitlog: coroutinize allocate_segment_ex
2021-03-17 09:59:22 +02:00
Dejan Mircevski
992d5c6184 cql3/expr: Improve column printing
Before this change, we would print an expression like this:

((ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}) = 0000007b)

Now, we print the same expression like this:

(c = 0000007b)

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8285
2021-03-17 09:59:22 +02:00
Tomasz Grabiec
40121621f6 Merge "Kill some get_local_migration_manager() calls" from Pavel Emelyanov
There are a bunch of such calls in schema altering statements and
there's currently no way to obtain the migration manager for such
statements, so a relatively big rework needed.

The solution in this set is -- all statements' execute() methods are
called with query processor as first argument (now the storage proxy
is there), query processor references and provides migration manager
for statements. Those statements that need proxy can already get it
from the query processor.

Afterwards table_helper and thrift code can also stop using the global
migration manager instance, since they both have query processor in
needed places. While patching them a couple of calls to global storage
proxy also go away.

The new query processor -> migration manager dependency fits into
current start-stop sequence: the migration manager is started early,
the query processor is started after it. On stop the query processor
remains alive, but the migration manager stops. But since no code
currently (should) call get_local_migration_manager() it will _not_
call the query_processor::get_migration_manager() either, so this
dangling reference is ugly, but safe.

Another option could be to make storage proxy reference migration
manager, but this dependency doesn't look correct -- migration manager
is higher-level service than the storage proxy is, it is migration
manager who currently calls storage proxy, but not the vice versa.

* xemul/br-kill-some-migration-managers-2:
  cql3: Get database directly from query processor
  thrift: Use query_processor::get_migration_manager()
  table_helper: Use query_processor::get_migration_manager()
  cql3: Use query_processor::get_migration_manager() (lambda captures cases)
  cql3: Use query_processor::get_migration_manager() (alter_type statement)
  cql3: Use query_processor::get_migration_manager() (trivial cases)
  query_processor: Keep migration manager onboard
  cql3: Pass query processor to announce_migration:s
  cql3: Switch to qp (almost) in schema-altering-stmt
  cql3: Change execute()'s 1st arg to query_processor
2021-03-17 09:59:22 +02:00
Raphael S. Carvalho
2065e2c912 partitioned_sstable_set: adjust select_sstable_runs() to work with compound set
compound set will select runs from all of its managed sets, so let's
adjust select_sstable_runs() to only return runs which belong to it.
without this adjustment, selection of runs would fail because
function would try to unconditionally retrieve the run which may
live somewhere else.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-3-raphaelsc@scylladb.com>
2021-03-17 09:59:22 +02:00
Raphael S. Carvalho
02b2df1ea9 sstable_set: move select_sstable_runs() into partitioned_sstable_set
after compound set is introduced, select_sstable_runs() will no longer
work because the sstable runs live in sstable_set, but they should
actually live in the sstable_set being written to.

Given that runs is a concept that belongs only to strategies which
use partitioned_sstable_set, let's move the implementation of
select_sstable_runs() to it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-2-raphaelsc@scylladb.com>
2021-03-17 09:59:22 +02:00
Avi Kivity
11308c05f4 Update tools/jmx submodule
* tools/jmx 15c1d4f...9c687b5 (1):
  > dist/redhat: add support SLES
2021-03-17 09:59:22 +02:00
Calle Wilund
a0745f9498 messaging_service: Enforce dc/rack membership iff required for non-tls connections
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.

This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.

Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"

Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.

Closes #8051
2021-03-17 09:59:22 +02:00
Avi Kivity
bcd41cb32d Merge 'Support installing our rpm to SLES' from Takuya ASADA
Basically SLES support is already done in f20736d93d, but it was for offline installer.
This fixes few more problems to install our rpm to SLES.
After this change, we can just install our rpm for both CentOS/RHEL and SLES in single image, like unified deb.
SLES uses original package manager called 'zypper', but it does support yum repository so no need to change required for repo.

Closes #8277

* github.com:scylladb/scylla:
  scylla_coredump_setup: support SLES
  scylla_setup: use rpm to check package availability for SLES
  dist: install optional packages for SLES
2021-03-17 09:59:22 +02:00
Tomasz Grabiec
cc0bb92afe Merge "raft: provide a ticker for each raft server" from Pavel Solodovnikov
Automatically initialize and start a timer in
`raft_services::add_server` for each raft server instance created.

The patch set also changes several other things in order
for tickers to work:

1. A bug in `raft_sys_table_storage` which caused an exception
   if `raft::server::start` is called without any persisted state.
2. `raft_services::add_server` now automatically calls
   `raft::server::start()` since a server instance should be started
   before any of its methods can be called.
3. Raft servers can now start with initial term = 0. There was an
   artificial restriction which is now lifted.
4. Raft schema state machine now returns a ready future instead of
   throwing "not implemented" exception in `abort()`.

* github.com/ManManson/scylla.git/raft_services_tickers_v9_next_rebase:
  raft/raft_services: provide a ticker for each raft server
  raft/raft_services: switch from plain `throw` to `on_internal_error`
  raft/raft_services: start server instance automatically in `add_server`
  raft: return ready future instead of throwing in schema_raft_state_machine
  raft: allow raft server to start with initial term 0
  raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
2021-03-17 09:59:22 +02:00
Nadav Har'El
e344f74858 Merge 'logalloc: improve background reclaim shares management' from Avi Kivity
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.

Fixes #8234.

Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)

Closes #8281

* github.com:scylladb/scylla:
  test: logalloc_test: harden background reclain test against cpu overcommit
  logalloc: background reclaim: use default scheduling group for adjusting shares
  logalloc: background reclaim: log shares adjustment under trace level
  logalloc: background reclaim: fix shares not updated by periodic timer
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
aaea8c6c7d raft/raft_services: provide a ticker for each raft server
Automatically initialize a ticker for each raft server
instance when `raft_services::add_server` is called.
A ticker is a timer which regularly calls `raft::server::tick`
in order to tick its raft protocol state machine.

Note that the timer should start after the server calls
its `start()` method, because otherwise it would crash
since fsm is not initialized yet.

Currently, the tick interval is hardcoded to be 100ms.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
1496a3559f raft/raft_services: switch from plain throw to on_internal_error
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
975c9a8021 raft/raft_services: start server instance automatically in add_server
Raft server instance cannot be used in any way prior
to calling the `start()` method, which initializes
its internal state, e.g. raft protocol state machine.
Otherwise, it will likely result in a crash.

Also, properly stop the servers on shutdown via
`raft_services::stop_servers()`.

In case some exception happened inside `add_server`,
the `init` function will de-initialize what it already
initialized, i.e. raft rpc verbs. This is important
since otherwise it would break further initialization
process and, what is more important, will prevent raft
rpc verbs deinitialization. This will cause a crash in
`messaging_service` uninit procedure, because raft rpc
handlers would still be initialized.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
0b3dba07bd raft: return ready future instead of throwing in schema_raft_state_machine
The current implementation throws an exception, which will cause
a crash when stopping scylla. This will be used in the next patch.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
93c565a1bf raft: allow raft server to start with initial term 0
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.

This restriction is completely artificial and can be lifted
without any problems, which will be described below.

The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).

This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.

Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.

In either case term will never be 0, because:

1. If the term has changed, then it's naturally greater than 0,
   since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
   a vote request message. In such case we have already updated
   our term to the requester's term.

Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.

Given the motivation described above, the corresponding

    assert(_fsm->get_current_term() != term_t(0));

in `server_impl::start` is removed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
ae5f26adec raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
When a raft server is started for the first time and there isn't
any persisted state yet, provide default return values for
`load_term_and_vote` and `load_snapshot`. The code currently
does not handle this corner case correctly and fail with an
exception.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Juliusz Stasiewicz
f77d0f5439 storage_proxy: Include counter writes in writes_coordinator_outside_replica_set
Coordinator prefers himself as the "counter leader", so if another
endpoint is chosen as the leader, we know that coordinator was
not a member of replica set. We can use this information to
increment relevant metric (which used to neglect the counters
completely).

Fixes #4337
2021-03-16 12:07:16 +01:00
Juliusz Stasiewicz
5689106b92 counters: Favor coordinator as leader
This not only reduces internode traffic but is also needed for a
later change in this PR: metrics for non-token-aware writes
including counter updates.
2021-03-16 12:07:13 +01:00
Pavel Emelyanov
a7a5ad4ded range_tombstone_stream: Remove unused methods
Both methods apply a list of tombstones to the stream. One
was unused even before the set, the other one became unused
after previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
2e6255c499 partition_snapshot_reader: Emit range tombstones on demand
Currently the reader gets all range tombstones from the given
range and places them into a stream. When filling the buffer
with fragments the range tombstones are extracted from the
stream one by one.

This is memory consuming, the reader's memory usage shouldn't
depend on the number of inhabitants in the partition range.

The patch implements the heap-based cursor for range tombstones
almost like it's done for rows.

The heap contains range_tombstone_list::iterator_ranges, the
tombstones are popped from the heap when needed, are applied
into the stream and then are emitted from it into the buffer.
The refresh_state() is called on each new range to set up the
iterators, and when lsa reports references invalidation to
refresh the iterators. To let the refresh_state revalidate the
iterators, the position at which the last range tombstone was
emitted is maintained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
ef61f84426 partition_snapshot_reader: Introduce maybe_refresh_state
The existing refresh_state() is supposed to setup or revalidate
iterators to rows inside partition versions if needed. It will
be called in more than one place soon, so here's the helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
5e0a8130d4 partition_snapshot_reader: Move range tombstone stream member
The lsa_partition_reader is the helper sub-class for
partition_snapshot_reader that, among other things, is
responsible for filling the stream of range tombstones,
that's then used by the reader itself.

Next patches will change the way range tombstones are
emitted by the reader, so hide the stream inside the
helper subclass in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
755d993031 partition_snapshot_reader: Add reset_state method to helper class
This method "notifies" the lsa_reader helper class when the owning
reader moves to a new range. This method is now empty, but will be
used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:07:20 +03:00
Pavel Emelyanov
a387fbd984 partition_snapshot_reader: Downgrade heap comparator
Next patch will extend the comparator to manage heap of
range tombstones. Not to add yet another comparator to
it (and not to create another heap comparator class) just
use the comparator that's common for both -- rows and range
tombstones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:06:19 +03:00
Pavel Emelyanov
2179014efa partition_snapshot_reader: Use on-demand comparators
There are already two raii-sh comparators on reader, next patch will
need to add the third. This just bloats the reader, the comparators
in question are state-less and can be created on demand for free.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:04:47 +03:00
Pavel Emelyanov
c8b2079705 range_tombstone_list: Add new slice() helper
There are two of them now -- one to return iterator_range that
covers the given query::clustering_range, the other to return
it for two given positions.

In the next patch the 3rd one is needed -- the slice() to get
iterator_range that's

a) starts strictly after a given position
b) ends after the given clustering_range's end

It will be used to refresh the range tombstones iterators after
some of them will have been emitted. The same thing is currently
done by partition_snapshot_reader's refresh_state wrt rows:

    if (last_row)
        start = rows.upper_bound(last_row) // continuation
    else
        start = rows.lower_bound(range.start) // initial

    end = rows.upper_bound(range.end) // end is the same in
                                      // either case

Respectively for range tombstones the goal is the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 11:55:28 +03:00
Pavel Emelyanov
7e1170ecb9 range_tombstone_list: Introduce iterator_range alias
The range_tombstone_list::slice() set of methods return
back pair of iterators represending a range. In the next
patches this pair will be actively used, and it's handy
to have a shorter alias for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 11:55:28 +03:00
Piotr Sarna
2201c9b146 configure: print more context if the linking attempt failed
Previously, when a linking attempt failed, configure.py immediately
printed that neither lld nor gold was found, which might be misleading
if the linkers are installed, but the compilation failed anyway.
The printed information is now more specific, and combined with the
previous commit, it will also provide more information why the
compilation attempt failed.
2021-03-16 07:39:05 +01:00
Piotr Sarna
f86b879933 configure: provide more context on failed ./configure.py run
If the configuration step failed, it used to only inform that
it must be due to the wrong GCC version, which can be misleading.
For instance, trying to compile on clang with incorrect flags
also resulted in an "wrong GCC version" message.
Now, the message is more generic, but it also prints the stderr
output from the miscompilation, which may help pinpoint the problem:

$ ./configure.py --mode release --cflags='-fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000' --compiler=clang++ --c-compiler=clang
Note: neither lld nor gold found; using default system linker
Compilation failed: clang++ -x c++ -o build/tmp/tmp1177gojf /home/sarna/repo/scylla/build/tmp/tmp_u3voys6 -fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000 []

// clang pretends to be gcc (defined __GNUC__), so we
// must check it first
\#ifdef __clang__

\#if __clang_major__ < 10
    #error "MAJOR"
\#endif

\#elif defined(__GNUC__)

\#if __GNUC__ < 10
    #error "MAJOR"
\#elif __GNUC__ == 10
    #if __GNUC_MINOR__ < 1
        #error "MINOR"
    #elif __GNUC_MINOR__ == 1
        #if __GNUC_PATCHLEVEL__ < 1
            #error "PATCHLEVEL"
        #endif
    #endif
\#endif

\#else

\#error "Unrecognized compiler"

\#endif

int main() { return 0; }

clang-11: error: unknown argument: '-fhello'
distcc[4085341] ERROR: compile (null) on localhost failed

Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.
2021-03-16 07:39:03 +01:00
Piotr Sarna
6389246d6e configure: add verbose option to try_compile_and_link
Which will be useful later for providing more context
why a ./configure.py run failed.
2021-03-16 07:35:16 +01:00
Pavel Emelyanov
12e4269dce cql3: Get database directly from query processor
After previous patches some places in cql3 code take a
long path to get database reference:

  query processor -> storage proxy -> database

The query processor can provide the database reference
by itself, so take this chance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:36:04 +03:00
Pavel Emelyanov
fb49550943 thrift: Use query_processor::get_migration_manager()
Thrift needs migration manager to call announce_<something> on
it and currently it grabs blobak migration manager instance.

Since thrift handler has query processor rerefence onboard and
the query processor can provide the migration manager reference,
it's time to remove few more globals from thrift code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:59 +03:00
Pavel Emelyanov
6dc9a16b4e table_helper: Use query_processor::get_migration_manager()
After the migration manager can be obtained from the query
processor the table heler can also benefit from it and not
call for global migration manager instance any longer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:53 +03:00
Pavel Emelyanov
a9646dd779 cql3: Use query_processor::get_migration_manager() (lambda captures cases)
There are few schema altering statements that need to have
the query processor inside lambda continuations. Fortunately,
they all are continuations of make_ready_future<>()s, so the
query processor can be simply captured by reference and used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:48 +03:00
Pavel Emelyanov
50e4eacd08 cql3: Use query_processor::get_migration_manager() (alter_type statement)
This statement needs the query processor one step below the
stack from its .announce_migration method. So here's the
dedicated patch for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:43 +03:00
Pavel Emelyanov
464e58abf7 cql3: Use query_processor::get_migration_manager() (trivial cases)
Most of the schema altering statements implementations can now
stop calling for global migration manager instance and get it
from the query processor.

Here are the trivial cases when the query processor is just
avaiable at the place where it's needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:36 +03:00
Pavel Emelyanov
1de235f4da query_processor: Keep migration manager onboard
The query processor sits upper than the migration manager,
in the services layering, it's started after and (will be)
stopped before the migration manager.

The migration manager is needed in schema altering statements
which are called with query processor argument. They will
later get the migration manager from the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:58 +03:00
Pavel Emelyanov
1e8f0963f9 cql3: Pass query processor to announce_migration:s
Now when the only call to .announce_migration gas the
query processor at hands -- pass it to the real statements.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Pavel Emelyanov
470928dd94 cql3: Switch to qp (almost) in schema-altering-stmt
The schema altering statements are all inherited from the same
base class which delcares a pure virtual .announce_migration()
method. All the real statements are called with storage proxy
argument, while the need the migration manager. So like in the
previous patch -- replace storage proxy with query processor.

While doing the replacement also get the database instance from
the querty processor, not from proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Pavel Emelyanov
26c115f379 cql3: Change execute()'s 1st arg to query_processor
Currently the statement's execute() method accepts storage
proxy as the first argument. This is enough for all of them
but schema altering ones, because the latter need to call
migration manager's announce.

To provide the migration manager to those who need it it's
needed to have some higher-level service that the proxy. The
query processor seems to be good candidate for it.

Said that -- all the .execute()s now accept the querty
processor instead of the proxy and get the proxy itself from
the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Avi Kivity
65fea203d2 test: logalloc_test: harden background reclain test against cpu overcommit
Use thread CPU time instead of real time to avoid an overcommitted
machine from not being able to supply enough CPU for the test.
2021-03-15 13:54:49 +02:00
Avi Kivity
290897ddbc logalloc: background reclaim: use default scheduling group for adjusting shares
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.

This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
2021-03-15 13:54:49 +02:00
Avi Kivity
a87f6498c3 logalloc: background reclaim: log shares adjustment under trace level
Useful when debugging, but too noisy at any other time.
2021-03-15 13:54:49 +02:00
Avi Kivity
ce1b1d6ec4 logalloc: background reclaim: fix shares not updated by periodic timer
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.

Fix by splitting the conditional into two.
2021-03-15 13:54:37 +02:00
Tomasz Grabiec
bf6c4e0b24 Merge "raft: consolidate tests in raft directory" from Alejo
Move boost tests to tests/raft and factor out common helpers.

* alejo/raft-tests-reorg-5-rebase-next-2:
  raft: tests: move common helpers to header
  raft: tests: move boost tests to tests/raft
2021-03-15 11:59:16 +01:00
Takuya ASADA
e8cfd5114f scylla_coredump_setup: support SLES
SLES requires to install systemd-coredump package and enable
systemd-coredump.socket to use systemd-coredump.
2021-03-15 19:19:56 +09:00
Takuya ASADA
13871ff1f8 scylla_setup: use rpm to check package availability for SLES
Use rpm to check scylla packages installed on SLES.
2021-03-15 19:18:44 +09:00
Takuya ASADA
e3b5ffcf14 dist: install optional packages for SLES
Support SUSE original package manager 'zypper' for pkg_install()
function.
2021-03-15 19:17:48 +09:00
Alejo Sanchez
88063b6e3e raft: tests: move common helpers to header
Move common test helper functions and data structures to a common
helpers.hh header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Alejo Sanchez
6139ad6337 raft: tests: move boost tests to tests/raft
Move raft boost tests to test/raft directory.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Calle Wilund
48ca01c3ab commitlog: Make pre-allocation drop O_DSYNC while pre-filling
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
2021-03-15 09:35:45 +00:00
Calle Wilund
ae3b8e6fdf commitlog: coroutinize allocate_segment_ex
To make further changes here easier to write and read.
2021-03-15 09:35:37 +00:00
Avi Kivity
f326a2253c Update tools/java submodule
* tools/java 2c6110500c...fdc8fcc22c (1):
  > sstableloader: Use compound "where" restrictions for clustering
2021-03-15 11:19:22 +02:00
Raphael S. Carvalho
7171244844 compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
2021-03-14 14:31:26 +02:00
Nadav Har'El
d73934372d storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
2021-03-14 14:11:11 +02:00
Tomasz Grabiec
f2ecb4617e Merge "raft: implement prevoting stage in leader election" from Gleb
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.

* scylla-dev/raft-prevote-v5:
  raft: store leader and candidate state in state variant
  raft: add boost tests for prevoting
  raft: implement prevoting stage in leader election
  raft: reset the leader on entering candidate state
  raft: use modern unordered_set::contains instead of find in become_candidate
2021-03-12 11:15:51 +01:00
Gleb Natapov
e231186a7b raft: store leader and candidate state in state variant
We already have server state dependant state in fsm, so there is no need
to maintain "voters" and "tracker" optionals as well. The upside is that
optional and variant sates cannot drift apart now.
2021-03-12 11:12:57 +02:00
Gleb Natapov
e17e7d57bd raft: add boost tests for prevoting 2021-03-12 11:12:57 +02:00
Gleb Natapov
1f868d516e raft: implement prevoting stage in leader election
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
2021-03-12 11:09:21 +02:00
Raphael S. Carvalho
f6fc32c8da table: use new sstable_set::for_each_sstable
for_each_sstable() is preferred over all() because it's guaranteed to
perform no copy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-2-raphaelsc@scylladb.com>
2021-03-11 18:47:17 +02:00
Raphael S. Carvalho
e7a6f3926a sstable_set: introduce for_each_sstable()
This new method is preferred over all() for iterations purposes, because
all() may have to copy sstables into a temporary.
For example, all() implementation of the upcoming compound_sstable_set
will have no choice but to merge all sstables from N managed sets into
a temporary.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-1-raphaelsc@scylladb.com>
2021-03-11 18:47:16 +02:00
Avi Kivity
486f6bf29c Merge "sstables: move format specific reader code to kl/, mx/" from Botond
"
Currently the sstable reader code is scattered across several source
files as following (paths are relative to sstables/):
* partition.cc - generic reader code;
* row.hh - format specific code related to building mutation fragments
  from cells;
* mp_row_consumer.hh - format specific code related to parsing the raw
  byte stream;

This is a strange organization scheme given that the generic sstable
reader is a template and as such it doesn't itself depend on the other
headers where the consumer and context implementations live. Yet these
are all included in partition.cc just so the reader factory function can
instantiate the sstable reader template with the format specific
objects.

This patchset reorganizes this code such that the generic sstable reader
is exposed in a header. Furthermore, format specific code is moved to
the kl/ and mx/ directories respectively. Each directory has a
reader.hh with a single factory function which creates the reader, all
the format specific code is hidden from sight. The added benefit is that
now reader code specific to a format is centralized in the format
specific folder, just like the writer code.

This patchset only moves code around, no logical changes are made.

Tests: unit(dev)
"

* 'sstable-reader-separation/v1' of https://github.com/denesb/scylla:
  sstables: get rid of mp_row_consumer.{hh,cc}
  sstables: get rid of row.hh
  sstables/mp_row_consumer.hh: remove unused struct new_mutation
  sstables: move mx specific context and consumer to mx/reader.cc
  sstables: move kl specific context and consumer to kl/reader.cc
  sstables: mv partition.cc sstable_mutation_reader.hh
2021-03-11 16:57:54 +02:00
Raphael S. Carvalho
6ff8bb4eac compaction: Allow all supported compaction types to be stopped
Let's make stop_compaction() use sstables::to_compaction_type(),
so all supported compaction types can now be aborted.

Refs #7738.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:30:11 -03:00
Raphael S. Carvalho
f1b8d5f20f compaction: introduce function to map compaction name to respective type
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:59 -03:00
Raphael S. Carvalho
a44bc233f5 compaction: refactor mapping of compaction type to string
This will make it easier to introduce new type and also to map type to
string and vice-versa, using reverse lookup.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:53 -03:00
Raphael S. Carvalho
503a0ea928 compaction: move compaction_name() out of line
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:46 -03:00
Botond Dénes
361ba473c7 sstables: get rid of mp_row_consumer.{hh,cc}
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
2021-03-11 12:17:13 +02:00
Botond Dénes
3ba782bddd sstables: get rid of row.hh
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
2021-03-11 12:17:13 +02:00
Botond Dénes
f5b0657fa5 sstables/mp_row_consumer.hh: remove unused struct new_mutation 2021-03-11 12:17:13 +02:00
Botond Dénes
cecc7f8064 sstables: move mx specific context and consumer to mx/reader.cc
Move all the mx format specific context and consumer code to
mx/reader.cc and add a factory function `mx::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the mx
specific context and consumer.
2021-03-11 12:17:13 +02:00
Botond Dénes
4e3ae9d913 sstables: move kl specific context and consumer to kl/reader.cc
Move all the kl format specific context and consumer code to
kl/reader* and add a factory function `kl::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the kl
specific context and consumer. Code which is used by test is moved to
kl/reader_impl.hh, while code that can be hidden us moved to
kl/reader.cc. Users who just want to create a reader only have to
include kl/reader.hh.
2021-03-11 12:17:13 +02:00
Botond Dénes
0ec040921d sstables: mv partition.cc sstable_mutation_reader.hh
The sstable reader currently knows the definition of all the different
consumers and contexts. But it doesn't really need to, as it is a
template. Exploit this and prepare for a organization scheme where the
consumers and contexts live hidden in a cc file which includes and
instantiates the sstable reader template. As a first step expose
`sstable_mutation_reader` in a header.
2021-03-11 12:17:13 +02:00
Avi Kivity
a49c4ab754 Update tools/java submodule
* tools/java c5d9e8513e...2c6110500c (1):
  > cassandra.in.sh: Add path to rack/dc properties file to classpath

Fixes #7930.
2021-03-11 12:03:01 +02:00
Asias He
d5e6ba1ff1 repair: Shortcut when no followers to repair with
- 3 nodes in the cluster with rf = 3
- run repair on node1 with ignore_nodes to ignore node2 and node3
- node1 has no followers to repair with

However, currently node1 will walk through the repair procedure to read
data from disk and calculate hashes which are unnecessary.

This patch fixes this issue, so that in case there are no followers, we
skip the range and avoid the unnecessary work.

Before:
   $ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"

   repair - repair id [id=1, uuid=ff39151b-2ce9-4885-b7e9-89158b14b5c2] on shard 0 stats:
   repair_reason=repair, keyspace=myks3, tables={standard1},
   ranges_nr=769, sub_ranges_nr=769, round_nr=1456,
   round_nr_fast_path_already_synced=1456,
   round_nr_fast_path_same_combined_hashes=0,
   round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.19 seconds,
   tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
   row_from_disk_bytes={{127.0.0.1, 2822972}},
   row_from_disk_nr={{127.0.0.1, 6218}},
   row_from_disk_bytes_per_sec={{127.0.0.1, 14.1695}} MiB/s,
   row_from_disk_rows_per_sec={{127.0.0.1, 32726.3}} Rows/s,
   tx_row_nr_peer={}, rx_row_nr_peer={}

Data was read from disk.

After:
   $ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"

   repair - repair id [id=1, uuid=c6df8b23-bd3b-4ebc-8d4c-a11d1ebcca39] on shard 0 stats:
   repair_reason=repair, keyspace=myks3, tables={standard1}, ranges_nr=769,
   sub_ranges_nr=0, round_nr=0, round_nr_fast_path_already_synced=0,
   round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
   rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.0 seconds,
   tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
   row_from_disk_bytes={},
   row_from_disk_nr={},
   row_from_disk_bytes_per_sec={} MiB/s,
   row_from_disk_rows_per_sec={} Rows/s,
   tx_row_nr_peer={}, rx_row_nr_peer={}

No data was read from disk.

Fixes #8256

Closes #8257
2021-03-11 11:53:22 +02:00
Avi Kivity
c8f692e526 Merge 'cql3: Rewrite get_clustering_bounds() using expressions' from Dejan Mircevski
Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause.  This will allow us to eventually drop the `restrictions` hierarchy altogether.

Tests: unit (dev, debug)

Closes #8227

* github.com:scylladb/scylla:
  cql3: Make get_clustering_bounds() use expressions
  cql3/expr: Add is_multi_column()
  cql3/expr: Add more operators to needs_filtering
  cql3: Replace CK-bound mode with comparison_order
  cql3/expr: Make to_range globally visible
  cql3: Gather slice-defining WHERE expressions
  cql3: Add statement_restrictions::_where
  test: Add unit tests for get_clustering_bounds
2021-03-11 11:46:52 +02:00
Gleb Natapov
a849246cfc raft: reset the leader on entering candidate state
Not resetting a leader causes vote requests to be ignored instead of
rejected which will make voting round to take more time to fail and may
slow down new leader election.
2021-03-11 10:36:43 +02:00
Gleb Natapov
20d6bb36cd raft: use modern unordered_set::contains instead of find in become_candidate 2021-03-11 10:36:43 +02:00
Dejan Mircevski
990de02d28 cql3: Make get_clustering_bounds() use expressions
Use expressions instead of _clustering_columns_restrictions.  This is
a step towards replacing the entire restrictions class hierarchy with
expressions.

Update some expected results in unit tests to reflect the new code.
These new results are equivalent to the old ones in how
storage_proxy::query() will process them (details:
bound_view::from_range() returns the same result for an empty-prefix
singular as for (-inf,+inf)).

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
8dac132581 cql3/expr: Add is_multi_column()
It will come in handy when we start using expressions to calculate the
clustering slice.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
1f591bd16e cql3/expr: Add more operators to needs_filtering
Omitting these operators didn't cause bugs, because needs_filtering()
is never invoked on them.  But that will likely change in the future,
so add them now to prevent problems down the road.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
c0c93982d0 cql3: Replace CK-bound mode with comparison_order
Instead of defining this enum in multi_column_restriction::slice, put
it in the expr namespace and add it to binary_operator.  We will need
it when we switch bounds calculation from multi_column_restriction to
expr classes.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
7dfe471b5a cql3/expr: Make to_range globally visible
It will be used in statement_restrictions for calculating clustering
bounds.  And it will come in handy elsewhere in the future, I'm sure.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
28b5a372f8 cql3: Gather slice-defining WHERE expressions
Add statement_restrictions::_clustering_prefix_restrictions and fill
it with relevant expressions.  Explain how to find all such
expressions in the WHERE clause.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
da096bfdce cql3: Add statement_restrictions::_where
... and collect all restrictions' expressions into it.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
2525759027 test: Add unit tests for get_clustering_bounds
... as guardrails for the upcoming rewrite.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:17:26 -05:00
Calle Wilund
f44420f2c9 snapshot: Add filter to check for existing snapshot
Fixes #8212

Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.

Use case is "scrub" which calls with a limited set of tables
to snapshot.

Closes #8240
2021-03-10 20:21:38 +02:00
Benny Halevy
ff5b42a0fa bytes_ostream: max_chunk_size: account for chunk header
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().

This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.

This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.

Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.

Refs #7950
Refs #8081

Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
2021-03-10 19:54:12 +02:00
Asias He
268fa9d9fe main: Lower shares for main scheduling group     The main scheduling group has the shares of 1000, which is as high as   the statement group.     From time to time, we see unexpected scheduling group leaking to the main group, which causes the drop of the query performance.   This patch reduce the main scheduling shares to 200, which is the same as the maintenance scheduling group. It is a safer default in case code leaks to the main scheduling group.         Refs: #7720
Closes #8243
2021-03-10 19:34:45 +02:00
Takuya ASADA
af8eae317b scylla_coredump_setup: avoid coredump failure when hard limit of coredump is set to zero
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.

Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.

Fixes #8238

Closes #8245
2021-03-10 19:28:10 +02:00
Avi Kivity
5342d79461 Merge "Preparatory work in sstable_set for the upcoming compound_sstable_set_impl" from Raphael
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
  sstable_set: move all() implementation into sstable_set_impl
  sstable_set: preparatory work to change sstable_set::all() api
  sstables: remove bag_sstable_set
2021-03-10 19:19:26 +02:00
Botond Dénes
cf28552357 mutation_test: test_mutation_diff_with_random_generator: compact input mutations
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.

Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
2021-03-10 16:28:14 +01:00
Raphael S. Carvalho
c3b8757fa1 sstable_set: move all() implementation into sstable_set_impl
The main motivation behind this is that by moving all() impl into
sstable_set_impl, sstable_set no longer needs to maintain a list
with all sstables, which in turn may disagree with the respective
sstable_set_impl.

This will be very important for compound_sstable_set_impl which
will be built from existing sets, and will implement all() by
combining the all() of its managed sets.
Without this patch, we'd have to insert the same sstable at
both compound set and also the set managed by it, to guarantee
all() of compound set would return the correct data, which would
be expensive and error prone.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-10 12:02:13 -03:00
Raphael S. Carvalho
05b07c7161 sstable_set: preparatory work to change sstable_set::all() api
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.

this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.

so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.

so the following code
	for (auto& sst : *sstable_set.all()) { ...}
becomes
	for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-10 12:02:12 -03:00
Avi Kivity
746798fd56 Merge "sstables: get rid of data_consume_context" from Botond
"
This class is basically a wrapper around a unique pointer and a few
short convenience methods, but is otherwise a distraction in trying to
untangle the maze that is the sstable reader class hierachy.
So this patchset folds it into its only real user: the sstable reader.
"

* 'data_consume_context_bye' of https://github.com/denesb/scylla:
  sstable: move data_consume_* factory methods to row.hh
  sstables: fold data_consume_context: into its users
  sstables: partition.cc: remove data_consume_* forward declarations
2021-03-10 16:45:32 +02:00
Nadav Har'El
a1725217e1 Merge 'alternator: coroutinize handle_api_request' from Piotr Sarna
The indentation level is significantly reduced, and so is the number
of allocations. The function signature is changed from taking an rvalue
ref to taking the unique_ptr by value, because otherwise the coroutine
captures the request as a reference, which results in use-after-free.

Tests: unit(dev)

Closes #8249

* github.com:scylladb/scylla:
  alternator: drop read_content_and_verify_signature
  alternator: coroutinize handle_api_request
2021-03-10 16:08:08 +02:00
Piotr Sarna
ba264e7199 alternator: drop read_content_and_verify_signature
The only use of this helper function was inlined in a bigger
coroutine, so it's no longer needed.
2021-03-10 14:42:53 +01:00
Piotr Sarna
35da51879f alternator: coroutinize handle_api_request
The indentation level is significantly reduced, and so is the number
of allocations.
The function signature is changed from taking an rvalue ref to taking
the unique_ptr by value, because otherwise the coroutine captures
the request as a reference, which results in use-after-free.
2021-03-10 14:42:52 +01:00
Botond Dénes
1aa2424dcf sstable: move data_consume_* factory methods to row.hh 2021-03-10 15:40:50 +02:00
Botond Dénes
a06465a8f3 sstables: fold data_consume_context: into its users
`data_consume_context` is a thin wrapper over the real context object
and it does little more than forward method calls to it. The few
methods doing more then mere forwarding can be folded into its single
real user: `sstable_reader`.
2021-03-10 15:38:58 +02:00
Botond Dénes
37eb547224 sstables: partition.cc: remove data_consume_* forward declarations
They don't seem to serve any purpose, everything builds fine without
them.
2021-03-10 15:23:54 +02:00
Raphael S. Carvalho
f7cc431477 compaction_manager: Fix use-after-free in rewrite_sstables()
Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
2021-03-10 13:18:38 +02:00
Nadav Har'El
f41dac2a3a alternator: avoid large contiguous allocation for request body
Alternator request sizes can be up to 16 MB, but the current implementation
had the Seastar HTTP server read the entire request as a contiguous string,
and then processed it. We can't avoid reading the entire request up-front -
we want to verify its integrity before doing any additional processing on it.
But there is no reason why the entire request needs to be stored in one big
*contiguous* allocation. This always a bad idea. We should use a non-
contiguous buffer, and that's the goal of this patch.

We use a new Seastar HTTPD feature where we can ask for an input stream,
instead of a string, for the request's body. We then begin the request
handling by reading lthe content of this stream into a
vector<temporary_buffer<char>> (which we alias "chunked_content"). We then
use this non-contiguous buffer to verify the request's signature and
if successful - parse the request JSON and finally execute it.

Beyond avoiding contiguous allocations, another benefit of this patch is
that while parsing a long request composed of chunks, we free each chunk
as soon as its parsing completed. This reduces the peak amount of memory
used by the query - we no longer need to store both unparsed and parsed
versions of the request at the same time.

Although we already had tests with requests of different lengths, most
of them were short enough to only have one chunk, and only a few had
2 or 3 chunks. So we also add a test which makes a much longer request
(a BatchWriteItem with large items), which in my experiment had 17 chunks.
The goal of this test is to verify that the new signature and JSON parsing
code which needs to cross chunk boundaries work as expected.

Fixes #7213.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>
2021-03-10 09:22:34 +01:00
Juliusz Stasiewicz
382545a614 docs: explain SSL/non-SSL and shard-aware CQL ports
I added short description of shard-aware ports + updated the rules
for disabling ports and enabling SSL introduced by #7992.

Fixes #8146

Closes #8152
2021-03-09 22:48:30 +02:00
Tomasz Grabiec
c9c2beabc0 Merge "raft: replication tests as individual boost tests" from Alejo
* alejo/raft-tests-replication-boost-5:
  raft: replication test: use Seastar random generator
  raft: replication test: rename drop_replication
  raft: replication test: change to Boost test
  raft: replication test: id helper functions
  raft: replication test: improve handling connectivity
  raft: replication test: parametrize snapshots
  raft: replication test: parametrize drop_replication
  raft: replication test: remove unused configuration
  raft: replication test: add license
2021-03-09 17:58:59 +01:00
Pavel Emelyanov
096e452db9 test: Fix exit condition of row_cache_test::test_eviction_from_invalidated
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.

However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that

  evictable size < chunk size < evictable size + dummy size

In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.

fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
       seed from the issue + 200 times more with random seeds)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
2021-03-09 17:57:52 +01:00
Alejo Sanchez
f67b85e2b3 raft: replication test: use Seastar random generator
Use the random generator provided by Seastar test suite.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
1bf10a87c6 raft: replication test: rename drop_replication
Rename drop_replication to packet_drops for readability.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
6e193ee3bf raft: replication test: change to Boost test
Change test/raft directory to Boost test type.

Run replication_test cases with their own test.

RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops.

The directory test/raft is changed to host Boost tests instead of unit.

While there improve the documentation.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
8d9c797954 raft: replication test: id helper functions
In raft the UUID 0 is a special case so server ids start at 1.
Add two helper functions. Convert local 0-based id to raft 1-based
UUID. And from UUID to raft_id.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:50:12 -04:00
Alejo Sanchez
0ffa450222 raft: replication test: improve handling connectivity
Change global map of disconnected servers to a more intuitive class
connected. The class is callable for the most common case
connected(id).

Methods connect(), disconnect(), and all() are provided for readability
instead of directly calling map methods (insert, erase, clear). They
also support both numerical (0 based) and server_id (UUID, 1 based) ids.

The actual shared map is kept in a lw_shared_ptr.

The class is passed around to be copy-constructed which is practically
just creating a new lw_shared_ptr.

Internally it tracks disconnected servers but externally it's more
intuitive to use connect instead of disconnect. So it reads
"connected id" and "not disconnected id", without double negatives.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:39:29 -04:00
Alejo Sanchez
7a644f37d3 raft: replication test: parametrize snapshots
Snapshots and persisted snapshots created per test instead of globals.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
f72e89fcfe raft: replication test: parametrize drop_replication
Pass drop_replication down instead of keeping it global.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
5a03670f91 raft: replication test: remove unused configuration
Remove test case configuration as it's not implemented yet.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
efc6681cd6 raft: replication test: add license
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Piotr Sarna
d473bc9b06 Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani
This is a reworked submission of #7686 which has been reverted.  This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.

Fixes #7709

Closes #8091

* github.com:scylladb/scylla:
  database: Fix view schemas in place when loading
  global_schema_ptr: add support for view's base table
  materialized views: create view schemas with proper base table reference.
  materialized views: Extract fix legacy schema into its own logic
2021-03-09 16:27:34 +01:00
Asias He
61ac8d03b9 repair: Add ignore_nodes option
In some cases, user may want to repair the cluster, ignoring the node
that is down. For example, run repair before run removenode operation to
remove a dead node.

Currently, repair will ignore the dead node and keep running repair
without the dead node but report the repair is partial and report the
repair is failed. It is hard to tell if the repair is failed only due to
the dead node is not present or some other errors.

In order to exclude the dead node, one can use the hosts option. But it
is hard to understand and use, because one needs to list all the "good"
hosts including the node itself. It will be much simpler, if one can
just specify the node to exclude explicitly.

In addition, we support ignore nodes option in other node operations
like removenode. This change makes the interface to ignore a node
explicitly more consistent.

Refs: #7806

Closes #8233
2021-03-09 16:03:13 +01:00
Gleb Natapov
2a41ad0b57 raft: add testing for non-voting members
Add tests to check if quorum (for leader election and commit index
purposes) is calculated correctly in the presence of non-voting members.
Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>
2021-03-09 13:51:09 +01:00
Gleb Natapov
dd6ba3d507 raft: add non-voting member support
This patch adds a support for non-voting members. Non voting member is a
member which vote is not counted for leader election purposes and commit
index calculation purposes and it cannot become a leader. But otherwise
it is a normal raft node. The state is needed to let new nodes to catch
up their log without disturbing a cluster.

All kind of transitions are allowed. A node may be added as a voting member
directly or it may be added as non-voting and then changed to be voting
one through additional configuration change. A node can be demoted from
voting to non-voting member through a configuration change as well.
Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>
2021-03-09 13:47:48 +01:00
Raphael S. Carvalho
863b95aa34 sstables: remove bag_sstable_set
bag_sstable_set can be replaced with partitioned_sstable_set, which
will provide the same functionality, given that L0 sstables go to
a "bag" rather than interval map. STCS, for example, will only
have L0 sstables, so it will get exact the same behavior with
partitioned_sstable_set.

it also gives us the benefit of keeping the leveled sstables in
the interval map if user has switched from LCS to STCS, until
they're all compacted into size-tiered ssts.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-09 08:39:48 -03:00
Avi Kivity
9038a81317 treewide: drop SEASTAR_CONCEPT
Since Scylla requires C++20, there is no need to protect
concept definitions or usages with SEASTAR_CONCEPT; it just
clutters the code. This patch therefore removes all uses.

Closes #8236
2021-03-08 16:04:20 +01:00
Asias He
dc40184faa gossip: Handle timeout error in gossiper::do_shadow_round
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.

It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.

This patch fixes an issue we saw recently in our sct tests:

```
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO    | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...

ERR     | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```

Fixes #8187

Closes #8213
2021-03-08 13:03:41 +01:00
Nadav Har'El
28804a50f7 alternator-test: test that index can't be a name reference (#xyz)
We already have a test which shows verify DynamoDB and Alternator
do not allow an index in an attribute path - like a[0].b - to be
a value reference - a[:xyz].b. We forgot to verify that the index
also can't be a name reference - a[#xyz].b is a syntax error. So here
we add a test which confirms that this is indeed the case - DynamoDB
doesn't allow it, and neither does Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210219123310.1240271-1-nyh@scylladb.com>
2021-03-08 10:17:19 +01:00
Avi Kivity
938761f49f types.cc: drop unused #include "compaction_garbage_collector.hh"
Garbage-collect unused #includes.

Closes #8232
2021-03-08 06:44:03 +01:00
Takuya ASADA
2d9feaacea scylla_raid_setup: don't abort using raiddev when array_state is 'clear'
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).

However, look into /proc/mdstat of the AMI, it actually says no active md device available:

ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>

We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:

    clear

        No devices, no size, no level
        Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html

So we should also check array_state != 'clear', not just array_state
existance.

Fixes #8219

Closes #8220
2021-03-07 18:30:11 +02:00
Avi Kivity
1287a5e1d0 test: index_reader_assertions: fix misuse of trichotomic comparator in has_monotonic_positions
has_monotonic_positions() wants to check for a greater-than-or-equal-to
relation, but actually tests for not-equal, since it treats a
trichotomic comparator as a less-than comparator. This is clearly seen
in the BOOST_FAIL message just below.

Fix by aligning the test with the intended invariant. Luckily, the tests
still pass.

Ref #1449.

Closes #8222
2021-03-07 13:44:37 +02:00
Eliran Sinvani
0220786710 database: Fix view schemas in place when loading
On restart the view schemas are loaded and might contain old
views with an unmarked computed column. We already have code to
update the schema, but before we do it we load the view as is. This
is not desired since once registered, this view version can be used
for writes which is forbidden since we will spot a none computed
column which is in the view's primary key but not in the base table
at all. To solve this, in addition to altering the persistent schema,
we fix the view's loaded schema in place. This is safe since computed
column is just involved in generating a value for this column when
creating a view update so the effect of this manipulation stays
internal.
The second stage of the in place fixing is to persist the
changes made in the in place fixing so the view is ready for
the next node restart in particular the `computed_columns` table.
2021-03-07 12:57:16 +02:00
Eliran Sinvani
04de770566 global_schema_ptr: add support for view's base table
Up until now, the global_schema_ptr object was a crack
through which a view schema with an uninitialized base
reference could sneak. Even if the schema itself contained a
base reference, the base schema didn't carry over to shards
different than the shard on which the global_schema_ptr was
created.
Since once the schema is in the registry it might be used for
everything (reads and writes), we also need to make sure that
global schemas for an incomplete view schemas will not be created.
2021-03-07 12:50:42 +02:00
Eliran Sinvani
9162748b18 materialized views: create view schemas with proper base table
reference.

Newly created view schemas don't always have their base info,
this is bad since such schemas don't support read nor write.
This leaves us vulnerable to a race condition where there is
an attempt to use this schema for read or write. Here we initialize
the base reference and also reconfigure the view to conform to the
new computed column type, which makes it usable for write and not only
reads. We do it for views created in the migration manager following
announcements and also for copied schemas.
2021-03-07 12:50:42 +02:00
Eliran Sinvani
39cd9dae4e materialized views: Extract fix legacy schema into its own logic
We extract the logic for fixing the view schema into it's own
logic as we will need to use it in more places in the code.
This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since
it becomes a two liner wrapper for this logic. We also
remove it here and replace the call to it with the equivalent code.
2021-03-07 12:50:42 +02:00
Takuya ASADA
53c7600da8 dist: increase fs.aio-max-nr value for other apps
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.

Related #8133

Closes #8228
2021-03-07 12:11:36 +02:00
Piotr Sarna
7106ca27e6 service: reduce continuation length for paxos pruning
A pair of (finally, handle_exception) is reduced to a single
use of then_wrapped(), which saves an allocation.

Message-Id: <01949e286db93397209435a85fcc46a8beef6d24.1614937462.git.sarna@scylladb.com>
2021-03-07 11:59:10 +02:00
Nadav Har'El
ad563c6279 Update tools/java submodule
Fixes an sstableloader bug where we quoted twice column names that
had to be quoted, and therefore failed on such tables - and in particular
Alternator tables which always have a column called ":attrs".

Fixes #8229

* tools/java 142f517a23...c5d9e8513e (1):
  > sstableloader: Only escape column names once
2021-03-07 10:33:49 +02:00
Botond Dénes
debaae41f9 mutation_partition: operator<<(mutation_partition::printer)
Include row tombstones in the row printout.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210305094106.210249-1-bdenes@scylladb.com>
2021-03-05 14:39:39 +02:00
Botond Dénes
45471419d0 multishard_mutation_query: re-enable reverse queries
034cb81323 and 0f0c3be disallowed reverse partition-range scans based on
the observation that the CQL frontend disallows them, assuming that
other client APIs also disallow them. As it turns out this is not true
and there it at least one client API (Thrift) which does allows reverse
range scans. So re-enable them.

Fixes: #8211

Tests: unit(release), dtest(thrift_tests.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304142249.164247-1-bdenes@scylladb.com>
2021-03-04 17:06:16 +02:00
Nadav Har'El
acfa180766 cql-pytest: recognize when Scylla crashes
Before this patch, if Scylla crashes during some test in cql-pytest, all
tests after it will fail because they can't connect to Scylla - and we can
get a report on hundreds of failures without a clear sign of where the real
problem was.

This patch introduces an autouse fixture (i.e., a fixture automatically
used by every test) which tries to run a do-nothing CQL command after each
test.  If this CQL command fails, we conclude that Scylla crashed and
report the test in which this happened - and exist pytest instead of failing
a hundred more tests.

Fixes #8080

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210304132804.1527977-1-nyh@scylladb.com>
2021-03-04 16:00:00 +02:00
Raphael S. Carvalho
1226fc755f compaction_manager: Increase cleanup compaction resilience when low on disk space
In a scenario where node is running out of disk space, which is a common
cause of cluster expansion, it's very important to clean up the smallest
files first to increase the chances of success when the biggest files are
reached down the road. That's possible given that cleanup operates on a
single file at a time, and that the smaller the file the smaller the
space requirement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210303165520.55563-1-raphaelsc@scylladb.com>
2021-03-04 15:11:06 +02:00
Botond Dénes
25367deb01 mutation_partition: make row::consume_with() exception safe
This function currently eagerly decrements `_size`, before `func()` is
invoked. If `func()` throws the consumption fails but the size remains
decremented. If this happens right at the last element in the row, the
`row::empty()` will incorrectly return `true`, even though there is
still one cell left in it. Move the decrement after the `func()`
invocation to avoid this by only decrementing if the consumption
was successful.

Fixes: #8154

Tests: unit(mutation_test:release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304125318.143323-1-bdenes@scylladb.com>
2021-03-04 15:07:15 +02:00
Piotr Sarna
added53b7d Merge 'hints: use a soft disk space limit in hints commitlog' from Piotr Dulikowski
A recent change to the commitlog (4082f57) caused its configurable size limit to
be strictly enforced - after reaching the limit, new segments wouldn't be
allocated until some of the previous segments are freed. This flow can work for
the regular commitlog, however the hints commitlog does not delete the segments
itself - instead, hints manager recreates its commitlog every 10 seconds, picks
up segments left by the previous instance and deletes each segment manually only
after all hints are sent out from a segment.

Because of the non-standard flow, it is possible that the hints commitlog fills
up and stops accepting more hints. Hints manager uses a relatively low limit for
each commitlog instance (128MB divided by shard count), so it's not hard to fill
it up. What's worse, hints manager tries to acquire file_update_mutex in
exclusive mode before re-creating the commitlog, while hints waiting to be
written acquire this lock in shared mode - which causes hints flushing to
completely deadlock and no more hints be admitted to the commitlog. The queue of
hints waiting to be admitted grows very quickly and soon all writes which could
result in a hint being generated are rejected with OverloadedException.

To solve this problem, it is now possible to bring back the soft disk space
limit by setting a flag in commitlog's configuration.

Tests:
- unit(dev)
- wrote hints for 15 minutes in order to see if it gets stuck again

Fixes #8137

Closes #8206

* github.com:scylladb/scylla:
  hints_manager: don't use commitlog hard space limit
  commitlog: add an option to allow going over size limit
2021-03-04 12:24:05 +01:00
Tomasz Grabiec
d6a94a7db1 Merge 'Make dht::token tri_compare safer' from Avi Kivity
tri_compare() returns an int, which is dangerous as a tri_compare can
be misused where a less_compare is expected. To prevent such misuse,
convert the interval<> template to accept comparators that return
std::strong_ordering, and then convert dht::token's comparator to do
the same.

Ref #1449.

Closes #8181

* github.com:scylladb/scylla:
  dht: convert token tri_compare to std::strong_ordering
  interval: support C++20 three-way comparisons
2021-03-04 11:55:08 +01:00
Nadav Har'El
3e66a5cd43 Merge 'More Redis cleanups' from Pekka Enberg
This pull request removes seastar namespace imports from the header
files. There are some additional cleanups to make that easier and to
remove some commented out code.

Closes #8202

* github.com:scylladb/scylla:
  redis: Remove seastar namespace import from query_processor.hh
  redis: Switch to seastar::sharded<> in query_procesor.hh
  redis: Remove seastar namespace import from query_utils.hh
  redis: Remove seastar namespace import from reply.hh
  redis: Remove commented out code from options.hh
  redis: Remove seastar namespace import from options.hh
  redis: Remove seastar namespace import from service.hh
  redis: Switch to seastar::sharded<> in service.{hh,cc}
  redis: Remove unneeded include from keyspace_utils.hh
  redis: Remove seastar namespace import from keyspace_utils.hh
  redis: Remove seastar namespace import from command_factory.hh
  redis: Fix include path in command_factory.hh
  redis: Remove unneeded includes from command_factory.hh
2021-03-04 11:08:24 +02:00
Pekka Enberg
6066db7c90 Update tools/jmx submodule
* tools/jmx bac7d0b...15c1d4f (2):
  > StorageService: Add a method to return the uptime
  > Bump Jackson version in scylla-apiclient
2021-03-04 10:56:37 +02:00
Nadav Har'El
e12e57c915 Merge 'Fix alternator streams management regression' from Calle Wilund
Refs: #8012
Fixes: #8210

With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.

Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.

Closes #8209

* github.com:scylladb/scylla:
  alternator::streams: Use better method for generation timestamp
  system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
  system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
2021-03-04 09:43:56 +02:00
Pekka Enberg
1d8a94f941 Update tools/jmx submodule
* tools/jmx c2fc96b...bac7d0b (1):
  > Merge 'Fix locking in APIBuilder.remove()' from Pekka Enberg
2021-03-03 18:30:48 +02:00
Calle Wilund
8bbc976ff1 alternator::streams: Use better method for generation timestamp
Get timestamp via system_distributed, instead of local gen.
2021-03-03 15:46:38 +00:00
Calle Wilund
5da0129775 system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
Since we have a table of cdc version timestamps, conviniently sorted
reversed, we can just query this and get the latest known gen ts.
2021-03-03 15:44:54 +00:00
Calle Wilund
5a69250d7e system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
With the new scheme for cdc generation management, one of the last
changes was to make the time ordering of the stream timestamps reversed.

However, cdc_get_versioned_streams forgot to take this into account
when sifting out timestamp ranges for stream retrieval (based on
low mark).

Fixed by doing reverse iteration.
2021-03-03 15:41:42 +00:00
Tomasz Grabiec
3cb01f218f Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.

Misc cleanups.

* scylla-dev.git/raft-confchange-test-v4:
  raft: fix spelling
  raft: add a unit test for voting
  raft: do not account for the same vote twice
  raft: remove fsm::set_configuration()
  raft: consistently use configuration from the log
  raft: add ostream serialization for enum vote_result
  raft: advance commit index right after leaving joint configuration
  raft: add tracker test
  raft: tidy up follower_progress API
  raft: update raft::log::apply_snapshot() assert
  raft: add a unit test for raft::log
  raft: rename log::non_snapshoted_length() to log::in_memory_size()
  raft: inline raft::log::truncate_tail()
  raft: ignore AppendEntries RPC with a very old term
  raft: remove log::start_idx()
  raft: return a correct last term on an empty log
  raft: do not use raft::log::start_idx() outside raft::log()
  raft: rename progress.hh to tracker.hh
  raft: extend single_node_is_quiet test
2021-03-03 16:29:40 +01:00
Tomasz Grabiec
0dc57db248 Revert "Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja"
This reverts commit f94f70cda8, reversing
changes made to 5206a97915.

Not the latest version of the series was merged. Rvert prior to
merging the latest one.
2021-03-03 16:29:02 +01:00
Avi Kivity
facc7c370e Update tools/jmx submodule
* tools/jmx 8073af6...c2fc96b (1):
  > APIBuilder: Remove RW-lock in JMX server repository wrapper

Fixes #7991.
2021-03-03 15:41:09 +02:00
Avi Kivity
aae43e1a20 Merge 'Untyped_result_set: make non-copying and fragment retaining' from Calle Wilund
Refs #7961
Fixes #8014

The "untyped_result_set" object was created for small, internal access to cql-stored metadata.
It is nowadays used for rather more than that (cdc).
This has the potential of mixing badly with the fact that the type does deep copying of data
and linearizes all (not to mention handles multiple rows rather inefficiently).

Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.

Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.

Otoh, it allows writing code using this that avoids data copying,
depending on destination.

v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase

v3:
* Changed shared_ptr to unique

Closes #8015

* github.com:scylladb/scylla:
  untyped_result_set: Do not copy data from input store (retain fragmented views)
  result_generator: make visitor callback args explicit optionals
  listlike_partial_deserializing_iterator: expose templated collection routines
2021-03-03 13:13:18 +02:00
Nadav Har'El
4e3db5297a cql-pytest: rework tests for filtering leaving out most rows
Previously, we had two tests demonstrating issue #7966. But since then,
our understanding of this issue has improved which resulted in issue #8203,
so this patch improves those tests and makes them reproduce the new issue.

Importantly, we now know that this problem is not specific to a full-table
scan, and also happens in a single-partition scan, so we fix the test to
demonstrate this (instead of the old test, which missed the problem so
the test passed).

Both tests pass on Cassandra, and fail on Scylla.

Refs #8203.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302224020.1498868-1-nyh@scylladb.com>
2021-03-03 11:22:08 +01:00
Calle Wilund
e4d6c8904f untyped_result_set: Do not copy data from input store (retain fragmented views)
Refs #7961
Fixes #8014

Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.

Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.

Otoh, it allows writing code using this that avoids data copying,
depending on destination.

v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase

v3:
* Changed shared_ptr to unique
2021-03-03 10:19:46 +00:00
Calle Wilund
353730d4bb result_generator: make visitor callback args explicit optionals
This allows a visitor to separate temporaries (non-optional views)
from store backed views (optionals) when traversing.
2021-03-03 10:19:46 +00:00
Calle Wilund
bba43ce31a listlike_partial_deserializing_iterator: expose templated collection routines
To allow using fragmented types as input.
2021-03-03 10:19:46 +00:00
Nadav Har'El
0fea089b37 Merge 'Fix reading whole requests during shedding' from Piotr Sarna
When shedding requests (e.g. due to their size or number exceeding the
limits), errors were returned right after parsing their headers, which
resulted in their bodies lingering in the socket. The server always
expects a correct request header when reading from the socket after the
processing of a single request is finished, so shedding the requests
should also take care of draining their bodies from the socket.

Fixes #8193

Closes #8194

* github.com:scylladb/scylla:
  cql-pytest: add a shedding test
  transport: return error on correct stream during size shedding
  transport: return error on correct stream during shedding
  transport: skip the whole request if it is too large
  transport: skip the whole request during shedding
2021-03-03 08:52:48 +02:00
Piotr Sarna
4499f89916 cql-pytest: add a shedding test
This scylla-only test case tries to push a too-large request
to Scylla, and then retries with a smaller request, expecting
a success this time.

Refs #8193
2021-03-03 07:08:55 +01:00
Pekka Enberg
310b5c9592 redis: Fix license text in server.hh
The search and replace pattern went bit overboard. Let's fix up the
license text.
Message-Id: <20210302171150.3346-1-penberg@scylladb.com>
2021-03-03 07:06:45 +01:00
Dejan Mircevski
05497fe14d cql3/maps: Drop redundant if condition
Accidentally introduced in 9eed26ca3d, it can never be true due to
code above it.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8201
2021-03-03 07:06:45 +01:00
Nadav Har'El
d6335b7fda test/alternator: better tests of oversized requests
Like DynamoDB, Alternator rejects requests larger than some fixed maximum
size (16MB). We had a test for this feature - test_too_large_request,
but it was too blunt, and missed two issues:

Refs #8195
Refs #8196

So this patch adds two better tests that reproduce these two issues:

First, test_too_large_request_chunked verifies that an oversized request
is detected even if the body is sent with chunked encoding.

Second, both tests - test_too_large_request_chunked and
test_too_large_request_content_length - verify that the rather limited
(and arguably buggy) Python HTTP client is able to read the 413 status
code - and doesn't report some generic I/O error.

Both tests pass on DynamoDB, but fail on Alternator because of these two
open issues.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302154555.1488812-1-nyh@scylladb.com>
2021-03-03 07:06:45 +01:00
Nadav Har'El
c6ca1ec643 cql-pytest: add reproducers for two filtering-related issues
The main goal of this patch is to add a reproducer for issue #7966, where
partition-range scan with filtering that begins with a long string of
non-matches aborts the query prematurely - but the same thing is fine with
a single-partition scan. The test, test_filtering_with_few_matches, is
marked as "xfail" because it still fails on Scylla. It passes on Cassandra.

I put a lot of effort into making this reproducer *fast* - the dev-build
test takes 0.4 seconds on my laptop. Earlier reproducers for the same
problem took as much as 30 seconds, but 0.4 seconds turns this test into
a viable regression test.

We also add a test, test_filter_on_unset, reproduces issue #6295 (or
the duplicate #8122), which was already solved so this test passes.

Refs #6295
Refs #7966
Refs #8122

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210301170451.1470824-1-nyh@scylladb.com>
2021-03-03 07:06:45 +01:00
Calle Wilund
58489dc003 cql3::restrictions: Add SCYLLA_CLUSTERING_BOUND keyword for sstableloader
Refs #8093
Refs /scylladb/scylla-tools-java#218

Adds keyword that can preface value tuples in (a, b, c) > (1, 2, 3)
expressions, forcing the restriction to bypass column sort order
treatment, and instead just create the raw ck bounds accordningly.

This is a very limited, and simple version, but since we only need
to cover this above exact syntax, this should be sufficient.

v2:
* Add small cql test
v3:
* Added comment in multi_column_restriction::slice, on what "mode" means and is for
* Added small document of our internal CQL extension keywords, including this.
v4:
* Added a few more cases to tests to verify multi-column restrictions
* Reworded docs a bit
v5:
* Fixed copy-paste error in comment
v6:
* Added negative (error) test cases
v7:
* Added check + reject of trying to combine SCYLLA_CLUST... slice and
  normal one

Closes #8094
2021-03-03 07:06:45 +01:00
Pekka Enberg
9d54a3e743 redis: Remove seastar namespace import from query_processor.hh 2021-03-02 18:39:30 +02:00
Pekka Enberg
27c5041c86 redis: Switch to seastar::sharded<> in query_procesor.hh 2021-03-02 18:38:41 +02:00
Pekka Enberg
ee8fe53b3c redis: Remove seastar namespace import from query_utils.hh 2021-03-02 18:37:31 +02:00
Pekka Enberg
c90c1ccd44 redis: Remove seastar namespace import from reply.hh 2021-03-02 18:36:30 +02:00
Pekka Enberg
1075d72780 redis: Remove commented out code from options.hh 2021-03-02 18:34:46 +02:00
Pekka Enberg
1c222fda65 redis: Remove seastar namespace import from options.hh 2021-03-02 18:34:30 +02:00
Pekka Enberg
d452c8f42e redis: Remove seastar namespace import from service.hh 2021-03-02 18:33:31 +02:00
Pekka Enberg
d0594a86aa redis: Switch to seastar::sharded<> in service.{hh,cc} 2021-03-02 18:30:39 +02:00
Avi Kivity
ee9db75210 Merge 'Clean up Redis transport layer' from Pekka Enberg
The Redis transport layer seems to have originated as a copy-paste of
the CQL transport layer. This pull request removes bunch of unused
and commented out bits of code, and also does some minor cleanups
like organizing includes, to make the code more readable.

Closes #8198

* github.com:scylladb/scylla:
  redis: Remove unused to_bytes_view() function from server.cc
  redis: Remove unused tracing_request_type enum
  redis: Remove unneeded connection friend declaration
  redis: Remove unused process_request_executor friend declaration
  redis: Remove unused _request_cpu class member
  redis: Remove commented out code from server.hh
  redis: Remove duplicate request.hh include
  redis: Remove unused db::config forward declaration
  redis: Remove unused fmt_visitor forward declaration
  redis: Organize includes in server.{cc,hh}
  redis: Switch to seastar::sharded<>
  redis: Remove redundant access modifiers from server.hh
2021-03-02 18:27:38 +02:00
Pekka Enberg
097aaa6dc2 redis: Remove unneeded include from keyspace_utils.hh 2021-03-02 18:16:29 +02:00
Pekka Enberg
7f4de3f915 redis: Remove seastar namespace import from keyspace_utils.hh 2021-03-02 18:15:37 +02:00
Pekka Enberg
bf47b58b8a redis: Remove seastar namespace import from command_factory.hh 2021-03-02 18:13:49 +02:00
Pekka Enberg
92e257d5bd redis: Fix include path in command_factory.hh 2021-03-02 18:13:08 +02:00
Pekka Enberg
ac4b8e4534 redis: Remove unneeded includes from command_factory.hh 2021-03-02 18:12:30 +02:00
Piotr Dulikowski
376da49cf4 hints_manager: don't use commitlog hard space limit
This commit disables the hard space limit applied by commitlogs created
to store hints. The hard limit causes problems for hints because they
use small-sized commitlogs to store hints (128MB, currently). Instead of
letting the commitlog delete the segments itself, it recreates the
commitlog every 10 seconds and manually deletes old segments after all
hints are sent out from them.

If the 128MB limit is reached, the hints manager will get stuck. A
future which puts hint into commitlog holds a shared lock, and commitlog
recreation needs to get an exclusive lock, which results in a deadlock.
No more hints will be admitted, and eventually we will start rejecting
writes with OverloadedException due to too many hints waiting to be
admitted to the commitlog.

By disabling the hard limit for hints commitlog, the old behavior is
brought back - commitlog becomes more conservative with the space used
after going over its size limit, but does not block until some of its
segments are deleted.
2021-03-02 16:53:50 +01:00
Piotr Sarna
8635094144 transport: return error on correct stream during size shedding
When a request is shed due to being too large, its response
was sent with stream id 0 instead of the stream id that matches
the communication lane. That in turn confused the client,
which is no longer the case.
2021-03-02 15:10:46 +01:00
Piotr Sarna
d6ea6937ee transport: return error on correct stream during shedding
When a request is shed due to exceeding the max number of concurrent
requests, its response was sent with stream id 0 instead of
the stream id that matches the communication lane.
That in turn confused the client, which is no longer the case.
2021-03-02 15:10:46 +01:00
Pekka Enberg
01a785f561 redis: Remove unused to_bytes_view() function from server.cc 2021-03-02 14:29:52 +02:00
Pekka Enberg
fb6eecfae2 redis: Remove unused tracing_request_type enum 2021-03-02 14:29:52 +02:00
Pekka Enberg
8d79deb973 redis: Remove unneeded connection friend declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
ff81f7bc23 redis: Remove unused process_request_executor friend declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
87c5968602 redis: Remove unused _request_cpu class member 2021-03-02 14:29:51 +02:00
Pekka Enberg
11fa32e8c9 redis: Remove commented out code from server.hh 2021-03-02 14:29:51 +02:00
Pekka Enberg
ddab15c47f redis: Remove duplicate request.hh include 2021-03-02 14:29:51 +02:00
Pekka Enberg
07bd125a59 redis: Remove unused db::config forward declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
5a7e6b6c09 redis: Remove unused fmt_visitor forward declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
298bf19981 redis: Organize includes in server.{cc,hh} 2021-03-02 14:29:51 +02:00
Pekka Enberg
23c2f47054 redis: Switch to seastar::sharded<> 2021-03-02 14:29:51 +02:00
Pekka Enberg
7bd4ff9d75 redis: Remove redundant access modifiers from server.hh 2021-03-02 14:13:45 +02:00
Avi Kivity
5f4bf18387 Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros"
This reverts commit 31909515b3, reversing
changes made to ef97adc72a. It shows many
serious regressions in dtest.

Fixes #8197.
2021-03-02 13:21:22 +02:00
Takuya ASADA
870c3a28c1 scylla_setup: strip spaces of comma separated list
On RAID prompt, we can type disk list something like this:
 /dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1

However, if the list has spaces in the list, it doesn't work:
 /dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1

Because the script mistakenly recognize the space part of a device path.
So we need strip() the input for each item.

Fixes #8174

Closes #8190
2021-03-02 12:48:18 +02:00
Piotr Sarna
4a24d7dca0 transport: skip the whole request if it is too large
When a request is shed due to being too large, only the header
was actually read, and the body was still stuck in the socket
- and would be read in the next iteration, which would expect
to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.

Fixes #8193
2021-03-02 10:10:19 +01:00
Piotr Sarna
3eb7e768cb transport: skip the whole request during shedding
When a request is shed due to exceeding the number of max concurrent
requests, only its header was actually read, and the body was still
stuck in the socket - and would be read in the next iteration,
which would expect to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.

Refs #8193
2021-03-02 10:10:19 +01:00
Avi Kivity
10364fca6e Merge "Build query::result directly in range scan queries" from Botond
"
Currently range scans build their results on the replica in the
`reconcilable_result` format, that -- as its name suggests -- is
normally used for reconciliation (read repair). As such this result
format is quite inefficient for normal queries: it contains all columns
and all tombstones in the requested range. These are all unnecessary for
normal queries which only want live data and only those columns that are
requested by the user.
Furthermore, as the coordinator works in terms of `query::result` for
normal queries anyway, this intermediate result has to be converted to
the final `query::result` format adding an unnecessary intermediate
conversion step.
This series gets rid of this problem by introducing
`query_data_on_all_shards()`, a variant of
`query_mutations_on_all_shards()` that builds `query::result` directly.
Reverse queries still use the old intermediate method behind the scenes.

Fixes #8061
Refs #7434

Tests: unit(release, debug)
"

* 'range-scan-data-variant/v5-rebased' of https://github.com/denesb/scylla:
  cql_query_test: add unit test for the more efficient range scan result format
  test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter
  cql_query_test: test_query_limit: clean up scheduling groups
  storage_proxy: use query_data_on_all_shards() for data range scan queries
  query: partition_slice: add range_scan_data_variant option
  gms: add RANGE_SCAN_DATA_VARIANT cluster feature
  multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries
  multishard_mutation_query: add query_data_on_all_shards()
  mutation_partition.cc: fix indentation
  query_result_builder: make it a public type
  multishard_mutation_query: generalize query code w.r.t. the result builder used
  multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method
  multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine
  multishar_mutation_query: do_query_mutations(): convert to coroutine
  multishard_mutation_query: read_page(): convert to coroutine
  multishard_mutation_query: extract page reading logic into separate method
2021-03-02 08:54:41 +02:00
Botond Dénes
257c295cff cql_query_test: add unit test for the more efficient range scan result format
The most user-visible aspect of this change is range scans which select
a small subset of the columns. These queries work as the user expects
them to work: unselected columns are not included in determining the
size of the result (or that of the page). This is the aspect this test
is checking for. While at it, also test single partition queries too.
2021-03-02 08:01:53 +02:00
Botond Dénes
af0a23e75c test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter
To allow conveniently setting the scheduling group `func` is to be run
in.
2021-03-02 07:53:53 +02:00
Botond Dénes
fe280271a6 cql_query_test: test_query_limit: clean up scheduling groups
Destroy scheduling groups created for this test, so other tests can
create scheduling groups with the same name, without conflicts.
2021-03-02 07:53:53 +02:00
Botond Dénes
f8ce168c8e storage_proxy: use query_data_on_all_shards() for data range scan queries
Currently range scans build their result using the `reconcilable_result`
format and then convert it to `query::result`. This is inefficient for
multiple reasons:
1) it introduces an additional intermediate result format and a
   subsequent conversion to the final one;
2) the reconcilable result format was designed for reconciliation so it
   contains all data, including columns unselected by the query, dead
   rows and tombstones, which takes much more memory to build;

There is no reason to go through all this trouble, if there ever was one
in the past it doesn't stand anymore. So switch to the newly introduced
`query_data_on_all_shards()` when doing normal data range scans, but
only if all the nodes in the cluster supports it, to avoid artificial
differences in page sizes due to how reconcilable result and
query::result calculates result size and the consequent false-positive
read repair.
The transition to this new more efficient method is coordinated by a
cluster feature and whether to use it is decided by the coordinator
(instead of each replica individually). This is to avoid needless
reconciliation due to the different page sizes the two formats will
produce.
2021-03-02 07:53:53 +02:00
Botond Dénes
f15551d23a query: partition_slice: add range_scan_data_variant option
Switching to the data variant of range scans have to be coordinated by
the coordinator to avoid replicas noticing the availability of the
respective feature in different time, resulting in some using the
mutation variant, some using the data variant.
So the plan is that it will be the coordinator's job to check the
cluster feature and set the option in the partition slice which will
tell the replicas to use the data variant for the query.
2021-03-02 07:53:53 +02:00
Botond Dénes
5c84aa52db gms: add RANGE_SCAN_DATA_VARIANT cluster feature
To control the transition to the data variant of range scans. As there
is a difference in how the data and mutation variants calculate pages
sizes, the transition to the former has to happen in a controlled
manner, when all nodes in the cluster support it, to avoid artificial
differences in page content and subsequently triggering false-positive
read repair.
2021-03-02 07:53:53 +02:00
Botond Dénes
0f0c3be63e multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries
Refuse reverse queries just like in the new
`query_data_on_all_shards()`. The reason is the same, reverse range
scans are not supported on the client API level and hence they are
underspecified and more importantly: not tested.
2021-03-02 07:53:53 +02:00
Botond Dénes
034cb81323 multishard_mutation_query: add query_data_on_all_shards()
A data query variant of the existing `query_mutations_on_all_shards()`.
This variant builds a `query::result`, instead of `reconcilable_result`.
This is actually the result format coordinators want when executing
range scans, the reason for using the reconcilable result for these
queries is historic, and it just introduces an unnecessary intermediate
format.
This new method allows the storage proxy to skip this intermediate
format and the associated conversion to `query::result`, just like we do
for single partition queries.

Reverse queries are refused because they are not supported on the client
API (CQL) level anyway and hence it is unspecified how they should work
and more importantly: they are not tested.
2021-03-02 07:53:53 +02:00
Botond Dénes
df0f501ba2 mutation_partition.cc: fix indentation
Left broken from the previous patch.
2021-03-02 07:53:53 +02:00
Botond Dénes
950150c6df query_result_builder: make it a public type
We will want to use it in multishard_mutation_query.cc.
2021-03-02 07:53:53 +02:00
Botond Dénes
f19ab5cff1 multishard_mutation_query: generalize query code w.r.t. the result builder used
We want to add support to building `query::result` directly and reuse
the code path we use to build reconcilable result currently for it.
So templatize said code path on the result builder used. Since the
different result builders don't have a source level compatible interface
an adaptor class is used.
2021-03-02 07:53:53 +02:00
Botond Dénes
bddb0d35d6 multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method
In the next patches we are going to generalize the query logic w.r.t.
the result builder used, so query_mutations_on_all_shards() will be just
a facade parametrizing the actual query code with the right result
builder.
2021-03-02 07:53:53 +02:00
Botond Dénes
b0b620b501 multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used.
This change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
5d85615698 multishar_mutation_query: do_query_mutations(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used.
This change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
8138bdb434 multishard_mutation_query: read_page(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used. This
change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
29195f67f1 multishard_mutation_query: extract page reading logic into separate method
The block of code moved also coincides with the scope in which the
reader has to be alive, making the code more clear.
2021-03-02 07:53:53 +02:00
Benny Halevy
baf5d05631 storage_service: use atomic_vector for lifecycle_subscribers
So it can be modified while walked to dispatch
subscribed event notifications.

In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.

Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.

Fixes #8143

Test: unit(release)
DTest:
  update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
  materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
2021-03-01 20:34:42 +02:00
Benny Halevy
1ed04affab cql_server: event_notifier: unregister_subscriber in stop
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
2021-03-01 20:34:42 +02:00
Avi Kivity
fe8f9039a2 Update seastar submodule
* seastar 803e790598...ea5e529f30 (3):
  > Merge "Teach io_tester to generate YAML output" from Pavel E
  > bitset: set_range: mark constructor constexpr
  > Update dpdk submodule
2021-03-01 20:34:35 +02:00
Avi Kivity
8747c684e0 Merge 'Move timeouts to client state' from Piotr Sarna
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.

The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).

Closes #8140

* github.com:scylladb/scylla:
  treewide: remove timeout config from query options
  cql3: use timeout config from client state instead of query options
  cql3: use timeout config from client state instead of query options
  cql3: use timeout config from client state instead of query options
  service: add timeout config to client state
2021-03-01 20:34:35 +02:00
Raphael S. Carvalho
2cf0c4bbf1 compaction: Prevent cleanup and regular from compacting the same sstable
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.

This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.

That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.

Fixes #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
2021-03-01 20:34:35 +02:00
Tomasz Grabiec
cb0b8d1903 row_cache: Zap dummy entries when populating or reading a range
This will prevent accumulation of unnecessary dummy entries.

A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.

Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.

In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.

Refs #8153.

_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.

The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.

perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:

$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]

Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
2021-03-01 20:34:35 +02:00
Tomasz Grabiec
761f89e55e api: Introduce system/drop_sstable_caches RESTful API
Evicts objects from caches which reflect sstable content, like the row
cache. In the future, it will also drop the page cache
and sstable index caches.

Unlike lsa/compact, doesn't cause reactor stalls.

The old lsa/compact call invokes memory reclamation, which is
non-preemptible. It also compacts LSA segments, so does more
work. Some use cases don't need to compact LSA segments, just want the
row cache to be wiped.

Message-Id: <20210301120211.36195-1-tgrabiec@scylladb.com>
2021-03-01 16:13:04 +02:00
Piotr Dulikowski
aa2df75321 commitlog: add an option to allow going over size limit
This commit adds an option which, when turned on, allows the commitlog
to go over configured size limit. After reaching the limit, commitlog
will be more conservative with its usage of the disk space - for
example, it won't increase the segment reserve size or reuse recycled
segments. Most importantly, it won't block writes until the space used
by the commitlog goes down.

This change is necessary for hinted handoff to keep its current
behavior. Hinted handoff does not let the commitlog free segments
itself - instead, it re-creates it every 10 seconds and manually deletes
segments after all hints are sent from a segment.
2021-03-01 14:16:05 +01:00
Takuya ASADA
d0297c599a dist: tune fs.aio-max-nr based on the number of cpus
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.

We need to tune the parameter based on the number of cpus, instead of
static setting.

Fixes #8133

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #8188
2021-03-01 14:18:24 +02:00
Avi Kivity
31909515b3 Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.

Fixes #2622

Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com

Closes #8111

* github.com:scylladb/scylla:
  sstables: add test for checking the latency of updating the sstable_set in a table
  sstables: move column_family_test class from test/boost to test/lib
  sstables: use fast copying of the sstable_set instead of rebuilding it
  sstables: replace the sstable_set with a versioned structure
  sstables: remove potential ub
  sstables: make sstable_set constructor less error-prone
2021-03-01 14:16:36 +02:00
Avi Kivity
ef97adc72a Merge "Validate token monotonicity on the sstable write path" from Botond
"
We have recently seen out-of-order partitions getting into sstables
causing major disruption later on. Given the damage caused, it was again
raised that we should enable partition key monotonicity validation
unconditionally in the sstable write path. This was also raised in the
past but dismissed as key validation was suspected (but not measured) to
add considerable per-fragment overhead. One of the problems was that the
key monotonicity validation was all or nothing. It either validated all
(clustering and partition) key monotonicity or none of it.
This series takes a second look at this and solves the all-or-nothing
problem by making the configuration of the key monotonicity check more
fine grained, allowing for enabling just token monotonicity validation
separately, then enables it unconditionally.

Refs: #7623

Tests: unit(release)
"

* 'sstable-writer-validate-partition-keys-unconditionally/v3' of https://github.com/denesb/scylla:
  sstables: enable token monotonicity validation by default
  mutation_fragment_stream_validator: add token validation level
  mutation_fragment_stream_validating_filter: make validation levels more fine-grained
2021-03-01 11:23:51 +02:00
Amnon Heiman
0595596172 api/compaction_manager: add the compaction id in get_compaction
This patch adds the compaction id to the get_compaction structure.
While it was supported, it was not used and up until now wasn't needed.

After this patch a call to curl -X GET 'http://localhost:10000/compaction_manager/compactions'
will include the compaction id.

Relates to #7927

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8186
2021-03-01 10:51:31 +02:00
Piotr Sarna
7936652322 db,view: improve verbosity of errors coming from view updates
The error now contains information about the view table that failed,
as well as base and view tokens.
Example:
view - Error applying view update to 127.0.0.1 (view: ks.testme_v_idx_index,
        base token: -4069959284402364209, view token: -3248873570005575792): std::runtime_error (manually injected error)

Fixes #8177

Closes #8178
2021-03-01 10:46:14 +02:00
Avi Kivity
86d8977c96 Update tools/python3 submodule
* tools/python3 199ac90...6f3bcbe (2):
  > Add support pip modules
  > create-relocatable-package.py: add support python libraries in /usr/local
2021-03-01 10:10:13 +02:00
Avi Kivity
8ac0d6d15d Update tools/jmx submodule
* tools/jmx bf8bb16...8073af6 (1):
  > CompactionManager: add the compaction id when available

Fixes #7927.
2021-03-01 10:09:16 +02:00
Takuya ASADA
4cf9b6988e scylla_coredump_setup: don't run apt-get when systemd-coredump is already installed
Check systemd-coredump existance before running apt-get install
systemd-coredump.

Closes #8185
2021-03-01 09:38:51 +02:00
Botond Dénes
f0b284dab8 sstables: enable token monotonicity validation by default
Partition key order validation in data written to sstables can be very
disruptive. All our components in the storage layers assume that
partitions are in order, which means that reading out-of-order
partitions triggers undefined behaviour. Computer scientists often joke
that undefined behaviour can erase your hard drive and in this case the
damage done by undefined behaviour caused by out-of-order partitions is
very close to that. The corruption is known to mutate causing crashes,
corrupting more data and even loose data. For this reason it is
imperative that out-of-order partitions cannot get into sstables. This
patch enables token monotonicity validation unconditionally in
the sstable writer. As partition key monotonicity checks involve a key
copy per partition, which might have an impact on the performance, we do
the next best thing instead and enable only token monotonicity
validation.
2021-03-01 07:49:23 +02:00
Botond Dénes
727bc0f5d4 mutation_fragment_stream_validator: add token validation level
In some cases the full-blown partition key validation and especially the
associated key copy per partition might be deemed too costly. As a next
best thing this patch adds a token only validation, which should cover
99% (number pulled out of my sleeve) of the cases. Let's hope no one
gets unlucky.
2021-03-01 07:49:23 +02:00
Botond Dénes
694f8a4ec6 mutation_fragment_stream_validating_filter: make validation levels more fine-grained
Currently key order validation for the mutation fragment stream
validating filter is all or nothing. Either no keys (partition or
clustering) are validated or all of them. As we suspect that clustering
key order validation would add a significant overhead, this discourages
turning key validation on, which means we miss out on partition key
monotonicity validation which has a much more moderate cost.
This patch makes this configurable in a more fine-grained fashion,
providing separate levels for partition and clustering key monotonicity
validation.

As the choice for the default validation level is not as clear-cut as
before, the default value for the validation level is removed in the
validating filter's constructor.
2021-03-01 07:49:23 +02:00
Avi Kivity
3cd2f00438 dht: convert token tri_compare to std::strong_ordering
Change token's tri_compare functions to return std::strong_ordering,
which is not convertible to bool and therefore not suspect to
being misused where a less-compare is expected.

Two of the users (ring_position and decorated_key) have to undo
the conversion, since they still return int. A follow up will
convert them too.

Ref #1449.
2021-02-28 21:03:59 +02:00
Avi Kivity
d3d7698502 interval: support C++20 three-way comparisons
Allow the tri-comparator input to range functions to return
std::strong_ordering, e.g. the result of operator<=>. An int
input is still allowed, and coerced to std::strong_ordering by
tri-comparing it against zero. Once all users are converted, this
will be disallowed.

The clever code that performs boundary comparisons unfortunately
has to be dumbed down to conditionals. A helper
require_ordering_and_on_equal_return() is introduced that accepts
a comparison result between bound values, an expected comparison
result, and what to return if the bound value matches (this depends
on whether individual bounds are exclusive or inclusive, on
whether the bounds are start bounds or end bounds, and on the
sense of the comparison).

Unfortunately, the code is somewhat pessimized, and there is no
way to pessimize it as the enum underlying std::strong_ordering
is hidden.
2021-02-28 21:03:25 +02:00
Avi Kivity
d980f550d1 Merge 'row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows' from Tomasz Grabiec
fill_buffer() will keep scanning until _lower_bound_changed is true,
even if preemption is signaled, so that the reader makes forward
progress.

Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.

Fix that by updating _lower_bound on dummy entries as well.

Refs #8153.

Tested with perf_row_cache_reads:

```
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
```

Notice that max preemption latency is low in the second "read:" line.

Closes #8167

* github.com:scylladb/scylla:
  row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
  tests: perf: Introduce perf_row_cache_reads
  row_cache: Add metric for dummy row hits
2021-02-28 21:00:20 +02:00
Botond Dénes
1d9b5911fe time_series_sstable_set::create_single_key_sstable_reader(): fix use-after-free
The optimal path of said method mistakenly captures `pos` (a local
variable) in its reader factory method and passes a temporary range
implicitly constructed from said `pos` as the range parameter to the
sstable reader. This will lead to the sstable reader using a dangling
range and will result in returning no result for queries. This patch
fixes this bug and adds a unit test to cover this code path.

Fixes #8138.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>
2021-02-26 23:57:25 +02:00
Botond Dénes
dd5a601aaa result_memory_accounter: abort unpaged queries hitting the global limit
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.

Fixes: #8162

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
2021-02-26 23:43:16 +02:00
Botond Dénes
bc1fcd3db2 multishard_combining_reader: only read from needed shards
The multishard combining reader currently assumes that all shards have
data for the read range. This however is not always true and in extreme
cases (like reading a single token) it can lead to huge read
amplification. Avoid this by not pushing shards to
`_shard_selection_min_heap` if the first token they are expected to
produce falls outside of the read range. Also change the read ahead
algorithm to select the shards from `_shard_selection_min_heap`, instead
of walking them in shard order. This was wrong in two ways:
* Shards may be ordered differently with respect to the first partition
  they will produce; reading ahead on the next shard in shard order
  might not bring in data on the next shard the read will continue on.
  Shard order is only correct when starting a new range and shards are
  iterated over in the order they own tokens according to the sharding
  algorithm.
* Shards that may not have data relevant to the read range are also
  considered for read ahead.

After this patch, the multishard reader will only read from shards that
have data relevant to the read range, both in the case of normal reads
and also for read-ahead.

Fixes: #8161

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>
2021-02-26 23:29:20 +02:00
Piotr Sarna
0e0282cdf1 Merge ' cdc: move (most of) CDC generation management to a new service' from Kamil Braun
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.

This PR introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.

We plug the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service call
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).

Some parts of generation management still remain in storage_service:
the bootstrap procedure, which happens inside storage_service,
must also do some initialization regarding CDC generations,
for example: on restart it must retrieve the latest known generation
timestamp from disk; on bootstrap it must create a new generation
and announce it to other nodes. The order of these operations w.r.t
the rest of the startup procedure is important, hence the startup
procedure is the only right place for them. We may try decoupling
these services even more in follow-up PRs, but that requires a bit
of careful reasoning. What this PR does is a low-hanging fruit.

Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these handling functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.

This PR is a prerequisite to fixing #7985. The fact that all the CDC generation
management code was in storage_service is technical debt. It will be easier
to modify the management algorithms when they sit in their own module.

Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java

Closes #8172

* github.com:scylladb/scylla:
  cdc: move (most of) CDC generation management code to the new service
  cdc: coroutinize make_new_cdc_generation
  cdc: coroutinize update_streams_description
  cdc: introduce cdc::generation_service
  main: move cdc_service initialization just prior to storage_service initialization
2021-02-26 12:42:27 +01:00
Kamil Braun
e2f03e4aba cdc: move (most of) CDC generation management code to the new service
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.

Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.

However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.

Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
2021-02-26 12:06:12 +01:00
Tomasz Grabiec
b9c3b6c10f row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
fill_buffer() will keep scanning until _lower_bound_chnaged is true,
even if preemption is signalled, so that the reader makes forward
progress.

Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.

Fix that by updating _lower_bound on dummy entries as well.

Refs #8153.

Tested with perf_row_cache_reads:

$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]

Notice that max preemption latency is low in the second "read:" line.
2021-02-26 01:20:38 +01:00
Tomasz Grabiec
52e411df36 tests: perf: Introduce perf_row_cache_reads
Tests performance of various read patterns from the row cache.

Example:

$ build/release/test/perf/perf_row_cache_reads_g  -c1 -m200M
Filling memtable
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 156.288986 [ms], preemption: {count: 702, 99%: 0.545791 [ms], max: 0.537537 [ms]}, cache: 99/100 [MB]
read: 106.480766 [ms], preemption: {count: 6, 99%: 0.006866 [ms], max: 106.496168 [ms]}, cache: 99/100 [MB]
2021-02-26 01:20:38 +01:00
Tomasz Grabiec
f0a3272a5f row_cache: Add metric for dummy row hits
This will help to diagnose performance problems related to the read
having to walk through a lot of dummy rows to fill the buffer.

Refs #8153
2021-02-25 18:26:01 +01:00
Piotr Sarna
c5214eb096 treewide: remove timeout config from query options
Timeout config is now stored in each connection, so there's no point
in tracking it inside each query as well. This patch removes
timeout_config from query_options and follows by removing now
unnecessary parameters of many functions and constructors.
2021-02-25 17:20:27 +01:00
Piotr Sarna
f973e09454 cql3: use timeout config from client state instead of query options
... in batch statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
639d90d2d6 cql3: use timeout config from client state instead of query options
... in modification statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
b71665efe8 cql3: use timeout config from client state instead of query options
... in select statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
7ceafda70a service: add timeout config to client state
Future patches will use this per-connection timeout config
to allow setting different timeouts for each session,
based on roles.
2021-02-25 17:20:26 +01:00
Takuya ASADA
aabc67e386 dist/debian: don't run dh_installinit for scylla-node-exporter when service name == package name
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.

However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.

To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.

Fixes #8163

Closes #8164
2021-02-25 17:05:20 +02:00
Avi Kivity
032fdfe855 Update seastar submodule
* seastar e53a1059f9...803e790598 (9):
  > io_queue: Count total time spent in the queue
  > io_queue: Fix "delay" metrics
Fixes #8166.
  > file: expose disk offset alignment for overwrites
Ref #7663.
  > RPC: (client) retain local address and use on stream creation
  > rpc: sink_impl: align _{last,next}_seq_num to cache-line size
  > reactor: Fix outdated comment
  > fair_queue: Remove now dead ticket strictly_less method
  > io_queue: Double max request size
  > bitsets: set_iterator: correctly implement pre- and post-increment operators
2021-02-25 16:58:06 +02:00
Takuya ASADA
f3a82f4685 scylla_setup: allow running scylla_setup with strict umask setting
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.

Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.

Fixes #8049

Closes #8119
2021-02-25 16:42:45 +02:00
Nadav Har'El
750d7903be cql-pytest: fix some comments in util.py
Fix some incorrect comments, pasted from other files or mentioning
wrong names. No other changes except comments

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210225133237.1403891-1-nyh@scylladb.com>
2021-02-25 16:00:20 +02:00
Raphael S. Carvalho
7bf0744d36 reshape/TWCS: Fix off-by-one in threshold check
A given time bucket should also be reshaped if its # of sstables
has reached the threshold.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210223182634.570648-1-raphaelsc@scylladb.com>
2021-02-24 15:12:40 +02:00
Raphael S. Carvalho
21608bd677 sstables: Fix TWCS reshape for windows with at least min_threshold sstables
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.

Fixes #8147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
2021-02-24 15:11:19 +02:00
Tomasz Grabiec
ecb6c56a2a Merge 'lsa: background reclaim' from Avi Kivity
This series adds background reclaim to lsa, with the goal
that most large allocations can be satisfied from available
free memory, and and reclaim work can be done from a preemptible
context.

If the workload has free cpu, then background reclaim will
utilize that free cpu, reducing latency for the main workload.
Otherwise, background reclaim will compete with the main
workload, but since that work needs to happen anyway,
throughput will not be reduced.

A unit test is added to verify it works.

Fixes #1634.

Closes #8044

* github.com:scylladb/scylla:
  test: logalloc_test: test background reclaim
  logalloc: reduce gap between std min_free and logalloc min_free
  logalloc: background reclaim
  logalloc: preemptible reclaim
2021-02-24 13:23:30 +01:00
Piotr Sarna
25f47561cb transport: fix an outdated comment
The comment mentions calling a lambda in-place, but the lambda
is no longer there since 2019!

Message-Id: <3903c84d5c151415409f28935e328b552dd548f8.1614155567.git.sarna@scylladb.com>
2021-02-24 11:14:01 +02:00
Avi Kivity
15d3797e97 test: logalloc_test: test background reclaim
Test that the background reclaimer is able to compete with a
fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU"
is fully randomized.

If the background reclaimer is disabled, the test fails as soon as the
20MB "gap" is exhausted. With the reclaimer enabled, it is able to
free memory ahead of the allocations.
2021-02-23 19:42:42 +02:00
Nadav Har'El
d905e71a90 Alternator: add support for CORS protocol
This patch adds to Alternator support for the CORS (Cross-Origin Resource
Sharing) protocol - a simple extension over the HTTP protocol which
browsers use when Javascript code contacts HTTP-based servers.

Although we usually think of Alternator as being used in a three-tier
application, in some setups there is no middle layer and the user's
browser, running Javascript code, wants to communicate directly with the
database. However, for security reasons, by default Javascript loaded
from domain X is not allowed to communicate with different domains Y.
The CORS protocol is meant to allow this, and Alternator needs to
participate in this protocol if it is to be used directly from Javascript
in browsers.

To implement CORS, Alternator needs to respond to the OPTIONS method
which it didn't allow before - with certain headers based on the
input headers. It also needs to do some of these things for the
regular methods (mostly, POST). The patch includes a comprehensive
test that runs against both Alternator and DynamoDB and shows that
Alternator handles these headers and methods the same as DynamoDB.

Additionally, I tested manually a Javascript DynamoDB client - which
didn't work prior to this patch (the browser reported CORS errors),
and works after this patch.

Fixes #8025.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217222027.1219319-1-nyh@scylladb.com>
2021-02-23 13:15:03 +01:00
Asias He
7018377bd7 messaging_service: Move gossip ack message verb to gossip group
Fix a scheduling group leak:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=statement
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

After the fix:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

Fixes #7986

Closes #8129
2021-02-23 10:10:00 +02:00
Tomasz Grabiec
fb1d3fe2cf table: Fix schema mismatch between memtable reader and sstable writer
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.

Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.

This could lead to the sstable write to crash, or generate a corrupted
sstable.

Fixes #7994

Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
2021-02-22 17:51:00 +02:00
Raphael S. Carvalho
81d773e5d8 compaction_manager: Redefine weight for better control of parallel compactions
Compaction manager allows compaction of different weights to proceed in
parallel. For example, a small-sized compaction job can happen in parallel to a
large-sized one, but similar-sized jobs are serialized.

The problem is the current definition of weight, which is the log (base 4) of
total size (size of all sstables) of a job.

This is what we get with the current weight definition:
    weight=5	for sizes=[1K, 3K]
    weight=6	for sizes=[4K, 15K]
    weight=7	for sizes=[16K, 63K]
    weight=8	for sizes=[64K, 255K]
    weight=9	for sizes=[258K, 1019K]
    weight=10	for sizes=[1M, 3M]
    weight=11	for sizes=[4M, 15M]
    weight=12	for sizes=[16M, 63M]
    weight=13	for sizes=[64M, 254M]
    weight=14	for sizes=[256M, 1022M]
    weight=15	for sizes=[1033M, 4078M]
    weight=16	for sizes=[4119M, 10188M]
    total weights: 12

Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5
jobs smaller than 1MB could proceed in parallel. High number of parallel
compactions can be observed after repair, which potentially produces tons of
small sstables of varying sizes. That causes compaction to use a significant
amount of resources.

To fix this problem, let's add a fixed tax to the size before taking the log,
so that jobs smaller than 1M will all have the same weight.

Look at what we get with the new weight definition:
    weight=10	for sizes=[1K, 2M]
    weight=11	for sizes=[3M, 14M]
    weight=12	for sizes=[15M, 62M]
    weight=13	for sizes=[63M, 254M]
    weight=14	for sizes=[256M, 1022M]
    weight=15	for sizes=[1033M, 4078M]
    weight=16	for sizes=[4119M, 10188M]
    total weights: 7

Fixes #8124.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>
2021-02-22 15:50:29 +02:00
Asias He
554ab035dd main: Run init_server and join_cluster inside maintenance scheduling group
Currently, init_server and join_cluster which initiate the bootstrap and
replace operations on the new node run inside the main scheduling group.
We should run them inside the maintenance scheduling group to reduce the
impact on the user workload.

This patch fixes a scheduling group leak for bootstrap and replace operation.

Before:
[shard 0] storage_service - storage_service::bootstrap sg=main
[shard 0] repair - bootstrap_with_repair sg=main

After:
[shard 0] storage_service - storage_service::bootstrap sg=streaming
[shard 0] repair - bootstrap_with_repair sg=streaming

Fixes #8130

Closes #8131
2021-02-22 14:55:02 +02:00
Michał Chojnowski
a24f83852e atomic_cell: fix operator<< for atomic_cell_or_collection
operator<< used the wrong criterium for deciding whether the data is
stored as atomic_cell or collection_mutation, resulting in
catastrophical failure if it was used with frozen collections or UDTs.
Since frozen collections and UDTs are stored as atomic_cell, not
collection_mutation, the correct criterium is not is_collection(),
but is_multi_cell().

Closes #8134
2021-02-22 14:45:34 +02:00
Kamil Braun
022d7773f4 cdc: coroutinize make_new_cdc_generation 2021-02-22 12:47:44 +01:00
Kamil Braun
26ca9d6c33 cdc: coroutinize update_streams_description 2021-02-22 12:46:53 +01:00
Kamil Braun
d4937daaea cdc: introduce cdc::generation_service
This commit introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.

The implementation is a stub for now, the service reacts to generation
changes by simply logging the event.

The commit plugs the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service start
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
2021-02-22 12:45:43 +01:00
Kamil Braun
8e72c33d7c main: move cdc_service initialization just prior to storage_service
initialization

As a preparation for introducing CDC generation management service.

cdc_service will depend on the generation service.
But the generation service needs some other services to work
properly. In particular, it uses the local database, so it should be
initialized after the local database.

The only service that will need the cdc generation service is
storage_service, so we can place the generation service initialization
code right before storage_service initialization code. So the order will
be cdc_generation_service -> cdc_service -> storage_service.
2021-02-22 12:43:10 +01:00
Liu Lan
d2378129a3 docs: fix invalid path in README.mds
Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com>

Closes #8126
2021-02-21 13:49:12 +02:00
Konstantin Osipov
95ee8e1b90 raft: fix spelling
Fix spelling of a few comments.
2021-02-19 22:56:26 +03:00
Pekka Enberg
d483922671 Update tools/java submodule
* tools/java 0187829d5e...142f517a23 (2):
  > nodetool: Enable resetlocalschema
  > sstableloader: Make progress printout less eager.
2021-02-19 12:37:04 +02:00
Avi Kivity
78d1afeabd Merge "Use radix tree to store cells on a row" from Pavel E
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is

   sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes

and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.

Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.

This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.

The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)

v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
  - changed functions not to return values via refs-arguments
  - fixed nested classes to properly use language constructors
  - renamed index_to to key_t to distinguish from node_index_t
  - improved recurring variadic templates not to use sentinel argument
  - use standard concepts

v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)

A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.

next todo:
- aarch64 version of 16-keys node search

tests: unit(dev), unit(debug for radix*), pref(dev)
"

* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
  test/memory_footpring: Print radix tree node sizes
  row: Remove old storages
  row: Prepare row::equal for switch
  row: Prepare row::difference for switch
  row: Introduce radix tree storage type
  row-equal: Re-declare the cells_equal lambda
  test: Add tests for radix tree
  utils: Compact radix tree
  array-search: Add helpers to search for a byte in array
  test/perf_collection: Add callback to check the speed of clone
  test/perf_mutation: Add option to run with more than 1 columns
  test/perf_mutation: Prepare to have several regular columns
  test/perf_mutation: Use builder to build schema
2021-02-18 21:19:14 +02:00
Nadav Har'El
02dde2aca1 cql-pytest: port Cassandra's unit test validation/entities/json_test
In this patch, we port validation/entities/json_test.java, containing
21 tests for various JSON-related operations - SELECT JSON, INSERT JSON,
and the fromJson() and toJson() functions.

In porting these tests, I uncovered 19 (!!) previously unknown bugs in
Scylla:

Refs #7911: Failed fromJson() should result in FunctionFailure error, not
            an internal error.
Refs #7912: fromJson() should allow null parameter.
Refs #7914: fromJson() integer overflow should cause an error, not silent
            wrap-around.
Refs #7915: fromJson() should accept "true" and "false" also as strings.
Refs #7944: fromJson() should not accept the empty string "" as a number.
Refs #7949: fromJson() fails to set a map<ascii, int>.
Refs #7954: fromJson() fails to set null tuple elements.
Refs #7972: toJson() truncates some doubles to integers.
Refs #7988: toJson() produces invalid JSON for columns with "time" type.
Refs #7997: toJson() is missing a timezone on timestamp.
Refs #8001: Documented unit "µs" not supported for assigning a "duration"
            type.
Refs #8002: toJson() of decimal type doesn't use exponents so can produce
            huge output.
Refs #8077: SELECT JSON output for function invocations should be
            compatible with Cassandra.
Refs #8078: SELECT JSON ignores the "AS" specification.
Refs #8085: INSERT JSON with bad arguments should yield InvalidRequest
            error, not internal error.
Refs #8086: INSERT JSON cannot handle user-defined types with case-
            sensitive component names.
Refs #8087: SELECT JSON incorrectly quotes strings inside map keys.
Refs #8092: SELECT JSON missing null component after adding field to
            UDT definition.
Refs #8100: SELECT JSON with IN and ORDER BY does not obey the ORDER BY.

Due to these bugs, 8 out of the 21 tests here currently xfail and one
has to be skipped (issue #8100 causes the sanitizer to detect a use
after free, and crash Scylla).

As usual in these sort of tests, all 21 tests pass when running against
Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217130732.1202811-1-nyh@scylladb.com>
2021-02-18 20:44:04 +02:00
Takuya ASADA
32d4ec6b8a scylla_util.py: resolve /dev/root to get actual device on aws
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.

Fixes #8055

Closes #8040
2021-02-18 20:25:45 +02:00
Avi Kivity
90a7f76fb6 Merge 'cdc: log: fix a use-after-free in process_bytes_visitor' from Michał Chojnowski
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.

Closes #8105

* github.com:scylladb/scylla:
  cdc: log: avoid an unnecessary copy
  cdc: log: fix use-after-free in process_bytes_visitor
2021-02-18 20:23:41 +02:00
Michał Chojnowski
96c22cf3f8 cdc: log: avoid an unnecessary copy
There is no need to copy `bytes_view` into `bytes` here.
2021-02-18 14:08:18 +01:00
Michał Chojnowski
8cc4f39472 cdc: log: fix use-after-free in process_bytes_visitor
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.

Fixes #8117
2021-02-18 14:08:17 +01:00
Konstantin Osipov
32952a744a raft: add a unit test for voting
Test duplicate votes, votes from non-members and voting
in joint configuration.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e49d5f89a5 raft: do not account for the same vote twice
While a duplicate vote from the same server is not possible by a
conforming Raft implementation, Raft assumptions on network permit
duplicates.
So, in theory, it is possible that a vote message is delivered
multiple times.

The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.

Keep track of who has voted yet, and reject duplicate votes.

A unit test follows.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
7ea064ac04 raft: remove fsm::set_configuration()
Set either tracker or votes configuration explicitly.
This saves a few lines and simplifies unit tests.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
4083026b65 raft: consistently use configuration from the log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
c4552ffb9a raft: add ostream serialization for enum vote_result 2021-02-18 16:04:44 +03:00
Konstantin Osipov
2ae04d8a47 raft: advance commit index right after leaving joint configuration
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
The leader's view of stable indexes is:

Server  Match Index
A       5
B       5
C       6
D       7
E       8

The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Let it happen without an extra FSM
step.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
132db931da raft: add tracker test 2021-02-18 16:04:44 +03:00
Konstantin Osipov
6e3932bbc7 raft: tidy up follower_progress API
Make the API More explicit so it's available for testing.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
ed65a8635e raft: update raft::log::apply_snapshot() assert
apply_snapshot() doesn't support applying the same snapshot
twice. The caller must check the current snapshot before
applying.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e58a3e42ca raft: add a unit test for raft::log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
51c968bcb4 raft: rename log::non_snapshoted_length() to log::in_memory_size()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
cfe407b402 raft: inline raft::log::truncate_tail()
It's the core of apply_snapshot() work and is only used in it.

Now that truncate_tail is inline, rename truncate_head()
to truncate_uncommitted().
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e0011c6e4d raft: ignore AppendEntries RPC with a very old term
Do not assert on an outdated message.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
805d52eb16 raft: remove log::start_idx()
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().

Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.

This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
af8770da63 raft: return a correct last term on an empty log
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.

This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.

A separate unit test for the issue will follow.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
cb035a7c8d raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-18 16:04:43 +03:00
Konstantin Osipov
04b4d97d6a raft: rename progress.hh to tracker.hh
class tracker is the main class of this module.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
97a16c0f77 raft: extend single_node_is_quiet test 2021-02-18 16:04:43 +03:00
Avi Kivity
f0950e023d Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.

---

Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.

However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).

Closes #8116

* github.com:scylladb/scylla:
  tests: add a simple CDC cql pytest
  cdc: add config option to disable streams rewriting
  cdc: rewrite streams to the new description table
  cql3: query_processor: improve internal paged query API
  cdc: introduce no_generation_data_exception exception type
  docs: cdc: mention system.cdc_local table
  cdc: coroutinize do_update_streams_description
  sys_dist_ks: split CDC streams table partitions into clustered rows
  cdc: use chunked_vector for streams in streams_version
  cdc: remove `streams_version::expired` field
  system_distributed_keyspace: use mutation API to insert CDC streams
  storage_service: don't use `sys_dist_ks` before it is started
2021-02-18 12:49:43 +02:00
Kamil Braun
4bf28aad7a tests: add a simple CDC cql pytest 2021-02-18 11:44:59 +01:00
Kamil Braun
841f07e9b7 cdc: add config option to disable streams rewriting
Rewriting stream descriptions is a long, expensive, and prone-to-failure
operation. Due to #8061 it may consume a lot of memory. In general, it
may keep failing (and being retried) endlessly, straining the cluster.
As a backdoor we add this flag for potential future needs of admins or
field engineers.

I don't expect it will ever be used, but it won't hurt and may save us
some work in the worst case scenario.
2021-02-18 11:44:59 +01:00
Kamil Braun
9bdd000e97 cdc: rewrite streams to the new description table
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.

However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. This commit adds an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
2021-02-18 11:44:59 +01:00
Kamil Braun
4ef736a0a3 cql3: query_processor: improve internal paged query API
The `query_processor::query` method allowed internal paged queries.
However, it was quite limited, hardcoding a number of parameters:
consistency level, timeout config, page size.

This commit does the following improvements:
1. Rename `query` to `query_internal` to make it obvious that this API
   is supposed to be used for internal queries only
2. Extend the method to take consistency level, timeout config, and page
   size as parameters
3. Remove unused overloads of `query_internal`
4. Fix a bunch of typos / grammar issues in the docstring
2021-02-18 11:44:59 +01:00
Kamil Braun
7c91894ddf cdc: introduce no_generation_data_exception exception type 2021-02-18 11:44:59 +01:00
Kamil Braun
99cc9b8051 docs: cdc: mention system.cdc_local table 2021-02-18 11:44:59 +01:00
Kamil Braun
44aab61aea cdc: coroutinize do_update_streams_description 2021-02-18 11:44:59 +01:00
Kamil Braun
67d4e5576d sys_dist_ks: split CDC streams table partitions into clustered rows
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
2021-02-18 11:44:59 +01:00
Kamil Braun
ba920361b3 cdc: use chunked_vector for streams in streams_version
The vector may get quite long (say... 1,6M stream IDs). We prevent a
large allocation by using utils::chunked_vector.
2021-02-18 11:44:59 +01:00
Kamil Braun
9ae4467970 cdc: remove streams_version::expired field
This field was not used anywhere.
2021-02-18 11:44:59 +01:00
Kamil Braun
3d7b990300 system_distributed_keyspace: use mutation API to insert CDC streams
The `storage_proxy::mutate` low-level API is much more powerful than
the CQL API. This power is not needed for this commit but for the next.
2021-02-18 11:44:59 +01:00
Kamil Braun
0df15ca8cc storage_service: don't use sys_dist_ks before it is started
It could happen that system_distributed_keyspace was used by
storage_service before it was fully started (inside
`handle_cdc_generation`), i.e. before sys_dist_ks' `start()` returned
(on shard 0). It only checked whether `local_is_initialized()` returns
true, so it only ensured that the service is constructed.

Currently, sys_dist_ks' `start` only announces migrations, so this was
mostly harmless. More concretely: it could result in the node trying to
send CQL requests using a table that it didn't yet recognize by calling
sys_dist_ks' methods before the `announce_migration` call inside `start`
has returned. This would result in an exception; however, the exception
would be catched by the caller and the procedure would be retried,
succeeding eventually. See `handle_cdc_generation` for details.

Still, the initial intention of the code was to wait for the sys_dist_ks
service to be fully started before it was used. This commit fixes that.
2021-02-18 11:44:59 +01:00
Tomasz Grabiec
f94f70cda8 Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.

Misc cleanups.

* scylla-dev/raft-confchange-test:
  raft: add a unit test for voting
  raft: do not account for the same vote twice
  raft: remove fsm::set_configuration()
  raft: consistently use configuration from the log
  raft: add ostream serialization for enum vote_result
  raft: advance commit index right after leaving joint configuration
  raft: add tracker test
  raft: tidy up follower_progress API
  raft: update raft::log::apply_snapshot() assert
  raft: add a unit test for raft::log
  raft: rename log::non_snapshoted_length() to log::length()
  raft: inline raft::log::truncate_tail()
  raft: ignore AppendEntries RPC with a very old term
  raft: remove log::start_idx()
  raft: return a correct last term on an empty log
  raft: do not use raft::log::start_idx() outside raft::log()
  raft: rename progress.hh to tracker.hh
  raft: extend single_node_is_quiet test
2021-02-18 10:55:59 +01:00
Raphael S. Carvalho
5206a97915 compaction: Fix leak of expired sstable in the backlog tracker
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.

to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.

Fixes #6054.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
2021-02-18 11:12:00 +02:00
Takuya ASADA
d7f202f900 dist/debian: fix renaming debian/scylla-* files rule
Current renaming rule of debian/scylla-* files is buggy, it fails to
install some .service files when custom product name specified.

Introduce regex based rewriting instead of adhoc renaming, and fixed
wrong renaming rule.

Fixes #8113

Closes #8114
2021-02-18 10:35:19 +02:00
Pekka Enberg
843bf57c3c Update tools/jmx submodule
* tools/jmx 949cefc...bf8bb16 (1):
  > Merge 'dist/debian: fix renaming debian/scylla-* files rule' from Takuya ASADA
2021-02-18 10:35:00 +02:00
Botond Dénes
c3b4c3f451 evictable_reader: reset _range_override after fast-forwarding
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
2021-02-17 19:11:00 +02:00
Benny Halevy
4b46793c19 row_cache: scanning_and_populating_reader: add _read_next_partition flag
Instead of resetting _reader in scanning_and_populating_reader::fill_buffer
in the `reader_finished` case, use a gentler, _read_next_partition flag
on which `read_next_partition` will be called in the next iteration.

Then, read_next_partition can close _reader only before overwriting it
with a new reader.  Otherwise, if _reader is always closed in the
``reader_finished` case, we end up hitting premature end_of_stream.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>
2021-02-17 19:06:21 +02:00
Benny Halevy
57540dae42 mutation_query: mark reconcilable_result_builder constructor noexcept
With result_memory_accounter begin nothrow move constructible
reconcilable_result_builder does not throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-67-bhalevy@scylladb.com>
2021-02-17 18:56:12 +02:00
Benny Halevy
92e0e84ee5 database: futurize remove
In preparation for futurizing the querier_cache api.

Coroutinize drop_column_family while at it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-61-bhalevy@scylladb.com>
2021-02-17 18:52:53 +02:00
Benny Halevy
5263ab0e9d row_cache: read_context: use query-request is_single_partition helper
Rather than hand-coding the same logic.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-32-bhalevy@scylladb.com>
2021-02-17 18:29:39 +02:00
Benny Halevy
35256d1b92 treewide: explicitly use flat_mutation_reader_opt
Unlike flat_mutation_reader_opt that is defined using
optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate
to `false` after being moved, only after it is explicitly reset.

Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader>
to make it easier to check if it was closed before it's destroyed
or being assigned-over.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>
2021-02-17 17:57:34 +02:00
Avi Kivity
c63e26e26f Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8048

* github.com:scylladb/scylla:
  cdc: Limit size of topology description
  cdc: Extract create_stream_ids from topology_description_generator
2021-02-17 15:43:53 +02:00
Piotr Jastrzebski
649f254863 cdc: Limit size of topology description
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-02-17 13:24:40 +01:00
Avi Kivity
001652815c Merge 'imr: switch back to open-coded description of structures' from Michał Chojnowski
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578

Closes #8106

* github.com:scylladb/scylla:
  imr: switch back to open-coded description of structures
  utils: managed_bytes: add a few trivial helper methods
  utils: fragment_range: move FragmentedView helpers to fragment_range.hh
  utils: fragment_range: add single_fragmented_mutable_view
  utils: fragment_range: implement FragmentRange for fragment_range
  utils: mutable_view: add front()
  types: remove an unused helper function
  test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
  test: mutation_test: remove an obsolete assertion
  test: mutation_test: initialize an uninitialized variable
  test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
2021-02-17 13:40:16 +02:00
Botond Dénes
ba7a9d2ac3 imr: switch back to open-coded description of structures
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578
2021-02-16 23:43:07 +01:00
Michał Chojnowski
25a9569cc4 utils: managed_bytes: add a few trivial helper methods
We will use them in the upcoming IMR removal patch.
2021-02-16 23:43:07 +01:00
Michał Chojnowski
3f248ca7cc utils: fragment_range: move FragmentedView helpers to fragment_range.hh
In the upcoming IMR removal patch we will need read_simple() and similar helpers
for FragmentedView outside of types.hh. For now, let's move them to
fragment_range.hh, where FragmentedView is defined. Since it's a widely included
header, we should consider moving them to a more specialized header later.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
8a06a576aa utils: fragment_range: add single_fragmented_mutable_view
We will use it later in the upcoming IMR removal patch.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
7b662b9315 utils: fragment_range: implement FragmentRange for fragment_range
This will allow us to pass FragmentedView instances to places where
FragmentRange is expected.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
f972f90193 utils: mutable_view: add front()
We will use it in the upcoming patches.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
9e591c6634 types: remove an unused helper function 2021-02-16 21:35:14 +01:00
Michał Chojnowski
6b8a69e01f test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
The off-by-one error would cause
test_multishard_combining_reader_non_strictly_monotonic_positions to fail if
the added range_tombstones filled the buffer exactly to the end.
In such situation, with the old loop condition,
make_fragments_with_non_monotonic_positions would add one range_tombstone too
many to the deque, violating the test assumptions.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
5b79d6ca4c test: mutation_test: remove an obsolete assertion
Due to small value optimizations, the removed assertions are not true in
general. Until now, atomic_cell did not use small value optimizations, but
it will after upcoming changes.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
aa60f28a09 test: mutation_test: initialize an uninitialized variable
It was assumed to be zero-initialized, but C++ does not guarantee that.
It has to be initialized explicitly.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
52bd190bb3 test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
sstable_run_based_compaction_test assumed that sstables are freed immediately
after they are fully processed.
Hovewer, since commit b524f96a74,
mutation_reader_merger releases sstables in batches of 4, which breaks the
assumption. This fix adjusts the test accordingly.

Until now, the test only kept working by chance: by coincidence, the number of
test sstables processed by merging_reader in a single fill_buffer() call was
divisible by 4. Since the test checks happen between those calls,
the test never witnessed a situation when an sstable was fully processed,
but not released yet.

The error was noticed during the work on an upcoming patch which changes the
size of mutation_fragment, and reduces the number of test sstables processed
in a single fill_buffer() call, which breaks the test.
2021-02-16 21:35:14 +01:00
Konstantin Osipov
d293966366 raft: add a unit test for voting
Test duplicate votes, votes from non-members and voting
in joint configuration.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
3478389d60 raft: do not account for the same vote twice
While duplicate votes are not allowed by Raft rules, it is possible
that a vote message is delivered multiple times.

The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.

Keep track of who has voted yet, and reject duplicate votes.

A unit test follows.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
ffd38de5fe raft: remove fsm::set_configuration()
Set either tracker or votes configuration explicitly.
This saves a few lines and simplifies unit tests.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
b941ca9bae raft: consistently use configuration from the log 2021-02-16 23:15:16 +03:00
Konstantin Osipov
75eddaf493 raft: add ostream serialization for enum vote_result 2021-02-16 23:15:16 +03:00
Konstantin Osipov
e099003c7c raft: advance commit index right after leaving joint configuration
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
Server stable indexes are:

Server  Stable Index
A       5
B       5
C       6
D       7
E       8

The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Left it happen without an extra FSM
step.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
1bdb3fc8a9 raft: add tracker test 2021-02-16 23:15:16 +03:00
Konstantin Osipov
63965f46f4 raft: tidy up follower_progress API
Make the API More explicit so it's available for testing.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
74879fab09 raft: update raft::log::apply_snapshot() assert
apply_snapshot() doesn't support applying the same snapshot
twice. The caller must check the current snapshot before
applying.
2021-02-16 23:15:12 +03:00
Konstantin Osipov
6ee3aedcc2 raft: add a unit test for raft::log 2021-02-16 23:12:01 +03:00
Konstantin Osipov
c35f029be1 raft: rename log::non_snapshoted_length() to log::length()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_length(), it's a more
specific name.
2021-02-16 23:12:01 +03:00
Konstantin Osipov
9e1a652805 raft: inline raft::log::truncate_tail()
It's the core of apply_snapshot() work and is only used in it.

Now that truncate_tail is inline, truncate_head() can be
called simply truncate().
2021-02-16 23:10:58 +03:00
Konstantin Osipov
f7fb788edf raft: ignore AppendEntries RPC with a very old term
Do not assert on an outdated message.
2021-02-16 23:07:58 +03:00
Konstantin Osipov
7236f081c1 raft: remove log::start_idx()
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().

Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.

This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
2021-02-16 23:06:23 +03:00
Konstantin Osipov
59ea383c7d raft: return a correct last term on an empty log
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.

This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.

A separate unit test for the issue will follow.
2021-02-16 21:07:05 +03:00
Konstantin Osipov
6c14775b20 raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-16 21:05:44 +03:00
Nadav Har'El
946e63ee6e cql-pytest: remove "xfail" tag from two passing tests
Issue #7595 was already fixed last week, in commit
b6fb5ee912, so the two tests which failed
because of this issue no longer fail and their "xfail" tag can be removed.

Refs #7595.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216160606.1172855-1-nyh@scylladb.com>
2021-02-16 19:17:22 +02:00
Nadav Har'El
737c1c6cc7 cql-pytest: Additional JSON tests
This patch adds several additional tests o test/cql-pytest/test_json.py
to reproduce additional bugs or clarify some non-bugs.

First, it adds a reproducer for issue #8087, where SELECT JSON may create
invalid JSON - because it doesn't quote a string which is part of a map's
key. As usual for these reproducers, the test passes on Cassandra, and fails
on Scylla (so marked xfail).

We have a bigger test translated from Cassandra's unit tests,
cassandra_tests/validation/entities/json_test.py::testInsertJsonSyntaxWithNonNativeMapKeys
which demonstrates the same problem, but the test added in this patch is much
shorter and focuses on demonstrating exactly where the problem is.

Second, this patch adds a test test verifies that SELECT JSON works correctly
for UDTs or tuples where one of their components was never set - in such a
case the SELECT JSON should also output this component, with a "null" value.
And this test works (i.e., produces the same result in Cassandra and Scylla).
This test is interesting because it shows that issue #8092 is specific to the
case of an altered UDT, and doesn't happen for every case of null
component in a UDT.

Refs #8087
Refs #8092

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216150329.1167335-1-nyh@scylladb.com>
2021-02-16 16:05:31 +01:00
Avi Kivity
2f3b265dac Update seastar submodule
* seastar 76cff58964...e53a1059f9 (18):
  > rpc: streaming sink: order outgoing messages
Fixes #7552.
  > http: fix compilation issues when using clang++
  > http/file_handler: normalize file-type for mime detection
  > http/mime_types: add support for svg+xml
  > reactor: simplify get_sched_stats()
  > Merge "output_stream: make api noexcept" from Benny
  > Merge " input_stream: make api noexcept" from Benny
  > rpc: mark 'protocol' class as final
  > tls: reloadable_certificate inotify flag is wrong
Fixes #8082.
  > cli: Ignore the --num-io-queues option
  > io_queue: Do not carry start time in lambda capture
  > fstream: Cancel all IO-s on file_data_source_impl close
  > http: add "Transfer-Encoding: chunked" handling
  > http: add ragel parsers for chunks used in messages with Transfer-Encoding: chunked
  > http: add request content streaming
  > http: add reading/skipping all bytes in an input_stream
  > Merge "Reduce per-io-queue container for prio classes" from Pavel Emelyanov
  > seastar-addr2line: split multiple addresses on the same line
2021-02-16 16:19:26 +02:00
Avi Kivity
789233228b messaging: don't inherit from seastar::rpc::protocol
messaging_service's rpc_protocol_server_wrapper inherits from
seastar::rpc::protocol::server as a way to avoid a
is unfortunate, as protocol.hh wasn't designed for inheritance, and
is not marked final.

Avoid this inheritance by hiding the class as a member. This causes
a lot of boilerplate code, which is unfortunate, but this random
inheritance is bad practice and should be avoided.

Closes #8084
2021-02-16 16:04:44 +02:00
Gleb Natapov
c9392095ce cql3: store cf_prop_defs as optional instead of shared_ptr
It been a shard_ptr is a remnant of translation from Java.
Message-Id: <20210216123931.80280-3-gleb@scylladb.com>
2021-02-16 15:58:38 +02:00
Gleb Natapov
805da054e7 cql3: store cf_name as optional in cf_statement instead of shared_ptr
It been a shard_ptr is a remnant of translation from Java.
Message-Id: <20210216123931.80280-2-gleb@scylladb.com>
2021-02-16 15:58:37 +02:00
Gleb Natapov
6335af625e cql3: assert that unengaged optional is not accessed in keyspace_element_name::get_keyspace()
Message-Id: <20210216085545.54753-2-gleb@scylladb.com>
2021-02-16 15:36:00 +02:00
Gleb Natapov
200ca974c3 Do not access potentially unengaged optional in keyspace_element_name
Currently there are places that call
keyspace_element_name::get_keyspace() without checking that _ks_name is
engaged. Fix those places.
Message-Id: <20210216085545.54753-1-gleb@scylladb.com>
2021-02-16 15:35:59 +02:00
Botond Dénes
4d309fc34a repair: row_level: invoke on_internal_error() on out-of-order partitions
repair_writer::do_write(): already has a partition compare for each
mutation fragment written, do determine whether the fragment belongs to
another partition or not. This equal compare can be converted to a
tri_compare at no extra cost allowing for detecting out-of-order
partitions, in which case `on_internal_error()` is called.

Refs: #7623
Refs: #7552

Test: dtest(RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test:debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210216074523.318217-1-bdenes@scylladb.com>
2021-02-16 15:31:40 +02:00
Benny Halevy
50ca693a02 main: disable stall detector during startup
We see long reactor stalls from `logalloc::prime_segment_pool`
in debug mode yet the stall detector's purpose is to detect
reactor stalls during normal operation where they can increase
the latency of other queries running in parallel.

Since this change doesn't actually fix the stalls but rather
hides them, the following annotations will just refrence
the respective github issues rather than auto-close them.

Refs #7150
Refs #5192
Refs #5960

Restore blocked_reactor_notify_ms right before
starting storage_proxy.  Once storage_proxy is up, this node
affects cluster latency, and so stalls should be reported so
they can be fixed.

Test: secondary_index_test --blocked-reactor-notify-ms 1 (release)
DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>
2021-02-16 13:28:31 +02:00
Tomasz Grabiec
446ea07ac6 Merge "raft: server instance init and raft RPC handlers" from Pavel Solodovnikov
This series provides a `raft_services` class to create and store
a raft schema changes server instances, and also wires up the RPC handlers
for Raft RPC verbs.

* manmanson/raft-api-server-handlers-v10:
  raft: share `raft_gossip_failure_detector` instance across multiple raft rpc instances
  raft: move server address handling from `raft_rpc` to `raft_services` class
  raft: wire up schema Raft RPC handlers
  raft: raft_rpc: provide `update_address_mapping` and dispatcher functions
  raft: pass `group_id` as an argument to raft rpc messages
  raft: use a named constant for pre-defined schema raft group
2021-02-16 11:14:50 +01:00
Pavel Solodovnikov
1ada0abf81 raft: share raft_gossip_failure_detector instance across multiple raft rpc instances
Store an instance inside `raft_services` and reuse it for
all raft groups created and managed by `raft_services` instance.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:09:12 +03:00
Pavel Solodovnikov
8c2a904dc8 raft: move server address handling from raft_rpc to raft_services class
This allows to decouple `raft_gossip_failure_detector` from being
dependent on a particular rpc instance and thus makes it possible
to share the same failure detector instance among all raft servers
since they are managed in a centralized way by a `raft_services`
instance.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:09:06 +03:00
Pavel Solodovnikov
63cdf4694d raft: wire up schema Raft RPC handlers
This patch adds registration and de-registration of the
corresponding Raft RPC verbs handlers.

There is a new `raft_services` class that
is responsible for initializing the raft RPC verbs and
managing raft server instances.

The service inherits `seastar::peering_sharded_service<T>`,
because we need to route the request to the appropriate shard
which is handled by the `shard_for_group` function (currently
only handling schema raft group to land on shard 0, otherwise
throws an exception).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:08:59 +03:00
Nadav Har'El
1e1cbaf589 docs/alternator: clean up description of DynamoDB compatibility
We had Alternator's current compatibility with DynamoDB described in
two places - alternator.md and compatibility.md. This duplication was
not only unnecessary, in some places it led to inconsistent claims.

In general, the better description was in compatibility.md, so in
this patch we remove the compatibility section from alternator.md
and instead link to compatibility.md. There was a bit of information
that was missing in compatibility.md, so this patch adds it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210215203057.1132162-1-nyh@scylladb.com>
2021-02-16 08:48:28 +01:00
Pavel Emelyanov
9baf1226dc test/memory_footpring: Print radix tree node sizes
After switching cells storage onto compact radix tree it
becomes useful to know the tree nodes' sizes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:41:09 +03:00
Pavel Emelyanov
1bdfa355ea row: Remove old storages
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed.  The result is:

1. memory footprint

sizeof(class row):  112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes

the "in cache" value depends on the number of cells:

num of cells     master       patch
         1       752         656
         2       808         712
         3       864         768
         4       920         824
         5       968         936
         6      1136         992
         ...
         16     1840        1672
         17     1904        1992  (+88)
         18     1976        2048  (+72)
         19     2048        2104  (+56)
         20     2120        2160  (+40)
         21     2184        2208  (+24)
         22     2256        2264  ( +8)
         23     2328        2320
         ...
         32     2960        2808

After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches

           64     7872        6056
           128   15040        9512
           256   29376       18568

2. perf_mutation test is enhanced by this series and the
   results differ depending on the number of columns used

                    tps value
--column-count    master   patch
          1       59.9k    57.6k  (-3.8%)
          2       59.9k    57.5k
          4       59.8k    57.6k
          8       57.6k    57.7k  <- eq
         16       56.3k    57.6k
         32       53.2k    57.4k  (+7.9%)

A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.

An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.

The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:35:06 +03:00
Pavel Emelyanov
2053b1c202 row: Prepare row::equal for switch
Same as the previous patch, re-implement the row::equal to use
the radix_tree iterator for comparison of two index:cell sequences.

The std::equal() doesn't work here, since the predicate-fn needs
to look at both iterators to call it.key() on (radix tree API
feature), while std::equal provides only the T&s in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:31:52 +03:00
Pavel Emelyanov
b5527b3635 row: Prepare row::difference for switch
The method effectively walks two pairs of <colun_id, cell> and
applies the difference to separare row instance. The code added
is the copy of the same code below this hunk with the mechanical
substitution:

 c.first -> c.key()
 c.second -> c->cell
 it->first -> it.key()
 it->second -> it.cell

because first-s are column_id-s reported by radix tree iterator
.key() method and second-s are cells, that were referenced by
current code in get_..._vector() from boost::irange and are now
directly pointed to by raidx tree iterator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
f006acc853 row: Introduce radix tree storage type
Currently class row uses a union of a vector and a set to keep
the cells and switches between them. Add the 3rd type with the
radix tree, but never switch to it, just to show how the operations
would look like. Later on vector and set will be removed and the
whole row will be immediately switched to the radix tree storage.

NB: All the added places have indentation deliberately broken, so
that next patch will just remove the surrounding (old) code away
and (most of) the new one will happen in its place instantly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
5f276b279e row-equal: Re-declare the cells_equal lambda
For further patching it's handy to have this helper to accept
column_id and atomic_cell_or_collection arguments, instead of
an std::pair of these two.

This is to facilitate next patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
aa85bc790b test: Add tests for radix tree
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
a5bd68ae5d utils: Compact radix tree
The tree uses integral type as a search key. On each level the local index
is next 7 bits from the key, respectively for 32-bit key we have 5 levels.
The tree uses 2 memory packing techniques -- prefix compaction and growing
node layouts.

The prefix compaction is used when a node has only one child. In this case
such a node is replaced in its parent with this only child and the child in
question keeps "prefix:length" pair on board, that's used to check if the
short-cut lookup took the correct path.

The growing node layouts makes the nodes occupy as much memory as needed
to keep the _present_ keys and there are 2 kinds of layouts.

Direct layout is array, intra-node search is plain indexing. The layout
storage grows in vector-like manner, but there's a special case for the
maximum-sized layout that helps avoiding some boundary checks.

Indirect layout keeps two arrays on board -- with values and with indices.
The intra-node search is thus a lookup in the latter array first. This
layout is used to save memory for sparse keys. Lookup is optimized with
SIMD instructions.

Inner nodes use direct layouts, as they occupy ~1% of memory and thus
need not to be memory efficient. At the same time lookup of a key in the
tree potentially walks several inner nodes, so speeding up search for
them is beneficial.

Leaf nodes are indirect, since they are 99% of memory and thus need to
be packed well. The single indirect lookup when searching in the tree
doesn't slow things down notably even on insertion stress test.

Said that
 * inner nodes are: header + 4 / 8 / 16 / 32 / 64 / 128 pointers
 * leaf nodes are : header + 4 / 8 / 16 / 32 bytes + <same nr> objects
                 or header + 16 bytes bitmap + 128 objects

The header is
 - backreference (8 bytes)
 - prefix (4 bytes)
 - size, layout, capacity (1 byte each)

The iterator is one-direction (for simplicity) but it enough for its main
target -- the sparse array of cells on a row. Also the iterator has an
.index() method that reports back the index of the entry at which it points.
This greatly simplifies the tree scans by the class row further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 19:25:10 +03:00
Piotr Sarna
495b7b5596 alternator: use unique_ptr for storing attribute paths
Previous commit eliminated the only copying of the attribute paths,
so it's now safe to make the object noncopyable.
Message-Id: <5468e8c17d3d42a03c1dd33706bbaac0c58959ce.1613398751.git.sarna@scylladb.com>
2021-02-15 18:22:59 +02:00
Piotr Sarna
7e1641224c alternator: batch: pass attrs_to_get by a shared pointer
The attrs_to_get object was previously copied, but it's quite
a heavyweight operations, since this object may contain an
instance of std::map or std::unordered_map.
To avoid copying whole maps, the object is wrapped in a shared
const pointer.
Message-Id: <75ad810de16c630b65ae8d319cb4b37e1de8085f.1613398751.git.sarna@scylladb.com>
2021-02-15 18:22:56 +02:00
Tomasz Grabiec
f86108aef1 Merge "raft: move ticking to external code" from Alejo
As Gleb suggested in a previous review, remove ticker from raft and
leave calling tick() to external code.

While there, tick faster to speed up tests.

* https://github.com/alecco/scylla/tree/tests-17-remove-ticker:
  raft: replication test: reduce ticker from 100ms to 1ms
  raft: drop ticker from raft
2021-02-15 18:14:03 +02:00
Botond Dénes
c24f350846 scylla-gdb.py: nonwrapping_interval_printer: fix compatibility with 4.2+
Use the `_interval` member instead of the old `_range` field, but stay
compatible with pre 4.2 releases, falling back to `_range` when
`_interval` doesn't exist.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210215104008.166746-1-bdenes@scylladb.com>
2021-02-15 18:14:03 +02:00
Pavel Emelyanov
d43ad8738c array-search: Add helpers to search for a byte in array
The radix tree code will need the code to find 8-bit value
in an array of some fixed size, so here are the helpers.

Those that allow for SIMD implementations are such for x86_64

TODO: Add aarch64

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:47:59 +03:00
Pavel Emelyanov
0ad361b380 test/perf_collection: Add callback to check the speed of clone
In some places scylla clones collections of objects, so it's
sometimes needed to measure the speed of this operation.

This patch adds a placeholder for it, but no implementations
for any supported collections. It will be added soon for radix
tree.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:46:37 +03:00
Pavel Emelyanov
767253fe24 test/perf_mutation: Add option to run with more than 1 columns
The --column-count makes the test generate schema with
the given numbers of columns and make mutation maker
fill random column with the value on each iteration.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:45:42 +03:00
Pavel Emelyanov
fc84ab3418 test/perf_mutation: Prepare to have several regular columns
Teach the schema builder and test itself to work on more
than one regular column, but for now only use 1, as before.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:44:34 +03:00
Pavel Emelyanov
21adff2a41 test/perf_mutation: Use builder to build schema
The test will be taught to use more than one regular
column, so switch to builder in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:44:06 +03:00
Piotr Sarna
cbbb7f08a0 Merge 'Alternator: support nested attribute paths...
in all expressions' from Nadav Har'El.

This series fixes #5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator.  The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.

The first patch in the series fixes #8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.

Closes #8066

* github.com:scylladb/scylla:
  alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
  alternator: fix bug in ReturnValues=UPDATED_NEW
  alternator: implemented nested attribute paths in UpdateExpression
  alternator: limit the depth of nested paths
  alternator: prepare for UpdateItem nested attribute paths
  alternator: overhaul ProjectionExpression hierarchy implementation
  alternator: make parsed::path object printable
  alternator-test: a few more ProjectionExpression conflict test cases
  alternator-test: improve tests for nested attributes in UpdateExpression
  alternator: support attribute paths in ConditionExpression, FilterExpression
  alternator-test: improve tests for nested attributes in ConditionExpression
  alternator: support attribute paths in ProjectionExpression
  alternator: overhaul attrs_to_get handling
  alternator-test: additional tests for attribute paths in ProjectionExpression
  alternator-test: harden attribute-path tests for ProjectionExpression
  alternator: fix ValidationException in FilterExpression - and more
2021-02-15 15:45:49 +02:00
Tomasz Grabiec
508f928220 tests: sstables: Test sstable write fails on missing partition_end mid-stream
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210115163055.74398-1-tgrabiec@scylladb.com>
2021-02-15 15:45:49 +02:00
Benny Halevy
e532585126 test: sstables::test_env: do_with: futurize_invoke func
Otherwise, if `func` throws, test_env isn't stopped, as it should.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210214190157.211858-1-bhalevy@scylladb.com>
2021-02-15 15:45:49 +02:00
Wojciech Mitros
1819be5ebc canonical_mutation: make the data type non-contiguous
The canonical_mutation type can contain a large mutation, particularly
when the mutation is a result of converting a big schema. Its data
was stored in a field of type 'bytes', which is non-contiguous and
may cause a large allocation.
This is fixed by simply changing the type to 'bytes_ostream', which is
fragmented. The change is compatible because the idl type 'bytes' is compatible
with 'bytes_ostream' as a result of dcf794b, and all canonical_mutations's
methods use the field as an input stream (ser::as_input_stream), which can
be used on 'bytes_ostream' too.

Fixes #8074

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #8075
2021-02-15 10:24:47 +01:00
Nadav Har'El
f884104eed cql-pytest: add more JSON tests
This patch adds several more tests reproducing bugs in toJson() and
SELECT JSON.

First add two xfailing tests reproducing two toJson() issues - #7988
and #8002. The first is that toJson() incorrectly formats values of the
"time" type - it should be a string but Scylla forgets the quotes.
The second is that toJson() format "decimal" values as JSON numbers
without using an exponent, resulting in memory allocation failure
for numbers with high exponents, like 1e1000000000.

The actual test for 1e1000000000 has to be skipped because in
debug build mode we get a crash trying this huge allocation.
So instead, we check 1e1000 - this generates a string of 1000
characters, which is much too much (should just be "1e1000")
but doesn't crash.

Then we add a reproducing test for issue #8077: When using SELECT JSON
on a function, such as count(*), ttl(v) or intAsBlob(v), Cassandra has
a specific way how it formats the result in JSON, and Scylla should do
it the same way unless we have a good reason not to.

As usual, the new tests passes on Cassandra, fails on Scylla, so is marked
xfail.

Refs #7988
Refs #8002
Refs #8077.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214210727.1098388-1-nyh@scylladb.com>
2021-02-15 10:55:43 +02:00
Nadav Har'El
9e029f09e5 docs: improve CONTRIBUTING.md
Start improving CONTRIBUTING.md, as suggested in issue #8037:

1. Incorporate the few lines we had in coding-style.md into CONTRIBUTING.md.
   This was mostly a pointer to Seastar's coding style anyway, so it's not
   helpful to have a separate file which hopeful developers will not find
   anyway.

2. Mention the Scylla developers mailing list, not just the Scylla users
   mailing list. The Scylla developers mailing list is where all the action
   happens, and it's very odd not to mention it.

3. The decisions that github pull requests are forbidden was retracted
   a long time ago, so change the explanation on pull requests.

4. Some smaller phrasing changes.

Refs #8037.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214152752.1071313-1-nyh@scylladb.com>
2021-02-14 22:09:24 +02:00
Nadav Har'El
8c9935359c build: Stir -O levels
Merged patch series from Pavel Emelyanov:

The default -O<> levels are considered to produce slow and tedious to
test code, so it's tempting to increase the level. On the other hand,
there was some complains about re-compile-mostrly work that would suffer
from slower builds.

This set tries to find a reasonable compromise -- raise the default opt
levels and provide the ability to configure one if needed.

* 'br-cxx-o-levels-2' of github.com:xemul/scylla:
	configure: Switch debug build from -O0 to -Og
	configure: Switch dev build from -O1 to -O2
	configure: Make -O flag configurable
2021-02-14 22:09:24 +02:00
Avi Kivity
cb4e1bb0b9 logalloc: reduce gap between std min_free and logalloc min_free
With the larger gap, logalloc reserved more memory for std than
the background reclaim threshold for running, so it was triggered
rarely.

With the gap reduced, background reclaim is constantly running in
an allocating workload (e.g. cache misses).
2021-02-14 19:09:29 +02:00
Avi Kivity
ca0c006b37 logalloc: background reclaim
Set up a coroutine in a new scheduling group to ensure there is
a "cushion" of free memory. It reclaims in preemptible mode in
order to reduce reactor stalls (constrast with synchronous reclaim
that cannot preempt until it achieved its goal).

The free memory target is arbitrarily set at 60MB. The reclaimer's
shares are proportional to the distance from the free memory target;
so a workload that allocates memory rapidly will have the background
reclaimer working harder.

I rolled my own condition variable here, mostly as an experiment.
seastar::condition_variable requires several allocations, while
the one here requires none. We should formalize it after we gain
more experience with it.
2021-02-14 19:09:29 +02:00
Avi Kivity
35076dd2d3 logalloc: preemptible reclaim
Add an option (currently unused by all callers) to preempt
reclaim. If reclaim is preempted, it just stops what it is
doing, even if it reclaimed nothing. This is useful for background
reclaim.

Currently, preemption checks are on segment granularity. This is
probably too coarse, and should be refined later, but is already
better than the current granularity which does not allow preemption
until the entire requested memory size was reclaimed.
2021-02-14 19:09:29 +02:00
Alejo Sanchez
5e49650146 raft: replication test: reduce ticker from 100ms to 1ms
To speed up replication test reduce the tick time from 100ms to 1ms

Speed up: debug 3.7 to 2.5, dev 2.9 to 2.1 seconds

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-14 09:59:06 -04:00
Alejo Sanchez
b41a6822e8 raft: drop ticker from raft
Remove ticker callbacks from raft::server.
External code should periodically call raft::server::tick().

Update replication_test accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-14 09:41:42 -04:00
Nadav Har'El
ea338db581 cql-pytest: reproduce bug in setting time column with integer
This test reproduces issue #7987, where Scylla cannot set a time column
with an integer - wheres the documentation says this should be possible
and it also works in Cassandra.

The test file includes tests for both ways of setting a time column
(using an integer and a string), with both prepared and unprepared
statements, and demonstrates that only one combination fails in Scylla -
an unprepared statement with an integer. This test xfails on Scylla
and passes on Cassandra, and the rest pass on both.

Refs #7987.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210128215103.370723-1-nyh@scylladb.com>
2021-02-14 15:09:38 +02:00
Nadav Har'El
49cd9b3fd5 alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
This patch fixes the last missing part of nested attribute support in
UpdateItem - returning the correct attributes when ReturnValues is requested.
When the expression says "a.b = :val" and ReturnValues is set to UPDATED_OLD
or UPDATED_NEW, only the actual updated attribute a.b should be returned, not
the entire top-level attribute a as we did before this patch.

This patch was made very simple because our existing hierarchy_filter()
function already does exactly the right thing, and can trivially be made to
accept any attribute_path_map<T> (in our case attribute_path_map<action>),
not just attrs_to_get as it did until now.

This patch also adds several more checks to the test in test_returnvalues.py
to improve the test's coverage even more. Interestingly, I discovered two
esoteric cases where DynamoDB does something which makes little sense, but
apparently simplified their implementation - but the beautiful thing is that
it also simplifies our implementation! See long comments about these two
cases in the test code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
964500e47a alternator: fix bug in ReturnValues=UPDATED_NEW
Commit 0c460927bf broke UpdateItem's
ReturnValues=UPDATED_NEW by moving previous_item while it is still
needed. None of the existing tests broke because none of them needed
previous_item after it was moved - but it started to break when we
add support for nested attribute paths, which need this previous_item.

So this patch returns the move to a copy, as it was before the
aforementioned patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
33685a683e alternator: implemented nested attribute paths in UpdateExpression
This patch adds full support for nested attribute paths (e.g., a.b[3].c)
in UpdateExpression. After in previous patches we already added such
support for ProjectionExpression, ConditionExpression and FilterExpression
this means the nested attribute paths feature is now complete, so we
remove the warning from the documents. However, there is one last loose
end to tie and we will do it in the next patch: After this patch, the
combination of UpdateExpression with nested attributes and ReturnValues
is still wrong, and the test for it in test_returnvalues.py still xfails.

Note that previous patches already implemented support for attribute paths
in expression evaluations - i.e., the right-hand side of UpdateExpression
actions, and in this patch we just needed to implement the left hand side:
When an update action is on an attribute a.b we need to read the entire
content of the top-level a (an RWM operation), modify just the b part of
its json with the result of the action, and finally write back the entire
content of a. Of course everything gets complicated by the fact that we
can have multiple actions on multiple pieces of the same JSON, and we also
need to detect overlapping and conflicting actions (we already have this
detection in the attribute_path_map<> class we introduced in a previous
patch).

I decided to leave one small esoteric difference, reproduced by the xfailing
test_update_expression.py::test_nested_attribute_remove_from_missing_item:
As expected, "SET x.y = :val" fails for an item if its attribute x doesn't
exist or the item itself does not exist. For the update expression
"REMOVE x.y", DynamoDB fails if the attribute x doesn't exist, but oddly
silently passes if the entire item doesn't exist. Alternator does not
currently reproduce this oddity - it will fail this write as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
7789606545 alternator: limit the depth of nested paths
DynamoDB limits the depth of a nested path in expressions (e.g. "a.b.c.d")
to 32 levels. This patch adds the same limit also to Alternator.

The exact value of this limit is less important (although it did make
sense to choose the same limit as DynamoDB does), but it's important
to have *some* limit: It's often convenient to handle paths with a
recursive algorithm, and if we allow unlimited path depth, it can
result in unlimited recursion depth, and a crash. Let's avoid this
possibility.

We detect the over-long path while building the parsed::path object
in the parser, and generate a parse error.

This patch also includes a test that verifies that both Alternator
and DynamoDB have the same 32-level nesting limit on paths.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
4c7e27c688 alternator: prepare for UpdateItem nested attribute paths
This patch prepares UpdateItem for updating of nested attribute paths
(e.g., "SET a.b = :val"), but does not yet support them.

Instead of _update_expression holding an unsorted list of "actions",
we change it to hold a attribute_path_map of actions. This will allow
us to process all the actions on a top-level attribute together, and
moreover gets us "for free" the correct checking for overlapping and
conflicting updates - exactly the same checking we already had in
attribute_path_map for ProjectionExpression. Other than this change,
most of this patch is just code movement, not functional changes.

After this patch, the tests for update path overlap and conflict pass:
test_update_expression_multi_overlap_nested and
test_update_expression_multi_conflict_nested.

We can also mark test_update_expression_nested_attribute_rhs as passing -
this test involves an attribute path in the right-hand-side of an update,
but the left-hand-side is still a top-level attribute, so it works (it
actually worked before this patch - it started working when we implemented
attribute paths in expressions, for ConditionExpression and
FilterExpression).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
7c5db2da83 alternator: overhaul ProjectionExpression hierarchy implementation
For ProjectionExpression we implemented a hierarchical filter object which
can be used to hold a tree of attribute paths groups by a the top-level
attributes, and also detect overlapping and conflicting entries.

For UpdateExpression, we need almost exactly the same object: We need to
group update actions (e.g., SET a.b=3) by the top-level attribute, and
also detect and fail overlapping or conflicting paths.

So in this patch we rewrite the data structure we had for ProjectionExpression
in a more genric manner, using the template attribute_path_map<T> - which
holds data of type T for each attribute path. We also implement a template
function attribute_path_map_add() to add a path/value pair to this map,
and includes all the overlap and conflict detecting logic.

There shouldn't be functional changes in this patch. The ProjectionExpression
code uses the new generic code instead of the specific code, but should work
the same. In the next patch we can use the new generic code to implement
UpdateExpression as well.

The only somewhat functional change is better error messages for
conflicting or overlapping paths - which now include one of the
conflicting paths.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
f78d33dd73 alternator: make parsed::path object printable
Make the parsed::path object printable - which is useful for error messages.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
c2f18e56ea alternator-test: a few more ProjectionExpression conflict test cases
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
de62a8c2d3 alternator-test: improve tests for nested attributes in UpdateExpression
We already had many tests for nested attributes in UpdateExpression, but
this patch adds even more:

 * Test nested attribute in right-hand-side in assignment: z = a.c.x.
 * Test for making multiple changes to the same and different top-level
   attributes in the same update.
 * Additional cases of overlap between multiple changes.
 * Tests for conflict between multiple changes.
 * Tests for writing to a nested path on a non-existent attribute or item.
 * A stronger test for array append sorts the added items.

As this feature was not yet implemented, these tests fail on Alternator,
and pass on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:24 +02:00
Pavel Solodovnikov
2ed445bfdd raft: raft_rpc: provide update_address_mapping and dispatcher functions
Provide several utility functions which will be used in rpc message
handlers:

1. `update_address_mapping` -- add a new (server_id -> inet_address)
   mapping for a `raft_rpc` instance.
   This is used to update rpc module with a caller address
   upon receiving an rpc message from a yet unknown server.
2. A set of dispatcher functions for every rpc call that forward calls
   to an appropriate `raft::rpc_server` instance (for which `raft::rpc`
   has a back-pointer).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-12 17:55:48 +03:00
Pavel Emelyanov
ffc9cc9aec range-streamer: Remove global storage service reference
The reference is used by range streamer and (!) storage
service itself to find out if the consistent_rangemovement
option is ON/OFF.

Both places already have the database with config at hands
and can be simplified.

v2: spellchecking

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210212095403.22662-1-xemul@scylladb.com>
2021-02-12 15:50:30 +01:00
Tomasz Grabiec
26aa000493 Merge "raft: replication test fixes" from Alejo
Fix rare debug mode hang and a minor fix.

* alejo/tests-16-fix-debug-hang-disruptive-ticks-master-v3:
  raft: replication test: fix debug mode hangs
  raft: replication test: remove unnecessary param
2021-02-11 20:35:35 +01:00
Nadav Har'El
a03a8a89a9 cql-pytest: fix flaky timeuuid_test.py
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.

The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.

What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.

Fixes #8060

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
2021-02-11 19:03:58 +02:00
Alejo Sanchez
97338ab53f raft: replication test: fix debug mode hangs
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.

For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.

But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.

This patch disables timer ticks for all servers when running custom
elections and partitioning.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-11 11:42:31 -04:00
Pavel Solodovnikov
d8dfdfba1e raft: pass group_id as an argument to raft rpc messages
This will be used later to filter the requests which belong
to the schema raft group and route them to shard 0.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-11 16:25:33 +03:00
Pavel Solodovnikov
3b50cdf1ed raft: use a named constant for pre-defined schema raft group
Introduce a static `schema_raft_state_machine::group_id` constant,
which denotes the raft group id for the schema changes server.

Also fix the comment on the state machine class declaration
to emphasize that the instance will be managed by shard 0.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-11 16:24:39 +03:00
Tomasz Grabiec
234f9dbe85 Merge 'Fix mixed cluster schema sync' from Eliran Sinvani
When a table is altered in a mixed cluster by a node with a more
recent version, the request can fail if there is a difference in
schema_features between the two versions.  This miniset handles the
two problems that prevents the sync.

Closes #8011

* github.com:scylladb/scylla:
  schema: recalculate digest when computed_columns feature is enabled
  schema tables: Remove mutations to unknown tables when adapting schema mutations
  schema tables: Register 'scylla_tables' versions that were sent to other nodes
2021-02-11 13:03:38 +01:00
Eliran Sinvani
63b794d104 schema: recalculate digest when computed_columns feature is enabled
The schema digest is affected by the computed_columns feature, this
means that we have to recalculate our schema digest when this feature is
enabled.
2021-02-11 13:48:58 +02:00
Eliran Sinvani
178ced9014 schema tables: Remove mutations to unknown tables when adapting schema
mutations

Whenever an alter table occurs, the mutations for the just altered table
are sent over to all of the replicas from the coordinator.
In a mixed cluster the mutations should be adapted to a specific version
of the schema. However, the adaptation that happens today doesn't omit
mutations to newly added schema tables, to be more specific, mutations
to the `computed_columns` table which doesn't exist for example in
version 2019.1
This makes altering a table during a rolling upgrade from 2019.1 to
2020.1 dangerous.
2021-02-11 13:48:55 +02:00
Eliran Sinvani
ff1ba9bc2b schema tables: Register 'scylla_tables' versions that were sent to other
nodes

In a mixed cluster there can be a situation where `scylla_tables` needs
to be  sent over to another node because a schema sync or because the
node pulls it because it is referenced by a frozen_mutation. The former
is not a problem since the sending node chooses the version to send.
However, the former is problematic since `scylla_tables` versions are not
registered anywhere.
This registers every `scylla_tables` schema version which is used to adapted
mutations since after this happens a schema pull for this version might
follow.
2021-02-11 13:47:16 +02:00
Takuya ASADA
856fe12e13 dist/debian: install scylla-node-exporter.service correctly
node-exporter systemd unit name is "scylla-node-exporter.service", not
"node-exporter.service".

Fixes #8054

Closes #8053
2021-02-11 12:19:38 +02:00
Benny Halevy
d01e7e7b58 stream_session: prepare: fix missing string format argument
As seen in
mv_populating_from_existing_data_during_node_decommission_test dtest:
```
ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found)
```

Fixes #8067

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>
2021-02-11 12:05:32 +02:00
Wojciech Mitros
17634d141b sstables: add test for checking the latency of updating the sstable_set in a table
Added a test which measures the time it takes to replace sstables in a table's
sstable_set, using the leveled compaction strategy.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
693b4e0fcd sstables: move column_family_test class from test/boost to test/lib
Column_family_test allows performing private methods on column_family's
sstable_set. It may be useful not only in the boost tests, so it's moved
from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh.
sstable_test.hh includes sstable_test_env.hh, so no includes need to be
changed.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
0feff8712e sstables: use fast copying of the sstable_set instead of rebuilding it
The sstable_set enables copying without iterating over all its elements,
so it's faster to copy a set and modify it than copy all its elements
while filtering the ones that were erased.

The modifications are done on a temporary version of the set, so that
if an operation fails the base version remains unchanged

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
aa0cd940d6 sstables: replace the sstable_set with a versioned structure
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that allows copying
without actually copying all the sstables in the set, while providing
the same methods(and some extra) without majorly decreasing their speed.
This is achieved by associating all copies with sstable_set versions
which hold the changes that were performed in them, and references to
the versions that were copied, a.k.a. their parents. The set represented
by a version is the result of combining all changes of its ancestors.

This causes most methods of the version to have a time complexity
dependent on the number of its ancestors. To limit this number, versions
that represent copies that have already been deleted are merged with its
descendants.

The strategy used for deciding when and with which of its children
should a version be merged heavily depends on the use case of sstable_sets:
there is a main copy of the set in a table class which undergoes many
insertions and deletions, and there are copies of it in compaction or
mutation readers which are further copied or edited few or zero times.
It's worth to mention, that when a copy is made, the copied set should not
be modified anymore, because it would also modify the results given by the
copy. In order to still allow modifying the copied set, if a change is
to be performed on it, the version assiociated with this set is replaced
with a new version depending on the previous one.
As we can see, in our use case there is a main chain of versions(with
changes from the table), and smaller branches of versions that start
from a version from this chain, but are deleted soon after.
In such case we can merge a version when it has exactly one descendant,
as this limits the number of concurrent ancestors of a version to the
number of copies of its ancestors are concurrently used. During each
such merge, the parent version is removed and the child version is
modified so that all operations on it give the same results.

In order to preserve the same interface, the sstable_set still contains a
lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for
unordered_set<shared_sstable>) is now a new structure. Each sstable_set
contains a sstable_list but not every sstable_list has to be contained
by a sstable_set, and we also want to allow fast copying of sstable_lists,
so the reference to the sstable_set_version is kept by the sstable_lists
and the sstable_set can access the sstable_set_version it's associated
with through its sstable_list.

Accessing sstables that are elements of a certain sstable_set copy(so
the select, select_sstable_runs and sstable_list's iterator) get results
from containers that hold all sstables from all versions(which are stored
in a single, shared "versioned_sstable_set_data" structure), and then
filter out these sstables that aren't present in the version in question.
This version of the sstable_set allows adding and erasing the same sstable
repeatedly. Inserting and erasing from the set modifies the containers in
a version only when it has an actual effect: if an sstable has been added
in the parent version, and hasn't been erased in the child version, adding
it again will have no effect. This ensures that when merging versions, the
versions have disjoint sets of added, and erased sstables (an sstable can
still be added in one and erased in the second). It's worth noting hat if
an sstable has been added in one of the merged sets and erased in the
second, the version that remains after merging doesn't need to have any
info about the sstable's inclusion in the set - it can be inferred from
the changes in previous versions (and it doesn't matter if the sstable has
been erased before or after being added).

To release pointers to sstables as soon as possible (i.e. when all references
to versions that contain them die), if an sstable is added/erased in all
child versions that are based on a version which has no external references,
this change gets removed from these versions and added to the parent version.
If an sstable's insertion gets overwritten as a result, we might be able
to remove the sstable completely from the set. We know how many times this
needs to happen by counting, for each sstable, in how many different verisions
has it been added. When a change that adds an sstable gets merged with a change
that removes it, or when a such a change simply gets deleted alongside its
associated version, this count is reduced, and when an sstable gets added to a
version that doesn't already contain it, this count is increased.

The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.

Fixes #2622

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
48153a1e2c sstables: remove potential ub
If the range expression in a range based for loop returns a temporary,
its lifetime is extended until the end of the loop. The same can't be said
about temporaries created within the range expression. In our case,
*t->get_sstables_including_compacted_undeleted() returns a reference to a
const sstable_list, but the t->get_sstables_including_compacted_undeleted()
is a temporary lw_shared_ptr, so its lifetime may not be prolonged until the
end of the loop, and it may be the sole owner of the referenced sstable_list,
so the referenced sstable_list may be already deleted inside the loop too.
Fix by creating a local copy of the lw_shared_ptr, and get reference from it
in the loop.

Fixes #7605

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
e1b494633b sstables: make sstable_set constructor less error-prone
Adding an non-empty set of sstables as the set of all sstables in
an sstable_set could cause inconsistencies with the values returned
by select_sstable_runs because the _all_runs map would still be
initialized empty. For similar reasons, the provided sstable_set_impl
should also be empty.

Dispel doubts by removing the unordered_set from the constructor, and
adding a check of emptiness of the sstable_set_impl.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Shlomi Livne
718976e794 scylla_io_setup did not configure pre tuned gce instances correctly
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values

Fixes #7341

Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>

Closes #8019
2021-02-11 11:06:00 +02:00
Avi Kivity
9cbbf40710 Merge "register_inactive_read: error handling" from Benny
"
Currently, register_inactive_read accepts an eviction_notify_handler
to be called when the inactive_read is evicted.

However, in case there was an error in register_inactive_read
the notification function isn't called leaving behind
state that needs to be cleaned up.

This series separates the register_inactive_reader interface
into 2 parts:

1. register_inactive_reader(flat_mutation_reader) - which just registers
the reader and return an inactive_read_handle, *if permitted*.
Otherwise, the notification handler is not called (it is not known yet)
and the caller is not expected to do anything fance at this point
that will require cleanup.

This optimizes the server when overloaded since we do less work
that we'd need to undo in case the reader_concurrecy_semaphore
runs out of resources.

2. After register_inactive_reader succeeded to return a valid
inactive_read_handle, the caller sets up its local state
and may call `set_notify_handler` to set the optional
notify_handler and ttl on the o_r_h.

After this state, the notify_handler will be called when
the inactive_reader is evicted, for any reason.

querier_cache::insert_querier was modified to use the
above procedure and to handle (and log/ignore) any error
in the process.

inactive_read_handle and inactive_read keeping track of each other
was simplified by keeping an iterator in the handle and a backpointer
in the inactive_read object.  The former is used to evict the reader
and to set the notify_handler and/or ttl without having to lookup the i_r.
The latter is used to invalidate the i_r_h when the i_r is destroyed.

Test: unit(release), querier_cache_test(debug)
"

* tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla:
  querier_cache: insert_querier: ignore errors to register inactive reader
  querier_cache: insert_querier: handle errors
  querier_utils: mark functions noexcept
  reader_concurrency_semaphore: register_inactive_read: make noexcept
  reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
  reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
  reader_concurrency_semaphore: inactive_read: use intrusive list
  reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
  reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
  reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
  reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
  reader_concurrency_semaphore: inactive_read_handle: swap definition order
  reader_lifecycle_policy: retire low level try_resume method
  reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
2021-02-10 19:09:21 +02:00
Alejo Sanchez
941eceb9c8 raft: replication test: remove unnecessary param
Remove unnecessary param from wait_log()

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-10 09:11:17 -04:00
Piotr Sarna
2aa4631148 test: fix a flaky timeout test depending on TTL
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.

Fixes #8062

Closes #8063
2021-02-10 14:20:02 +02:00
Piotr Sarna
8f98c0585f failure_detector: add a missing const qualifier
The mean() method is effectively const, so it should be marked as such.
Message-Id: <14dd39e8419136909fcf10508c34de3752faa7fe.1612953601.git.sarna@scylladb.com>
2021-02-10 13:04:37 +02:00
Piotr Sarna
aa39130a20 bounded_stats_queue: add missing const qualifiers
Most of the methods of this utility are effectively const.
Message-Id: <ed376ab74b6323cf770cc0a1314edbae0b16111e.1612953601.git.sarna@scylladb.com>
2021-02-10 13:04:35 +02:00
Piotr Jastrzebski
390cef6a96 cdc: Extract create_stream_ids from topology_description_generator
This new function will be used in the following patches in additional
places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-02-10 10:24:06 +01:00
Gleb Natapov
d06d21bfae database: remove add_keyspace() function
It is not longer used.
Message-Id: <20210209175931.1796263-2-gleb@scylladb.com>
2021-02-10 00:36:02 +01:00
Nadav Har'El
54785604b4 Merge 'Add max concurrent requests to alternator' from Piotr Sarna
Previous version, merged and dequeued due to a dependency bug: https://github.com/scylladb/scylla/pull/7297

Note: this pull request is temporarily created against /next, because it depends on https://github.com/scylladb/scylla/pull/7279.

This series adds support for `max_concurrent_requests_per_shard` config variable to alternator. Excessive requests are shed and RequestLimitExceeded is sent back to the client.

Tested manually by reloading Scylla multiple times and editing the config, while bombarding alternator with many concurrent requests. Observed excepted failures are:
`botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
`

Fixes #7294

Closes #8039

* github.com:scylladb/scylla:
  alternator: server: return api_error instead of throwing
  alternator: add requests_shed metrics
  alternator: add handling max_concurrent_requests_per_shard
  alternator: add RequestLimitExceeded error
2021-02-09 19:55:31 +02:00
Gleb Natapov
d8345c67d9 Consolidate system and non system keyspace creation
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
2021-02-09 17:18:04 +01:00
Gleb Natapov
51037e94ec lwt: handle an error during prune operation
The error is benign but if it is not handled "unhandled exception" error
will be printed in the logs.

Message-Id: <20210209150313.GA1708015@scylladb.com>
2021-02-09 16:26:00 +01:00
Tomasz Grabiec
3dd9c5596a Merge 'Minor tweaks to the failure detector interface' from Piotr Sarna
The interface of the failure detector service is cleaned up a little:
 - an unimplemented method is removed (is_alive)
 - a return type of another method is fixed (arrival_samples)
 - a getter for the most recent successful update is added (last_update)

This code was tested manually during various overload protection
experiments, which check if the failure detector can be used to reject
requests which have a very small chance of succeeding within their
timeout.

Closes #8052

* github.com:scylladb/scylla:
  failure_detector: add getting last update time point
  failure_detector: return arrival samples by const reference
  failure_detector: remove unimplemented is_alive method
2021-02-09 15:23:09 +01:00
Konstantin Osipov
86dec79c1b raft: rename progress.hh to tracker.hh
class tracker is the main class of this module.
2021-02-09 17:07:25 +03:00
Konstantin Osipov
41387225c3 raft: extend single_node_is_quiet test 2021-02-09 17:04:13 +03:00
Piotr Sarna
4acc6fecf0 Merge 'locator: Check DC names in NetworkTopologyStrategy' from Juliusz Stasiewicz
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

The edited CQL test relied on quietly accepting non-existing DCs, so it had to
be removed. Also, one boost-test referred to nonexistent `datacenter2` and had
to be removed.

Fixes #7595

Closes #8056

* github.com:scylladb/scylla:
  tests: Adjusted tests for DC checking in NTS
  locator: Check DC names in NTS
2021-02-09 14:45:20 +02:00
Botond Dénes
3d001b5587 query: use local limit for non-limited queries in mixed cluster
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.

Fixes: #8022

Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
2021-02-09 14:45:20 +02:00
Avi Kivity
2f50bf2029 Update seastar submodule
* seastar 4c7c5c7c4...76cff5896 (6):
  > rpc: Make is possible for rpc server instance to refuse connection
  > reactor: expose cumulative tasks processed statistic
  > fair_queue: add missing #include <optional>
  > reactor: optimize need_preempt() thread-local-storage access
  > Merge " Use reference for backend->reactor link" from Pavel E
  > test: coroutines: failed coroutine does not throw
2021-02-09 14:45:20 +02:00
Avi Kivity
37b41d7764 test: add missing #include <fstream>
std::ofstream is used, but there is no direct include for it. This
fails the build with libstdc++ 11.

Closes #8050
2021-02-09 14:45:20 +02:00
Juliusz Stasiewicz
97bb15b2f2 tests: Adjusted tests for DC checking in NTS
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
2021-02-09 08:29:35 +01:00
Juliusz Stasiewicz
b6fb5ee912 locator: Check DC names in NTS
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

Fixes #7595
2021-02-09 07:04:17 +01:00
Benny Halevy
d2b8b3041d querier_cache: insert_querier: ignore errors to register inactive reader
Since the reader may normally dropped upon
registration, hitting an error is equivalent to having
it evicted at any time, so just log the exception
and ignore it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
9bdb8190ce querier_cache: insert_querier: handle errors
Make insert_querier exception safe.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
b8f935457a querier_utils: mark functions noexcept
They all are trivially noexcept.
Mark them so to simplify error handing assumptions in the
next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
6e92f07630 reader_concurrency_semaphore: register_inactive_read: make noexcept
Catch error to allocate an inactive_read and just log them.
Return an empty inactive_read_handle in
this case, as if the inactive reader was evicted due to
lack of resources.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
46c2229b78 reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
Register the inactive reader first with no
evict_notify_handler and ttl.

Those can be set later, only if registration succeeded.
Otherwise, as in the querier example, there is no need
to to place the querier in the index and erase it
on eviction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
d752ea7e91 reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
By default it will be unarmed and with no callback
so there's no need to wrap it in a std::optional.

This saves an allocation and another potential
error case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
a12c9638b6 reader_concurrency_semaphore: inactive_read: use intrusive list
To simplify insertion and eviction into the inactive_reads container,
use an intrusive list thta requires a single allocation for the
inactive_read object itself.

This allows passing a reference to the inactive_read
to evict it.

Note that the reader will be unlinked automatically from
the inactive_readers list if the inactive_read_handle is destroyed.
This is okay since there is no need to track the inactive_read
if the caller loses the i_r_h (e.g. if an error is thrown).

It is also safe to evict the inactive_reader while the
i_r_h is alive.  In this case the i_r will be unlinked
after the flat_mutation_reader it holds is moved out of it.

bi::auto_unlink will detect that it's alredy unlinked
when destroyed and do nothing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
f751e42bf9 reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:52:16 +02:00
Benny Halevy
81cd3d0c51 reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
So try_evict_one_inactive_read could be used also in do_wait_admission
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
e072199b8d reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
Calling unregister_inactive_read on the wrong semaphore is a blatant
bug so better call on_internal_error so it'd be easier to catch and fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
9c9b4c85ae reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
There is no need to lookup the inactive_read if the i_r_h
is disengaged, it should not be registered so just return
quickly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
769dff6c54 reader_concurrency_semaphore: inactive_read_handle: swap definition order
For using boost::intrusive::list for _inactive_reads.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
d565e3fb57 reader_lifecycle_policy: retire low level try_resume method
The caller can now just call sem.unregister_inactive_read(irh) directly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
4e8f29ef14 reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
There's no need to hold a unique_ptr<flat_mutation_reader> as
flat_mutation_reader itself holds a unique_ptr<flat_mutation_reader::impl>
and functions as a unique ptr via flat_mutation_reader_opt.

With that, unregister_inactive_read was modified to return a
flat_mutation_reader_opt rather than a std::unique_ptr<flat_mutation_reader>,
keeping exactly the same semantics.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Nadav Har'El
e52785be08 alternator: support attribute paths in ConditionExpression, FilterExpression
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ConditionExpression in conditional updates, and FilterExpression in
queries and scans. After this patch, all previously-xfailing tests in
test_projection_expression.py and test_filter_expression.py now pass.

The fix is simple: Both ConditionExpression and FilterExpression use the
function calculate_value() to calculate the value of the expression. When
this function calculates the value of a path, it mustn't just take the
top-level attribute - it needs to walk into the specific sub-object as
specified by the attribute path.

This is not the end of attribute path support, UpdateExpression and
ReturnValues are not yet fully supported. This will come in following
patches.

Refs #5024

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 19:19:09 +02:00
Nadav Har'El
579c7b8dae alternator-test: improve tests for nested attributes in ConditionExpression
Strengthen the tests in test_condition_expression.py for nested attribute
paths (e.g., b.y[1]):

1. The test test_update_condition_nested_attributes only tested successful
   conditions involving nested attributes. Let's also add an *unsuccessful*
   condition, to verify we don't accidentally pass every condition involving
   a nested attribute.

2. Test a case where a non-existant nested attribute is involved in the
   condition.

3. In the test for an attribute path with references - "#name1.#name2",
   make sure the test doesn't pass if #name2 is silently ignored.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 19:19:09 +02:00
Piotr Sarna
faca59efa6 failure_detector: add getting last update time point
It can be useful to use the information how long ago an endpoint
responded to heartbeat.
2021-02-08 16:45:58 +01:00
Tomasz Grabiec
c16e4a0423 migration_manager: Propagate schema changes with reads like we do on writes
This fixes the problem where the cordinator already knows about the
new schema and issues a read which uses new objects, but the replica
doesn't know those objects yet. The read will fail in this case. We
can avoid this if we propagate schema changes with reads, like we
already do for writes.

Message-Id: <20210205163422.414275-1-tgrabiec@scylladb.com>
2021-02-08 16:49:55 +02:00
Avi Kivity
4082f57edc Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund
Refs #6148

Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.

This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments

Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.

Closes #7879

* github.com:scylladb/scylla:
  commitlog: Force earlier cycle/flush iff segment reserve is empty
  commitlog: Make segment allocation wait iff disk usage > max
  commitlog: Do partial (memtable) flushing based on threshold
  commitlog: Make flush threshold configurable
  table: Add a flush RP mark to table, and shortcut if not above
2021-02-08 16:44:05 +02:00
Avi Kivity
af2d1fa0de Update abseil submodule
Compiles with newer compilers.

Added new library wyhash.a to configure.py.

* abseil 1e3d25b...9c6a50f (51):
  > Export of internal Abseil changes
  >  Do not set mvsc linker flags for clang-cl (fixes #874) (#891)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add support for Elbrus 2000 (e2k) (#889)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add missing word 'library' in the 'status' description (#868)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Include the status library into the main README. (#863)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > fix build dll (#797)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix stacktrace on aarch64 architecture. Fixes #805 (#827)
  > moved deleted functions to public for better compiler errors. (#828)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
2021-02-08 15:41:46 +02:00
Gleb Natapov
b9a5aff7a6 distributed_loader: drop execute_futures function
execute_futures() is just a local reimplementation of
when_all_succeed(). Use the former directly.

Message-Id: <20210208114816.GA1658725@scylladb.com>
2021-02-08 13:24:19 +01:00
Nadav Har'El
104ef5242b alternator: support attribute paths in ProjectionExpression
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ProjectionExpression in the various operations where this parameter
is supported - GetItem, BatchGetItem, Query and Scan. After this patch, all
xfailing tests in test_projection_expression.py now pass.

In the previous patch we remembered in the "attrs_to_get" object not only
the top-level attributes to read from the table, but also how to filter
from it only the desired pieces of the nested document. In this patch we
add a filter() function to do this filtering, and call it in the right
places to post-process the JSON objects we read from the table.

We also had to fix reference resolution in paths to resolve all the
components of the path (e.g., #name1.#name2) and not just the top-level
attribute.

This is not the end of attribute path support, there are still other
expressions (ConditionExpression, UpdateExpression, FilterExpression,
ReturnValues) where they are not yet supported. This will come in following
patches.

Refs #5024

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
6340619e69 alternator: overhaul attrs_to_get handling
In the existing code, the variable "attrs_to_get" is a list of top-level
attributes to fetch for an item. It is used to implement features like
ProjectionExpression or AttributesToGet in GetItem and other places.

However, to support attribute paths (e.g., a.b.c[2]) in ProjectionExpression,
i.e., issue #5024, we need more than that. We still need to know the top-
level attribute "a", because this is the granularity we have in the Scylla
table (all the content inside "a" is serialized as a single JSON); But we
also need to remember exactly which parts *inside* "a" we will need to
extract and return.

So in this patch we add a new type, "attrs_to_get", which is more than
just a list of top-level attributes. Instead, it is a *map*, whose keys
are the top-level attributes, and the value for each of them is a
"hierarchy_filter", an object which describes which part of the attribute
is needed.

This patch includes the code which converts the AttributesToGet and
ProjectionExpression into the new attrs_to_get structure. During this
conversion, we recognize two kinds of errors which DynamoDB complains
about: We recognize "overlapping" attributes (e.g., requesting both
a.b and a.b.c) and "conflicting" attributes (e.g, requesting both
a.b and a[1]). After this, two xfailing tests we had for detecting
these overlap and conflicts finally pass and their "xfail" label is
removed.

After this patch, we have the attrs_to_get object which can allow us
to filter only the requested pieces of the top-level attributes, but
we don't use it yet - so this patch is not enough for complete support
of attribute paths in ProjectionExpression. We will complete this
support in the next patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
b2dbd56a3a alternator-test: additional tests for attribute paths in ProjectionExpression
This patch adds more tests for attribute paths in ProjectionExpression,
that deal with document paths which do not fit the content of the item -
e.g., trying to ask for "a.b[3]" when a.b is not a list but rather an
integer or a dictionary.

Moreover, we note that if you try to ask for "a.b, a[2]", DynamoDB
fails this request as a "conflict". The reasoning is that no single
item can ever have both a.b and a[2] (the first is only valid for
dictionaries, the second for lists). It's not clear to me why we
still can't return whichever of the two actually is relevant, but
the fact is that DynamoDB does not allow it.

The new tests fail on Alternator (marked xfailed) and pass on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
2a2c5563ba alternator-test: harden attribute-path tests for ProjectionExpression
We have 7 xfailing tests for usage of nested attribute paths (e.g.,
"a.b.c[7]") in a ProjectionExpression. But some of these tests were too
"easy" to pass - a trivial and *wrong* implementation that just ignores
the path and uses the top level attribute (in the above example, "a"),
would cause some of them to start passing.

So this patch strengthens these tests. They still pass on AWS DynamoDB,
and now continue to fail with the aforementioned broken implementation.

Refs #5024.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
653610f4bc alternator: fix ValidationException in FilterExpression - and more
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.

When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.

This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.

Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.

The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.

Fixes #8043

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:30 +02:00
Pavel Emelyanov
a05adb8538 database: Remove global storage proxy reference
The db::update_keyspace() needs sharded<storage_proxy>
reference, but the only caller of it already has it and
can pass one as argument.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-3-xemul@scylladb.com>
2021-02-08 12:59:46 +01:00
Pavel Emelyanov
8490c9ff6a transport: Remove global storage service reference
On start the transport controller keeps the storage service
on server config's lambda just to let the server grab a
database config option.

The same can be achieved by passing the sharded database
reference to sharded<server>::start, so that each server
instance get local database with config.

As an nice side effect transport::server's config looks
more like a config with simple values and without methods
and/or lambdas on board.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-1-xemul@scylladb.com>
2021-02-08 12:58:49 +01:00
Piotr Sarna
d23584c8f7 failure_detector: return arrival samples by const reference
There's no point in always returning the whole map by value - callers
can decide to copy the map of their own if need be.
2021-02-08 11:50:32 +01:00
Piotr Sarna
445e6e44f4 failure_detector: remove unimplemented is_alive method
The method was never implemented, so it makes no sense to keep
it in the header.
2021-02-08 11:49:50 +01:00
Amnon Heiman
4498bb0a48 API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007
2021-02-08 12:11:30 +02:00
Raphael S. Carvalho
e1261d10f1 table: Avoid useless allocations when updating cache on memtable flush completion
we're unconditionally using make_combined_mutation_source(), which causes extra
allocations, even if memtable was flushed into a single sstable, which is the
most common case. memtable will only be flushed into more than one sstable if
TWCS is used and memtable had old data written into it due to out-of-order
writes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>
2021-02-06 20:03:33 +02:00
Pavel Emelyanov
7e68ed6a5d configure: Switch debug build from -O0 to -Og
Previous patch changed the -O flag for dev builds. This
had no effect on unit tests compile+run time, and was
aimed at improving the individual tests, dtest, stress-
and other tests runtimes.

This change is mainly focused on imprving the debug-mode
full unit tests running, while keeping the debuggability:
the compile+run time gets ~10 minutes shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Pavel Emelyanov
4fd5ef92ae configure: Switch dev build from -O1 to -O2
Based on the original patch from Nadav.

The -O1-generated code is too slow. Raising the opt level
slows compilation down ~9%, but greatly improves the
testing time. E.g. running the alternator test alone is
2.5 times faster with -O2 (118 vs 48 seconds).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Pavel Emelyanov
7ced07d22c configure: Make -O flag configurable
It was noticed, that current optimization levels do not
generate fast enough code for dev builds. On the other
hand just increasing the default optimization level will
make re-compile-mostly work much more frustrating.

The new configure.py option allows to select the desired
-O option value by hands. Current hard-coded values are
used as defaults.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Botond Dénes
7910e745bc scylla-gdb.py: std_list: restore python2 compatibility
std_list has an iterator object which provides the python3 `__next__()`
method only. Python2 wants a method called `next()`. As it is trivial to
provide both, do that to allow debugging on centos7.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210205073549.734362-1-bdenes@scylladb.com>
2021-02-05 12:47:53 +01:00
Gleb Natapov
8dbe222331 raft: compile raft by default 2021-02-05 12:40:20 +01:00
Konstantin Osipov
adc87aa278 raft: re-lookup progress object after a configuration change
Fix raft_fsm_test failure in debug mode. ASAN complained
that follower_progress is used in append_entries_reply()
after it was destroyed. This could happen if in maybe_commit()
we switched to a new configuration and destroyed old progress
objects.

The fix is to lookup the object one more time after maybe_commit().
2021-02-05 12:40:19 +01:00
Piotr Sarna
d7848750d8 alternator: server: return api_error instead of throwing
Throwing a C++ exception creates unnecessary overhead, so when
an unsupported operation is encountered, the api error is directly
returned instead of being thrown.
2021-02-04 17:23:41 +01:00
Piotr Sarna
868e04e8e2 alternator: add requests_shed metrics
The counter shows the total number of requests shed due to overload.
2021-02-04 17:23:41 +01:00
Piotr Sarna
1b8c946ad7 alternator: add handling max_concurrent_requests_per_shard
The config value is already used to set an upper limit of concurrent
CQL requests, and now it's also abided by alternator.
Excessive requests result in returning RequestLimitExceeded error
to the client.

Tests: manual
Running multiple concurrent requests via the test suite results in:
botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
2021-02-04 17:23:41 +01:00
Piotr Sarna
32dc692b8b alternator: add RequestLimitExceeded error
The error code is used when requests are shed due to crossing
the user-defined threshold of the rate of incoming requests.
2021-02-04 17:14:21 +01:00
Avi Kivity
7f3083739f Merge "sstables: Share partition index pages between readers" from Tomasz
"
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.

It used to be like that a few years ago, but we moved to per-reader
cache to implement incremental promoted index parsing, to avoid OOMs
with large partitions. At that time, the solution involved caching
input streams inside partition index entries, which couldn't be reused
between readers. This could have been solved differently. Instead of
caching input streams, we can cache information needed to created them
(temporary_buffer<>). This solution takes this approach.

This series is also needed before we can implement promoted index
caching. That's because before the promoted index can be shared by
readers, the partition index entries, which hold the promoted index,
must also be shareable.

The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.

Promoted index cursor is no longer created when the partition index
entry is parsed, by it's created on-demand when the top-level cursor
enters the partition. The promoted index cursor is owned by the
top-level cursor, not by the partition index entry.

Below are the results of an experiment performed on my laptop which
demonstrates the improvement in performance.

Load driver command line:

  ./scylla-bench                   \
       -workload uniform           \
       -mode read                  \
       --partition-count=10        \
       -clustering-row-count=1     \
       -concurrency 100

Scylla command line:

  scylla --developer-mode=1 -c1 -m1G --enable-cache=0

The workload is IO-bound.
Before, we needed 2 I/O per read, now we need 1 (amortized).
The throughput is ~70% higher.

Before:

 time   ops/s  rows/s errors max    99.9th 99th   95th   90th   median mean
   1s    4706    4706      0 35ms   30ms   27ms   25ms   24ms   21ms   21ms
   2s    4646    4646      0 42ms   31ms   31ms   27ms   25ms   21ms   22ms
 3.1s    4670    4670      0 40ms   27ms   26ms   25ms   25ms   21ms   21ms
 4.1s    4581    4581      0 39ms   33ms   33ms   27ms   26ms   21ms   22ms
 5.1s    4345    4345      0 40ms   37ms   35ms   32ms   31ms   21ms   23ms
 6.1s    4328    4328      0 49ms   40ms   34ms   32ms   31ms   22ms   23ms
 7.1s    4198    4198      0 45ms   36ms   35ms   31ms   30ms   22ms   24ms
 8.2s    3913    3913      0 51ms   50ms   50ms   39ms   35ms   24ms   26ms
 9.2s    4524    4524      0 34ms   31ms   30ms   28ms   27ms   21ms   22ms

After:

 time   ops/s  rows/s errors max    99.9th 99th   95th   90th   median mean
   1s    7913    7913      0 25ms   25ms   20ms   15ms   14ms   12ms   13ms
   2s    7913    7913      0 18ms   18ms   18ms   16ms   14ms   12ms   13ms
   3s    8125    8125      0 20ms   20ms   17ms   15ms   14ms   12ms   12ms
   4s    5609    5609      0 41ms   35ms   29ms   28ms   27ms   13ms   18ms
 5.1s    8020    8020      0 18ms   17ms   17ms   15ms   14ms   12ms   13ms
 6.1s    7102    7102      0 27ms   27ms   24ms   19ms   18ms   13ms   14ms
 7.1s    5780    5780      0 26ms   26ms   26ms   23ms   22ms   17ms   18ms
 8.1s    6530    6530      0 37ms   34ms   26ms   22ms   20ms   15ms   15ms
 9.1s    7937    7937      0 19ms   19ms   17ms   17ms   16ms   12ms   13ms

Tests:

  - unit [release]
  - scylla-bench
"

* tag 'share-partition-index-v1' of github.com:tgrabiec/scylla:
  sstables: Share partition index pages between readers
  sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()
  sstables: index_reader: Do not store cluster index cursor inside partition indexes
2021-02-04 17:27:49 +02:00
Nadav Har'El
1953b1b006 alternator-test: increase timeout in tracing test
Our test for tracing Alternator requests can't be sure when tracing a request
finished, because tracing is asynchronous and has no official ending signal.
So before we can conclude that tracing failed, we need to wait until a
timeout, which in the current code was roughly 6.4 seconds (the timeout
logic is unnecessarily convoluted, but to make a long story short it has
exponential sleeps starting with 0.1 second and ending with 3.2 seconds,
totaling 6.4 seconds).

It turns out that sporadically, in test runs on overcommitted test machines
with the very slow debug build, we fail this test with this timeout.
So this patch increases the timeout to 51.2 seconds. It should be more
than enough for everyone. Famous last words :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210204151554.582260-1-nyh@scylladb.com>
2021-02-04 17:17:07 +02:00
Tomasz Grabiec
63188abb87 sstables: Share partition index pages between readers
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.

This change is also needed before we can implement promoted index caching.
That's because before the promoted index can be shared by readers, the
partition index entries, which hold the promoted index, must also be
shareable.

The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.

Promoted index cursor is no longer created when the partition index entry
is parsed, by it's created on-demand when the top-level cursor enters
the partition. The promoted index cursor is owned by the top-level cursor,
not by the partition index entry.
2021-02-04 15:24:07 +01:00
Tomasz Grabiec
c232d71fc8 sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream() 2021-02-04 15:24:07 +01:00
Tomasz Grabiec
5ed559c8c6 sstables: index_reader: Do not store cluster index cursor inside partition indexes
Currently, the partition index page parser will create and store
promoted index cursors for each entry. The assumption is that
partition index pages are not shared by readers so each promoted index
cursor will be used by a single index_reader (the top-level cursor).

In order to be able to share partition index entries we must make the
entries immutable and thus move the cursor outside. The promoted index
cursor is now created and owned by each index_reader. There is at most
one such active cursor per index_reader bound (lower/upper).
2021-02-04 15:23:55 +01:00
Avi Kivity
713a159600 tools: toolchain: add simplified procedure for creating dbuild images
The current procedure for building images is complicated, as it
requires access to x86_64, aarch64, and s390x machines. Add an alternative
procedure that is fully automated, as it relies on emulation on a single
machine.

It is slow, but requires less attention.

Closes #8024
2021-02-04 15:37:36 +02:00
Avi Kivity
bd7fbcc0cf tools: toolchain: dbuild: keep original user's groups
The supplementary groups are removed by default, so add them back.
Supplementary groups are useful for group-shared directories like
ccache.

I added them to the podman-only branch since I don't know if this
works for docker. If a docker user verifies it works there too,
we can move it to the generic code.

Closes #8020
2021-02-04 15:36:55 +02:00
Gleb Natapov
e9043565b3 raft: add counters to raft server
The patch adds set of counters for various events inside raft
implementation to facilitate monitoring and debugging.

Message-Id: <20210204125313.GA1513786@scylladb.com>
2021-02-04 14:19:54 +01:00
Benny Halevy
f5fe8283cc test: reader_permit: do not include reader_concurrency_semaphore.hh in header file
We can do with a forward declaration instead to reduce
the dependency, and include reader_concurrency_semaphore.hh
in test/lib/reader_permit.cc instead.

We need to include "../../reader_permit.hh" to get the
definition of class reader_permit. We need the include
path to prevent recursive include (or rename test/lib/reader_permit.hh
but this creates a lot of code churn).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204122002.1041808-1-bhalevy@scylladb.com>
2021-02-04 15:02:16 +02:00
Benny Halevy
338c190842 reader_concurrency_semaphore: inactive_read_handle: mark methods noexcept
All are trivially noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204113327.1027792-1-bhalevy@scylladb.com>
2021-02-04 13:57:42 +02:00
Benny Halevy
ba4b8dd6e5 sstables: row.hh: no need to include reader_concurrency_semaphore.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204113413.1027893-1-bhalevy@scylladb.com>
2021-02-04 13:42:06 +02:00
Tomasz Grabiec
2e3f6a9622 tests: perf_fast_forward: Print outpout directory
Message-Id: <20210203180053.230627-1-tgrabiec@scylladb.com>
2021-02-04 10:39:41 +02:00
Tomasz Grabiec
e0ceb454c0 tests: perf_fast_forward: Print error hints to stdout
They point to lines printed to stdout, so should be aligned with them.
Message-Id: <20210203180016.230547-1-tgrabiec@scylladb.com>
2021-02-04 10:39:41 +02:00
Avi Kivity
fcd48adcc4 Update seastar submodule
* seastar b5b2ee53d...4c7c5c7c4 (1):
  > Merge "add support for printing backtraces on one line" from Benny

Fixes #5464.
2021-02-03 14:01:45 +02:00
Benny Halevy
ca6f5cb0bc test: commitlog_test: test_allocation_failure: fill memory using smaller allocations
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.

To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).

Fixes #8028

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
2021-02-03 12:21:20 +02:00
Pavel Solodovnikov
856b0b3a58 raft: introduce raft_gossip_failure_detector class
This is an implementation of `raft::failure_detector` for Scylla
that uses gms::gossiper to query `is_alive` state for a given
raft server id.

Server ids are translated to `gms::inet_address` to be consumed
by `gms::gossiper` with the help of `raft_rpc` class,
which manages the mapping.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210129223109.2142072-1-pa.solodovnikov@scylladb.com>
2021-02-03 10:45:18 +01:00
Tomasz Grabiec
f8ae46f294 Merge "raft: RPC module implementation" from Pavel Solodovnikov
This series provides additional RPC verbs and corresponding methods in
`messaging_service` class, as well as a scylla-specific Raft RPC module
implementation that uses `netw::messaging_service` under the hood to
dispatch RPC messages.

* https://github.com/ManManson/scylla/commits/raft-api-rpc-impl-v6:
  raft: introduce `raft_rpc` class
  raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls
  configure.py: compile serializer.cc
2021-02-03 10:43:58 +01:00
Benny Halevy
55e3df8a72 dist: scylla_util: prevent IndexError when no ephemeral_disks were found
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks.  When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```

This change reverses the order and first checks that we found
enough disks before getting the fist disk size.

Fixes #7971

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8027
2021-02-03 11:30:18 +02:00
Avi Kivity
10606aadb5 Update tools/java submodule
* tools/java 78c8ef4f54...0187829d5e (1):
  > nodetool: alternate way to specify table name which includes a dot

Fixes #6521.
2021-02-03 11:27:33 +02:00
Botond Dénes
46b795b5fd mutation: consume(): add reverse mode
`mutation::consume()` is used by range scans to convert the immediate
`reconcilable_result` to the final `query::result` format. When the
range scan is in reverse, `mutation::consume()` has to feed the
clustering fragments to the consumer in reverse order, but currently
`mutation::consume()` always uses the natural order, breaking reverse
range scans.
This patch fixes this by adding a `consume_in_reverse` parameter to
`mutation::consume()`, and consequently support for consuming clustering
fragments in reverse order.

Fixes: #8000

Tests: unit(release, debug),
dtest(thrift_tests.py:TestMutations.test_get_range_slice)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210203081659.622424-1-bdenes@scylladb.com>
2021-02-03 11:00:47 +02:00
Piotr Sarna
c03363b520 README: fix a dead link for building instructions
The link was outdated, since its destination was moved to
a subdirectory.

Message-Id: <b0e0eedaea4f26acf050a91ab9eed1ca37a838bb.1612338584.git.sarna@scylladb.com>
2021-02-03 10:59:50 +02:00
Avi Kivity
913d970c64 Merge "Unify inactive readers" from Botond
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.

Tests: unit(release, debug:v1)
"

* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: store inactive readers directly
  querier_cache: store readers in the reader concurrency semaphore directly
  querier_cache: retire memory based cache eviction
  querier_cache: delegate expiry to the reader_concurrency_semaphore
  reader_concurrency_semaphore: introduce ttl for inactive reads
  querier_cache: use new eviction notify mechanism to maintain stats
  reader_concurrency_semaphore: add eviction notification facility
  reader_concurrency_semaphore: extract evict code into method evict()
2021-02-03 10:59:04 +02:00
Piotr Sarna
d395305ddd api: fix retrieving replied RPC messages
The API call referred to a nonexistent callback,
which is now renamed to better match the API path
and actually implemented.

Message-Id: <3d0dbb42f67e1584999a58da9aa9cc722487fda1.1612279443.git.sarna@scylladb.com>
2021-02-03 09:42:17 +02:00
Pekka Enberg
5670276163 Update seastar submodule
* seastar cb3aaf07...b5b2ee53 (1):
  > perftune.py: fix assignment after extend and add asserts
Fixes #8008
2021-02-02 15:27:13 +02:00
Tomasz Grabiec
873e732042 Merge "Switch partition rows onto B-tree" from Pavel Emelyanov
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.

The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree

Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:

  There's no copyable key at hands. The clustering key is the
  managed_bytes() that is not nothrow-copy-constructibe, neither
  it's hash-able for lookup due to prefix lookup.

Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.

B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.

Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.

The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.

Numbers:

- memory_footprint: B-tree       master
  rows_entry size:  216          232

  1 row
   in-cache:        968          960     (because of dummy entry)
   in-memtable:     1006         1022

  100 rows
   in-cache:        50774        50856
   in-memtable:     50620        50918

- mutation_test:    B-tree       master
   tps.average:     891177       833896

- simple_query:     B-tree       master
   tps.median:      71807        71656
   tps.maximum:     71847        71708

* xemul/clustering-cache-over-btree-4:
  mutation_partition: Save one keys comparison
  partition_snapshot_row_cursor: Remove rows pointer
  mutation_partition: Use B-tree insertion sugar
  perf-test : Print B-tree sizes
  mutation_partition: Switch cache of rows onto B-tree
  partition_snapshot_reader: Rename cmp to less for explicity
  mutation_partition: Make insertion bullet-proof
  mutation_partition: Use tri-compare in non-set places
  flat_mutation_reader: Use clear() in destroy_current_mutation()
  rows_entry: Generalize compare
  utils: Intrusive B-tree (with tests)
  tests: Generalize bptree compaction test
  tests: Generalize bptree stress test
2021-02-02 12:26:02 +01:00
Tomasz Grabiec
75eb97b12c Merge 'Commitlog multi-entry write' from Calle Wilund
Fixes #7615

Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.

Small test included.

Closes #7616

* github.com:scylladb/scylla:
  commitlog_test: Add multi-entry write test
  commitlog: Add "add_entries" call to allow inputting N mutations
  commitlog: Make commitlog entries optionally multi-entry
  commitlog: Move entry_writer definition to cc file
2021-02-02 12:23:19 +01:00
Tomasz Grabiec
7b17969a6e Merge 'sstable: reader: preempt after every fragment' from Avi Kivity
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev, debug, release)

Fixes #7883.

Closes #7928

* github.com:scylladb/scylla:
  sstable: reader: preempt after every fragment
  clustering_range_walker: fix false discontiguity detected after a static row
2021-02-02 12:21:58 +01:00
Benny Halevy
0fecc78d88 user_function: throw on_internal_error if executed outside a seastar thread
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.

Add unit test that reproduces this scenario
and verifies the internal error exception.

Fixes #7977

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
2021-02-02 13:03:39 +02:00
Calle Wilund
720a47fe8a commitlog_test: Add multi-entry write test 2021-02-02 10:41:08 +00:00
Calle Wilund
c5f6125039 commitlog: Add "add_entries" call to allow inputting N mutations
Fixes #7615

Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.

Returns rp_handle vector corresponding to the call vector.
2021-02-02 10:41:08 +00:00
Calle Wilund
5fcc2066ed commitlog: Make commitlog entries optionally multi-entry
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.

To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.

When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
2021-02-02 10:41:08 +00:00
Calle Wilund
6bef3f9cc3 commitlog: Move entry_writer definition to cc file
Should not be public/visible
2021-02-02 10:32:44 +00:00
Juliusz Stasiewicz
29e4737a9b transport: Fix abort on certain configurations of native_transport_port(_ssl)
The reason was accessing the `configs` table out of index. Also,
native_transport_port-s can no longer be disabled by setting to 0,
as per the table below.

Rules for port/encryption (the same apply to shard_aware counterpart):

np  := native_transport_port.is_set()
nps := native_transport_port_ssl.is_set()
ceo := ceo.at("enabled") == "true"
eq  := native_transport_port_ssl() == native_transport_port()

+-----+-----+-----+-----+
|  np | nps | ceo |  eq |
+-----+-----+-----+-----+
|  0  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  0  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  0  |  1  |  0  |  *  |   =>   nonsense, don't listen
|  0  |  1  |  1  |  *  |   =>   listen on native_transport_port_ssl, encrypted
|  1  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  1  |  1  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  1  |  1  |  0  |   =>   listen on native_transport_port, unencrypted + native_transport_port_ssl, encrypted
|  1  |  1  |  1  |  1  |   =>   native_transport_port(_ssl), encrypted
+-----+-----+-----+-----+

Fixes #7783
Fixes #7866

Closes #7992
2021-02-02 11:32:31 +02:00
Avi Kivity
285303b131 Update tools/jmx submodule
* tools/jmx 2c95650...949cefc (2):
  > dist/redhat: stop using systemd macros, call systemctl directly
  > Remove obsolete FIXME

See scylladb/scylla-jmx#94.
2021-02-02 11:29:36 +02:00
Takuya ASADA
7b310c591e dist/redhat: stop using systemd macros, call systemctl directly
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.

See scylladb/scylla-jmx#94

Closes #8005
2021-02-02 11:28:07 +02:00
Avi Kivity
da4fa0629a Merge "sstables: add sstable_origin to scylla_metadata" from Benny
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)

The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().

A get_origin() method was added to class sstable to retrieve
its origin.  It returns an empty string by default if the origin
is missing.

Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates.  Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.

A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.

Test: unit(release)

Fixes #7880
"

* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
  compaction: log sstable origin
  sstables: scylla_metadata: add support for sstable_origin
  sstables: sstable_writer_config: add origin member
2021-02-02 10:35:11 +02:00
Pavel Emelyanov
54ddb5a70a mutation_partition: Save one keys comparison
The apply_monotonically checks if the cursor is behind the source
position to decide whether or not to push it forward (with the
lower_bound call). The 2nd comparison is done to check if either
the cursor was ahead or if lower_bound result actually hit the key.

This 2nd comparison can be avoided:

- the 1st case needs B-tree lower_bound API extention that reports
  if the bound is match or not.
- the 2nd one is covered with reusing tri-compare result from the
  1st comparison

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
4ccce97396 partition_snapshot_row_cursor: Remove rows pointer
The pointer is needed to erase an element by its iterator from the
rows container. The B-tree has this method on iterator and it does
NOT need to walk up the tree to find its root, so the complexity
is still amortized constant.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
8e7c1e049b mutation_partition: Use B-tree insertion sugar
The B-tree .insert methods accept unique pointers and release them

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
a92eb2f7a9 perf-test : Print B-tree sizes
After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes
from the B-tree, so it's useful to know their sizes too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
5c0f9a8180 mutation_partition: Switch cache of rows onto B-tree
The switch is pretty straightforward, and consists of

- change less-compare into tri-compare

- rename insert/insert_check into insert_before_hint

- use tree::key_grabber in mutation_partition::apply_monotonically to
  exception-safely transfer a row from one tree to another

- explicitly erase the row from tree in rows_entry::on_evicted, there's
  a O(1) tree::iterator method for this

- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
  fit the B-tree API

- include the B-tree's external memory usage into stats

That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because

- experimenting with tree shows that numbers 8 through 10 keys with linear
  search show the best performance on stress tests for insert/find-s of
  keys that are memcmp-able arrays of bytes (which is an approximation of
  current clustring key compare). More keys work slower, but still better
  than any bigger value with any type of search up to 64 keys per node

- having 12 keys per nodes is the threshold at which the memory footprint
  for B-tree becomes smaller than for boost::intrusive::set for partitions
  with 32+ keys

- 20 keys for linear root eats the first-split peak and still performs
  well in linear search

As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
165255e2bd partition_snapshot_reader: Rename cmp to less for explicity
This is less comparator, cmp is used as a sign of tri-compare in this set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
ee9e104541 mutation_partition: Make insertion bullet-proof
The bi::intrusive::set::insert-s are non-throwing, so it's safe to add
new entry like this

  auto* ne = new entry;
  set.insert(ne);

and not worry about memory leak. B-tree's insert will be throwing, so we
need some way to free the new entries in case of exception. There's alreay
a way for this:

   std::unique_ptr<entry> ne = std::make_unique<entry>();
   set.insert(*ne);
   ne.release();

so make every insertion into the set work this way in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
926f748a3d mutation_partition: Use tri-compare in non-set places
The mutation_partition::_rows  will be switched on B-tree with tri
comparator, so to clearly identify not affected by it places, switch
them onto tri-compare in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
bfcd6a4bb7 flat_mutation_reader: Use clear() in destroy_current_mutation()
Currently the code uses a look of unlink_leftmost_without_rebalance
calls. B-tree does have it, but plain clearing of the tree is a bit
faster with clear().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
306c40939b rows_entry: Generalize compare
Turn the rows_entry less-comparator's calls into a template as
they are nothing but wrappers on top of rows_entyry tri-comparator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
2f7c03d84c utils: Intrusive B-tree (with tests)
The design of the tree goes from the row-cache needs, which are

1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible

With the above the design is

Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).

Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.

In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().

Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.

The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.

For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.

As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:

            vs set        vs b+tree
   fill:     +13%           -6%
   find:     +23%          -35%

Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).

v4:
- equip insertion methods with on_alloc_point() calls to catch
  potential exception guarantees violations eariler

- add unlink_leftmost_without_rebalance. The method is borrowed from
  boost intrusive set, and is added to kill two birds -- provide it,
  as it turns out to be popular, and use a bit faster step-by-step
  tree destruction than plain begin+erase loop

v3:
- introduce "inline" root node that is embedded into tree object and in
  which the 1st key is inserted. This greatly improves the 1-key-tree
  performance, which is pretty common case for rows cache

v2:
- introduce "linear" root leaf that grows on demand

  This improves the memory consumption for small trees. This linear node may
  and should over-grow the NodeSize parameter. This comes from the fact that
  there are two big per-key memory spikes on small trees -- 1-key root leaf
  and the first split, when the tree becomes 1-key root with two half-filled
  leaves. If the linear extention goes above NodeSize it can flatten even the
  2nd peak

- mitigate the keys indirection a bit

  Prefetching the keys while doing the intra-node linear scan and the nodes
  while descending the tree gives ~+5% of fill and find

- generalize stress tests for B and B+ trees

- cosmetic changes

TODO:

- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
  unused tree pointer on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:29 +03:00
Pavel Emelyanov
6d63bdbefe tests: Generalize bptree compaction test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:28:59 +03:00
Pavel Emelyanov
8bdad0bb28 tests: Generalize bptree stress test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:28:57 +03:00
Avi Kivity
db4b9215dd sstable: reader: preempt after every fragment
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev)

Fixes #7883.
2021-02-01 19:32:07 +02:00
Avi Kivity
7634a90dd2 clustering_range_walker: fix false discontiguity detected after a static row
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.

This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.

A unit test is added.

Ref #7883.
2021-02-01 19:32:07 +02:00
Pavel Solodovnikov
9d17a654a6 raft: use null_sharder for raft tables
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210201105300.110210-1-pa.solodovnikov@scylladb.com>
2021-02-01 18:52:04 +02:00
Gleb Natapov
382ee066bf database: drop duplicated function
The database lass have to duplicated functions keyspaces() and
get_keyspaces(). Drop the former since it is used in one place only.

Message-Id: <20210201135333.GA1403508@scylladb.com>
2021-02-01 18:52:04 +02:00
Tomasz Grabiec
eac9c1d80a Merge "raft: configuration changes with joint consensus" from Kostja
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.

Misc cleanups.

* scylla-dev/raft-config-changes-v2:
  raft: update README.md
  raft: add a simple test for configuration changes
  raft: joint consensus, wire up configuration changes in the API
  raft: joint consensus, count votes using joint config
  raft: joint consensus, wire up configuration changes in FSM
  raft: joint consensus, update progress tracker with joint configuration
  raft: joint consensus, don't store configuration in FSM
  raft: joint consensus, keep track of the last confchange index in the log
  raft: joint consensus, implement helpers in class configuration
  raft: joint consensus, use unordered_set for server_address list
  raft: joint consensus, switch configuration to joint
  raft: rename check_committed() to maybe_commit()
  raft: fix spelling and add comments
2021-02-01 18:52:04 +02:00
Benny Halevy
4b309e0829 compaction: log sstable origin
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
77328a936a sstables: scylla_metadata: add support for sstable_origin
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.

Add unit test to verify the sstable_origin extension
using both empty and a random string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
22f6023ac3 sstables: sstable_writer_config: add origin member
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)

If configure_writer is called with a nullptr, the origin
will be equal to an empty string.

Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin.  This was to reduce the
code churn in this patch and to keep the tests simple.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Nadav Har'El
75a4281bff cql-pytest: test the units supposed to be usable for "duration" type
This patch adds a test for the different units which are supposed to
be usable for assigning a "duration" type in CQL. It turns out that
all documented units are supported correctly except µs (with a unicode
mu), so the test reproduces issue #8001.

The test xfails on Scylla (because µs is not supported) and passes
on Cassandra.

Refs: #8001.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210131192220.407481-1-nyh@scylladb.com>
2021-02-01 11:05:10 +01:00
Avi Kivity
bb202db1ff Merge 'dist/offline_installer/redhat: fix umask error' from Takuya ASADA
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243

Closes #6244

* github.com:scylladb/scylla:
  dist/offline_redhat: fix umask error
  dist/offline_installer/redhat: support cross build
2021-01-31 18:47:27 +02:00
Takuya ASADA
49e4f318a0 dist/offline_redhat: fix umask error
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243
2021-01-31 21:37:49 +09:00
Takuya ASADA
74d7e31576 dist/offline_installer/redhat: support cross build
Supported cross build by running CentOS7 on docker, now it's able to build
on Fedora.
It also supported switch container image, tested on Oracle Linux 7 and
CentOS 7/8.
2021-01-31 21:37:49 +09:00
Avi Kivity
9271e4bf6e Update seastar submodule
* seastar 52d41277a...cb3aaf07e (2):
  > tls: reloadable_credentials_base: add_dir_watch: fix root dir detection
  > scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
2021-01-31 13:28:45 +02:00
Raphael S. Carvalho
298d54ceb0 utils/fragment_temporary_buffer: don't push empty fragment if data size is fragment-aligned
last fragment is unconditionally pushed to set of fragments, so if data
size is fragment-aligned, an empty fragment will be needlessly pushed to
the back of the fragment set.

note: i haven't tested if empty fragment at back of set will cause issues,
i think it won't, but this should be avoided anyway.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-3-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Raphael S. Carvalho
e745f1e697 utils/fragmented_temporary_buffer: avoid reallocations by reserving upfront
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-2-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Raphael S. Carvalho
08e838d4b5 utils/fragmented_temporary_buffer: simplify allocate_to_fit()
1) reuse default_fragment_size for knowledge of max fragment size
2) fragments_count is not a good name as it doesn't include last non-full
fragment (if present), so rename it.
3) simplify calculation of last fragment size

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-1-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Pavel Solodovnikov
b9a280161d raft: introduce raft_rpc class
The patch contains a skeleton implementation for the Scylla-specific
Raft RPC module.

It uses `netw::messaging_service` as underlying mechanism to send
RPC messages.

The instance is supposed to be bound to a single raft group.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:12:35 +03:00
Pavel Solodovnikov
1a979dbba2 raft: add Raft RPC verbs to messaging_service and wire up the RPC calls
All RPC module APIs except for `send_snapshot` should resolve as
soon as the message is sent, so these messages are passed via
`send_message_oneway_timeout`.

`send_snapshot` message is sent via `send_message_timeout` and
returns a `future<>`, which resolves when snapshot transfer
finishes or fails with an exception.

All necessary functions to wire the new Raft RPC verbs are also
provided (such as `register` and `unregister` handlers).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:11:17 +03:00
Pavel Solodovnikov
e30a55ba2f configure.py: compile serializer.cc
This file was not added to the configure.py,
which `raft_sys_table_storage` series was supposed to do.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:09:32 +03:00
Konstantin Osipov
a8f2fa7fa0 raft: update README.md 2021-01-29 22:07:08 +03:00
Konstantin Osipov
b7692af8bc raft: add a simple test for configuration changes
Test adding, removing replacing a node.

With fix-ups by Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-29 22:07:08 +03:00
Konstantin Osipov
c7b5a60320 raft: joint consensus, wire up configuration changes in the API
Now that we've implemented joint consensus based configuration changes,
replace add_server()/remove_server() with a more general set_configuration().
2021-01-29 22:07:08 +03:00
Konstantin Osipov
afadc7c0a1 raft: joint consensus, count votes using joint config
Send RequestVote to a joint config.

We need to exclude self from the list of peers
if we're not part of the current configuration.
Avoid disrupting the cluster in this case.

Maintain separate status for previous and current config when counting
votes.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
8b86d91754 raft: joint consensus, wire up configuration changes in FSM
When add_entry() with new configuraiton is submitted,
create a joint configuration and switch to it immediately.
Refuse to enter joint configuration if a configuration
change is already in progress.
When the leader it committed an entry with joint configuration,
append a new entry with final configuration and switch to it.

Resign leadership if the current leader is not part of a new
configuration.

When we change from A, B, C to B, C, D and the leader is A,
then, when C_new starts to be used, the leader is not part of
the current configuration, so it doesn't have to be in the tracker.
Do not try to find & advance leader progress unconditionally then.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
18a684ba11 raft: joint consensus, update progress tracker with joint configuration
The leader doesn't have to be part of the current
configuration, so add a way to access follower_progress for the leader
only if it is present.

Upon configuration changes, preserve progress information
for intact nodes, remove for removed, and create a new progress
object for added nodes.

When tracking commit progress in joint configuration mode,
calculate two commit indexes for two configurations, and
choose the smallest one.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
20df1955b2 raft: joint consensus, don't store configuration in FSM
In follower state, FSM doesn't know the current cluster
configuration.  Instead of trying to watch the follower log for
configuration changes to keep FSM copy up to date, remove it from
FSM altogether since the follower doesn't need it anyway.

When entering candidate or leader state, fetch the most recent
configuration from the log and initialize the state specific
state with it.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
b29181875c raft: joint consensus, keep track of the last confchange index in the log
When initializing the log, find the most recent configuration
change index, if present.
Maintain the most recent configuration change index when
the log is truncated or entries are appended to it.
The last configuration change index will be used by FSM when it enters
candidate or leader state to fetch the current configuration.

We never truncate beyond a single in-progress configuration
change, so storing the previous value of last_conf_idx
helps avoid log backward scan on truncation in 100% of cases.

Remove all unused log constructors.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
6e128aa357 raft: joint consensus, implement helpers in class configuration 2021-01-29 22:07:07 +03:00
Konstantin Osipov
1ca738d9a2 raft: joint consensus, use unordered_set for server_address list 2021-01-29 22:07:07 +03:00
Konstantin Osipov
df944f953c raft: joint consensus, switch configuration to joint
In order to work correctly in transitional configuration,
participants must enter it after crashes, restarts and
state changes.

This means it must be stored in Raft log and snapshot
on the leader and followers.

This is most easily done if transitional configuration
is just a flavour of standard configuration.

In FSM, rename _current_config to _configuration,
it now contains both current and future configuration
at all times.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
076e46af9e raft: rename check_committed() to maybe_commit()
This is what the function does, and it's the name
used in other implementations.
2021-01-29 22:07:07 +03:00
Gleb Natapov
aad0209b1c raft: fix spelling and add comments
Fix spelling errors in a few comments,
improve comments.

With fix-ups by Gleb Natapov <gleb@scylladb.com>
2021-01-29 22:07:07 +03:00
Pavel Emelyanov
575c992a35 test: Bring test_apply_monotonically_is_monotonic back to work
The idea of the monotonicity checking test is: try to apply
one one random partition to another random one sequentually
failing allocations. Each time allocation fails (with the
bad_alloc exception) -- check the exception guarantee is
respected, then apply (!) the very same two partitions to
each other. At the end of the test we make sure, that an
exception may pop up at any point of application and it
will be safe.

This idea is flawed currently. When verifying the guarantee
the test moves the 2nd partition and leaves it empty for the
next loop iteration. So right on the 2nd attempt to apply
partitions it becomes a no-op, doesn't fail and no more
exceptions arise.

Fix by restoring both partitions at the end of each check.
Broken since 74db08165d.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210129153641.5449-1-xemul@scylladb.com>
2021-01-29 18:47:15 +01:00
Tomasz Grabiec
16eb4c6ce2 Merge "raft: system table backed persistency module" from Pavel Solodovnikov
This series contains an initial implementation of raft persistency module
that uses `raft` system table as the underlying storage model.

"system.raft" table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.

The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.

The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.

IDL definitions are also added for every raft struct so that we
automatically provide serialization and deserialization facilities
needed both for persistency module and for future RPC implmementation.

The first patch is a side-change needed to provide complete
serialization/deserialization for `bytes_ostream`, which we
need when persisting the raft log in the table (since `data`
is a variant containing `raft::command` (aka `bytes_ostream`)
among others).
`bytes_ostream` was lacking `deserialize` function, which is
added in the patch.

The second patch provides serializer for `lw_shared_ptr<T>`
which will be used for `raft::append_entries`, which has
a field with `std::vector<const lw_shared_ptr<raft::log_entry>>`
type.

There is also a patch to extend `fragmented_temporary_buffer`
with a static function `allocate_to_fit` that allocates an
instance of the fragmented buffer that has a specified size.
Individual fragment size is limited to 128kb.

The patch-set also contains the test suite covering basic
functionality of the persistency module.

* manmanson/raft-api-impl-v11:
  raft/sys_table_storage: add basic tests for raft_sys_table_storage
  raft: introduce `raft_sys_table_storage` class
  utils: add `fragmented_temporary_buffer::allocate_to_fit`
  raft: add IDL definitions for raft types
  raft: create `system.raft` and `system.raft_snapshots` tables
  serializer: add `serializer<lw_shared_ptr<T>>` specialization
  serializer: add `deserialize` function overload for `bytes_ostream`
2021-01-29 11:40:39 +02:00
Pavel Solodovnikov
e309502c42 raft/sys_table_storage: add basic tests for raft_sys_table_storage
The test suite covers the most basic use cases for the system table
backed raft persistency module:
 * store/load vote and term
 * store/load snapshot
 * store snapshot with log tail truncation
 * store/load log entries
 * log truncation

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 02:00:27 +03:00
Pavel Solodovnikov
aebb1987b5 raft: introduce raft_sys_table_storage class
This is the implementation of raft persistency module that
uses `raft` system table as the underlying storage model.

The instance is supposed to be bound to a single raft group.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 02:00:12 +03:00
Pavel Solodovnikov
d14dc030ac utils: add fragmented_temporary_buffer::allocate_to_fit
Introduce `fragmented_temporary_buffer::allocate_to_fit` static
function returning an instance of the buffer of a specified size.

The allocated buffer fragments have a size of at most 128kb.
`bytes_ostream` has the same hard-coded limit, so just use the
same here.

This patch will be later needed for `raft::log_entry` raw data
serialization when writing to the underlying persistent storage.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:59:16 +03:00
Pavel Solodovnikov
e1504bbf0e raft: add IDL definitions for raft types
Changes to the `configuration` and `tagged_uint64` classes are needed
to overcome limitations of the IDL compiler tool, i.e. we need to
supply a constructor to the struct initializing all the
members (raft::configuration) and also need to make an accessor
function for private members (in case of raft::tagged_uint64).

All other structs mirror raft definitions in exactly the same way
they are declared in `raft.hh`.

`tagged_id` and `tagged_uint64` are used directly instead of their
typedef-ed companions defined in `raft.hh` since we don't want
to introduce indirect dependencies. In such case it can be guaranteed
that no accidental changes made outside of the idl file will affect idl
definitions.

This patch also fixes a minor typo in `snapshot_id_tag` struct used
in `snapshot_id` typedef.
2021-01-29 01:59:10 +03:00
Pavel Solodovnikov
cf5b8c4b79 raft: create system.raft and system.raft_snapshots tables
System raft table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.

The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.

The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:59:04 +03:00
Pavel Solodovnikov
83c26e542d serializer: add serializer<lw_shared_ptr<T>> specialization
This one works similar to `serializer<optional<T>>` and will be
later needed for serializing `raft::append_request`, which has
a field containing `lw_shared_ptr`.

Users to be warned, though: this code assumes that the pointer
is never null. This is done to mirror the serialize implementation
for `lw_shared_ptr:s` in the messaging_service.cc, which is
subject to being deleted in favor of the impl in the
`serializer_impl.hh`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:58:46 +03:00
Avi Kivity
b32ece6975 Update tools/java submodule
* tools/java 4a55b81941...78c8ef4f54 (1):
  > nodetool: do no treat table name with dot as a secondary index

Fixes #6521.
2021-01-28 16:16:47 +02:00
Kamil Braun
bf115e7d69 schema_tables: put schema tables on shard 0
We use a custom sharder for all schema tables: every table under
the `system_schema` keyspace, plus `system.scylla_table_schema_history`.
This sharder puts all data on shard 0.

To achieve this, we hardcode the sharder in initial schema object
definitions. Furthermore - since the sharder is not stored inside schema
mutations yet - whenever we deserialize schema objects from mutations,
we modify the sharder based on the schema's keyspace and table names.

A regression test is added to ensure no one forgets to set the special
sharder for newly added schema tables. This test assumes that all newly
added schema tables will end up in the `system_schema` keyspace (other
tables may go unnoticed, unfortunately).

Closes #7947
2021-01-28 13:28:22 +02:00
Avi Kivity
32cdcc0c8b Merge "sstables: consolidate reader factory methods" from Botond
"
Currently there are three different methods for creating an sstable
reader:
* one for single key reads
* one for ranged reads
* and one nobody uses

This patch-set consolidates all these into a single `make_reader()`
method, which behind the scenes uses the same logic to dispatch to the
right sstable reader constructor that `sstables::as_mutation_source()`
uses.

This patch-set is part of an effort to clean up the jungle that is the
various reader creation methods. The next step is to clean up the
sstable_set, which has even more methods.

One very sad discovery I made while working on this patch-set is that
we
still default `mutation_reader::forwarding` to `yes` in the sstable
range reader creator method and in the
`mutation_source::make_reader()`.
I couldn't assume that all callers are passing what they mean as the
value for that parameter. I found many sites in tests that create
forwardable single partition readers. This is also something we should
address soon.

Tests: unit(release, debug:v3)
"

* 'sstables-consolidate-reader-factory-methods-v4' of https://github.com/denesb/scylla:
  cql_query_test: add unit test covering the non-optimal TWCS sstable read path
  sstable_mutation_reader: consolidate constructors
  tests: don't pass temporary ranges to readers
  sstables: sstable_mutation_reader: remove now unused whole sstable constructor
  sstables: stats: remove now unused sstable_partition_reads counter
  sstable: remove read_.*row.*_flat() methods
  tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods
  sstables: pass partition_range to create_single_key_sstable_reader()
  sstables: sstable: add make_reader()
2021-01-28 12:05:06 +02:00
Botond Dénes
1e9ce62ee6 cql_query_test: add unit test covering the non-optimal TWCS sstable read path
The sstable read path for TWCS tables takes a different path when the
optimized read path cannot be used. This path was found to be not
covered at all by unit tests which allowed a trivial use-after-free to
slip in. Add a unit test to cover this path as well, so ASAN can catch
such bugs in the future.
2021-01-28 11:34:03 +02:00
Avi Kivity
55609f2033 Update seastar submodule
* seastar a287bb1a3...52d41277a (8):
  > fair_queue: Preempted requests got re-queued too far
  > scripts/perftune.py: remove repeated items after merging options from file
  > file.hh: Remove fair_queue.hh
  > Merge "Reloadable TLS certificate tolerance" from Calle
  > Merge "Cancellable IO" from Pavel E
  > abort-source: Improve the subscriptions management
  > fair_queue: Improve requests preemption while in pending state
  > http: add support for Default handler (/*)
2021-01-28 08:45:33 +01:00
Konstantin Osipov
b4f875f08e uuid: reduce code dependency on UUID_gen.hh
Do not include UUID_gen.hh in trace_state.hh and lists.hh
to reduce header level dependency on it.

Message-Id: <20210127173114.725761-2-kostja@scylladb.com>
2021-01-27 20:08:29 +02:00
Botond Dénes
6024ef5dad sstable_mutation_reader: consolidate constructors
The two remaining sstable constructor are very similar apart from the
content of the initialize lambda. Speaking of which, the two remaining
initializer lambdas can be easily merged into one too. So this patch
does just that, consolidates the two constructors one and moves
consolidates as well as extracts the initializer method into a member
method. This means we have to store the previously captured variables as
members, but this is actually a good thing: when debugging we can see
the range and slice the reader is reading, and we are not actually
paying for it either -- they were already stored, just out of sight.
2021-01-27 17:38:17 +02:00
Botond Dénes
dd26a96e63 tests: don't pass temporary ranges to readers
The sstable_mutation_reader, like all other mutation readers expects
that the partition-range passed to it is kept alive by its creator
for the duration of its lifetime. However, the single-key constructor
of the sstable reader was more tolerant, as it only extracted the key
from the range, essentially requiring only the key to be kept alive (but
not the containing range). Naturally in time some code come to rely on
it and ended up passing temporary ranges to the reader. This behaviour
will no longer be acceptable as we are about to consolidate the various
sstable reader constructors, uniformly requiring that the range is kept
alive. So this patch fixes up the tests so they work with this stricter
requirement. Only two occurences were found.
2021-01-27 17:38:17 +02:00
Botond Dénes
43ad64db78 sstables: sstable_mutation_reader: remove now unused whole sstable constructor 2021-01-27 17:38:17 +02:00
Botond Dénes
ec6c540c30 sstables: stats: remove now unused sstable_partition_reads counter 2021-01-27 17:38:17 +02:00
Botond Dénes
5f18e9eb37 sstable: remove read_.*row.*_flat() methods 2021-01-27 17:38:17 +02:00
Botond Dénes
c3b4e990a2 tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods 2021-01-27 17:38:17 +02:00
Botond Dénes
080bc2ffec sstables: pass partition_range to create_single_key_sstable_reader()
We want to unify the various sstable reader creation methods and this
method taking a ring position instead of a partition range like
everybody else stands in the way of that.

This is effect reverts 68663d0de.
2021-01-27 17:38:14 +02:00
Wojciech Mitros
a1f93e4297 api: use a list instead of a vector to remove a large allocation in api handler
Follow-up to #7917

The size of an cf::column_family_info is 224 bytes, so an std::vector that
contains one for each column family may be very large, causing allocations
of over 1MB.
Considering the vector is used only for iteration, it can be changed to
a non-contiguous list instead.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #7973
2021-01-27 16:02:07 +02:00
Avi Kivity
aec231ba2e Merge "Unify query paths" from Botond
"
Currently we have two parallel query paths:
* database::query() -> table::query() -> data_query()
* mutation::query()

The former is used by single partition queries, the latter by range
scans, as mutation::query() is used to convert reconcilable_result to
query::result (which means it is also used in single partition queries
if it triggers read repair). This is a rather unfortunate situation as
we have two parallel implementation of the query code, which means they
are prone to diverge, and in fact they already have -- more on that
later.

This patchset aims to remedy this situation by retiring
`mutation::query()` and migrating users to an implementation based on
the "standard" query path, in other words one using the same building
blocks as the `database::query()` path. This means using
`compact_mutation` for compacting and `query_result_builder` for result
building. These components however were created to work with
`flat_mutation_reader`, however introducing a reader into this pipeline
would mean that we'd have to make all the related APIs asynchronous,
which would cause an insane amount of churn. To avoid this, this
patchset adds an API compatible `consume()` method to `mutation`, which
can accept a `compact_mutation` instance as-is. This allows an elegant
and succinct reimplementation. So far so good.

Like mentioned above, the two implementations have diverged in time, or
have been different from the start. The difference manifest when
calculating digests, more precisely in which tombstones are included in
the digest. The retired `mutation::query()` path incorporates only
non-purgeable tombstones in the digest. The standard query path however
incorporates all tombstones, even those that can be purged. After some
scrutiny however this difference proved to be completely theoretical,
as
the code path where this would matter -- converting reconcilable result
to query result -- passes min timestamp as the query time to the
compaction, so nothing is compacted and hence the difference has no
chance to manifest.

This patch-set was motivated by the desire to provide a single solution
to #7434, instead of two, one for each path.

Tests: unit(release:v2, debug:v2, dev:v3)
"

* 'unified-query-path/v3' of https://github.com/denesb/scylla:
  mutation: remove now unused query() and query_compacted()
  treewide: use query_mutations() instead of mutation::query()
  mutation_test: test_query_digest: ensure digest is produced consistently
  mutation_query: introduce query_mutation()
  mutation_query: to_data_query_result(): migrate to standard query code
  mutation_query: move to_data_query_result() to mutation_partition.cc
  mutation: add consume()
  flat_mutation_reader: move mutation consumer concepts to separate header
  mutation compactor: query compaction: ignore purgeable tombstones
2021-01-27 15:58:47 +02:00
Botond Dénes
a5a8037f6e sstables: sstable: add make_reader()
This will be the only method to create sstable readers with. For now we
leave the other variants, they as well as their users will be removed in
a following patch.
2021-01-27 15:20:06 +02:00
Nadav Har'El
2113849a2b cql-pytest: reproducer for toJson() bug with doubles
This patch adds a cql-pytest, test_json.py::test_tojson_double(),
which reproduces issue #7972 - where toJson() prints some doubles
incorrectly - truncated to integers, but some it prints fine (I
still don't know why, this will need to be debugged).

The test is marked xfail: It fails on Scylla, and passes on Cassandra.

Refs #7972.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210127124338.297544-1-nyh@scylladb.com>
2021-01-27 14:00:25 +01:00
Pavel Solodovnikov
10b117aada raft: create dummy impl for schema changes state machine
This patch introduces `schema_raft_state_machine` class
which is currently just a dummy implementation throwing a
"not implemented" exceptions for every call.

Will be needed later to construct an instance of `raft::server`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210126193413.1520948-1-pa.solodovnikov@scylladb.com>
2021-01-27 12:33:27 +01:00
Pavel Solodovnikov
223c823963 serializer: add deserialize function overload for bytes_ostream
For some reason we had a distinct specialization of `serialize`
function to handle `bytes_ostream` but not `deserialize`.

This will be used in the following patches.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-26 23:21:15 +03:00
Asias He
c82250e0cf gossip: Allow deferring advertise of local node to be up
Currently the replacing node sets the status as STATUS_UNKNOWN when it
starts gossip service for the first time before it sets the status to
HIBERNATE to start the replacing operation. This introduces the
following race:

1) Replacing node using the same IP address of the node to be replaced
starts gossip service without setting the gossip STATUS (will be seen as
STATUS_UNKNOWN by other nodes)

2) Replacing node waits for gossip to settle and learns status and
tokens of existing nodes

3) Replacing node announces the HIBERNATE STATUS.

After Step 1 and before Step 3, existing nodes will mark the replacing
node as UP, but haven't marked the replacing node as doing replacing
yet. As a result, the replacing node will not be excluded from the read
replicas and will be considered a target node to serve CQL reads.

To fix, we make the replacing node avoid responding echo message when it is not
ready.

Fixes #7312

Closes #7714
2021-01-26 19:02:11 +01:00
Pekka Enberg
9fc83ac627 Update tools/java submodule
* tools/java 8080009794...4a55b81941 (1):
  > cassandra.in.sh: remove debug message
2021-01-26 15:56:58 +02:00
Avi Kivity
90a6c3bd7a build: reduce release mode inline tuning on aarch64
I see a miscompile on aarch64 where a call to format("{}", uuid)
translates a function pointer to -1. When called, this crashes.

Reduce the inline threshold from 2500 to 600. This doesn't guarantee
no miscompiles but all the tests pass with this parameter.

Closes #7953
2021-01-26 11:14:42 +02:00
Tomasz Grabiec
90f6bb754e Merge "raft: replication tests: fixes for debug mode" from Alejo
The following patches fix issues seen occasionally in debug mode.

Notes:
    - In debug mode there's still the UB nullptr arithmetic warning.

* https://github.com/alecco/scylla/tree/raft-ale-tests-07h-wait-propagation:
  raft: replication test: wait for log propagation
  raft: replication test: move wait for log to a function
  raft: replication test: remove unused member
  raft: replication test: use later()
  raft: testing: remove election wait time and just yield
2021-01-26 11:14:42 +02:00
Avi Kivity
f58151d191 test: mutation_test: fix initialization order bug with thread local storage
test_cell_external_memory_usage uses with_allocator() to observe how some
types allocate memory. However, compiler reordering (observed with clang 11
on aarch64) can move the various thread-local CQL type object initialization
into the with_allocator() scope; so any managed object allocated as part of
this initialization also gets measured, and the test fails. The code movement
is legal, as far as I can tell.

Fix this by initializing the type object early; use an atomic_thread_fence
as an optimization barrier so the compiler doesn't eliminate the or move
the early initialization.

Closes #7951
2021-01-26 11:14:42 +02:00
Nadav Har'El
356250f720 cql-pytest: tests for fromJson() failing to set tuple elements to null
This patch adds a test for trying to set a tuple element to null with
fromJson(), which works on Cassandra but fails on Scylla. So the test
xfails on Scylla. Reproduces issue #7954.

Refs #7954.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210124082311.126300-1-nyh@scylladb.com>
2021-01-26 11:14:42 +02:00
Avi Kivity
05c435dddc Merge "mutation readers: remove next_partition() workarounds" from Botond
"
`next_partition()` used to return void, so readers that had to call
future returning code had to work around this. Now that
`next_partition()` returns a future, we can get rid of these
workarounds.

Tests: unit(release, debug)
"

* 'next-partition-cross-shard-readers/v1' of https://github.com/denesb/scylla:
  mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag
  mutation_reader: evictable_reader: remove next_partition() workaround
  mutation_reader: shard_reader: remove next_partition() workaround
  mutation_reader: foreign_reader: remove next_partition() workaround
2021-01-26 11:14:42 +02:00
Nadav Har'El
067330c08f Merge 'redis: support large redis message' from Takuya ASADA
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.

Fixes #7273

Closes #7903

* github.com:scylladb/scylla:
  redis: rename _args_size/_size_left There are two types of numerical parameter in redis protocol:  - *[0-9]+ defined array size  - $[0-9]+ defined string size
  redis: fix large message handling
2021-01-25 10:11:17 +02:00
Takuya ASADA
229940aaff redis: rename _args_size/_size_left
There are two types of numerical parameter in redis protocol:
 - *[0-9]+ defined array size
 - $[0-9]+ defined string size

Currently, array size is stored to args_count, and string size is
stored to _arg_size / _size_left.
It's bit hard to understand since both uses same word "arg(s)", let's
rename string size variables to _bytes_count / _bytes_left.
2021-01-25 10:26:37 +09:00
Takuya ASADA
7a6ee9858f redis: fix large message handling
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.

Fixes #7273
2021-01-25 10:26:37 +09:00
Alejo Sanchez
0d694990cf raft: replication test: wait for log propagation
Wait until entries propagate after adding and before changing leader
using the same code as done for partitioning.

This fixes occasional hangs in debug mode when a test switches to a
different leader without leaving enough time for full propagation.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:33:54 -04:00
Alejo Sanchez
4d1ec88f90 raft: replication test: move wait for log to a function
Move wait for log propagation to its own function for reuse.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
72f9b108e3 raft: replication test: remove unused member
Initial state doesn't need to specify total entries anymore.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
db95d6e7f1 raft: replication test: use later()
Instead of sleep 1us use later()

Also use later to yield after sending append entries in rpc test impl.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
f875ff72c9 raft: testing: remove election wait time and just yield
Replace sleep time for elect_me_leader with yield to speed things up.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Pekka Enberg
8258556832 Update tools/python3 submodule
* tools/python3 c579207...199ac90 (1):
  > dist: debian: adjust .orig tarball name for .rc releases
2021-01-24 21:30:59 +02:00
Gleb Natapov
020da49c89 storage_proxy: remove no longer needed range_slice_read_executor
After support for mixed cluster compatibility feature
DIGEST_MULTIPARTITION_READ was dropped in 854a44ff9b
range_slice_read_executor and never_speculating_read_executor become
identical, so remove the former for good.

Message-Id: <20210124122731.GA1122499@scylladb.com>
2021-01-24 14:45:22 +02:00
Benny Halevy
088f92e574 paxos_state: learn: fix injected error description
It was copy-pasted from another injection point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201220091439.3604201-1-bhalevy@scylladb.com>
2021-01-24 11:51:23 +02:00
Takuya ASADA
5d527bd17e scylla_ntp_setup: use chrony on all distributions
To simplify scylla_ntp_setup, use chrony on all distributions.

Closes #7922
2021-01-24 11:45:58 +02:00
Takuya ASADA
984dc44ebf dist: drop /etc/security/limits.d/scylla.conf
Drop limits.d conf file, since we don't use it.
We set these parameters via systemd unit file instead.

Fixes #7925

Closes #7941
2021-01-24 11:43:39 +02:00
Benny Halevy
1847d49971 test: test_env: pick the highest sstable version by default
If possible, test the highest sstable format version,
as it's the mostly used.

If there pre-written sstables we need to load from the
test directory from an older version, either specify their
version explicitly, or use the new test_env::reusable_sst
method that looks up the latest sstable version in the
given directory and generation.

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>
2021-01-24 10:38:55 +02:00
Botond Dénes
226088d12e mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag
Its not used anymore.
2021-01-22 16:18:59 +02:00
Botond Dénes
4eb65b12a0 mutation_reader: evictable_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 16:18:30 +02:00
Botond Dénes
febd2feb4c mutation_reader: shard_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 15:53:05 +02:00
Botond Dénes
9c96d74b72 mutation: remove now unused query() and query_compacted() 2021-01-22 15:36:37 +02:00
Botond Dénes
1a3ee71b39 treewide: use query_mutations() instead of mutation::query()
We want to retire the latter.
2021-01-22 15:36:37 +02:00
Botond Dénes
81da6b756f mutation_reader: foreign_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 15:30:36 +02:00
Nadav Har'El
cb9e2ee00a cql-pytest: tests for fromJson() setting a map<ascii, int>
The fromJson() function can take a map JSON and use it to set a map column.
However, the specific example of a map<ascii, int> doesn't work in Scylla
(it does work in Cassandra). The xfailing tests in this patch demonstrate
this. Although the tests use perfectly legal ASCII, scylla fails the
fromJson() function, with a misleading error.

Refs #7949.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121233855.100640-1-nyh@scylladb.com>
2021-01-22 14:29:25 +01:00
Botond Dénes
a9d726c7ba mutation_test: test_query_digest: ensure digest is produced consistently
Before we retire the mutation::query() code, expand the digest test to
check that the new code replacing it produces identical digest on all
possible equivalent mutations.
2021-01-22 15:27:48 +02:00
Botond Dénes
821ed96e0e mutation_query: introduce query_mutation()
This is a replacement of `mutation::query()`, but with an implementation
based on the standard query result building code.
This will allow us to migrate the remaining `mutation::query()` users
off of said method, which in turn will allow us to retire it finally.
2021-01-22 15:27:48 +02:00
Botond Dénes
c4f12221b8 mutation_query: to_data_query_result(): migrate to standard query code
Reimplement in terms of the standard query result building code. We want
to retire the alternative query result code in `mutation::query()` and
`to_data_query_result()` is one of the main users.
2021-01-22 15:27:48 +02:00
Botond Dénes
164582f33b mutation_query: move to_data_query_result() to mutation_partition.cc
We want to rewrite the above mentioned method's implementation in terms
of the standard query result building code (that of the `data_query()`
path), in order to retire the alternative query code in the mutation
class.
The `data_query()` code uses classes private to `mutation_partition.cc`
and instead of making these public, just move `to_data_query_result()`
to `mutation_partition.cc`.
2021-01-22 15:27:48 +02:00
Botond Dénes
d0c5f550a9 mutation: add consume()
This consume method accepts a `FlattenedConsumer`, the same one that the
name-sake `flat_mutation_reader::consume()` does. Indeed the main
purpose of this method is to allow using the standard query result
building stack with a mutation, the same way said stack is used with
mutation readers currently. This will allow us to replace the parallel
query result building code that currently exists in the
`mutation::query()` and friends, with the standard one.
2021-01-22 15:27:48 +02:00
Botond Dénes
9153f63135 flat_mutation_reader: move mutation consumer concepts to separate header
In the next patch we will want to use these concepts in `mutation.hh`. To
avoid pulling in the entire `flat_mutation_reader.hh` just for these,
and create a circular dependency in doing so, move them to a dedicated
header instead.
2021-01-22 15:27:48 +02:00
Botond Dénes
73808c12eb mutation compactor: query compaction: ignore purgeable tombstones
This behaviour is makes query result building sensitive to whether the
data was recently compacted or not, in particular different digests will
be produced depending on whether purgeable tombstones happened to be
compacted (and thus purged) or not. This means that two replicas can
produce different digests for the same data if has compacted some
purgeable tombstones and the other not.

To avoid this, drop purgeable tombstones during query compaction as
well.
2021-01-22 15:27:48 +02:00
Pavel Emelyanov
90d445464b compaction: Remove compaction_manager::enabled()
This method was marked with 'FIXME -- should not be public'
when it was introduced. Since then it has stopped being used
and can even be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210122083146.5886-1-xemul@scylladb.com>
2021-01-22 14:07:38 +02:00
Kamil Braun
570d15c7bc multishard_combining_reader: do not use smp::count
`multishard_combining_reader` currently only works under the assumption
that every table uses the same sharder configured using the node's number
of shards. But we could potentially specify a different sharder for a chosen table,
e.g. one that puts everything on shard 0.
Then this assumption will be broken and the reader causes a segfault.

Fixes #7945.
2021-01-21 18:28:18 +02:00
Nadav Har'El
328be1ca7c cql-pytest: tests for fromJson() not accepting empty string as integer
When writing to an integer column, Cassandra's fromJson() function allows
not just JSON number constants, it also allows a string containing a
number. Strings which do not hold a number fail with a FunctionFailure.
In particular, the empty string "" is an invalid number, and should fail.

The tests in this patch check this for two integer types: int and
varint.

Curiously, Cassandra and Scylla have opposite bugs here: Scylla fails
to recognize the error for varint, while Cassandra fails to recognize
the error for int. The tests in this patch reproduce these bugs.

The tests demonstrating Scylla's bug are marked xfail, and the tests
demonstrating Cassandra's bug is marked "cassandra_bug" (which means
it is marked xfail only when running against Cassandra, but expected
to succeed on Scylla.

Refs #7944.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121133833.66075-1-nyh@scylladb.com>
2021-01-21 15:24:48 +01:00
Nadav Har'El
702b1b97bf cql: fix error return from execution of fromJson() and other functions
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.

This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:

1. Parse errors in fromJson()'s parameters are converted into a
   function_execution_exception.

2. Any exceptions during the execute() of a native_scalar_function_for
   function is converted into a function_execution_exception.
   In particular, fromJson() uses a native_scalar_function_for.

   Note, however, that functions which already took care to produce
   a specific Cassandra error, this error is passed through and not
   converted to a function_execution_exception. An example is
   the blobAsText() which can return an invalid_request error, so
   it is left as such and not converted. This also happens in Cassandra.

All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.

Fixes #7911

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
2021-01-21 15:21:13 +01:00
Nadav Har'El
49440d67ad Merge: Fix multiple issues with timeuuid type
Merged patch series by Konstantin Osipov:

"These series improve uniqueness of generated timeuuids and change
list append/prepend logic to use client/LWT timestamp in timeuuids
generated for list keys. Timeuuid compare functions are
optimized.

The test coverage is extended for all of the above."

  uuid: add a comment warning against UUID::operator<
  uuid: replace slow versions of timeuiid compare with optimized/tested versions.
  test: add tests for legacy uuid compare & msb monotonicity
  test: add a test case for append/prepend limit
  test: add a test case for monotonicity of timeuuid least significant bits
  uuid: implement optimized timeuuid compare
  test: add a test case for list prepend/append with custom timestamp
  lists: rewrite list prepend to use append machinery
  lists: use query timestamp for list cell values during append
  uuid: fill in UUID node identifier part of UUID
  test: add a CQL test for list append/prepend operations
2021-01-21 13:20:07 +02:00
Konstantin Osipov
e18e2cb9f2 uuid: add a comment warning against UUID::operator< 2021-01-21 13:03:59 +03:00
Konstantin Osipov
845f6c667b uuid: replace slow versions of timeuiid compare with optimized/tested versions. 2021-01-21 13:03:59 +03:00
Konstantin Osipov
56d8d166cb test: add tests for legacy uuid compare & msb monotonicity 2021-01-21 13:03:59 +03:00
Konstantin Osipov
257c5b0879 test: add a test case for append/prepend limit 2021-01-21 13:03:59 +03:00
Konstantin Osipov
d6e65a3735 test: add a test case for monotonicity of timeuuid least significant bits
Ensure that timeuuid least significant bits are compared correctly.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
0af3758aff uuid: implement optimized timeuuid compare
Introduce uint64_t based comparator for serialized timeuuids.

Respect Cassandra legacy for timeuuid compare order.

Scylla uses two versions of timeuuid compare:
- one for timeuuid values stored in uuid columns
- a different one for timeuuid values stored in timeuuid columns.

This commit re-implements the implementations of these comparators in
types.cc and deprecates the respective implementations types.cc. They
will be removed in a following patch.

A micro-benchmark at https://github.com/alecco/timeuuid-bench/
shows 2-4x speed up of the new comparators.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
b4500a55c7 test: add a test case for list prepend/append with custom timestamp
Scylla now takes a custom timestamp into account when
executing list append/prepend operations. Test the new
semantics.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
232ce6f611 lists: rewrite list prepend to use append machinery
Rewrite list prepend to use the same machinery
as append, and thus produce correct results when used in LWT.

After this patch, list prepend begins to honor user supplied timestamps.

If a user supplied timestamp for prepend is less than 2010-01-01 00:00:00
an exception is thrown.

Fixes #7611
2021-01-21 13:03:59 +03:00
Konstantin Osipov
2b8ce83eea lists: use query timestamp for list cell values during append
Scylla list cells are represented internally as a map of
timeuuid => value. To append a new value to a list
the coordinator generates a timeuuid reflecting the current time as key
and adds a value to the map using this key.

Before this patch, Scylla always generated a timeuuid for a new
value, even if the query had a user supplied or LWT timestamp.
This could break LWT linearizability. User supplied timestamps were
ignored.

This is reported as https://github.com/scylladb/scylla/issues/7611

A statement which appended multiple values to a list or a BATCH
generated an own microsecond-resolution timeuuid for each value:

BEGIN BATCH
  UPDATE ... SET a = a + [3]
  UPDATE ... SET a = a + [4]
APPLY BATCH

UPDATE ... SET a = a + [3, 4]

To fix the bug, it's necessary to preserve monotonicity of
timeuuids within a batch or multi-value append, but make sure
they all use the microsecond time, as is set by LWT or user.

To explain the fix, it's first necessary to recall the structure
of time-based UUIDs:

60 bits: time since start of GMT epoch, year 1582, represented
         in 100-nanosecond units
4 bits:  version
14 bits: clock sequence, a random number to avoid duplicates
         in case system clock is adjusted
2 bits:  type
48 bits: MAC address (or other hardware address)

The purpose of clockseq bits is as defined in
https://tools.ietf.org/html/rfc4122#section-4.1.5
is to reduce the probability of UUID collision in case clock
goes back in time or node id changes. The implementation should reset it
whenever one of these events may occur.

Since LWT microsecond time is guaranteed to be
unique by Paxos, the RFC provisioning for clockseq and MAC
slots becomes excessive.

The fix thus changes timeuuid slot content in the following way:
- time component now contains the same microsecond time for all
  values of a statement or a batch. The time is unique and monotonic in
  case of LWT. Otherwise it's most always monotonic, but may not be
  unique if two timestamps are created on different coordinators.
- clockseq component is used to store a sequence number which is
  unique and monotonic for all values within the statement/batch.
- to protect against time back-adjustments and duplicates
  if time is auto-generated, MAC component contains a random (spoof)
  MAC address, re-created on each restart. The address is different
  at each shard.

The change is made for all sources of time: user, generated, LWT.
Conditioning the list key generation algorithm on the source of
time would unnecessarily complicate the code while not increase
quality (uniqueness) of created list keys.

Since 14 bits of clockseq provide us with only 16383 distinct slots
per statement or batch, 3 extra bits in nanosecond part of the time
are used to extend the range to 131071 values per statement/batch.
If the rang is exceeded beyond the limit, an exception is produced.

A twist on the use of clockseq to extend timeuuid uniqueness is
that Scylla, like Cassandra, uses int8 compare to compare lower
bits of timeuuid for ordering. The patch takes this into account
and sign-complements the clockseq value to make it monotonic
according to the legacy compare function.

Fixes #7611

test: unit (dev)
2021-01-21 13:03:59 +03:00
Konstantin Osipov
6d1781be36 uuid: fill in UUID node identifier part of UUID
Before this patch, UUID generation code was not creating
sufficiently unique IDs: the 6 byte node identifier was mostly
empty, i.e. only containing shard id. This could lead to
collisions between queries executed concurrently at different
coordinators, and, since timeuuid is used as key in list append
and prepend operations, lead to lost updates.

To generate a unique node id, the patch uses a combination of
hardware MAC address (or a random number if no hardware address is
available) and the current shard id.

The shard id is mixed into higher bits of MAC, to reduce the
chances on NIC collision within the same network.

With sufficiently unique timeuuids as list cell keys, such updates
are no longer lost, but multi-value update can still be "merged"
with another multi-value update.

E.g. if node A executes SET l = l + [4, 5] and node B executes SET
l  = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4,
6, 5, 7], [6, 4, 5, 7] and so on.

At least we are now less likely to get any value lost.

Fixes #6208.

@todo: initialize UUID subsystem explicitly in main()
and switch to using seastar::engine().net().network_interfaces()

test: unit (dev)
2021-01-21 13:03:53 +03:00
Avi Kivity
4cfaab208e allocation_strategy: set preferred max contiguous allocation to 128k for standard allocations
Now that managed_bytes and its users do not assume that a managed_bytes
instance allocated using standard_allocation_strategy is non-fragmented,
we can set the preferred max contiguous allocation to 128k. This causes
managed_bytes to fragment instances that are larger than this size.

Note that managed_bytes is the only user.

Closes #7943
2021-01-21 11:15:13 +02:00
Tomasz Grabiec
f08a3e3fd8 Merge "raft: test fixes, etcd tests, simplification" from Alejo
This patch set adds etcd unit tests for raft.

It also includes a fix for replication test in debug mode and a
simplification for append_request.

Tests: unit ({dev}), unit ({debug}), unit ({release})

*  https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
  raft: etcd unit tests: test log replication
  raft: boost test etcd: test fsm can vote from any state
  raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
  raft: replication test: add etcd test for cycling leaders
  raft: testing: provide primitives to wait for log propagation
  raft: etcd unit tests: initial boost tests
  raft: combine append_request _receive and _send
2021-01-21 10:41:33 +02:00
Pekka Enberg
7d98e05923 Update tools/python3 submodule
* tools/python3 1763a1a...c579207 (1):
  > dist/debian: handle rc version correctly
2021-01-21 10:41:33 +02:00
Avi Kivity
daa0e964fc dbuild: avoid --pids-limit with podman and cgroupsv1
Podman doesn't correctly support --pids-limit with cgroupsv1. Some
versions ignore it, and some versions reject the option.

To avoid the error, don't supply --pids-limit if cgroupsv2 is not
available (detected by its presence in /proc/filesystems). The user
is required to configure the pids limit in
/etc/containers/containers.conf.

Fixes #7938.

Closes #7939
2021-01-21 10:41:33 +02:00
Botond Dénes
4d581f1bb3 docs/README.md: guides: also mention running and debugging
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083304.36447-1-bdenes@scylladb.com>
2021-01-20 16:07:29 +02:00
Avi Kivity
f11a0700a8 Merge "mutation_writer: explicitly close writers" from Benny
"
_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

A future<> close() method was added to return
_consumer_fut.  It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.

With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.

Fixes #7904
Test: unit(release)
"

* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: add close
  mutation_writer/feed_writers: refactor bucket/shard writers
  mutation_writer: update bucket/shard writers consume_end_of_stream
2021-01-20 16:07:29 +02:00
Pekka Enberg
6cc981d089 scylla: Add "--build-mode" command line option
This adds a "--build-mode" command line option to "scylla" executable:

$ ./build/dev/scylla --build-mode
dev

This allows you to discover the build mode of a "scylla" executable
without resorting to "readelf", for example, to verify that you are
looking at the correct executable while debugging packaging issues.

Closes #7865
2021-01-20 16:07:29 +02:00
Botond Dénes
7eb8c71342 tools/scylla-types: add link to cql3-type-mapping.md
Just like scylla-sstable-index, scylla-types accepts types in (short)
cassandra class name notation. The mapping from the clq3 type names to
the class names is not straight-forward in all cases, so provide a link
to a table which lists the cassandra class name of all supported types
(and more).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-2-bdenes@scylladb.com>
2021-01-20 10:50:33 +02:00
Botond Dénes
882ade7c6a types/scylla-sstable-index: update URL to cql3-type-mapping.md
Said document was recently moved but the URL was not updated.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-1-bdenes@scylladb.com>
2021-01-20 10:50:33 +02:00
Avi Kivity
114da51d73 Revert "commitlog: fix size of a write used to zero a segment"
This reverts commit df2f67626b. The fix
is correct, but has an unfortunate side effect with O_DSYNC: each
128k write also needs to flush the XFS log. This translates to
32MB/128k = 256 flushes, compared to one flush with the original code.

A better fix would be to prezero without O_DSYNC, then reopen the file
with O_DSYNC, but we can do that later.

Reopens #5857.
2021-01-20 10:23:43 +02:00
Avi Kivity
586f16bf79 Merge "Cut snitch -> storage service dependency" from Pavel E
"
Currently storage service and snitch implicitly depend on each
other. Storage service gossips snitch data on start, snitch
kicks the storage service when its configuration changes.

This interdependency is relaxed:

- snitch gossips all its state itself without using the
  storage service as a mediator
- storage service listens for snitch updates with the help
  of self-breaking subscription

Both changes make snitch independent from storage service,
remove yet another call for global storage service from the
codebase and make the storage service -> snitch reference
robust against dagling pointers/references

tests: unit(dev), dtest.rebuild.TestRebuild.simple_rebuild(dev)
"

* 'br-snitch-gossip-2' of https://github.com/xemul/scylla:
  storage-service: Subscribe to snitch to update topology
  snitch: Introduce reconfiguration signal
  snitch: Always gossip snitch info itself
  snitch: Do gossip DC and RACK itself
  snitch: Add generic gossiping helper
2021-01-20 10:23:43 +02:00
Pavel Solodovnikov
041072b59f raft: rename storage to persistence
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
2021-01-20 10:23:43 +02:00
Gleb Natapov
248449816b raft: fix snapshot transfer with existing log prefix
Current code that checks when snapshot has to be transferred does not
take in account the case where there can be log entries preceding the
snapshot. Fix the code to correctly test for snapshot transfer
condition.

Message-Id: <20210117095801.GB733394@scylladb.com>
2021-01-20 10:23:43 +02:00
Gleb Natapov
1ab262e86b raft: test: change replication_test to submit one entry at a time
replication_test's state machine is not commutative, so if commands are
applied in different order the states will be different as well. Since
the preemption check was added into co_await in seastar even waiting for
a ready future can preempt which will cause reordering of simultaneously
submitted entries in debug mode. For a long time we tried to keep entries
submission parallel in the test, but with the above seastar change it
is no longer possible to maintain it without changing the state machine
to be commutative. The patch changes the test to submit entries one by
one.

Message-Id: <20210117095147.GA733394@scylladb.com>
2021-01-20 10:23:43 +02:00
Benny Halevy
f29732573a mutation_writer: bucket_writer: add close
bucket_writer::close waits for the _consumer_fut.
It is called both after consume_end_of_stream()
and after abort().

_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

With that moved to close() time, consume_end_of_stream
doesn't need to return a future and is made void
all the way in the stack.  This is ok since
queue_reader_handle::push_end_of_stream is synchronous too.

Added a unit test that aborts the reader consumer
during `segregate_by_timestamp`, reproducing the
Exceptional future ignored issue without the fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 19:03:58 +02:00
Benny Halevy
fc3f9a57ff mutation_writer/feed_writers: refactor bucket/shard writers
Consolidate shard_based_splitting_writer::shard_writer
and timestamp_based_splitting_writer::bucket_writer
common code into mutation_writer::bucket_writer.

This provides a common place to handle consume_end_of_stream()
and abort(), and in particular the handling of the underlying
_conmsumer_fut.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 18:48:01 +02:00
Benny Halevy
a9d91a2d09 mutation_writer: update bucket/shard writers consume_end_of_stream
After 61520a33d6
feed_writers doesn't call consume_end_of_stream
after abort() so no need to test
            if (!_handle.is_terminated()) {

and consume_end_of_stream is now called in then_wrapped
rather than `finally` so it's ok if it throws.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 18:44:40 +02:00
Kamil Braun
1a8630e6a7 transport: silence "broken pipe" and "connection reset by peer" errors
The code would already silence broken pipe exceptions since it's
expected when the other side closes the connection or when we shutdown the
socket during Scylla shutdown, but the code wouldn't handle the following:
1. "Connection reset by peer" errors: these can also happen in the
   aforementioned two scenarios; the conditions that determine which of
   the two types of errors occur are unclear.
2. The scenarios would sometimes result in a `seastar::nested_exception`,
   mainly during shutdown. The errors could happen once when trying to send
   a response to a request (`_write_buf.write(...)/flush(...)`) and then
   again when trying to close the connection in a `finally` block. These
   nested exceptions were not silenced.

The commit handles each of these cases.
Closes #7907.

Closes #7931
2021-01-19 10:30:17 +02:00
Tomasz Grabiec
94749b01eb Merge "futurize flat_mutation_reader::next_partition" from Benny
The main motivation for this patchset is to prepare
for adding a async close() method to flat_mutation_reader.

In order to close the reader before destroying it
in all paths we need to make next_partition asynchronous
so it can asynchronously close a current reader before
destoring it, e.g. by reassignment of flat_mutation_reader_opt,
as done in scanning_reader::next_partition.

Test: unit(release, debug)

* git@github.com:bhalevy/scylla.git futurize-next-partition-v1:
  flat_mutation_reader: return future from next_partition
  multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
  mutation_reader: filtering_reader: fill_buffer: futurize inner loop
  flat_mutation_reader::impl: consumer_adapter: futurize handle_result
  flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
  flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
  flat_mutation_reader:impl: get rid of _consume_done member
2021-01-19 10:19:03 +02:00
Alejo Sanchez
8a61e7defc raft: etcd unit tests: test log replication
etcd TestLogReplication

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
417b18aaad raft: boost test etcd: test fsm can vote from any state
etcd TestVoteFromAnyState

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
5a75c0e06a raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
Log truncation of follower when node re-gains leadership.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
f14c44c686 raft: replication test: add etcd test for cycling leaders
This test cycles 3 nodes as leaders without adding entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
f627972186 raft: testing: provide primitives to wait for log propagation
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.

This prevents occasional hangs in debug mode for incoming tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
948ae813e4 raft: etcd unit tests: initial boost tests
First batch of ported etcd raft unit tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:12 -04:00
Gleb Natapov
6d47a535b9 raft: combine append_request _receive and _send
Combine structs for append request send and receive into a single
struct.

Author:    Gleb Natapov <gleb@scylladb.com>
Date:      Mon Nov 23 14:33:14 2020 +0200
2021-01-18 12:24:13 -04:00
Konstantin Osipov
bf1a031bd6 test: add a CQL test for list append/prepend operations
Test single- and multi- value list append, prepend,
append and prepend in a batch, conditional statements.

This covers the parts of Cassandra which are working as documented
and which we intend to preserve compatibility with.
2021-01-18 17:32:00 +03:00
Jenkins
faf71c6f75 release: prepare for 4.5.dev 2021-01-18 16:05:25 +02:00
Avi Kivity
df3ef800c2 Merge 'Introduce load and stream feature' from Asias He
storage_service: Introduce load_and_stream

=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831

Closes #7846

* github.com:scylladb/scylla:
  storage_service: Introduce load_and_stream
  distributed_loader: Add get_sstables_from_upload_dir
  table: Add make_streaming_reader for given sstables set
2021-01-18 15:08:19 +02:00
Asias He
4d32d03172 storage_service: Introduce load_and_stream
=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831
2021-01-18 16:32:33 +08:00
Asias He
28007f13f8 distributed_loader: Add get_sstables_from_upload_dir
This function scans sstables under the upload directory and return a list of
sstables for each shard.

Refs #7831
2021-01-16 20:03:17 +08:00
Benny Halevy
29002e3b48 flat_mutation_reader: return future from next_partition
To allow it to asynchronously close underlying readers
on next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
ff931c2ecc multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
The reader_meta in _readers[shard] is created on shard 0 and must
be destroyed on it as well.

A following patch changes next_partition() to return a future<>
thus it introduces a continuation that requires access to `rm`.

We cannot move it down to the conuation safely, since it will be
wrongly destroyed in the invoked shard, so use do_with to hold it
in the scope of the calling shard until the invoked function
completes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
75c0c05f71 mutation_reader: filtering_reader: fill_buffer: futurize inner loop
Prepare for futurizing next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
cd4d082e51 flat_mutation_reader::impl: consumer_adapter: futurize handle_result
Prepare for futurizing next_partition.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
d8ae6d7591 flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
To support both sync and async consumers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
fdb3c59e35 flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
So that consumer_adapter and other consumers in the future
may return a future from consumer().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
515bed90bb flat_mutation_reader:impl: get rid of _consume_done member
It is only used in consume_pausable, that can easily do
without it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Pavel Emelyanov
d3ee8774ad storage-service: Subscribe to snitch to update topology
Currently snitch explicitly calls storage service (if
it's initialized) to update topology on snitch data
change.

Instead of it -- make storage service subscribe on the
snitch reconfigure signal upon creation.

This finally makes snitch fully independent from storage
service.

In tests the snitch instance is not created, so check
for it before subscribing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
d1a2d0f894 snitch: Introduce reconfiguration signal
Add a notifier to snitch_base that gets triggered when the
snitch configuration changes to which others may subscribe.

For now only the gossiping-file-snitch triggers it when it
re-reads its config file. Other existing snitches are kinda
static in this sense.

The subscribe-trigger engine is based on scoped connection
from boost::signals2.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
ca336409d7 snitch: Always gossip snitch info itself
The gossiping_property_file_snitch updates the gossip RACK and DC
values upon config change. Right now this is done with the help
of storage service, but the needed code to gossip rack and dc is
already available in the snitch itself.

Said that -- gossip snitch info by snitch helper and remove the
storage_service's one. This makes the 2nd step decoupling snitch
and storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
99e71bd1f6 snitch: Do gossip DC and RACK itself
This is the 2nd step in generalizing the snitch data gossiping
and at the same the 1st step in decoupling storage service and
snitch.

During start storage service starts gossiper, which notifies the
snicth with .gossiper_starting() call, then the storage service
calls gossip_snitch_info.

This patch makes snitch itself do the last step.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
bc1a3a358d snitch: Add generic gossiping helper
Nowadays some snitch implementations gossip the INTERNAL_IP
value and storage_service gossip RACK and DC for all of them.

This functionality is going to be generalized and the first
step is in making a common method for a snitch to gossip its
data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Calle Wilund
4be718ebfa commitlog: Force earlier cycle/flush iff segment reserve is empty
Attempt to hurry flushing/segment delete/recycle if we are trying
to get a segment for allocation, and reserve is empty when above
disk threshold. This is minimize time waited in allocation semaphore.
2021-01-11 12:45:36 +00:00
Calle Wilund
be8c359a62 commitlog: Make segment allocation wait iff disk usage > max
Instead of allowing new segments to be added, explicitly wait
for either disk delete or recycle to happen iff current disk
usage is larger than limit.
2021-01-11 12:45:36 +00:00
Calle Wilund
ab55a1b4e6 commitlog: Do partial (memtable) flushing based on threshold
Instead of asking to flush data for all segments, just request
up to an RP where we get comfortably below disk usage threshold.
2021-01-11 12:45:10 +00:00
Calle Wilund
7c84b16cd8 commitlog: Make flush threshold configurable 2021-01-05 18:16:09 +00:00
Calle Wilund
c3d95811da table: Add a flush RP mark to table, and shortcut if not above
Adds a second RP to table, marking where we flushed last.
If a new flush request comes in that is below this mark, we
can skip a second flush.

This is to (in future) support incremental CL flush.
2021-01-05 18:16:09 +00:00
Asias He
84f482bde4 table: Add make_streaming_reader for given sstables set
Add a streaming reader that streams from a given sstables set.

Refs #7831
2020-12-30 08:32:42 +08:00
1802 changed files with 143019 additions and 83758 deletions

28
.github/CODEOWNERS vendored
View File

@@ -4,7 +4,7 @@ auth/* @elcallio @vladzcloudius
# CACHE
row_cache* @tgrabiec @haaawk
*mutation* @tgrabiec @haaawk
tests/mvcc* @tgrabiec @haaawk
test/boost/mvcc* @tgrabiec @haaawk
# CDC
cdc/* @haaawk @kbr- @elcallio @piodul @jul-stas
@@ -19,13 +19,13 @@ db/batch* @elcallio
service/storage_proxy* @gleb-cloudius
# COMPACTION
sstables/compaction* @raphaelsc @nyh
compaction/* @raphaelsc @nyh
# CQL TRANSPORT LAYER
transport/* @penberg
transport/*
# CQL QUERY LANGUAGE
cql3/* @tgrabiec @penberg @psarna
cql3/* @tgrabiec @psarna @cvybhu
# COUNTERS
counters* @haaawk @jul-stas
@@ -35,7 +35,7 @@ tests/counter_test* @haaawk @jul-stas
gms/* @tgrabiec @asias
# DOCKER
dist/docker/* @penberg
dist/docker/*
# LSA
utils/logalloc* @tgrabiec
@@ -58,9 +58,9 @@ service/migration* @tgrabiec @nyh
schema* @tgrabiec @nyh
# SECONDARY INDEXES
db/index/* @nyh @penberg @psarna
cql3/statements/*index* @nyh @penberg @psarna
test/boost/*index* @nyh @penberg @psarna
db/index/* @nyh @psarna
cql3/statements/*index* @nyh @psarna
test/boost/*index* @nyh @psarna
# SSTABLES
sstables/* @tgrabiec @raphaelsc @nyh
@@ -78,10 +78,20 @@ db/hints/* @haaawk @piodul @vladzcloudius
# REDIS
redis/* @nyh @syuu1228
redis-test/* @nyh @syuu1228
test/redis/* @nyh @syuu1228
# READERS
reader_* @denesb
querier* @denesb
test/boost/mutation_reader_test.cc @denesb
test/boost/querier_cache_test.cc @denesb
# PYTEST-BASED CQL TESTS
test/cql-pytest/* @nyh
# RAFT
raft/* @kbr- @gleb-cloudius @kostja
test/raft/* @kbr- @gleb-cloudius @kostja
# HEAT-WEIGHTED LOAD BALANCING
db/heat_load_balance.* @nyh @gleb-cloudius

View File

@@ -1,17 +1,16 @@
name: "CI Docs"
name: "Docs / Publish"
on:
push:
branches:
- master
paths:
- 'docs/**'
- "docs/**"
workflow_dispatch:
jobs:
release:
name: Build
runs-on: ubuntu-latest
env:
LATEST_VERSION: master
steps:
- name: Checkout
uses: actions/checkout@v2
@@ -23,11 +22,8 @@ jobs:
with:
python-version: 3.7
- name: Build docs
run: |
export PATH=$PATH:~/.local/bin
cd docs
make multiversion
run: make -C docs multiversion
- name: Deploy
run : ./docs/_utils/deploy.sh
run: ./docs/_utils/deploy.sh
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

25
.github/workflows/docs-pr@v1.yaml vendored Normal file
View File

@@ -0,0 +1,25 @@
name: "Docs / Build PR"
on:
pull_request:
branches:
- master
paths:
- "docs/**"
jobs:
build:
name: Build
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Build docs
run: make -C docs test

2
.gitignore vendored
View File

@@ -27,3 +27,5 @@ test/*/*.reject
.vscode
docs/_build
docs/poetry.lock
compile_commands.json
.ccls-cache/

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -32,8 +32,13 @@ if(target_arch)
set(target_arch_flag "-march=${target_arch}")
endif()
set(cxx_coro_flag)
if (CMAKE_CXX_COMPILER_ID MATCHES GNU)
set(cxx_coro_flag -fcoroutines)
endif()
# Configure Seastar compile options to align with Scylla
set(Seastar_CXX_FLAGS -fcoroutines ${target_arch_flag} CACHE INTERNAL "" FORCE)
set(Seastar_CXX_FLAGS ${cxx_coro_flag} ${target_arch_flag} CACHE INTERNAL "" FORCE)
set(Seastar_CXX_DIALECT gnu++20 CACHE INTERNAL "" FORCE)
add_subdirectory(seastar)
@@ -96,7 +101,7 @@ endfunction()
scylla_generate_thrift(
TARGET scylla_thrift_gen_cassandra
VAR scylla_thrift_gen_cassandra_files
IN_FILE interface/cassandra.thrift
IN_FILE "${CMAKE_SOURCE_DIR}/interface/cassandra.thrift"
OUT_DIR ${scylla_gen_build_dir}
SERVICE Cassandra)
@@ -153,7 +158,7 @@ foreach(f ${antlr3_grammar_files})
scylla_generate_antlr3(
TARGET scylla_antlr3_gen_${grammar_file_name}
VAR scylla_antlr3_gen_${grammar_file_name}_files
IN_FILE ${f}
IN_FILE "${CMAKE_SOURCE_DIR}/${f}"
OUT_DIR ${scylla_gen_build_dir}/${f_dir})
list(APPEND antlr3_gen_files "${scylla_antlr3_gen_${grammar_file_name}_files}")
endforeach()
@@ -162,7 +167,7 @@ endforeach()
seastar_generate_ragel(
TARGET scylla_ragel_gen_protocol_parser
VAR scylla_ragel_gen_protocol_parser_file
IN_FILE redis/protocol_parser.rl
IN_FILE "${CMAKE_SOURCE_DIR}/redis/protocol_parser.rl"
OUT_FILE ${scylla_gen_build_dir}/redis/protocol_parser.hh)
# Generate C++ sources from Swagger definitions
@@ -194,7 +199,7 @@ foreach(f ${swagger_files})
seastar_generate_swagger(
TARGET scylla_swagger_gen_${fname}
VAR scylla_swagger_gen_${fname}_files
IN_FILE "${f}"
IN_FILE "${CMAKE_SOURCE_DIR}/${f}"
OUT_DIR "${scylla_gen_build_dir}/${dir}")
list(APPEND swagger_gen_files "${scylla_swagger_gen_${fname}_files}")
endforeach()
@@ -229,6 +234,7 @@ set(idl_serializers
idl/frozen_mutation.idl.hh
idl/frozen_schema.idl.hh
idl/gossip_digest.idl.hh
idl/hinted_handoff.idl.hh
idl/idl_test.idl.hh
idl/keys.idl.hh
idl/messaging_service.idl.hh
@@ -237,6 +243,7 @@ set(idl_serializers
idl/partition_checksum.idl.hh
idl/paxos.idl.hh
idl/query.idl.hh
idl/raft.idl.hh
idl/range.idl.hh
idl/read_command.idl.hh
idl/reconcilable_result.idl.hh
@@ -260,7 +267,7 @@ foreach(f ${idl_serializers})
scylla_generate_idl_serializer(
TARGET scylla_idl_gen_${idl_target}
VAR scylla_idl_gen_${idl_target}_files
IN_FILE ${f}
IN_FILE "${CMAKE_SOURCE_DIR}/${f}"
OUT_FILE ${scylla_gen_build_dir}/${idl_dir}/${idl_out_hdr_name})
list(APPEND idl_gen_files "${scylla_idl_gen_${idl_target}_files}")
endforeach()
@@ -268,8 +275,8 @@ endforeach()
set(scylla_sources
absl-flat_hash_map.cc
alternator/auth.cc
alternator/base64.cc
alternator/conditions.cc
alternator/controller.cc
alternator/executor.cc
alternator/expressions.cc
alternator/serialization.cc
@@ -314,6 +321,7 @@ set(scylla_sources
auth/standard_role_manager.cc
auth/transitional.cc
bytes.cc
caching_options.cc
canonical_mutation.cc
cdc/cdc_partitioner.cc
cdc/generation.cc
@@ -322,6 +330,12 @@ set(scylla_sources
cdc/split.cc
clocks-impl.cc
collection_mutation.cc
compaction/compaction.cc
compaction/compaction_manager.cc
compaction/compaction_strategy.cc
compaction/leveled_compaction_strategy.cc
compaction/size_tiered_compaction_strategy.cc
compaction/time_window_compaction_strategy.cc
compress.cc
connection_notifier.cc
converting_mutation_partition_applier.cc
@@ -335,6 +349,7 @@ set(scylla_sources
cql3/constants.cc
cql3/cql3_type.cc
cql3/expr/expression.cc
cql3/expr/prepare_expr.cc
cql3/functions/aggregate_fcts.cc
cql3/functions/castas_fcts.cc
cql3/functions/error_injection_fcts.cc
@@ -345,6 +360,7 @@ set(scylla_sources
cql3/lists.cc
cql3/maps.cc
cql3/operation.cc
cql3/prepare_context.cc
cql3/query_options.cc
cql3/query_processor.cc
cql3/relation.cc
@@ -360,25 +376,32 @@ set(scylla_sources
cql3/sets.cc
cql3/single_column_relation.cc
cql3/statements/alter_keyspace_statement.cc
cql3/statements/alter_service_level_statement.cc
cql3/statements/alter_table_statement.cc
cql3/statements/alter_type_statement.cc
cql3/statements/alter_view_statement.cc
cql3/statements/attach_service_level_statement.cc
cql3/statements/authentication_statement.cc
cql3/statements/authorization_statement.cc
cql3/statements/batch_statement.cc
cql3/statements/cas_request.cc
cql3/statements/cf_prop_defs.cc
cql3/statements/cf_statement.cc
cql3/statements/create_aggregate_statement.cc
cql3/statements/create_function_statement.cc
cql3/statements/create_index_statement.cc
cql3/statements/create_keyspace_statement.cc
cql3/statements/create_service_level_statement.cc
cql3/statements/create_table_statement.cc
cql3/statements/create_type_statement.cc
cql3/statements/create_view_statement.cc
cql3/statements/delete_statement.cc
cql3/statements/detach_service_level_statement.cc
cql3/statements/drop_aggregate_statement.cc
cql3/statements/drop_function_statement.cc
cql3/statements/drop_index_statement.cc
cql3/statements/drop_keyspace_statement.cc
cql3/statements/drop_service_level_statement.cc
cql3/statements/drop_table_statement.cc
cql3/statements/drop_type_statement.cc
cql3/statements/drop_view_statement.cc
@@ -388,6 +411,8 @@ set(scylla_sources
cql3/statements/index_target.cc
cql3/statements/ks_prop_defs.cc
cql3/statements/list_permissions_statement.cc
cql3/statements/list_service_level_attachments_statement.cc
cql3/statements/list_service_level_statement.cc
cql3/statements/list_users_statement.cc
cql3/statements/modification_statement.cc
cql3/statements/permission_altering_statement.cc
@@ -397,21 +422,20 @@ set(scylla_sources
cql3/statements/role-management-statements.cc
cql3/statements/schema_altering_statement.cc
cql3/statements/select_statement.cc
cql3/statements/service_level_statement.cc
cql3/statements/sl_prop_defs.cc
cql3/statements/truncate_statement.cc
cql3/statements/update_statement.cc
cql3/statements/use_statement.cc
cql3/token_relation.cc
cql3/tuples.cc
cql3/type_json.cc
cql3/untyped_result_set.cc
cql3/update_parameters.cc
cql3/user_types.cc
cql3/ut_name.cc
cql3/util.cc
cql3/ut_name.cc
cql3/values.cc
cql3/variable_specifications.cc
data/cell.cc
database.cc
data_dictionary/data_dictionary.cc
db/batchlog_manager.cc
db/commitlog/commitlog.cc
db/commitlog/commitlog_entry.cc
@@ -422,8 +446,10 @@ set(scylla_sources
db/data_listeners.cc
db/extensions.cc
db/heat_load_balance.cc
db/hints/host_filter.cc
db/hints/manager.cc
db/hints/resource_manager.cc
db/hints/sync_point.cc
db/large_data_handler.cc
db/legacy_schema_migrator.cc
db/marshal/type_parser.cc
@@ -436,6 +462,7 @@ set(scylla_sources
db/view/row_locking.cc
db/view/view.cc
db/view/view_update_generator.cc
db/virtual_table.cc
dht/boot_strapper.cc
dht/i_partitioner.cc
dht/murmur3_partitioner.cc
@@ -447,17 +474,18 @@ set(scylla_sources
flat_mutation_reader.cc
frozen_mutation.cc
frozen_schema.cc
generic_server.cc
gms/application_state.cc
gms/endpoint_state.cc
gms/failure_detector.cc
gms/feature_service.cc
gms/gossip_digest_ack.cc
gms/gossip_digest_ack2.cc
gms/gossip_digest_ack.cc
gms/gossip_digest_syn.cc
gms/gossiper.cc
gms/inet_address.cc
gms/version_generator.cc
gms/versioned_value.cc
gms/version_generator.cc
hashers.cc
index/secondary_index.cc
index/secondary_index_manager.cc
@@ -465,6 +493,7 @@ set(scylla_sources
keys.cc
lister.cc
locator/abstract_replication_strategy.cc
locator/azure_snitch.cc
locator/ec2_multi_region_snitch.cc
locator/ec2_snitch.cc
locator/everywhere_replication_strategy.cc
@@ -478,33 +507,37 @@ set(scylla_sources
locator/simple_strategy.cc
locator/snitch_base.cc
locator/token_metadata.cc
lua.cc
lang/lua.cc
main.cc
memtable.cc
message/messaging_service.cc
multishard_mutation_query.cc
mutation.cc
raft/fsm.cc
raft/log.cc
raft/progress.cc
raft/raft.cc
raft/server.cc
mutation_fragment.cc
mutation_partition.cc
mutation_partition_serializer.cc
mutation_partition_view.cc
mutation_query.cc
mutation_reader.cc
mutation_writer/feed_writers.cc
mutation_writer/multishard_writer.cc
mutation_writer/partition_based_splitting_writer.cc
mutation_writer/shard_based_splitting_writer.cc
mutation_writer/timestamp_based_splitting_writer.cc
partition_slice_builder.cc
partition_version.cc
querier.cc
query-result-set.cc
query.cc
query-result-set.cc
raft/fsm.cc
raft/log.cc
raft/raft.cc
raft/server.cc
raft/tracker.cc
range_tombstone.cc
range_tombstone_list.cc
tombstone_gc_options.cc
tombstone_gc.cc
reader_concurrency_semaphore.cc
redis/abstract_command.cc
redis/command_factory.cc
@@ -518,15 +551,18 @@ set(scylla_sources
redis/server.cc
redis/service.cc
redis/stats.cc
release.cc
repair/repair.cc
repair/row_level.cc
replica/database.cc
replica/table.cc
row_cache.cc
schema.cc
schema_mutations.cc
schema_registry.cc
serializer.cc
service/client_state.cc
service/migration_manager.cc
service/migration_task.cc
service/misc_services.cc
service/pager/paging_state.cc
service/pager/query_pagers.cc
@@ -535,29 +571,33 @@ set(scylla_sources
service/paxos/prepare_summary.cc
service/paxos/proposal.cc
service/priority_manager.cc
service/qos/qos_common.cc
service/qos/service_level_controller.cc
service/qos/standard_service_level_distributed_data_accessor.cc
service/raft/raft_gossip_failure_detector.cc
service/raft/raft_group_registry.cc
service/raft/raft_rpc.cc
service/raft/raft_sys_table_storage.cc
service/raft/group0_state_machine.cc
service/storage_proxy.cc
service/storage_service.cc
sstables/compaction.cc
sstables/compaction_manager.cc
sstables/compaction_strategy.cc
sstables/compress.cc
sstables/integrity_checked_file_impl.cc
sstables/kl/writer.cc
sstables/leveled_compaction_strategy.cc
sstables/m_format_read_helpers.cc
sstables/kl/reader.cc
sstables/metadata_collector.cc
sstables/mp_row_consumer.cc
sstables/m_format_read_helpers.cc
sstables/mx/reader.cc
sstables/mx/writer.cc
sstables/partition.cc
sstables/prepended_input_stream.cc
sstables/random_access_reader.cc
sstables/size_tiered_compaction_strategy.cc
sstables/sstable_directory.cc
sstables/sstable_version.cc
sstables/sstable_mutation_reader.cc
sstables/sstables.cc
sstables/sstable_set.cc
sstables/sstables_manager.cc
sstables/time_window_compaction_strategy.cc
sstables/sstable_version.cc
sstables/writer.cc
streaming/consumer.cc
streaming/progress_info.cc
streaming/session_info.cc
streaming/stream_coordinator.cc
@@ -572,18 +612,19 @@ set(scylla_sources
streaming/stream_summary.cc
streaming/stream_task.cc
streaming/stream_transfer_task.cc
table.cc
table_helper.cc
thrift/controller.cc
thrift/handler.cc
thrift/server.cc
thrift/thrift_validation.cc
timeout_config.cc
tools/scylla-sstable-index.cc
tools/scylla-types.cc
tracing/traced_file.cc
tracing/trace_keyspace_helper.cc
tracing/trace_state.cc
tracing/traced_file.cc
tracing/tracing.cc
tracing/tracing_backend_registry.cc
tracing/tracing.cc
transport/controller.cc
transport/cql_protocol_extension.cc
transport/event.cc
@@ -592,10 +633,10 @@ set(scylla_sources
transport/server.cc
types.cc
unimplemented.cc
utils/UUID_gen.cc
utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc
utils/array-search.cc
utils/ascii.cc
utils/base64.cc
utils/big_decimal.cc
utils/bloom_calculations.cc
utils/bloom_filter.cc
@@ -610,6 +651,7 @@ set(scylla_sources
utils/file_lock.cc
utils/generation-number.cc
utils/gz/crc_combine.cc
utils/gz/gen_crc_combine_table.cc
utils/human_readable.cc
utils/i_filter.cc
utils/large_bitset.cc
@@ -625,10 +667,10 @@ set(scylla_sources
utils/updateable_value.cc
utils/utf8.cc
utils/uuid.cc
utils/UUID_gen.cc
validation.cc
vint-serialization.cc
zstd.cc
release.cc)
zstd.cc)
set(scylla_gen_sources
"${scylla_thrift_gen_cassandra_files}"
@@ -689,7 +731,7 @@ target_link_libraries(scylla PRIVATE
target_compile_options(scylla PRIVATE
-std=gnu++20
-fcoroutines # TODO: Clang does not have this flag, adjust to both variants
${cxx_coro_flag}
${target_arch_flag})
# Hacks needed to expose internal APIs for xxhash dependencies
target_compile_definitions(scylla PRIVATE XXH_PRIVATE_API HAVE_LZ4_COMPRESS_DEFAULT)
@@ -709,7 +751,7 @@ target_link_libraries(crc_combine_table PRIVATE seastar)
target_include_directories(crc_combine_table PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}")
target_compile_options(crc_combine_table PRIVATE
-std=gnu++20
-fcoroutines
${cxx_coro_flag}
${target_arch_flag})
add_dependencies(scylla crc_combine_table)
@@ -722,15 +764,15 @@ target_sources(scylla PRIVATE "${scylla_gen_build_dir}/utils/gz/crc_combine_tabl
###
### Generate version file and supply appropriate compile definitions for release.cc
###
execute_process(COMMAND ${CMAKE_SOURCE_DIR}/SCYLLA-VERSION-GEN RESULT_VARIABLE scylla_version_gen_res)
execute_process(COMMAND ${CMAKE_SOURCE_DIR}/SCYLLA-VERSION-GEN --output-dir "${CMAKE_BINARY_DIR}/gen" RESULT_VARIABLE scylla_version_gen_res)
if(scylla_version_gen_res)
message(SEND_ERROR "Version file generation failed. Return code: ${scylla_version_gen_res}")
endif()
file(READ build/SCYLLA-VERSION-FILE scylla_version)
file(READ "${CMAKE_BINARY_DIR}/gen/SCYLLA-VERSION-FILE" scylla_version)
string(STRIP "${scylla_version}" scylla_version)
file(READ build/SCYLLA-RELEASE-FILE scylla_release)
file(READ "${CMAKE_BINARY_DIR}/gen/SCYLLA-RELEASE-FILE" scylla_release)
string(STRIP "${scylla_release}" scylla_release)
get_property(release_cdefs SOURCE "${CMAKE_SOURCE_DIR}/release.cc" PROPERTY COMPILE_DEFINITIONS)
@@ -742,7 +784,7 @@ set_source_files_properties("${CMAKE_SOURCE_DIR}/release.cc" PROPERTIES COMPILE_
###
set(libdeflate_lib "${scylla_build_dir}/libdeflate/libdeflate.a")
add_custom_command(OUTPUT "${libdeflate_lib}"
COMMAND make -C libdeflate
COMMAND make -C "${CMAKE_SOURCE_DIR}/libdeflate"
BUILD_DIR=../build/${BUILD_TYPE}/libdeflate/
CC=${CMAKE_C_COMPILER}
"CFLAGS=${target_arch_flag}"

View File

@@ -1,13 +1,20 @@
# Contributing
# Contributing to Scylla
## Asking questions or requesting help
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.
## Reporting an issue
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to report issues or to suggest features. Fill in as much information as you can in the issue template, especially for performance problems.
## Contributing Code to Scylla
## Contributing code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
Header files in Scylla must be self-contained, i.e., each can be included without having to include specific other headers first. To verify that your change did not break this property, run `ninja dev-headers`. If you added or removed header files, you must `touch configure.py` first - this will cause `configure.py` to be automatically re-run to generate a fresh list of header files.

View File

@@ -172,12 +172,8 @@ and you will get output like this:
```
CQL QUERY LANGUAGE
Tomasz Grabiec <tgrabiec@scylladb.com> [maintainer]
Pekka Enberg <penberg@scylladb.com> [maintainer]
MATERIALIZED VIEWS
Pekka Enberg <penberg@scylladb.com> [maintainer]
Duarte Nunes <duarte@scylladb.com> [maintainer]
Nadav Har'El <nyh@scylladb.com> [reviewer]
Duarte Nunes <duarte@scylladb.com> [reviewer]
```
### Running Scylla
@@ -366,7 +362,27 @@ $ git remote update
$ git checkout -t local/my_local_seastar_branch
```
### Generating code coverage report
Install dependencies:
$ dnf install llvm # for llvm-profdata and llvm-cov
$ dnf install lcov # for genhtml
Instruct `configure.py` to generate build files for `coverage` mode:
$ ./configure.py --mode=coverage
Build the tests you want to run, then run them via `test.py` (important!):
$ ./test.py --mode=coverage [...]
Alternatively, you can run individual tests via `./scripts/coverage.py --run`.
Open the link printed at the end. Be horrified. Go and write more tests.
For more details see `./scripts/coverage.py --help`.
### Core dump debugging
Slides:
2018.11.20: https://www.slideshare.net/tomekgrabiec/scylla-core-dump-debugging-tools
See [debugging.md](debugging.md).

View File

@@ -5,3 +5,7 @@ It includes files from https://github.com/antonblanchard/crc32-vpmsum (author An
These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.
It includes modified code from https://gitbox.apache.org/repos/asf?p=cassandra-dtest.git (owned by The Apache Software Foundation)
It includes modified tests from https://github.com/etcd-io/etcd.git (owned by The etcd Authors)
It includes files from https://github.com/bytecodealliance/wasmtime-cpp (owned by Bytecode Alliance), licensed with Apache License 2.0.

View File

@@ -42,8 +42,8 @@ For further information, please see:
* [Docker image build documentation] for information on how to build Docker images.
[developer documentation]: HACKING.md
[build documentation]: docs/building.md
[docker image build documentation]: dist/docker/redhat/README.md
[build documentation]: docs/guides/building.md
[docker image build documentation]: dist/docker/debian/README.md
## Running Scylla
@@ -65,7 +65,7 @@ $ ./tools/toolchain/dbuild ./build/release/scylla --help
## Testing
See [test.py manual](docs/testing.md).
See [test.py manual](docs/guides/testing.md).
## Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and

View File

@@ -1,15 +1,74 @@
#!/bin/sh
USAGE=$(cat <<-END
Usage: $(basename "$0") [-h|--help] [-o|--output-dir PATH] -- generate Scylla version and build information files.
Options:
-h|--help show this help message.
-o|--output-dir PATH specify destination path at which the version files are to be created.
By default, the script will attempt to parse 'version' file
in the current directory, which should contain a string of
'\$version-\$release' form.
Otherwise, it will call 'git log' on the source tree (the
directory, which contains the script) to obtain current
commit hash and use it for building the version and release
strings.
The script assumes that it's called from the Scylla source
tree.
The files created are:
SCYLLA-VERSION-FILE
SCYLLA-RELEASE-FILE
SCYLLA-PRODUCT-FILE
By default, these files are created in the 'build'
subdirectory under the directory containing the script.
The destination directory can be overriden by
using '-o PATH' option.
END
)
while [[ $# -gt 0 ]]; do
opt="$1"
case $opt in
-h|--help)
echo "$USAGE"
exit 0
;;
-o|--output-dir)
OUTPUT_DIR="$2"
shift
shift
;;
*)
echo "Unexpected argument found: $1"
echo
echo "$USAGE"
exit 1
;;
esac
done
SCRIPT_DIR="$(dirname "$0")"
if [ -z "$OUTPUT_DIR" ]; then
OUTPUT_DIR="$SCRIPT_DIR/build"
fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=4.4.dev
VERSION=5.0.13
if test -f version
then
SCYLLA_VERSION=$(cat version | awk -F'-' '{print $1}')
SCYLLA_RELEASE=$(cat version | awk -F'-' '{print $2}')
else
DATE=$(date +%Y%m%d)
GIT_COMMIT=$(git log --pretty=format:'%h' -n 1)
DATE=$(date --utc +%Y%m%d)
GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1)
SCYLLA_VERSION=$VERSION
# For custom package builds, replace "0" with "counter.your_name",
# where counter starts at 1 and increments for successive versions.
@@ -19,16 +78,15 @@ else
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
fi
if [ -f build/SCYLLA-RELEASE-FILE ]; then
RELEASE_FILE=$(cat build/SCYLLA-RELEASE-FILE)
GIT_COMMIT_FILE=$(cat build/SCYLLA-RELEASE-FILE |cut -d . -f 3)
if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" |cut -d . -f 3)
if [ "$GIT_COMMIT" = "$GIT_COMMIT_FILE" ]; then
exit 0
fi
fi
echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"
mkdir -p build
echo "$SCYLLA_VERSION" > build/SCYLLA-VERSION-FILE
echo "$SCYLLA_RELEASE" > build/SCYLLA-RELEASE-FILE
echo "$PRODUCT" > build/SCYLLA-PRODUCT-FILE
mkdir -p "$OUTPUT_DIR"
echo "$SCYLLA_VERSION" > "$OUTPUT_DIR/SCYLLA-VERSION-FILE"
echo "$SCYLLA_RELEASE" > "$OUTPUT_DIR/SCYLLA-RELEASE-FILE"
echo "$PRODUCT" > "$OUTPUT_DIR/SCYLLA-PRODUCT-FILE"

2
abseil

Submodule abseil updated: 1e3d25b265...f70eadadd7

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2020 ScyllaDB
* Copyright (C) 2020-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "absl-flat_hash_map.hh"

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2020 ScyllaDB
* Copyright (C) 2020-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "alternator/error.hh"
@@ -24,7 +11,6 @@
#include <string>
#include <string_view>
#include <gnutls/crypto.h>
#include <seastar/util/defer.hh>
#include "hashers.hh"
#include "bytes.hh"
#include "alternator/auth.hh"
@@ -32,8 +18,12 @@
#include "auth/common.hh"
#include "auth/password_authenticator.hh"
#include "auth/roles-metadata.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "service/storage_proxy.hh"
#include "alternator/executor.hh"
#include "cql3/selection/selection.hh"
#include "query-result-set.hh"
#include "cql3/result_set.hh"
#include <seastar/core/coroutine.hh>
namespace alternator {
@@ -62,6 +52,14 @@ static std::string apply_sha256(std::string_view msg) {
return to_hex(hasher.finalize());
}
static std::string apply_sha256(const std::vector<temporary_buffer<char>>& msg) {
sha256_hasher hasher;
for (const temporary_buffer<char>& buf : msg) {
hasher.update(buf.get(), buf.size());
}
return to_hex(hasher.finalize());
}
static std::string format_time_point(db_clock::time_point tp) {
time_t time_point_repr = db_clock::to_time_t(tp);
std::string time_point_str;
@@ -91,7 +89,7 @@ void check_expiry(std::string_view signature_date) {
std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string) {
const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string) {
auto amz_date_it = signed_headers_map.find("x-amz-date");
if (amz_date_it == signed_headers_map.end()) {
throw api_error::invalid_signature("X-Amz-Date header is mandatory for signature verification");
@@ -124,24 +122,36 @@ std::string get_signature(std::string_view access_key_id, std::string_view secre
return to_hex(bytes_view(reinterpret_cast<const int8_t*>(signature.data()), signature.size()));
}
future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username) {
static const sstring query = format("SELECT salted_hash FROM {} WHERE {} = ?",
auth::meta::roles_table::qualified_name, auth::meta::roles_table::role_col_name);
future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username) {
schema_ptr schema = proxy.data_dictionary().find_schema("system_auth", "roles");
partition_key pk = partition_key::from_single_value(*schema, utf8_type->decompose(username));
dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};
std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};
const column_definition* salted_hash_col = schema->get_column_definition(bytes("salted_hash"));
if (!salted_hash_col) {
co_return coroutine::make_exception(api_error::unrecognized_client(format("Credentials cannot be fetched for: {}", username)));
}
auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col});
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id}, selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice));
auto cl = auth::password_authenticator::consistency_for_user(username);
auto& timeout = auth::internal_distributed_timeout_config();
return qp.execute_internal(query, cl, timeout, {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
auto res = f.get0();
auto salted_hash = std::optional<sstring>();
if (res->empty()) {
throw api_error::unrecognized_client(fmt::format("User not found: {}", username));
}
salted_hash = res->one().get_opt<sstring>("salted_hash");
if (!salted_hash) {
throw api_error::unrecognized_client(fmt::format("No password found for user: {}", username));
}
return make_ready_future<std::string>(*salted_hash);
});
service::client_state client_state{service::client_state::internal_tag()};
service::storage_proxy::coordinator_query_result qr = co_await proxy.query(schema, std::move(command), std::move(partition_ranges), cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), empty_service_permit(), client_state));
cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
if (result_set->empty()) {
co_return coroutine::make_exception(api_error::unrecognized_client(format("User not found: {}", username)));
}
const bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column
if (!salted_hash) {
co_return coroutine::make_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));
}
co_return value_cast<sstring>(utf8_type->deserialize(*salted_hash));
}
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -27,20 +14,20 @@
#include "gc_clock.hh"
#include "utils/loading_cache.hh"
namespace cql3 {
class query_processor;
namespace service {
class storage_proxy;
}
namespace alternator {
using hmac_sha256_digest = std::array<char, 32>;
using key_cache = utils::loading_cache<std::string, std::string>;
using key_cache = utils::loading_cache<std::string, std::string, 1>;
std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string);
const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string);
future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username);
future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username);
}

View File

@@ -1,38 +0,0 @@
/*
* Copyright 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <string_view>
#include "bytes.hh"
#include "utils/rjson.hh"
std::string base64_encode(bytes_view);
bytes base64_decode(std::string_view);
inline bytes base64_decode(const rjson::value& v) {
return base64_decode(std::string_view(v.GetString(), v.GetStringLength()));
}
size_t base64_decoded_len(std::string_view str);
bool base64_begins_with(std::string_view base, std::string_view operand);

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <list>
@@ -28,7 +15,8 @@
#include <unordered_map>
#include "utils/rjson.hh"
#include "serialization.hh"
#include "base64.hh"
#include "utils/base64.hh"
#include "utils/rjson.hh"
#include <stdexcept>
#include <boost/algorithm/cxx11/all_of.hpp>
#include <boost/algorithm/cxx11/any_of.hpp>
@@ -123,7 +111,7 @@ struct rjson_engaged_ptr_comp {
// as internally they're stored in an array, and the order of elements is
// not important in set equality. See issue #5021
static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2) {
if (set1.Size() != set2.Size()) {
if (!set1.IsArray() || !set2.IsArray() || set1.Size() != set2.Size()) {
return false;
}
std::set<const rjson::value*, rjson_engaged_ptr_comp> set1_raw;
@@ -137,45 +125,107 @@ static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2
}
return true;
}
// Moreover, the JSON being compared can be a nested document with outer
// layers of lists and maps and some inner set - and we need to get to that
// inner set to compare it correctly with check_EQ_for_sets() (issue #8514).
static bool check_EQ(const rjson::value* v1, const rjson::value& v2);
static bool check_EQ_for_lists(const rjson::value& list1, const rjson::value& list2) {
if (!list1.IsArray() || !list2.IsArray() || list1.Size() != list2.Size()) {
return false;
}
auto it1 = list1.Begin();
auto it2 = list2.Begin();
while (it1 != list1.End()) {
// Note: Alternator limits an item's depth (rjson::parse() limits
// it to around 37 levels), so this recursion is safe.
if (!check_EQ(&*it1, *it2)) {
return false;
}
++it1;
++it2;
}
return true;
}
static bool check_EQ_for_maps(const rjson::value& list1, const rjson::value& list2) {
if (!list1.IsObject() || !list2.IsObject() || list1.MemberCount() != list2.MemberCount()) {
return false;
}
for (auto it1 = list1.MemberBegin(); it1 != list1.MemberEnd(); ++it1) {
auto it2 = list2.FindMember(it1->name);
if (it2 == list2.MemberEnd() || !check_EQ(&it1->value, it2->value)) {
return false;
}
}
return true;
}
// Check if two JSON-encoded values match with the EQ relation
static bool check_EQ(const rjson::value* v1, const rjson::value& v2) {
if (!v1) {
return false;
}
if (v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
if (v1 && v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if ((it1->name == "SS" && it2->name == "SS") || (it1->name == "NS" && it2->name == "NS") || (it1->name == "BS" && it2->name == "BS")) {
return check_EQ_for_sets(it1->value, it2->value);
if (it1->name != it2->name) {
return false;
}
if (it1->name == "SS" || it1->name == "NS" || it1->name == "BS") {
return check_EQ_for_sets(it1->value, it2->value);
} else if(it1->name == "L") {
return check_EQ_for_lists(it1->value, it2->value);
} else if(it1->name == "M") {
return check_EQ_for_maps(it1->value, it2->value);
} else {
// Other, non-nested types (number, string, etc.) can be compared
// literally, comparing their JSON representation.
return it1->value == it2->value;
}
} else {
// If v1 and/or v2 are missing (IsNull()) the result should be false.
// In the unlikely case that the object is malformed (issue #8070),
// let's also return false.
return false;
}
return *v1 == v2;
}
// Check if two JSON-encoded values match with the NE relation
static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
return !v1 || *v1 != v2; // null is unequal to anything.
return !check_EQ(v1, v2);
}
// Check if two JSON-encoded values match with the BEGINS_WITH relation
static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
// BEGINS_WITH requires that its single operand (v2) be a string or
// binary - otherwise it's a validation error. However, problems with
// the stored attribute (v1) will just return false (no match).
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error::validation(format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
}
auto it2 = v2.MemberBegin();
if (it2->name != "S" && it2->name != "B") {
throw api_error::validation(format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));
}
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
if (v1_from_query) {
throw api_error::validation(format("begins_with supports only string or binary type, got: {}", *v1));
} else {
bad = true;
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
if (v2_from_query) {
throw api_error::validation(format("begins_with() supports only string or binary type, got: {}", v2));
} else {
bad = true;
}
}
if (bad) {
return false;
}
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
@@ -200,7 +250,7 @@ bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2) {
if (kv1.name == "S" && kv2.name == "S") {
return rjson::to_string_view(kv1.value).find(rjson::to_string_view(kv2.value)) != std::string_view::npos;
} else if (kv1.name == "B" && kv2.name == "B") {
return base64_decode(kv1.value).find(base64_decode(kv2.value)) != bytes::npos;
return rjson::base64_decode(kv1.value).find(rjson::base64_decode(kv2.value)) != bytes::npos;
} else if (is_set_of(kv1.name, kv2.name)) {
for (auto i = kv1.value.Begin(); i != kv1.value.End(); ++i) {
if (*i == kv2.value) {
@@ -279,24 +329,40 @@ static bool check_NOT_NULL(const rjson::value* val) {
return val != nullptr;
}
// Only types S, N or B (string, number or bytes) may be compared by the
// various comparion operators - lt, le, gt, ge, and between.
// Note that in particular, if the value is missing (v->IsNull()), this
// check returns false.
static bool check_comparable_type(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return false;
}
const rjson::value& type = v.MemberBegin()->name;
return type == "S" || type == "N" || type == "B";
}
// Check if two JSON-encoded values match with cmp.
template <typename Comparator>
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error::validation(
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !check_comparable_type(*v1)) {
if (v1_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error::validation(
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
if (!check_comparable_type(v2)) {
if (v2_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (bad) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name != kv2.name) {
return false;
}
@@ -308,9 +374,10 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
std::string_view(kv2.value.GetString(), kv2.value.GetStringLength()));
}
if (kv1.name == "B") {
return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
return cmp(rjson::base64_decode(kv1.value), rjson::base64_decode(kv2.value));
}
clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
// cannot reach here, as check_comparable_type() verifies the type is one
// of the above options.
return false;
}
@@ -341,56 +408,71 @@ struct cmp_gt {
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
// True if v is between lb and ub, inclusive. Throws or returns false
// (depending on bounds_from_query parameter) if lb > ub.
template <typename T>
static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
if (cmp_lt()(ub, lb)) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
bool v_from_query, bool lb_from_query, bool ub_from_query) {
if ((v && v_from_query && !check_comparable_type(*v)) ||
(lb_from_query && !check_comparable_type(lb)) ||
(ub_from_query && !check_comparable_type(ub))) {
throw api_error::validation("between allow only the types String, Number, or Binary");
}
if (!v || !v->IsObject() || v->MemberCount() != 1 ||
!lb.IsObject() || lb.MemberCount() != 1 ||
!ub.IsObject() || ub.MemberCount() != 1) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
bool bounds_from_query = lb_from_query && ub_from_query;
if (kv_lb.name != kv_ub.name) {
throw api_error::validation(
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
} else {
return false;
}
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
bounds_from_query);
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
return check_BETWEEN(rjson::base64_decode(kv_v.value), rjson::base64_decode(kv_lb.value), rjson::base64_decode(kv_ub.value), bounds_from_query);
}
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
if (v_from_query) {
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
} else {
return false;
}
}
// Verify one Expect condition on one attribute (whose content is "got")
@@ -437,19 +519,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NE(got, (*attribute_value_list)[0]);
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
case comparison_operator_type::IN:
verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
return check_IN(got, *attribute_value_list);
@@ -461,7 +543,8 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NOT_NULL(got);
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
false, true, true);
case comparison_operator_type::CONTAINS:
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
@@ -573,7 +656,8 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
@@ -604,13 +688,17 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
cond._values[0].is_constant(), cond._values[1].is_constant());
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
/*
@@ -52,6 +39,7 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,

153
alternator/controller.cc Normal file
View File

@@ -0,0 +1,153 @@
/*
* Copyright (C) 2021-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/net/dns.hh>
#include "controller.hh"
#include "server.hh"
#include "executor.hh"
#include "rmw_operation.hh"
#include "db/config.hh"
#include "cdc/generation_service.hh"
#include "service/memory_limiter.hh"
using namespace seastar;
namespace alternator {
static logging::logger logger("alternator_controller");
controller::controller(
sharded<gms::gossiper>& gossiper,
sharded<service::storage_proxy>& proxy,
sharded<service::migration_manager>& mm,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
const db::config& config)
: _gossiper(gossiper)
, _proxy(proxy)
, _mm(mm)
, _sys_dist_ks(sys_dist_ks)
, _cdc_gen_svc(cdc_gen_svc)
, _memory_limiter(memory_limiter)
, _config(config)
{
}
sstring controller::name() const {
return "alternator";
}
sstring controller::protocol() const {
return "dynamodb";
}
sstring controller::protocol_version() const {
return version;
}
std::vector<socket_address> controller::listen_addresses() const {
return _listen_addresses;
}
future<> controller::start_server() {
return seastar::async([this] {
_listen_addresses.clear();
auto preferred = _config.listen_interface_prefer_ipv6() ? std::make_optional(net::inet_address::family::INET6) : std::nullopt;
auto family = _config.enable_ipv6_dns_lookup() || preferred ? std::nullopt : std::make_optional(net::inet_address::family::INET);
// Create an smp_service_group to be used for limiting the
// concurrency when forwarding Alternator request between
// shards - if necessary for LWT.
smp_service_group_config c;
c.max_nonlocal_requests = 5000;
_ssg = create_smp_service_group(c).get0();
rmw_operation::set_default_write_isolation(_config.alternator_write_isolation());
executor::set_default_timeout(std::chrono::milliseconds(_config.alternator_timeout_in_ms()));
net::inet_address addr = utils::resolve(_config.alternator_address, family).get0();
auto get_cdc_metadata = [] (cdc::generation_service& svc) { return std::ref(svc.get_cdc_metadata()); };
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks), sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value()).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper)).get();
// Note: from this point on, if start_server() throws for any reason,
// it must first call stop_server() to stop the executor and server
// services we just started - or Scylla will cause an assertion
// failure when the controller object is destroyed in the exception
// unwinding.
std::optional<uint16_t> alternator_port;
if (_config.alternator_port()) {
alternator_port = _config.alternator_port();
_listen_addresses.push_back({addr, *alternator_port});
}
std::optional<uint16_t> alternator_https_port;
std::optional<tls::credentials_builder> creds;
if (_config.alternator_https_port()) {
alternator_https_port = _config.alternator_https_port();
_listen_addresses.push_back({addr, *alternator_https_port});
creds.emplace();
auto opts = _config.alternator_encryption_options();
if (opts.empty()) {
// Earlier versions mistakenly configured Alternator's
// HTTPS parameters via the "server_encryption_option"
// configuration parameter. We *temporarily* continue
// to allow this, for backward compatibility.
opts = _config.server_encryption_options();
if (!opts.empty()) {
logger.warn("Setting server_encryption_options to configure "
"Alternator's HTTPS encryption is deprecated. Please "
"switch to setting alternator_encryption_options instead.");
}
}
opts.erase("require_client_auth");
opts.erase("truststore");
try {
utils::configure_tls_creds_builder(creds.value(), std::move(opts)).get();
} catch(...) {
logger.error("Failed to set up Alternator TLS credentials: {}", std::current_exception());
stop_server().get();
std::throw_with_nested(std::runtime_error("Failed to set up Alternator TLS credentials"));
}
}
bool alternator_enforce_authorization = _config.alternator_enforce_authorization();
_server.invoke_on_all(
[this, addr, alternator_port, alternator_https_port, creds = std::move(creds), alternator_enforce_authorization] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, creds, alternator_enforce_authorization,
&_memory_limiter.local().get_semaphore(),
_config.max_concurrent_requests_per_shard);
}).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {
logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",
addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF", ep);
return stop_server().then([ep = std::move(ep)] { return make_exception_future<>(ep); });
}).then([addr, alternator_port, alternator_https_port] {
logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}",
addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF");
}).get();
});
}
future<> controller::stop_server() {
return seastar::async([this] {
if (!_ssg) {
return;
}
_server.stop().get();
_executor.stop().get();
_listen_addresses.clear();
destroy_smp_service_group(_ssg.value()).get();
});
}
future<> controller::request_stop_server() {
return stop_server();
}
}

82
alternator/controller.hh Normal file
View File

@@ -0,0 +1,82 @@
/*
* Copyright (C) 2021-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include <seastar/core/sharded.hh>
#include <seastar/core/smp.hh>
#include "protocol_server.hh"
namespace service {
class storage_proxy;
class migration_manager;
class memory_limiter;
}
namespace db {
class system_distributed_keyspace;
class config;
}
namespace cdc {
class generation_service;
}
namespace gms {
class gossiper;
}
namespace alternator {
// This is the official DynamoDB API version.
// It represents the last major reorganization of that API, and all the features
// that were added since did NOT increment this version string.
constexpr const char* version = "2012-08-10";
using namespace seastar;
class executor;
class server;
class controller : public protocol_server {
sharded<gms::gossiper>& _gossiper;
sharded<service::storage_proxy>& _proxy;
sharded<service::migration_manager>& _mm;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<cdc::generation_service>& _cdc_gen_svc;
sharded<service::memory_limiter>& _memory_limiter;
const db::config& _config;
std::vector<socket_address> _listen_addresses;
sharded<executor> _executor;
sharded<server> _server;
std::optional<smp_service_group> _ssg;
public:
controller(
sharded<gms::gossiper>& gossiper,
sharded<service::storage_proxy>& proxy,
sharded<service::migration_manager>& mm,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
const db::config& config);
virtual sstring name() const override;
virtual sstring protocol() const override;
virtual sstring protocol_version() const override;
virtual std::vector<socket_address> listen_addresses() const override;
virtual future<> start_server() override;
virtual future<> stop_server() override;
virtual future<> request_stop_server() override;
};
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -34,7 +21,7 @@ namespace alternator {
// "ResourceNotFoundException", and a human readable message.
// Eventually alternator::api_handler will convert a returned or thrown
// api_error into a JSON object, and that is returned to the user.
class api_error final {
class api_error final : public std::exception {
public:
using status_type = httpd::reply::status_type;
status_type _http_code;
@@ -80,9 +67,23 @@ public:
static api_error trimmed_data_access_exception(std::string msg) {
return api_error("TrimmedDataAccessException", std::move(msg));
}
static api_error request_limit_exceeded(std::string msg) {
return api_error("RequestLimitExceeded", std::move(msg));
}
static api_error serialization(std::string msg) {
return api_error("SerializationException", std::move(msg));
}
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);
}
// Provide the "std::exception" interface, to make it easier to print this
// exception in log messages. Note that this function is *not* used to
// format the error to send it back to the client - server.cc has
// generate_error_reply() to format an api_error as the DynamoDB protocol
// requires.
virtual const char* what() const noexcept override;
mutable std::string _what_string;
};
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -27,9 +14,9 @@
#include <seastar/json/json_elements.hh>
#include <seastar/core/sharded.hh>
#include "service/storage_proxy.hh"
#include "service/migration_manager.hh"
#include "service/client_state.hh"
#include "service_permit.hh"
#include "db/timeout_clock.hh"
#include "alternator/error.hh"
@@ -50,7 +37,17 @@ namespace cql3::selection {
}
namespace service {
class storage_service;
class storage_proxy;
}
namespace cdc {
class metadata;
}
namespace gms {
class gossiper;
}
namespace alternator {
@@ -63,6 +60,16 @@ public:
explicit make_jsonable(rjson::value&& value);
std::string to_json() const override;
};
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
json::json_return_type make_streamed(rjson::value&&);
struct json_string : public json::jsonable {
std::string _value;
public:
@@ -70,11 +77,82 @@ public:
std::string to_json() const override;
};
namespace parsed {
class path;
};
const std::map<sstring, sstring>& get_tags_of_table(schema_ptr schema);
std::optional<std::string> find_tag(const schema& s, const sstring& tag);
future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map);
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
// has a root attribute, and then modified by member and index operators -
// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
// "[2]" index, and finally ".c" member.
// Data can be added to an attribute_path_map using the add() function, but
// requires that attributes with data not be *overlapping* or *conflicting*:
//
// 1. Two attribute paths which are identical or an ancestor of one another
// are considered *overlapping* and not allowed. If a.b.c has data,
// we can't add more data in a.b.c or any of its descendants like a.b.c.d.
//
// 2. Two attribute paths which need the same parent to have both a member and
// an index are considered *conflicting* and not allowed. E.g., if a.b has
// data, you can't add a[1]. The meaning of adding both would be that the
// attribute a is both a map and an array, which isn't sensible.
//
// These two requirements are common to the two places where Alternator uses
// this abstraction to describe how a hierarchical item is to be transformed:
//
// 1. In ProjectExpression: for filtering from a full top-level attribute
// only the parts for which user asked in ProjectionExpression.
//
// 2. In UpdateExpression: for taking the previous value of a top-level
// attribute, and modifying it based on the instructions in the user
// wrote in UpdateExpression.
template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra unique_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-(
using members_t = std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
bool is_empty() const { return !_content; }
bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
// get_members() assumes that has_members() is true
members_t& get_members() { return std::get<members_t>(*_content); }
const members_t& get_members() const { return std::get<members_t>(*_content); }
indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
T& get_value() { return std::get<T>(*_content); }
const T& get_value() const { return std::get<T>(*_content); }
};
template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
using attrs_to_get = attribute_path_map<std::monostate>;
class executor : public peering_sharded_service<executor> {
gms::gossiper& _gossiper;
service::storage_proxy& _proxy;
service::migration_manager& _mm;
db::system_distributed_keyspace& _sdks;
service::storage_service& _ss;
cdc::metadata& _cdc_metadata;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
@@ -87,8 +165,8 @@ public:
static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";
static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";
executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, smp_service_group ssg)
: _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _ssg(ssg) {}
executor(gms::gossiper& gossiper, service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, cdc::metadata& cdc_metadata, smp_service_group ssg)
: _gossiper(gossiper), _proxy(proxy), _mm(mm), _sdks(sdks), _cdc_metadata(cdc_metadata), _ssg(ssg) {}
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
@@ -107,6 +185,8 @@ public:
future<request_return_type> tag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> untag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> update_time_to_live(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_streams(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request);
@@ -115,10 +195,6 @@ public:
future<> start();
future<> stop() { return make_ready_future<>(); }
future<> create_keyspace(std::string_view keyspace_name);
static tracing::trace_state_ptr maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query);
static sstring table_name(const schema&);
static db::timeout_clock::time_point default_timeout();
static void set_default_timeout(db::timeout_clock::duration timeout);
@@ -140,19 +216,17 @@ public:
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::unordered_set<std::string>&);
const attrs_to_get&);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<bytes_opt>&,
const std::unordered_set<std::string>&,
const attrs_to_get&,
rjson::value&,
bool = false);
void add_stream_options(const rjson::value& stream_spec, schema_builder&) const;
void supplement_table_info(rjson::value& descr, const schema& schema) const;
void supplement_table_stream_info(rjson::value& descr, const schema& schema) const;
static void add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
static void supplement_table_stream_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
};
}

View File

@@ -1,27 +1,14 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "expressions.hh"
#include "serialization.hh"
#include "base64.hh"
#include "utils/base64.hh"
#include "conditions.hh"
#include "alternator/expressionsLexer.hpp"
#include "alternator/expressionsParser.hpp"
@@ -130,6 +117,27 @@ void condition_expression::append(condition_expression&& a, char op) {
}, _expression);
}
void path::check_depth_limit() {
if (1 + _operators.size() > depth_limit) {
throw expressions_syntax_error(format("Document path exceeded {} nesting levels", depth_limit));
}
}
std::ostream& operator<<(std::ostream& os, const path& p) {
os << p.root();
for (const auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (const std::string& member) {
os << '.' << member;
},
[&] (unsigned index) {
os << '[' << index << ']';
}
}, op);
}
return os;
}
} // namespace parsed
// The following resolve_*() functions resolve references in parsed
@@ -151,10 +159,9 @@ void condition_expression::append(condition_expression&& a, char op) {
// we need to resolve the expression just once but then use it many times
// (once for each item to be filtered).
static void resolve_path(parsed::path& p,
static std::optional<std::string> resolve_path_component(const std::string& column_name,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
const std::string& column_name = p.root();
if (column_name.size() > 0 && column_name.front() == '#') {
if (!expression_attribute_names) {
throw api_error::validation(
@@ -166,7 +173,30 @@ static void resolve_path(parsed::path& p,
format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
}
used_attribute_names.emplace(column_name);
p.set_root(std::string(rjson::to_string_view(*value)));
return std::string(rjson::to_string_view(*value));
}
return std::nullopt;
}
static void resolve_path(parsed::path& p,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
std::optional<std::string> r = resolve_path_component(p.root(), expression_attribute_names, used_attribute_names);
if (r) {
p.set_root(std::move(*r));
}
for (auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (std::string& s) {
r = resolve_path_component(s, expression_attribute_names, used_attribute_names);
if (r) {
s = std::move(*r);
}
},
[&] (unsigned index) {
// nothing to resolve
}
}, op);
}
}
@@ -385,24 +415,6 @@ void for_condition_expression_on(const parsed::condition_expression& ce, const n
// expression. The parsed expression is assumed to have been "resolved", with
// the matching resolve_* function.
// Take two JSON-encoded list values (remember that a list value is
// {"L": [...the actual list]}) and return the concatenation, again as
// a list value.
static rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2) {
const rjson::value* list1 = unwrap_list(v1);
const rjson::value* list2 = unwrap_list(v2);
if (!list1 || !list2) {
throw api_error::validation("UpdateExpression: list_append() given a non-list");
}
rjson::value cat = rjson::copy(*list1);
for (const auto& a : list2->GetArray()) {
rjson::push_back(cat, rjson::copy(a));
}
rjson::value ret = rjson::empty_object();
rjson::set(ret, "L", std::move(cat));
return ret;
}
// calculate_size() is ConditionExpression's size() function, i.e., it takes
// a JSON-encoded value and returns its "size" as defined differently for the
// different types - also as a JSON-encoded number.
@@ -439,11 +451,11 @@ static rjson::value calculate_size(const rjson::value& v) {
ret = base64_decoded_len(rjson::to_string_view(it->value));
} else {
rjson::value json_ret = rjson::empty_object();
rjson::set(json_ret, "null", rjson::value(true));
rjson::add(json_ret, "null", rjson::value(true));
return json_ret;
}
rjson::value json_ret = rjson::empty_object();
rjson::set(json_ret, "N", rjson::from_string(std::to_string(ret)));
rjson::add(json_ret, "N", rjson::from_string(std::to_string(ret)));
return json_ret;
}
@@ -462,7 +474,7 @@ static const rjson::value& calculate_value(const parsed::constant& c) {
static rjson::value to_bool_json(bool b) {
rjson::value json_ret = rjson::empty_object();
rjson::set(json_ret, "BOOL", rjson::value(b));
rjson::add(json_ret, "BOOL", rjson::value(b));
return json_ret;
}
@@ -487,7 +499,11 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
return list_concatenate(v1, v2);
rjson::value ret = list_concatenate(v1, v2);
if (ret.IsNull()) {
throw api_error::validation("UpdateExpression: list_append() given a non-list");
}
return ret;
}
},
{"if_not_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
@@ -603,52 +619,8 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
// TODO: There's duplication here with check_BEGINS_WITH().
// But unfortunately, the two functions differ a bit.
// If one of v1 or v2 is malformed or has an unsupported type
// (not B or S), what we do depends on whether it came from
// the user's query (is_constant()), or the item. Unsupported
// values in the query result in an error, but if they are in
// the item, we silently return false (no match).
bool bad = false;
if (!v1.IsObject() || v1.MemberCount() != 1) {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));
}
} else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));
}
}
bool ret = false;
if (!bad) {
auto it1 = v1.MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name == it2->name) {
if (it2->name == "S") {
std::string_view val1 = rjson::to_string_view(it1->value);
std::string_view val2 = rjson::to_string_view(it2->value);
ret = val1.starts_with(val2);
} else /* it2->name == "B" */ {
ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
}
}
}
return to_bool_json(ret);
return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1, v2,
f._parameters[0].is_constant(), f._parameters[1].is_constant()));
}
},
{"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
@@ -667,6 +639,55 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
};
// Given a parsed::path and an item read from the table, extract the value
// of a certain attribute path, such as "a" or "a.b.c[3]". Returns a null
// value if the item or the requested attribute does not exist.
// Note that the item is assumed to be encoded in JSON using DynamoDB
// conventions - each level of a nested document is a map with one key -
// a type (e.g., "M" for map) - and its value is the representation of
// that value.
static rjson::value extract_path(const rjson::value* item,
const parsed::path& p, calculate_value_caller caller) {
if (!item) {
return rjson::null_value();
}
const rjson::value* v = rjson::find(*item, p.root());
if (!v) {
return rjson::null_value();
}
for (const auto& op : p.operators()) {
if (!v->IsObject() || v->MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed
// objects. But today Alternator does not validate the structure
// of nested documents before storing them, so this can happen on
// read.
throw api_error::validation(format("{}: malformed item read: {}", *item));
}
const char* type = v->MemberBegin()->name.GetString();
v = &(v->MemberBegin()->value);
std::visit(overloaded_functor {
[&] (const std::string& member) {
if (type[0] == 'M' && v->IsObject()) {
v = rjson::find(*v, member);
} else {
v = nullptr;
}
},
[&] (unsigned index) {
if (type[0] == 'L' && v->IsArray() && index < v->Size()) {
v = &(v->GetArray()[index]);
} else {
v = nullptr;
}
}
}, op);
if (!v) {
return rjson::null_value();
}
}
return rjson::copy(*v);
}
// Given a parsed::value, which can refer either to a constant value from
// ExpressionAttributeValues, to the value of some attribute, or to a function
// of other values, this function calculates the resulting value.
@@ -684,21 +705,12 @@ rjson::value calculate_value(const parsed::value& v,
auto function_it = function_handlers.find(std::string_view(f._function_name));
if (function_it == function_handlers.end()) {
throw api_error::validation(
format("UpdateExpression: unknown function '{}' called.", f._function_name));
format("{}: unknown function '{}' called.", caller, f._function_name));
}
return function_it->second(caller, previous_item, f);
},
[&] (const parsed::path& p) -> rjson::value {
if (!previous_item) {
return rjson::null_value();
}
std::string update_path = p.root();
if (p.has_operators()) {
// FIXME: support this
throw api_error::validation("Reading attribute paths not yet implemented");
}
const rjson::value* previous_value = rjson::find(*previous_item, update_path);
return previous_value ? rjson::copy(*previous_value) : rjson::null_value();
return extract_path(previous_item, p, caller);
}
}, v._value);
}

View File

@@ -1,25 +1,9 @@
/*
* Copyright 2019 ScyllaDB
*
* This file is part of Scylla. See the LICENSE.PROPRIETARY file in the
* top-level directory for licensing information.
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
/*

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -49,15 +36,23 @@ class path {
// dot (e.g., ".xyz").
std::string _root;
std::vector<std::variant<std::string, unsigned>> _operators;
// It is useful to limit the depth of a user-specified path, because is
// allows us to use recursive algorithms without worrying about recursion
// depth. DynamoDB officially limits the length of paths to 32 components
// (including the root) so let's use the same limit.
static constexpr unsigned depth_limit = 32;
void check_depth_limit();
public:
void set_root(std::string root) {
_root = std::move(root);
}
void add_index(unsigned i) {
_operators.emplace_back(i);
check_depth_limit();
}
void add_dot(std::string(name)) {
_operators.emplace_back(std::move(name));
check_depth_limit();
}
const std::string& root() const {
return _root;
@@ -65,6 +60,13 @@ public:
bool has_operators() const {
return !_operators.empty();
}
const std::vector<std::variant<std::string, unsigned>>& operators() const {
return _operators;
}
std::vector<std::variant<std::string, unsigned>>& operators() {
return _operators;
}
friend std::ostream& operator<<(std::ostream&, const path&);
};
// When an expression is first parsed, all constants are references, like

View File

@@ -1,29 +1,15 @@
/*
* Copyright 2020 ScyllaDB
* Copyright 2020-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "seastarx.hh"
#include "service/storage_proxy.hh"
#include "service/storage_proxy.hh"
#include "service/paxos/cas_request.hh"
#include "utils/rjson.hh"
#include "executor.hh"

View File

@@ -1,25 +1,13 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "base64.hh"
#include "utils/base64.hh"
#include "utils/rjson.hh"
#include "log.hh"
#include "serialization.hh"
#include "error.hh"
@@ -68,7 +56,7 @@ struct from_json_visitor {
bo.write(t.from_string(rjson::to_string_view(v)));
}
void operator()(const bytes_type_impl& t) const {
bo.write(base64_decode(v));
bo.write(rjson::base64_decode(v));
}
void operator()(const boolean_type_impl& t) const {
bo.write(boolean_type->decompose(v.GetBool()));
@@ -114,18 +102,18 @@ struct to_json_visitor {
void operator()(const decimal_type_impl& t) const {
auto s = to_json_string(*decimal_type, bytes(bv));
//FIXME(sarna): unnecessary copy
rjson::set_with_string_name(deserialized, type_ident, rjson::from_string(s));
rjson::add_with_string_name(deserialized, type_ident, rjson::from_string(s));
}
void operator()(const string_type_impl& t) {
rjson::set_with_string_name(deserialized, type_ident, rjson::from_string(reinterpret_cast<const char *>(bv.data()), bv.size()));
rjson::add_with_string_name(deserialized, type_ident, rjson::from_string(reinterpret_cast<const char *>(bv.data()), bv.size()));
}
void operator()(const bytes_type_impl& t) const {
std::string b64 = base64_encode(bv);
rjson::set_with_string_name(deserialized, type_ident, rjson::from_string(b64));
rjson::add_with_string_name(deserialized, type_ident, rjson::from_string(b64));
}
// default
void operator()(const abstract_type& t) const {
rjson::set_with_string_name(deserialized, type_ident, rjson::parse(to_json_string(t, bytes(bv))));
rjson::add_with_string_name(deserialized, type_ident, rjson::parse(to_json_string(t, bytes(bv))));
}
};
@@ -196,7 +184,7 @@ bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column
format("The AttributeValue for a key attribute cannot contain an empty string value. Key: {}", column.name_as_text()));
}
if (column.type == bytes_type) {
return base64_decode(it->value);
return rjson::base64_decode(it->value);
} else {
return column.type->from_string(rjson::to_string_view(it->value));
}
@@ -258,11 +246,9 @@ big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {
throw api_error::validation(format("{}: expected number, found type '{}'", diagnostic, it->name));
}
try {
if (it->value.IsNumber()) {
// FIXME(sarna): should use big_decimal constructor with numeric values directly:
return big_decimal(rjson::print(it->value));
}
if (!it->value.IsString()) {
// We shouldn't reach here. Callers normally validate their input
// earlier with validate_value().
throw api_error::validation(format("{}: improperly formatted number constant", diagnostic));
}
return big_decimal(rjson::to_string_view(it->value));
@@ -271,6 +257,21 @@ big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {
}
}
std::optional<big_decimal> try_unwrap_number(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return std::nullopt;
}
auto it = v.MemberBegin();
if (it->name != "N" || !it->value.IsString()) {
return std::nullopt;
}
try {
return big_decimal(rjson::to_string_view(it->value));
} catch (const marshal_exception& e) {
return std::nullopt;
}
}
const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return {"", nullptr};
@@ -278,7 +279,7 @@ const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value&
auto it = v.MemberBegin();
const std::string it_key = it->name.GetString();
if (it_key != "SS" && it_key != "BS" && it_key != "NS") {
return {"", nullptr};
return {std::move(it_key), nullptr};
}
return std::make_pair(it_key, &(it->value));
}
@@ -301,7 +302,7 @@ rjson::value number_add(const rjson::value& v1, const rjson::value& v2) {
auto n2 = unwrap_number(v2, "UpdateExpression");
rjson::value ret = rjson::empty_object();
std::string str_ret = std::string((n1 + n2).to_string());
rjson::set(ret, "N", rjson::from_string(str_ret));
rjson::add(ret, "N", rjson::from_string(str_ret));
return ret;
}
@@ -310,7 +311,7 @@ rjson::value number_subtract(const rjson::value& v1, const rjson::value& v2) {
auto n2 = unwrap_number(v2, "UpdateExpression");
rjson::value ret = rjson::empty_object();
std::string str_ret = std::string((n1 - n2).to_string());
rjson::set(ret, "N", rjson::from_string(str_ret));
rjson::add(ret, "N", rjson::from_string(str_ret));
return ret;
}
@@ -336,7 +337,7 @@ rjson::value set_sum(const rjson::value& v1, const rjson::value& v2) {
}
}
rjson::value ret = rjson::empty_object();
rjson::set_with_string_name(ret, set1_type, std::move(sum));
rjson::add_with_string_name(ret, set1_type, std::move(sum));
return ret;
}
@@ -348,7 +349,7 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&
auto [set1_type, set1] = unwrap_set(v1);
auto [set2_type, set2] = unwrap_set(v2);
if (set1_type != set2_type) {
throw api_error::validation(format("Mismatched set types: {} and {}", set1_type, set2_type));
throw api_error::validation(format("Set DELETE type mismatch: {} and {}", set1_type, set2_type));
}
if (!set1 || !set2) {
throw api_error::validation("UpdateExpression: DELETE operation can only be performed on a set");
@@ -364,7 +365,7 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&
return std::nullopt;
}
rjson::value ret = rjson::empty_object();
rjson::set_with_string_name(ret, set1_type, rjson::empty_array());
rjson::add_with_string_name(ret, set1_type, rjson::empty_array());
rjson::value& result_set = ret[set1_type];
for (const auto& a : set1_raw) {
rjson::push_back(result_set, rjson::copy(a));
@@ -372,4 +373,23 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&
return ret;
}
// Take two JSON-encoded list values (remember that a list value is
// {"L": [...the actual list]}) and return the concatenation, again as
// a list value.
// Returns a null value if one of the arguments is not actually a list.
rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2) {
const rjson::value* list1 = unwrap_list(v1);
const rjson::value* list2 = unwrap_list(v2);
if (!list1 || !list2) {
return rjson::null_value();
}
rjson::value cat = rjson::copy(*list1);
for (const auto& a : list2->GetArray()) {
rjson::push_back(cat, rjson::copy(a));
}
rjson::value ret = rjson::empty_object();
rjson::add(ret, "L", std::move(cat));
return ret;
}
}

View File

@@ -1,28 +1,16 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include <string>
#include <string_view>
#include <optional>
#include "types.hh"
#include "schema_fwd.hh"
#include "keys.hh"
@@ -64,6 +52,10 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);
// raises ValidationException with diagnostic.
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);
// try_unwrap_number is like unwrap_number, but returns an unset optional
// when the given v does not encode a number.
std::optional<big_decimal> try_unwrap_number(const rjson::value& v);
// Check if a given JSON object encodes a set (i.e., it is a {"SS": [...]}, or "NS", "BS"
// and returns set's type and a pointer to that set. If the object does not encode a set,
// returned value is {"", nullptr}
@@ -85,5 +77,11 @@ rjson::value set_sum(const rjson::value& v1, const rjson::value& v2);
// DynamoDB does not allow empty sets, so if resulting set is empty, return
// an unset optional instead.
std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value& v2);
// Take two JSON-encoded list values (remember that a list value is
// {"L": [...the actual list]}) and return the concatenation, again as
// a list value.
// Returns a null value if one of the arguments is not actually a list.
rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2);
}

View File

@@ -1,36 +1,29 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "alternator/server.hh"
#include "log.hh"
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/json/json_elements.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/short_streams.hh>
#include "seastarx.hh"
#include "error.hh"
#include "utils/rjson.hh"
#include "auth.hh"
#include <cctype>
#include "cql3/query_processor.hh"
#include "service/storage_service.hh"
#include "service/storage_proxy.hh"
#include "locator/snitch_base.hh"
#include "gms/gossiper.hh"
#include "utils/overloaded_functor.hh"
#include "utils/fb_utilities.hh"
static logging::logger slogger("alternator-server");
@@ -59,6 +52,40 @@ inline std::vector<std::string_view> split(std::string_view text, char separator
return tokens;
}
// Handle CORS (Cross-origin resource sharing) in the HTTP request:
// If the request has the "Origin" header specifying where the script which
// makes this request comes from, we need to reply with the header
// "Access-Control-Allow-Origin: *" saying that this (and any) origin is fine.
// Additionally, if preflight==true (i.e., this is an OPTIONS request),
// the script can also "request" in headers that the server allows it to use
// some HTTP methods and headers in the followup request, and the server
// should respond by "allowing" them in the response headers.
// We also add the header "Access-Control-Expose-Headers" to let the script
// access additional headers in the response.
// This handle_CORS() should be used when handling any HTTP method - both the
// usual GET and POST, and also the "preflight" OPTIONS method.
static void handle_CORS(const request& req, reply& rep, bool preflight) {
if (!req.get_header("origin").empty()) {
rep.add_header("Access-Control-Allow-Origin", "*");
// This is the list that DynamoDB returns for expose headers. I am
// not sure why not just return "*" here, what's the risk?
rep.add_header("Access-Control-Expose-Headers", "x-amzn-RequestId,x-amzn-ErrorType,x-amzn-ErrorMessage,Date");
if (preflight) {
sstring s = req.get_header("Access-Control-Request-Headers");
if (!s.empty()) {
rep.add_header("Access-Control-Allow-Headers", std::move(s));
}
s = req.get_header("Access-Control-Request-Method");
if (!s.empty()) {
rep.add_header("Access-Control-Allow-Methods", std::move(s));
}
// Our CORS response never change anyway, let the browser cache it
// for two hours (Chrome's maximum):
rep.add_header("Access-Control-Max-Age", "7200");
}
}
}
// DynamoDB HTTP error responses are structured as follows
// https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
// Our handlers throw an exception to report an error. If the exception
@@ -93,6 +120,10 @@ public:
[&] (const json::json_return_type& json_return_value) {
slogger.trace("api_handler success case");
if (json_return_value._body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(json_return_value._body_writer));
} else {
rep->_content += json_return_value._res;
@@ -105,14 +136,16 @@ public:
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}), _type("json") { }
}) { }
api_handler(const api_handler&) = default;
future<std::unique_ptr<reply>> handle(const sstring& path,
std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
return _f_handle(std::move(req), std::move(rep)).then(
[this](std::unique_ptr<reply> rep) {
rep->done(_type);
rep->set_mime_type("application/x-amz-json-1.0");
rep->done();
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}
@@ -126,7 +159,6 @@ protected:
}
future_handler_function _f_handle;
sstring _type;
};
class gated_handler : public handler_base {
@@ -146,6 +178,7 @@ public:
health_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
rep->set_status(reply::status_type::ok);
rep->write_body("txt", format("healthy: {}", req->get_header("Host")));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
@@ -153,21 +186,25 @@ protected:
};
class local_nodelist_handler : public gated_handler {
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
public:
local_nodelist_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
local_nodelist_handler(seastar::gate& pending_requests, service::storage_proxy& proxy, gms::gossiper& gossiper)
: gated_handler(pending_requests)
, _proxy(proxy)
, _gossiper(gossiper) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
rjson::value results = rjson::empty_array();
// It's very easy to get a list of all live nodes on the cluster,
// using gms::get_local_gossiper().get_live_members(). But getting
// using _gossiper().get_live_members(). But getting
// just the list of live nodes in this DC needs more elaborate code:
sstring local_dc = locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(
utils::fb_utilities::get_broadcast_address());
std::unordered_set<gms::inet_address> local_dc_nodes =
service::get_local_storage_service().get_token_metadata().
get_topology().get_datacenter_endpoints().at(local_dc);
_proxy.get_token_metadata_ptr()->get_topology().get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
if (gms::get_local_gossiper().is_alive(ip)) {
if (_gossiper.is_alive(ip)) {
rjson::push_back(results, rjson::from_string(ip.to_sstring()));
}
}
@@ -178,10 +215,26 @@ protected:
}
};
future<> server::verify_signature(const request& req) {
// The CORS (Cross-origin resource sharing) protocol can send an OPTIONS
// request before ("pre-flight") the main request. The response to this
// request can be empty, but needs to have the right headers (which we
// fill with handle_CORS())
class options_handler : public gated_handler {
public:
options_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, true);
rep->set_status(reply::status_type::ok);
rep->write_body("txt", sstring(""));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
};
future<std::string> server::verify_signature(const request& req, const chunked_content& content) {
if (!_enforce_authorization) {
slogger.debug("Skipping authorization");
return make_ready_future<>();
return make_ready_future<std::string>("<unauthenticated request>");
}
auto host_it = req._headers.find("Host");
if (host_it == req._headers.end()) {
@@ -192,24 +245,31 @@ future<> server::verify_signature(const request& req) {
throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");
}
std::string host = host_it->second;
std::vector<std::string_view> credentials_raw = split(authorization_it->second, ' ');
std::string_view authorization_header = authorization_it->second;
auto pos = authorization_header.find_first_of(' ');
if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {
throw api_error::invalid_signature(format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
}
authorization_header.remove_prefix(pos+1);
std::string credential;
std::string user_signature;
std::string signed_headers_str;
std::vector<std::string_view> signed_headers;
for (std::string_view entry : credentials_raw) {
do {
// Either one of a comma or space can mark the end of an entry
pos = authorization_header.find_first_of(" ,");
std::string_view entry = authorization_header.substr(0, pos);
if (pos != std::string_view::npos) {
authorization_header.remove_prefix(pos + 1);
}
if (entry.empty()) {
continue;
}
std::vector<std::string_view> entry_split = split(entry, '=');
if (entry_split.size() != 2) {
if (entry != "AWS4-HMAC-SHA256") {
throw api_error::invalid_signature(format("Only AWS4-HMAC-SHA256 algorithm is supported. Found: {}", entry));
}
continue;
}
std::string_view auth_value = entry_split[1];
// Commas appear as an additional (quite redundant) delimiter
if (auth_value.back() == ',') {
auth_value.remove_suffix(1);
}
if (entry_split[0] == "Credential") {
credential = std::string(auth_value);
} else if (entry_split[0] == "Signature") {
@@ -219,7 +279,8 @@ future<> server::verify_signature(const request& req) {
signed_headers = split(auth_value, ';');
std::sort(signed_headers.begin(), signed_headers.end());
}
}
} while (pos != std::string_view::npos);
std::vector<std::string_view> credential_split = split(credential, '/');
if (credential_split.size() != 5) {
throw api_error::validation(format("Incorrect credential information format: {}", credential));
@@ -243,10 +304,10 @@ future<> server::verify_signature(const request& req) {
}
}
auto cache_getter = [&qp = _qp] (std::string username) {
return get_key_from_roles(qp, std::move(username));
auto cache_getter = [&proxy = _proxy] (std::string username) {
return get_key_from_roles(proxy, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then([this, &req,
return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,
user = std::move(user),
host = std::move(host),
datestamp = std::move(datestamp),
@@ -256,52 +317,102 @@ future<> server::verify_signature(const request& req) {
service = std::move(service),
user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {
std::string signature = get_signature(user, *key_ptr, std::string_view(host), req._method,
datestamp, signed_headers_str, signed_headers_map, req.content, region, service, "");
datestamp, signed_headers_str, signed_headers_map, content, region, service, "");
if (signature != std::string_view(user_signature)) {
_key_cache.remove(user);
throw api_error::unrecognized_client("The security token included in the request is invalid.");
}
return user;
});
}
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request>&& req) {
static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing_instance) {
tracing::trace_state_props_set props;
props.set<tracing::trace_state_props::full_tracing>();
props.set_if<tracing::trace_state_props::log_slow_query>(tracing_instance.slow_query_tracing_enabled());
return tracing_instance.create_session(tracing::trace_type::QUERY, props);
}
// truncated_content_view() prints a potentially long chunked_content for
// debugging purposes. In the common case when the content is not excessively
// long, it just returns a view into the given content, without any copying.
// But when the content is very long, it is truncated after some arbitrary
// max_len (or one chunk, whichever comes first), with "<truncated>" added at
// the end. To do this modification to the string, we need to create a new
// std::string, so the caller must pass us a reference to one, "buf", where
// we can store the content. The returned view is only alive for as long this
// buf is kept alive.
static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {
constexpr size_t max_len = 1024;
if (content.empty()) {
return std::string_view();
} else if (content.size() == 1 && content.begin()->size() <= max_len) {
return std::string_view(content.begin()->get(), content.begin()->size());
} else {
buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";
return std::string_view(buf);
}
}
static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, sstring_view op, const chunked_content& query) {
tracing::trace_state_ptr trace_state;
tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();
if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {
trace_state = create_tracing_session(tracing_instance);
std::string buf;
tracing::add_session_param(trace_state, "alternator_op", op);
tracing::add_query(trace_state, truncated_content_view(query, buf));
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
tracing::set_username(trace_state, auth::authenticated_user(username));
}
return trace_state;
}
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {
_executor._stats.total_operations++;
sstring target = req->get_header(TARGET);
std::vector<std::string_view> split_target = split(target, '.');
//NOTICE(sarna): Target consists of Dynamo API version followed by a dot '.' and operation type (e.g. CreateTable)
std::string op = split_target.empty() ? std::string() : std::string(split_target.back());
slogger.trace("Request: {} {} {}", op, req->content, req->_headers);
return verify_signature(*req).then([this, op, req = std::move(req)] () mutable {
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
throw api_error::unknown_operation(format("Unsupported operation {}", op));
}
return with_gate(_pending_requests, [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] () mutable {
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
return do_with(std::make_unique<executor::client_state>(executor::client_state::internal_tag()),
[this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] (std::unique_ptr<executor::client_state>& client_state) mutable {
tracing::trace_state_ptr trace_state = executor::maybe_trace_query(*client_state, op, req->content);
tracing::trace(trace_state, op);
// JSON parsing can allocate up to roughly 2x the size of the raw document, + a couple of bytes for maintenance.
// FIXME: by this time, the whole HTTP request was already read, so some memory is already occupied.
// Once HTTP allows working on streams, we should grab the permit *before* reading the HTTP payload.
size_t mem_estimate = req->content.size() * 3 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
return units_fut.then([this, callback_it = std::move(callback_it), &client_state, trace_state, req = std::move(req)] (semaphore_units<> units) mutable {
return _json_parser.parse(req->content).then([this, callback_it = std::move(callback_it), &client_state, trace_state,
units = std::move(units), req = std::move(req)] (rjson::value json_request) mutable {
return callback_it->second(_executor, *client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req)).finally([trace_state] {});
});
});
});
});
});
// JSON parsing can allocate up to roughly 2x the size of the raw
// document, + a couple of bytes for maintenance.
// TODO: consider the case where req->content_length is missing. Maybe
// we need to take the content_length_limit and return some of the units
// when we finish read_content_and_verify_signature?
size_t mem_estimate = req->content_length * 2 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
auto units = co_await std::move(units_fut);
assert(req->content_stream);
chunked_content content = co_await util::read_entire_stream(*req->content_stream);
auto username = co_await verify_signature(*req, content);
if (slogger.is_enabled(log_level::trace)) {
std::string buf;
slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);
}
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
co_return api_error::unknown_operation(format("Unsupported operation {}", op));
}
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);
tracing::trace(trace_state, op);
rjson::value json_request = co_await _json_parser.parse(std::move(content));
co_return co_await callback_it->second(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
}
void server::set_routes(routes& r) {
@@ -322,17 +433,19 @@ void server::set_routes(routes& r) {
// consider this to be a security risk, because an attacker can already
// scan an entire subnet for nodes responding to the health request,
// or even just scan for open ports.
r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests));
r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests, _proxy, _gossiper));
r.put(operation_type::OPTIONS, "/", new options_handler(_pending_requests));
}
//FIXME: A way to immediately invalidate the cache should be considered,
// e.g. when the system table which stores the keys is changed.
// For now, this propagation may take up to 1 minute.
server::server(executor& exec, cql3::query_processor& qp)
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper)
: _http_server("http-alternator")
, _https_server("https-alternator")
, _executor(exec)
, _qp(qp)
, _proxy(proxy)
, _gossiper(gossiper)
, _key_cache(1024, 1min, slogger)
, _enforce_authorization(false)
, _enabled_servers{}
@@ -389,6 +502,12 @@ server::server(executor& exec, cql3::query_processor& qp)
{"ListTagsOfResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_tags_of_resource(client_state, std::move(permit), std::move(json_request));
}},
{"UpdateTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_time_to_live(client_state, std::move(permit), std::move(json_request));
}},
{"DescribeTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_time_to_live(client_state, std::move(permit), std::move(json_request));
}},
{"ListStreams", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_streams(client_state, std::move(permit), std::move(json_request));
}},
@@ -405,42 +524,37 @@ server::server(executor& exec, cql3::query_processor& qp)
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter) {
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
_memory_limiter = memory_limiter;
_enforce_authorization = enforce_authorization;
_max_concurrent_requests = std::move(max_concurrent_requests);
if (!port && !https_port) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
" must be specified in order to init an alternator HTTP server instance"));
}
return seastar::async([this, addr, port, https_port, creds] {
try {
_executor.start().get();
_executor.start().get();
if (port) {
set_routes(_http_server._routes);
_http_server.set_content_length_limit(server::content_length_limit);
_http_server.listen(socket_address{addr, *port}).get();
_enabled_servers.push_back(std::ref(_http_server));
}
if (https_port) {
set_routes(_https_server._routes);
_https_server.set_content_length_limit(server::content_length_limit);
_https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {
if (ep) {
slogger.warn("Exception loading {}: {}", files, ep);
} else {
slogger.info("Reloaded {}", files);
}
}).get0());
_https_server.listen(socket_address{addr, *https_port}).get();
_enabled_servers.push_back(std::ref(_https_server));
}
} catch (...) {
slogger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",
addr, port ? std::to_string(*port) : "OFF", https_port ? std::to_string(*https_port) : "OFF", std::current_exception());
std::throw_with_nested(std::runtime_error(
format("Failed to set up Alternator HTTP server on {} port {}, TLS port {}",
addr, port ? std::to_string(*port) : "OFF", https_port ? std::to_string(*https_port) : "OFF")));
if (port) {
set_routes(_http_server._routes);
_http_server.set_content_length_limit(server::content_length_limit);
_http_server.set_content_streaming(true);
_http_server.listen(socket_address{addr, *port}).get();
_enabled_servers.push_back(std::ref(_http_server));
}
if (https_port) {
set_routes(_https_server._routes);
_https_server.set_content_length_limit(server::content_length_limit);
_https_server.set_content_streaming(true);
_https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {
if (ep) {
slogger.warn("Exception loading {}: {}", files, ep);
} else {
slogger.info("Reloaded {}", files);
}
}).get0());
_https_server.listen(socket_address{addr, *https_port}).get();
_enabled_servers.push_back(std::ref(_https_server));
}
});
}
@@ -462,7 +576,7 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
return;
}
try {
_parsed_document = rjson::parse_yieldable(_raw_document);
_parsed_document = rjson::parse_yieldable(std::move(_raw_document));
_current_exception = nullptr;
} catch (...) {
_current_exception = std::current_exception();
@@ -472,12 +586,12 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
})) {
}
future<rjson::value> server::json_parser::parse(std::string_view content) {
future<rjson::value> server::json_parser::parse(chunked_content&& content) {
if (content.size() < yieldable_parsing_threshold) {
return make_ready_future<rjson::value>(rjson::parse(content));
return make_ready_future<rjson::value>(rjson::parse(std::move(content)));
}
return with_semaphore(_parsing_sem, 1, [this, content] {
_raw_document = content;
return with_semaphore(_parsing_sem, 1, [this, content = std::move(content)] () mutable {
_raw_document = std::move(content);
_document_waiting.signal();
return _document_parsed.wait().then([this] {
if (_current_exception) {
@@ -495,5 +609,12 @@ future<> server::json_parser::stop() {
return std::move(_run_parse_json_thread);
}
const char* api_error::what() const noexcept {
if (_what_string.empty()) {
_what_string = format("{} {}: {}", _http_code, _type, _msg);
}
return _what_string.c_str();
}
}

View File

@@ -1,37 +1,28 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "alternator/executor.hh"
#include <seastar/core/future.hh>
#include <seastar/core/condition-variable.hh>
#include <seastar/http/httpd.hh>
#include <seastar/net/tls.hh>
#include <optional>
#include "alternator/auth.hh"
#include "utils/small_vector.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/units.hh>
namespace alternator {
using chunked_content = rjson::chunked_content;
class server {
static constexpr size_t content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
@@ -41,7 +32,8 @@ class server {
http_server _http_server;
http_server _https_server;
executor& _executor;
cql3::query_processor& _qp;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
key_cache _key_cache;
bool _enforce_authorization;
@@ -50,10 +42,11 @@ class server {
alternator_callbacks_map _callbacks;
semaphore* _memory_limiter;
utils::updateable_value<uint32_t> _max_concurrent_requests;
class json_parser {
static constexpr size_t yieldable_parsing_threshold = 16*KB;
std::string_view _raw_document;
chunked_content _raw_document;
rjson::value _parsed_document;
std::exception_ptr _current_exception;
semaphore _parsing_sem{1};
@@ -63,21 +56,25 @@ class server {
future<> _run_parse_json_thread;
public:
json_parser();
future<rjson::value> parse(std::string_view content);
// Moving a chunked_content into parse() allows parse() to free each
// chunk as soon as it is parsed, so when chunks are relatively small,
// we don't need to store the sum of unparsed and parsed sizes.
future<rjson::value> parse(chunked_content&& content);
future<> stop();
};
json_parser _json_parser;
public:
server(executor& executor, cql3::query_processor& qp);
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter);
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();
private:
void set_routes(seastar::httpd::routes& r);
future<> verify_signature(const seastar::httpd::request& r);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request>&& req);
// If verification succeeds, returns the authenticated user's username
future<std::string> verify_signature(const seastar::httpd::request&, const chunked_content&);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request> req);
};
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "stats.hh"
@@ -38,6 +25,7 @@ stats::stats() : api_operations{} {
#define OPERATION_LATENCY(name, CamelCaseName) \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName)}, [this]{return to_metrics_histogram(api_operations.name);}),
OPERATION(batch_get_item, "BatchGetItem")
OPERATION(batch_write_item, "BatchWriteItem")
OPERATION(create_backup, "CreateBackup")
OPERATION(create_global_table, "CreateGlobalTable")
@@ -96,6 +84,8 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),
seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,
seastar::metrics::description("Counts a number of requests blocked due to memory pressure.")),
seastar::metrics::make_total_operations("requests_shed", requests_shed,
seastar::metrics::description("Counts a number of requests shed due to overload.")),
seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,
seastar::metrics::description("number of rows read during filtering operations")),
seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -92,6 +79,7 @@ public:
uint64_t write_using_lwt = 0;
uint64_t shard_bounce_for_lwt = 0;
uint64_t requests_blocked_memory = 0;
uint64_t requests_shed = 0;
// CQL-derived stats
cql3::cql_stats cql_stats;
private:

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2020 ScyllaDB
* Copyright 2020-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <type_traits>
@@ -26,21 +13,22 @@
#include <seastar/json/formatter.hh>
#include "base64.hh"
#include "utils/base64.hh"
#include "log.hh"
#include "database.hh"
#include "db/config.hh"
#include "cdc/log.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
#include "cdc/metadata.hh"
#include "db/system_distributed_keyspace.hh"
#include "utils/UUID_gen.hh"
#include "cql3/selection/selection.hh"
#include "cql3/result_set.hh"
#include "cql3/type_json.hh"
#include "cql3/column_identifier.hh"
#include "schema_builder.hh"
#include "service/storage_service.hh"
#include "service/storage_proxy.hh"
#include "gms/feature.hh"
#include "gms/feature_service.hh"
@@ -88,7 +76,7 @@ struct rapidjson::internal::TypeHelper<ValueType, utils::UUID>
{};
static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
return db_clock::time_point{std::chrono::milliseconds(utils::UUID_gen::get_adjusted_timestamp(uuid))};
return db_clock::time_point{utils::UUID_gen::unix_timestamp(uuid)};
}
/**
@@ -153,24 +141,29 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
auto limit = rjson::get_opt<int>(request, "Limit").value_or(std::numeric_limits<int>::max());
auto streams_start = rjson::get_opt<stream_arn>(request, "ExclusiveStartStreamArn");
auto table = find_table(_proxy, request);
auto& db = _proxy.get_db().local();
auto& cfs = db.get_column_families();
auto i = cfs.begin();
auto e = cfs.end();
auto db = _proxy.data_dictionary();
auto cfs = db.get_tables();
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
}
// TODO: the unordered_map here is not really well suited for partial
// querying - we're sorting on local hash order, and creating a table
// between queries may or may not miss info. But that should be rare,
// and we can probably expect this to be a single call.
// # 12601 (maybe?) - sort the set of tables on ID. This should ensure we never
// generate duplicates in a paged listing here. Can obviously miss things if they
// are added between paged calls and end up with a "smaller" UUID/ARN, but that
// is to be expected.
std::sort(cfs.begin(), cfs.end(), [](const data_dictionary::table& t1, const data_dictionary::table& t2) {
return t1.schema()->id() < t2.schema()->id();
});
auto i = cfs.begin();
auto e = cfs.end();
if (streams_start) {
i = std::find_if(i, e, [&](const std::pair<utils::UUID, lw_shared_ptr<column_family>>& p) {
return p.first == streams_start
&& cdc::get_base_table(db, *p.second->schema())
&& is_alternator_keyspace(p.second->schema()->ks_name())
i = std::find_if(i, e, [&](const data_dictionary::table& t) {
return t.schema()->id() == streams_start
&& cdc::get_base_table(db.real_database(), *t.schema())
&& is_alternator_keyspace(t.schema()->ks_name())
;
});
if (i != e) {
@@ -184,7 +177,7 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
std::optional<stream_arn> last;
for (;limit > 0 && i != e; ++i) {
auto s = i->second->schema();
auto s = i->schema();
auto& ks_name = s->ks_name();
auto& cf_name = s->cf_name();
@@ -194,27 +187,27 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
if (table && ks_name != table->ks_name()) {
continue;
}
if (cdc::is_log_for_some_table(ks_name, cf_name)) {
if (table && table != cdc::get_base_table(db, *s)) {
if (cdc::is_log_for_some_table(db.real_database(), ks_name, cf_name)) {
if (table && table != cdc::get_base_table(db.real_database(), *s)) {
continue;
}
rjson::value new_entry = rjson::empty_object();
last = i->first;
rjson::set(new_entry, "StreamArn", *last);
rjson::set(new_entry, "StreamLabel", rjson::from_string(stream_label(*s)));
rjson::set(new_entry, "TableName", rjson::from_string(cdc::base_name(table_name(*s))));
last = i->schema()->id();
rjson::add(new_entry, "StreamArn", *last);
rjson::add(new_entry, "StreamLabel", rjson::from_string(stream_label(*s)));
rjson::add(new_entry, "TableName", rjson::from_string(cdc::base_name(table_name(*s))));
rjson::push_back(streams, std::move(new_entry));
--limit;
}
}
rjson::set(ret, "Streams", std::move(streams));
rjson::add(ret, "Streams", std::move(streams));
if (last) {
rjson::set(ret, "LastEvaluatedStreamArn", *last);
rjson::add(ret, "LastEvaluatedStreamArn", *last);
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
@@ -294,7 +287,8 @@ sequence_number::sequence_number(std::string_view v)
// view directly.
uint128_t tmp{std::string(v)};
// see above
return utils::UUID_gen::get_time_UUID_raw(uint64_t(tmp >> 64), uint64_t(tmp & std::numeric_limits<uint64_t>::max()));
return utils::UUID_gen::get_time_UUID_raw(utils::UUID_gen::decimicroseconds{uint64_t(tmp >> 64)},
uint64_t(tmp & std::numeric_limits<uint64_t>::max()));
}())
{}
@@ -421,7 +415,7 @@ using namespace std::string_literals;
* This will be a partial overlap, but it is the best we can do.
*/
static std::chrono::seconds confidence_interval(const database& db) {
static std::chrono::seconds confidence_interval(data_dictionary::database db) {
return std::chrono::seconds(db.get_config().alternator_streams_time_window_s());
}
@@ -439,12 +433,12 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");
schema_ptr schema, bs;
auto& db = _proxy.get_db().local();
auto db = _proxy.data_dictionary();
try {
auto& cf = db.find_column_family(stream_arn);
auto cf = db.find_column_family(stream_arn);
schema = cf.schema();
bs = cdc::get_base_table(_proxy.get_db().local(), *schema);
bs = cdc::get_base_table(db.real_database(), *schema);
} catch (...) {
}
@@ -470,8 +464,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto status = "DISABLED";
if (opts.enabled()) {
auto& metadata = _ss.get_cdc_metadata();
if (!metadata.streams_available()) {
if (!_cdc_metadata.streams_available()) {
status = "ENABLING";
} else {
status = "ENABLED";
@@ -480,18 +473,18 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto ttl = std::chrono::seconds(opts.ttl());
rjson::set(stream_desc, "StreamStatus", rjson::from_string(status));
rjson::add(stream_desc, "StreamStatus", rjson::from_string(status));
stream_view_type type = cdc_options_to_steam_view_type(opts);
rjson::set(stream_desc, "StreamArn", alternator::stream_arn(schema->id()));
rjson::set(stream_desc, "StreamViewType", type);
rjson::set(stream_desc, "TableName", rjson::from_string(table_name(*bs)));
rjson::add(stream_desc, "StreamArn", alternator::stream_arn(schema->id()));
rjson::add(stream_desc, "StreamViewType", type);
rjson::add(stream_desc, "TableName", rjson::from_string(table_name(*bs)));
describe_key_schema(stream_desc, *bs);
if (!opts.enabled()) {
rjson::set(ret, "StreamDescription", std::move(stream_desc));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
}
@@ -499,19 +492,11 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// TODO: creation time
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
// cannot really "resume" query, must iterate all data. because we cannot query neither "time" (pk) > something,
// or on expired...
// TODO: maybe add secondary index to topology table to enable this?
return _sdks.cdc_get_versioned_streams({ normal_token_owners }).then([this, &db, schema, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc), ttl](std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
auto i = topologies.lower_bound(low_ts);
// need first gen _intersecting_ the timestamp.
if (i != topologies.begin()) {
i = std::prev(i);
}
return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([this, db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
auto e = topologies.end();
auto prev = e;
@@ -519,9 +504,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
std::optional<shard_id> last;
// i is now at the youngest generation we include. make a mark of it.
auto first = i;
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
@@ -547,7 +530,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
};
// need a prev even if we are skipping stuff
if (i != first) {
if (i != topologies.begin()) {
prev = std::prev(i);
}
@@ -598,19 +581,19 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::set(shard, "ParentShardId", shard_id(prev->first, *pid));
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
}
last.emplace(ts, id);
rjson::set(shard, "ShardId", *last);
rjson::add(shard, "ShardId", *last);
auto range = rjson::empty_object();
rjson::set(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch().count())));
rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));
if (expired) {
rjson::set(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch().count())));
rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));
}
rjson::set(shard, "SequenceNumberRange", std::move(range));
rjson::add(shard, "SequenceNumberRange", std::move(range));
rjson::push_back(shards, std::move(shard));
if (--limit == 0) {
@@ -622,11 +605,11 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
}
if (last) {
rjson::set(stream_desc, "LastEvaluatedShardId", *last);
rjson::add(stream_desc, "LastEvaluatedShardId", *last);
}
rjson::set(stream_desc, "Shards", std::move(shards));
rjson::set(ret, "StreamDescription", std::move(stream_desc));
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
});
@@ -734,18 +717,18 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
}
auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");
auto& db = _proxy.get_db().local();
auto db = _proxy.data_dictionary();
schema_ptr schema = nullptr;
std::optional<shard_id> sid;
try {
auto& cf = db.find_column_family(stream_arn);
auto cf = db.find_column_family(stream_arn);
schema = cf.schema();
sid = rjson::get<shard_id>(request, "ShardId");
} catch (...) {
}
if (!schema || !cdc::get_base_table(db, *schema) || !is_alternator_keyspace(schema->ks_name())) {
if (!schema || !cdc::get_base_table(db.real_database(), *schema) || !is_alternator_keyspace(schema->ks_name())) {
throw api_error::resource_not_found("Invalid StreamArn");
}
if (!sid) {
@@ -771,7 +754,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
inclusive_of_threshold = true;
break;
case shard_iterator_type::LATEST:
threshold = utils::UUID_gen::min_time_UUID((db_clock::now() - confidence_interval(db)).time_since_epoch().count());
threshold = utils::UUID_gen::min_time_UUID((db_clock::now() - confidence_interval(db)).time_since_epoch());
inclusive_of_threshold = true;
break;
}
@@ -779,7 +762,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
shard_iterator iter(stream_arn, *sid, threshold, inclusive_of_threshold);
auto ret = rjson::empty_object();
rjson::set(ret, "ShardIterator", iter);
rjson::add(ret, "ShardIterator", iter);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
}
@@ -822,12 +805,12 @@ future<executor::request_return_type> executor::get_records(client_state& client
throw api_error::validation("Limit must be 1 or more");
}
auto& db = _proxy.get_db().local();
auto db = _proxy.data_dictionary();
schema_ptr schema, base;
try {
auto& log_table = db.find_column_family(iter.table);
auto log_table = db.find_column_family(iter.table);
schema = log_table.schema();
base = cdc::get_base_table(db, *schema);
base = cdc::get_base_table(db.real_database(), *schema);
} catch (...) {
}
@@ -843,7 +826,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
dht::partition_range_vector partition_ranges{ dht::partition_range::make_singular(dht::decorate_key(*schema, pk)) };
auto high_ts = db_clock::now() - confidence_interval(db);
auto high_uuid = utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch().count());
auto high_uuid = utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch());
auto lo = clustering_key_prefix::from_exploded(*schema, { iter.threshold.serialize() });
auto hi = clustering_key_prefix::from_exploded(*schema, { high_uuid.serialize() });
@@ -855,16 +838,18 @@ future<executor::request_return_type> executor::get_records(client_state& client
static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");
static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");
auto key_names = boost::copy_range<std::unordered_set<std::string>>(
auto key_names = boost::copy_range<attrs_to_get>(
boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))
| boost::adaptors::transformed([&] (const column_definition& cdef) { return cdef.name_as_text(); })
| boost::adaptors::transformed([&] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
// Include all base table columns as values (in case pre or post is enabled).
// This will include attributes not stored in the frozen map column
auto attr_names = boost::copy_range<std::unordered_set<std::string>>(base->regular_columns()
auto attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()
// this will include the :attrs column, which we will also force evaluating.
// But not having this set empty forces out any cdc columns from actual result
| boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.name_as_text(); })
| boost::adaptors::transformed([] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
std::vector<const column_definition*> columns;
@@ -933,13 +918,13 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::set(record, "dynamodb", std::move(dynamodb));
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
// TODO: awsRegion?
rjson::set(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::set(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
@@ -954,10 +939,10 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::set(dynamodb, "Keys", std::move(keys));
rjson::set(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::set(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::set(dynamodb, "StreamViewType", type);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
//TODO: SizeInBytes
}
@@ -989,17 +974,17 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, true);
describe_single_item(*selection, row, key_names, item);
rjson::set(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
case cdc::operation::update:
rjson::set(record, "eventName", "MODIFY");
rjson::add(record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::set(record, "eventName", "INSERT");
rjson::add(record, "eventName", "INSERT");
break;
default:
rjson::set(record, "eventName", "REMOVE");
rjson::add(record, "eventName", "REMOVE");
break;
}
if (eor) {
@@ -1013,7 +998,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::set(ret, "Records", std::move(records));
rjson::add(ret, "Records", std::move(records));
if (nrecords != 0) {
// #9642. Set next iterators threshold to > last
@@ -1022,13 +1007,15 @@ future<executor::request_return_type> executor::get_records(client_state& client
// without checking if maybe we reached the end-of-shard. If the
// shard did end, then the next read will have nrecords == 0 and
// will notice end end of shard and not return NextShardIterator.
rjson::set(ret, "NextShardIterator", next_iter);
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.add(std::chrono::steady_clock::now() - start_time);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
}
// ugh. figure out if we are and end-of-shard
return cdc::get_local_streams_timestamp().then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret), nrecords](db_clock::time_point ts) mutable {
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
@@ -1041,23 +1028,27 @@ future<executor::request_return_type> executor::get_records(client_state& client
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch().count()), true);
rjson::set(ret, "NextShardIterator", iter);
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
rjson::add(ret, "NextShardIterator", iter);
}
_stats.api_operations.get_records_latency.add(std::chrono::steady_clock::now() - start_time);
// TODO: determine a better threshold...
if (nrecords > 10) {
return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
});
});
}
void executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder) const {
void executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {
auto stream_enabled = rjson::find(stream_specification, "StreamEnabled");
if (!stream_enabled || !stream_enabled->IsBool()) {
throw api_error::validation("StreamSpecification needs boolean StreamEnabled");
}
if (stream_enabled->GetBool()) {
auto& db = _proxy.get_db().local();
auto db = sp.data_dictionary();
if (!db.features().cluster_supports_cdc()) {
throw api_error::validation("StreamSpecification: streams (CDC) feature not enabled in cluster.");
@@ -1094,17 +1085,17 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche
}
}
void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema) const {
void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp) {
auto& opts = schema.cdc_options();
if (opts.enabled()) {
auto& db = _proxy.get_db().local();
auto& cf = db.find_column_family(schema.ks_name(), cdc::log_name(schema.cf_name()));
auto db = sp.data_dictionary();
auto cf = db.find_table(schema.ks_name(), cdc::log_name(schema.cf_name()));
stream_arn arn(cf.schema()->id());
rjson::set(descr, "LatestStreamArn", arn);
rjson::set(descr, "LatestStreamLabel", rjson::from_string(stream_label(*cf.schema())));
rjson::add(descr, "LatestStreamArn", arn);
rjson::add(descr, "LatestStreamLabel", rjson::from_string(stream_label(*cf.schema())));
auto stream_desc = rjson::empty_object();
rjson::set(stream_desc, "StreamEnabled", true);
rjson::add(stream_desc, "StreamEnabled", true);
auto mode = stream_view_type::KEYS_ONLY;
if (opts.preimage() && opts.postimage()) {
@@ -1114,8 +1105,8 @@ void executor::supplement_table_stream_info(rjson::value& descr, const schema& s
} else if (opts.postimage()) {
mode = stream_view_type::NEW_IMAGE;
}
rjson::set(stream_desc, "StreamViewType", mode);
rjson::set(descr, "StreamSpecification", std::move(stream_desc));
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
}
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2019 ScyllaDB
* Copyright 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

783
alternator/ttl.cc Normal file
View File

@@ -0,0 +1,783 @@
/*
* Copyright 2021-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <chrono>
#include <cstdint>
#include <optional>
#include <seastar/core/sstring.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/core/sleep.hh>
#include <seastar/core/future.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <boost/multiprecision/cpp_int.hpp>
#include "gms/gossiper.hh"
#include "gms/inet_address.hh"
#include "inet_address_vectors.hh"
#include "locator/abstract_replication_strategy.hh"
#include "log.hh"
#include "gc_clock.hh"
#include "replica/database.hh"
#include "service_permit.hh"
#include "timestamp.hh"
#include "service/storage_proxy.hh"
#include "service/pager/paging_state.hh"
#include "service/pager/query_pagers.hh"
#include "gms/feature_service.hh"
#include "sstables/types.hh"
#include "mutation.hh"
#include "types.hh"
#include "types/map.hh"
#include "utils/rjson.hh"
#include "utils/big_decimal.hh"
#include "utils/fb_utilities.hh"
#include "cql3/selection/selection.hh"
#include "cql3/values.hh"
#include "cql3/query_options.hh"
#include "cql3/column_identifier.hh"
#include "alternator/executor.hh"
#include "alternator/controller.hh"
#include "alternator/serialization.hh"
#include "dht/sharder.hh"
#include "ttl.hh"
static logging::logger tlogger("alternator_ttl");
namespace alternator {
// We write the expiration-time attribute enabled on a table using a
// tag TTL_TAG_KEY.
// Currently, the *value* of this tag is simply the name of the attribute,
// and the expiration scanner interprets it as an Alternator attribute name -
// It can refer to a real column or if that doesn't exist, to a member of
// the ":attrs" map column. Although this is designed for Alternator, it may
// be good enough for CQL as well (there, the ":attrs" column won't exist).
static const sstring TTL_TAG_KEY("system:ttl_attribute");
future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.update_time_to_live++;
if (!_proxy.data_dictionary().features().cluster_supports_alternator_ttl()) {
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
}
schema_ptr schema = get_table(_proxy, request);
rjson::value* spec = rjson::find(request, "TimeToLiveSpecification");
if (!spec || !spec->IsObject()) {
co_return api_error::validation("UpdateTimeToLive missing mandatory TimeToLiveSpecification");
}
const rjson::value* v = rjson::find(*spec, "Enabled");
if (!v || !v->IsBool()) {
co_return api_error::validation("UpdateTimeToLive requires boolean Enabled");
}
bool enabled = v->GetBool();
v = rjson::find(*spec, "AttributeName");
if (!v || !v->IsString()) {
co_return api_error::validation("UpdateTimeToLive requires string AttributeName");
}
// Although the DynamoDB documentation specifies that attribute names
// should be between 1 and 64K bytes, in practice, it only allows
// between 1 and 255 bytes. There are no other limitations on which
// characters are allowed in the name.
if (v->GetStringLength() < 1 || v->GetStringLength() > 255) {
co_return api_error::validation("The length of AttributeName must be between 1 and 255");
}
sstring attribute_name(v->GetString(), v->GetStringLength());
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
co_return api_error::validation("TTL is already enabled");
}
tags_map[TTL_TAG_KEY] = attribute_name;
} else {
auto i = tags_map.find(TTL_TAG_KEY);
if (i == tags_map.end()) {
co_return api_error::validation("TTL is already disabled");
} else if (i->second != attribute_name) {
co_return api_error::validation(format(
"Requested to disable TTL on attribute {}, but a different attribute {} is enabled.",
attribute_name, i->second));
}
tags_map.erase(TTL_TAG_KEY);
}
co_await update_tags(_mm, schema, std::move(tags_map));
// Prepare the response, which contains a TimeToLiveSpecification
// basically identical to the request's
rjson::value response = rjson::empty_object();
rjson::add(response, "TimeToLiveSpecification", std::move(*spec));
co_return make_jsonable(std::move(response));
}
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.describe_time_to_live++;
schema_ptr schema = get_table(_proxy, request);
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
rjson::value desc = rjson::empty_object();
auto i = tags_map.find(TTL_TAG_KEY);
if (i == tags_map.end()) {
rjson::add(desc, "TimeToLiveStatus", "DISABLED");
} else {
rjson::add(desc, "TimeToLiveStatus", "ENABLED");
rjson::add(desc, "AttributeName", rjson::from_string(i->second));
}
rjson::value response = rjson::empty_object();
rjson::add(response, "TimeToLiveDescription", std::move(desc));
co_return make_jsonable(std::move(response));
}
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
//
// Here is a brief overview of how the expiration service works:
//
// An expiration thread on each shard periodically scans the items (i.e.,
// rows) owned by this shard, looking for items whose chosen expiration-time
// attribute indicates they are expired, and deletes those items.
// The expiration-time "attribute" can be either an actual Scylla column
// (must be numeric) or an Alternator "attribute" - i.e., an element in
// the ATTRS_COLUMN_NAME map<utf8,bytes> column where the numeric expiration
// time is encoded in DynamoDB's JSON encoding inside the bytes value.
// To avoid scanning the same items RF times in RF replicas, only one node is
// responsible for scanning a token range at a time. Normally, this is the
// node owning this range as a "primary range" (the first node in the ring
// with this range), but when this node is down, other nodes may take over
// (FIXME: this is not implemented yet).
// An expiration thread is reponsible for all tables which need expiration
// scans. FIXME: explain how this is done with multiple tables - parallel,
// staggered, or what?
// The expiration thread scans item using CL=QUORUM to ensures that it reads
// a consistent expiration-time attribute. This means that the items are read
// locally and in addition QUORUM-1 additional nodes (one additional node
// when RF=3) need to read the data and send digests.
// FIXME: explain if we can read the exact attribute or the entire map.
// When the expiration thread decides that an item has expired and wants
// to delete it, it does it using a CL=QUORUM write. This allows this
// deletion to be visible for consistent (quorum) reads. The deletion,
// like user deletions, will also appear on the CDC log and therefore
// Alternator Streams if enabled (FIXME: explain how we mark the
// deletion different from user deletes. We don't do it yet.).
expiration_service::expiration_service(data_dictionary::database db, service::storage_proxy& proxy)
: _db(db)
, _proxy(proxy)
{
//FIXME: add metrics for the service
//setup_metrics();
}
// Convert the big_decimal used to represent expiration time to an integer.
// Any fractional part is dropped. If the number is negative or invalid,
// 0 is returned, and if it's too high, the maximum unsigned long is returned.
static unsigned long bigdecimal_to_ul(const big_decimal& bd) {
// The big_decimal format has an integer mantissa of arbitrary length
// "unscaled_value" and then a (power of 10) exponent "scale".
if (bd.unscaled_value() <= 0) {
return 0;
}
if (bd.scale() == 0) {
// The fast path, when the expiration time is an integer, scale==0.
return static_cast<unsigned long>(bd.unscaled_value());
}
// Because the mantissa can be of arbitrary length, we work on it
// as a string. TODO: find a less ugly algorithm.
auto str = bd.unscaled_value().str();
if (bd.scale() > 0) {
int len = str.length();
if (len < bd.scale()) {
return 0;
}
str = str.substr(0, len-bd.scale());
} else {
if (bd.scale() < -20) {
return std::numeric_limits<unsigned long>::max();
}
for (int i = 0; i < -bd.scale(); i++) {
str.push_back('0');
}
}
// strtoul() returns ULONG_MAX if the number is too large, or 0 if not
// a number.
return strtoul(str.c_str(), nullptr, 10);
}
// The following is_expired() functions all check if an item with the given
// expiration time has expired, according to the DynamoDB API rules.
// The rules are:
// 1. If the expiration time attribute's value is not a number type,
// the item is not expired.
// 2. The expiration time is measured in seconds since the UNIX epoch.
// 3. If the expiration time is more than 5 years in the past, it is assumed
// to be malformed and ignored - and the item does not expire.
static bool is_expired(gc_clock::time_point expiration_time, gc_clock::time_point now) {
return expiration_time <= now &&
expiration_time > now - std::chrono::years(5);
}
static bool is_expired(const big_decimal& expiration_time, gc_clock::time_point now) {
unsigned long t = bigdecimal_to_ul(expiration_time);
// We assume - and the assumption turns out to be correct - that the
// epoch of gc_clock::time_point and the one used by the DynamoDB protocol
// are the same (the UNIX epoch in UTC). The resolution (seconds) is also
// the same.
return is_expired(gc_clock::time_point(gc_clock::duration(std::chrono::seconds(t))), now);
}
static bool is_expired(const rjson::value& expiration_time, gc_clock::time_point now) {
std::optional<big_decimal> n = try_unwrap_number(expiration_time);
return n && is_expired(*n, now);
}
// expire_item() expires an item - i.e., deletes it as appropriate for
// expiration - with CL=QUORUM and (FIXME!) in a way Alternator Streams
// understands it is an expiration event - not a user-initiated deletion.
static future<> expire_item(service::storage_proxy& proxy,
const service::query_state& qs,
const std::vector<bytes_opt>& row,
schema_ptr schema,
api::timestamp_type ts) {
// Prepare the row key to delete
// NOTICE: the order of columns is guaranteed by the fact that selection::wildcard
// is used, which indicates that columns appear in the order defined by
// schema::all_columns_in_select_order() - partition key columns goes first,
// immediately followed by clustering key columns
std::vector<bytes> exploded_pk;
const unsigned pk_size = schema->partition_key_size();
const unsigned ck_size = schema->clustering_key_size();
for (unsigned c = 0; c < pk_size; ++c) {
const auto& row_c = row[c];
if (!row_c) {
// This shouldn't happen - all key columns must have values.
// But if it ever happens, let's just *not* expire the item.
// FIXME: log or increment a metric if this happens.
return make_ready_future<>();
}
exploded_pk.push_back(*row_c);
}
auto pk = partition_key::from_exploded(exploded_pk);
mutation m(schema, pk);
// If there's no clustering key, a tombstone should be created directly
// on a partition, not on a clustering row - otherwise it will look like
// an open-ended range tombstone, which will crash on KA/LA sstable format.
// See issue #6035
if (ck_size == 0) {
m.partition().apply(tombstone(ts, gc_clock::now()));
} else {
std::vector<bytes> exploded_ck;
for (unsigned c = pk_size; c < pk_size + ck_size; ++c) {
const auto& row_c = row[c];
if (!row_c) {
// This shouldn't happen - all key columns must have values.
// But if it ever happens, let's just *not* expire the item.
// FIXME: log or increment a metric if this happens.
return make_ready_future<>();
}
exploded_ck.push_back(*row_c);
}
auto ck = clustering_key::from_exploded(exploded_ck);
m.partition().clustered_row(*schema, ck).apply(tombstone(ts, gc_clock::now()));
}
return proxy.mutate(std::vector<mutation>{std::move(m)},
db::consistency_level::LOCAL_QUORUM,
executor::default_timeout(), // FIXME - which timeout?
qs.get_trace_state(), qs.get_permit());
}
static size_t random_offset(size_t min, size_t max) {
static thread_local std::default_random_engine re{std::random_device{}()};
std::uniform_int_distribution<size_t> dist(min, max);
return dist(re);
}
// Get a list of secondary token ranges for the given node, and the primary
// node responsible for each of these token ranges.
// A "secondary range" is a range of tokens where for each token, the second
// node (in ring order) out of the RF replicas that hold this token is the
// given node.
// In the expiration scanner, we want to scan a secondary range but only if
// this range's primary node is down. For this we need to return not just
// a list of this node's secondary ranges - but also the primary owner of
// each of those ranges.
static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary_ranges(
const locator::effective_replication_map_ptr& erm,
gms::inet_address ep) {
const auto& tm = *erm->get_token_metadata_ptr();
const auto& sorted_tokens = tm.sorted_tokens();
std::vector<std::pair<dht::token_range, gms::inet_address>> ret;
if (sorted_tokens.empty()) {
on_internal_error(tlogger, "Token metadata is empty");
}
auto prev_tok = sorted_tokens.back();
for (const auto& tok : sorted_tokens) {
inet_address_vector_replica_set eps = erm->get_natural_endpoints(tok);
if (eps.size() <= 1 || eps[1] != ep) {
prev_tok = tok;
continue;
}
// Add the range (prev_tok, tok] to ret. However, if the range wraps
// around, split it to two non-wrapping ranges.
if (prev_tok < tok) {
ret.emplace_back(
dht::token_range{
dht::token_range::bound(prev_tok, false),
dht::token_range::bound(tok, true)},
eps[0]);
} else {
ret.emplace_back(
dht::token_range{
dht::token_range::bound(prev_tok, false),
std::nullopt},
eps[0]);
ret.emplace_back(
dht::token_range{
std::nullopt,
dht::token_range::bound(tok, true)},
eps[0]);
}
prev_tok = tok;
}
return ret;
}
// A class for iterating over all the token ranges *owned* by this shard.
// To avoid code duplication, it is a template with two distinct cases -
// <primary> and <secondary>:
//
// In the <primary> case, we consider a token *owned* by this shard if:
// 1. This node is a replica for this token.
// 2. Moreover, this node is the *primary* replica of the token (i.e., the
// first replica in the ring).
// 3. In this node, this shard is responsible for this token.
// We will use this definition of which shard in the cluster owns which tokens
// to split the expiration scanner's work between all the shards of the
// system.
//
// In the <secondary> case, we consider a token *owned* by this shard if:
// 1. This node is the *secondary* replica for this token (i.e., the second
// replica in the ring).
// 2. The primary replica for this token is currently marked down.
// 3. In this node, this shard is responsible for this token.
// We use the <secondary> case to handle the possibility that some of the
// nodes in the system are down. A dead node will not be expiring expiring
// the tokens owned by it, so we want the secondary owner to take over its
// primary ranges.
//
// FIXME: need to decide how to choose primary ranges in multi-DC setup!
// We could call get_primary_ranges_within_dc() below instead of get_primary_ranges().
// NOTICE: Iteration currently starts from a random token range in order to improve
// the chances of covering all ranges during a scan when restarts occur.
// A more deterministic way would be to regularly persist the scanning state,
// but that incurs overhead that we want to avoid if not needed.
enum primary_or_secondary_t {primary, secondary};
template<primary_or_secondary_t primary_or_secondary>
class token_ranges_owned_by_this_shard {
template<primary_or_secondary_t> class ranges_holder;
// ranges_holder<primary> holds just the primary ranges themselves
template<> class ranges_holder<primary> {
const dht::token_range_vector _token_ranges;
public:
ranges_holder(const locator::effective_replication_map_ptr& erm, gms::inet_address ep)
: _token_ranges(erm->get_primary_ranges(ep)) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i];
}
bool should_skip(std::size_t i) const {
return false;
}
};
// ranges_holder<secondary> holds the secondary token ranges plus each
// range's primary owner, needed to implement should_skip().
template<> class ranges_holder<secondary> {
std::vector<std::pair<dht::token_range, gms::inet_address>> _token_ranges;
gms::gossiper& _gossiper;
public:
ranges_holder(const locator::effective_replication_map_ptr& erm, gms::inet_address ep)
: _token_ranges(get_secondary_ranges(erm, ep))
, _gossiper(gms::get_local_gossiper()) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i].first;
}
// range i should be skipped if its primary owner is alive.
bool should_skip(std::size_t i) const {
return _gossiper.is_alive(_token_ranges[i].second);
}
};
schema_ptr _s;
// _token_ranges will contain a list of token ranges owned by this node.
// We'll further need to split each such range to the pieces owned by
// the current shard, using _intersecter.
const ranges_holder<primary_or_secondary> _token_ranges;
// NOTICE: _range_idx is used modulo _token_ranges size when accessing
// the data to ensure that it doesn't go out of bounds
size_t _range_idx;
size_t _end_idx;
std::optional<dht::selective_token_range_sharder> _intersecter;
public:
token_ranges_owned_by_this_shard(replica::database& db, schema_ptr s)
: _s(s)
, _token_ranges(db.find_keyspace(s->ks_name()).get_effective_replication_map(),
utils::fb_utilities::get_broadcast_address())
, _range_idx(random_offset(0, _token_ranges.size() - 1))
, _end_idx(_range_idx + _token_ranges.size())
{
tlogger.debug("Generating token ranges starting from base range {} of {}", _range_idx, _token_ranges.size());
}
// Return the next token_range owned by this shard, or nullopt when the
// iteration ends.
std::optional<dht::token_range> next() {
// We may need three or more iterations in the following loop if a
// vnode doesn't intersect with the given shard at all (such a small
// vnode is unlikely, but possible). The loop cannot be infinite
// because each iteration of the loop advances _range_idx.
for (;;) {
if (_intersecter) {
std::optional<dht::token_range> ret = _intersecter->next();
if (ret) {
return ret;
}
// done with this range, go to next one
++_range_idx;
_intersecter = std::nullopt;
}
if (_range_idx == _end_idx) {
return std::nullopt;
}
// If should_skip(), the range should be skipped. This happens for
// a secondary range whose primary owning node is still alive.
while (_token_ranges.should_skip(_range_idx % _token_ranges.size())) {
++_range_idx;
if (_range_idx == _end_idx) {
return std::nullopt;
}
}
_intersecter.emplace(_s->get_sharder(), _token_ranges[_range_idx % _token_ranges.size()], this_shard_id());
}
}
// Same as next(), just return a partition_range instead of token_range
std::optional<dht::partition_range> next_partition_range() {
std::optional<dht::token_range> ret = next();
if (ret) {
return dht::to_partition_range(*ret);
} else {
return std::nullopt;
}
}
};
// Precomputed information needed to perform a scan on partition ranges
struct scan_ranges_context {
schema_ptr s;
bytes column_name;
std::optional<std::string> member;
::shared_ptr<cql3::selection::selection> selection;
std::unique_ptr<service::query_state> query_state_ptr;
std::unique_ptr<cql3::query_options> query_options;
::lw_shared_ptr<query::read_command> command;
scan_ranges_context(schema_ptr s, service::storage_proxy& proxy, bytes column_name, std::optional<std::string> member)
: s(s)
, column_name(column_name)
, member(member)
{
// FIXME: don't read the entire items - read only parts of it.
// We must read the key columns (to be able to delete) and also
// the requested attribute. If the requested attribute is a map's
// member we may be forced to read the entire map - but it would
// be good if we can read only the single item of the map - it
// should be possible (and a must for issue #7751!).
lw_shared_ptr<service::pager::paging_state> paging_state = nullptr;
auto regular_columns = boost::copy_range<query::column_id_vector>(
s->regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
selection = cql3::selection::selection::wildcard(s);
query::partition_slice::option_set opts = selection->get_query_options();
opts.set<query::partition_slice::option::allow_short_read>();
std::vector<query::clustering_range> ck_bounds{query::clustering_range::make_open_ended_both_sides()};
auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), opts);
command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice));
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state;
// NOTICE: empty_service_permit is used because the TTL service has fixed parallelism
query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, empty_service_permit());
// FIXME: What should we do on multi-DC? Will we run the expiration on the same ranges on all
// DCs or only once for each range? If the latter, we need to change the CLs in the
// scanner and deleter.
db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;
query_options = std::make_unique<cql3::query_options>(cl, std::vector<cql3::raw_value>{});
query_options = std::make_unique<cql3::query_options>(std::move(query_options), std::move(paging_state));
}
};
// Scan data in a list of token ranges in one table, looking for expired
// items and deleting them.
// Because of issue #9167, partition_ranges must have a single partition
// for this code to work correctly.
static future<> scan_table_ranges(
service::storage_proxy& proxy,
const scan_ranges_context& scan_ctx,
dht::partition_range_vector&& partition_ranges,
abort_source& abort_source,
named_semaphore& page_sem)
{
const schema_ptr& s = scan_ctx.s;
assert (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
auto p = service::pager::query_pagers::pager(proxy, s, scan_ctx.selection, *scan_ctx.query_state_ptr,
*scan_ctx.query_options, scan_ctx.command, std::move(partition_ranges), nullptr);
while (!p->is_exhausted()) {
if (abort_source.abort_requested()) {
co_return;
}
auto units = co_await get_units(page_sem, 1);
// We don't to limit page size in number of rows because there is a
// builtin limit of the page's size in bytes. Setting this limit to 1
// is useful for debugging the paging code with moderate-size data.
uint32_t limit = std::numeric_limits<uint32_t>::max();
// FIXME: which timeout?
// FIXME: if read times out, need to retry it.
std::unique_ptr<cql3::result_set> rs = co_await p->fetch_page(limit, gc_clock::now(), executor::default_timeout());
auto rows = rs->rows();
auto meta = rs->get_metadata().get_names();
std::optional<unsigned> expiration_column;
for (unsigned i = 0; i < meta.size(); i++) {
const cql3::column_specification& col = *meta[i];
if (col.name->name() == scan_ctx.column_name) {
expiration_column = i;
break;
}
}
if (!expiration_column) {
continue;
}
for (const auto& row : rows) {
const bytes_opt& cell = row[*expiration_column];
if (!cell) {
continue;
}
auto v = meta[*expiration_column]->type->deserialize(*cell);
bool expired = false;
// FIXME: don't recalculate "now" all the time
auto now = gc_clock::now();
if (scan_ctx.member) {
// In this case, the expiration-time attribute we're
// looking for is a member in a map, saved serialized
// into bytes using Alternator's serialization (basically
// a JSON serialized into bytes)
// FIXME: is it possible to find a specific member of a map
// without iterating through it like we do here and compare
// the key?
for (const auto& entry : value_cast<map_type_impl::native_type>(v)) {
std::string attr_name = value_cast<sstring>(entry.first);
if (value_cast<sstring>(entry.first) == *scan_ctx.member) {
bytes value = value_cast<bytes>(entry.second);
rjson::value json = deserialize_item(value);
expired = is_expired(json, now);
break;
}
}
} else {
// For a real column to contain an expiration time, it
// must be a numeric type.
// FIXME: Currently we only support decimal_type (which is
// what Alternator uses), but other numeric types can be
// supported as well to make this feature more useful in CQL.
// Note that kind::decimal is also checked above.
big_decimal n = value_cast<big_decimal>(v);
expired = is_expired(n, now);
}
if (expired) {
// FIXME: maybe don't recalculate new_timestamp() all the time
// FIXME: if expire_item() throws on timeout, we need to retry it.
auto ts = api::new_timestamp();
co_await expire_item(proxy, *scan_ctx.query_state_ptr, row, s, ts);
}
}
// FIXME: once in a while, persist p->state(), so on reboot
// we don't start from scratch.
}
}
// scan_table() scans data in one table "owned" by this shard, looking for
// expired items and deleting them.
// We consider each node to "own" its primary token ranges, i.e., the tokens
// that this node is their first replica in the ring. Inside the node, each
// shard "owns" subranges of the node's token ranges - according to the node's
// sharding algorithm.
// When a node goes down, the token ranges owned by it will not be scanned
// and items in those token ranges will not expire, so in the future (FIXME)
// this function should additionally work on token ranges whose primary owner
// is down and this node is the range's secondary owner.
// If the TTL (expiration-time scanning) feature is not enabled for this
// table, scan_table() returns false without doing anything. Remember that the
// TTL feature may be enabled later so this function will need to be called
// again when the feature is enabled.
// Currently this function scans the entire table (or, rather the parts owned
// by this shard) at full rate, once. In the future (FIXME) we should consider
// how to pace this scan, how and when to repeat it, how to interleave or
// parallelize scanning of multiple tables, and how to continue scans after a
// reboot.
static future<bool> scan_table(
service::storage_proxy& proxy,
data_dictionary::database db,
schema_ptr s,
abort_source& abort_source,
named_semaphore& page_sem)
{
// Check if an expiration-time attribute is enabled for this table.
// If not, just return false immediately.
std::optional<std::string> attribute_name = find_tag(*s, TTL_TAG_KEY);
if (!attribute_name) {
co_return false;
}
// attribute_name may be one of the schema's columns (in Alternator, this
// means it's a key column), or an element in Alternator's attrs map
// encoded in Alternator's JSON encoding.
// FIXME: To make this less Alternators-specific, we should encode in the
// single key's value three things:
// 1. The name of a column
// 2. Optionally if column is a map, a member in the map
// 3. The deserializer for the value: CQL or Alternator (JSON).
// The deserializer can be guessed: If the given column or map item is
// numeric, it can be used directly. If it is a "bytes" type, it needs to
// be deserialized using Alternator's deserializer.
bytes column_name = to_bytes(*attribute_name);
const column_definition *cd = s->get_column_definition(column_name);
std::optional<std::string> member;
if (!cd) {
member = std::move(attribute_name);
column_name = bytes(executor::ATTRS_COLUMN_NAME);
cd = s->get_column_definition(column_name);
tlogger.info("table {} TTL enabled with attribute {} in {}", s->cf_name(), *member, executor::ATTRS_COLUMN_NAME);
} else {
tlogger.info("table {} TTL enabled with attribute {}", s->cf_name(), *attribute_name);
}
if (!cd) {
tlogger.info("table {} TTL column is missing, not scanning", s->cf_name());
co_return false;
}
data_type column_type = cd->type;
// Verify that the column has the right type: If "member" exists
// the column must be a map, and if it doesn't, the column must
// (currently) be a decimal_type. If the column has the wrong type
// nothing can get expired in this table, and it's pointless to
// scan it.
if ((member && column_type->get_kind() != abstract_type::kind::map) ||
(!member && column_type->get_kind() != abstract_type::kind::decimal)) {
tlogger.info("table {} TTL column has unsupported type, not scanning", s->cf_name());
co_return false;
}
// FIXME: need to pace the scan, not do it all at once.
// FIXME: consider if we should ask the scan without caching?
// can we use cache but not fill it?
scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};
token_ranges_owned_by_this_shard<primary> my_ranges(db.real_database(), s);
while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {
// Note that because of issue #9167 we need to run a separate
// query on each partition range, and can't pass several of
// them into one partition_range_vector.
dht::partition_range_vector partition_ranges;
partition_ranges.push_back(std::move(*range));
// FIXME: if scanning a single range fails, including network errors,
// we fail the entire scan (and rescan from the beginning). Need to
// reconsider this. Saving the scan position might be a good enough
// solution for this problem.
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem);
}
// If each node only scans its own primary ranges, then when any node is
// down part of the token range will not get scanned. This can be viewed
// as acceptable (when the comes back online, it will resume its scan),
// but as noted in issue #9787, we can allow more prompt expiration
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
token_ranges_owned_by_this_shard<secondary> my_secondary_ranges(db.real_database(), s);
while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {
dht::partition_range_vector partition_ranges;
partition_ranges.push_back(std::move(*range));
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem);
}
co_return true;
}
future<> expiration_service::run() {
// FIXME: don't just tight-loop, think about timing, pace, and
// store position in durable storage, etc.
// FIXME: think about working on different tables in parallel.
// also need to notice when a new table is added, a table is
// deleted or when ttl is enabled or disabled for a table!
for (;;) {
// _db.tables() may change under our feet during a
// long-living loop, so we must keep our own copy of the list of
// schemas.
std::vector<schema_ptr> schemas;
for (auto cf : _db.get_tables()) {
schemas.push_back(cf.schema());
}
for (schema_ptr s : schemas) {
co_await coroutine::maybe_yield();
if (shutting_down()) {
co_return;
}
try {
co_await scan_table(_proxy, _db, s, _abort_source, _page_sem);
} catch (...) {
// The scan of a table may fail in the middle for many
// reasons, including network failure and even the table
// being removed. We'll continue scanning this table later
// (if it still exists). In any case it's important to catch
// the exception and not let the scanning service die for
// good.
// If the table has been deleted, it is expected that the scan
// will fail at some point, and even a warning is excessive.
if (_db.has_schema(s->ks_name(), s->cf_name())) {
tlogger.warn("table {}.{} expiration scan failed: {}",
s->ks_name(), s->cf_name(), std::current_exception());
} else {
tlogger.info("expiration scan failed when table {}.{} was deleted",
s->ks_name(), s->cf_name());
}
}
}
// FIXME: replace this silly 1-second sleep by something smarter.
try {
co_await seastar::sleep_abortable(std::chrono::seconds(1), _abort_source);
} catch(seastar::sleep_aborted&) {}
}
}
future<> expiration_service::start() {
// Called by main() on each shard to start the expiration-service
// thread. Just runs run() in the background and allows stop().
if (_db.features().cluster_supports_alternator_ttl()) {
if (!shutting_down()) {
_end = run().handle_exception([] (std::exception_ptr ep) {
tlogger.error("expiration_service failed: {}", ep);
});
}
}
return make_ready_future<>();
}
future<> expiration_service::stop() {
if (_abort_source.abort_requested()) {
throw std::logic_error("expiration_service::stop() called a second time");
}
_abort_source.request_abort();
if (!_end) {
// if _end is was not set, start() was never called
return make_ready_future<>();
}
return std::move(*_end);
}
} // namespace alternator

57
alternator/ttl.hh Normal file
View File

@@ -0,0 +1,57 @@
/*
* Copyright 2021-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "seastarx.hh"
#include <seastar/core/sharded.hh>
#include <seastar/core/abort_source.hh>
#include <seastar/core/semaphore.hh>
#include "data_dictionary/data_dictionary.hh"
namespace replica {
class database;
}
namespace service {
class storage_proxy;
}
namespace alternator {
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
class expiration_service final : public seastar::peering_sharded_service<expiration_service> {
data_dictionary::database _db;
service::storage_proxy& _proxy;
// _end is set by start(), and resolves when the the background service
// started by it ends. To ask the background service to end, _abort_source
// should be triggered. stop() below uses both _abort_source and _end.
std::optional<future<>> _end;
abort_source _abort_source;
// Ensures that at most 1 page of scan results at a time is processed by the TTL service
named_semaphore _page_sem{1, named_semaphore_exception_factory{"alternator_ttl"}};
bool shutting_down() { return _abort_source.abort_requested(); }
public:
// sharded_service<expiration_service>::start() creates this object on
// all shards, so calls this constructor on each shard. Later, the
// additional start() function should be invoked on all shards.
expiration_service(data_dictionary::database, service::storage_proxy&);
future<> start();
future<> run();
// sharded_service<expiration_service>::stop() calls the following stop()
// method on each shard. This stop() asks the service on this shard to
// shut down as quickly as it can. The returned future indicates when the
// service is no longer running.
// stop() may be called even before start(), but may only be called once -
// calling it twice will result in an exception.
future<> stop();
};
} // namespace alternator

View File

@@ -89,7 +89,7 @@
"description":"true if the output of the major compaction should be split in several sstables",
"required":false,
"allowMultiple":false,
"type":"bool",
"type":"boolean",
"paramType":"query"
}
]

View File

@@ -102,7 +102,47 @@
"parameters":[
{
"name":"type",
"description":"the type of compaction to stop. Can be one of: - COMPACTION - VALIDATION - CLEANUP - SCRUB - INDEX_BUILD",
"description":"The type of compaction to stop. Can be one of: COMPACTION | CLEANUP | SCRUB | UPGRADE | RESHAPE",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/compaction_manager/stop_keyspace_compaction/{keyspace}",
"operations":[
{
"method":"POST",
"summary":"Stop all running compaction-like tasks in the given keyspace and tables having the provided type.",
"type":"void",
"nickname":"stop_keyspace_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace to stop compaction in",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"tables",
"description":"Comma-seperated tables to stop compaction in",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"type",
"description":"The type of compaction to stop. Can be one of: COMPACTION | CLEANUP | SCRUB | UPGRADE | RESHAPE",
"required":true,
"allowMultiple":false,
"type":"string",

View File

@@ -7,6 +7,61 @@
"application/json"
],
"apis":[
{
"path":"/hinted_handoff/sync_point",
"operations":[
{
"method":"POST",
"summary":"Creates a hints sync point. It can be used to wait until hints between given nodes are replayed. A sync point allows you to wait for hints accumulated at the moment of its creation - it won't wait for hints generated later. A sync point is described entirely by its ID - there is no state kept server-side, so there is no need to delete it.",
"type":"string",
"nickname":"create_hints_sync_point",
"produces":[
"application/json"
],
"parameters":[
{
"name":"target_hosts",
"description":"A list of nodes towards which hints should be replayed. Multiple hosts can be listed by separating them with commas. If not provided or empty, the point will resolve when current hints towards all nodes in the cluster are sent.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
{
"method":"GET",
"summary":"Get the status of a hints sync point, possibly waiting for it to be reached.",
"type":"string",
"enum":[
"DONE",
"IN_PROGRESS"
],
"nickname":"get_hints_sync_point",
"produces":[
"application/json"
],
"parameters":[
{
"name":"id",
"description":"The ID of the hint sync point which should be checked or waited on",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"timeout",
"description":"Timeout in seconds after which the query returns even if hints are still being replayed. No value or 0 will cause the query to return immediately. A negative value will cause the query to wait until the sync point is reached",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
]
},
{
"path":"/hinted_handoff/hints",
"operations":[

View File

@@ -76,7 +76,7 @@
"items":{
"type":"message_counter"
},
"nickname":"get_completed_messages",
"nickname":"get_replied_messages",
"produces":[
"application/json"
],
@@ -252,7 +252,7 @@
"UNUSED__STREAM_MUTATION",
"STREAM_MUTATION_DONE",
"COMPLETE_MESSAGE",
"REPAIR_CHECKSUM_RANGE",
"UNUSED__REPAIR_CHECKSUM_RANGE",
"GET_SCHEMA_VERSION"
]
}

View File

@@ -104,6 +104,68 @@
}
]
},
{
"path":"/storage_service/toppartitions/",
"operations":[
{
"method":"GET",
"summary":"Toppartitions query",
"type":"toppartitions_query_results",
"nickname":"toppartitions_generic",
"produces":[
"application/json"
],
"parameters":[
{
"name":"table_filters",
"description":"Optional list of table name filters in keyspace:name format",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"keyspace_filters",
"description":"Optional list of keyspace filters",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"duration",
"description":"Duration (in milliseconds) of monitoring operation",
"required":true,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"list_size",
"description":"number of the top partitions to list",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"capacity",
"description":"capacity of stream summary: determines amount of resources used in query processing",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/nodes/leaving",
"operations":[
@@ -562,7 +624,7 @@
},
{
"name":"kn",
"description":"Comma seperated keyspaces name to snapshot",
"description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -570,11 +632,19 @@
},
{
"name":"cf",
"description":"the column family to snapshot",
"description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"sf",
"description":"Skip flush. When set to \"true\", do not flush memtables before snapshotting (snapshot will not contain unflushed data)",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},
@@ -695,12 +765,44 @@
}
]
},
{
"path":"/storage_service/keyspace_offstrategy_compaction/{keyspace}",
"operations":[
{
"method":"POST",
"summary":"Perform offstrategy compaction, if needed, in a single keyspace",
"type":"boolean",
"nickname":"perform_keyspace_offstrategy_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace to operate on",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"cf",
"description":"Comma-seperated table names",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/keyspace_scrub/{keyspace}",
"operations":[
{
"method":"GET",
"summary":"Scrub (deserialize + reserialize at the latest version, skipping bad rows if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false",
"summary":"Scrub (deserialize + reserialize at the latest version, resolving corruptions if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false. Scrub has the following modes: Abort (default) - abort scrub if corruption is detected; Skip (same as `skip_corrupted=true`) skip over corrupt data, omitting them from the output; Segregate - segregate data into multiple sstables if needed, such that each sstable contains data with valid order; Validate - read (no rewrite) and validate data, logging any problems found.",
"type": "long",
"nickname":"scrub",
"produces":[
@@ -723,6 +825,33 @@
"type":"boolean",
"paramType":"query"
},
{
"name":"scrub_mode",
"description":"How to handle corrupt data (overrides 'skip_corrupted'); ",
"required":false,
"allowMultiple":false,
"type":"string",
"enum":[
"ABORT",
"SKIP",
"SEGREGATE",
"VALIDATE"
],
"paramType":"query"
},
{
"name":"quarantine_mode",
"description":"Controls whether to scrub quarantined sstables (default INCLUDE)",
"required":false,
"allowMultiple":false,
"type":"string",
"enum":[
"INCLUDE",
"EXCLUDE",
"ONLY"
],
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace to query about",
@@ -970,6 +1099,14 @@
"type":"string",
"paramType":"query"
},
{
"name":"ignore_nodes",
"description":"Which hosts are to ignore in this repair. Multiple hosts can be listed separated by commas.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"trace",
"description":"If the value is the string 'true' with any capitalization, enable tracing of the repair.",
@@ -1764,6 +1901,22 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"load_and_stream",
"description":"Load the sstables and stream to all replica nodes that owns the data",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"primary_replica_only",
"description":"Load the sstables and stream to primary replica node that owns the data. Repair is needed after the load and stream process",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
@@ -1874,6 +2027,14 @@
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"fast",
"description":"Lightweight tracing mode: if true, slow queries tracing records only session headers",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},
@@ -2372,6 +2533,10 @@
"threshold":{
"type":"long",
"description":"The slow query logging threshold in microseconds. Queries that takes longer, will be logged"
},
"fast":{
"type":"boolean",
"description":"Is lightweight tracing mode enabled. In that mode tracing ignore events and tracks only sessions."
}
}
},

View File

@@ -52,6 +52,22 @@
}
]
},
{
"path":"/system/drop_sstable_caches",
"operations":[
{
"method":"POST",
"summary":"Drop in-memory caches for data which is in sstables",
"type":"void",
"nickname":"drop_sstable_caches",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/system/uptime_ms",
"operations":[

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2015 ScyllaDB
* Copyright 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "api.hh"
@@ -49,7 +36,7 @@ namespace api {
static std::unique_ptr<reply> exception_reply(std::exception_ptr eptr) {
try {
std::rethrow_exception(eptr);
} catch (const no_such_keyspace& ex) {
} catch (const replica::no_such_keyspace& ex) {
throw bad_param_exception(ex.what());
}
// We never going to get here
@@ -75,10 +62,10 @@ future<> set_server_init(http_context& ctx) {
});
}
future<> set_server_config(http_context& ctx) {
future<> set_server_config(http_context& ctx, const db::config& cfg) {
auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");
return ctx.http_server.set_routes([&ctx, rb02](routes& r) {
set_config(rb02, ctx, r);
return ctx.http_server.set_routes([&ctx, &cfg, rb02](routes& r) {
set_config(rb02, ctx, r, cfg);
});
}
@@ -109,12 +96,30 @@ future<> unset_rpc_controller(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_rpc_controller(ctx, r); });
}
future<> set_server_storage_service(http_context& ctx) {
return register_api(ctx, "storage_service", "The storage service API", set_storage_service);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs) {
return register_api(ctx, "storage_service", "The storage service API", [&ss, &g, &cdc_gs] (http_context& ctx, routes& r) {
set_storage_service(ctx, r, ss, g.local(), cdc_gs);
});
}
future<> set_server_repair(http_context& ctx, sharded<netw::messaging_service>& ms) {
return ctx.http_server.set_routes([&ctx, &ms] (routes& r) { set_repair(ctx, r, ms); });
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader) {
return ctx.http_server.set_routes([&ctx, &sst_loader] (routes& r) { set_sstables_loader(ctx, r, sst_loader); });
}
future<> unset_server_sstables_loader(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_sstables_loader(ctx, r); });
}
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb) {
return ctx.http_server.set_routes([&ctx, &vb] (routes& r) { set_view_builder(ctx, r, vb); });
}
future<> unset_server_view_builder(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_view_builder(ctx, r); });
}
future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair) {
return ctx.http_server.set_routes([&ctx, &repair] (routes& r) { set_repair(ctx, r, repair); });
}
future<> unset_server_repair(http_context& ctx) {
@@ -133,9 +138,11 @@ future<> set_server_snitch(http_context& ctx) {
return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", set_endpoint_snitch);
}
future<> set_server_gossip(http_context& ctx) {
future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g) {
return register_api(ctx, "gossiper",
"The gossiper API", set_gossiper);
"The gossiper API", [&g] (http_context& ctx, routes& r) {
set_gossiper(ctx, r, g.local());
});
}
future<> set_server_load_sstable(http_context& ctx) {
@@ -153,14 +160,22 @@ future<> unset_server_messaging_service(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_messaging_service(ctx, r); });
}
future<> set_server_storage_proxy(http_context& ctx) {
future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss) {
return register_api(ctx, "storage_proxy",
"The storage proxy API", set_storage_proxy);
"The storage proxy API", [&ss] (http_context& ctx, routes& r) {
set_storage_proxy(ctx, r, ss);
});
}
future<> set_server_stream_manager(http_context& ctx) {
future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_manager>& sm) {
return register_api(ctx, "stream_manager",
"The stream manager API", set_stream_manager);
"The stream manager API", [&sm] (http_context& ctx, routes& r) {
set_stream_manager(ctx, r, sm);
});
}
future<> unset_server_stream_manager(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_stream_manager(ctx, r); });
}
future<> set_server_cache(http_context& ctx) {
@@ -168,13 +183,34 @@ future<> set_server_cache(http_context& ctx) {
"The cache service API", set_cache_service);
}
future<> set_server_gossip_settle(http_context& ctx) {
future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g) {
return register_api(ctx, "hinted_handoff",
"The hinted handoff API", [&g] (http_context& ctx, routes& r) {
set_hinted_handoff(ctx, r, g.local());
});
}
future<> unset_hinted_handoff(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_hinted_handoff(ctx, r); });
}
future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx, &g](routes& r) {
rb->register_function(r, "failure_detector",
"The failure detector API");
set_failure_detector(ctx, r, g.local());
});
}
future<> set_server_compaction_manager(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx](routes& r) {
rb->register_function(r, "failure_detector",
"The failure detector API");
set_failure_detector(ctx,r);
rb->register_function(r, "compaction_manager",
"The Compaction manager API");
set_compaction_manager(ctx, r);
});
}
@@ -182,18 +218,12 @@ future<> set_server_done(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx](routes& r) {
rb->register_function(r, "compaction_manager",
"The Compaction manager API");
set_compaction_manager(ctx, r);
rb->register_function(r, "lsa", "Log-structured allocator API");
set_lsa(ctx, r);
rb->register_function(r, "commitlog",
"The commit log API");
set_commitlog(ctx,r);
rb->register_function(r, "hinted_handoff",
"The hinted handoff API");
set_hinted_handoff(ctx, r);
rb->register_function(r, "collectd",
"The collectd API");
set_collectd(ctx, r);

View File

@@ -1,22 +1,9 @@
/*
* Copyright 2015 ScyllaDB
* Copyright 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -29,6 +16,7 @@
#include <boost/units/detail/utility.hpp>
#include "api/api-doc/utils.json.hh"
#include "utils/histogram.hh"
#include "utils/estimated_histogram.hh"
#include <seastar/http/exception.hh>
#include "api_init.hh"
#include "seastarx.hh"
@@ -70,7 +58,7 @@ T map_sum(T&& dest, const S& src) {
for (auto i : src) {
dest[i.first] += i.second;
}
return dest;
return std::move(dest);
}
template <typename MAP>
@@ -93,13 +81,6 @@ inline std::vector<sstring> split(const sstring& text, const char* separator) {
return boost::split(tokens, text, boost::is_any_of(separator));
}
/**
* Split a column family parameter
*/
inline std::vector<sstring> split_cf(const sstring& cf) {
return split(cf, ",");
}
/**
* A helper function to sum values on an a distributed object that
* has a get_stats method.

View File

@@ -3,32 +3,55 @@
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "database_fwd.hh"
#include "service/storage_proxy.hh"
#include <seastar/http/httpd.hh>
namespace service { class load_meter; }
namespace locator { class shared_token_metadata; }
#include <seastar/http/httpd.hh>
#include <seastar/core/future.hh>
#include "replica/database_fwd.hh"
#include "seastarx.hh"
namespace service {
class load_meter;
class storage_proxy;
class storage_service;
} // namespace service
class sstables_loader;
namespace streaming {
class stream_manager;
}
namespace locator {
class token_metadata;
class shared_token_metadata;
} // namespace locator
namespace cql_transport { class controller; }
class thrift_controller;
namespace db { class snapshot_ctl; }
namespace db {
class snapshot_ctl;
class config;
namespace view {
class view_builder;
}
}
namespace netw { class messaging_service; }
class repair_service;
namespace cdc { class generation_service; }
namespace gms {
class gossiper;
}
namespace api {
@@ -36,12 +59,12 @@ struct http_context {
sstring api_dir;
sstring api_doc;
httpd::http_server_control http_server;
distributed<database>& db;
distributed<replica::database>& db;
distributed<service::storage_proxy>& sp;
service::load_meter& lmeter;
const sharded<locator::shared_token_metadata>& shared_token_metadata;
http_context(distributed<database>& _db,
http_context(distributed<replica::database>& _db,
distributed<service::storage_proxy>& _sp,
service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm)
: db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm) {
@@ -51,10 +74,14 @@ struct http_context {
};
future<> set_server_init(http_context& ctx);
future<> set_server_config(http_context& ctx);
future<> set_server_config(http_context& ctx, const db::config& cfg);
future<> set_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx);
future<> set_server_repair(http_context& ctx, sharded<netw::messaging_service>& ms);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs);
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);
future<> unset_server_sstables_loader(http_context& ctx);
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb);
future<> unset_server_view_builder(http_context& ctx);
future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair);
future<> unset_server_repair(http_context& ctx);
future<> set_transport_controller(http_context& ctx, cql_transport::controller& ctl);
future<> unset_transport_controller(http_context& ctx);
@@ -62,14 +89,18 @@ future<> set_rpc_controller(http_context& ctx, thrift_controller& ctl);
future<> unset_rpc_controller(http_context& ctx);
future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl);
future<> unset_server_snapshot(http_context& ctx);
future<> set_server_gossip(http_context& ctx);
future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g);
future<> set_server_load_sstable(http_context& ctx);
future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);
future<> unset_server_messaging_service(http_context& ctx);
future<> set_server_storage_proxy(http_context& ctx);
future<> set_server_stream_manager(http_context& ctx);
future<> set_server_gossip_settle(http_context& ctx);
future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss);
future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_manager>& sm);
future<> unset_server_stream_manager(http_context& ctx);
future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g);
future<> unset_hinted_handoff(http_context& ctx);
future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g);
future<> set_server_cache(http_context& ctx);
future<> set_server_compaction_manager(http_context& ctx);
future<> set_server_done(http_context& ctx);
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "cache_service.hh"
@@ -208,7 +195,7 @@ void set_cache_service(http_context& ctx, routes& r) {
});
cs::get_row_capacity.set(r, [&ctx] (std::unique_ptr<request> req) {
return ctx.db.map_reduce0([](database& db) -> uint64_t {
return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {
return db.row_cache_tracker().region().occupancy().used_space();
}, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {
return make_ready_future<json::json_return_type>(res);
@@ -216,26 +203,26 @@ void set_cache_service(http_context& ctx, routes& r) {
});
cs::get_row_hits.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {
return map_reduce_cf(ctx, uint64_t(0), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().hits.count();
}, std::plus<uint64_t>());
});
cs::get_row_requests.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {
return map_reduce_cf(ctx, uint64_t(0), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count();
}, std::plus<uint64_t>());
});
cs::get_row_hit_rate.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, ratio_holder(), [](const column_family& cf) {
return map_reduce_cf(ctx, ratio_holder(), [](const replica::column_family& cf) {
return ratio_holder(cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count(),
cf.get_row_cache().stats().hits.count());
}, std::plus<ratio_holder>());
});
cs::get_row_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {
return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().hits.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -243,7 +230,7 @@ void set_cache_service(http_context& ctx, routes& r) {
});
cs::get_row_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {
return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().hits.rate() + cf.get_row_cache().stats().misses.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -253,7 +240,7 @@ void set_cache_service(http_context& ctx, routes& r) {
cs::get_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
// In origin row size is the weighted size.
// We currently do not support weights, so we use num entries instead
return ctx.db.map_reduce0([](database& db) -> uint64_t {
return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {
return db.row_cache_tracker().partitions();
}, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {
return make_ready_future<json::json_return_type>(res);
@@ -261,7 +248,7 @@ void set_cache_service(http_context& ctx, routes& r) {
});
cs::get_row_entries.set(r, [&ctx] (std::unique_ptr<request> req) {
return ctx.db.map_reduce0([](database& db) -> uint64_t {
return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {
return db.row_cache_tracker().partitions();
}, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {
return make_ready_future<json::json_return_type>(res);

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "collectd.hh"

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "column_family.hh"
@@ -24,10 +11,13 @@
#include <vector>
#include <seastar/http/exception.hh>
#include "sstables/sstables.hh"
#include "sstables/metadata_collector.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include "db/system_keyspace_view_types.hh"
#include "db/data_listeners.hh"
#include "storage_service.hh"
#include "unimplemented.hh"
extern logging::logger apilog;
@@ -53,57 +43,57 @@ std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name) {
return std::make_tuple(name.substr(0, pos), name.substr(end));
}
const utils::UUID& get_uuid(const sstring& ks, const sstring& cf, const database& db) {
const utils::UUID& get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {
try {
return db.find_uuid(ks, cf);
} catch (std::out_of_range& e) {
throw bad_param_exception(format("Column family '{}:{}' not found", ks, cf));
} catch (replica::no_such_column_family& e) {
throw bad_param_exception(e.what());
}
}
const utils::UUID& get_uuid(const sstring& name, const database& db) {
const utils::UUID& get_uuid(const sstring& name, const replica::database& db) {
auto [ks, cf] = parse_fully_qualified_cf_name(name);
return get_uuid(ks, cf, db);
}
future<> foreach_column_family(http_context& ctx, const sstring& name, function<void(column_family&)> f) {
future<> foreach_column_family(http_context& ctx, const sstring& name, function<void(replica::column_family&)> f) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.invoke_on_all([f, uuid](database& db) {
return ctx.db.invoke_on_all([f, uuid](replica::database& db) {
f(db.find_column_family(uuid));
});
}
future<json::json_return_type> get_cf_stats(http_context& ctx, const sstring& name,
int64_t column_family_stats::*f) {
return map_reduce_cf(ctx, name, int64_t(0), [f](const column_family& cf) {
int64_t replica::column_family_stats::*f) {
return map_reduce_cf(ctx, name, int64_t(0), [f](const replica::column_family& cf) {
return cf.get_stats().*f;
}, std::plus<int64_t>());
}
future<json::json_return_type> get_cf_stats(http_context& ctx,
int64_t column_family_stats::*f) {
return map_reduce_cf(ctx, int64_t(0), [f](const column_family& cf) {
int64_t replica::column_family_stats::*f) {
return map_reduce_cf(ctx, int64_t(0), [f](const replica::column_family& cf) {
return cf.get_stats().*f;
}, std::plus<int64_t>());
}
static future<json::json_return_type> get_cf_stats_count(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {
return map_reduce_cf(ctx, name, int64_t(0), [f](const column_family& cf) {
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
return map_reduce_cf(ctx, name, int64_t(0), [f](const replica::column_family& cf) {
return (cf.get_stats().*f).hist.count;
}, std::plus<int64_t>());
}
static future<json::json_return_type> get_cf_stats_sum(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([uuid, f](database& db) {
return ctx.db.map_reduce0([uuid, f](replica::database& db) {
// Histograms information is sample of the actual load
// so to get an estimation of sum, we multiply the mean
// with count. The information is gather in nano second,
// but reported in micro
column_family& cf = db.find_column_family(uuid);
replica::column_family& cf = db.find_column_family(uuid);
return ((cf.get_stats().*f).hist.count/1000.0) * (cf.get_stats().*f).hist.mean;
}, 0.0, std::plus<double>()).then([](double res) {
return make_ready_future<json::json_return_type>((int64_t)res);
@@ -112,16 +102,16 @@ static future<json::json_return_type> get_cf_stats_sum(http_context& ctx, const
static future<json::json_return_type> get_cf_stats_count(http_context& ctx,
utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {
return map_reduce_cf(ctx, int64_t(0), [f](const column_family& cf) {
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
return map_reduce_cf(ctx, int64_t(0), [f](const replica::column_family& cf) {
return (cf.get_stats().*f).hist.count;
}, std::plus<int64_t>());
}
static future<json::json_return_type> get_cf_histogram(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
utils::UUID uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([f, uuid](const database& p) {
return ctx.db.map_reduce0([f, uuid](const replica::database& p) {
return (p.find_column_family(uuid).get_stats().*f).hist;},
utils::ihistogram(),
std::plus<utils::ihistogram>())
@@ -130,8 +120,8 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, const
});
}
static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {
std::function<utils::ihistogram(const database&)> fun = [f] (const database& db) {
static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
std::function<utils::ihistogram(const replica::database&)> fun = [f] (const replica::database& db) {
utils::ihistogram res;
for (auto i : db.get_column_families()) {
res += (i.second->get_stats().*f).hist;
@@ -146,9 +136,9 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils:
}
static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
utils::UUID uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([f, uuid](const database& p) {
return ctx.db.map_reduce0([f, uuid](const replica::database& p) {
return (p.find_column_family(uuid).get_stats().*f).rate();},
utils::rate_moving_average_and_histogram(),
std::plus<utils::rate_moving_average_and_histogram>())
@@ -157,8 +147,8 @@ static future<json::json_return_type> get_cf_rate_and_histogram(http_context& c
});
}
static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {
std::function<utils::rate_moving_average_and_histogram(const database&)> fun = [f] (const database& db) {
static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
std::function<utils::rate_moving_average_and_histogram(const replica::database&)> fun = [f] (const replica::database& db) {
utils::rate_moving_average_and_histogram res;
for (auto i : db.get_column_families()) {
res += (i.second->get_stats().*f).rate();
@@ -173,30 +163,30 @@ static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ct
}
static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ctx, const sstring& name) {
return map_reduce_cf(ctx, name, int64_t(0), [](const column_family& cf) {
return map_reduce_cf(ctx, name, int64_t(0), [](const replica::column_family& cf) {
return cf.get_unleveled_sstables();
}, std::plus<int64_t>());
}
static int64_t min_partition_size(column_family& cf) {
static int64_t min_partition_size(replica::column_family& cf) {
int64_t res = INT64_MAX;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res = std::min(res, i->get_stats_metadata().estimated_partition_size.min());
}
return (res == INT64_MAX) ? 0 : res;
}
static int64_t max_partition_size(column_family& cf) {
static int64_t max_partition_size(replica::column_family& cf) {
int64_t res = 0;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res = std::max(i->get_stats_metadata().estimated_partition_size.max(), res);
}
return res;
}
static integral_ratio_holder mean_partition_size(column_family& cf) {
static integral_ratio_holder mean_partition_size(replica::column_family& cf) {
integral_ratio_holder res;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
auto c = i->get_stats_metadata().estimated_partition_size.count();
res.sub += i->get_stats_metadata().estimated_partition_size.mean() * c;
res.total += c;
@@ -220,7 +210,7 @@ static json::json_return_type sum_map(const std::unordered_map<sstring, uint64_t
static future<json::json_return_type> sum_sstable(http_context& ctx, const sstring name, bool total) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([uuid, total](database& db) {
return ctx.db.map_reduce0([uuid, total](replica::database& db) {
std::unordered_map<sstring, uint64_t> m;
auto sstables = (total) ? db.find_column_family(uuid).get_sstables_including_compacted_undeleted() :
db.find_column_family(uuid).get_sstables();
@@ -236,7 +226,7 @@ static future<json::json_return_type> sum_sstable(http_context& ctx, const sstr
static future<json::json_return_type> sum_sstable(http_context& ctx, bool total) {
return map_reduce_cf_raw(ctx, std::unordered_map<sstring, uint64_t>(), [total](column_family& cf) {
return map_reduce_cf_raw(ctx, std::unordered_map<sstring, uint64_t>(), [total](replica::column_family& cf) {
std::unordered_map<sstring, uint64_t> m;
auto sstables = (total) ? cf.get_sstables_including_compacted_undeleted() :
cf.get_sstables();
@@ -249,7 +239,7 @@ static future<json::json_return_type> sum_sstable(http_context& ctx, bool total)
});
}
future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const column_family&)> f) {
future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const replica::column_family&)> f) {
return map_reduce_cf_raw(ctx, name, utils::time_estimated_histogram(), f, utils::time_estimated_histogram_merge).then([](const utils::time_estimated_histogram& res) {
return make_ready_future<json::json_return_type>(time_to_json_histogram(res));
});
@@ -272,9 +262,9 @@ public:
}
};
static double get_compression_ratio(column_family& cf) {
static double get_compression_ratio(replica::column_family& cf) {
sum_ratio<double> result;
for (auto i : *cf.get_sstables()) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
auto compression_ratio = i->get_compression_ratio();
if (compression_ratio != sstables::metadata_collector::NO_COMPRESSION_RATIO) {
result(compression_ratio);
@@ -311,7 +301,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_column_family.set(r, [&ctx] (std::unique_ptr<request> req){
vector<cf::column_family_info> res;
std::list<cf::column_family_info> res;
for (auto i: ctx.db.local().get_column_families_mapping()) {
cf::column_family_info info;
info.ks = i.first.first;
@@ -319,7 +309,7 @@ void set_column_family(http_context& ctx, routes& r) {
info.type = "ColumnFamilies";
res.push_back(info);
}
return make_ready_future<json::json_return_type>(json::stream_object(std::move(res)));
return make_ready_future<json::json_return_type>(json::stream_range_as_array(std::move(res), std::identity()));
});
cf::get_column_family_name_keyspace.set(r, [&ctx] (const_req req){
@@ -331,15 +321,15 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], 0, [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](replica::column_family& cf) {
return cf.active_memtable().partition_count();
}, std::plus<int>());
}, std::plus<>());
});
cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](column_family& cf) {
return map_reduce_cf(ctx, uint64_t{0}, [](replica::column_family& cf) {
return cf.active_memtable().partition_count();
}, std::plus<int>());
}, std::plus<>());
});
cf::get_memtable_on_heap_size.set(r, [] (const_req req) {
@@ -351,25 +341,25 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().total_space();
}, std::plus<int64_t>());
});
cf::get_all_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().total_space();
}, std::plus<int64_t>());
});
cf::get_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().used_space();
}, std::plus<int64_t>());
});
cf::get_all_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().used_space();
}, std::plus<int64_t>());
});
@@ -384,14 +374,14 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_cf_all_memtables_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return cf.occupancy().total_space();
}, std::plus<int64_t>());
});
cf::get_all_cf_all_memtables_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {
warn(unimplemented::cause::INDEXES);
return ctx.db.map_reduce0([](const database& db){
return ctx.db.map_reduce0([](const replica::database& db){
return db.dirty_memory_region_group().memory_used();
}, int64_t(0), std::plus<int64_t>()).then([](int res) {
return make_ready_future<json::json_return_type>(res);
@@ -400,31 +390,31 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_cf_all_memtables_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return cf.occupancy().used_space();
}, std::plus<int64_t>());
});
cf::get_all_cf_all_memtables_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().used_space();
}, std::plus<int64_t>());
});
cf::get_memtable_switch_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats(ctx,req->param["name"] ,&column_family_stats::memtable_switch_count);
return get_cf_stats(ctx,req->param["name"] ,&replica::column_family_stats::memtable_switch_count);
});
cf::get_all_memtable_switch_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats(ctx, &column_family_stats::memtable_switch_count);
return get_cf_stats(ctx, &replica::column_family_stats::memtable_switch_count);
});
// FIXME: this refers to partitions, not rows.
cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_partition_size);
}
return res;
@@ -434,9 +424,9 @@ void set_column_family(http_context& ctx, routes& r) {
// FIXME: this refers to partitions, not rows.
cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
uint64_t res = 0;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res += i->get_stats_metadata().estimated_partition_size.count();
}
return res;
@@ -445,9 +435,9 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_cells_count);
}
return res;
@@ -462,87 +452,87 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_pending_flushes.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats(ctx,req->param["name"] ,&column_family_stats::pending_flushes);
return get_cf_stats(ctx,req->param["name"] ,&replica::column_family_stats::pending_flushes);
});
cf::get_all_pending_flushes.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats(ctx, &column_family_stats::pending_flushes);
return get_cf_stats(ctx, &replica::column_family_stats::pending_flushes);
});
cf::get_read.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_count(ctx,req->param["name"] ,&column_family_stats::reads);
return get_cf_stats_count(ctx,req->param["name"] ,&replica::column_family_stats::reads);
});
cf::get_all_read.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_count(ctx, &column_family_stats::reads);
return get_cf_stats_count(ctx, &replica::column_family_stats::reads);
});
cf::get_write.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_count(ctx, req->param["name"] ,&column_family_stats::writes);
return get_cf_stats_count(ctx, req->param["name"] ,&replica::column_family_stats::writes);
});
cf::get_all_write.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_count(ctx, &column_family_stats::writes);
return get_cf_stats_count(ctx, &replica::column_family_stats::writes);
});
cf::get_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_histogram(ctx, req->param["name"], &column_family_stats::reads);
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::reads);
});
cf::get_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_rate_and_histogram(ctx, req->param["name"], &column_family_stats::reads);
return get_cf_rate_and_histogram(ctx, req->param["name"], &replica::column_family_stats::reads);
});
cf::get_read_latency.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_sum(ctx,req->param["name"] ,&column_family_stats::reads);
return get_cf_stats_sum(ctx,req->param["name"] ,&replica::column_family_stats::reads);
});
cf::get_write_latency.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_sum(ctx, req->param["name"] ,&column_family_stats::writes);
return get_cf_stats_sum(ctx, req->param["name"] ,&replica::column_family_stats::writes);
});
cf::get_all_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_histogram(ctx, &column_family_stats::writes);
return get_cf_histogram(ctx, &replica::column_family_stats::writes);
});
cf::get_all_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_rate_and_histogram(ctx, &column_family_stats::writes);
return get_cf_rate_and_histogram(ctx, &replica::column_family_stats::writes);
});
cf::get_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_histogram(ctx, req->param["name"], &column_family_stats::writes);
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::writes);
});
cf::get_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_rate_and_histogram(ctx, req->param["name"], &column_family_stats::writes);
return get_cf_rate_and_histogram(ctx, req->param["name"], &replica::column_family_stats::writes);
});
cf::get_all_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_histogram(ctx, &column_family_stats::writes);
return get_cf_histogram(ctx, &replica::column_family_stats::writes);
});
cf::get_all_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_rate_and_histogram(ctx, &column_family_stats::writes);
return get_cf_rate_and_histogram(ctx, &replica::column_family_stats::writes);
});
cf::get_pending_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf);
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
}, std::plus<int64_t>());
});
cf::get_all_pending_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf);
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
}, std::plus<int64_t>());
});
cf::get_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats(ctx, req->param["name"], &column_family_stats::live_sstable_count);
return get_cf_stats(ctx, req->param["name"], &replica::column_family_stats::live_sstable_count);
});
cf::get_all_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats(ctx, &column_family_stats::live_sstable_count);
return get_cf_stats(ctx, &replica::column_family_stats::live_sstable_count);
});
cf::get_unleveled_sstables.set(r, [&ctx] (std::unique_ptr<request> req) {
@@ -598,105 +588,115 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
});
}, std::plus<uint64_t>());
});
cf::get_all_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
});
}, std::plus<uint64_t>());
});
cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
});
}, std::plus<uint64_t>());
});
cf::get_all_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
});
}, std::plus<uint64_t>());
});
cf::get_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_all_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {
return map_reduce_cf(ctx, ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_all_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {
return map_reduce_cf(ctx, ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
@@ -740,7 +740,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_true_snapshots_size.set(r, [&ctx] (std::unique_ptr<request> req) {
auto uuid = get_uuid(req->param["name"], ctx.db.local());
return ctx.db.local().find_column_family(uuid).get_snapshot_details().then([](
const std::unordered_map<sstring, column_family::snapshot_details>& sd) {
const std::unordered_map<sstring, replica::column_family::snapshot_details>& sd) {
int64_t res = 0;
for (auto i : sd) {
res += i.second.total;
@@ -769,7 +769,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_row_cache_hit.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const column_family& cf) {
return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().hits.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -777,7 +777,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_all_row_cache_hit.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {
return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().hits.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -785,7 +785,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_row_cache_miss.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const column_family& cf) {
return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().misses.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -793,7 +793,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_all_row_cache_miss.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {
return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().misses.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -802,36 +802,36 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_cas_prepare.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_cas_prepare;
});
});
cf::get_cas_propose.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_cas_accept;
});
});
cf::get_cas_commit.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_cas_learn;
});
});
cf::get_sstables_per_read_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
return cf.get_stats().estimated_sstable_per_read;
},
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
cf::get_tombstone_scanned_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_histogram(ctx, req->param["name"], &column_family_stats::tombstone_scanned);
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::tombstone_scanned);
});
cf::get_live_scanned_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_histogram(ctx, req->param["name"], &column_family_stats::live_scanned);
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::live_scanned);
});
cf::get_col_update_time_delta_histogram.set(r, [] (std::unique_ptr<request> req) {
@@ -844,23 +844,29 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_auto_compaction.set(r, [&ctx] (const_req req) {
const utils::UUID& uuid = get_uuid(req.param["name"], ctx.db.local());
column_family& cf = ctx.db.local().find_column_family(uuid);
replica::column_family& cf = ctx.db.local().find_column_family(uuid);
return !cf.is_auto_compaction_disabled_by_user();
});
cf::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {
cf.enable_auto_compaction();
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
cf.enable_auto_compaction();
}).then([g = std::move(g)] {
return make_ready_future<json::json_return_type>(json_void());
});
});
});
cf::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {
cf.disable_auto_compaction();
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
return cf.disable_auto_compaction();
}).then([g = std::move(g)] {
return make_ready_future<json::json_return_type>(json_void());
});
});
});
@@ -868,7 +874,7 @@ void set_column_family(http_context& ctx, routes& r) {
auto ks_cf = parse_fully_qualified_cf_name(req->param["name"]);
auto&& ks = std::get<0>(ks_cf);
auto&& cf_name = std::get<1>(ks_cf);
return db::system_keyspace::load_view_build_progress().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace::view_build_progress>& vb) mutable {
return db::system_keyspace::load_view_build_progress().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace_view_build_progress>& vb) mutable {
std::set<sstring> vp;
for (auto b : vb) {
if (b.view.first == ks) {
@@ -877,7 +883,7 @@ void set_column_family(http_context& ctx, routes& r) {
}
std::vector<sstring> res;
auto uuid = get_uuid(ks, cf_name, ctx.db.local());
column_family& cf = ctx.db.local().find_column_family(uuid);
replica::column_family& cf = ctx.db.local().find_column_family(uuid);
res.reserve(cf.get_index_manager().list_indexes().size());
for (auto&& i : cf.get_index_manager().list_indexes()) {
if (!vp.contains(secondary_index::index_table_name(i.metadata().name()))) {
@@ -905,8 +911,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_compression_ratio.set(r, [&ctx](std::unique_ptr<request> req) {
auto uuid = get_uuid(req->param["name"], ctx.db.local());
return ctx.db.map_reduce(sum_ratio<double>(), [uuid](database& db) {
column_family& cf = db.find_column_family(uuid);
return ctx.db.map_reduce(sum_ratio<double>(), [uuid](replica::database& db) {
replica::column_family& cf = db.find_column_family(uuid);
return make_ready_future<double>(get_compression_ratio(cf));
}).then([] (const double& result) {
return make_ready_future<json::json_return_type>(result);
@@ -914,20 +920,20 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_read_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_read;
});
});
cf::get_write_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_write;
});
});
cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<request> req) {
sstring strategy = req->get_query_param("class_name");
return foreach_column_family(ctx, req->param["name"], [strategy](column_family& cf) {
return foreach_column_family(ctx, req->param["name"], [strategy](replica::column_family& cf) {
cf.set_compaction_strategy(sstables::compaction_strategy::type(strategy));
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -951,7 +957,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_sstable_count_per_level.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], std::vector<uint64_t>(), [](const column_family& cf) {
return map_reduce_cf_raw(ctx, req->param["name"], std::vector<uint64_t>(), [](const replica::column_family& cf) {
return cf.sstable_count_per_level();
}, concat_sstable_count_per_level).then([](const std::vector<uint64_t>& res) {
return make_ready_future<json::json_return_type>(res);
@@ -962,7 +968,7 @@ void set_column_family(http_context& ctx, routes& r) {
auto key = req->get_query_param("key");
auto uuid = get_uuid(req->param["name"], ctx.db.local());
return ctx.db.map_reduce0([key, uuid] (database& db) {
return ctx.db.map_reduce0([key, uuid] (replica::database& db) {
return db.find_column_family(uuid).get_sstables_by_partition_key(key);
}, std::unordered_set<sstring>(),
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
@@ -973,45 +979,20 @@ void set_column_family(http_context& ctx, routes& r) {
});
});
cf::toppartitions.set(r, [&ctx] (std::unique_ptr<request> req) {
auto name_param = req->param["name"];
auto [ks, cf] = parse_fully_qualified_cf_name(name_param);
auto name = req->param["name"];
auto [ks, cf] = parse_fully_qualified_cf_name(name);
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: name={} duration={} list_size={} capacity={}",
name_param, duration.param, list_size.param, capacity.param);
name, duration.param, list_size.param, capacity.param);
return seastar::do_with(db::toppartitions_query(ctx.db, ks, cf, duration.value, list_size, capacity), [&ctx](auto& q) {
return q.scatter().then([&q] {
return sleep(q.duration()).then([&q] {
return q.gather(q.capacity()).then([&q] (auto topk_results) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = topk_results.read.size();
results.write_cardinality = topk_results.write.size();
for (auto& d: topk_results.read.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = sstring(d.item);
r.count = d.count;
r.error = d.error;
results.read.push(r);
}
for (auto& d: topk_results.write.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = sstring(d.item);
r.count = d.count;
r.error = d.error;
results.write.push(r);
}
return make_ready_future<json::json_return_type>(results);
});
});
});
return seastar::do_with(db::toppartitions_query(ctx.db, {{ks, cf}}, {}, duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
return run_toppartitions_query(q, ctx, true);
});
});
@@ -1019,7 +1000,7 @@ void set_column_family(http_context& ctx, routes& r) {
if (req->get_query_param("split_output") != "") {
fail(unimplemented::cause::API);
}
return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
return cf.compact_all_sstables();
}).then([] {
return make_ready_future<json::json_return_type>(json_void());

View File

@@ -1,29 +1,16 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api.hh"
#include "api/api-doc/column_family.json.hh"
#include "database.hh"
#include "replica/database.hh"
#include <seastar/core/future-util.hh>
#include <any>
@@ -31,17 +18,17 @@ namespace api {
void set_column_family(http_context& ctx, routes& r);
const utils::UUID& get_uuid(const sstring& name, const database& db);
future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(column_family&)> f);
const utils::UUID& get_uuid(const sstring& name, const replica::database& db);
future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(replica::column_family&)> f);
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer) {
auto uuid = get_uuid(name, ctx.db.local());
using mapper_type = std::function<std::unique_ptr<std::any>(database&)>;
using mapper_type = std::function<std::unique_ptr<std::any>(replica::database&)>;
using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;
return ctx.db.map_reduce0(mapper_type([mapper, uuid](database& db) {
return ctx.db.map_reduce0(mapper_type([mapper, uuid](replica::database& db) {
return std::make_unique<std::any>(I(mapper(db.find_column_family(uuid))));
}), std::make_unique<std::any>(std::move(init)), reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {
return std::make_unique<std::any>(I(reducer(std::any_cast<I>(std::move(*a)), std::any_cast<I>(std::move(*b)))));
@@ -68,16 +55,16 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& n
});
}
future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const column_family&)> f);
future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const replica::column_family&)> f);
struct map_reduce_column_families_locally {
std::any init;
std::function<std::unique_ptr<std::any>(column_family&)> mapper;
std::function<std::unique_ptr<std::any>(replica::column_family&)> mapper;
std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)> reducer;
future<std::unique_ptr<std::any>> operator()(database& db) const {
future<std::unique_ptr<std::any>> operator()(replica::database& db) const {
auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));
return do_for_each(db.get_column_families(), [res, this](const std::pair<utils::UUID, seastar::lw_shared_ptr<table>>& i) {
*res = std::move(reducer(std::move(*res), mapper(*i.second.get())));
return do_for_each(db.get_column_families(), [res, this](const std::pair<utils::UUID, seastar::lw_shared_ptr<replica::table>>& i) {
*res = reducer(std::move(*res), mapper(*i.second.get()));
}).then([res] {
return std::move(*res);
});
@@ -87,9 +74,9 @@ struct map_reduce_column_families_locally {
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, I init,
Mapper mapper, Reducer reducer) {
using mapper_type = std::function<std::unique_ptr<std::any>(column_family&)>;
using mapper_type = std::function<std::unique_ptr<std::any>(replica::column_family&)>;
using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;
auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (column_family& cf) mutable {
auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (replica::column_family& cf) mutable {
return std::make_unique<std::any>(I(mapper(cf)));
});
auto wrapped_reducer = reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {
@@ -111,9 +98,12 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, I init,
}
future<json::json_return_type> get_cf_stats(http_context& ctx, const sstring& name,
int64_t column_family_stats::*f);
int64_t replica::column_family_stats::*f);
future<json::json_return_type> get_cf_stats(http_context& ctx,
int64_t column_family_stats::*f);
int64_t replica::column_family_stats::*f);
std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);
}

View File

@@ -1,28 +1,15 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "commitlog.hh"
#include "db/commitlog/commitlog.hh"
#include "api/api-doc/commitlog.json.hh"
#include "database.hh"
#include "replica/database.hh"
#include <vector>
namespace api {
@@ -31,7 +18,7 @@ template<typename T>
static auto acquire_cl_metric(http_context& ctx, std::function<T (db::commitlog*)> func) {
typedef T ret_type;
return ctx.db.map_reduce0([func = std::move(func)](database& db) {
return ctx.db.map_reduce0([func = std::move(func)](replica::database& db) {
if (db.commitlog() == nullptr) {
return make_ready_future<ret_type>();
}
@@ -47,7 +34,7 @@ void set_commitlog(http_context& ctx, routes& r) {
auto res = make_shared<std::vector<sstring>>();
return ctx.db.map_reduce([res](std::vector<sstring> names) {
res->insert(res->end(), names.begin(), names.end());
}, [](database& db) {
}, [](replica::database& db) {
if (db.commitlog() == nullptr) {
return make_ready_future<std::vector<sstring>>(std::vector<sstring>());
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,29 +1,21 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "compaction_manager.hh"
#include "sstables/compaction_manager.hh"
#include "compaction/compaction_manager.hh"
#include "api/api-doc/compaction_manager.json.hh"
#include "db/system_keyspace.hh"
#include "column_family.hh"
#include "unimplemented.hh"
#include "storage_service.hh"
#include <utility>
namespace api {
@@ -33,7 +25,7 @@ using namespace json;
static future<json::json_return_type> get_cm_stats(http_context& ctx,
int64_t compaction_manager::stats::*f) {
return ctx.db.map_reduce0([f](database& db) {
return ctx.db.map_reduce0([f](replica::database& db) {
return db.get_compaction_manager().get_stats().*f;
}, int64_t(0), std::plus<int64_t>()).then([](const int64_t& res) {
return make_ready_future<json::json_return_type>(res);
@@ -52,18 +44,19 @@ static std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_ha
void set_compaction_manager(http_context& ctx, routes& r) {
cm::get_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {
return ctx.db.map_reduce0([](database& db) {
return ctx.db.map_reduce0([](replica::database& db) {
std::vector<cm::summary> summaries;
const compaction_manager& cm = db.get_compaction_manager();
for (const auto& c : cm.get_compactions()) {
cm::summary s;
s.ks = c->ks_name;
s.cf = c->cf_name;
s.id = c.compaction_uuid.to_sstring();
s.ks = c.ks_name;
s.cf = c.cf_name;
s.unit = "keys";
s.task_type = sstables::compaction_name(c->type);
s.completed = c->total_keys_written;
s.total = c->total_partitions;
s.task_type = sstables::compaction_name(c.type);
s.completed = c.total_keys_written;
s.total = c.total_partitions;
summaries.push_back(std::move(s));
}
return summaries;
@@ -73,11 +66,11 @@ void set_compaction_manager(http_context& ctx, routes& r) {
});
cm::get_pending_tasks_by_table.set(r, [&ctx] (std::unique_ptr<request> req) {
return ctx.db.map_reduce0([&ctx](database& db) {
return ctx.db.map_reduce0([&ctx](replica::database& db) {
return do_with(std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>(), [&ctx, &db](std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>& tasks) {
return do_for_each(db.get_column_families(), [&tasks](const std::pair<utils::UUID, seastar::lw_shared_ptr<table>>& i) {
table& cf = *i.second.get();
tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.get_compaction_strategy().estimated_pending_compactions(cf);
return do_for_each(db.get_column_families(), [&tasks](const std::pair<utils::UUID, seastar::lw_shared_ptr<replica::table>>& i) {
replica::table& cf = *i.second.get();
tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
return make_ready_future<>();
}).then([&tasks] {
return std::move(tasks);
@@ -107,17 +100,34 @@ void set_compaction_manager(http_context& ctx, routes& r) {
cm::stop_compaction.set(r, [&ctx] (std::unique_ptr<request> req) {
auto type = req->get_query_param("type");
return ctx.db.invoke_on_all([type] (database& db) {
return ctx.db.invoke_on_all([type] (replica::database& db) {
auto& cm = db.get_compaction_manager();
cm.stop_compaction(type);
return cm.stop_compaction(type);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto ks_name = validate_keyspace(ctx, req->param);
auto table_names = parse_tables(ks_name, ctx, req->query_parameters, "tables");
if (table_names.empty()) {
table_names = map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());
}
auto type = req->get_query_param("type");
co_await ctx.db.invoke_on_all([&ks_name, &table_names, type] (replica::database& db) {
auto& cm = db.get_compaction_manager();
return parallel_for_each(table_names, [&db, &cm, &ks_name, type] (sstring& table_name) {
auto& t = db.find_column_family(ks_name, table_name);
return cm.stop_compaction(type, &t);
});
});
co_return json_void();
});
cm::get_pending_tasks.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf);
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
}, std::plus<int64_t>());
});

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,28 +1,14 @@
/*
* Copyright 2018 ScyllaDB
* Copyright 2018-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "api/config.hh"
#include "api/api-doc/config.json.hh"
#include "db/config.hh"
#include "database.hh"
#include <sstream>
#include <boost/algorithm/string/replace.hpp>
@@ -89,11 +75,11 @@ future<> get_config_swagger_entry(std::string_view name, const std::string& desc
namespace cs = httpd::config_json;
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r) {
rb->register_function(r, [&ctx] (output_stream<char>& os) {
return do_with(true, [&os, &ctx] (bool& first) {
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, const db::config& cfg) {
rb->register_function(r, [&cfg] (output_stream<char>& os) {
return do_with(true, [&os, &cfg] (bool& first) {
auto f = make_ready_future();
for (auto&& cfg_ref : ctx.db.local().get_config().values()) {
for (auto&& cfg_ref : cfg.values()) {
auto&& cfg = cfg_ref.get();
f = f.then([&os, &first, &cfg] {
return get_config_swagger_entry(cfg.name(), std::string(cfg.desc()), cfg.type_name(), first, os);
@@ -103,9 +89,9 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx
});
});
cs::find_config_id.set(r, [&ctx] (const_req r) {
cs::find_config_id.set(r, [&cfg] (const_req r) {
auto id = r.param["id"];
for (auto&& cfg_ref : ctx.db.local().get_config().values()) {
for (auto&& cfg_ref : cfg.values()) {
auto&& cfg = cfg_ref.get();
if (id == cfg.name()) {
return cfg.value_as_json();

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2018 ScyllaDB
* Copyright (C) 2018-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -26,5 +13,5 @@
namespace api {
void set_config(std::shared_ptr<api_registry_builder20> rb, http_context& ctx, routes& r);
void set_config(std::shared_ptr<api_registry_builder20> rb, http_context& ctx, routes& r, const db::config& cfg);
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "locator/snitch_base.hh"

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2020 ScyllaDB
* Copyright (C) 2020-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "api/api-doc/error_injection.json.hh"

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2019 ScyllaDB
* Copyright (C) 2019-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "failure_detector.hh"
@@ -28,10 +15,10 @@ namespace api {
namespace fd = httpd::failure_detector_json;
void set_failure_detector(http_context& ctx, routes& r) {
fd::get_all_endpoint_states.set(r, [](std::unique_ptr<request> req) {
void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_all_endpoint_states.set(r, [&g](std::unique_ptr<request> req) {
std::vector<fd::endpoint_state> res;
for (auto i : gms::get_local_gossiper().endpoint_state_map) {
for (auto i : g.endpoint_state_map) {
fd::endpoint_state val;
val.addrs = boost::lexical_cast<std::string>(i.first);
val.is_alive = i.second.is_alive();
@@ -52,14 +39,14 @@ void set_failure_detector(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(res);
});
fd::get_up_endpoint_count.set(r, [](std::unique_ptr<request> req) {
return gms::get_up_endpoint_count().then([](int res) {
fd::get_up_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {
return gms::get_up_endpoint_count(g).then([](int res) {
return make_ready_future<json::json_return_type>(res);
});
});
fd::get_down_endpoint_count.set(r, [](std::unique_ptr<request> req) {
return gms::get_down_endpoint_count().then([](int res) {
fd::get_down_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {
return gms::get_down_endpoint_count(g).then([](int res) {
return make_ready_future<json::json_return_type>(res);
});
});
@@ -70,8 +57,8 @@ void set_failure_detector(http_context& ctx, routes& r) {
});
});
fd::get_simple_states.set(r, [] (std::unique_ptr<request> req) {
return gms::get_simple_states().then([](const std::map<sstring, sstring>& map) {
fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {
return gms::get_simple_states(g).then([](const std::map<sstring, sstring>& map) {
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(map));
});
});
@@ -83,8 +70,8 @@ void set_failure_detector(http_context& ctx, routes& r) {
});
});
fd::get_endpoint_state.set(r, [](std::unique_ptr<request> req) {
return gms::get_endpoint_state(req->param["addr"]).then([](const sstring& state) {
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
return get_endpoint_state(g, req->param["addr"]).then([](const sstring& state) {
return make_ready_future<json::json_return_type>(state);
});
});

View File

@@ -1,30 +1,23 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api.hh"
namespace api {
namespace gms {
void set_failure_detector(http_context& ctx, routes& r);
class gossiper;
}
namespace api {
void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g);
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "gossiper.hh"
@@ -26,50 +13,50 @@
namespace api {
using namespace json;
void set_gossiper(http_context& ctx, routes& r) {
httpd::gossiper_json::get_down_endpoint.set(r, [] (const_req req) {
auto res = gms::get_local_gossiper().get_unreachable_members();
void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
httpd::gossiper_json::get_down_endpoint.set(r, [&g] (const_req req) {
auto res = g.get_unreachable_members();
return container_to_vec(res);
});
httpd::gossiper_json::get_live_endpoint.set(r, [] (const_req req) {
auto res = gms::get_local_gossiper().get_live_members();
httpd::gossiper_json::get_live_endpoint.set(r, [&g] (const_req req) {
auto res = g.get_live_members();
return container_to_vec(res);
});
httpd::gossiper_json::get_endpoint_downtime.set(r, [] (const_req req) {
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (const_req req) {
gms::inet_address ep(req.param["addr"]);
return gms::get_local_gossiper().get_endpoint_downtime(ep);
return g.get_endpoint_downtime(ep);
});
httpd::gossiper_json::get_current_generation_number.set(r, [] (std::unique_ptr<request> req) {
httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<request> req) {
gms::inet_address ep(req->param["addr"]);
return gms::get_local_gossiper().get_current_generation_number(ep).then([] (int res) {
return g.get_current_generation_number(ep).then([] (int res) {
return make_ready_future<json::json_return_type>(res);
});
});
httpd::gossiper_json::get_current_heart_beat_version.set(r, [] (std::unique_ptr<request> req) {
httpd::gossiper_json::get_current_heart_beat_version.set(r, [&g] (std::unique_ptr<request> req) {
gms::inet_address ep(req->param["addr"]);
return gms::get_local_gossiper().get_current_heart_beat_version(ep).then([] (int res) {
return g.get_current_heart_beat_version(ep).then([] (int res) {
return make_ready_future<json::json_return_type>(res);
});
});
httpd::gossiper_json::assassinate_endpoint.set(r, [](std::unique_ptr<request> req) {
httpd::gossiper_json::assassinate_endpoint.set(r, [&g](std::unique_ptr<request> req) {
if (req->get_query_param("unsafe") != "True") {
return gms::get_local_gossiper().assassinate_endpoint(req->param["addr"]).then([] {
return g.assassinate_endpoint(req->param["addr"]).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
return gms::get_local_gossiper().unsafe_assassinate_endpoint(req->param["addr"]).then([] {
return g.unsafe_assassinate_endpoint(req->param["addr"]).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
httpd::gossiper_json::force_remove_endpoint.set(r, [](std::unique_ptr<request> req) {
httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<request> req) {
gms::inet_address ep(req->param["addr"]);
return gms::get_local_gossiper().force_remove_endpoint(ep).then([] {
return g.force_remove_endpoint(ep).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});

View File

@@ -1,30 +1,23 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api.hh"
namespace api {
namespace gms {
void set_gossiper(http_context& ctx, routes& r);
class gossiper;
}
namespace api {
void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g);
}

View File

@@ -1,33 +1,98 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <algorithm>
#include <vector>
#include "hinted_handoff.hh"
#include "api/api-doc/hinted_handoff.json.hh"
#include "gms/inet_address.hh"
#include "gms/gossiper.hh"
#include "service/storage_proxy.hh"
namespace api {
using namespace json;
namespace hh = httpd::hinted_handoff_json;
void set_hinted_handoff(http_context& ctx, routes& r) {
void set_hinted_handoff(http_context& ctx, routes& r, gms::gossiper& g) {
hh::create_hints_sync_point.set(r, [&ctx, &g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto parse_hosts_list = [&g] (sstring arg) {
std::vector<sstring> hosts_str = split(arg, ",");
std::vector<gms::inet_address> hosts;
hosts.reserve(hosts_str.size());
if (hosts_str.empty()) {
// No target_hosts specified means that we should wait for hints for all nodes to be sent
const auto members_set = g.get_live_members();
std::copy(members_set.begin(), members_set.end(), std::back_inserter(hosts));
} else {
for (const auto& host_str : hosts_str) {
try {
gms::inet_address host;
host = gms::inet_address(host_str);
hosts.push_back(host);
} catch (std::exception& e) {
throw httpd::bad_param_exception(format("Failed to parse host address {}: {}", host_str, e.what()));
}
}
}
return hosts;
};
std::vector<gms::inet_address> target_hosts = parse_hosts_list(req->get_query_param("target_hosts"));
return ctx.sp.local().create_hint_sync_point(std::move(target_hosts)).then([] (db::hints::sync_point sync_point) {
return json::json_return_type(sync_point.encode());
});
});
hh::get_hints_sync_point.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
db::hints::sync_point sync_point;
const sstring encoded = req->get_query_param("id");
try {
sync_point = db::hints::sync_point::decode(encoded);
} catch (std::exception& e) {
throw httpd::bad_param_exception(format("Failed to parse the sync point description {}: {}", encoded, e.what()));
}
lowres_clock::time_point deadline;
const sstring timeout_str = req->get_query_param("timeout");
try {
deadline = [&] {
if (timeout_str.empty()) {
// Empty string - don't wait at all, just check the status
return lowres_clock::time_point::min();
} else {
const auto timeout = std::stoll(timeout_str);
if (timeout >= 0) {
// Wait until the point is reached, or until `timeout` seconds elapse
return lowres_clock::now() + std::chrono::seconds(timeout);
} else {
// Negative value indicates infinite timeout
return lowres_clock::time_point::max();
}
}
} ();
} catch (std::exception& e) {
throw httpd::bad_param_exception(format("Failed to parse the timeout parameter {}: {}", timeout_str, e.what()));
}
using return_type = hh::ns_get_hints_sync_point::get_hints_sync_point_return_type;
using return_type_wrapper = hh::ns_get_hints_sync_point::return_type_wrapper;
return ctx.sp.local().wait_for_hint_sync_point(std::move(sync_point), deadline).then([] {
return json::json_return_type(return_type_wrapper(return_type::DONE));
}).handle_exception_type([] (const timed_out_error&) {
return json::json_return_type(return_type_wrapper(return_type::IN_PROGRESS));
});
});
hh::list_endpoints_pending_hints.set(r, [] (std::unique_ptr<request> req) {
//TBD
unimplemented();
@@ -71,5 +136,16 @@ void set_hinted_handoff(http_context& ctx, routes& r) {
});
}
void unset_hinted_handoff(http_context& ctx, routes& r) {
hh::create_hints_sync_point.unset(r);
hh::get_hints_sync_point.unset(r);
hh::list_endpoints_pending_hints.unset(r);
hh::truncate_all_hints.unset(r);
hh::schedule_hint_delivery.unset(r);
hh::pause_hints_delivery.unset(r);
hh::get_create_hint_count.unset(r);
hh::get_not_stored_hints_count.unset(r);
}
}

View File

@@ -1,30 +1,24 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api.hh"
namespace api {
namespace gms {
void set_hinted_handoff(http_context& ctx, routes& r);
class gossiper;
}
namespace api {
void set_hinted_handoff(http_context& ctx, routes& r, gms::gossiper& g);
void unset_hinted_handoff(http_context& ctx, routes& r);
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "api/api-doc/lsa.json.hh"
@@ -26,6 +13,7 @@
#include <seastar/http/exception.hh>
#include "utils/logalloc.hh"
#include "log.hh"
#include "replica/database.hh"
namespace api {
@@ -34,7 +22,7 @@ static logging::logger alogger("lsa-api");
void set_lsa(http_context& ctx, routes& r) {
httpd::lsa_json::lsa_compact.set(r, [&ctx](std::unique_ptr<request> req) {
alogger.info("Triggering compaction");
return ctx.db.invoke_on_all([] (database&) {
return ctx.db.invoke_on_all([] (replica::database&) {
logalloc::shard_tracker().reclaim(std::numeric_limits<size_t>::max());
}).then([] {
return json::json_return_type(json::json_void());

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "messaging_service.hh"
@@ -96,6 +83,10 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
return c.get_stats().sent_messages;
}));
get_replied_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
return c.get_stats().replied;
}));
get_dropped_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
// We don't have the same drop message mechanism
// as origin has.
@@ -155,6 +146,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
void unset_messaging_service(http_context& ctx, routes& r) {
get_timeout_messages.unset(r);
get_sent_messages.unset(r);
get_replied_messages.unset(r);
get_dropped_messages.unset(r);
get_exception_messages.unset(r);
get_pending_messages.unset(r);

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "storage_proxy.hh"
@@ -26,7 +13,7 @@
#include "service/storage_service.hh"
#include "db/config.hh"
#include "utils/histogram.hh"
#include "database.hh"
#include "replica/database.hh"
#include "seastar/core/scheduling_specific.hh"
namespace api {
@@ -193,7 +180,7 @@ sum_timer_stats_storage_proxy(distributed<proxy>& d,
});
}
void set_storage_proxy(http_context& ctx, routes& r) {
void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_service>& ss) {
sp::get_total_hints.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
@@ -363,8 +350,8 @@ void set_storage_proxy(http_context& ctx, routes& r) {
return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_repaired_background);
});
sp::get_schema_versions.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().describe_schema_versions().then([] (auto result) {
sp::get_schema_versions.set(r, [&ss](std::unique_ptr<request> req) {
return ss.local().describe_schema_versions().then([] (auto result) {
std::vector<sp::mapper_list> res;
for (auto e : result) {
sp::mapper_list entry;

View File

@@ -1,30 +1,20 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include <seastar/core/sharded.hh>
#include "api.hh"
namespace service { class storage_service; }
namespace api {
void set_storage_proxy(http_context& ctx, routes& r);
void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_service>& ss);
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,38 +1,53 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include <seastar/core/sharded.hh>
#include "api.hh"
#include "db/data_listeners.hh"
namespace cql_transport { class controller; }
class thrift_controller;
namespace db { class snapshot_ctl; }
namespace db {
class snapshot_ctl;
namespace view {
class view_builder;
}
}
namespace netw { class messaging_service; }
class repair_service;
namespace cdc { class generation_service; }
class sstables_loader;
namespace gms {
class gossiper;
}
namespace api {
void set_storage_service(http_context& ctx, routes& r);
void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms);
// verify that the keyspace parameter is found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective keyspace error.
sstring validate_keyspace(http_context& ctx, const parameters& param);
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs);
void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader);
void unset_sstables_loader(http_context& ctx, routes& r);
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb);
void unset_view_builder(http_context& ctx, routes& r);
void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair);
void unset_repair(http_context& ctx, routes& r);
void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl);
void unset_transport_controller(http_context& ctx, routes& r);
@@ -40,5 +55,6 @@ void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl);
void unset_rpc_controller(http_context& ctx, routes& r);
void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl);
void unset_snapshot(http_context& ctx, routes& r);
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "stream_manager.hh"
@@ -87,13 +74,13 @@ static hs::stream_state get_state(
return state;
}
void set_stream_manager(http_context& ctx, routes& r) {
void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_manager>& sm) {
hs::get_current_streams.set(r,
[] (std::unique_ptr<request> req) {
return streaming::get_stream_manager().invoke_on_all([] (auto& sm) {
[&sm] (std::unique_ptr<request> req) {
return sm.invoke_on_all([] (auto& sm) {
return sm.update_all_progress_info();
}).then([] {
return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& stream) {
}).then([&sm] {
return sm.map_reduce0([](streaming::stream_manager& stream) {
std::vector<hs::stream_state> res;
for (auto i : stream.get_initiated_streams()) {
res.push_back(get_state(*i.second.get()));
@@ -109,17 +96,17 @@ void set_stream_manager(http_context& ctx, routes& r) {
});
});
hs::get_all_active_streams_outbound.set(r, [](std::unique_ptr<request> req) {
return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& stream) {
hs::get_all_active_streams_outbound.set(r, [&sm](std::unique_ptr<request> req) {
return sm.map_reduce0([](streaming::stream_manager& stream) {
return stream.get_initiated_streams().size();
}, 0, std::plus<int64_t>()).then([](int64_t res) {
return make_ready_future<json::json_return_type>(res);
});
});
hs::get_total_incoming_bytes.set(r, [](std::unique_ptr<request> req) {
hs::get_total_incoming_bytes.set(r, [&sm](std::unique_ptr<request> req) {
gms::inet_address peer(req->param["peer"]);
return streaming::get_stream_manager().map_reduce0([peer](streaming::stream_manager& sm) {
return sm.map_reduce0([peer](streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {
return sbytes.bytes_received;
});
@@ -128,8 +115,8 @@ void set_stream_manager(http_context& ctx, routes& r) {
});
});
hs::get_all_total_incoming_bytes.set(r, [](std::unique_ptr<request> req) {
return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& sm) {
hs::get_all_total_incoming_bytes.set(r, [&sm](std::unique_ptr<request> req) {
return sm.map_reduce0([](streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards().then([] (auto sbytes) {
return sbytes.bytes_received;
});
@@ -138,9 +125,9 @@ void set_stream_manager(http_context& ctx, routes& r) {
});
});
hs::get_total_outgoing_bytes.set(r, [](std::unique_ptr<request> req) {
hs::get_total_outgoing_bytes.set(r, [&sm](std::unique_ptr<request> req) {
gms::inet_address peer(req->param["peer"]);
return streaming::get_stream_manager().map_reduce0([peer] (streaming::stream_manager& sm) {
return sm.map_reduce0([peer] (streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {
return sbytes.bytes_sent;
});
@@ -149,8 +136,8 @@ void set_stream_manager(http_context& ctx, routes& r) {
});
});
hs::get_all_total_outgoing_bytes.set(r, [](std::unique_ptr<request> req) {
return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& sm) {
hs::get_all_total_outgoing_bytes.set(r, [&sm](std::unique_ptr<request> req) {
return sm.map_reduce0([](streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards().then([] (auto sbytes) {
return sbytes.bytes_sent;
});
@@ -160,4 +147,13 @@ void set_stream_manager(http_context& ctx, routes& r) {
});
}
void unset_stream_manager(http_context& ctx, routes& r) {
hs::get_current_streams.unset(r);
hs::get_all_active_streams_outbound.unset(r);
hs::get_total_incoming_bytes.unset(r);
hs::get_all_total_incoming_bytes.unset(r);
hs::get_total_outgoing_bytes.unset(r);
hs::get_all_total_outgoing_bytes.unset(r);
}
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -25,6 +12,7 @@
namespace api {
void set_stream_manager(http_context& ctx, routes& r);
void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_manager>& sm);
void unset_stream_manager(http_context& ctx, routes& r);
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "api/api-doc/system.json.hh"
@@ -25,6 +12,9 @@
#include <seastar/core/reactor.hh>
#include <seastar/http/exception.hh>
#include "log.hh"
#include "replica/database.hh"
extern logging::logger apilog;
namespace api {
@@ -70,6 +60,16 @@ void set_system(http_context& ctx, routes& r) {
}
return json::json_void();
});
hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {
apilog.info("Dropping sstable caches");
return ctx.db.invoke_on_all([] (replica::database& db) {
return db.drop_caches();
}).then([] {
apilog.info("Caches dropped");
return json::json_return_type(json::json_void());
});
});
}
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2018 ScyllaDB
* Copyright (C) 2018-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "atomic_cell.hh"
@@ -24,142 +11,125 @@
#include "counters.hh"
#include "types.hh"
/// LSA mirator for cells with irrelevant type
///
///
const data::type_imr_descriptor& no_type_imr_descriptor() {
static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());
return state;
}
atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_dead(timestamp, deletion_time);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, single_fragment_range(value));
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, fragment_range(value));
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, single_fragment_range(value), expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, fragment_range(value), expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
}
atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())
);
}
static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)
{
using imr_object_type = imr::utils::object<data::cell::structure>;
// If the cell doesn't own any memory it is trivial and can be copied with
// memcpy.
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
if (!f.template get<data::cell::tags::external_data>()) {
data::cell::context ctx(f, imr_data.type_info());
// XXX: We may be better off storing the total cell size in memory. Measure!
auto size = data::cell::structure::serialized_object_size(ptr, ctx);
return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {
std::copy_n(ptr, size, dst);
}, &imr_data.lsa_migrator());
}
return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());
return atomic_cell_type::make_live_uninitialized(timestamp, size);
}
atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
: atomic_cell(type.imr_state().type_info(),
copy_cell(type.imr_state(), other._view.raw_pointer()))
{ }
: _data(other._view) {
set_view(_data);
}
// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
std::strong_ordering
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
if (left.timestamp() != right.timestamp()) {
return left.timestamp() <=> right.timestamp();
}
if (left.is_live() != right.is_live()) {
return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value()) <=> 0;
if (c != 0) {
return c;
}
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
}
if (left.is_live_and_has_ttl()) {
if (left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
return right.ttl() <=> left.ttl();
}
}
} else {
// Both are deleted
// Origin compares big-endian serialized deletion time. That's because it
// delegates to AbstractCell.reconcile() which compares values after
// comparing timestamps, which in case of deleted cells will hold
// serialized expiry.
return (uint64_t) left.deletion_time().time_since_epoch().count()
<=> (uint64_t) right.deletion_time().time_since_epoch().count();
}
return std::strong_ordering::equal;
}
atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
if (!_data.get()) {
if (_data.empty()) {
return atomic_cell_or_collection();
}
auto& imr_data = type.imr_state();
return atomic_cell_or_collection(
copy_cell(imr_data, _data.get())
);
return atomic_cell_or_collection(managed_bytes(_data));
}
atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)
: _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))
: _data(acv._view)
{
}
bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const
{
auto ptr_a = _data.get();
auto ptr_b = other._data.get();
if (!ptr_a || !ptr_b) {
return !ptr_a && !ptr_b;
if (_data.empty() || other._data.empty()) {
return _data.empty() && other._data.empty();
}
if (type.is_atomic()) {
auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);
auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);
auto a = atomic_cell_view::from_bytes(type, _data);
auto b = atomic_cell_view::from_bytes(type, other._data);
if (a.timestamp() != b.timestamp()) {
return false;
}
@@ -191,44 +161,24 @@ bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_c
size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const
{
if (!_data.get()) {
return 0;
}
auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());
auto view = data::cell::structure::make_view(_data.get(), ctx);
auto flags = view.get<data::cell::tags::flags>();
size_t external_value_size = 0;
if (flags.get<data::cell::tags::external_data>()) {
if (flags.get<data::cell::tags::collection>()) {
external_value_size = as_collection_mutation().data.size_bytes();
} else {
auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);
external_value_size = cell_view.value_size();
}
// Add overhead of chunk headers. The last one is a special case.
external_value_size += (external_value_size - 1) / data::cell::effective_external_chunk_length * data::cell::external_chunk_overhead;
external_value_size += data::cell::external_last_chunk_overhead;
}
return data::cell::structure::serialized_object_size(_data.get(), ctx)
+ imr_object_type::size_overhead + external_value_size;
return _data.external_memory_usage();
}
std::ostream&
operator<<(std::ostream& os, const atomic_cell_view& acv) {
if (acv.is_live()) {
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
fmt::print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
acv.is_counter_update()
? "counter_update_value=" + to_sstring(acv.counter_update_value())
: to_hex(acv.value().linearize()),
: to_hex(to_bytes(acv.value())),
acv.timestamp(),
acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,
acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);
} else {
return fmt_print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",
fmt::print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",
acv.timestamp(), acv.deletion_time().time_since_epoch().count());
}
return os;
}
std::ostream&
@@ -247,22 +197,22 @@ operator<<(std::ostream& os, const atomic_cell_view::printer& acvp) {
cell_value_string_builder << "counter_update_value=" << acv.counter_update_value();
} else {
cell_value_string_builder << "shards: ";
counter_cell_view::with_linearized(acv, [&cell_value_string_builder] (counter_cell_view& ccv) {
cell_value_string_builder << ::join(", ", ccv.shards());
});
auto ccv = counter_cell_view(acv);
cell_value_string_builder << ::join(", ", ccv.shards());
}
} else {
cell_value_string_builder << type.to_string(acv.value().linearize());
cell_value_string_builder << type.to_string(to_bytes(acv.value()));
}
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
fmt::print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
cell_value_string_builder.str(),
acv.timestamp(),
acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,
acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);
} else {
return fmt_print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",
fmt::print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",
acv.timestamp(), acv.deletion_time().time_since_epoch().count());
}
return os;
}
std::ostream&
@@ -271,12 +221,11 @@ operator<<(std::ostream& os, const atomic_cell::printer& acp) {
}
std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection::printer& p) {
if (!p._cell._data.get()) {
if (p._cell._data.empty()) {
return os << "{ null atomic_cell_or_collection }";
}
using dc = data::cell;
os << "{ ";
if (dc::structure::get_member<dc::tags::flags>(p._cell._data.get()).get<dc::tags::collection>()) {
if (p._cdef.type->is_multi_cell()) {
os << "collection ";
auto cmv = p._cell.as_collection_mutation();
os << collection_mutation_view::printer(*p._cdef.type, cmv);

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -27,11 +14,10 @@
#include "gc_clock.hh"
#include "utils/managed_bytes.hh"
#include <seastar/net//byteorder.hh>
#include <seastar/util/bool_class.hh>
#include <cstdint>
#include <iosfwd>
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"
#include <concepts>
#include "utils/fragmented_temporary_buffer.hh"
#include "serializer.hh"
@@ -40,41 +26,191 @@ class abstract_type;
class collection_type_impl;
class atomic_cell_or_collection;
using atomic_cell_value_view = data::value_view;
using atomic_cell_value_mutable_view = data::value_mutable_view;
using atomic_cell_value = managed_bytes;
template <mutable_view is_mutable>
using atomic_cell_value_basic_view = managed_bytes_basic_view<is_mutable>;
using atomic_cell_value_view = atomic_cell_value_basic_view<mutable_view::no>;
using atomic_cell_value_mutable_view = atomic_cell_value_basic_view<mutable_view::yes>;
template <typename T>
requires std::is_trivial_v<T>
static void set_field(atomic_cell_value_mutable_view& out, unsigned offset, T val) {
auto out_view = managed_bytes_mutable_view(out);
out_view.remove_prefix(offset);
write<T>(out_view, val);
}
template <typename T>
requires std::is_trivial_v<T>
static void set_field(atomic_cell_value& out, unsigned offset, T val) {
auto out_view = atomic_cell_value_mutable_view(out);
set_field(out_view, offset, val);
}
template <FragmentRange Buffer>
static void set_value(managed_bytes& b, unsigned value_offset, const Buffer& value) {
auto v = managed_bytes_mutable_view(b).substr(value_offset, value.size_bytes());
for (auto frag : value) {
write_fragmented(v, single_fragmented_view(frag));
}
}
template <typename T, FragmentedView Input>
requires std::is_trivial_v<T>
static T get_field(Input in, unsigned offset = 0) {
in.remove_prefix(offset);
return read_simple<T>(in);
}
/*
* Represents atomic cell layout. Works on serialized form.
*
* Layout:
*
* <live> := <int8_t:flags><int64_t:timestamp>(<int64_t:expiry><int32_t:ttl>)?<value>
* <dead> := <int8_t: 0><int64_t:timestamp><int64_t:deletion_time>
*/
class atomic_cell_type final {
private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;
static constexpr unsigned expiry_size = 8;
static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;
static constexpr unsigned deletion_time_size = 8;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(atomic_cell_value_view cell) {
return cell.front() & COUNTER_UPDATE_FLAG;
}
static bool is_live(atomic_cell_value_view cell) {
return cell.front() & LIVE_FLAG;
}
static bool is_live_and_has_ttl(atomic_cell_value_view cell) {
return cell.front() & EXPIRY_FLAG;
}
static bool is_dead(atomic_cell_value_view cell) {
return !is_live(cell);
}
// Can be called on live and dead cells
static api::timestamp_type timestamp(atomic_cell_value_view cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
static void set_timestamp(atomic_cell_value_mutable_view& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
private:
template <mutable_view is_mutable>
static managed_bytes_basic_view<is_mutable> do_get_value(managed_bytes_basic_view<is_mutable> cell) {
auto expiry_field_size = bool(cell.front() & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static atomic_cell_value_view value(managed_bytes_view cell) {
return do_get_value(cell);
}
static atomic_cell_value_mutable_view value(managed_bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(atomic_cell_value_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(atomic_cell_value_view cell) {
assert(is_dead(cell));
return gc_clock::time_point(gc_clock::duration(get_field<int64_t>(cell, deletion_time_offset)));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::time_point expiry(atomic_cell_value_view cell) {
assert(is_live_and_has_ttl(cell));
auto expiry = get_field<int64_t>(cell, expiry_offset);
return gc_clock::time_point(gc_clock::duration(expiry));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::duration ttl(atomic_cell_value_view cell) {
assert(is_live_and_has_ttl(cell));
return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, static_cast<int64_t>(deletion_time.time_since_epoch().count()));
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_value(b, value_offset, value);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, value_offset, value);
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = EXPIRY_FLAG | LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, expiry_offset, static_cast<int64_t>(expiry.time_since_epoch().count()));
set_field(b, ttl_offset, static_cast<int32_t>(ttl.count()));
set_value(b, value_offset, value);
return b;
}
static managed_bytes make_live_uninitialized(api::timestamp_type timestamp, size_t size) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
return b;
}
template <mutable_view is_mutable>
friend class basic_atomic_cell_view;
friend class atomic_cell;
};
/// View of an atomic cell
template<mutable_view is_mutable>
class basic_atomic_cell_view {
protected:
data::cell::basic_atomic_cell_view<is_mutable> _view;
friend class atomic_cell;
public:
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;
managed_bytes_basic_view<is_mutable> _view;
friend class atomic_cell;
protected:
explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)
: _view(std::move(v)) { }
basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)
: _view(data::cell::make_atomic_cell_view(ti, ptr))
{ }
void set_view(managed_bytes_basic_view<is_mutable> v) {
_view = v;
}
basic_atomic_cell_view() = default;
explicit basic_atomic_cell_view(managed_bytes_basic_view<is_mutable> v) : _view(std::move(v)) { }
friend class atomic_cell_or_collection;
public:
operator basic_atomic_cell_view<mutable_view::no>() const noexcept {
return basic_atomic_cell_view<mutable_view::no>(_view);
}
void swap(basic_atomic_cell_view& other) noexcept {
using std::swap;
swap(_view, other._view);
}
bool is_counter_update() const {
return _view.is_counter_update();
return atomic_cell_type::is_counter_update(_view);
}
bool is_live() const {
return _view.is_live();
return atomic_cell_type::is_live(_view);
}
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
@@ -83,73 +219,69 @@ public:
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return _view.is_expiring();
return atomic_cell_type::is_live_and_has_ttl(_view);
}
bool is_dead(gc_clock::time_point now) const {
return !is_live() || has_expired(now);
return atomic_cell_type::is_dead(_view) || has_expired(now);
}
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
return _view.timestamp();
return atomic_cell_type::timestamp(_view);
}
void set_timestamp(api::timestamp_type ts) {
_view.set_timestamp(ts);
atomic_cell_type::set_timestamp(_view, ts);
}
// Can be called on live cells only
data::basic_value_view<is_mutable> value() const {
return _view.value();
atomic_cell_value_basic_view<is_mutable> value() const {
return atomic_cell_type::value(_view);
}
// Can be called on live cells only
size_t value_size() const {
return _view.value_size();
}
bool is_value_fragmented() const {
return _view.is_value_fragmented();
return atomic_cell_type::value(_view).size();
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return _view.counter_update_value();
return atomic_cell_type::counter_update_value(_view);
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? _view.deletion_time() : expiry() - ttl();
return !is_live() ? atomic_cell_type::deletion_time(_view) : expiry() - ttl();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::time_point expiry() const {
return _view.expiry();
return atomic_cell_type::expiry(_view);
}
// Can be called only when is_live_and_has_ttl()
gc_clock::duration ttl() const {
return _view.ttl();
return atomic_cell_type::ttl(_view);
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() <= now;
}
bytes_view serialize() const {
return _view.serialize();
managed_bytes_view serialize() const {
return _view;
}
};
class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {
atomic_cell_view(const data::type_info& ti, const uint8_t* data)
: basic_atomic_cell_view<mutable_view::no>(ti, data) {}
atomic_cell_view(managed_bytes_view v)
: basic_atomic_cell_view(v) {}
template<mutable_view is_mutable>
atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) { }
atomic_cell_view(basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) {}
friend class atomic_cell;
public:
static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {
return atomic_cell_view(ti, data.get());
static atomic_cell_view from_bytes(const abstract_type& t, managed_bytes_view v) {
return atomic_cell_view(v);
}
static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {
return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));
static atomic_cell_view from_bytes(const abstract_type& t, bytes_view v) {
return atomic_cell_view(managed_bytes_view(v));
}
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
@@ -164,11 +296,11 @@ public:
};
class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data) {}
atomic_cell_mutable_view(managed_bytes_mutable_view data)
: basic_atomic_cell_view(data) {}
public:
static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {
return atomic_cell_mutable_view(ti, data.get());
static atomic_cell_mutable_view from_bytes(const abstract_type& t, managed_bytes_mutable_view v) {
return atomic_cell_mutable_view(v);
}
friend class atomic_cell;
@@ -177,26 +309,31 @@ public:
using atomic_cell_ref = atomic_cell_mutable_view;
class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}
managed_bytes _data;
atomic_cell(managed_bytes b) : _data(std::move(b)) {
set_view(_data);
}
public:
class collection_member_tag;
using collection_member = bool_class<collection_member_tag>;
atomic_cell(atomic_cell&&) = default;
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&&) = default;
void swap(atomic_cell& other) noexcept {
basic_atomic_cell_view<mutable_view::yes>::swap(other);
_data.swap(other._data);
atomic_cell(atomic_cell&& o) noexcept : _data(std::move(o._data)) {
set_view(_data);
}
operator atomic_cell_view() const { return atomic_cell_view(_view); }
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&& o) {
_data = std::move(o._data);
set_view(_data);
return *this;
}
operator atomic_cell_view() const { return atomic_cell_view(managed_bytes_view(_data)); }
atomic_cell(const abstract_type& t, atomic_cell_view other);
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
@@ -208,6 +345,8 @@ public:
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, managed_bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
@@ -224,6 +363,13 @@ public:
return make_live(type, timestamp, value, gc_clock::now() + *ttl, *ttl, cm);
}
}
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const managed_bytes_view& value, ttl_opt ttl, collection_member cm = collection_member::no) {
if (!ttl) {
return make_live(type, timestamp, value, cm);
} else {
return make_live(type, timestamp, value, gc_clock::now() + *ttl, *ttl, cm);
}
}
static atomic_cell make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size);
friend class atomic_cell_or_collection;
friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);
@@ -237,7 +383,7 @@ public:
class column_definition;
int compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);
std::strong_ordering compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);
void merge_column(const abstract_type& def,
atomic_cell_or_collection& old,
const atomic_cell_or_collection& neww);

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -52,9 +39,7 @@ struct appending_hash<atomic_cell_view> {
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
::feed_hash(h, ccv);
});
::feed_hash(h, counter_cell_view(cell));
return;
}
if (cell.is_live_and_has_ttl()) {

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -24,22 +11,15 @@
#include "atomic_cell.hh"
#include "collection_mutation.hh"
#include "schema.hh"
#include "hashing.hh"
#include "imr/utils.hh"
// A variant type that can hold either an atomic_cell, or a serialized collection.
// Which type is stored is determined by the schema.
// Has an "empty" state.
// Objects moved-from are left in an empty state.
class atomic_cell_or_collection final {
// FIXME: This has made us lose small-buffer optimisation. Unfortunately,
// due to the changed cell format it would be less effective now, anyway.
// Measure the actual impact because any attempts to fix this will become
// irrelevant once rows are converted to the IMR as well, so maybe we can
// live with this like that.
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
managed_bytes _data;
private:
atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}
atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
public:
atomic_cell_or_collection() = default;
atomic_cell_or_collection(atomic_cell_or_collection&&) = default;
@@ -49,20 +29,16 @@ public:
atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(*cdef.type, _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(*cdef.type, _data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }
atomic_cell_or_collection copy(const abstract_type&) const;
explicit operator bool() const {
return bool(_data);
return !_data.empty();
}
static constexpr bool can_use_mutable_view() {
return true;
}
void swap(atomic_cell_or_collection& other) noexcept {
_data.swap(other._data);
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }
collection_mutation_view as_collection_mutation() const;
bytes_view serialize() const;
@@ -82,12 +58,3 @@ public:
};
friend std::ostream& operator<<(std::ostream&, const printer&);
};
namespace std {
inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept
{
a.swap(b);
}
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2017 ScyllaDB
* Copyright (C) 2017-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "auth/allow_all_authenticator.hh"

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2017 ScyllaDB
* Copyright (C) 2017-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2017 ScyllaDB
* Copyright (C) 2017-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "auth/allow_all_authorizer.hh"

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2017 ScyllaDB
* Copyright (C) 2017-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,42 +1,14 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
* Copyright (C) 2016-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#include "auth/authenticated_user.hh"

View File

@@ -1,42 +1,14 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
* Copyright (C) 2016-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#pragma once

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2018 ScyllaDB
* Copyright (C) 2018-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "auth/authentication_options.hh"

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2018 ScyllaDB
* Copyright (C) 2018-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once

View File

@@ -1,42 +1,14 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
* Copyright (C) 2016-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#include "auth/authenticator.hh"

View File

@@ -1,42 +1,14 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
* Copyright (C) 2016-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#pragma once
@@ -47,7 +19,6 @@
#include <stdexcept>
#include <unordered_map>
#include <boost/any.hpp>
#include <seastar/core/enum.hh>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
@@ -75,6 +46,8 @@ class authenticated_user;
///
class authenticator {
public:
using ptr_type = std::unique_ptr<authenticator>;
///
/// The name of the key to be used for the user-name part of password authentication with \ref authenticate.
///

View File

@@ -1,42 +1,14 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
* Copyright (C) 2016-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#pragma once
@@ -91,6 +63,8 @@ public:
///
class authorizer {
public:
using ptr_type = std::unique_ptr<authorizer>;
virtual ~authorizer() = default;
virtual future<> start() = 0;

View File

@@ -1,31 +1,19 @@
/*
* Copyright (C) 2017 ScyllaDB
* Copyright (C) 2017-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "auth/common.hh"
#include <seastar/core/shared_ptr.hh>
#include "cql3/query_processor.hh"
#include "cql3/statements/create_table_statement.hh"
#include "database.hh"
#include "replica/database.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "timeout_config.hh"
@@ -34,9 +22,9 @@ namespace auth {
namespace meta {
constexpr std::string_view AUTH_KS("system_auth");
constexpr std::string_view USERS_CF("users");
constexpr std::string_view AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
constinit const std::string_view AUTH_KS("system_auth");
constinit const std::string_view USERS_CF("users");
constinit const std::string_view AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
}
@@ -45,15 +33,13 @@ static logging::logger auth_log("auth");
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func) {
struct empty_state { };
return delay_until_system_ready(as).then([&as, func = std::move(func)] () mutable {
return exponential_backoff_retry::do_until_value(1s, 1min, as, [func = std::move(func)] {
return func().then_wrapped([] (auto&& f) -> std::optional<empty_state> {
if (f.failed()) {
auth_log.debug("Auth task failed with error, rescheduling: {}", f.get_exception());
return { };
}
return { empty_state() };
});
return exponential_backoff_retry::do_until_value(1s, 1min, as, [func = std::move(func)] {
return func().then_wrapped([] (auto&& f) -> std::optional<empty_state> {
if (f.failed()) {
auth_log.debug("Auth task failed with error, rescheduling: {}", f.get_exception());
return { };
}
return { empty_state() };
});
}).discard_result();
}
@@ -63,10 +49,9 @@ static future<> create_metadata_table_if_missing_impl(
cql3::query_processor& qp,
std::string_view cql,
::service::migration_manager& mm) {
static auto ignore_existing = [] (seastar::noncopyable_function<future<>()> func) {
return futurize_invoke(std::move(func)).handle_exception_type([] (exceptions::already_exists_exception& ignored) { });
};
auto& db = qp.db();
assert(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
auto db = qp.db();
auto parsed_statement = cql3::query_processor::parse_statement(cql);
auto& parsed_cf_statement = static_cast<cql3::statements::raw::cf_statement&>(*parsed_statement);
@@ -81,9 +66,14 @@ static future<> create_metadata_table_if_missing_impl(
schema_builder b(schema);
b.set_uuid(uuid);
schema_ptr table = b.build();
return ignore_existing([&mm, table = std::move(table)] () {
return mm.announce_new_column_family(table);
});
if (!db.has_schema(table->ks_name(), table->cf_name())) {
auto group0_guard = co_await mm.start_group0_operation();
auto ts = group0_guard.write_timestamp();
try {
co_return co_await mm.announce(co_await mm.prepare_new_column_family_announcement(table, ts), std::move(group0_guard));
} catch (exceptions::already_exists_exception&) {}
}
}
future<> create_metadata_table_if_missing(
@@ -94,12 +84,12 @@ future<> create_metadata_table_if_missing(
return futurize_invoke(create_metadata_table_if_missing_impl, table_name, qp, cql, mm);
}
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db, seastar::abort_source& as) {
future<> wait_for_schema_agreement(::service::migration_manager& mm, const replica::database& db, seastar::abort_source& as) {
static const auto pause = [] { return sleep(std::chrono::milliseconds(500)); };
return do_until([&db, &as] {
as.check();
return db.get_version() != database::empty_version;
return db.get_version() != replica::database::empty_version;
}, pause).then([&mm, &as] {
return do_until([&mm, &as] {
as.check();
@@ -108,7 +98,7 @@ future<> wait_for_schema_agreement(::service::migration_manager& mm, const datab
});
}
const timeout_config& internal_distributed_timeout_config() noexcept {
::service::query_state& internal_distributed_query_state() noexcept {
#ifdef DEBUG
// Give the much slower debug tests more headroom for completing auth queries.
static const auto t = 30s;
@@ -116,7 +106,9 @@ const timeout_config& internal_distributed_timeout_config() noexcept {
static const auto t = 5s;
#endif
static const timeout_config tc{t, t, t, t, t, t, t};
return tc;
static thread_local ::service::client_state cs(::service::client_state::internal_tag{}, tc);
static thread_local ::service::query_state qs(cs, empty_service_permit());
return qs;
}
}

View File

@@ -1,22 +1,9 @@
/*
* Copyright (C) 2017 ScyllaDB
* Copyright (C) 2017-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
@@ -35,10 +22,14 @@
#include "log.hh"
#include "seastarx.hh"
#include "utils/exponential_backoff_retry.hh"
#include "service/query_state.hh"
using namespace std::chrono_literals;
namespace replica {
class database;
}
class timeout_config;
namespace service {
@@ -54,9 +45,9 @@ namespace auth {
namespace meta {
constexpr std::string_view DEFAULT_SUPERUSER_NAME("cassandra");
extern const std::string_view AUTH_KS;
extern const std::string_view USERS_CF;
extern const std::string_view AUTH_PACKAGE_NAME;
extern constinit const std::string_view AUTH_KS;
extern constinit const std::string_view USERS_CF;
extern constinit const std::string_view AUTH_PACKAGE_NAME;
}
@@ -69,10 +60,6 @@ future<> once_among_shards(Task&& f) {
return make_ready_future<>();
}
inline future<> delay_until_system_ready(seastar::abort_source& as) {
return sleep_abortable(15s, as);
}
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func);
@@ -82,11 +69,11 @@ future<> create_metadata_table_if_missing(
std::string_view cql,
::service::migration_manager&) noexcept;
future<> wait_for_schema_agreement(::service::migration_manager&, const database&, seastar::abort_source&);
future<> wait_for_schema_agreement(::service::migration_manager&, const replica::database&, seastar::abort_source&);
///
/// Time-outs for internal, non-local CQL queries.
///
const timeout_config& internal_distributed_timeout_config() noexcept;
::service::query_state& internal_distributed_query_state() noexcept;
}

View File

@@ -1,42 +1,14 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
* Copyright (C) 2016-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#include "auth/default_authorizer.hh"
@@ -61,7 +33,8 @@ extern "C" {
#include "cql3/untyped_result_set.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "database.hh"
#include "replica/database.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -103,7 +76,6 @@ future<bool> default_authorizer::any_granted() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
@@ -116,8 +88,7 @@ future<> default_authorizer::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::LOCAL_ONE).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
@@ -136,13 +107,13 @@ future<> default_authorizer::migrate_legacy_metadata() const {
}
future<> default_authorizer::start() {
static const sstring create_table = sprint(
"CREATE TABLE %s.%s ("
"%s text,"
"%s text,"
"%s set<text>,"
"PRIMARY KEY(%s, %s)"
") WITH gc_grace_seconds=%d",
static const sstring create_table = fmt::format(
"CREATE TABLE {}.{} ("
"{} text,"
"{} text,"
"{} set<text>,"
"PRIMARY KEY({}, {})"
") WITH gc_grace_seconds={}",
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
@@ -160,7 +131,7 @@ future<> default_authorizer::start() {
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db(), _as).get0();
wait_for_schema_agreement(_migration_manager, _qp.db().real_database(), _as).get0();
if (legacy_metadata_exists()) {
if (!any_granted().get0()) {
@@ -197,7 +168,6 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
@@ -226,7 +196,7 @@ default_authorizer::modify(
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
@@ -251,7 +221,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
@@ -278,7 +248,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name) const {
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
@@ -298,7 +268,6 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
@@ -315,7 +284,6 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
[resource](auto ep) {
try {

Some files were not shown because too many files have changed in this diff Show More