Commit Graph

34881 Commits

Author SHA1 Message Date
Anna Stuchlik
b8ff392e8b doc: add info - support for FIPS-compliant systems
This commit adds the information that ScyllaDB Enterprise
supports FIPS-compliant systems in versions
2023.1.1 and later.
The information is excluded from OSS docs with
the "only" directive, because the support was not
added in OSS.

This commit must be backported to branch-5.2 so that
it appears on version 2023.1 in the Enterprise docs.

Closes #15415

(cherry picked from commit fb635dccaa)
2023-09-18 09:17:59 +03:00
Raphael S. Carvalho
a65e5120ab compaction: base compaction throughput on amount of data read
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.

A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.

Fixes #14533.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14615

(cherry picked from commit 3b1829f0d8)
2023-09-14 21:30:22 +03:00
Jan Ciolek
cd9458eeb1 cql.g: make the parser reject INSERT JSON without a JSON value
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```

When no JSON value is specified, the query should be rejected.

Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.

A unit test is added to prevent regressions.

Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

\Closes #14785

(cherry picked from commit cbc97b41d4)
2023-09-14 21:07:21 +03:00
Nadav Har'El
e917b874f9 test/alternator: fix flaky test test_ttl_expiration_gsi_lsi
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!

So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.

This patch also fixes a small bug which prevented this test from running
on DynamoDB.

Fixes #14495

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14496

(cherry picked from commit 599636b307)
2023-09-14 20:43:51 +03:00
Pavel Emelyanov
a27c391cba Update seastar submodule
* seastar 85147cfd...872e0bc6 (3):
  > rpc: Abort server connection streams on stop
  > rpc: Do not register stream to dying parent
  > rpc: Fix client-side stream registration race

refs: #13100

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-06 12:33:30 +03:00
Beni Peled
455ab99b6c release: prepare for 5.2.8 scylla-5.2.8 2023-08-31 22:02:06 +03:00
Michał Chojnowski
adcf296bcf reader_concurrency_semaphore: fix a deadlock between stop() and execution_loop()
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.

Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.

A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.

After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.

Fixes #15198

Closes #15199

(cherry picked from commit 2000a09859)
2023-08-31 08:13:09 +03:00
Calle Wilund
198297a08a generic_server: Handle TLS error codes indicating broken pipe
Fixes  #14625

In broken pipe detection, handle also TLS error codes.

Requires https://github.com/scylladb/seastar/pull/1729

Closes #14626

(cherry picked from commit 890f1f4ad3)
2023-08-29 15:38:21 +03:00
Botond Dénes
9a9b5b691d Update seastar submodule
* seastar 534cb38c...85147cfd (1):
  > tls: Export error_category instance used by tls + some common error codes

Refs: #14625
2023-08-29 15:37:24 +03:00
Alejo Sanchez
610b682cf4 gms, service: replicate live endpoints on shard 0
Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of
the shards. And get the list of live members from shard 0.

Move lock to the callers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13240

(cherry picked from commit da00052ad8)
2023-08-29 12:28:00 +02:00
Kamil Braun
05f4640360 Merge 'api: gossiper: get alive nodes after reaching current shard 0 version' from Alecco
Add an API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.

Closes #12540

* github.com:scylladb/scylladb:
  api: gossiper: fix alive nodes
  gms, service: lock live endpoint copy
  gms, service: live endpoint copy method

(cherry picked from commit b919373cce)
2023-08-29 12:27:52 +02:00
Kefu Chai
8ed58c7dca sstable/writer: log sstable name and pk when capping ldt
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.

Fixes #15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 9c24be05c3)

Closes #15150
2023-08-25 10:13:19 +03:00
Petr Gusev
a83c0a8bbc test_secondary_index_collections: change insert/create index order
Secondary index creation is asynchronous, meaning it
takes time for existing data to be reflected within
the index. However, new data added after the
index is created should appear in it immediately.

The test consisted of two parts. The first created
a series of indexes for one table, added
test data to the table, and then ran a series of checks.
In the second part, several new indexes were added to
the same table, and checks were made to make sure that
already existing data would appear in them. This
last part was flaky.

The patch just moves the index creation statements
from the second part to the first.

Fixes: #14076

Closes #14090

(cherry picked from commit 0415ac3d5f)

Closes #15101
2023-08-24 14:09:08 +03:00
Botond Dénes
df71753498 Merge '[Backport 5.2] distributed_loader: process_sstable_dir: do not verify snapshots' from Benny Halevy
This mini-series backports the fix for #12010 along with low-risk patches it depends on.

Fixes: #12010

Closes #15137

* github.com:scylladb/scylladb:
  distributed_loader: process_sstable_dir: do not verify snapshots
  utils/directories: verify_owner_and_mode: add recursive flag
  utils: Restore indentation after previous patch
  utils: Coroutinize verify_owner_and_mode()
2023-08-23 15:50:29 +03:00
Benny Halevy
6588ecd66f distributed_loader: process_sstable_dir: do not verify snapshots
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.

Fixes #12010

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 845b6f901b)
2023-08-23 13:19:55 +03:00
Benny Halevy
03640cc15b utils/directories: verify_owner_and_mode: add recursive flag
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 60862c63dd)
2023-08-23 13:19:36 +03:00
Pavel Emelyanov
6d4d576460 utils: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2eb88945ea)
2023-08-23 13:19:36 +03:00
Pavel Emelyanov
96aca473b4 utils: Coroutinize verify_owner_and_mode()
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4ebb812df0)
2023-08-23 13:19:30 +03:00
Aleksandra Martyniuk
29e6dc8c1b compaction: do not swallow compaction_stopped_exception for reshape
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.

Rethrow an exception in perfrom task for reshape compaction.

Fixes: #15058.

(cherry picked from commit e0ce711e4f)

Closes #15122
2023-08-23 12:11:58 +03:00
Vlad Zolotarov
9a414d440d scylla_raid_setup: make --online-discard argument useful
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.

Fixes #14963

Closes #14964

(cherry picked from commit e13a2b687d)
2023-08-22 10:40:37 +03:00
Anna Mikhlin
e0ebc95025 release: prepare for 5.2.7 scylla-5.2.7 2023-08-21 14:44:56 +03:00
Botond Dénes
b7ab42b61c Merge 'Ignore no such column family in repair' from Aleksandra Martyniuk
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.

When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.

Fixes: #13045

Closes #13068

* github.com:scylladb/scylladb:
  repair: fix indentation
  repair: continue user requested repair if no_such_column_family is thrown
  repair: add find_column_family_if_exists function

(cherry picked from commit 9859bae54f)
2023-08-20 19:49:21 +03:00
Botond Dénes
098baaef48 Merge 'cql: add missing functions for the COUNTER column type' from Nadav Har'El
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL  `counterasblob()` and `blobascounter()` functions (issue #14742).

As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.

Fixes #14501
Fixes #14742

Closes #14745

* github.com:scylladb/scylladb:
  test/cql-pytest: test confirming that casting to counter doesn't work
  cql: support casting of counter to other types
  cql: implement missing counterasblob() and blobascounter() functions
  cql: implement missing type functions for "counters" type

(cherry picked from commit a637ddd09c)
2023-08-13 14:53:48 +03:00
Nadav Har'El
e11561ef65 cql-pytest: translate Cassandra's tests for compact tables
This is a translation of Cassandra's CQL unit test source file
validation/operations/CompactStorageTest.java into our cql-pytest
framework.

This very large test file includes 86 tests for various types of
operations and corner cases of WITH COMPACT STORAGE tables.

All 86 tests pass on Cassandra (except one using a deprecated feature
that needs to be specially enabled). 30 of the tests fail on Scylla
reproducing 7 already-known Scylla issues and 7 previously-unknown issues:

Already known issues:

Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE"
Refs #4244: Add support for mixing token, multi- and single-column
            restrictions
Refs #5361: LIMIT doesn't work when using GROUP BY
Refs #5362: LIMIT is not doing it right when using GROUP BY
Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #8627: Cleanly reject updates with indexed values where value > 64k

New issues:

Refs #12471: Range deletions on COMPACT STORAGE is not supported
Refs #12474: DELETE prints misleading error message suggesting
             ALLOW FILTERING would work
Refs #12477: Combination of COUNT with GROUP BY is different from
             Cassandra in case of no matches
Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column
Refs #12526: Support filtering on COMPACT tables
Refs #12749: Unsupported empty clustering key in COMPACT table
Refs #12815: Hidden column "value" in compact table isn't completely hidden

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12816

(cherry picked from commit 328cdb2124)
2023-08-13 14:44:19 +03:00
Nadav Har'El
e03c21a83b cql-pytest: translate Cassandra's tests for CAST operations
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.

There are 13 tests, 9 of them currently xfail.

The failures are caused by one recently-discovered issue:

Refs #14501: Cannot Cast Counter To Double

and by three previously unknown or undocumented issues:

Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
             milliseconds
Refs #14522: Support CAST function not only in SELECT

Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14528

(cherry picked from commit f08bc83cb2)
2023-08-13 14:41:36 +03:00
Nadav Har'El
79b5befe65 test/cql-pytest: add tests for data casts and inf in sums
This patch adds tests to reproduce issue #13551. The issue, discovered
by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast())
from varint type broke. So we add two tests in cql-pytest:

1. A new test file, test_cast_data.py, for testing data casts (a
   CAST (...) as ... in a SELECT), starting with testing casts from
   varint to other types.

   The test uncovers a lot of interesting cases (it is heavily
   commented to explain these cases) but nothing there is wrong
   and all tests pass on Scylla.

2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out
   that this caused #13551. In Cassandra and older Scylla, the sum
   returned a NaN. In Scylla today, it generates a misleading
   error message.

As usual, the tests were run on both Cassandra (4.1.1) and Scylla.

Refs #13551.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 78555ba7f1)
2023-08-13 14:40:08 +03:00
Petr Gusev
aca9e41a44 topology.cc: remove_endpoint: _dc_racks removal fix
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.

This is a backport of bcb1d7c to branch-5.2.

Refs: #14184

Closes #14893
2023-08-11 14:29:37 +03:00
Pavel Emelyanov
ff22807ed2 Update seastar submodule
* seastar 29a0e645...534cb38c (1):
  > rpc: Abort connection if send_entry() fails

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-09 11:30:57 +03:00
Botond Dénes
bcb8f6a8dd Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.

Until now, semaphore mismatch was only checked in multi-partition queries.  The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.

This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.

Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050
Closes: #14770

Closes #14736

* github.com:scylladb/scylladb:
  querier_cache: add stats of scheduling group mismatches
  querier_cache: check semaphore mismatch during querier lookup
  querier_cache: add reference to `replica::database::is_user_semaphore()`
  replica:database: add method to determine if semaphore is user one

(cherry picked from commit a8feb7428d)
2023-08-09 10:20:53 +03:00
Kefu Chai
9ce3695a0d compaction_manager: prevent gc-only sstables from being compacted
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.

in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.

so, in this change,

* to prevent sstables in the transient stage from being picked
  up by regular compactions, a new interface class is introduced
  so that the sstable is always added to registration before
  it is added to sstable set, and removed from registration after
  it is removed from sstable set. the struct helps to consolidate
  the regitration related logic in a single place, and helps to
  make it more obvious that the timespan of an sstable in
  the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
  overlap with the output sstable run. the run_id for the
  gc sstable run is created only when the gc sstable writer
  is created. because the gc sstables is not always created
  for all compactions.

please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.

Fixes #14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14725

(cherry picked from commit fdf61d2f7c)

Closes #14827
2023-08-04 09:59:10 +03:00
Patryk Jędrzejczak
4cd5847761 config: add schema_commitlog_segment_size_in_mb variable
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.

(cherry picked from commit 5b167a4ad7)
2023-08-02 18:05:39 +02:00
Botond Dénes
2b7f1cd906 Update tools/java submodule
* tools/java 83b2168b19...f8f556d802 (1):
  > Use EstimatedHistogram in metricPercentilesAsArray

Fixes: #10089
2023-07-31 12:13:01 +03:00
Nadav Har'El
e34c62c567 Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.

Fixes: #14819

Closes #14821

* github.com:scylladb/scylladb:
  test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
  db/view/view_updating_consumer: account for the size of mutations
  mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
  mutation/mutation: add memory_usage()

(cherry picked from commit 056d04954c)
2023-07-31 03:43:44 -04:00
Nadav Har'El
992c50173a Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved.
We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different.

We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change.

A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash.

Fixes: https://github.com/scylladb/scylladb/issues/13129

Closes #14429

* github.com:scylladb/scylladb:
  modification_statement: reject conditional statements with empty clustering key
  statements/cas_request: fix crash on empty clustering range in LWT

(cherry picked from commit 49c8c06b1b)
2023-07-31 09:14:55 +03:00
Beni Peled
58acf071bf release: prepare for 5.2.6 scylla-5.2.6 2023-07-30 14:19:28 +03:00
Raphael S. Carvalho
d2369fc546 cached_file: Evict unused pages that aren't linked to LRU yet
It was found that cached_file dtor can hit the following assert
after OOM

cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`

cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.

That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.

share() is responsible for automatically placing the page into
LRU once refcount drops to zero.

If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.

Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.

A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.

A reproducer was added.

Fixes #14814.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14818

(cherry picked from commit 050ce9ef1d)
2023-07-28 13:56:28 +02:00
Kamil Braun
6273c4df35 test: use correct timestamp resolution in test_group0_history_clearing_old_entries
In 10c1f1dc80 I fixed
`make_group0_history_state_id_mutation` to use correct timestamp
resolution (microseconds instead of milliseconds) which was supposed to
fix the flakiness of `test_group0_history_clearing_old_entries`.

Unfortunately, the test is still flaky, although now it's failing at a
later step -- this is because I was sloppy and I didn't adjust this
second part of the test to also use microsecond resolution. The test is
counting the number of entries in the `system.group0_history` table that
are older than a certain timestamp, but it's doing the counting using
millisecond resolution, causing it to give results that are off by one
sometimes.

Fix it by using microseconds everywhere.

Fixes #14653

Closes #14670

(cherry picked from commit 9d4b3c6036)
2023-07-27 15:46:37 +02:00
Raphael S. Carvalho
752984e774 Fix stack-use-after-return in mutation source excluding staging
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.

This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.

The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.

Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.

Fixes #14812.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14813

(cherry picked from commit 0ac43ea877)
2023-07-26 14:30:32 +03:00
Raphael S. Carvalho
986491447b table: Optimize creation of reader excluding staging for view building
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s

to
INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s

Refs #14089.
Fixes #14244.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1d8cb32a5d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14764
2023-07-20 16:46:15 +03:00
Takuya ASADA
a05bb26cd6 scylla_fstrim_setup: start scylla-fstrim.timer on setup
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.

Also, unmask is unnecessary for enabling the timer.

Fixes #14249

Closes #14252

(cherry picked from commit c70a9cbffe)

Closes #14421
2023-07-18 16:05:09 +03:00
Michał Chojnowski
41aef6dc96 partition_snapshot_reader.hh: fix iterator invalidation in do_refresh_state
do_refresh_state() keeps iterators to rows_entry in a vector.
This vector might be resized during the procedure, triggering
memory reclaim and invalidating the iterators, which can cause
arbitrarily long loops and/or a segmentation fault during make_heap().
To fix this, do_refresh_state has to always be called from the allocating
section.

Additionally, it turns out that the first do_refresh_state is useless,
because reset_state() doesn't set _change_mark. This causes do_refresh_state
to be needlessly repeated during a next_row() or next_range_tombstone() which
happens immediately after it. Therefore this patch moves the _change_mark
assignment from maybe_refresh_state to do_refresh_state, so that the change mark
is properly set even after the first refresh.

Fixes #14696

Closes #14697
2023-07-17 14:20:37 +02:00
Botond Dénes
aa5e904c40 repair: Release permit earlier when the repair_reader is done
Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes #14676

Closes #14677

(cherry picked from commit 1b577e0414)
2023-07-14 18:18:43 +03:00
Marcin Maliszkiewicz
eff2fe79b1 alternator: close output_stream when exception is thrown during response streaming
When exception occurs and we omit closing output_stream then the whole process is brought down
by an assertion in ~output_stream.

Fixes https://github.com/scylladb/scylladb/issues/14453
Relates https://github.com/scylladb/scylladb/issues/14403

Closes #14454

(cherry picked from commit 6424dd5ec4)
2023-07-13 23:27:46 +03:00
Nadav Har'El
ee8b26167b Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz
Adds preemption points used in Alternator when:
 - sending bigger json response
 - building results for BatchGetItem

I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:

    auto start  = std::chrono::steady_clock::now();
    do { } while ((std::chrono::steady_clock::now() - start) < 100ms);

and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.

Refs #7926
Fixes #13689

Closes #12351

* github.com:scylladb/scylladb:
  alternator: remove redundant flush call in make_streamed
  utils: yield when streaming json in print()
  alternator: yield during BatchGetItem operation

(cherry picked from commit d2e089777b)
2023-07-13 23:27:38 +03:00
Yaron Kaikov
02bc54d4b6 release: prepare for 5.2.5 scylla-5.2.5 2023-07-13 14:23:18 +03:00
Avi Kivity
c9a5c4c876 Merge ' message: match unknown tenants to the default tenant' from Botond Dénes
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types

(cherry picked from commit a7c2c9f92b)
2023-07-12 15:31:48 +03:00
Tomasz Grabiec
1a6f4389ae Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb/scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

\Closes scylladb/scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606cd6)

Closes #14649
2023-07-12 10:09:56 +03:00
Calle Wilund
1088c3e24a storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14636
2023-07-12 09:24:56 +03:00
Botond Dénes
c9cb8dcfd0 Merge '[backport 5.2] view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.

This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).

The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.

Backported from c25201c1a3. `view_build_test.cc` needed some tiny adjustments for the backport.

Closes #14619
Fixes #14503

* github.com:scylladb/scylladb:
  test: view_build_test: add range tombstones to test_view_update_generator_buffering
  test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
  view_updating_consumer: make buffer limit a variable
  view: fix range tombstone handling on flushes in view_updating_consumer
2023-07-11 15:04:23 +03:00
Takuya ASADA
91c1feec51 scylla_raid_setup: wipe filesystem signatures from specified disks
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.

Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).

Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.

Fixes #13737

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #13738

(cherry picked from commit fdceda20cc)
2023-07-11 15:00:03 +03:00