Commit Graph

34795 Commits

Author SHA1 Message Date
Tomasz Grabiec
cec9b2d114 mutation_partition_v2: Implement compact()
For convenience, will be used in unit tests.
2023-01-27 21:56:31 +01:00
Tomasz Grabiec
4317999ca4 cache_tracker: Extract insert(mutation_partition_v2&) 2023-01-27 21:56:31 +01:00
Tomasz Grabiec
c7f7377ea3 mvcc, mutation_partition: Document guarantees in case merging succeeds
It's not obvious that invariants for partial merge do not hold for a
completed merge.

This is due to the fact that an empty source partition, which is
always empty after merge, is always fully continuous.
2023-01-27 21:56:31 +01:00
Tomasz Grabiec
8ae78ffebd mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically()
Will be useful in testing to exhaustivaly test preemption scenarios.
2023-01-27 21:56:31 +01:00
Tomasz Grabiec
8883ac30cf mutation_partition_v2: Simplify get_continuity() 2023-01-27 21:56:31 +01:00
Tomasz Grabiec
d9e27abe87 row_cache: Distinguish dummy insertion site in trace log 2023-01-27 21:56:31 +01:00
Tomasz Grabiec
026f8cc1e7 db: Use mutation_partition_v2 in mvcc
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes #2578
Fixes #3288
Fixes #10587
2023-01-27 21:56:28 +01:00
Tomasz Grabiec
ccf3a13648 range_tombstone_change_merger: Introduce peek()
Returns the current tombstone without affecting state.
2023-01-27 19:15:39 +01:00
Tomasz Grabiec
42f5a7189d readers: Extract range_tombstone_change_merger 2023-01-27 19:15:39 +01:00
Tomasz Grabiec
6b7473be53 mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots
This is a prerequisite for using the cursor in memtable readers.

Non-evictable snapshots are those which live in memtables. Unlike
evictable snapshots, they don't have a dummy entry at position after
all clustering rows. In evictable snapshots, lookup always finds an
entry, not so with non-evictable snapshots. The cursor was not
prepared for this case, this patch handles it.
2023-01-27 19:15:39 +01:00
Tomasz Grabiec
091ad8f6ee mvcc: partition_snapshot_row_cursor: Support digest calculation
Prerequisite for using in memtable reader.
2023-01-27 19:15:39 +01:00
Tomasz Grabiec
195b40315a mutation_partition_v2: Store range tombstones together with rows
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs #10587
Refs #3288
2023-01-27 19:15:39 +01:00
Tomasz Grabiec
7e6056b3cc db: Introduce mutation_partition_v2
Intended to be used in memtable/cache, as opposed to the old
mutation_partition which will be intended to be used as temporary
object.

The two will have different trade-offs regarding memory efficiency and
algorithms.

In this commit there is no change in logic, the class is mostly
copied. Some methods which are not needed on the v2 model were removed
from the interface.

Logic changes will be introduced in later commits.
2023-01-27 19:15:39 +01:00
Tomasz Grabiec
806f698272 doc: Introduce docs/dev/mvcc.md
This extracts information which was there in row_cache.md, but is
relevant to MVCC in general.

It also makes adaptations and reflects the upcoming changes in this
series related to switching to the new mutation_partition_v2 model:

 - continuity in evictable snapshots can now overlap. This is needed
   to represent range tombstone information, which is linked to
   continuity information.

 - description of range tombstone representation was added
2023-01-27 19:15:39 +01:00
Tomasz Grabiec
27882ff19e db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU 2023-01-27 19:15:39 +01:00
Tomasz Grabiec
a574a1cc4e db: Print range_tombstone bounds as position_in_partition
It's the standard now which replaced bound_view.

Will be consistent with how range tombstone bounds are represented in
mutation_partition_v2 (as rows_entry::position()).
2023-01-27 19:15:39 +01:00
Tomasz Grabiec
40719c600c test: memtable_test: Relax test_segment_migration_during_flush
Partition version merging can now insert sentinels, which may
temporarily increase unspooled memory. It is no longer true that
unspooled monotonically decreases, which the test verified.  Relax it,
and only verify that unspooled is smaller than real dirty.
2023-01-27 19:15:39 +01:00
Tomasz Grabiec
31bcc3b861 test: cache_flat_mutation_reader: Avoid timestamp clash
api::new_timestamp() is not monotonic. In
test_single_row_and_tombstone_not_cached_single_row_range1, we
generate a deletion and an insertion in the deleted reange. If they
get the same timestamp, the inserted row will be covered.

This will surface after cache starts to compact rows with range tombstones.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
25683449e4 test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows
When inserting range tombstones, the test uses api::new_timestamp(),
but when inserting rows, it uses a fixed timestamp of 1. This will be
problematic when rows get compacted with range tombstone, all rows
would get compacted away, which is not expected by the test. To fix
this, let's use the same timestamp source as range tombstones. This
way rows will get a later timestamp.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
71057412ed test: mvcc: Fix sporadic failures due to compact_for_compaction()
compact_for_compaction() will perform cell expiration based on
gc_clock::now(), which introduces sporadic mismatches due to expiry
status of a row marker.

Drop this, we can rely on compaction done by is_equal_to_compacted()
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
f908713290 test: lib: random_mutation_generator: Produce partition tombstone less often
This tombstone has a high chance of obliterating all data, which will
make tests which involve partition version merging not very
interesting. The result will be an empty partition with a
tombstone. Reduce its frequency, so that in MVCC there is a
significant chance of having live data in the combined entry where
individual versions come from the generator.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
3bf8052be4 test: lib: random_utils: Introduce with_probability() 2023-01-27 19:15:38 +01:00
Tomasz Grabiec
c386874e18 test: lib: Improve error message in has_same_continuity() 2023-01-27 19:15:38 +01:00
Tomasz Grabiec
08f68c5f20 test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker 2023-01-27 19:15:38 +01:00
Tomasz Grabiec
5aa8cb56a8 test: mvcc: Insert entries in the tracker
evictable snapshots must have all entries added to the
tracker. Partition version merging assumes this. Before this was
benign, but will start to trigger asserts in mutation_partition_v2.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
9d38997971 test: mvcc_test: Do not set dummy::no on non-clustering rows
This will trigger an assert in apply_monotonically() later in the
series, where this row would be merged with a dummy at the same
position. This row must not be marked as non-dummy, there is an
assumption that non-clustering positions are all dummies. There can't
be two entries with the same position an a different dummy status.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
f79072638d mutation_partition: Print full position in error report in append_clustered_row()
std::prev(i) can be dummy.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
6a305666a4 db: mutation_cleaner: Extract make_region_space_guard()
Will be used in more places.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
833e2a8d30 position_in_partition: Optimize equality check
We can avoid key comparsion if bound weights don't match.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
95b509afcd mvcc: Fix version merging state resetting
Upon entry to merge_partition_versions() we skip over versions which
are not referenced in order to start merging from the oldest
unreferenced version, which is good for performance. Later, we
reallocate version merging state if we detected such a move, so that
we don't reuse state allocated for a different version pair than
before. This check was using version_no, the counter of skipped
versions to detect this. But this only makes sense if each
merge_partition_versions() uses the same version pointer as a base. In
fact it doesn't, if we skip, we advance _version, so the skip is
persisted in the snapshot. It's enough to discard the version merging
state when we do that.

This shouldn't have effect on existing code base, since there is
currently no way to trigger the version skipping loggic.
2023-01-27 19:15:38 +01:00
Tomasz Grabiec
1c4b5b0b6b mutation_partition: apply_resume: Mark operator bool() as explicit 2023-01-27 19:15:38 +01:00
Kamil Braun
5eadea301e Merge 'pytest: start after ungraceful stop' from Alecco
If a server is stopped suddenly (i.e. not graceful), schema tables might
be in inconsistent state. Add a test case and enable Scylla
configuration option (force_schema_commit_log) to handle this.

Fixes #12218

Closes #12630

* github.com:scylladb/scylladb:
  pytest: test start after ungraceful stop
  test.py: enable force_schema_commit_log
2023-01-26 12:08:33 +01:00
Botond Dénes
ba26770376 tools/schema_loader: data_dictionary_impl:try_find_table(): also check ks name
Although the number of keyspaces should mostly be 1 here, and thus the
chance of two tables from different keyspaces colliding is miniscule, it
is not zero. Better be safe than sorry, so match the keyspace name too
when looking up a table.

Closes #12627
2023-01-25 22:04:07 +02:00
Raphael S. Carvalho
87ee547120 table: Fix quadratic behavior when inserting sstables into tracker on schema change
Each time backlog tracker is informed about a new or old sstable, it
will recompute the static part of backlog which complexity is
proportional to the total number of sstables.
On schema change, we're calling backlog_tracker::replace_sstables()
for each existing sstable, therefore it produces O(N ^ 2) complexity.

Fixes #12499.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12593
2023-01-25 21:57:33 +02:00
Botond Dénes
bdd4b25c61 scylla-gdb.py: scylla memory: remove 'sstable reads' from semaphore names
This phrase is inaccurate and unnecessary. We know all lines in the
printout are for reads and they are semaphores: no need to repeat this
information on each line.
Example:

  Read Concurrency Semaphores:
    read:              0/100,             0/     41901096, queued: 0
    streaming:         0/ 10,             0/     41901096, queued: 0
    system:            0/ 10,             0/     41901096, queued: 0

Closes #12633
2023-01-25 21:55:27 +02:00
Nadav Har'El
f4f2d608d7 dbuild: fix path in example in README
The dbuild README has an example how to enable ccache, and required
modifying the PATH. Since recently, our docker image includes
required commands (cxxbridge) in /usr/local/bin, so the build will
fail if that directory isn't also in the path - so add it in the
example.

Also use the opportunity to fix the "/home/nyh" in one example to
"$HOME".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12631
2023-01-25 21:54:44 +02:00
Kamil Braun
a0ff33e777 test/pylib: scylla_cluster: don't leak server if stopping it fails
`ScyllaCluster.server_stop` had this piece of code:
```
        server = self.running.pop(server_id)
        if gracefully:
            await server.stop_gracefully()
        else:
            await server.stop()
        self.stopped[server_id] = server
```

We observed `stop_gracefully()` failing due to a server hanging during
shutdown. We then ended up in a state where neither `self.running` nor
`self.stopped` had this server. Later, when releasing the cluster and
its IPs, we would release that server's IP - but the server might have
still been running (all servers in `self.running` are killed before
releasing IPs, but this one wasn't in `self.running`).

Fix this by popping the server from `self.running` only after
`stop_gracefully`/`stop` finishes.

Make an analogous fix in `server_start`: put `server` into
`self.running` *before* we actually start it. If the start fails, the
server will be considered "running" even though it isn't necessarily,
but that is OK - if it isn't running, then trying to stop it later will
simply do nothing; if it is actually running, we will kill it (which we
should do) when clearing after the cluster; and we don't leak it.

Closes #12613
2023-01-25 16:58:02 +02:00
Alejo Sanchez
878cb45c24 pytest: test start after ungraceful stop
Test case for a start of a server after it was stopped suddenly (instead
of gracefully). This coud cause commitlog flush issues.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-01-25 14:49:27 +01:00
Alejo Sanchez
ccbd89f0cd test.py: enable force_schema_commit_log
To handle start after ungraceful stop, enable separate schema commit log
from server start.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-01-25 14:49:27 +01:00
Kamil Braun
5c886e59de Merge 'Enable Raft by default in new clusters' from Kamil Braun
New clusters that use a fresh conf/scylla.yaml will have `consistent_cluster_management: true`, which will enable Raft, unless the user explicitly turns it off before booting the cluster.

People using existing yaml files will continue without Raft, unless consistent_cluster_management is explicitly requested during/after upgrade.

Also update the docs: cluster creation and node addition procedures.

Fixes #12572.

Closes #12585

* github.com:scylladb/scylladb:
  docs: mention `consistent_cluster_management` for creating cluster and adding node procedures
  conf: enable `consistent_cluster_management` by default
2023-01-25 14:09:38 +01:00
Benny Halevy
82011fc489 dht: incremental_owned_ranges_checker: belongs_to_current_node: mark as const
Its _it member keeps state about the current range.
Although it's modified by the method, this is an implementation
detail that irrelevant to the caller, hence mark the
belongs_to_current_node method as const (and noexcept while
at it).

This allows the caller, cleanup_compaction, to use it from
inside a const method, without having to mark
its respective member as mutable too.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12634
2023-01-25 14:52:21 +02:00
Alexey Novikov
ce96b472d3 prevent populating cache with expired rows from sstables
change row purge condition for compacting_reader to remove all expired
rows to avoid read perfomance problems when there are many expired
tombstones in row cache

Refs #2252

Closes #12565
2023-01-25 12:59:40 +01:00
Kamil Braun
5bc7f0732e Merge 'test.py: manual cluster pool handling for Python suite' from Alecco
From reviews of https://github.com/scylladb/scylladb/pull/12569, avoid
using `async with` and access the `Pool` of clusters with
`get()`/`put()`.

Closes #12612

* github.com:scylladb/scylladb:
  test.py: manual cluster handling for PythonSuite
  test.py: stop cluster if PythonSuite fails to start
  test.py: minor fix for failed PythonSuite test
2023-01-24 17:37:55 +01:00
Nadav Har'El
b28818db06 Merge 'Make regexes in types.cc static and remove unnecessary tolower transform' from Marcin Maliszkiewicz
- makes all regexes static

If making regex compilation static
for uuid_type_impl and timeuuid_type_impl helps then it should
also help for timestamp_type and simple_date_type.

-  remove unnecessary tolower transform in simple_date_type_impl::from_sstring

Following function uses only decimal and '-' characters (see date_re). They are not
affected by tolower call in any way.

Aditionally std::strtoll supports "0x" prefixes but also accepts
upper case version "0X" so it's also not affected by tolower call.

get_simple_date_time only casts strings to integer types using
boost:lexical_cast so also not affected by tolower.

Finally, serialize only uses str to include it in an exception text
so tolower doesn't affect it in a positive way. It's even better
that input is displayed to the user as it was, not converted to lower
case.

Closes #12621

* github.com:scylladb/scylladb:
  types: remove unnecessary tolower transform in simple_date_type_impl::from_sstring
  types: make all regexes static
2023-01-24 16:13:59 +02:00
Pavel Emelyanov
f6e8b64334 snitch: Use set_my_dc_and_rack() on all shards
Most of snitch drivers set _my_dc and _my_rack with direct assignment
thus skipping the sanity checks for dc/rack being empty. On other shards
they call set_my_dc_and_rack() helper which warns the empty value and
replaces it with some defaults.

It's better to use the helper on all shards in order to have the same
dc/rack values everywhere.

refs: #12185

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12524
2023-01-24 14:17:06 +02:00
Nadav Har'El
55558e1bd7 test/alternator: check operation on invalid TableName
Issue #12538 suggested that maybe Alternator shouldn't bother reporting an
invalid table name in item operations like PutItem, and that it's enough
to report that the table doesn't exist. But the test added in this patch
shows that DynamoDB, like Alternator, reports the invalid table name in
this case - not just that the table doesn't exist.

That should make us think twice before acting on issue #12538. If we do
what this issue recommended, this test will need to be fixed (e.g., to
accept as correct both types of errors).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12608
2023-01-24 14:14:39 +02:00
Kefu Chai
4a0134a097 db: system_keyspace: take the reserved_memory into account
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.

after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.

Fixes #12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12573
2023-01-24 14:07:44 +02:00
Anna Stuchlik
3cbe657b24 doc: fixes https://github.com/scylladb/scylla-docs/issues/3706, v2 of https://github.com/scylladb/scylladb/pull/11638, add a note about performance penalty in non-frozen connections vs frozen connections and UDT, add a link to the blog post about performance
Closes #12583
2023-01-24 13:16:58 +02:00
Alejo Sanchez
f236d518c6 test.py: manual cluster handling for PythonSuite
Instead of complex async with logic, use manual cluster pool handling.

Revert the discard() logic in Pool from a recent commit.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-01-24 11:38:17 +01:00
Alejo Sanchez
a6059e4bb7 test.py: stop cluster if PythonSuite fails to start
If cluster fails to start, stop it.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-01-24 11:36:49 +01:00