Commit Graph

34149 Commits

Author SHA1 Message Date
Tomasz Grabiec
a46b2e4e4c Merge 'Make node replace procedure work with Raft' from Kamil Braun
We need to obtain the Raft ID of the replaced node during the shadow round and
place it in the address map. It won't be placed by the regular gossiping route
if we're replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it is not
guaranteed that background gossiping manages update the address map before we
need it, especially in tests where we set ring_delay to 0 and disable
wait_for_gossip_to_settle. The shadow round, on the other hand, performs a
synchronous request (and if it fails during bootstrap, bootstrap will fail -
because we also won't be able to obtain the tokens and Host ID of the replaced
node).

Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.

Also remove an unconditional 60 seconds sleep from the replace code. Make it
dependent on ring_delay.

Enable the replace tests.

Modify some code related to removing servers from group 0 which depended on
storing IP addresses in the group 0 configuration.

Closes #12172

* github.com:scylladb/scylladb:
  test/topology: enable replace tests
  service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0`
  service: handle replace correctly with Raft enabled
  gms/gossiper: fetch RAFT_SERVER_ID during shadow round
  service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
2022-12-07 15:30:27 +01:00
Pavel Emelyanov
9bdea110a6 code: Reduce fanout of sstables(_manager)?.hh over headers
This change removes sstables.hh from some other headers replacing it
with version.hh and shared_sstable.hh. Also this drops
sstables_manager.hh from some more headers, because this header
propagates sstables.hh via self. That change is pretty straightforward,
but has a recochet in database.hh that needs disk-error-handler.hh.

Without the patch touch sstables/sstable.hh results in 409 targets
recompillation, with the patch -- 299 targets.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12222
2022-12-07 14:34:19 +02:00
Botond Dénes
57a4971962 Merge 'dirty_memory_manager: tidy up' from Avi Kivity
Tidy up namespaces, move code to the right file, and
move the whole thing to the replica module where it
belongs.

Closes #12219

* github.com:scylladb/scylladb:
  dirty_memory_manager: move implementaton from database.cc
  dirty_memory_manager: move to replica module
  test: dirty_memory_manager_test: disambiguate classes named 'test_region_group'
  dirty_memory_manager: stop using using namespace
2022-12-07 14:25:59 +02:00
Avi Kivity
f7f5700289 dirty_memory_manager: move implementaton from database.cc
A few leftover method implementations were left in database.cc
when dirty_memory_manager.cc was created, move them to their
correct place now.
2022-12-06 22:28:54 +02:00
Avi Kivity
444de2831e dirty_memory_manager: move to replica module
It's a replica-side thing, so move it there. The related
flush_permit and sstable_write_permit are moved alongside.
2022-12-06 22:24:17 +02:00
Avi Kivity
a038a35ad6 test: dirty_memory_manager_test: disambiguate classes named 'test_region_group'
There are two similarly named classes: ::test_region_group and
dirty_memory_manager_logalloc::test_region_group. Rename the
former to ::raii_region_group (that's what it's for) and the
latter to ::test_region_group, to reduce confusion.
2022-12-06 22:20:38 +02:00
Avi Kivity
dfdae5ffa9 dirty_memory_manager: stop using using namespace
`using namespace` is pretty bad, especially in a header, as it
pollutes the namespace for everyone. Stop using it and qualify
names instead.
2022-12-06 21:37:38 +02:00
Avi Kivity
47a8fad2a2 Merge 'scylla-types: add serialize action' from Botond Dénes
Serializes the value that is an instance of a type. The opposite of `deserialize` (previously known as `print`).
All other actions operate on serialized values, yet up to now we were missing a way to go from human readable values to serialized ones. This prevented for example using `scylla types tokenof $pk` if one only had the human readable key value.
Example:

```
$ scylla types serialize -t Int32Type -- -1286905132
b34b62d4
$ scylla types serialize --prefix-compound -t TimeUUIDType -t Int32Type -- d0081989-6f6b-11ea-0000-0000001c571b 16
0010d00819896f6b11ea00000000001c571b000400000010
$ scylla types serialize --prefix-compound -t TimeUUIDType -t Int32Type -- d0081989-6f6b-11ea-0000-0000001c571b
0010d00819896f6b11ea00000000001c571b
```

Closes #12029

* github.com:scylladb/scylladb:
  docs: scylla-types.rst: add mention of per-operation --help
  tools/scylla-types: add serialize operation
  tools/scylla-types: prepare for action handlers with string arguments
  tools/scylla-types: s/print/deserialize/ operation
  docs: scylla-types.rst: document tokenof and shardof
  docs: scylla-types.rst: fix typo in compare operation description
2022-12-06 19:27:15 +02:00
Nadav Har'El
f275bfd57b Update CODEOWNERS file
Update the CODEOWNERS file with some people who joined different parts
of the project, and one person that left.

Note that despite is name, CODEOWNERS does not list "ownership" in any
strict sense of the word - it is more about who is willing and/or
knowledgeable enough to participate in reviewing changes to particular
files or directories. Github uses this file to automatically suggest
who should review a pull request.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12216
2022-12-06 19:26:03 +02:00
Benny Halevy
5007ded2c1 view: row_lock: lock_ck: serialize partition and row locking
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:

    1. lock_pk acquires an exclusive lock on the partition.
    2.a lock_ck attempts to acquire shared lock on the partition
        and any lock on the row. both cases currently use a fiber
        returning a future<rwlock::holder>.
    2.b since the partition is locked, the lock_partition times out
        returning an exceptional future.  lock_row has no such problem
        and succeeds, returning a future holding a rwlock::holder,
        pointing to the row lock.
    3.a the lock_holder previously returned by lock_pk is destroyed,
        calling `row_locker::unlock`
    3.b row_locker::unlock sees that the partition is not locked
        and erases it, including the row locks it contains.
    4.a when_all_succeeds continuation in lock_ck runs.  Since
        the lock_partition future failed, it destroyes both futures.
    4.b the lock_row future is destroyed with the rwlock::holder value.
    4.c ~holder attempts to return the semaphore units to the row rwlock,
        but the latter was already destroyed in 3.b above.

Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,

This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.

This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.

Fixes #12168

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12208
2022-12-06 16:29:46 +02:00
Botond Dénes
f017e9f1c6 docs: document the reader concurrency semaphore diagnostics dump
The diagnostics dumped by the reader concurrency semaphore are pretty
common-sight in logs, as soon as a node becomes problematic. The reason
is that the reader concurrency semaphore acts as the canary in the coal
mine: it is the first that starts screaming when the node or workload is
unhealthy. This patch adds documentation of the content of the
diagnostics and how to diagnose common problems based on it.

Fixes: #10471

Closes #11970
2022-12-06 16:24:44 +02:00
Botond Dénes
c35cee7e2b docs: scylla-types.rst: add mention of per-operation --help 2022-12-06 14:47:28 +02:00
Botond Dénes
4f9799ce4f tools/scylla-types: add serialize operation
Takes human readable values and converts them to serialized hex encoded
format. Only regular atomic types are supported for now, no
collection/UDT/tuple support, not even in frozen form.
2022-12-06 14:46:53 +02:00
Botond Dénes
7c87655b4b tools/scylla-types: prepare for action handlers with string arguments
Currently all action handlers have bytes arguments, parsed from
hexadecimal string representations. We plan on adding a serialize
command which will require raw string arguments. Prepare the
infrastructure for supporting both types of action handlers.
2022-12-06 14:45:30 +02:00
Botond Dénes
15452730fb tools/scylla-types: s/print/deserialize/ operation
Soon we will have a serialize operation. Rename the current print
operation to deserialize in preparation to that. We want the two
operations (serialize and deserialize) to reflect their relation in
their names too.
2022-12-06 14:45:30 +02:00
Botond Dénes
f98e6552b4 docs: scylla-types.rst: document tokenof and shardof
These new actions were added recently but without the accompanying
documentation change. Make up for this now.
2022-12-06 14:45:30 +02:00
Botond Dénes
30c047cae6 docs: scylla-types.rst: fix typo in compare operation description 2022-12-06 14:45:23 +02:00
Botond Dénes
681bd62424 Update tools/java submodule
* tools/java ecab7cf7d6...1c4e1e7a7d (2):
  > Merge "Cqlsh serverless v2" from Karol Baryla
  > Update Java Driver version to 3.11.2.4
2022-12-06 09:06:09 +02:00
Botond Dénes
6a1dbffaaa Merge 'compaction_manager: coroutinize postponed_compactions_reevaluation' from Avi Kivity
Three lambdas were removed, simplifying the code.

Closes #12207

* github.com:scylladb/scylladb:
  compaction_manager: reindent postponed_compactions_reevaluation()
  compaction_manager: coroutinize postponed_compactions_reevaluation()
  compaction_manager: make postponed_compactions_reevaluation() return a future
2022-12-06 08:08:36 +02:00
Avi Kivity
2339a3fa06 database: remove continuation for updating statistics
update_write_metrics() is a continuation added solely for updating
statistics. Fold it into do_update to reduce an allocation in the
write path.

```console
$ ./artifacts/before --write --smp 1  2<&1 | grep insn
189930.77 tps ( 57.2 allocs/op,  13.2 tasks/op,   50994 insns/op,        0 errors)
189954.18 tps ( 57.2 allocs/op,  13.2 tasks/op,   51086 insns/op,        0 errors)
188623.86 tps ( 57.2 allocs/op,  13.2 tasks/op,   51083 insns/op,        0 errors)
190115.01 tps ( 57.2 allocs/op,  13.2 tasks/op,   51092 insns/op,        0 errors)
190173.71 tps ( 57.2 allocs/op,  13.2 tasks/op,   51083 insns/op,        0 errors)
median 189954.18 tps ( 57.2 allocs/op,  13.2 tasks/op,   51086 insns/op,        0 errors)
```

vs

```console
$ ./artifacts/after --write --smp 1  2<&1 | grep insn
190358.38 tps ( 56.2 allocs/op,  12.2 tasks/op,   50754 insns/op,        0 errors)
185222.78 tps ( 56.2 allocs/op,  12.2 tasks/op,   50789 insns/op,        0 errors)
184508.09 tps ( 56.2 allocs/op,  12.2 tasks/op,   50842 insns/op,        0 errors)
142099.47 tps ( 56.2 allocs/op,  12.2 tasks/op,   50825 insns/op,        0 errors)
190447.22 tps ( 56.2 allocs/op,  12.2 tasks/op,   50811 insns/op,        0 errors)
```

One allocation and ~300 cycles saved.

update_write_metrics() is still called from other call sites, so it is
not removed.

Closes #12108
2022-12-06 07:04:17 +02:00
Botond Dénes
6daa1e973f Merge 'alternator: fix hangs related to TTL scanning' from Nadav Har'El
The first patch in this small series fixes a hang during shutdown when the expired-item scanning thread can hang in a retry loop instead of quitting.  These hangs were seen in some test runs (issue #12145).

The second patch is a failsafe against additional bugs like those solved by the first patch: If any bugs causes the same page fetch to repeatedly time out, let's stop the attempts after 10 retries instead of retrying for ever. When we stop the retries, a warning will be printed to the log, Scylla will wait until the next scan period and start a new scan from scratch - from a random position in the database, instead of hanging potentially-forever waiting for the same page.

Closes #12152

* github.com:scylladb/scylladb:
  alternator ttl: in scanning thread, don't retry the same page too many times
  alternator: fix hang during shutdown of expiration-scanning thread
2022-12-06 06:44:22 +02:00
Botond Dénes
c5da96e6f7 Merge 'cql3: batch_statement: coroutinize get_mutations()' from Avi Kivity
As it has a do_with(), coroutinizing it is an automatic win.

Closes #12195

* github.com:scylladb/scylladb:
  cql3: batch_statement: reindent get_mutations()
  cql3: batch_statement: coroutinize get_mutations()
2022-12-06 06:41:44 +02:00
Avi Kivity
d2b1d2f695 compaction_manager: reindent postponed_compactions_reevaluation() 2022-12-05 22:02:27 +02:00
Avi Kivity
1669025736 compaction_manager: coroutinize postponed_compactions_reevaluation()
So much nicer.
2022-12-05 22:01:41 +02:00
Avi Kivity
d2c44cba77 compaction_manager: make postponed_compactions_reevaluation() return a future
postponed_compactions_reevaluation() runs until compaction_manager is
stopped, checking if it needs to launch new compactions.

Make it return a future instead of stashing its completion somewhere.
This makes is easier to convert it to a coroutine.
2022-12-05 21:58:48 +02:00
Avi Kivity
fe4d7fbdf2 Update abseil submodule
* abseil 7f3c0d78...4e5ff155 (125):
  > Add a compilation test for recursive hash map types
  > Add AbslStringify support for enum types in Substitute.
  > Use a c++14-style constexpr initialization if c++14 constexpr is available.
  > Move the vtable into a function to delay instantiation until the function is called. When the variable is a global the compiler is allowed to instantiate it more aggresively and it might happen before the types involved are complete. When it is inside a function the compiler can't instantiate it until after the functions are called.
  > Cosmetic reformatting in a test.
  > Reorder base64 unescape methods to be below the escaping methods.
  > Fixes many compilation issues that come from having no external CI coverage of the accelerated CRC implementation and some differences bewteen the internal and external implementation.
  > Remove static initializer from mutex.h.
  > Import of CCTZ from GitHub.
  > Remove unused iostream include from crc32c.h
  > Fix MSVC builds that reject C-style arrays of size 0
  > Remove deprecated use of absl::ToCrc32c()
  > CRC: Make crc32c_t as a class for explicit control of operators
  > Convert the full parser into constexpr now that Abseil requires C++14, and use this parser for the static checker. This fixes some outstanding bugs where the static checker differed from the dynamic one. Also, fix `%v` to be accepted with POSIX syntax.
  > Write (more) directly into the structured buffer from StringifySink, including for (size_t, char) overload.
  > Avoid using the non-portable type __m128i_u.
  > Reduce flat_hash_{set,map} generated code size.
  > Use ABSL_HAVE_BUILTIN to fix -Wundef __has_builtin warning
  > Add a TODO for the deprecation of absl::aligned_storage_t
  > TSAN: Remove report_atomic_races=0 from CI now that it has been fixed
  > absl: fix Mutex TSan annotations
  > CMake: Remove trailing commas in `AbseilDll.cmake`
  > Fix AMD cpu detection.
  > CRC: Get CPU detection and hardware acceleration working on MSVC x86(_64)
  > Removing trailing period that can confuse a url in str_format.h.
  > Refactor btree iterator generation code into a base class rather than using ifdefs inside btree_iterator.
  > container.h: fix incorrect comments about the location of <numeric> algorithms.
  > Zero encoded_remaining when a string field doesn't fit, so that we don't leave partial data in the buffer (all decoders should ignore it anyway) and to be sure that we don't try to put any subsequent operands in either (there shouldn't be enough space).
  > Improve error messages when comparing btree iterators when generations are enabled.
  > Document the WebSafe* and *WithPadding variants more concisely, as deltas from Base64Encode.
  > Drop outdated comment about LogEntry copyability.
  > Release structured logging.
  > Minor formatting changes in preparation for structured logging...
  > Update absl::make_unique to reflect the C++14 minimum
  > Update Condition to allocate 24 bytes for MSVC platform pointers to methods.
  > Add missing include
  > Refactor "RAW: " prefix formatting into FormatLogPrefix.
  > Minor formatting changes due to internal refactoring
  > Fix typos
  > Add a new API for `extract_and_get_next()` in b-tree that returns both the extracted node and an iterator to the next element in the container.
  > Use AnyInvocable in internal thread_pool
  > Remove absl/time/internal/zoneinfo.inc.  It was used to guarantee availability of a few timezones for "time_test" and "time_benchmark", but (file-based) zoneinfo is now secured via existing Bazel data/env attributes, or new CMake environment settings.
  > Updated documentation on use of %v Also updated documentation around FormatSink and PutPaddedString
  > Use the correct Bazel copts in crc targets
  > Run the //absl/time timezone tests with a data dependency on, and a matching ${TZDIR} setting for, //absl/time/internal/cctz:zoneinfo.
  > Stop unnecessary clearing of fields in ~raw_hash_set.
  > Fix throw_delegate_test when using libc++ with shared libraries
  > CRC: Ensure SupportsArmCRC32PMULL() is defined
  > Improve error messages when comparing btree iterators.
  > Refactor the throw_delegate test into separate test cases
  > Replace std::atomic_flag with std::atomic<bool> to avoid the C++20 deprecation of ATOMIC_FLAG_INIT.
  > Add support for enum types with AbslStringify
  > Release the CRC library
  > Improve error messages when comparing swisstable iterators.
  > Auto increase inlined capacity whenever it does not affect class' size.
  > drop an unused dep
  > Factor out the internal helper AppendTruncated, which is used and redefined in a couple places, plus several more that have yet to be released.
  > Fix some invalid iterator bugs in btree_test.cc for multi{set,map} emplace{_hint} tests.
  > Force a conservative allocation for pointers to methods in Condition objects.
  > Fix a few lint findings in flags' usage.cc
  > Narrow some _MSC_VER checks to not catch clang-cl.
  > Small cleanups in logging test helpers
  > Import of CCTZ from GitHub.
  > Merge pull request abseil/abseil-cpp#1287 from GOGOYAO:patch-1
  > Merge pull request abseil/abseil-cpp#1307 from KindDragon:patch-1
  > Stop disabling some test warnings that have been fixed
  > Support logging of user-defined types that implement `AbslStringify()`
  > Eliminate span_internal::Min in favor of std::min, since Min conflicts with a macro in a third-party library.
  > Fix -Wimplicit-int-conversion.
  > Improve error messages when dereferencing invalid swisstable iterators.
  > Cord: Avoid leaking a node if SetExpectedChecksum() is called on an empty cord twice in a row.
  > Add a warning about extract invalidating iterators (not just the iterator of the element being extracted).
  > CMake: installed artifacts reflect the compiled ABI
  > Import of CCTZ from GitHub.
  > Import of CCTZ from GitHub.
  > Support empty Cords with an expected checksum
  > Move internal details from one source file to another more appropriate source file.
  > Removes `PutPaddedString()` function
  > Return uint8_t from CappedDamerauLevenshteinDistance.
  > Remove the unknown CMAKE_SYSTEM_PROCESSOR warning when configuring ABSL_RANDOM_RANDEN_COPTS
  > Enforce Visual Studio 2017 (MSVC++ 15.0) minumum
  > `absl::InlinedVector::swap` supports non-assignable types.
  > Improve b-tree error messages when dereferencing invalid iterators.
  > Mutex: Fix stall on single-core systems
  > Document Base64Unescape() padding
  > Fix sign conversion warnings in memory_test.cc.
  > Fix a sign conversion warning.
  > Fix a truncation warning on Windows 64-bit.
  > Use btree iterator subtraction instead of std::distance in erase_range() and count().
  > Eliminate use of internal interfaces and make the test portable and expose it to OSS.
  > Fix various warnings for _WIN32.
  > Disables StderrKnobsDefault due to order dependency
  > Implement btree_iterator::operator-, which is faster than std::distance for btree iterators.
  > Merge pull request abseil/abseil-cpp#1298 from rpjohnst:mingw-cmake-build
  > Implement function to calculate Damerau-Levenshtein distance between two strings.
  > Change per_thread_sem_test from size medium to size large.
  > Support stringification of user-defined types in AbslStringify in absl::Substitute.
  > Fix "unsafe narrowing" warnings in absl, 12/12.
  > Revert change to internal 'Rep', this causes issues for gdb
  > Reorganize InlineData into an inner Rep structure.
  > Remove internal `VLOG_xxx` macros
  > Import of CCTZ from GitHub.
  > `absl::InlinedVector` supports move assignment with non-assignable types.
  > Change Cord internal layout, which reduces store-load penalties on ARM
  > Detects accidental multiple invocations of AnyInvocable<R(...)&&>::operator()&& by producing an error in debug mode, and clarifies that the behavior is undefined in the general case.
  > Fix a bug in StrFormat. This issue would have been caught by any compile-time checking but can happen for incorrect formats parsed via ParsedFormat::New. Specifically, if a user were to add length modifiers with 'v', for example the incorrect format string "%hv", the ParsedFormat would incorrectly be allowed.
  > Adds documentation for stringification extension
  > CMake: Remove check_target calls which can be problematic in case of dependency cycle
  > Changes mutex unlock profiling
  > Add static_cast<void*> to the sources for trivial relocations to avoid spurious -Wdynamic-class-memaccess errors in the presence of other compilation errors.
  > Configure ABSL_CACHE_ALIGNED for clang-like and MSVC toolchains.
  > Fix "unsafe narrowing" warnings in absl, 11/n.
  > Eliminate use of internal interfaces
  > Merge pull request abseil/abseil-cpp#1289 from keith:ks/fix-more-clang-deprecated-builtins
  > Merge pull request abseil/abseil-cpp#1285 from jun-sheaf:patch-1
  > Delete LogEntry's copy ctor and assignment operator.
  > Make sinks provided to `AbslStringify()` usable with `absl::Format()`.
  > Cast unused variable to void
  > No changes in OSS.
  > No changes in OSS
  > Replace the kPower10ExponentTable array with a formula.
  > CMake: Mark absl::cord_test_helpers and absl::spy_hash_state PUBLIC
  > Use trivial relocation for transfers in swisstable and b-tree.
  > Merge pull request abseil/abseil-cpp#1284 from t0ny-peng:chore/remove-unused-class-in-variant
  > Removes the legacy spellings of the thread annotation macros/functions by default.

Closes #12201
2022-12-05 21:07:16 +02:00
Eliran Sinvani
5a5514d052 cql server: Only parallelize relevant cql requests
The cql server uses an execution stage to process and execute queries,
however, processing stage is best utilized when having a recurrent flow
that needs to be called repeatedly since it better utilizes the
instruction cache.
Up until now, every request was sent through the processing stage, but
most requests are not meant to be executed repeatedly with high volume.
This change processes and executes the data queries asynchronously,
through an execution stage, and all of the rest are processed one by
one, only continuing once the request has been done end to end.

Tests:
Unit tests in dev and debug.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #12202
2022-12-05 21:06:58 +02:00
Takuya ASADA
b7851ab1ec docker: fix locale on SSH shell
4ecc08c broke locale settings on SSH shell, since we dropped "update-locale".
To fix this without installing locales package, we need to manually specify
LANG=C.UTF-8 in /etc/default/locale.

see https://github.com/scylladb/scylla-cluster-tests/pull/5519

Closes #12197
2022-12-05 20:02:18 +02:00
Avi Kivity
6f2d060d12 Merge 'Make sstable_directory call sstable_manager for sstables' components' from Pavel Emelyanov
This PR hits two goals for "object storage" effort

1. Sstables loader "knows" that sstables components are stored in a Linux directory and uses utils/lister to access it. This is not going to work with sstables over object storage, the loader should be abstracted from the underlying storage.

2. Currently class keyspace and class column_family carry "datadir" and "all_datadirs" on board which are path on local filesystem where sstable files are stored (those usually started with /var/lib/scylla/data). The paths include subsdirs like "snapshots", "staging", etc. This is not going to look nice for obejct storage, the /var/lib/ prefix is excessive and meaningless in this case. Instead, ks and cf should know their "location" and some other component should know the directory where in which the files are stored.

Said that, this PR prepares distributed_loader and sstables_directly to stop using Linux paths explicitly by making both call sstables_manager to list and open sstables object. After it will be possible to teach manager to list sstables from object storage. Also this opens the way to removing paths from keyspace and column_family classes and replacing those with relative "location"s.

Closes #12128

* github.com:scylladb/scylladb:
  sstable_directory: Get components lister from manager
  sstable_directory: Extract directory lister
  sstable_directory: Remove sstable creation callback
  sstable_directory: Call manager to make sstables
  sstable_directory: Keep error handler generator
  sstable_directory: Keep schema_ptr
  sstable_directory: Use directory semaphore from manager
  sstable_directory: Keep reference on manager
  tests: Use sstables creation helper in some cases
  sstables_manager: Keep directory semaphore reference
  sstables, code: Wrap directory semaphore with concurrency
2022-12-05 18:54:17 +02:00
Gleb Natapov
022a825b33 raft: introduce not_a_member error and return it when non member tries to do add/modify_config
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.

Return a new error not_a_member to a caller instead.

Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
2022-12-05 17:11:04 +01:00
Benny Halevy
c61083852c storage_service: handle_state_normal: calculate candidates_for_removal when replacing tokens
We currently try to detect a replaced node so to insert it to
endpoints_to_remove when it has no owned tokens left.
However, for each token we first generate a multimap using
get_endpoint_to_token_map_for_reading().

There are 2 problems with that:

1. unless the replaced node owns a single token, this map will not
   be empty after erasing one token out of it, since the
   token metadata has not changed yet (this is done later with
   update_normal_tokens(owned_tokens, endpoint)).
2. generating this map for each token is inefficient, turning this
   algorithm complexity to quadratic in the number of tokens...

This change copies the current token_to_endpoint map
to temporary map and erases replaced tokens from it,
while maintaining a set of candidates_for_removal.

After traversing all replaced tokens, we check again
the `token_to_endpoint_map` erasing from `candidates_for_removal`
any endpoint that still owns tokens.
The leftover candidates are endpoints the own no tokens
and so they are added to `hosts_to_remove`.

Fixes #12082

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12141
2022-12-05 16:17:18 +01:00
Botond Dénes
3d620378d4 Merge 'view: coroutinize maybe_mark_view_as_built' from Avi Kivity
Simplifying it a little.

Closes #12171

* github.com:scylladb/scylladb:
  view: reindent maybe_mark_view_as_built
  view: coroutinize maybe_mark_view_as_built
2022-12-05 13:43:34 +02:00
Kamil Braun
3f8aaeeab9 test/topology: enable replace tests
Also add some TODOs for enhancing existing tests.
2022-12-05 11:50:07 +01:00
Kamil Braun
ee19411783 service/raft: report an error when Raft ID can't be found in raft_group0::remove_from_group0
Also simplify the code and improve logging in general.

The previous code did this: search for the ID in the address map. If it
couldn't be found, perform a read barrier and search again. If it again
couldn't be found, return.

This algorithm depended on the fact that IP addresses were stored in
group 0 configuration. The read barrier was used to obtain the most
recent configuration, and if the IP was not a part of address map after
the read barrier, that meant it's simply not a member of group 0.

This logic no longer applies so we can simplify the code.

Furthermore, when I was fixing the replace operation with Raft enabled,
at some point I had a "working" solution with all tests passing. But I
was suspicious and checked if the replaced node got removed from
group 0. It wasn't. So the replace finished "successfully", but we had
an additional (voting!) member of group 0 which didn't correspond to
a token ring member.

The last version of my fixes ensure that the node gets removed by the
replacing node. But the system is fragile and nothing prevents us from
breaking this again. At least log an error for now. Regression tests
will be added later.
2022-12-05 11:50:07 +01:00
Kamil Braun
4429885543 service: handle replace correctly with Raft enabled
We must place the Raft ID obtained during the shadow round in the
address map. It won't be placed by the regular gossiping route if we're
replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it
is not guaranteed that background gossiping manages to update the
address map before we need it, especially in tests where we set
ring_delay to 0 and disable wait_for_gossip_to_settle. The shadow round,
on the other hand, performs a synchronous request (and if it fails
during bootstrap, bootstrap will fail - because we also won't be able to
obtain the tokens and Host ID of the replaced node).

Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.
2022-12-05 11:50:07 +01:00
Kamil Braun
45bb5bfb52 gms/gossiper: fetch RAFT_SERVER_ID during shadow round
During the replace operation we need the Raft ID of the replaced node.
The shadow round is used for fetching all necessary information before
the replace operation starts.
2022-12-05 11:50:07 +01:00
Kamil Braun
7222c2f9a1 service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
Most of the sleeps related to gossiping are based on `ring_delay`,
which is configurable and can be set to lower value e.g. during tests.

But for some reason there was one case where we slept for a hardcoded
value, `service::load_broadcaster::BROADCAST_INTERVAL` - 60 seconds.

Use `2 * get_ring_delay()` instead. With the default value of
`ring_delay` (30 seconds) this will give the same behavior.
2022-12-05 11:50:07 +01:00
Pavel Emelyanov
b5ede873f2 sstable_directory: Get components lister from manager
For now this is almost a no-op because manager just calls
sstables_directory code back to create the lister.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
3f9b8c855d sstable_directory: Extract directory lister
Currently the utils/lister.cc code is in use to list regular files in a
directory. This patch wraps the lister into more abstract components
lister class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
abd3602b10 sstable_directory: Remove sstable creation callback
It's no longer used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
3d559391df sstable_directory: Call manager to make sstables
Now the directory code has everyhting it needs to create sstable object
and can stop using the external lambda.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
db657a8d1c sstable_directory: Keep error handler generator
Yet another continuation to previous patch -- IO error handlers
generator is also needed to create sstables.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
4281f4af42 sstable_directory: Keep schema_ptr
Continuation of one-before-previous patch. In order to create sstable
without external lambda the directory code needs schema.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
8df1bcb907 sstable_directory: Use directory semaphore from manager
After previous patch sstables_directory code may no longer require for
semaphore argument, because it can get one from manager. This makes the
directory API shorter and simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
4da941e159 sstable_directory: Keep reference on manager
The sstables_directly accesses /var/lib/scylla/data in two ways -- lists
files in it and opens sstables. The latter is abdtracted with the help
of lambdas passed around, but the former (listing) is done by using
directory liters from utils.

Listing sstables components with directlry lister won't work for object
storage, the directory code will need to call some abstraction layer
instead. Opening sstables with the help of a lambda is a bit of
overkill, having sstables manager at hand could make it much simpler.

Said that, this patch makes sstables_directly reference sstables_manager
on start.

This change will also simplify directory semaphore usage (next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
784d78810a tests: Use sstables creation helper in some cases
Several test cases push sstables creation lambda into
with_sstables_directory helper. There's a ready to use helper class that
does the same. Next patch will make additional use of that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
5e13ce2619 sstables_manager: Keep directory semaphore reference
Preparational patch. The semaphore will be used by sstables_directory in
next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:18 +03:00
Pavel Emelyanov
be8512d7cc sstables, code: Wrap directory semaphore with concurrency
Currently this is a sharded<semaphore> started/stopped in main and
referenced by database in order to be fed into sstables code. This
semaphore always comes with the "concurrency" parameter that limits the
parallel_for_each parallelizm.

This patch wraps both together into directory_semaphore class. This
makes its usage simpler and will allow extending it in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 11:59:30 +03:00
Asias He
c6087cf3a0 repair: Reduce repair reader eviction with diff shard count
When repair master and followers have different shard count, the repair
followers need to create multi-shard readers. Each multi-shard reader
will create one local reader on each shard, N (smp::count) local readers
in total.

There is a hard limit on the number of readers who can work in parallel.
When there are more readers than this limit. The readers will start to
evict each other, causing buffers already read from disk to be dropped
and recreating of readers, which is not very efficient.

To optimize and reduce reader eviction overhead, a global reader permit
is introduced which considers the multi-shard reader bloats.

With this patch, at any point in time, the number of readers created by
repair will not exceed the reader limit.

Test Results:

1) with stream sem 10, repair global sem 10, 5 ranges in parallel, n1=2
shards, n2=8 shards, memory wanted =1

1.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2  (repair on n2)
[2022-11-23 17:45:24,770] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:45:53,869] Repair session 1
[2022-11-23 17:45:53,869] Repair session 1 finished

real    0m30.212s
user    0m1.680s
sys     0m0.222s

1.2)
[asias@hjpc2 mycluster]$ time nodetool  repair ks2  (repair on n1)
[2022-11-23 17:46:07,507] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:46:30,608] Repair session 1
[2022-11-23 17:46:30,608] Repair session 1 finished

real    0m24.241s
user    0m1.731s
sys     0m0.213s

2) with stream sem 10, repair global sem no_limit, 5 ranges in
parallel, n1=2 shards, n2=8 shards, memory wanted =1

2.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2)
[2022-11-23 17:49:49,301] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:01,414] Repair session 1
[2022-11-23 17:52:01,415] Repair session 1 finished

real    2m13.227s
user    0m1.752s
sys     0m0.218s

2.2)
[asias@hjpc2 mycluster]$ time nodetool  repair ks2 (repair on n1)
[2022-11-23 17:52:19,280] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:42,387] Repair session 1
[2022-11-23 17:52:42,387] Repair session 1 finished

real    0m24.196s
user    0m1.689s
sys     0m0.184s

Comparing 1.1) and 2.1), it shows the eviction played a major role here.
The patch gives 73s / 30s = 2.5X speed up in this setup.

Comparing 1.1 and 1.2, it shows even if we limit the readers, starting
on the lower shard is faster 30s / 24s = 1.25X (the total number of
multishard readers is lower)

Fixes #12157

Closes #12158
2022-12-05 10:47:36 +02:00
Botond Dénes
1e20095547 Update tools/java submodule
* tools/java 1c06006447...ecab7cf7d6 (1):
  > Add VSCode files to gitignore
2022-12-05 09:54:51 +02:00