Many of the gossiper internal functions currently use seastar threads for historical reasons,
but since they are short living, the cost of spawning a seastar thread for them is excessive
and they can be simplified and made more efficient using coroutines.
Closes#15364
* github.com:scylladb/scylladb:
gossiper: reindent do_stop_gossiping
gossiper: coroutinize do_stop_gossiping
gossiper: reindent assassinate_endpoint
gossiper: coroutinize assassinate_endpoint
gossiper: coroutinize handle_ack2_msg
gossiper: handle_ack_msg: always log warning on exception
gossiper: reindent handle_ack_msg
gossiper: coroutinize handle_ack_msg
gossiper: reindent handle_syn_msg
gossiper: coroutinize handle_syn_msg
gossiper: message handlers: no need to capture shared_from_this
gossiper: add_local_application_state: throw internal error if endpoint state is not found
gossiper: coroutinize add_local_application_state
Simplify the function. It does not need to spawn
a seastar thread.
While at it, declare it as private since it's called
only internally by the gossiper (and on shard 0).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unlike handle_syn_msg, the warning is currently printed only
`if (_ack_handlers.contains(from.addr))`.
Unclear why. It is interesting in any case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The handlers future is waited on under `background_msg`
which is closed in gossiper::stop so the instance is
already guranteed to be kept valid.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the function is called too early, the first get_endpoint_state_ptr
would throw an exception that is later caught and degraded
into a warning.
But that endpoint_state should never disappear after yielding,
so call on_internal_error in that case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
in this series, the packaging of tools modules are improved:
- package cqlsh also. as cqlsh should be redistributed as a part of the unified package
- use ${arch} in the postfix of the python3 package. the python3 package is not architecture independent.
- set the version with tide for `Scylla_VERSION`, so it can be reused elsewhere.
Refs #15241Closes#15369
* github.com:scylladb/scylladb:
build: cmake: build cqlsh as a submodule
build: cmake: always use the version with tilde
build: cmake: build python3 dist tarball with arch postfix
build: cmake: use the default comment message
This function practically returned true from inception.
In d38deef499
it started using messaging_service().knows_version(endpoint)
that also returns `true` unconditionally, to this day
So there's no point calling it since we can assume
that `uses_host_id` is true for all versions.
Closes#15343
* github.com:scylladb/scylladb:
storage_service: fixup indentation after last patch
gossiper: get rid of uses_host_id
since we always use tilde ("~") in the verson number,
let's just cache it as an internal variable in CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that `configure.py` always generate python3 dist tarball with
${arch} postfix, let's mirror this behavior. as `build_unified.sh`
uses this naming convention.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it turns out "Generating submodule python3 in python3" is not
as informative as default one:
"/home/kefu/dev/scylladb/tools/python3/build/scylla-python3-5.4.0~dev-0.20230908.1668d434e458.noarch.tar.gz"
so let's drop the "COMMENT" argument.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change allows CMake to build the dist tarball for a certain build.
Refs https://github.com/scylladb/scylladb/issues/15241Closes#15352
* github.com:scylladb/scylladb:
build: cmake: add packaging support
build: cmake: enable build of seastar/apps/iotune
Currently minio starts with a bucket that has public anonymous access. Respectively, all tests use unsigned S3 requests. That was done for simplicity, and its better to apply some policy to the bucket and, consequentially, make tests sign their requests.
Other than the obvious benefit that we test requests signing in unit tests, another goal of this PR is to make it possible to simulate and test various error paths locally, e.g. #13745 and #13022Closes#14525
* github.com:scylladb/scylladb:
test/s3: Remove AWS_S3_EXTRA usage
test/s3: Run tests over non-anonymous bucket
test/minio: Create random temp user on start
code: Rename S3_PUBLIC_BUCKET_FOR_TEST
This is a workaround for the flakiness of the test where INSERT
statements following the rolling restart fail with "No host available"
exception. The hypothesis is that those INSERTS race with driver
reconnecting to the cluster and if INSERTs are attempted before
reconnection is finished, the driver will refuse to execute the
statements.
The real fix should be in the driver to join with reconnections but
before that is ready we want to fix CI flakiness.
Refs #14746Closes#15355
instead of fabricating a `/etc/password` manually, we can just
leave it to podman to add an entry in `/etc/password` in container.
as podman allows us to map user's account to the same UID in the
container. see
https://docs.podman.io/en/stable/markdown/options/userns.container.html.
this is not only a cosmetic change, it also avoid the permission denied
failure when accessing `/etc/passwd` in the container when selinux is
enabled. without this change, we would otherwise need to either add the
selinux lable to the bind volume with ':Z' option address the failure
like:
```
type=AVC msg=audit(1693449115.261:2599): avc: denied { open } for pid=2298247 comm="bash" path="/etc/passwd" dev="tmpfs" ino=5931 scontext=system_u:system_r:container_t:s0:c252,c259 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
type=AVC msg=audit(1693449115.263:2600): avc: denied { open } for pid=2298249 comm="id" path="/etc/passwd" dev="tmpfs" ino=5931 scontext=system_u:system_r:container_t:s0:c252,c259 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
```
found in `/var/log/audit/audit.log`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15230
Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:
1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.
This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.
This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.
My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row
This was designed to trigger reconciliation right from the very start of the query.
Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).
Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111Closes#14923
* github.com:scylladb/scylladb:
test/topology_custom: add test_read_repair.py
replica/mutation_dump: detect end-of-page in range-scans
tools/scylla-sstable: write: abort parser thread if writing fails
test/pylib: add REST methods to get node exe and workdir paths
test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
service/storage_proxy: add trace points for the actual read executor type
service/storage_proxy: add trace points for read-repair
storage_proxy: Add more trace-level logging to read-repair
database: Fix accounting of small partitions in mutation query
database, storage_proxy: Reconcile pages with no live rows incrementally
scylla redistribute iotune, so let's enable the related building
options, so that we can built iotune on demand.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When `nodetool disablebinary` command executes its handler aborts listening sockets, shuts down all client connections _and_ (!) then waits for the connections to stop existing. Effectively the command tries to make sure that no activity initiated by a CQL query continues, even though client would never see its result (client sockets are closed)
This makes the disablebinary command hang for long sometimes, which is not really nice. The proposal is to wait for the connections to terminate in the background. So once disablebinary command exists what's guaranteed is that all client connections are aborted and new connections are not admitted, but some activity started by them may still be running (e.g. up until `nodetool drain` is issued). Driver-side sockets won't get the queries' results anyway.
The behavior of `disablebinary` is not documented wrt whether it should wait for CQL processing to stop or not, so technically we're not breaking anything. However, it can happen that it's a disruptive change and some setups may behave differently after it.
refs: #14031
refs: #14711Closes#14743
* github.com:scylladb/scylladb:
test/cql-pytest: Add enable|disable-binary test case
test.py: Add suite option to auto-dirty cluster after test
test/pylib: Add nodetool enable|disable-binary commands
transport: Shutdown server on disablebinary
generic_server: Introduce shutdown()
generic_server: Decouple server stopped from connection stopped
transport/controller: Coroutinize do_stop_server()
transport/controller: Coroutinize stop_server()
The test checks that `nodetool disablebinary` makes subsequent queries
fail and `nodetool enablebinary` lets client to establish new
connections.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
ScyllaCluster can be marked as 'dirty' which means that the cluster is
in unusable state (after test) and shouldn't be re-used by other tests
launched by test.py. For now this is only implemented via the cluster
manager class which is only available for topology tests.
Add a less flexible short-cut for cql-pytest-s via suite.yaml marking.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
... and do the real "sharded::stop" in the background. On node shutdown
it needs to pick up all dangling background stopping.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method waits for listening sockets to stop listening and aborts the
connected sockets, but doesn't wait for the established connections to
finish processing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The _stopped future resolves when all "sockets" stop -- listening and
connected ones. Furure patching will need to wait for listening sockets
to stop separately from connected ones.
Rename the `_stopped` to reflect what it is now while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's better to pass a disengaged optional when
the caller doesn't have the information rather than
passing the default dc_rack location so the latter
will never implicitly override a known endpoint dc/rack location.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15300
This function practically returned true from inception.
In d38deef499
It started using messaging_service().knows_version(endpoint)
that also returns `true` unconditionally, to this day
So there's no point calling it since we can assume
that `uses_host_id` is true for all versions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to be aligned with seastar's coding-style.md: scylladb uses seastar's
coding-style.md. so let's adhere to it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15345
this series adds `--node-exporter-dir` and `--build-dir` options to `create-relocatable-package.py`. this enables us to use create relocatable package from arbitrary build directories.
Refs #15241Closes#15299
* github.com:scylladb/scylladb:
create-relocatable-package.py: add --node-exporter-dir option
build: specify the build dir instead mode
so we can point `debian_files_gen.py` to builddir other than
'build', and can optionally use other output directory. this would
help to reduce the number of "magic numbers" in our building system.
Refs https://github.com/scylladb/scylladb/issues/15241Closes#15282
* github.com:scylladb/scylladb:
dist/debian: specify debian/* file encodings
dist/debian: wrap lines whose length exceeds 100 chars
dist/debian: add command line option for builddir
dist/debian: modularize debian_files_gen.py
The current read-loop fails to detect end-of-page and if the query
result buider cuts the page, it will just proceed to the next
partition. This will result in distorted query results, as the result
builder will request for the consumption to stop after each clustering
row.
To fix, check if the page was cut before moving on to the next
partition.
A unit test reproducing the bug was also added.
Currently if writing the sstable fails, e.g. because the input data is
out-of-order, the json parser thread hangs because its output is no
longer consumed. This results in the entire application just freezing.
Fix this by aborting the parsing thread explicitely in the
json_mutation_stream_parser destructor. If the parser thread existed
successfully, this will be a no-op, but on the error-path, this will
ensure that the parser thread doesn't hang.