Currently, if e.g. find_column_family throws an error,
as seen in #8776 when the table was dropped during repair,
the reader is not closed.
Use a coroutine to simplify error handling and
close the reader if an exception is caught.
Also, catch an error inside the lambda passed to make_interposer_consumer
when making the shared_sstable for streaming, and close the reader
their and return an exceptional future early, since
the reader will not be moved to sst->write_components, that assumes
ownership over it and closes it in all cases.
Fixes#8776
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Too large requests are currently handled by the CQL server by
skipping them and sending back an error response.
That's however wasteful and dangerous: bogus request sizes
will force Scylla to potentially skip gigabytes of data
- and skipping is done by simply reading from the socket,
so it may results in gigabytes of bandwidth wasted.
Even if the request size is not bogus, closing the connection
forces users to adjust their request sizes, which should be done
anyway.
Originally, there was a bug in handling too large requests which
only read their headers and then left the connection in a broken,
undefined state, trying to interpret the rest of the large request
as a next CQL header. It was later fixed to skip the request, but
closing the connection is a safer thing to do.
Fixes#8798Closes#8800
"
There's a bunch of issues with starting and stopping of cql_server with
the help of cql_controller.
fixes: #8796
tests: manual(start + stop,
start + exception on cql_set_state()
)
unit not run, they don't mess with transport controller
"
* 'br-transport-stop-fixes' of https://github.com/xemul/scylla:
transport/controller: Stop server on state change failure too
transport/controller: Rollback server start on state change failure too
transport/controller: Do not leave _server uninitialized
transport/controller: Rework try-catch into defers
Feature requests, fixes, and OOP refactor of replication_test.
Note: all known bugs and hangs are now fixed.
A new helper class "raft_cluster" is created.
Each move of a helper function to the class has its own commit.
New helpers are provided
To simplify code, for now only a single apply function can be set per
raft_cluster. No tests were using in any other way. In the future,
there could be custom apply functions per server dynamically assigned,
if this becomes needed.
* alejo/raft-tests-replication-02-v3-30: (66 commits)
raft: replication test: wait for log for both index and term
raft: replication test: reset network at construction
raft: replication test: use lambda visitor for updates
raft: replication test: move structs into class
raft: replication test: move data structures to cluster class
raft: replication test: remove shared pointers
raft: replication test: move get_states() to raft_cluster
raft: replication test: test_server inside raft_cluster
raft: replication test: rpc declarative tests
raft: replication test: add wait_log
raft: replication test: add stop and reset server
raft: replication test: disconnect 2 support
raft: replication test: explicit node_id naming
raft: replication test: move definitions up
raft: replication test: no append entries support
raft: replication test: fix helper parameter
raft: replication test: stop servers out of config
raft: replication test: wait log when removing leader from configuration
raft: replication test: only manipulate servers in configuration
raft: replication test: only cancel rearm ticker for removed server
...
In this small series, I rewrite test/alternator/run to Python using the utility
functions developed for test/cql-pytest. In the future, we should do the same to
test/redis/run and test/scylla-gdb/run.
The benefit of this rewrite is less code duplication (all run scripts start with
the same duplicate code to deal with temporary directories, to run Scylla IP
addresses, etc.), but most importantly - in the future fixes we do to cql-pytest
(e.g., parameters needed to start Scylla efficiently, how to shut down Scylla,
etc.) will appear automatically in alternator test without needing to remember
to change both.
Another benefit is that test/alternator/run will now be Python, not a shell
script. This should make it easier to integrate it into test.py (refs #6212) in
the future - if we want to.
Closes#8792
* github.com:scylladb/scylla:
test/alternator: rewrite test/alternator/run script in Python
test/cql-pytest: make test run code more general
In the last year, four new features were added to DynamoDB which we
don't yet support - Kinesis Streams, PartiQL, Contributor Insights and
Export to S3. Let's document them as missing Alternator features, and
point to the four newly-created issues about these features.
Refs #8786
Refs #8787
Refs #8788
Refs #8789
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210603125825.1179171-1-nyh@scylladb.com>
scylladb/seastar@e6463df8a0 ("smp: allow
having multiple instances of the smp class") changes the type of
seastar::smp::_qs from a unique_ptr to a regular pointer. Adjust for
that change, with a fallback to support older versions.
Closes#8784
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.
This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.
In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.
Fixes#8577
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8578
* github.com:scylladb/scylla:
commitlog: segment_manager::shutdown: abort on errors
commitlog: allocate_segment_ex: make_checked_file
std::bind() copies the bound parameters for safekeeping. Here this
includes expr, which can be quite heavyweight. Use std::ref() to
prevent copying. This is safe since the bound expression is executed
and discarded before has_supporting_index() returns.
Closes#8791
The current pull_github_pr.sh hard-codes the project name "scylladb/scylla".
Let's determine it automaticaly, from the git origin url.
This will allow using exactly the same script in other Scylla subprojects,
e.g., Seastar.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318142624.1794419-1-nyh@scylladb.com>
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Most RAFT packets are sent very rarely during special phases of the
protocol (like election or leader stepdown). The protocol itself does
not care if a packet is sent or dropped, so returning futures from their
send function does not serve any purpose. Change the raft's rpc interface
to return void for all packet types but append_request. We still want to
get a future from sending append_request for backpressure purposes since
replication protocol is more efficient if there is no packet loss, so
it is better to pause a sender than dropping packets inside the rpc. Rpc
is still allowed to drop append_requests if overloaded.
IO and applier fibers may update waiters and start new snapshot
transfers, so abort() needs to wait for them to stop before proceeding
to abort waiters and snapshot transfers,
In https://github.com/scylladb/scylla/pull/7807 we added support for
Azure instance in Scylla.
The following changes are required in order machine-image to work:
1) fix wrong metadata URL and updating metadata path values (was
intreduce in
f627fcbb0c)
2) fix function naming which been used my machine image
3) add missing function which are reuqired by mahcine-image
4) cleanup unused functions
Closes#8596
The _node_ops_metrics is thread local, it is constructed when it
is first accessed.
If there are no node operations, the metrics will not be shown. To make the
metrics more consistent, init during startup.
Refs #8311Closes#8780
In commit 425e3b1182 (gossip: Introduce
direct failure detector), the call to notify_failure_detector inside ack
and ack2 msg handler was removed since there is no need to update the
old failure detector anymore. However, the timestamp for endpoit_state
is also updated inside notify_failure_detector. With the new failure
detector we still need the timestamp for endpoit_state. Otherwise, nodes
might be removed from gossip wrongly.
For example, as we saw in issue #8702:
INFO 2021-05-24 22:45:24,713 [shard 0] gossip - FatClient 127.0.60.2
has been silent for 5000ms, removing from gossip
To fix, update the timestamp as we do before in ack and ack2 msg
handler.
Fixes#8702Closes#8777
Now that it returns a future that always waits on
pending flushes there is no point in calling it `request_flush`.
`flush()` is simpler and better describes its function.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#8773
When refactored for cdc, properties -> extensions merge
was modified so it did not handle _removal_ (i.e. an
extension function returning null -> no entry in new map).
This causes certain enterprise extensions to not be able
to disable themselves.
Fixed by filtering existing extensions by property keywords.
Unit test added.
Closes#8774
In test_tracing.py::test_slow_query_log, the was what looked like
an innocent 30-second timeout, but this was in fact a 8 minute
timeout - because it started with sleeping 1 second, then 2 seconds,
then 3, ... until 30 seconds. Such a high timeout is frustrating when
trying to debug failures in the test - which is only expected to take
2 seconds (and all of it because of an artificial timeout).
So fix the loop to stop iterating after 60 seconds (a compromise
between 30 seconds and 8 minutes...), sleeping a constant amount
between iterations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210601150631.1037158-1-nyh@scylladb.com>
Also drop a single violation in transport/server.cc. This helps
prevent dead code from piling up.
Three functions in row_cache_test that are not used in debug mode
are moved near their user, and under the same ifdef, to avoid triggering
the error.
Closes#8767
Make sure to apply the mutation under the table's _async_gate.
Fixes#8790
Test: unit(dev), view_build_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8794
Fixes#8733
If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.
Closes#8766
Both streaming and repair call the distributed sstables writing with
equal lambdas each being ~30 lines of code. The only difference between
them is repair might request offstrategy compaction for new sstable.
Generalization of these two pieces save lines of codes and speeds the
release/repair/row_level.o compilation by half a minute (out of twelve).
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210531133113.23003-1-xemul@scylladb.com>
Instead of attempting to universally set the proper environment
necessary for tests to generate profiling data such that coverage.py can
process it, allow each Test subclass to set up the environment as needed
by the specific Test variant.
With this we now have support for all current test types, including cql,
cql-pytest and alternator tests.
Yet another modifier for `--run`, allowing running the same executable
multiple times and then generating a coverage report across all runs.
This will also be used by test.py for those test suites (cql test) which
run the same executable multiple times, with different inputs.
Another modifier for `--run`, allowing to override the test executable
path. This is useful when the real test is ran through a run-script,
like in the case of cql-pytest.
If on stop the set_cql_state() throws the local sharded<cql_server>
will be left not stopped and will fail the respective assertion on
its destruction.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If set_cql_state() throws the cserver remains started. If this
happens on start before the controller stop defer action is
scheduled the destruction of controller will fain on assertion
that checks the _server must be stopped.
Effectively this is the fix of #8796
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If an exception happens after sharded<cql_server>.start() the
controller's _server pointer is left pointing to stopped sharded
server. This makes it impossible to start the server again (via
API) since the check for if (_server) will always be true.
This is the continuation of the ae4d5a60 fix.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Waiting on index alone does not guarantee leader correct leader log
propagation. This patch add checking also the term of the leader's last
log entry.
This was exposed with occasional problems with packet drops.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>