As part of this change, the container for partition index pages was
changed from utils::loading_shared_values to intrusive_btree. This is
to avoid reactor stalls which the former induces with a large number
of elements (pages) due to its use of a hashtable under the hood,
which reallocates contiguous storage.
Simplifies managing non-owning references to LSA-managed objects. The
lsa::weak_ptr is a smart pointer which is not invalidated by LSA and
can be used safely in any allocator context. Dereferenced will always
give a valid reference.
This can be used as a building block for implementing cursors into
LSA-based caches.
Example simple use:
// LSA-managed
struct X : public lsa::weakly_referencable<X> {
int value;
};
lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] {
X* x = current_allocator().construct<X>();
return x->weak_from_this();
});
std::cout << x_ptr->value;
In preparation for caching index objects, manage them under LSA.
Implementation notes:
key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it. Actual linearization should be rare since partition
keys are typically small.
Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:
class parsed_promoted_index_entry;
calss parsed_partition_index_entry;
This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.
This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.
As part of this, cached_file needed to be adjusted in the following ways.
The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.
When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.
In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.
While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns
buffers allocated inside LSA segments. It uses an alternative
allocation method which differs from regular LSA allocations in the
following ways:
1) LSA segments only hold buffers, they don't hold metadata. They
also don't mix with standard allocations. So a 128K segment can
hold 32 4K buffers.
2) objects' life time is managed by lsa_buffer, an owning smart
pointer, which is automatically updated when buffers are migrated
to another segment. This makes LSA allocations easier to use and
off-loads metadata management to the client (which can keep the
lsa_buffer wherever he wants).
The metadata is kept inside segment_descriptor, in a vector. Each
allocated buffer will have an entangled object there (8 bytes), which
is paired with an entabled object inside lsa_buffer.
The reason to have an alternative allocation method is to efficiently
pack buffers inside LSA segments.
make_sstable_containing() already calls open_data(), so does
load(). This will trigger assertion failure added in a later patch:
assert(!_cached_index_file);
There is no need to call load() here.
It's an adpator between seastar::file and cached_file. It gives a
seastar::file which will serve reads using a given cached_file as a
read-through cache.
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
Scylla doesn't allow unencrypted connections over encrypted CQL ports
(Cassandra does allow this, by setting "optional: true", but it's not
secure and not recommended). Here we add a test that in indeed, we can't
connect to an SSL port using an unencrypted connection.
The test passes on Scylla, and also on Cassandra (run it on Cassandra
with "test/cql-pytest/run-cassandra --ssl" - for which we added support
in a recent patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210629121514.541042-1-nyh@scylladb.com>
Fixtures in conftest.py (e.g., the test_keyspace fixture) can be shared by
all tests in all source files, so they are marked with the "session"
scope: All the tests in the testing session may share the same instance.
This is fine.
Some of test files have additional fixtures for creating special tables
needed only in those files. Those were also, unnecessarily, marked
"session" scope as well. This means that these temporary tables are
only deleted at the very end of test suite, event though they can be
deleted at the end of the test file which needed them - other test
source files don't have access to it anyway. This is exactly what the
"module" fixture scope is, so this patch changes all the fixtures that
are private to one test file to use the "module" scope.
After this patch, the teardown of the last test in the suite goes down
from 0.26 seconds to just 0.06 seconds.
Another benefit is that the peak disk usage of the test suite is
lower, because some of the temporary tables are deleted sooner.
This patch does not change any test functionality, and also does not
make any test faster - it just changes the order of the fixture
teardowns.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#8932
This patch adds support for the "--ssl" option in run-cassandra, which
will now be able, like run (which runs Scylla), to run Cassandra with
listening to a *SSL-encrypted* CQL connection. The "--ssl" option is also
passed to the tests, so they know to encrypt their CQL connections.
We already had support for this feature in the test/cql-pytest/run
script - which runs Scylla. Adding this also to the run-cassandra
script can help verify that a behavior we notice in Scylla's SSL support
and we want to add to a test - is also shared by Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210629082532.535229-1-nyh@scylladb.com>
This class is used in only few places and does not have to be included
everywhere sstable class is needed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
instead of LA. Ability to write LA and KA sstables will be removed
by the following patches so we need to switch all the tests to write
newer sstables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following patches will switch all sstable writing tests to use
the latest sstables format. compaction_with_fully_expired_table
contains some test for a LA specific behaviour so let's remove it
to make the switch possible.
For more context see https://github.com/scylladb/scylla/issues/2620
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Replace calls to make_compressed_file_k_l_format_input_stream
with calls to make_compressed_file_m_format_input_stream.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
check_compacted_sstables is used in compact_02 test which uses sstables
created by compact_sstables. The problem is that schema used in
check_compacted_sstables and compact_sstables is not the same.
The type of r1 column is different. This was not a problem when the
test was running on LA sstables but following patches will switch
all the tests to use MC and then sstable schema becomes validated
when reading the sstable and the test will fail such validation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Those tests check that created sstables have exactly the expected bytes
inside. This won't work with other sstable formats and writting LA/KA
sstables will be removed by the following patches so there's nothing
we can do with those tests but to remove them. Otherwise they will be
failing after LA/KA writting capability is removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
and use it instead of all_sstable_versions in tests that check
writting of sstables. Following patches remove LA/KA writer so we
want tests to be ready for that and not break by trying to write LA/KA
sstables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
These features have been around for over 2 years and every reasonable
deployment should have them enabled.
The only case when those features could be not enabled is when the user
has used enable_sstables_mc_format config flag to disable MC sstable
format. This case has been eliminated by removing
enable_sstables_mc_format config flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The netw command tries to access the netw::_the_messaging_service that
was removed long ago. The correct place for the messaging service is
in debug:: namespace.
The scylla-gdb test checks that, but the netw command sees that the ptr
in question is not initialized, thinks it's not yet sharded::start()-ed
and exits without errors.
tests: unit(gdb)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624135107.12375-1-xemul@scylladb.com>
Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.
Closes#8764
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog_test: Improve test assertion
commitlog: Add waitable future for background sync/flush
commitlog: abort queues on shutdown
commitlog: break out "abort" calls into member functions
commitlog: Do explicit discard+delete in shutdown
commitlog: Recycle or not should not depend on shutdown state
commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
Move `raft_sys_table_storage_test` and `raft_address_map_test` to
`test/raft` folder since they naturally belong here, not in
`test/boost` folder.
Tests: unit(dev)
* manmanson/move_some_raft_tests_to_raft_folder:
test: raft: move `raft_address_map_test` to `raft` folder
test: raft: move `raft_sys_table_storage_test` to `raft` folder
configure: add extended raft testing dependencies
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.
* scylla-dev/raft-stepdown-v4:
raft: test: test leadership transfer timeout
raft: allow to initiate leader stepdown process
"
This series prevents broken_promise or dangling reader errors
when (resharding) compaction is stopped, e.g. during shutdown.
At the moment compaction just closes the reader unilaterally
and this yanks the reader from under the queue_reader_handle feet,
causing dangling queue reader and broken_promise errors
as seen in #8755.
Instead, fix queue_reader::close to set value on the
_full/_not_full promises and detach from the handle,
and return _consume_fut from bucket_writer::consume
if handle is terminated.
Fixes#8755
Test: unit(dev)
DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
"
* tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla:
mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated
queue_reader_handle: add get_exception method
queue_reader: close: set value on promises on detach from handle
Test that if leadership transfer cannot be done in configured time frame
fsm cancels the leadership transfer process. Also check that timeout_now
message is resent on each tick while leadership transfer is in progress.
This is a more informative name. Helps see that, say, group0
is a separate service and not bundle all raft services together.
Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>
The existing test_ssl.py which tests for Scylla's support of various TLS
and SSL versions, used a deprecated and misleading Python API for
choosing the protocol version. In particular, the protocol version
ssl.PROTOCOL_SSLv23 is *not*, despite it's name, SSL versions 2 or 3,
or SSL at all - it is in fact an alias for the latest TLS version :-(
This misunderstanding led us to open the incorrect issue #8837.
So in this patch, we avoid the old Python APIs for choosing protocols,
which were gradually deprecated, and switch to the new API introduced
in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum
desired protocol version.
With this new API, we can correctly connect with various versions of
the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the
fixed test, we confirm that Scylla does *not* allow SSLv3 - as desired -
so issue #8837 is a non-issue.
Moreover, after issue #8827 was already fixed, this test now passes,
so the "xfail" mark is removed.
Refs #8837.
Refs #8827.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210617134305.173034-1-nyh@scylladb.com>
The patch set introduces a few leadership transfer tests, some of them
are adaptations of corresponding etcd tests (e.g.
`test_leader_transfer_ignore_proposal` and `test_transfer_non_member`).
Others test different scenarios ensuring that pending leadership
transfer doesn't disrupt the rest of the cluster from progressing:
Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and
`test_leader_transferee_dies_upon_receiving_timeout_now`) as well as
lost `vote_request(force)` from the new candidate
(test_leader_transfer_lost_force_vote_request) don't impact the
election process following that and the leader is elected as normal.
* manmanson/leadership_transfer_tests_v3:
raft: etcd_test: test_transfer_non_member
raft: etcd_test: test_leader_transfer_ignore_proposal
raft: fsm_test: test_leader_transfer_lost_force_vote_request
raft: fsm_test: test_leader_transfer_lost_timeout_now
raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
Turns out the DELETE statement already supports attributes
like timestamp, so it's ridiculously easy to add USING TIMEOUT
support - it's just the matter of accepting it in the grammar.
Fixes#8855Closes#8876
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
[(-pp | --print-port)] [(-pw <password> | --password <password>)]
[(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
[(-u <username> | --username <username>)] snapshot
[(-cf <table> | --column-family <table> | --table <table>)]
[(-kc <kclist> | --kc.list <kclist>)]
[(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
-sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```
But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.
This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.
In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).
Refs #8725Closes#8726
* github.com:scylladb/scylla:
test: database_test: add snapshot_skip_flush_works
api: storage_service/snapshots: support skip-flush option
snapshot: support skip_flush option
table: snapshot: add skip_flush option
api: storage_service/snapshots: add sf (skip_flush) option