Commit Graph

27165 Commits

Author SHA1 Message Date
Tomasz Grabiec
2c727f37fb api: Drop sstable index caches on system/drop_sstable_caches 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
f553db69f7 cached_file: Issue single I/O for the whole read range on miss
Currently, reading a page range would issue I/O for each missing
page. This is inefficient, better to issue a single I/O for the whole
range and populate cache from that.

As an optimization, issue a single I/O if the first page is missing.

This is important for index reads which optimistically try to read
32KB of index file to read the partition index page.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
6a6403d19d row_cache: cache_tracker: Do not register metrics when constructed for tests
Some tests will create two cache_tracker instances because of one
being embedded in the sstable test env.

This would lead to double registration of metrics, which raises run
time error. Avoid by not registering metrics in prometheus in tests at
all.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
1f74863bf8 sstables, cached_file: Evict cache gently when sstable is destroyed
We must evict before the _cached_index_file associated with the
sstable goes away. Better to do it gently to avoid stalls.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
f14576f4be sstables: Hide partition_index_cache implementation away from sstables.hh
Reduces scope of the header to index_reader.hh which reduces
recompilation time.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
7d34799f3f sstables: Drop shared_index_lists alias 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
af4cc233c3 sstables: Destroy partition index cache gently
There could be a lot of them so we should clear it gently to avoid
reactor stalls.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
9f957f1cf9 sstables: Cache partition index pages in LSA and link to LRU
As part of this change, the container for partition index pages was
changed from utils::loading_shared_values to intrusive_btree. This is
to avoid reactor stalls which the former induces with a large number
of elements (pages) due to its use of a hashtable under the hood,
which reallocates contiguous storage.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
b3728f7d9b utils: Introduce lsa::weak_ptr<>
Simplifies managing non-owning references to LSA-managed objects. The
lsa::weak_ptr is a smart pointer which is not invalidated by LSA and
can be used safely in any allocator context. Dereferenced will always
give a valid reference.

This can be used as a building block for implementing cursors into
LSA-based caches.

Example simple use:

     // LSA-managed
     struct X : public lsa::weakly_referencable<X> {
         int value;
     };

     lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] {
           X* x = current_allocator().construct<X>();
           return x->weak_from_this();
     });

     std::cout << x_ptr->value;
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
2a852cd0c9 sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
The new names are less confusing.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
934824394a sstables, cached_file: Avoid copying buffers from cache when parsing promoted index 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
7b6f18b4ed cached_file: Introduce get_page_units()
Will be needed later for reading a page view which cannot use
make_tracked_temporary_buffer(). Standardize on get_page_units(),
converting existing code to wrap the units in a deleter.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
23bc19643f sstables: read: Document that primitive_consumer::read_32() is alloc-free
Callers will rely on it to assume that it does not invalidate
references to LSA objects.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
b98e660a4a sstables: read: Count partition index page evictions 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
8360a64f73 sstables: Drop the _use_binary_search flag from index entries
It doesn't have to be set by the parser now that the cursors are
created lazily, pass it to the cursor when it's created.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
06e373e272 sstables: index_reader: Keep index objects under LSA
In preparation for caching index objects, manage them under LSA.

Implementation notes:

key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it.  Actual linearization should be rare since partition
keys are typically small.

Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:

  class parsed_promoted_index_entry;
  calss parsed_partition_index_entry;

This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
20ef54e9ed lsa: chunked_managed_vector: Adapt more to managed_vector
For seamless transition.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
78e5b9fd85 utils: lsa: chunked_managed_vector: Make LSA-aware
The max chunk size is set to be 10% of segment size.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
856e4a539d test: chunked_managed_vector_test: Make exception_safe_class standard layout
Required by managed_vector<> due to its use of offsetof()

In preparation for swtiching chunked_managed_vector storage to
managed_vector<>.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
c87ea09535 lsa: Copy chunked_vector to chunked_managed_vector
In preparation for adapting it to LSA. Split into two steps to make
reiew easier.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
1523a7d367 utils: managed_vector: Make clear_and_release() public
Will be needed by index reader to ensure that destructor doesn't
invoke the allocator so that all is destroyed in the desried
allocation context before the object is destroyed.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
2b673478aa sstables: index_reader: Do not expose index_entry references
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.

This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
a955e7971d sstables: index_reader: Don't store schema reference inside index_entry
To save space.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
9e7bf066a9 sstables: index_reader: Don't store file object inside promoted_index
The file object which is currently stored there has per-request
tracing wrappers (permit, trace_state) attached to it. It doesn't make
sense once the entry is cached and shared. Annotate when the cursor is
created instead.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
86b135056c sstables: index_reader: Don't store front buffer inside promoted_index
Index reads and promoted index reads are both using the same
cached_file now, so there's no need to pass the buffers between the
index parser and promoted index reader.

Makes the promoted_index structure easier to move to LSA.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
484e06d69b cached_file: Always start at offset 0
All current uses start at offset 0, so simplify the code by assuming it.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
078a6e422b sstables: Cache all index file reads
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.

As part of this, cached_file needed to be adjusted in the following ways.

The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.

When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.

In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.

While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
b5ca0eb2a2 lsa: Introduce lsa_buffer
lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns
buffers allocated inside LSA segments. It uses an alternative
allocation method which differs from regular LSA allocations in the
following ways:

  1) LSA segments only hold buffers, they don't hold metadata. They
     also don't mix with standard allocations. So a 128K segment can
     hold 32 4K buffers.

  2) objects' life time is managed by lsa_buffer, an owning smart
     pointer, which is automatically updated when buffers are migrated
     to another segment. This makes LSA allocations easier to use and
     off-loads metadata management to the client (which can keep the
     lsa_buffer wherever he wants).

The metadata is kept inside segment_descriptor, in a vector. Each
allocated buffer will have an entangled object there (8 bytes), which
is paired with an entabled object inside lsa_buffer.

The reason to have an alternative allocation method is to efficiently
pack buffers inside LSA segments.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
a23f27034f lsa: Introduce entangled helper
Will be useful in building higher-level LSA tools.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
056f14063e lsa: Encapsulate segment_descriptor::_free_space access
Prepares for reusing some of its bits for storing segment kind.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
019956739d cached_file: Switch to bplus::tree
In order to be able to move it to LSA later.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
f537d1a7e5 tests: sstables: Do not call open_data() twice
make_sstable_containing() already calls open_data(), so does
load(). This will trigger assertion failure added in a later patch:

   assert(!_cached_index_file);

There is no need to call load() here.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
627a2ef087 test: cached_file: Add test for eof_error 2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8fbea0b5b7 utils: cached_file: Introduce file wrapper
It's an adpator between seastar::file and cached_file. It gives a
seastar::file which will serve reads using a given cached_file as a
read-through cache.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8e2118069b sstables: cached_file: Account buffers returned by cached_file under read_permit
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
a5c72ed899 sstables, database: Keep cache_tracker reference inside sstables_manager
So that sstable code can pick it up for caching (lru and region).
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
4b51e0bf30 row_cache: Move cache_tracker to a separate header
It will be needed by the sstable layer to get the to the LRU and the
LSA region. Split to avoid inclusion of whole row_cache.hh
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
7fa4e10aa0 row_cache: Use generic LRU for eviction
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
6b59c8cfb1 utils: Introduce general-purpose LRU
The LRU can link objects of different types, which is achieved by
having a virtual base class called "evictable" from which the linked
objects should inherit. Whe the object is removed from the LRU,
evictable::on_evicted() is called.

The container is non-owning.
2021-07-02 10:25:58 +02:00
Nadav Har'El
7a5111c580 Merge 'messaging_service: do not listen on port 0' from Benny Halevy
We never want to listen on port 0, even if configured so.
When the listen port is set to 0, the OS will choose the
port randomly, which makes it useless for communicating
with other nodes in the cluster, since we don't support that.

Also, it causes the listen_ports_conf_test internode_ssl_test
to fail since it expects to disable listening on storage_port
or ssl_storage_port when set to 0, as seen in
https://github.com/scylladb/scylla-dtest/issues/2174.

Fixes #8957

Test: unit(dev)
DTest: listen_ports_conf_test (modified)

Closes #8956

* github.com:scylladb/scylla:
  messaging_service: do_start_listen: improve info log accuracy
  messaging_service: never listen on port 0
2021-06-30 18:41:58 +03:00
Nadav Har'El
7ab48b405f CQL: always validate NetworkTopologyStrategy replication factor
The replication factor passed to NetworkTopologyStrategy (which we call
by the confusing name "auto expand") may or may not be used (see
explanation why in #8881), but regardless, we should validate that it's
a legal number and not some non-numeric junk, and we should report the error.

Before this patch, the two commands

CREATE KEYSPACE name WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
ALTER KEYSPACE name WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 'foo' }

succeed despite the invalid replication factor "foo". After this patch,
the second command fails.

The problem fixed here is reproduced by the existing test
test_keyspace.py::test_alter_keyspace_invalid when switching it to use
NetworkTopologyStrategy, as suggested by issue #8638.

Fixes #8880
Refs #8881

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210620100442.194610-1-nyh@scylladb.com>
2021-06-30 16:49:46 +03:00
Benny Halevy
51bc6c8b5a messaging_service: do_start_listen: improve info log accuracy
Make sure to log the info message when we actually
start listening.

Also, print a log message when listening on the
broadcast address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-30 16:25:21 +03:00
Benny Halevy
df442d4d24 messaging_service: never listen on port 0
We never want to listen on port 0, even if configured so.
When the listen port is set to 0, the OS will choose the
port randomly, which makes it useless for communicating
with other nodes in the cluster, since we don't support that.

Also, it causes the listen_ports_conf_test internode_ssl_test
to fail since it expects to disable listening on storage_port
or ssl_storage_port when set to 0, as seen in
https://github.com/scylladb/scylla-dtest/issues/2174.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-30 16:24:54 +03:00
Nadav Har'El
029991bfc2 test/cql-pytest: test that SSL CQL port doesn't accept unencrypted connections
Scylla doesn't allow unencrypted connections over encrypted CQL ports
(Cassandra does allow this, by setting "optional: true", but it's not
secure and not recommended). Here we add a test that in indeed, we can't
connect to an SSL port using an unencrypted connection.

The test passes on Scylla, and also on Cassandra (run it on Cassandra
with "test/cql-pytest/run-cassandra --ssl" - for which we added support
in a recent patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210629121514.541042-1-nyh@scylladb.com>
2021-06-29 16:42:22 +03:00
Nadav Har'El
dc4c05b2e3 test/cql-pytest: switch some fixture scopes from "session" to "module"
Fixtures in conftest.py (e.g., the test_keyspace fixture) can be shared by
all tests in all source files, so they are marked with the "session"
scope: All the tests in the testing session may share the same instance.
This is fine.

Some of test files have additional fixtures for creating special tables
needed only in those files. Those were also, unnecessarily, marked
"session" scope as well. This means that these temporary tables are
only deleted at the very end of test suite, event though they can be
deleted at the end of the test file which needed them - other test
source files don't have access to it anyway. This is exactly what the
"module" fixture scope is, so this patch changes all the fixtures that
are private to one test file to use the "module" scope.

After this patch, the teardown of the last test in the suite goes down
from 0.26 seconds to just 0.06 seconds.

Another benefit is that the peak disk usage of the test suite is
lower, because some of the temporary tables are deleted sooner.

This patch does not change any test functionality, and also does not
make any test faster - it just changes the order of the fixture
teardowns.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #8932
2021-06-29 16:10:47 +03:00
Calle Wilund
a40b6a2f54 commitlog: Use disk file alignment info (with lower value if possible)
Previously, the disk block alignment of segments was hardcoded (due to
really old code). Now we use the value as declared in the actual file
opened. If we are using a previously written file (i.e. o_dsync), we
can even use the sometimes smaller "read" alignment.

Also allow config to completely override this with a disk alignment
config option (not exposed to global config yet, but can be).

v2:
* Use overwrite alignment if doing only overwrite
* Ensure to adjust actual alignment if/when doing file wrapping

v3:
* Kill alignment config param. Useless and unsafe.

Closes #8935
2021-06-29 16:00:49 +03:00
Nadav Har'El
7e4bef96af test/cql-pytest: support "--ssl" option in run-cassandra
This patch adds support for the "--ssl" option in run-cassandra, which
will now be able, like run (which runs Scylla), to run Cassandra with
listening to a *SSL-encrypted* CQL connection. The "--ssl" option is also
passed to the tests, so they know to encrypt their CQL connections.

We already had support for this feature in the test/cql-pytest/run
script - which runs Scylla. Adding this also to the run-cassandra
script can help verify that a behavior we notice in Scylla's SSL support
and we want to add to a test - is also shared by Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210629082532.535229-1-nyh@scylladb.com>
2021-06-29 12:05:40 +03:00
Takuya ASADA
edd54a9463 reloc: add arch to relocatable package filename
Add architecture name for relocatable packages, to support distributing
both x86_64 version and aarch64 version.

Also create symlink from new filename to old filename to keep
compatibility with older scripts.

Fixes #8675

Closes #8709

[update tools/python3 submodule:

* tools/python3 ad04e8e...afe2e7f (1):
  > reloc: add arch to relocatable package filename
]
2021-06-28 15:01:09 +03:00
Avi Kivity
f660726773 Update seastar submodule
* seastar 0e48ba883...eaa00e761 (3):
  > memory: reduce statistics TLS initialization even more
  > Merge "Sanitize io-topology creation on start" from Pavel E
  > doc/prometheus: note that metric family is passed by query name
2021-06-28 11:52:36 +03:00
Botond Dénes
09309f5dbf reader_concurrency_semaphore: on_permit_created(): remove noexcept
The permit creation path enters the semaphore's permit gate in
on_permit_created(). Entering this gate can throw so this method is not
noexcept. Remove the noexcept specifier accordingly.
Also enter the gate before adding the permit to the permit list, to save
some work when this fails.

Fixes: #8933

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210628074941.32878-1-bdenes@scylladb.com>
2021-06-28 11:04:38 +03:00