Commit aa1270a00c changed most uses
of `assert` in the codebase to `SCYLLA_ASSERT`.
But the comment fixed in this patch is talking specifically about
`assert`, and shouldn't have been changed. It doesn't make sense
after the change.
(cherry picked from commit da7edc3a08)
Closesscylladb/scylladb#20976
This fixes a use-after-free bug when parsing clustering key across
pages.
Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.
To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.
In b1b5bda, the parser was changed to keep shared fragments of the
buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.
If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.
The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.
This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.
The solution is to use page_view instead of temporary_buffer, which
can be safely shared via share() and stored across allocating
section. The page_view maintains its hold to the LSA buffer even
across allocating sections.
Fixes#20766
(cherry picked from commit 7670ee701a)
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.
In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.
It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.
(cherry picked from commit c0fa49bab5)
There are two currently -- upload_sink_base and do_upload_file. This PR merges as much code as possible (spoiler: it's already mostly copy-n-pase-d, so squashing is pretty straightforward)
Closesscylladb/scylladb#20568
* github.com:scylladb/scylladb:
s3/client: Reuse class multipart_upload in do_upload_file
s3/client: Split upload_sink_base class into two
Uploading a file is implemented by the do_upload_file class. This class
re-implements a big portion of what's currently in multipart_upload one.
This patch makes the former class inherit from the latter and removes
all the duplication from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This class implements two facilities -- multipart upload protocol itself
plus some common parts of upload_sink_impl (in fact -- only close() and
plugs put(packet)).
This patch aplits those two facilities into two classes. One of them
will be re-used later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
recently, we are observing errors like:
```
stderr: error running operation: rjson::error (JSON SCYLLA_ASSERT failed on condition 'false', at: 0x60d6c8e 0x4d853fd 0x50d3ac8 0x518f5cd 0x51c4a4b 0x5fad446)
```
we only passed `false` to the `RAPIDJSON_ASSERT()` macro, so what we
have is but the type of the error (rjson::error) and a backtrace.
would be better if we can have more information without recompiling
or fetching the debug symbols for decipher the backtrace.
Refs scylladb/scylladb#20533
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20539
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.
that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:
```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
265 | return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
| ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
4290 | format(format_string<_Args...> __fmt, _Args&&... __args)
| ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
143 | format(fmt::format_string<A...> fmt, A&&... a) {
| ^
```
in this change, we
change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
`seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
because, `sstring::operator std::basic_string` would incur a deep
copy.
we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
It's required to make it possible to push lister into with_closeable().
Its requiremenent of nothrow-move-constructible doesn't accept
default-generated one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Directory lister comes with a filter function that tells lister which
entries to skip by its .get() method. For uniformity, add the same to
S3 bucket_lister.
After this change the lister reports shorter name in the returned
directory entry (with the prefix cut), so also need to tune up the unit
test respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch hides directory_lister and bucket_lister behind a common
facade. The intention is to provide a uniform API for sstable_directory
that it could use to list sstables' components wherever they are.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series fixes an issue where histogram Summaries return an infinite value.
It updated the quantile calculation logic to address cases where values fall into the infinite bucket of a histogram.
Now, instead of returning infinite (max int), the calculation will return the last bucket limit, ensuring finite outputs in all cases.
The series adds a test for summaries with a specific test case for this scenario.
Fixes#20255
Need backport to 6.0, 6.1 and 2023.1 and above
Closesscylladb/scylladb#20257
* github.com:scylladb/scylladb:
test/estimated_histogram_test Add summary tests
utils/histogram.hh: Make summary support inifinite bucket.
This is prerequisite for "restore from object storage" feature. In order to collect the sstables in bucket one would need to list the bucket contents with the given prefix. The ListObjectsV2 provides a way for it and here's the respective s3::client extension.
Closesscylladb/scylladb#20120
* github.com:scylladb/scylladb:
test: Add test for s3::client::bucket_lister
s3_client: Add bucket lister
s3_client: Encode query parameter value for query-string
This patch handles an edge cases related to The infinite bucket
limit.
Summaries are the P50, P95, and P99 quantiles.
The quantiles are calculated from a histogram; we find the bucket and
return its upper limit.
In classic histograms, there is a notion of the infinite bucket;
anything that does not fall into the last bucket is considered to be
infinite;
with quantile, it does not make sense. So instead of reporting infinite
we'll report the bucket lower limit.
Fixes#20255
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
io_fiber/store_snapshot_descriptor now gets the actual number of items
preserved when the log is truncated, fixing extra entries remained after
log snapshot creation. Also removes incorrect check for the number of
truncated items in the
raft_sys_table_storage::store_snapshot_descriptor.
Minor change: Added error_injection test API for changing snapshot thresholds settings.
Fixesscylladb/scylladb#16817Fixesscylladb/scylladb#20080Closesscylladb/scylladb#20095
* github.com:scylladb/scylladb:
raft: Ensure const correctness in applier_fiber.
raft: Invoke store_snapshot_descriptor with actually preserved items.
raft: Use raft_server_set_snapshot_thresholds in tests.
raft: Fix indentation in server.cc
raft: Add a test to check log size after truncation.
raft: Add raft_server_set_snapshot_thresholds injection.
utils: Ensure const correctness of injection_handler::get().
It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored.
Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology`
strategies, as with simple replication strategy
there is no guarantee that there would be any
more replicas in that data center.
Fixes#16826
Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865
It fails without this fix and passes with it.
* Requires backport to live versions. Issue hit in the filed with 2022.2.14
Closesscylladb/scylladb#16827
* github.com:scylladb/scylladb:
repair: do_rebuild_replace_with_repair: use source_dc only when safe
repair: replace_with_repair: pass the replace_node downstream
repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
repair: replace_rebuild_with_repair: pass ks_erms from caller
nodetool: rebuild: add force option
Add and use utils::optional_param to pass source_dc
~~~
utils/tagged_integer: remove conversion to underlying integer
Silently converting a tagged (i.e., "dimension-ful") integer to a naked
("dimensionless") integer defeats the purpose of having tagged integers,
and is a source of practical bugs, such as
<https://github.com/scylladb/scylladb/issues/20080>.
We could make the conversion operator explicit, for enforcing
static_cast<TAGGED_INTEGER_TYPE::value_type>(TAGGED_INTEGER_VALUE)
in every conversion location -- but that's a mouthful to write. Instead,
remove the conversion operator, and let clients call the (identically
behaving) value() member function.
~~~
No backport needed (refactoring).
The series is supposed to solve #20081.
Two patches in the series touch up code that is known to be (orthogonally) buggy; see
- `service/raft_sys_table_storage: tweak dead code` (#20080)
- `test/raft/replication: untag index_t in test_case::get_first_val()` (#20151)
Fixes for those (independent) issues will have to be rebased on this series, or this series will have to be rebased on those (due to context conflicts).
The series builds at every stage. The debug and release unit test suites pass at the end.
Closesscylladb/scylladb#20159
* github.com:scylladb/scylladb:
utils/tagged_integer: remove conversion to underlying integer
test/raft/randomized_nemesis_test: clean up remaining index_t usage
test/raft/randomized_nemesis_test: clean up index_t usage in store_snapshot()
test/raft/replication: clean up remaining index_t usage
test/raft/replication: take an "index_t start_idx" in create_log()
test/raft/replication: untag index_t in test_case::get_first_val()
test/raft/etcd_test: tag index_t and term_t for comparisons and subtractions
test/raft/fsm_test: tag index_t and term_t for comparisons and subtractions
test/raft/helpers: tighten compare_log_entries() param types
service/raft_sys_table_storage: tweak dead code
service/raft_sys_table_storage: simplify (snap.idx - preserve_log_entries)
service/raft_sys_table_storage: untag index_t and term_t for queries
raft/server: clean up index_t usage
raft/tracker: don't drop out of index_t space for subtraction
raft/fsm: clean up index_t and term_t usage
raft/log: clean up index_t usage
db/system_keyspace: promise a tagged integer from increment_and_get_generation()
gms/gossiper: return "strong_ordering" from compare_endpoint_startup()
gms/gossiper: get "int32_t" value of "gms::version_type" explicitly
To be used to force usage of source_dc, even
when it is unsafe for rebuild.
Update docs and add test/nodetool/test_rebuild.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clearly indicate if a source_dc is provided,
and if so, was it explicitly given by the user,
or was implicitly selected by scylla.
This will become useful in the next patches
that will use that to either reject the operation
if it's unsafe to use the source_dc and the dc was
explicitly given by the user, or whether
to fallback to using all nodes otherwise.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Silently converting a tagged (i.e., "dimension-ful") integer to a naked
("dimensionless") integer defeats the purpose of having tagged integers,
and is a source of practical bugs, such as
<https://github.com/scylladb/scylladb/issues/20080>.
We could make the conversion operator explicit, for enforcing
static_cast<TAGGED_INTEGER_TYPE::value_type>(TAGGED_INTEGER_VALUE)
in every conversion location -- but that's a mouthful to write. Instead,
remove the conversion operator, and let clients call the (identically
behaving) value() member function.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lister resembles the directory_lister from util -- it returns
entries upon its .get() invocation, and should be .close()d at the end.
Internally the lister issues ListObjectsV2 request with provided prefix
and limits the server with the amount of entries returned not to consume
too much local memory (we don't have streaming XML parser for response).
If the result is indeed truncated, the subsequent calls include the
continuation token as per [1]
[1] https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When signing AWS query one need to prepare "query string" which is a
line looking like `encode(query_param)=encode(query_value)&...`. Encoded
are only the query parameter names and values. It was missing in current
code and so far worked because no encodable characters were used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#19960
Write path for sstables/commitlog need to handle the fact that IO extensions can
generate errors, some of which should be considered retry-able, and some that should,
similar to system IO errors, cause the node to go into isolate mode.
One option would of course be for extensions to simply generate std::system_errors,
with system_category and appropriate codes. But this is probably a bad idea, since
it makes it more muddy at which level an error happened, as well as limits the
expressibility of the error.
This adds three distinct types (sharing base) distinguishing permission, availabilty
and configuration errors. These are treated akin to EACCESS, ENOENT and EINVAL in
disk error handler and memtable write loop.
Tests updated to use and verify behaviour.
Closesscylladb/scylladb#19961
before this change, we use the default options for
performing read on the input. and the default options
is like
```c++
struct file_input_stream_options {
size_t buffer_size = 8192; ///< I/O buffer size
unsigned read_ahead = 0; ///< Maximum number of extra read-ahead operations
};
```
which is not able to offer good throughput when
reading from disk, when we stream to S3.
so, in this change, we use options which allows better throughput.
Refs 061def001d
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20074
The container containing all verticies doesn't have to be a vector.
Allowing to pass any container that meet conditions, will make to
function more flexible.
instead of reinventing the wheel, let's use the existing one.
in this change, we trade the `div_ceil()` implementated in s3/client.cc
for the existing one in utils/div_ceil.hh . because we are not using
`std::lldiv()` anymore, the corresponding `#include <cstdlib>` is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20000
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:
- with the file, we don't need to accumulate the payloads in memory
and send them in batch, as we do in upload_sink and in
upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
upload applies to it, this helps to improve the throughput.
so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.
Fixes#16287
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
minimum_part_size and aws_maximum_parts_in_piece are
AWS S3 related constraints, they can be reused out of
client::upload_sink and client::upload_jumbo_sink, so
in this change
* extract them out.
* use the user-defined literal with IEC prefix for
better readablity to define minimum_part_size
* add "aws_" prefix to `minimum_part_size` to be
more consistent.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The yaml/json representation for bool is true/false, but boost::lexical_cast
is 1/0. Specialize bool conversion to accept true/false (for yaml/json
compatibilty) and 1/0 (for backward compatibility). This provides
round-trip conversion for bool configs in system.config.
Configuration uses boost::lexical_cast to convert strings to native
values (e.g. bools/ints). However, boost::lexical_cast doesn't
recognize true/false for bool. Since we can't change boost::lexical_cast,
replace it with a wrapper that forwards directly to boost::lexical_cast.
In the next step, we'll specialize it for bool.
utils::in uses std::aligned_storage, which is deprecated. Rather than fixing it, replace its only
user with simpler code and remove it.
No backport needed as this isn't fixing a bug.
Closesscylladb/scylladb#19683
* github.com:scylladb/scylladb:
utils: remove utils/in.hh
gossiper: remove initializer-list overload of add_local_application_state()
The default configuration for replication_strategy_warn_list is
["SimpleStrategy"], but one cannot set this via CQL:
cqlsh> select * from system.config where name = 'replication_strategy_warn_list';
name | source | type | value
--------------------------------+---------+---------------------------+--------------------
replication_strategy_warn_list | default | replication strategy list | ["SimpleStrategy"]
(1 rows)
cqlsh> update system.config set value = '[NetworkTopologyStrategy]' where name = 'replication_strategy_warn_list';
cqlsh> select * from system.config where name = 'replication_strategy_warn_list';
name | source | type | value
--------------------------------+--------+---------------------------+-----------------------------
replication_strategy_warn_list | cql | replication strategy list | ["NetworkTopologyStrategy"]
(1 rows)
cqlsh> update system.config set value = '["NetworkTopologyStrategy"]' where name = 'replication_strategy_warn_list';
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for system.config - received 0 responses and 1 failures from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 1}
Fix by allowing quotes in enum_set parsing.
Bug present since 8c464b2ddb ("guardrails: restrict
replication strategy (RS)", 6.0).
Fixes#19604.
Closesscylladb/scylladb#19605
The following command had been executed to get the
list of headers that did not contain '#pragma once':
'grep -rnw . -e "#pragma once" --include *.hh -L'
This change adds missing include guard to headers
that did not contain any guard.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#19626
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.
This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.
Fixes https://github.com/scylladb/scylladb/issues/19552Closesscylladb/scylladb#19617
* github.com:scylladb/scylladb:
mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
logalloc: add hold_reserve
logalloc: generalize refill_emergency_reserve()
mutation_partition_v2::apply_monotonically() needs to perform some allocations
in a destructor, to ensure that the invariants of the data structure are
restored before returning. But it is usually called with reclaiming disabled,
so the allocations might fail even in a perfectly healthy node with plenty of
reclaimable memory.
This patch adds a mechanism which allows to reserve some LSA memory (by
asking the allocator to keep it unused) and make it available for allocation
right when we need to guarantee allocation success.
In the next patch, we will want to do the thing as
refill_emergency_reserve() does, just with a quantity different
than _emergency_reserve_max. So we split off the shareable part
to a new function, and use it to implement refill_emergency_reserve().
They don't need to modify the captured objects. In fact, they must not
do it in the first place, because the request can be called more than
once and the buffers must not change between those invocations.
For the memory_sink_buffers there must be const method to get the vector
of temporary_buffers themselves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19599
config_file::add_deprecated_options() returns an lvalue reference
to a parameter which itself is an rvalue reference. In C++20 this
is bad practice (but not a bug in this case) as rvalue references
are not expected to live past the call. In C++23, it fails to compile.
Fix by accepting an lvalue reference for the parameter, and adjust the
caller.
Real S3 server is known to actively close connections, thus breaking S3
storage backend at random places. The recent http client update is more
robust against that, but the needed feature is OFF by default.
refs: scylladb/seastar#1883
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19461