CWG 2631 (https://cplusplus.github.io/CWG/issues/2631.html) reports
an issue on how the default argument is evaluated. this problem is
more obvious when it comes to how `std::source_location::current()`
is evaluated as a default argument. but not all compilers have the
same behavior, see https://godbolt.org/z/PK865KdG4.
notebaly, clang-15 evaluates the default argument at the callee
site. so we need to check the capability of compiler and fall back
to the one defined by util/source_location-compat.hh if the compiler
suffers from CWG 2631. and clang-16 implemented CWG2631 in
https://reviews.llvm.org/D136554. But unfortunately, this change
was not backported to clang-15.
before switching over to clang-16, for using std::source_location::current()
as the default parameter and expect the behavior defined by CWG2631,
we have to use the compatible layer provided by Seastar. otherwise
we always end up having the source_location at the callee side, which
is not interesting under most circumstances.
so in this change, all places using the idiom of passing
std::source_location::current() as the default parameter are changed
to use seastar::compat::source_location::current(). despite that
we have `#include "seastarx.h"` for opening the seastar namespace,
to disambiguate the "namespace compat" defined somewhere in scylladb,
the fully qualified name of
`seastar::compat::source_location::current()` is used.
see also 09a3c63345, where we used
std::source_location as an alias of std::experimental::source_location
if it was available. but this does not apply to the settings of our
current toolchain, where we have GCC-12 and Clang-15.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14086
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current S3 uploading sink has implicit limit for the final file size that comes from two places. First, S3 protocol declares that uploading parts count from 1 to 10000 (inclusive). Second, uploading sink sends out parts once they grow above S3 minimal part size which is 5Mb. Since sstables puts data in 128kb (or smaller) portions, parts are almost exactly 5Mb in size, so the total uploading size cannot grow above ~50Gb. That's too low.
To break the limit the new sink (called jumbo sink) uses the UploadPartCopy S3 call that helps splicing several objects into one right on the server. Jumbo sink starts uploading parts into an intermediate temporary object called a piece and named ${original_object}_${piece_number}. When the number of parts in current piece grows above the configured limit the piece is finalized and upload-copied into the object as its next part, then deleted. This happens in the background, meanwhile the new piece is created and subsequent data is put into it. When the sink is flushed the current piece is flushed as is and also squashed into the object.
The new jumbo sink is capable of uploading ~500Tb of data, which looks enough.
fixes: #13019Closes#13577
* github.com:scylladb/scylladb:
sstables: Switch data and index sink to use jumbo uploader
s3/test: Tune-up multipart upload test alignment
s3/test: Add jumbo upload test
s3/client: Wait for background upload fiber on close-abort
c3/client: Implement jumbo upload sink
s3/client: Move memory buffers to upload_sink from base
s3/client: Move last part upload out of finalize_upload()
s3/client: Merge do_flush() with upload_part()
s3/client: Rename upload_sink -> upload_sink_base
We don't use the return value of erase, so
we can allow it to return anything. We'll
need this for ring_mapping, since
boost::icl::interval_map::erase(it)
returns void.
When uploading a part (and a piece) there can be one or more background
fibers handling the upload. In case client needs to abort the operation
it calls .close() without flush()ing. In this case the S3 API Abort is
made and the sink can be terminated. It's expected that background
fibers would resolve on their own eventually, but it's not quite the
case.
First, they hold units for the semaphore and the semaphore should be
alive by the time units are returned.
Second, the PUT (or copy) request can finish successfully and it may be
sitting in the reactor queue waiting for its continuation to get
scheduler. The continuation references sink via "this" capture to put
the part etag.
Finally, in case of piece uploading the copy fiber needs _client at the
end to issue delete-object API call dropping the no longer needed part.
Said that -- background fibers must be waited upon on .close() if the
closing is aborting (if it's successfull close, then the fibers mush
have been picked up by final flush() call).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sink is also in charge of uploading large objects in parts, but this
time each part is put with the help of upload-part-copy API call, not
the regular upload-part one.
To make it work the new sink inherits from the uploading base class, but
instead of keeping memory_data_sink_buffers with parts it keeps a sink
to upload a temporary intermediate object with parts. When the object is
"full", i.e. the number of parts in it hits the limit, the object is
flushed, then copied into the target object with the S3 API call, then
deletes the intermediate object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the buffers manipulations now happen in the upload_sink class and
the respective member can be removed from base class. The base class
only messes with the buffers in its upload_part() call, but that's
unavoidable, as uploading part implies sending its contents which sits
in buffers.
Now the base class can be re-used for uploading parts with the help of
copy-part API call (next patches)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change has two reasons. First, is to facilitate moving the
memory_data_sink_buffers from base class, i.e. -- continuation of the
previous patch. Also this fixes a corner case -- if final sink flush
happens right after the previous part was sent for uploading, the
finalization doesn't happen and sink closing aborts the upload even if
it was successful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The do_flush() helper is practically useless because what it does can be
done by the upload_part() itself. This merge also facilitates moving the
memory_data_sink_buffers from base class to uploader class by next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will appear another sink that would implement multipart upload
with the help of copy-part functionality. Current uploading code is
going to be partially re-used, so this patch moves all of it into the
base class in advance. Next patches will pick needed parts.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add add a respective unit test.
It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since 4cdad8bc8b
(merged in 7f04d8231d)
Implementing min/max correctly
Fixes#13801
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add basic test for tagged+integer arithmetic operations.
Remove const qualifier from `tagged_integer::operator[+-]=`
as these are add/sub-assign operators that need to modify
the value in place.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This series changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
Currently s3::client is created for each sstable::storage. It's later shared between sstable's files and upload sink(s). Also foreign_sstable_open_info can produce a file from a handle making a new standalone client. Coupled with the seastar's http client spawning connections on demand, this makes it impossible to control the amount of opened connections to object storage server.
In order to put some policy on top of that (as well as apply workload prioritization) s3 clients should be collected in one place and then shared by users. Since s3::client uses seastar::http::client under the hood which, in turn, can generate many connections on demand, it's enough to produce a single s3::client per configured endpoint one each shard and then share it between all the sstables, files and sinks.
There's one difficulty however, solving which is most of what this PR does. The file handle, that's used to transfer sstable's file across shards, should keep aboard all it needs to re-create a file on another shard. Since there's a single s3::client per shard, creation of a file out of a handle should grab that shard's client somehow. The meaningful shard-local object that can help is the sstables_manager and there are three ways to make use of it. All deal with the fact that sstables_manager-s are not sharded<> services, but are owner by the database independently on each shard.
1. walk the client -> sst.manager -> database -> container -> database -> sst.manager -> client chain by keeping its first half on the handle and unrolling the second half to produce a file
2. keep sharded peering service referenced by the sstables_manager that's initialized in main and passed though the database constructor down to sstables_manager(s)
3. equip file_handle::to_file with the "context" argument and teach sstables foreign info opener to push sstables_manager down to s3 file ... somehow
This PR chooses the 2nd way and introduces the sstables::storage_manager main-local sharded peering service that maintains all the s3::clients. "While at it" the new manager gets the object_storage_config updating facilities from the database (it's overloaded even without it already). Later the manager will also be in charge of collecting and exporting S3 metrics. In order to limit the number of S3 connections it also needs a patch seastar http::client, there's PR already doing that, once (if) merged there'll come one more fix on top.
refs: #13458
refs: #13369
refs: scylladb/seastar#1652Closes#13859
* github.com:scylladb/scylladb:
s3: Pick client from manager via handle
s3: Generalize s3 file handle
s3: Live-update clients' configs
sstables: Keep clients shared across sstables
storage_manager: Rewrap config map
sstables, database: Move object storage config maintenance onto storage_manager
sstables: Introduce sharded<storage_manager>
Add the global-factory onto the client that is
- cross-shard copyable
- generates a client from local storage_manager by given endpoint
With that the s3 file handle is fixed and also picks up shared s3
clients from the storage manager instead of creating its own one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the s3 file handle tries to carry client's info via explicit
host name and endpoint config pointer. This is buggy, the latter pointer
is shard-local can cannot be transferred across shards.
This patch prepares the fix by abstracting the client handle part.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the client is accessible directli via the storage_manager, when
the latter is requested to update its endpoint config, it can kick the
client to do the same.
The latter, in turn, can only update the AWS creds info for now. The
endpoint port and https usage are immutable for now.
Also, updating the endpoint address is not possible, but for another
reason -- the endpoint itself is the part of keyspace configuration and
updating one in the object_storage.yaml will have no effect on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
CQL evolved several expression evaluation mechanisms: WHERE clause,
selectors (the SELECT clause), and the LWT IF clause are just some
examples. Most now use expressions, which use managed_bytes_opt
as the underlying value representation, but selectors still use bytes_opt.
This poses two problems:
1. bytes_opt generates large contiguous allocations when used with large blobs, impacting latency
2. trying to use expressions with bytes_opt will incur a copy, reducing performance
To solve the problem, we harmonize the data types to managed_bytes_opt
(#13216 notwithstanding). This is somewhat difficult since the source of the values
are views into a bytes_ostream. However, luckily bytes_ostream and managed_bytes_view
are mostly compatible so with a little effort this can be done.
The series is neutral wrt performance:
before:
```
222118.61 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
224250.14 tps ( 61.1 allocs/op, 12.1 tasks/op, 43094 insns/op, 0 errors)
224115.66 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
223508.70 tps ( 61.1 allocs/op, 12.1 tasks/op, 43107 insns/op, 0 errors)
223498.04 tps ( 61.1 allocs/op, 12.1 tasks/op, 43087 insns/op, 0 errors)
```
after:
```
220708.37 tps ( 61.1 allocs/op, 12.1 tasks/op, 43118 insns/op, 0 errors)
225168.99 tps ( 61.1 allocs/op, 12.1 tasks/op, 43081 insns/op, 0 errors)
222406.00 tps ( 61.1 allocs/op, 12.1 tasks/op, 43088 insns/op, 0 errors)
224608.27 tps ( 61.1 allocs/op, 12.1 tasks/op, 43102 insns/op, 0 errors)
225458.32 tps ( 61.1 allocs/op, 12.1 tasks/op, 43098 insns/op, 0 errors)
```
Though I expect with some more effort we can eliminate some copies.
Closes#13637
* github.com:scylladb/scylladb:
cql3: untyped_result_set: switch to managed_bytes_view as the cell type
cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
cql3: untyped_result_set: always own data
types: abstract_type: add mixed-type versions of compare() and equal()
utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view
utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt
utils: managed_bytes: add managed_bytes_view::with_linearized()
utils: managed_bytes: mark managed_bytes_view::is_linearized() const
When new nodes are added or existing nodes are deleted, the topology
state machine needs to shunt reads from the old nodes to the new ones.
This happens in the `write_both_read_new` state. The problem is that
previously this state was not handled in any way in `token_metadata` and
the read nodes were only changed when the topology state machine reached
the final 'owned' state.
To handle `write_both_read_new` an additional `interval_map` inside
`token_metadata` is maintained similar to `pending_endpoints`. It maps
the ranges affected by the ongoing topology change operation to replicas
which should be used for reading. When topology state sm reaches the
point when it needs to switch reads to a new topology, it passes
`request_read_new=true` in a call to `update_pending_ranges`. This
forces `update_pending_ranges` to compute the ranges based on new
topology and store them to the `interval_map`. On the data plane, when a
read on coordinator needs to decide which endpoints to use, it first
consults this `interval_map` in `token_metadata`, and only if it doesn't
contain a range for current token it uses normal endpoints from
`effective_replication_map`.
Closes#13376
* github.com:scylladb/scylladb:
storage_proxy, storage_service: use new read endpoints
storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading
token_metadata: add unit test for endpoints_for_reading
token_metadata: add endpoints for reading
sequenced_set: add extract_set method
token_metadata_impl: extract maybe_migration_endpoints helper function
token_metadata_impl: introduce migration_info
token_metadata_impl: refactor update_pending_ranges
token_metadata: add unit tests
token_metadata: fix indentation
token_metadata_impl: return unique_ptr from clone functions
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change restores the use of raft::internal::tagged_uint64
for the raft types and adds back an idl definition for
it that is not marked as final, similar to the way
raft::internal::tagged_id extends utils::tagged_uuid.
Fixes#13752Closes#13774
* github.com:scylladb/scylladb:
raft, idl: restore internal::tagged_uint64 type
raft: define term_t as a tagged uint64_t
idl: gossip_digest: include required headers
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change defines the different raft tagged_uint64
types in idl/raft_storage.idl.hh as non-final
to restore the way they were serialized prior to
f5f566bdd8Fixes#13752
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Therefore, let's implement it.
Fixes https://github.com/scylladb/scylladb/issues/13649.
Closes#13713
* github.com:scylladb/scylladb:
s3: Provide timestamps in the s3 file implementation
s3: Introduce get_object_stats()
s3: introduce get_object_header()
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Fixes#13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get_object_stats() will be used for retrieving content size and
also last modified.
The latter is required for filling st_mtim, etc, in the
s3::client::readable_file::stat() method.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
since #13452, we switched most of the caller sites from std::regex
to boost::regex. in this change, all occurences of `#include <regex>`
are dropped unless std::regex is used in the same source file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13765
functinoality wise, `uint64_t_tri_compare()` is identical to the
three-way comparison operator, so no need to keep it. in this change,
it is dropped in favor of <=>.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13794
The codebase evolved to have several different ways to hold a fragmented
buffer: fragmented_temporary_buffer (for data received from the network;
not relevant for this discussion); bytes_ostream (for fragmented data that
is built incrementally; also used for a serialized result_set), and
managed_bytes (used for lsa and serialized individual values in
expression evaluation).
One problem with this state of affairs is that using data in one
fragmented form with functions that accept another fragmented form
requires either a copy, or templating everything. The former is
unpalatable for fast-path code, and the latter is undesirable for
compile time and run-time code footprint. So we'd like to make
the various forms compatible.
In 53e0dc7530 ("bytes_ostream: base on managed_bytes") we changed
bytes_ostream to have the same underlying data structure as
managed_bytes, so all that remains is to add the right API. This
is somewhat difficult as the data is hidden in multiple layers:
ser::buffer_view<> is used to abstract a slice of bytes_ostream,
and this is further abstracted by using iterators into bytes_ostream
rather than directly using the internals. Likewise, it's impossible
to construct a managed_bytes_view from the internals.
Hack through all of these by adding extract_implementation() methods,
and a build_managed_bytes_view_from_internals() helper. These are all
used by new APIs buffer_view_to_managed_bytes_view() that extract
the internals and put them back together again.
Ideally we wouldn't need any of this, but unifying the type system
in this area is quite an undertaking, so we need some shortcuts.
Becomes useful in later patches.
To avoid double-compiling the call to func(), use an
immediately-invoked lambda to calculate the bytes_view we'll be
calling func() with.
If the endpoint config specifies AWS key, secret and region, all the
S3 requests get signed. Signature should have all the x-amz-... headers
included and should contain at least three of them. This patch includes
x-ams-date, x-amz-content-sha256 and host headers into the signing list.
The content can be unsigned when sent over HTTPS, this is what this
patch does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Existing seastar's factories work on socket_address, but in S3 we have
endpoint name which's a DNS name in case of real S3. So this patch
creates the http client for S3 with the custom connection factory that
does two things.
First, it resolves the provided endpoint name into address.
Second, it loads trust-file from the provided file path (or sets system
trust if configured that way).
Since s3 client creation is no-waiting code currently, the above
initialization is spawned in afiber and before creating the connection
this fiber is waited upon.
This code probably deserves living in seastar, but for now it can land
next to utils/s3/client.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch -- extent the s3::client constructor to get
the endpoint config value next to the endpoint string. For now the
configs are likely empty, but they are yet unused too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the client is constructed with socket_address which's prepared
by the caller from the endpoint string. That's not flexible engouh,
because s3 client needs to know the original endpoint string for two
reasons.
First, it needs to lookup endpoint config for potential AWS creds.
Second, it needs this exact value as Host: header in its http requests.
So this patch just relaxes the client constructor to accept the endpoint
string and hard-code the 9000 port. The latter is temporary, this is how
local tests' minio is started, but next patch will make it configurable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.
To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name
The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.
To keep the described configuration the proposed place is the
object_storage.yaml file with the format
endpoints:
- name: a.b.c
port: 443
aws_key: 12345
aws_secret: abcdefghijklmnop
...
When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's unused.
Just in case, add a unit test case for using the fmt library to
format it (that includes fmt::to_string(std::initializer_list)).
Note that the existing to_string implementation
used square brackets to enclose the initializer_list
but the new, standardized form uses curly braces.
This doesn't break anything since to_string(initializer_list)
wasn't used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As seen in https://github.com/scylladb/scylladb/issues/13146
the current implementation is not general enough
to provide print helpers for all kind of containers.
Modernize the implementation using templates based
on std::ranges::range and using fmt::join.
Extend unit test for formatting different types of ranges,
boost::transformed ranges, deque.
Fixes#13146
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
when compiling using GCC-13, it warns that:
```
/home/kefu/dev/scylladb/utils/s3/client.cc:224:9: error: stack usage might be 66352 bytes [-Werror=stack-usage=]
224 | sstring parse_multipart_upload_id(sstring& body) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
```
so it turns out that `rapidxml::xml_document<>` could be very large,
let's allocate it on heap instead of on the stack to address this issue.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13722