Commit Graph

34 Commits

Author SHA1 Message Date
Pavel Emelyanov
c1c1752f88 s3/client: Replace skink flush semaphore with gate
Uploading sinks have internal semaphore limiting the maximum number of
uploading parts and pieces with the value of two. This approach has
several drawbacks.

1. The number is random. It could as well be three, four and any other

2. Jumbo upload in fact violates this parallelizm, because it applies to
   maximum number of pieces _and_ maximum number of parts in each piece
   that can be uploaded in parallels. Thus jumbo upload results in four
   parts in parallel.

3. Multiple uploads don't sync with each other, so uploading N objects
   would result in N * 2 (or even N * 4 with jumbo) uploads in parallel.

4. Single upload could benefit from using more sockets if no other
   uploads happen in parallel. IOW -- limit should be shard-wide, not
   single-upload-wide

Previous patches already put the per-shard parallelizm under (some)
control, so this semaphore is in fact used as a way to collect
background uploading fibers on final flush and thus can be replaced with
a gate.

As a side effect, this fixes an issue that writes-after-flush shouldn't
happen (see #13320) -- when flushed the upload gate is closed and
subsequent writes would hit gate-closed error.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:38:57 +03:00
Pavel Emelyanov
99b92f0ed8 s3/client: Configure different max-connections on http clients
After previous patch different sched groups got different http clients.
By default each client is started with 100 allowed connections. This can
be too much -- 100 * nr-sched-groups * smp::count can be quite huge
number. Also, different groups should have different parallelizm, e.g.
flush/compaction doesn't care that much about latency and can use fewer
sockets while query class is more welcome to have larger concurrency.

As a starter -- configure http clients with maximum shares/100 sockets.
Thus query class would have 10 and flush/compaction -- 1.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:35:59 +03:00
Pavel Emelyanov
81d1bfce2a s3/client: Maintain several http clients on-board
The intent is to isolate workloads from different sched groups from each
other and not let one sched group consume all sockets from the http
client thus affecting requests made by other sched groups.

The conention happens in the maximim number of socket an http client may
have (see scylladb/seastar#1652). If requests take time and client is
asked to make more and more it will eventually stop spawning new
connections and would get blocked internally waiting for running
requests to complete and put a socket back to pool. If a sched group
workload (e.g. -- memtable flush) consumes all the available sockets
then workload from another group (e.g. -- query) would be blocked thus
spoiling its latency (which is poor on its own, but still)

After this change S3 client maintains a sched_group:http_client map
thus making sure different sched groups don't clash with each other so
that e.g. query requests don't wait for flush/compaction to release a
socket.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:28:55 +03:00
Pavel Emelyanov
a8492a065b s3/client: Remove now unused http reference from sink and file
Now these two classes use client-> calls and don't need the http&
shortcut

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:28:30 +03:00
Pavel Emelyanov
b9ee0d385b s3/client: Add make_request() method
This helper call will serve several purposes.

First, make necessary preparations to the request before making, in
particular -- calling authorize()

Second, there's the need to re-make requests that failed with
"connection closed" error (see #13736)

Third, one S3 client is shared between different scheduling groups. In
order to isolate groups' workload from each other different http clients
should be used, and this helper will be in change of selecting one

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:19:19 +03:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Pavel Emelyanov
908d0d2e6a s3/client: Wait for background upload fiber on close-abort
When uploading a part (and a piece) there can be one or more background
fibers handling the upload. In case client needs to abort the operation
it calls .close() without flush()ing. In this case the S3 API Abort is
made and the sink can be terminated. It's expected that background
fibers would resolve on their own eventually, but it's not quite the
case.

First, they hold units for the semaphore and the semaphore should be
alive by the time units are returned.

Second, the PUT (or copy) request can finish successfully and it may be
sitting in the reactor queue waiting for its continuation to get
scheduler. The continuation references sink via "this" capture to put
the part etag.

Finally, in case of piece uploading the copy fiber needs _client at the
end to issue delete-object API call dropping the no longer needed part.

Said that -- background fibers must be waited upon on .close() if the
closing is aborting (if it's successfull close, then the fibers mush
have been picked up by final flush() call).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:23:18 +03:00
Pavel Emelyanov
f9686926c2 c3/client: Implement jumbo upload sink
The sink is also in charge of uploading large objects in parts, but this
time each part is put with the help of upload-part-copy API call, not
the regular upload-part one.

To make it work the new sink inherits from the uploading base class, but
instead of keeping memory_data_sink_buffers with parts it keeps a sink
to upload a temporary intermediate object with parts. When the object is
"full", i.e. the number of parts in it hits the limit, the object is
flushed, then copied into the target object with the S3 API call, then
deletes the intermediate object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:23:18 +03:00
Pavel Emelyanov
8fa3294ae1 s3/client: Move memory buffers to upload_sink from base
All the buffers manipulations now happen in the upload_sink class and
the respective member can be removed from base class. The base class
only messes with the buffers in its upload_part() call, but that's
unavoidable, as uploading part implies sending its contents which sits
in buffers.

Now the base class can be re-used for uploading parts with the help of
copy-part API call (next patches)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:19:50 +03:00
Pavel Emelyanov
2ac5ecd659 s3/client: Move last part upload out of finalize_upload()
This change has two reasons. First, is to facilitate moving the
memory_data_sink_buffers from base class, i.e. -- continuation of the
previous patch. Also this fixes a corner case -- if final sink flush
happens right after the previous part was sent for uploading, the
finalization doesn't happen and sink closing aborts the upload even if
it was successful.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:19:50 +03:00
Pavel Emelyanov
407b40c430 s3/client: Merge do_flush() with upload_part()
The do_flush() helper is practically useless because what it does can be
done by the upload_part() itself. This merge also facilitates moving the
memory_data_sink_buffers from base class to uploader class by next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:19:50 +03:00
Pavel Emelyanov
a88629227f s3/client: Rename upload_sink -> upload_sink_base
There will appear another sink that would implement multipart upload
with the help of copy-part functionality. Current uploading code is
going to be partially re-used, so this patch moves all of it into the
base class in advance. Next patches will pick needed parts.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:19:50 +03:00
Pavel Emelyanov
613acba5d0 s3: Pick client from manager via handle
Add the global-factory onto the client that is

- cross-shard copyable
- generates a client from local storage_manager by given endpoint

With that the s3 file handle is fixed and also picks up shared s3
clients from the storage manager instead of creating its own one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:01 +03:00
Pavel Emelyanov
8ed9716f59 s3: Generalize s3 file handle
Currently the s3 file handle tries to carry client's info via explicit
host name and endpoint config pointer. This is buggy, the latter pointer
is shard-local can cannot be transferred across shards.

This patch prepares the fix by abstracting the client handle part.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:01 +03:00
Pavel Emelyanov
63ff6744d8 s3: Live-update clients' configs
Now when the client is accessible directli via the storage_manager, when
the latter is requested to update its endpoint config, it can kick the
client to do the same.

The latter, in turn, can only update the AWS creds info for now. The
endpoint port and https usage are immutable for now.

Also, updating the endpoint address is not possible, but for another
reason -- the endpoint itself is the part of keyspace configuration and
updating one in the object_storage.yaml will have no effect on it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:01 +03:00
Raphael S. Carvalho
ad471e5846 s3: Provide timestamps in the s3 file implementation
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.

Fixes #13649.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-07 19:51:12 -03:00
Raphael S. Carvalho
57661f0392 s3: Introduce get_object_stats()
get_object_stats() will be used for retrieving content size and
also last modified.

The latter is required for filling st_mtim, etc, in the
s3::client::readable_file::stat() method.

Refs #13649.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-07 19:51:10 -03:00
Raphael S. Carvalho
da2ccc44a4 s3: introduce get_object_header()
This allows other functions to reuse the code to retrieve the
object header.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-07 19:49:52 -03:00
Pavel Emelyanov
98b9c205bb s3/client: Sign requests if configured
If the endpoint config specifies AWS key, secret and region, all the
S3 requests get signed. Signature should have all the x-amz-... headers
included and should contain at least three of them. This patch includes
x-ams-date, x-amz-content-sha256 and host headers into the signing list.
The content can be unsigned when sent over HTTPS, this is what this
patch does.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:23:37 +03:00
Pavel Emelyanov
3dd82485f6 s3/client: Add connection factory with DNS resolve and configurable HTTPS
Existing seastar's factories work on socket_address, but in S3 we have
endpoint name which's a DNS name in case of real S3. So this patch
creates the http client for S3 with the custom connection factory that
does two things.

First, it resolves the provided endpoint name into address.
Second, it loads trust-file from the provided file path (or sets system
trust if configured that way).

Since s3 client creation is no-waiting code currently, the above
initialization is spawned in afiber and before creating the connection
this fiber is waited upon.

This code probably deserves living in seastar, but for now it can land
next to utils/s3/client.cc.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:23:19 +03:00
Pavel Emelyanov
3bec5ea2ce s3/client: Keep server port on config
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
85f06ca556 s3/client: Construct it with config
Similar to previous patch -- extent the s3::client constructor to get
the endpoint config value next to the endpoint string. For now the
configs are likely empty, but they are yet unused too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
caf9e357c8 s3/client: Construct it with sstring endpoint
Currently the client is constructed with socket_address which's prepared
by the caller from the endpoint string. That's not flexible engouh,
because s3 client needs to know the original endpoint string for two
reasons.

First, it needs to lookup endpoint config for potential AWS creds.
Second, it needs this exact value as Host: header in its http requests.

So this patch just relaxes the client constructor to accept the endpoint
string and hard-code the 9000 port. The latter is temporary, this is how
local tests' minio is started, but next patch will make it configurable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
2f6aa5b52e code: Introduce conf/object_storage.yaml configuration file
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.

To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name

The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.

To keep the described configuration the proposed place is the
object_storage.yaml file with the format

endpoints:
  - name: a.b.c
    port: 443
    aws_key: 12345
    aws_secret: abcdefghijklmnop
    ...

When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:15 +03:00
Kefu Chai
37f1beade5 s3/client: do not allocate potentially big object on stack
when compiling using GCC-13, it warns that:

```
/home/kefu/dev/scylladb/utils/s3/client.cc:224:9: error: stack usage might be 66352 bytes [-Werror=stack-usage=]
  224 | sstring parse_multipart_upload_id(sstring& body) {
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~
```

so it turns out that `rapidxml::xml_document<>` could be very large,
let's allocate it on heap instead of on the stack to address this issue.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13722
2023-05-01 22:46:18 +03:00
Pavel Emelyanov
9a9dbffce3 s3/client: Zeroify stat by default
The s3::readable_file::stat() call returns a hand-crafted stat structure
with some fields set to some sane values, most are constants. However,
other fields remain not initialized which leads to troubles sometimes.
Better to fill the stat with zeroes and later revisit it for more sane
values.

fixes: #13645
refs: #13649
Using designated initializers is not an option here, see PR #13499

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13650
2023-04-25 09:53:47 +02:00
Pavel Emelyanov
30b6f34a0b s3/client: Explicitly set _upload_id empty when completing
The upload_sink::_upload_id remains empty until upload starts, remains
non-empty while it proceeds, then becomes empty again after it
completes. The upload_started() method cheks that and on .close()
started upload is aborted.

The final switch to empty is done by std::move()ing the upload id into
completion requrest, but it's better to use std::exchange() to emphasize
the fact the the _upload_id becomes empty at that point for a reason.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13570
2023-04-20 17:32:08 +03:00
Pavel Emelyanov
7c7a3416c5 s3/client: Add comments about multipart upload completion message
The message length is pre-calculated in advance to provide correct
content-length request header. This math is not obvious and deserves a
comment.

Also, the final message preparation code is also implicitly checking
if any part failed to upload. There's a comment in the upload_sink's
upload_part() method about it, but the finalization place deserves one
too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:08:34 +03:00
Pavel Emelyanov
3f86bed600 s3/client: Fix succeeded/failed part upload final checking
When all parts upload complete the final message is prepared and sent
out to the server. The preparation code is also responsible for checking
if all parts uploaded OK by checking the part etag to be non-empty. In
that check a misprint crept in -- the whole list is checked to be empty,
not the individual etag itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:08:15 +03:00
Pavel Emelyanov
79379760e6 s3/client: Fix parts to start from 1
Docs say, that part numbers should start from 1, while the code follows
the tradition and starts from 0. Minio is conveniently incompatible in
this sense so test had been passing so far. On real S3 part number 0
ends up with failed request.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 10:43:12 +03:00
Pavel Emelyanov
b1501d4261 s3/client: Don't use designated initialization of sys stat struct
It makes compiler complan about mis-ordered initialization of st_nlink
vs st_mode on different arches. Current code (st_nlink before st_mode)
compiled fine on x86, but fails on ARM which wants st_mode to come
before st_nlink. Changing the order would, apparently, break x86 build
with similar message.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13499
2023-04-13 15:13:56 +03:00
Pavel Emelyanov
033fa107f8 utils: Add S3 readable file impl for random reads
Sometimes an sstable is used for random read, sometimes -- for streamed
read using the input stream. For both cases the file API can be
provided, because S3 API allows random reads of arbitrary lengths.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00
Pavel Emelyanov
a4a64149a6 utils: Add S3 data sink for multipart upload
Putting a large object into S3 using plain PUT is bad choice -- one need
to collect the whole object in memory, then send it as a content-length
request with plain body. Less memory stress is by using multipart
upload, but multipart upload has its limitation -- each part should be
at least 5Mb in size. For that reason using file API doesn't work --
file IO API operates with external memory buffers and the file impl
would only have raw pointers to it. In order to collect 5Mb of chunk in
RAM the impl would have to copy the memory which is not good. Unlike the
file API data_sink API is more flexible, as it has temporary buffers at
hand and can cache them in zero-copy manner.

Having sad that, the S3 data_sink implementation is like this:

* put(buffer):
  move the buffer into local cache, once the local cache grows above 5Mb
  send out the part

* flush:
  send out whatever is in cache, then send upload completion request

* close:
  check that the upload finihsed (in flush), abort the upload otherwise

User of the API may (actually should) wrap the sink with output_stream
and use it as any other output_stream.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00
Pavel Emelyanov
3745b5c715 utils: Add S3 client with basic ops
Those include -- HEAD to get size, PUT to upload object in one go, GET
to read the object as contigious buffer and DELETE to drop one.

The client uses http client from seastar and just implements the S3
protocol using it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00