The get_object_contiguous() accepts optional range argument in a form of
offset:lengh and then converts it into first_byte:last_byte pair to
satisfy http's Range header range-specifier.
If the lat_byte, which is offset + lenght - 1, overflows 64-bits the
range specifier becomes invalid. According to RFC9110 servers may ignore
invalid ranges if they want to and this is what minio does.
The result is pretty interesting. Since the range is specified, client
expect PartialContent response, but since the range is ignored by server
the result is OK, as if the full object was requested. So instead of
some sane "overflow" error, the get_object_contiguous() fails with
status "success".
The fix is in pre-checking provided ranges and failing early
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
If S3 readable file is used inside file input stream, the latter may
call its read methods with position that is above file size. In that
case server replies with generic http error and the fact that the range
was invalid is encoded into reply body's xml.
That's not great to catch this via wrong reply status exception and xml
parsing all the more so we can know that the read is out-of-bound in
advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
S3-based sstables components are immutable, so every time stat is called
there's no need to ping server again.
But the main intention of this patch is to provide stats for read calls
in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now its plain updateable_value, but without the ..._source object the
updateable_value is just a no-op value holder. In order for the
observers to operate there must be the value source, updating it would
update the attached updateable values _and_ notify the observers.
In order for the config to be the u.v._source, config entries should be
comparable to each other, thus the <=> operator for it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When http request resolves with excpetion it makes sense to translate
the network exception into storage exceptio to make upper layers think
that it was some sort of IO error, not SUDDENLY and http one.
The translation is, for now, pretty simple:
- 404 and 3xx -> ENOENT
- 403(forbidden) and 401(unauthorized) -> EACCESS
- anything else -> EIO
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The jumbo sink is there to upload files that can be potentially larger
than 50Gb (10000*5Mb). For that the sink uploads a set of so called
"pieces" -- files up to 50Gb each -- then uses the copy-upload APi call
to squash the pieces together. After copying the piece is removed. In
case of a crash while uploading pieces remain in the bucket forever
which is not great.
This patch tags pieces with 'kind=piece' tag in order to tell pieces
from regular objects. This can be used, for example, by setting up the
lifecycle tag-based policy and collect dangling pieces eventually.
fixes: #13670
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16023
on debian derivatives librapidxml-dev installs rapidxml.h as
rapixml/rapidxml.hpp, so let's use it as a fallback.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15814
When the non-jumbo sink is flushed and notices that the real upload is
not started yet, it may just go ahead and PUT the buffers into the
object with the single request.
For jumbo sink the fallback is not implemented as it likely doesn't make
and any sense -- jumbo sinks are unlikely to produce less than 5Mb of
data so it's going to be dead code anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It may happen that wrapping up multipart upload fails too. However,
before sending the request the driver clears the _upload_id field thus
marking the whole process as "all is OK". So in case the finalization
method fails and thrown, the upload context remains on the server side
forever.
Fix this by keeping the _upload_id set, so even if finalization throws,
closing the uploader notices this and calls abort.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#15521
The S3 uploading sink needs to collect buffers internally before sending them out, because the minimal upload-able part size is 5Mb. When the necessary amount of bytes is accumulated, the part uploading fibers starts in the background. On flush the sink waits for all the fibers to complete and handles failure of any.
Uploading parallelism is nowadays limited by the means of the http client max-connections parameter. However, when a part uploading fibers waits for it connection it keeps the 5Mb+ buffers on the request's body, so even though the number of uploading parts is limited, the number of _waiting_ parts is effectively not.
This PR adds a shard-wide limiter on the number of background buffers S3 clients (and theirs http clients) may use.
Closesscylladb/scylladb#15497
* github.com:scylladb/scylladb:
s3::client: Track memory in client uploads
code: Configure s3 clients' memory usage
s3::client: Construct client with shared semaphore
sstables::storage_manager: Introduce config
when accessing AWS resources, uses are allowed to long-term security
credentials, they can also the temporary credentials. but if the latter
are used, we have to pass a session token along with the keys.
see also https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html
so, if we want to programatically get authenticated, we need to
set the "x-amz-security-token" header,
see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAuthentication.html#UsingTemporarySecurityCredentials
so, in this change, we
1. add another member named `token` in `s3::endpoint_config::aws_config`
for storing "AWS_SESSION_TOKEN".
2. populate the setting from "object_storage.yaml" and
"$AWS_SESSION_TOKEN" environment variable.
3. set "x-amz-security-token" header if
`s3::endpoint_config::aws_config::token` is not empty.
this should allow us to test s3 client and s3 object store backend
with S3 bucket, with the temporary credentials.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15486
When a piece is uploaded it's first flushed, then upload-copy is issued.
Both happen in the background and if piece flush calls resolves with
exception the exception remains unhandled. That's OK, since upload
finalization code checks that some pieces didn't complete (for whatever
reason) and fails the whole uploading, however, the ignored exception is
reported in logs. Not nice.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#15491
as the size of `rapidxml::xml_document` size quite large, let's
allocate it on the heap. otherwise GCC 13.2.1 warns us like:
```
utils/s3/client.cc: In function ‘seastar::sstring s3::parse_multipart_copy_upload_etag(seastar::sstring&)’:
utils/s3/client.cc:455:9: warning: stack usage is 66208 bytes [-Wstack-usage=]
455 | sstring parse_multipart_copy_upload_etag(sstring& body) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15472
When uploading an object part, client spawns a background fiber that
keeps the buffers with data on the http request's write_body() lambda
capture. This generates unbound usage of memory with uploaded buffers
which is not nice. Even though s3 client is limited with http's client
max-connections parallelism, waiting for the available connection still
happens with buffers held in memory.
This patch makes the client claim the background memory from the
provided semaphore (which, in turn, sits on the shard-wide storage
manager instance). Once body writing is complete, the claimed units are
returned back to the semaphore allowing for more background writes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The semaphore will be used to cap memory consumption by client. This
patch makes sure the reference to a semaphore exists as an argument to
client's constructor, not more than that.
In scylla binary, the semaphore sits on storage_manager. In tests the
semaphore is some local object. For now the semaphore is unused and is
initialized locked as this patch just pushes the needed argument all the
way around, next patches will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series introduces a scylla-native nodetool. It is invokable via the main scylla executable as the other native tools we have. It uses the seastar's new `http::client` to connect to the specified node and execute the desired commands.
For now a single command is implemented: `nodetool compact`, invokable as `scylla nodetool compact`. Once all the boilerplate is added to create a new tool, implementing a single command is not too bad, in terms of code-bloat. Certainly not as clean as a python implementation would be, but good enough. The advantages of a C++ implementation is that all of us in the core team know C++ and that it is shipped right as part of the scylla executable..
Closes#14841
* github.com:scylladb/scylladb:
test: add nodetool tests
test.py: add ToolTestSuite and ToolTest
tools/scylla-nodetool: implement compact operation
tools/scylla-nodetool: implement basic scylla_rest_api_client
tools: introduce scylla-nodetool
utils: export dns_connection_factory from s3/client.cc to http.hh
utils/s3/client: pass logger to dns_connection_factory in constructor
tools/utils: tool_app_template::run_async(): also detect --help* as --help
We want to publish this class in a header so it can be used by others,
but it uses the s3 logger. We don't want future users to pollute the s3
logs, so allow users to pass their own loggers to the factory.
The added metrics include:
- http client metrics, which include the number of connections, the number of active connections and the number of new connections made so far
- IO metrics that mimic those for traditional IO -- total number of object read/write ops, total number of get/put/uploaded bytes and individual IO request delay (round-trip, including body transfer time)
fixes: #13369Closes#14494
* github.com:scylladb/scylladb:
s3/client: Add IO stats metrics
s3/client: Add HTTP client metrics
s3/client: Split make_request()
s3/client: Wrap http client with struct group_client
s3/client: Move client::stats to namespace scope
s3/client: Keep part size local variable
seastar has deprecated the overload which accepts `server_name`,
let's use the one which accepts `tls::tls_options`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15324
These metrics mimic the existing IO ones -- total number of read
operation, total number of read bytes and total read delay. And the same
for writing.
This patch makes no difference between wrting object with plain PUT vs
putting it with multipart uploading. Instead, it "measures" individual
IO writes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently an http client has several exported "numbers" regarding the
number of transport connections the client uses. This patch exports
those via S3 client's per-sched-group metrics and prepares the ground
for more metrics in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will appear another make_request() helper that'll do mostly the
same. This split will help to avoid code duplication
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The http-client is per-sched-group. Next patch will need to keep metrics
per-sched-group too and this sched-group -> http-client map is the good
place to put them on. Wrapping struct will allow extending it with
metrics
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The stats is stats about object, not about client, so it's better if it
lives in namespace scope. Also it will avoid conflicts with client stats
that will be reported as metrics (later patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This serves two purposes. First, it fixes potential use-after-move since
the bufs are moved on lambda and bufs.size() are called in the same
statement with no defined evaluation order.
Second, this makes 'size' varable alive up to the time request is
complete thus making it possible to update stats with it (later patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
with tagging ops, we will be able to attach kv pairs to an object.
this will allow us to mark sstable components with taggings, and
filter them based on them.
* test/pylib/minio_server.py: enable anonymous user to perform
more actions. because the tagging related ops are not enabled by
"mc anonymous set public", we have to enable them using "set-json"
subcommand.
* utils/s3/client: add methods to manipulate taggings.
* test/boost/s3_test: add a simple test accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14486
Uploading sinks have internal semaphore limiting the maximum number of
uploading parts and pieces with the value of two. This approach has
several drawbacks.
1. The number is random. It could as well be three, four and any other
2. Jumbo upload in fact violates this parallelizm, because it applies to
maximum number of pieces _and_ maximum number of parts in each piece
that can be uploaded in parallels. Thus jumbo upload results in four
parts in parallel.
3. Multiple uploads don't sync with each other, so uploading N objects
would result in N * 2 (or even N * 4 with jumbo) uploads in parallel.
4. Single upload could benefit from using more sockets if no other
uploads happen in parallel. IOW -- limit should be shard-wide, not
single-upload-wide
Previous patches already put the per-shard parallelizm under (some)
control, so this semaphore is in fact used as a way to collect
background uploading fibers on final flush and thus can be replaced with
a gate.
As a side effect, this fixes an issue that writes-after-flush shouldn't
happen (see #13320) -- when flushed the upload gate is closed and
subsequent writes would hit gate-closed error.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After previous patch different sched groups got different http clients.
By default each client is started with 100 allowed connections. This can
be too much -- 100 * nr-sched-groups * smp::count can be quite huge
number. Also, different groups should have different parallelizm, e.g.
flush/compaction doesn't care that much about latency and can use fewer
sockets while query class is more welcome to have larger concurrency.
As a starter -- configure http clients with maximum shares/100 sockets.
Thus query class would have 10 and flush/compaction -- 1.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The intent is to isolate workloads from different sched groups from each
other and not let one sched group consume all sockets from the http
client thus affecting requests made by other sched groups.
The conention happens in the maximim number of socket an http client may
have (see scylladb/seastar#1652). If requests take time and client is
asked to make more and more it will eventually stop spawning new
connections and would get blocked internally waiting for running
requests to complete and put a socket back to pool. If a sched group
workload (e.g. -- memtable flush) consumes all the available sockets
then workload from another group (e.g. -- query) would be blocked thus
spoiling its latency (which is poor on its own, but still)
After this change S3 client maintains a sched_group:http_client map
thus making sure different sched groups don't clash with each other so
that e.g. query requests don't wait for flush/compaction to release a
socket.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper call will serve several purposes.
First, make necessary preparations to the request before making, in
particular -- calling authorize()
Second, there's the need to re-make requests that failed with
"connection closed" error (see #13736)
Third, one S3 client is shared between different scheduling groups. In
order to isolate groups' workload from each other different http clients
should be used, and this helper will be in change of selecting one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
When uploading a part (and a piece) there can be one or more background
fibers handling the upload. In case client needs to abort the operation
it calls .close() without flush()ing. In this case the S3 API Abort is
made and the sink can be terminated. It's expected that background
fibers would resolve on their own eventually, but it's not quite the
case.
First, they hold units for the semaphore and the semaphore should be
alive by the time units are returned.
Second, the PUT (or copy) request can finish successfully and it may be
sitting in the reactor queue waiting for its continuation to get
scheduler. The continuation references sink via "this" capture to put
the part etag.
Finally, in case of piece uploading the copy fiber needs _client at the
end to issue delete-object API call dropping the no longer needed part.
Said that -- background fibers must be waited upon on .close() if the
closing is aborting (if it's successfull close, then the fibers mush
have been picked up by final flush() call).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sink is also in charge of uploading large objects in parts, but this
time each part is put with the help of upload-part-copy API call, not
the regular upload-part one.
To make it work the new sink inherits from the uploading base class, but
instead of keeping memory_data_sink_buffers with parts it keeps a sink
to upload a temporary intermediate object with parts. When the object is
"full", i.e. the number of parts in it hits the limit, the object is
flushed, then copied into the target object with the S3 API call, then
deletes the intermediate object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the buffers manipulations now happen in the upload_sink class and
the respective member can be removed from base class. The base class
only messes with the buffers in its upload_part() call, but that's
unavoidable, as uploading part implies sending its contents which sits
in buffers.
Now the base class can be re-used for uploading parts with the help of
copy-part API call (next patches)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change has two reasons. First, is to facilitate moving the
memory_data_sink_buffers from base class, i.e. -- continuation of the
previous patch. Also this fixes a corner case -- if final sink flush
happens right after the previous part was sent for uploading, the
finalization doesn't happen and sink closing aborts the upload even if
it was successful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The do_flush() helper is practically useless because what it does can be
done by the upload_part() itself. This merge also facilitates moving the
memory_data_sink_buffers from base class to uploader class by next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will appear another sink that would implement multipart upload
with the help of copy-part functionality. Current uploading code is
going to be partially re-used, so this patch moves all of it into the
base class in advance. Next patches will pick needed parts.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the global-factory onto the client that is
- cross-shard copyable
- generates a client from local storage_manager by given endpoint
With that the s3 file handle is fixed and also picks up shared s3
clients from the storage manager instead of creating its own one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the s3 file handle tries to carry client's info via explicit
host name and endpoint config pointer. This is buggy, the latter pointer
is shard-local can cannot be transferred across shards.
This patch prepares the fix by abstracting the client handle part.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the client is accessible directli via the storage_manager, when
the latter is requested to update its endpoint config, it can kick the
client to do the same.
The latter, in turn, can only update the AWS creds info for now. The
endpoint port and https usage are immutable for now.
Also, updating the endpoint address is not possible, but for another
reason -- the endpoint itself is the part of keyspace configuration and
updating one in the object_storage.yaml will have no effect on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Fixes#13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get_object_stats() will be used for retrieving content size and
also last modified.
The latter is required for filling st_mtim, etc, in the
s3::client::readable_file::stat() method.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If the endpoint config specifies AWS key, secret and region, all the
S3 requests get signed. Signature should have all the x-amz-... headers
included and should contain at least three of them. This patch includes
x-ams-date, x-amz-content-sha256 and host headers into the signing list.
The content can be unsigned when sent over HTTPS, this is what this
patch does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>