Commit Graph

1412 Commits

Author SHA1 Message Date
Pavel Emelyanov
3bec5ea2ce s3/client: Keep server port on config
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
85f06ca556 s3/client: Construct it with config
Similar to previous patch -- extent the s3::client constructor to get
the endpoint config value next to the endpoint string. For now the
configs are likely empty, but they are yet unused too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
caf9e357c8 s3/client: Construct it with sstring endpoint
Currently the client is constructed with socket_address which's prepared
by the caller from the endpoint string. That's not flexible engouh,
because s3 client needs to know the original endpoint string for two
reasons.

First, it needs to lookup endpoint config for potential AWS creds.
Second, it needs this exact value as Host: header in its http requests.

So this patch just relaxes the client constructor to accept the endpoint
string and hard-code the 9000 port. The latter is temporary, this is how
local tests' minio is started, but next patch will make it configurable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
2f6aa5b52e code: Introduce conf/object_storage.yaml configuration file
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.

To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name

The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.

To keep the described configuration the proposed place is the
object_storage.yaml file with the format

endpoints:
  - name: a.b.c
    port: 443
    aws_key: 12345
    aws_secret: abcdefghijklmnop
    ...

When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:15 +03:00
Benny Halevy
959a740dac utils: to_string: get rid of utils::join
Use `fmt::format("{}", fmt::join(...))` instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:59:58 +03:00
Benny Halevy
e6bcb1c8df utils: to_string: get rid of to_string(std::initializer_list)
It's unused.

Just in case, add a unit test case for using the fmt library to
format it (that includes fmt::to_string(std::initializer_list)).

Note that the existing to_string implementation
used square brackets to enclose the initializer_list
but the new, standardized form uses curly braces.

This doesn't break anything since to_string(initializer_list)
wasn't used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Benny Halevy
ba883859c7 utils: to_string: get rid of to_string(const Range&)
Use fmt::to_string instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Benny Halevy
15c9f0f0df utils: to_string: generalize range helpers
As seen in https://github.com/scylladb/scylladb/issues/13146
the current implementation is not general enough
to provide print helpers for all kind of containers.

Modernize the implementation using templates based
on std::ranges::range and using fmt::join.

Extend unit test for formatting different types of ranges,
boost::transformed ranges, deque.

Fixes #13146

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Benny Halevy
45153b58bd utils: chunked_vector: add std::ranges::range ctor
To be used in next patch for constructing
chunked_vector from an initializer_list.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Kefu Chai
37f1beade5 s3/client: do not allocate potentially big object on stack
when compiling using GCC-13, it warns that:

```
/home/kefu/dev/scylladb/utils/s3/client.cc:224:9: error: stack usage might be 66352 bytes [-Werror=stack-usage=]
  224 | sstring parse_multipart_upload_id(sstring& body) {
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~
```

so it turns out that `rapidxml::xml_document<>` could be very large,
let's allocate it on heap instead of on the stack to address this issue.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13722
2023-05-01 22:46:18 +03:00
Kefu Chai
43e9910fa0 utils/chunked_managed_vector: use operator<=> when appropriate
instead of crafting 4 operators manually, just delegate it to <=>.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13698
2023-04-28 15:59:08 +03:00
Kamil Braun
30cc07b40d Merge 'Introduce tablets' from Tomasz Grabiec
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.

Things achieved in this PR:

  - You can start a cluster and create a keyspace whose tables will use
    tablet-based replication. This is done by setting `initial_tablets`
    option:

    ```
        CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
                        'replication_factor': 3,
                        'initial_tablets': 8};
    ```

    All tables created in such a keyspace will be tablet-based.

    Tablet-based replication is a trait, not a separate replication
    strategy. Tablets don't change the spirit of replication strategy, it
    just alters the way in which data ownership is managed. In theory, we
    could use it for other strategies as well like
    EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
    is augmented to support tablets.

  - You can create and drop tablet-based tables (no DDL language changes)

  - DML / DQL work with tablet-based tables

    Replicas for tablet-based tables are chosen from tablet metadata
    instead of token metadata

Things which are not yet implemented:

  - handling of views, indexes, CDC created on tablet-based tables
  - sharding is done using the old method, it ignores the shard allocated in tablet metadata
  - node operations (topology changes, repair, rebuild) are not handling tablet-based tables
  - not integrated with compaction groups
  - tablet allocator piggy-backs on tokens to choose replicas.
    Eventually we want to allocate based on current load, not statically

Closes #13387

* github.com:scylladb/scylladb:
  test: topology: Introduce test_tablets.py
  raft: Introduce 'raft_server_force_snapshot' error injection
  locator: network_topology_strategy: Support tablet replication
  service: Introduce tablet_allocator
  locator: Introduce tablet_aware_replication_strategy
  locator: Extract maybe_remove_node_being_replaced()
  dht: token_metadata: Introduce get_my_id()
  migration_manager: Send tablet metadata as part of schema pull
  storage_service: Load tablet metadata when reloading topology state
  storage_service: Load tablet metadata on boot and from group0 changes
  db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
  migration_notifier: Introduce before_drop_keyspace()
  migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
  test: perf: Introduce perf-tablets
  test: Introduce tablets_test
  test: lib: Do not override table id in create_table()
  utils, tablets: Introduce external_memory_usage()
  db: tablets: Add printers
  db: tablets: Add persistence layer
  dht: Use last_token_of_compaction_group() in split_token_range_msb()
  locator: Introduce tablet_metadata
  dht: Introduce first_token()
  dht: Introduce next_token()
  storage_proxy: Improve trace-level logging
  locator: token_metadata: Fix confusing comment on ring_range()
  dht, storage_proxy: Abstract token space splitting
  Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
  db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
  db: Introduce get_non_local_vnode_based_strategy_keyspaces()
  service: storage_proxy: Avoid copying keyspace name in write handler
  locator: Introduce per-table replication strategy
  treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
  locator: Introduce effective_replication_map
  locator: Rename effective_replication_map to vnode_effective_replication_map
  locator: effective_replication_map: Abstract get_pending_endpoints()
  db: Propagate feature_service to abstract_replication_strategy::validate_options()
  db: config: Introduce experimental "TABLETS" feature
  db: Log replication strategy for debugging purposes
  db: Log full exception on error in do_parse_schema_tables()
  db: keyspace: Remove non-const replication strategy getter
  config: Reformat
2023-04-27 09:40:18 +02:00
Kefu Chai
f5b05cf981 treewide: use defaulted operator!=() and operator==()
in C++20, compiler generate operator!=() if the corresponding
operator==() is already defined, the language now understands
that the comparison is symmetric in the new standard.

fortunately, our operator!=() is always equivalent to
`! operator==()`, this matches the behavior of the default
generated operator!=(). so, in this change, all `operator!=`
are removed.

in addition to the defaulted operator!=, C++20 also brings to us
the defaulted operator==() -- it is able to generated the
operator==() if the member-wise lexicographical comparison.
under some circumstances, this is exactly what we need. so,
in this change, if the operator==() is also implemented as
a lexicographical comparison of all memeber variables of the
class/struct in question, it is implemented using the default
generated one by removing its body and mark the function as
`default`. moreover, if the class happen to have other comparison
operators which are implemented using lexicographical comparison,
the default generated `operator<=>` is used in place of
the defaulted `operator==`.

sometimes, we fail to mark the operator== with the `const`
specifier, in this change, to fulfil the need of C++ standard,
and to be more correct, the `const` specifier is added.

also, to generate the defaulted operator==, the operand should
be `const class_name&`, but it is not always the case, in the
class of `version`, we use `version` as the parameter type, to
fulfill the need of the C++ standard, the parameter type is
changed to `const version&` instead. this does not change
the semantic of the comparison operator. and is a more idiomatic
way to pass non-trivial struct as function parameters.

please note, because in C++20, both operator= and operator<=> are
symmetric, some of the operators in `multiprecision` are removed.
they are the symmetric form of the another variant. if they were
not removed, compiler would, for instance, find ambiguous
overloaded operator '=='.

this change is a cleanup to modernize the code base with C++20
features.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13687
2023-04-27 10:24:46 +03:00
Botond Dénes
3e92bcaa20 Merge 'utils: redesign reusable_buffer' from Michał Chojnowski
Common compression libraries work on contiguous buffers.
Contiguous buffers are a problem for the allocator. However, as long as they are short-lived,
we can avoid the expensive allocations by reusing buffers across tasks.

This idea is already applied to the compression of CQL frames, but with some deficiencies.
`utils: redesign reusable_buffer` attempts to improve upon it in a few ways. See its commit message for an extended discussion.

Compression buffer reuse also happens in the zstd SSTable compressor, but the implementation is misguided. Every `zstd_processor` instance reuses a buffer, but each instance has its own buffer. This is very bad, because a healthy database might have thousands of concurrent instances (because there is one for each sstable reader). Together, the buffers might require gigabytes of memory, and the reuse actually *increases* memory pressure significantly, instead of reducing it.
`zstd: share buffers between compressor instances` aims to improve that by letting a single buffer be shared across all instances on a shard.

Closes #13324

* github.com:scylladb/scylladb:
  zstd: share buffers between compressor instances
  utils: redesign reusable_buffer
2023-04-27 09:09:09 +03:00
Michał Chojnowski
bf26a8c467 utils: redesign reusable_buffer
Large contiguous buffers put large pressure on the allocator
and are a common source of reactor stalls. Therefore, Scylla avoids
their use, replacing it with fragmented buffers whenever possible.
However, the use of large contiguous buffers is impossible to avoid
when dealing with some external libraries (i.e. some compression
libraries, like LZ4).

Fortunately, calls to external libraries are synchronous, so we can
minimize the allocator impact by reusing a single buffer between calls.

An implementation of such a reusable buffer has two conflicting goals:
to allocate as rarely as possible, and to waste as little memory as
possible. The bigger the buffer, the more likely that it will be able
to handle future requests without reallocation, but also the memory
memory it ties up.

If request sizes are repetitive, the near-optimal solution is to
simply resize the buffer up to match the biggest seen request,
and never resize down.

However, if we anticipate pathologically large requests, which are
caused by an application/configuration bug and are never repeated
again after they are fixed, we might want to resize down after such
pathological requests stop, so that the memory they took isn't tied
up forever.

The current implementation of reusable buffers handles this by
resizing down to 0 every 100'000 requests.

This patch attempts to solve a few shortcomings of the current
implementation.
1. Resizing to 0 is too aggressive. During regular operation, we will
surely need to resize it back to the previous size again. If something
is allocated in the hole left by the old buffer, this might cause
a stall. We prefer to resize down only after pathological requests.
2. When resizing, the current implementation allocates the new buffer
before freeing the old one. This increases allocator pressure for no
reason.
3. When resizing up, the buffer is resized to exactly the requested
size. That is, if the current size is 1MiB, following requests
of 1MiB+1B and 1MiB+2B will both cause a resize.
It's preferable to limit the set of possible sizes so that every
reset doesn't tend to cause multiple resizes of almost the same size.
The natural set of sizes is powers of 2, because that's what the
underlying buddy allocator uses. No waste is caused by rounding up
the allocation to a power of 2.
4. The interval of 100'000 uses is both too low and too arbitrary.
This is up for discussion, but I think that it's preferable to base
the dynamics of the buffer on time, rather than the number of uses.
It's more predictable to humans.

The implementation proposed in this patch addresses these as follows:
1. Instead of resizing down to 0, we resize to the biggest size
   seen in the last period.
   As long as at least one maximal (up to a power of 2) "normal" request
   appears each period, the buffer will never have to be resized.
2. The capacity of the buffer is always rounded up to the nearest
   power of 2.
3. The resize down period is no longer measured in number of requests
   but in real time.

Additionally, since a shared buffer in asynchronous code is quite a
footgun, some rudimentary refcounting is added to assert that only
one reference to the buffer exists at a time, and that the buffer isn't
downsized while a reference to it exists.

Fixes #13437
2023-04-26 22:09:17 +02:00
Botond Dénes
8765442f3f Merge 'utils: add basic_xx_hasher' from Benny Halevy
Consolidate `bytes_view_hasher` and abstract_replication_strategy `factory_key_hasher` which are the same into a reusable utils::basic_xx_hasher.

To be used in a followup series for netw:msg_addr.

Closes #13530

* github.com:scylladb/scylladb:
  utils: hashing: use simple_xx_hasher
  utils: hashing: add simple_xx_hasher
  utils: hashers: add HasherReturning concept
  hashing: move static_assert to source file
2023-04-25 09:53:47 +02:00
Pavel Emelyanov
9a9dbffce3 s3/client: Zeroify stat by default
The s3::readable_file::stat() call returns a hand-crafted stat structure
with some fields set to some sane values, most are constants. However,
other fields remain not initialized which leads to troubles sometimes.
Better to fill the stat with zeroes and later revisit it for more sane
values.

fixes: #13645
refs: #13649
Using designated initializers is not an option here, see PR #13499

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13650
2023-04-25 09:53:47 +02:00
Benny Halevy
f4fefec343 utils: hashing: add simple_xx_hasher
And a respective unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 14:06:43 +03:00
Benny Halevy
b638dddf1b utils: hashers: add HasherReturning concept
And a more specific HasherReturningBytes for hashers
that return bytes in finalize().

HasherReturning will be used by the following patch
also for simple hashers that return size_t from
finalize().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 14:06:40 +03:00
Benny Halevy
a765472b8b hashing: move static_assert to source file
No need to check it inline in the header.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 12:23:03 +03:00
Tomasz Grabiec
5a24984147 utils, tablets: Introduce external_memory_usage() 2023-04-24 10:49:37 +02:00
Botond Dénes
864d27f9af Merge 'clear_gently: handle null unique_ptr and optional values' from Benny Halevy
This series adds handling of null std::unique_ptr to utils::clear_gently
and handling of std::optional and seastar::optimized_optional (both engaged and disengaged cases).

Also, unit tests were added to tests the above cases.

Fixes #13636

Closes #13638

* github.com:scylladb/scylladb:
  utils: clear_gently: add variants for optional values
  utils: clear_gently: do not clear null unique_ptr
2023-04-24 10:27:32 +03:00
Benny Halevy
002865018f utils: clear_gently: add variants for optional values
Implement clear_gently for std:;optional<T>
and seastar::optimized_optional<T> and respective
unit tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 21:34:02 +03:00
Benny Halevy
12877ad026 utils: clear_gently: do not clear null unique_ptr
Otherwise the null pointer is dereferenced.

Add a unit test reproducing the issue
and testing this fix.

Fixes #13636

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 21:33:11 +03:00
Benny Halevy
d1817e9e1b utils: move generation-number to gms
Although get_generation_number implementation is
completely generic, it is used exclusively to seed
the gossip generation number.

Following patches will define a strong gms::generation_id
type and this function should return it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:37:32 +03:00
Benny Halevy
f5f566bdd8 utils: add tagged_integer
A generic template for defining strongly typed
integer types.

Use it here to replace raft::internal::tagged_uint64.
Will be used for defining gms generation and version
as strong and distinguishable types in following patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 08:37:32 +03:00
Kefu Chai
a2aa133822 treewide: use std::lexicographical_compare_threeway
this the standard library offers
`std::lexicographical_compare_threeway()`, and we never uses the
last two addition parameters which are not provided by
`std::lexicographical_compare_threeway()`. there is no need to have
the homebrew version of trichotomic compare function.

in this change,

* all occurrences of `lexicographical_tri_compare()` are replaced
  with `std::lexicographical_compare_threeway()`.
* ``lexicographical_tri_compare()` is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13615
2023-04-21 14:28:18 +03:00
Botond Dénes
10c1f1dc80 Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.

There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).

Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).

Fixes #13594

Closes #13604

* github.com:scylladb/scylladb:
  db: system_keyspace: use microsecond resolution for group0_history range tombstone
  utils: UUID_gen: accept decimicroseconds in min_time_UUID
2023-04-21 14:08:56 +03:00
Kamil Braun
218a056825 utils: UUID_gen: accept decimicroseconds in min_time_UUID
The function now accepts higher-resolution duration types, such as
microsecond resolution timestamps. Will be used by the next commit.
2023-04-21 10:33:02 +02:00
Pavel Emelyanov
30b6f34a0b s3/client: Explicitly set _upload_id empty when completing
The upload_sink::_upload_id remains empty until upload starts, remains
non-empty while it proceeds, then becomes empty again after it
completes. The upload_started() method cheks that and on .close()
started upload is aborted.

The final switch to empty is done by std::move()ing the upload id into
completion requrest, but it's better to use std::exchange() to emphasize
the fact the the _upload_id becomes empty at that point for a reason.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13570
2023-04-20 17:32:08 +03:00
Nadav Har'El
5b792dde68 Merge 'Extend aws_sigv4 code to suite S3 client needs' from Pavel Emelyanov
The AWS signature-generating code was moved from alternator some time ago as is. Now it's clear that in which places it should be extended to work for S3 client as well. The enhancements are

- Support UNSIGNED-PAYLOAD to omit calculating checksums for request body
- Include full URL path into the signature, not just hard-coded "/" string
- Don't check datastamp expiration if not asked for

This is a part of #13493

Closes #13535

* github.com:scylladb/scylladb:
  utils/aws: Brush up the aws_sigv4.hh header
  utils/aws: Export timepoint formatter
  utils/aws: Omit datestamp expiration checks when not needed
  utils/aws: Add canonical-uri argument
  utils/aws: Support unsigned-payload signatures
2023-04-18 16:33:52 +03:00
Avi Kivity
7724223134 Merge 'utils: big_decimal: optimize big_decimal::compare() and use <=> operator' from Kefu Chai
in this series, we use <=> operator to replace `big_decimal::compare()` for better readability. also, we trade the chained ternary expression with a more verbose if-else statement for better performance and readability.

Closes #13478

* github.com:scylladb/scylladb:
  utils: big_decimal: replace compare() with <=> operator
  utils: big_decimal: optimize big_decimal::compare()
2023-04-17 14:33:53 +03:00
Pavel Emelyanov
d09d6adbf4 utils/aws: Brush up the aws_sigv4.hh header
Add lost pragma-once directive.

Remove the hashers.hh inclusion. It was carried in when the whole code
was detached from alternator (f5de0582c8), but this header is not needed
in the header, only in the .cc file which uses sha256_hasher.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:16:45 +03:00
Pavel Emelyanov
792490e095 utils/aws: Export timepoint formatter
The format of timestamp for AWS requests is defined in documentation,
there's already the code that prepares it in this form. This patch
exports this method so that S3 client could use it in next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:14:45 +03:00
Pavel Emelyanov
706b60a0b0 utils/aws: Omit datestamp expiration checks when not needed
The signing code is used in two ways -- by alternator to verify the
arrived signed request and by S3 client to prepare the signed request.
In the former case date expiration check is performed, but for the
latter this is not required, because date stamp is most likely now (or
close to it).

So this patch makes the orig_datestamp argument optional meaning that
expiration checks can be omited.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:14:45 +03:00
Pavel Emelyanov
c5ccef078a utils/aws: Add canonical-uri argument
Current signing code hard-codes the "/" as the URL, likely this just
works for alternator. For S3 client the URL would include bucket and
object name and should thus become the argument, not constant.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:14:45 +03:00
Pavel Emelyanov
8eabe9c4ef utils/aws: Support unsigned-payload signatures
For S3 signing the whole request payload can be too resource consuming.
Fortunately, payload signing is only enforced if used with plain http,
but with real S3 we're going to use signed requests over https only (see
next patch why).

Said that, the patch turns body-content into optional reference (i.e. --
a pointer) so that the signing code could inject the UNSIGNED-PAYLOAD
mark instead of the payload signature and omit heavy payload signing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:14:45 +03:00
Pavel Emelyanov
7c7a3416c5 s3/client: Add comments about multipart upload completion message
The message length is pre-calculated in advance to provide correct
content-length request header. This math is not obvious and deserves a
comment.

Also, the final message preparation code is also implicitly checking
if any part failed to upload. There's a comment in the upload_sink's
upload_part() method about it, but the finalization place deserves one
too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:08:34 +03:00
Pavel Emelyanov
3f86bed600 s3/client: Fix succeeded/failed part upload final checking
When all parts upload complete the final message is prepared and sent
out to the server. The preparation code is also responsible for checking
if all parts uploaded OK by checking the part etag to be non-empty. In
that check a misprint crept in -- the whole list is checked to be empty,
not the individual etag itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 11:08:15 +03:00
Pavel Emelyanov
79379760e6 s3/client: Fix parts to start from 1
Docs say, that part numbers should start from 1, while the code follows
the tradition and starts from 0. Minio is conveniently incompatible in
this sense so test had been passing so far. On real S3 part number 0
ends up with failed request.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-17 10:43:12 +03:00
Kefu Chai
6bb32efac0 utils: big_decimal: replace compare() with <=> operator
now that we are using C++20, it'd be more convenient if we can use
the <=> operator for comparing. the compiler creates the 6 other
operators for us if the <=> operator is defined. so the code is more
compacted.

in this change, `big_decimal::compare()` is replaced with `operator<=>`,
and its caller is updated accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-15 12:52:30 +08:00
Kefu Chai
e991e6087e utils: big_decimal: optimize big_decimal::compare()
before this change in the worst case, the underlying
`number::compare()` gets called twice. as it is used by Boost::multiprecision
to implement the comparing operators of `number`. but since we can
have the result in one go, there is no need to to perform the
comparison multiple times.

so, in this change, we just call `number::compare()` explicitly,
and use it to implement `compare()`. this should save a call of
`number::compare()`. also, the chained ternary expression is
replaced using if-else statement for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-15 12:52:30 +08:00
Pavel Emelyanov
b1501d4261 s3/client: Don't use designated initialization of sys stat struct
It makes compiler complan about mis-ordered initialization of st_nlink
vs st_mode on different arches. Current code (st_nlink before st_mode)
compiled fine on x86, but fails on ARM which wants st_mode to come
before st_nlink. Changing the order would, apparently, break x86 build
with similar message.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13499
2023-04-13 15:13:56 +03:00
Botond Dénes
0c51f72ad6 Merge 'utils, mutation: replace operator<<(..) with fmt formatter' from Kefu Chai
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `tombstone` and `shadowable_tombstone` without the help of fmt::ostream. and their `operator<<(ostream,..)` are dropped, as there are no users of them anymore.

Refs #13245

Closes #13474

* github.com:scylladb/scylladb:
  mutation: specialize fmt::formatter<tombstone> and fmt::formatter<shadowable_tombstone>
  utils: specialize fmt::formatter<optional<>>
2023-04-12 09:32:56 +03:00
Kefu Chai
ff202723c6 utils: big_decimal: specialize fmt::formatter<big_decimal>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `big_decimal` without the help of `operator<<`. this operator
is droppe in this change, as all its callers are now using fmtlib
for formatting now. we might need to use fmtlib to implement `big_decimal::to_string()`,
and use `fmt::to_string()` instead, but let's leave it for a follow-up
change.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13479
2023-04-12 09:20:50 +03:00
Kefu Chai
c980bd54ad utils: specialize fmt::formatter<optional<>>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `optional<T>` without the help of `operator<<()`.

this change also enables us to ditch more `operator<<()`s in future.
as we are relying on `operator<<(ostream&, const optional<T>&)` for
printing instances of `optional<T>`, and `operator<<(ostream&, const optional<T>&)`
in turn uses the `operator<<(ostream&, const T&)`. so, the new
specialization of `fmt::formatter<optional<>>` will remove yet
another caller of these operators.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-12 10:57:03 +08:00
Kefu Chai
59579d5876 utils: fragment_range: specialize fmt::formatter<FragmentedView>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print classes fulfill the requirement of `FragmentedView` concept
without the help of template function of `to_hex()`, this function is
dropped in this change, as all its callers are now using fmtlib
for formatting now. the helper of `fragment_to_hex()` is dropped
as well, its only caller is `to_hex()`.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13471
2023-04-11 16:09:38 +03:00
Botond Dénes
05b381bfa2 Merge 'Simple S3 storage for sstables' from Pavel Emelyanov
The PR adds sstables storage backend that keeps all component files as S3 objects and system.sstables_registry ownership table that keeps track of what sstables objects belong to local node and their names.

When a keyspace is configured with 'STORAGE = { 'type': 'S3' }' the respective class table object eventually gets the storage_options instance pointing to the target S3 endpoint and bucket. All the sstables created for that table attach the S3 storage implementation that maintains components' files as S3 objects. Writing to and reading from components is handled by the S3 client facilities from utils/. Changing the sstable state, which is -- moving between normal, staging and quarantine states -- is not yet implemented, but would eventually happen by updating entries in the sstables registry.

To keep track of which node owns which objects, to provide bucket-wide uniqueness of object names and to maintain sstable state the storage driver keeps records in the system.sstables_registry ownership table. The table maps sstable location and generation to the object format, version, status-state (*) and (!) unique identifier (some time soon this identifier is supposed to be replaced with UUID sstables generations). The component object name is thus s3://bucket/uuid/component_basename. The registry is also used on boot. The distributed loader picks up sstables from all the tables found in schema and for S3-backed keyspaces it lists entries in the registry to a) identify those and b) get their unique S3-side identifiers to open by name.

(*) About sstable's status and state.

The state field is the part of today's sstable path on disk -- staging, quarantine, normal (root table data dir), etc. Since S3 doesn't have the renaming facility, moving sstable between those states is only possible by updating the entry in the registry. This is not yet implemented in this set (#13017)

The status field tracks sstable' transition through its creation-deletion. It first starts with 'creating' status which corresponds to the today's TemporaryTOC file. After being created and written to the sstable moves into 'sealed' state which corresponds to the today's normal sstable being with the TOC file. To delete sstable atomically it first moves into 'removing' state which is equivalent to being in the deletion-log for the on-disk sstable. Once removed from the bucket, the entry is removed from the registry.

To play with:

1. Start minio (installed by install-dependencies.sh)
```
export MINIO_ROOT_USER=${root_user}
export MINIO_ROOT_PASSWORD=${root_pass}
mkdir -p ${root_directory}
minio server ${root_directory}
```

2. Configure minio CLI, create anonymous bucket
```
mc config host rm local
mc config host add local http://127.0.0.1:9000 ${root_user} ${root_pass}
mc mb local/sstables
mc anonymous set public local/sstables
```

3. Start Scylla with object-storage feature enabled
``` scylla ... --experimental-features=keyspace-storage-options --workdir ${as_usual}```

4. Create KS with S3 storage
``` create keyspace ... storage = { 'type': 'S3', 'endpoint': '127.0.0.1:9000', 'bucket': 'sstables' };```

The S3 client has a logger named "s3", it's useful to use on with `trace` verbosity.

Closes #12523

* github.com:scylladb/scylladb:
  test: Add object-storage test
  distributed_loader: Print storage type when populating
  sstable_directory: Add ownership table components lister
  sstable_directory: Make components_lister and API
  sstable_directory: Create components lister based on storage options
  sstables: Add S3 storage implementation
  system_keyspace: Add ownership table
  system_keyspace: Plug to user sstables manager too
  sstable: Make storage instance based on storage options
  sstable_directory: Keep storage_options aboard
  sstable: Virtualize the helper that gets on-disk stats for sstable
  sstable, storage: Virtualize data sink making for small components
  sstable, storage: Virtualize data sink making for Data and Index
  sstable/writer: Shuffle writer::init_file_writers()
  sstable: Make storage an API
  utils: Add S3 readable file impl for random reads
  utils: Add S3 data sink for multipart upload
  utils: Add S3 client with basic ops
  cql-pytest: Add option to run scylla over stable directory
  test.py: Equip it with minio server
  sstables: Detach write_toc() helper
2023-04-11 08:17:25 +03:00
Pavel Emelyanov
033fa107f8 utils: Add S3 readable file impl for random reads
Sometimes an sstable is used for random read, sometimes -- for streamed
read using the input stream. For both cases the file API can be
provided, because S3 API allows random reads of arbitrary lengths.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00
Pavel Emelyanov
a4a64149a6 utils: Add S3 data sink for multipart upload
Putting a large object into S3 using plain PUT is bad choice -- one need
to collect the whole object in memory, then send it as a content-length
request with plain body. Less memory stress is by using multipart
upload, but multipart upload has its limitation -- each part should be
at least 5Mb in size. For that reason using file API doesn't work --
file IO API operates with external memory buffers and the file impl
would only have raw pointers to it. In order to collect 5Mb of chunk in
RAM the impl would have to copy the memory which is not good. Unlike the
file API data_sink API is more flexible, as it has temporary buffers at
hand and can cache them in zero-copy manner.

Having sad that, the S3 data_sink implementation is like this:

* put(buffer):
  move the buffer into local cache, once the local cache grows above 5Mb
  send out the part

* flush:
  send out whatever is in cache, then send upload completion request

* close:
  check that the upload finihsed (in flush), abort the upload otherwise

User of the API may (actually should) wrap the sink with output_stream
and use it as any other output_stream.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00