This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows:
After the patch, messaging_service (Scylla's interface for all inter-node communication)
compresses its network traffic with compressors managed by
the new advanced_rpc_compression::tracker. Those compressors compress with lz4,
but can also be configured to use zstd as long as a CPU usage limit isn't crossed.
A precomputed compression dictionary can be fed to the tracker. Each connection
handled by the tracker will then start a negotiation with the other end to switch
to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary.
All traffic going through the tracker is passed as a single merged "stream" through dict_sampler.
dictionary_service has access to the dict_sampler.
On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the alien_worker thread) and publishes the new dictionary
to system.dicts via Raft's write_mutation command.
This update triggers (eventually) a callback on all nodes, which feeds the new dictionary
to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections
to this dictionary.
Closesscylladb/scylladb#22032
* github.com:scylladb/scylladb:
messaging_service: use advanced_rpc_compression::tracker for compression
message/dictionary_service: introduce dictionary_service
service: make Raft group 0 aware of system.dicts
db/system_keyspace: add system.dicts
utils: add advanced_rpc_compressor
utils: add dict_trainer
utils: introduce reservoir_sampling
utils: introduce alien_worker
utils: add stream_compressor
So we can use it for defining other small_vector
deriving their internal capacity from another small_vector
type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Adds glue needed to pass lz4 and zstd with streaming and/or dictionaries
as the network traffic compressors for Seastar's RPC servers.
The main jobs of this glue are:
1. Implementing the API expected by Seastar from RPC compressors.
2. Expose metrics about the effectiveness of the compression.
3. Allow dynamically switching algorithms and dictionaries on a running
connection, without any extra waits.
The biggest design decision here is that the choice of algorithm and dictionary
is negotiated by both sides of the connection, not dictated unilaterally by the
sender.
The negotiation algorithm is fairly complicated (a TLA+ model validating
it is included in the commit). Unilateral compression choice would be much simpler.
However, negotiation avoids re-sending the same dictionary over every
connection in the cluster after dictionary updates (with one-way communication,
it's the only reliable way to ensure that our receiver possesses the dictionary
we are about to start using), lets receivers ask for a cheaper compression mode
if they want, and lets them refuse to update a dictionary if they don't think
they have enough free memory for that.
In hindsight, those properties probably weren't worth the extra complexity and
extra development effort.
Zstd can be quite expensive, so this patch also includes a mechanism which
temporarily downgrades the compressor from zstd to lz4 if zstd has been
using too much CPU in a given slice of time. But it should be noted that
this can't be treated as a reliable "protection" from negative performance
effects of zstd, since a downgrade can happen on the sender side,
and receivers are at the mercy of senders.
We are planning to improve some usages of compression in Scylla
(in which we compress small blocks of data) by pre-training
compression dictionaries on similar data seen so far.
For example, many RPC messages have similar structure
(and likely similar data), so the similarity could be exploited
for better compression. This can be achieved e.g. by training
a dictionary on the RPC traffic, and compressing subsequent
RPC messages against that dictionary.
To work well, the training should be fed a representative sample
of the compressible data. Such a sample can be approached by
taking a random subset (of some given reasonable size) of the data,
with uniform probability.
For our purposes, we need an online algorithm for this -- one
which can select the random k-subset from a stream of arbitrary
size (e.g. all RPC traffic over an hour), while requiring only
the necessary minimum of memory.
This is a known problem, called "reservoir sampling".
This PR introduces `reservoir_sampler`, which implements
an optimal algorithm for reservoir sampling.
Additionally, it introduces `page_sampler` -- a wrapper for `reservoir_sampler`,
which uses it to select a random sample of pages from a stream of bytes.
Introduces a util which launches a new OS thread and accepts
callables for concurrent execution.
Meant to be created once at startup and used until shutdown,
for running nonpreemptible, 3rd party, non-interactive code.
Note: this new utility is almost identical to wasm::alien_thread_runner.
Maybe we should unify them.
Adds utilities for "advanced" methods of compression with lz4
and zstd -- with streaming (a history buffer persisted across messages)
and/or precomputed dictionaries.
This patch is mostly just glue needed to use the underlying
libraries with discontiguous input and output buffers, and for reusing the
same compressor context objects across messages. It doesn't contain
any innovations of its own.
There is one "design decision" in the patch. The block format of LZ4
doesn't contain the length of the compressed blocks. At decompression
time, that length must be delivered to the decompressor by a channel
separate to the compressed block itself. In `lz4_cstream`, we deal
with that by prepending a variable-length integer containing the
compressed size to each compressed block. This is suboptimal for
single-fragment messages, since the user of lz4_cstream is likely
going to remember the length of the whole message anyway,
which makes the length prepended to the block redundant.
But a loss of 1 byte is probably acceptable for most uses.
Fixes#20717
Enables abortable interface and propagates abort_source to all s3 objects used for reading the restore data.
Note: because restore is done on each shard, we have to maintain a per-shard abort source proxy for each, and do a background per-shard abort on abort call. This is synced at the end of "run()".
Abort source is added as an optional parameter to s3 storage and the s3 path in distributed loader.
There is no attempt to "clean up" an aborted restore. As we read on a mutation level from remote sstables, we should not cause incomplete sstables as such, even though we might end up of course with partial data restored.
Closesscylladb/scylladb#21567
* github.com:scylladb/scylladb:
test_backup: Add restore abort test case
sstables_loader: Make restore task abortable
distributed_loader: Add optional abort_source to get_sstables_from_object_store
s3_storage: Add optional abort_source to params/object
s3::client: Make "readable_file" abortable
since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also,
we've switched to the build image based on fedora 40, which ships
fmt-devel v10.2.1, there is no need to support fmt < 10.
in this change, we drop the support fmt < 10.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21847
Currently, the tri-compare operator for big_decimal (operator <=>), uses
a precise but potentially very expensive algorithm for comparing the
numbers: it first brings them to the same scale, then compares the
normalized unscaled values. big_decimal has abritrary precisions,
therefore the stored numbers can be arbitrarily large.
In extreme cases, comparing two numbers can result in huge amount of
memory allocated and stalls. If this type is used int he primary key of
a table, these comparisons can make the node completely unresponsive.
This patch adds the following fast-paths to operator <=>:
* An early return for the case of equal scales.
* An early return for different signs.
* An early return for the case where one or both of the numbers are 0.
* A fast algorithm for detecting the case where the there is a big
difference between the two numbers. This algorithm works only with the
scales and is able to compare the two numbers by using only one division
and some additions and substractions. This algorithm is imprecise and
when the numbers are closer than its confidence window, it will
fall-back to the current slow but precise tri-compare.
All but the last case should have been fast before as well, but the
scale-compare algorithm makes a huge difference. Numbers, which would
previously make the node unresponsive, now compare in constant-time.
Fixes: scylladb/scylladb#21716Closesscylladb/scylladb#21715
Adds optional abortable source to "readable_file" interface.
Note: the abortable aspect is not preserved across a "dup()" call
however, since these objects are generally not used in a cross-shard
fashion, it should be ok.
Schema of system tables is defined statically and table_schema_version needs to be explicitly set in code like this:
```
builder.with_version(system_keyspace::generate_schema_version(table_id, version_offset));
```
Whenever schema is changed, the schema version needs to change, otherwise we hit undefined behavior when trying to interpret mutation data created with the old schema using the new schema.
It's not obvious that one needs to do that and developers often forget to do that. There were several instances of mistakes of omission, some caught during review, some not, e.g.: 31ea74b96e.
This patch changes definitions to call the new `schema_builder::with_hash_version()`, which will make the schema builder compute version from schema definition so that changes of the schema will automatically change the version. This way we no longer rely on the developer to remember to bump the version offset.
All nodes should arrive at the same version, which is verified by existing `test_group0_schema_versioning` and a new unit test: `test_system_schema_version_is_stable`.
Closesscylladb/scylladb#21602
* github.com:scylladb/scylladb:
system_tables: Compute schema version automatically
schema_builder: Introduce with_hash_version()
schema: Store raw_view_info in schema::raw_schema
schema: Remove dead comment
hashing: Add hasher for unordered_map
hashing: Add hasher for unique_ptr
hashing: Add hasher for double
[avi: add missing include <memory> to hashing.hh]
Our "sstring_view" is an historic alias for the standard std::string_view.
The patch changes the last remaining random uses of this old alias across
our source directory to the standard type name.
After this patch, there are no more uses of the "sstring_view" alias.
It will be removed in the following patch.
Refs #4062.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The later includes the former and in addition to `seastar::format()`,
`print.hh` also provides helpers like `seastar::fprint()` and
`seastar::print()`, which are deprecated and not used by scylladb.
Previously, we include `seastar/core/print.hh` for using
`seastar::format()`. and in seastar 5b04939e, we extracted
`seastar::format()` into `seastar/core/format.hh`. this allows us
to include a much smaller header.
In this change, we just include `seastar/core/format.hh` in place of
`seastar/core/print.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21574
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):
forward_jump_clocks(std::chrono::seconds(60*60*24*31));
The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.
When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:
ttl_opt topology_request_tracking_mutation_builder::ttl() const {
return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
- std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
}
_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.
The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.
The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().
Closesscylladb/scylladb#21558
1. Add `retry_strategy` interface and default implementation for exponential back-off retry strategy.
2. Add new S3 related errors, also introduce additional errors to describe pure http errors that has no additional information in the body.
3. Add retries to the s3 client, all retries are coordinated by an instance of `retry_strategy`. In a case of error also parse response body in attempt to retrieve additional and more focused error information as suggested by AWS. See https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html. Introduce `aws_exception` to carry the original `aws_error`.
4. Discard whatever exception is thrown in `abort_upload` when aborting multipart upload since we don't care about cleanly aborting it since there are other means to clean up dangling parts, for example `rclone cleanup` or S3 bucket's Lifecycle Management Policy.
5. Add tests to cover retries, and retry exhaustion. Also add tests for jumbo upload.
6. Add the S3 proxy which is used to randomly inject retryable S3 errors to test the "retry" part of the S3 client. Switch the `s3_test` to use the S3 proxy. `s3_tests` set afloat `put_object` problem that was causing segmentation when retrying, fixed.
7. Extend the `s3_test` to use both `minio` and `proxy` configurations.
8. Add parameter to the proxy to seed the error injection randomization to make it replayable.
fixes: #20611fixes: #20613Closesscylladb/scylladb#21054
* github.com:scylladb/scylladb:
aws_errors: Make error messages more verbose.
test: Make the minio proxy randomization re-playable
test/boost/s3_test: add error injection scenarios to existing test suite
test: Switch `s3_test` to use proxy
test: Add more tests
client: Stop returning error on `DELETE` in multipart upload abortion
client: Fix sigsegv when retrying
client: Add retries
client: Adjust `map_s3_client_exception` to return exception instance
aws_errors: Change aws_error::parse to return std::optional<>
aws_errors: Add http errors mapping into aws_error
client: Add aws_exception mapping
aws_error: Add `aws_exeption` to carry original `aws_error`
aws_errors: Add new error codes
client: Introduce retry strategy
in seastar e96932b05f394b27cd0101e24f0584736795b50f, we stopped
including unused `fmt/ostream.h`. this helped to reduce the header
dependency. but this also broke the build of scylladb, as we rely
on the `fmt/ostream.h` indirectly included by seastar's header project.
in this change, we include `fmt/iostream.h` and `iostream` explictly
when we are using the declarations in them. this enables us to
- bump up the seastar submodule
- potentially reduce the header dependency as we will be able to
include seastar/core/format.hh instead of a more bloated
seastar/core/print.hh after bumping up seastar submodule
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21494
Discard whatever exception is thrown in `abort_upload` when aborting multipart upload since we don't care about cleanly aborting it since there are other means to clean up dangling parts, for example `rclone cleanup` or S3 bucket's Lifecycle Management Policy
Add retries to the s3 client, all retries are coordinated by an instance of `retry_strategy`. In a case of error also parse response body in attempt to retrieve additional and more focused error information as suggested by AWS. See https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html.
Also move the expected http status check to the `make_s3_error_handler` since the http::client::make_request call is done with `nullopt` - we want to manage all the aws errors handling in s3 client to prevent the http client to validate it and fail before we have a chance to analyze the error properly
"Unfuturize" the `map_s3_client_exception` since the retryable client is going to be implemented using coroutines and no `future` is needed here, just to save unnecessary `co_await` on it
Sample the clock once to avoid the filter returning different results.
Range algorithms may use multiple passes, so it's better to return
consistent results.
Closesscylladb/scylladb#21400
Fixes#20716
Adds optional abort_source to all s3 client operations. If provided, will propagate to actual HTTP client and allow for aborting actual net op.
Note: this uses an abort source per call, not a client-local one.
This is for two reasons:
1.) The usage pattern of the client object is to create it outside the eventual owning object (task) that hosts the relevant abort source
2.) It is quite possible to want to have different/no abort source for some operation usage.
Also adds forward usage of task abort_source in backup tasks upload s3 call, making it more readily abort-able.
Closesscylladb/scylladb#21431
* github.com:scylladb/scylladb:
backup_task: Use task abort source in s3 client call
s3::client: Make operations (individually) abortable
to_string.hh does not use this header, neither is it obliged to
expose the content of this header. so, let's remove this include.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21440
Refs #20716
Adds optional abort_source to all s3 client operations. If provided, will
propagate to actual HTTP client and allow for aborting actual net op.
Note: this uses an abort source per call, not a client-local one.
This is for two reasons:
1.) The usage pattern of the client object is to create it outside the
eventual owning object (task) that hosts the relevant abort source
2.) It is quite possible to want to have different/no abort source for
some operation usage.
The overload was introduced by a8b14b0227 (utils: add timeout error
injection with lambda), but is only used by the test nowadays.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21377
Optimize the various constructors a little, and add an std::from_range_t
constructor.
Minor improvement, so no backports.
Closesscylladb/scylladb#21399
* github.com:scylladb/scylladb:
utils: chunked_vector: add from_range_t constructor
utils: chunked_vector: optimize initializer_list constructor
utils: chunked_vector: iterator constructor: copy spanwise
utils: chunked_vector: reserve for forward iterators, not just random access iterators, on construction
Task manager GET /status method returns two counters that reflect task progress -- total and completed. To make caller reason about their meaning, additionally there's progress_units field next to those counters.
This patch implements this progress report for backup task. The units are bytes, the total counter is total size of files that are being uploaded, and the completed counter is total amount of bytes successfully sent with PUT requests. To get the counters, the client::upload_file() is extended to calculate those.
fixes#20653Closesscylladb/scylladb#21144
* github.com:scylladb/scylladb:
backup_task: Report uploading progress
s3/client: Account upload progress for real
s3/client: Introduce upload_progress
s3: Extract client_fwd.hh
std::ranges::to<> has a little protocol with containers. Implement it
to get optimized construction.
Similar to the iterator pair constructor, if the range's size can be
obtained (even with an O(N) algorithm), favor that to avoid reallocations.
Copy elements spanwise to promote optimization to memcpy when possible.
Instead of copying element-by-element, copy contiguous spans. This
is much faster if the input is a span and the constructor is trivial,
since the whole thing translates to a memcpy.
Make the two branches constexpr to reduce work for the compiler in
optimizing the other branch away.
For a forward iterator, prefer a two pass algorithm to first count
the number of elements, reserver, then copy the elements, to a single
pass algorithm that involves reallocation and copying.
The wrapper object denotes that injection should run a handler and
wait_for_message() on it. Wrapper carries the timeout used to call the
mentioned method. It's currently unused, next patches will start enjoing
it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.
I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).
Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.
The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.
Fixes#21325Closesscylladb/scylladb#21323
* github.com:scylladb/scylladb:
utils: cached_file: Mark permit as awaiting on page miss
utils: cached_file: Push resource_unit management down to cached_file
Before upload starts file size is checked, so this is the place that
updates progress.total counter.
Uploading a file happens by reading unit_size bytes from file input
stream and writing the buffer into http body writer stream. This is the
place to update progress.uploaded counter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is a structure with "total" and "uploaded" counters that's passed
by user to client::upload_file() method so that client would update it
with the progress.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to export some simple structures to users without the need to
include client.hh itself (rather large already)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of inject() overloads check if the injection is enabled, then
optionally clear the one-shot one, then do the injection. Everything
but doing the injection is implemented in the enter() method, it's
perfectly worth re-using one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21285