The PR adds sstables storage backend that keeps all component files as S3 objects and system.sstables_registry ownership table that keeps track of what sstables objects belong to local node and their names.
When a keyspace is configured with 'STORAGE = { 'type': 'S3' }' the respective class table object eventually gets the storage_options instance pointing to the target S3 endpoint and bucket. All the sstables created for that table attach the S3 storage implementation that maintains components' files as S3 objects. Writing to and reading from components is handled by the S3 client facilities from utils/. Changing the sstable state, which is -- moving between normal, staging and quarantine states -- is not yet implemented, but would eventually happen by updating entries in the sstables registry.
To keep track of which node owns which objects, to provide bucket-wide uniqueness of object names and to maintain sstable state the storage driver keeps records in the system.sstables_registry ownership table. The table maps sstable location and generation to the object format, version, status-state (*) and (!) unique identifier (some time soon this identifier is supposed to be replaced with UUID sstables generations). The component object name is thus s3://bucket/uuid/component_basename. The registry is also used on boot. The distributed loader picks up sstables from all the tables found in schema and for S3-backed keyspaces it lists entries in the registry to a) identify those and b) get their unique S3-side identifiers to open by name.
(*) About sstable's status and state.
The state field is the part of today's sstable path on disk -- staging, quarantine, normal (root table data dir), etc. Since S3 doesn't have the renaming facility, moving sstable between those states is only possible by updating the entry in the registry. This is not yet implemented in this set (#13017)
The status field tracks sstable' transition through its creation-deletion. It first starts with 'creating' status which corresponds to the today's TemporaryTOC file. After being created and written to the sstable moves into 'sealed' state which corresponds to the today's normal sstable being with the TOC file. To delete sstable atomically it first moves into 'removing' state which is equivalent to being in the deletion-log for the on-disk sstable. Once removed from the bucket, the entry is removed from the registry.
To play with:
1. Start minio (installed by install-dependencies.sh)
```
export MINIO_ROOT_USER=${root_user}
export MINIO_ROOT_PASSWORD=${root_pass}
mkdir -p ${root_directory}
minio server ${root_directory}
```
2. Configure minio CLI, create anonymous bucket
```
mc config host rm local
mc config host add local http://127.0.0.1:9000 ${root_user} ${root_pass}
mc mb local/sstables
mc anonymous set public local/sstables
```
3. Start Scylla with object-storage feature enabled
``` scylla ... --experimental-features=keyspace-storage-options --workdir ${as_usual}```
4. Create KS with S3 storage
``` create keyspace ... storage = { 'type': 'S3', 'endpoint': '127.0.0.1:9000', 'bucket': 'sstables' };```
The S3 client has a logger named "s3", it's useful to use on with `trace` verbosity.
Closes#12523
* github.com:scylladb/scylladb:
test: Add object-storage test
distributed_loader: Print storage type when populating
sstable_directory: Add ownership table components lister
sstable_directory: Make components_lister and API
sstable_directory: Create components lister based on storage options
sstables: Add S3 storage implementation
system_keyspace: Add ownership table
system_keyspace: Plug to user sstables manager too
sstable: Make storage instance based on storage options
sstable_directory: Keep storage_options aboard
sstable: Virtualize the helper that gets on-disk stats for sstable
sstable, storage: Virtualize data sink making for small components
sstable, storage: Virtualize data sink making for Data and Index
sstable/writer: Shuffle writer::init_file_writers()
sstable: Make storage an API
utils: Add S3 readable file impl for random reads
utils: Add S3 data sink for multipart upload
utils: Add S3 client with basic ops
cql-pytest: Add option to run scylla over stable directory
test.py: Equip it with minio server
sstables: Detach write_toc() helper
Sometimes an sstable is used for random read, sometimes -- for streamed
read using the input stream. For both cases the file API can be
provided, because S3 API allows random reads of arbitrary lengths.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Putting a large object into S3 using plain PUT is bad choice -- one need
to collect the whole object in memory, then send it as a content-length
request with plain body. Less memory stress is by using multipart
upload, but multipart upload has its limitation -- each part should be
at least 5Mb in size. For that reason using file API doesn't work --
file IO API operates with external memory buffers and the file impl
would only have raw pointers to it. In order to collect 5Mb of chunk in
RAM the impl would have to copy the memory which is not good. Unlike the
file API data_sink API is more flexible, as it has temporary buffers at
hand and can cache them in zero-copy manner.
Having sad that, the S3 data_sink implementation is like this:
* put(buffer):
move the buffer into local cache, once the local cache grows above 5Mb
send out the part
* flush:
send out whatever is in cache, then send upload completion request
* close:
check that the upload finihsed (in flush), abort the upload otherwise
User of the API may (actually should) wrap the sink with output_stream
and use it as any other output_stream.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Those include -- HEAD to get size, PUT to upload object in one go, GET
to read the object as contigious buffer and DELETE to drop one.
The client uses http client from seastar and just implements the S3
protocol using it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
S3 client cannot perform anonymous multipart uploads into any real S3
buckets regardless of their configuration. Since multipart upload is
essential part of the sstables backend, we need to implement the
authorisation support for the client early.
(side note): with minio anonymous multipart upload works, with aws s3
anonymous PUT and DELETE can be configured, it's exactly the combination
of aws + multipart upload that does need authorization.
Fortunately, the signature generation and signature checking code is
symmetrical and we have the checking option already in alternator :) So
what this patch does is just moves the alternator::get_signature()
helper into utils/. A sad side effect of that is all tests now need to
link with gnutls :( that is used to compute the hash value itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13428
Currently, aggregate functions are implemented in a statefull manner.
The accumulator is stored internally in an aggregate_function::aggregate,
requiring each query to instantiate new instances (see
aggregate_function_selector's constructor, and note how it's called
from selector::new_instance()).
This makes aggregates hard to use in expressions, since expressions
are stateless (with state only provided to evaluate()). To facilitate
migration towards stateless expressions, we define a
stateless_aggregate_function (modeled after user-defined aggregates,
which are already stateless). This new struct defines the aggregate
in terms of three scalar functions: one to aggregate a new input into
an accumulator (provided in the first parameter), one to finalize an
accumulator into a result, and one to reduce two accumulators for
parallelized aggregation.
All existing native aggregate functions are converted to the new model, and
the old interface is removed. This series does not yet convert selectors to
expressions, but it does remove one of the obstacles.
Performance evaluation: I created a table with a million ints on a single-node cluster, and ran the avg() function on them. I measured the number of instructions executed with `perf stat -p $(pgrep scylla) -e instructions` while the query was running. The query executed from cache, memtables were flushed beforehand. The instruction count per row increased from roughly 49k to roughly 52k, indicating 3k extra instructions per row. While 3k instructions to execute a function is huge, it is currently dwarfed by other overhead (and will be even less important in a cluster where it CL>1 will cause non-coordinator code to run multiple times).
Closes#13105
* github.com:scylladb/scylladb:
cql3/selection, forward_service: use use stateless_aggregate_function directly
db: functions: fold stateless_aggregate_function_adapter into aggregate_function
cql3: functions: simplify accumulator_for template
cql3: functions: base user-defined aggregates on stateless aggregates
cql3: functions: drop native_aggregate_function
cql3: functions: reimplement count(column) statelessly
cql3: functions: reimplement avg() statelessly
cql3: functions: reimplement sum() statelessly
cql3: functions: change wide accumulator type to varint
cql3: functions: unreverse types for min/max
cql3: functions: rename make_{min,max}_dynamic_function
cql3: functions: reimplement min/max statelessly
cql3: functions: reimplement count(*) statelessly
cql3: functions: simplify creating native functions even more
cql3: functions: add helpers for automating marshalling for scalar functions
types: fix big_decimal constructor from literal 0
cql3: functions: add helper class for internal scalar functions
db: functions: add stateless aggregate functions
db, cql3: move scalar_function from cql3/functions to db/functions
this series applies some random cleanups to bloom_filter. these cleanups were the side products when the author was working on #13314 .
Closes#13315
* github.com:scylladb/scylladb:
bloom_filter: mark internal help function static
bloom_filter: add more constness to false positive rate tables
bloom_filter: use vector::back() when appropriate
no need to use `size - 1` for accessing the last element in a vector,
let's just use `vector::back()` for more compacted code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
UUID_test uses lexicograhical_compare from the types module. This
is a layering violation, since UUIDs are at a much lower level than
the database type system. In practical terms, this cause link failures
with gcc due to some thread-local-storage variables defined in types.hh
but not provided by any object, since we don't link with types.o in this
test.
Fix by extracting the relevant functions into a new header.
this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print UUID without using ostream<<. also, this change re-implements some formatting helpers using fmtlib for better performance and less dependencies on operator<<(), but we cannot drop it at this moment, as quite a few caller sites are still using operator<<(ostream&, const UUID&) and operator<<(ostream&, tagged_uuid<T>&). we will address them separately.
* add `fmt::formatter<UUID>`
* add `fmt::formatter<tagged_uuid<T>>`
* implement `UUID::to_string()` using `fmt::to_string()`
* implement `operator<<(std::ostream&, const UUID&)` with `fmt::print()`, this should help to improve the performance when printing uuid, as `fmt::print()` does not materialize a string when printing the uuid.
* treewide: use fmtlib when printing UUID
Refs #13245Closes#13246
* github.com:scylladb/scylladb:
treewide: use fmtlib when printing UUID
utils: UUID: specialize fmt::formatter for UUID and tagged_uuid<>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print UUID without using ostream<<. also, this change reimplements
some formatting helpers using fmtlib for better performance and less
dependencies on operator<<(), but we cannot drop it at this moment,
as quite a few caller sites are still using operator<<(ostream&, const UUID&)
and operator<<(ostream&, tagged_uuid<T>&). we will address them separately.
* add fmt::formatter<UUID>
* add fmt::formatter<tagged_uuid<T>>
* implement UUID::to_string() using fmt::to_string()
* implement operator<<(std::ostream&, const UUID&) with fmt::print(),
this should help to improve the performance when printing uuid, as
fmt::print() does not materialize a string when printing the uuid.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
To make it possible to move the class member away resetting to be be
empty at the same time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13208
`join` can easily be confused with boost::algorithm::join
so make it more visible that we're using scylla's
utils implementation.
Also, move `struct print_with_comma` to utils::internal.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, big_decimal(0) will select the big_decimal(string_view)
constructor (via 0 -> const char* -> string_view conversions).
0 is important for initializing aggregates, so fix it ahead of
using it.
these dependencies were found when trying to compile
`user_function_test`. whenever a library libfoo references another one,
say, libbar, the corresponding linkage from libfoo to libbar is added.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The code for compare_endpoints originates at the dawn of time (bc034aeaec)
and is called on the fast path from storage_proxy via `sort_by_proximity`.
This series considerably reduces the function's footprint by:
1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case)
2. avoid sstring copy in topology::get_{datacenter,rack}
Closes#12761
* github.com:scylladb/scylladb:
topology: optimize compare_endpoints
to_string: add print operators for std::{weak,partial}_ordering
utils: to_sstring: deinline std::strong_ordering print operator
move to_string.hh to utils/
test: network_topology: add test_topology_compare_endpoints
* throw marshal_exception if not the whole string is parsed, we
should error out if the parsed string contains gabage at the end.
before this change, we silent accept uuid like
"ce84997b-6ea2-4468-9f02-8a65abf4wxyz", and parses it as
"ce84997b-6ea2-4468-9f02-8a65abf4". this is not correct.
* throw marshal_exception if stoull() throws,
`stoull()` throws if it fails to parse a string to an unsigned long
long, we should translate the exception to `marshal_exception`, so
we can handle these exception in a consistent manner.
test is updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13069
despite that RapidJSON is a header-only library, we still need to
find it and "link" against it for adding the include directory.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we include cryptopp/ headers, we need find it and link against
it explicitly, instead of relying on seastar to do this.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- treewide: do not define/capture unused variables
- sstables/sstables: mark dummy variable for loop [[maybe_unused]]
- util/result_try: reference this explicitly
- raft: reference this explicitly
- idl-compiler: mark captured this used
- build: reenable unused-{variable,lambda-capture} warnings
Closes#12915
* github.com:scylladb/scylladb:
build: reenable unused-{variable,lambda-capture} warnings
test: reader_concurrency_semaphore_test: define target_memory in debug mode
api::failure_detector: mark set_phi_convict_threshold unimplemented
test: memtable_test: mark dummy variable for loop [[maybe_unused]]
idl-compiler: mark captured this used
raft: reference this explicitly
util/result_try: reference this explicitly
sstables/sstables: mark dummy variable for loop [[maybe_unused]]
treewide: do not define/capture unused variables
service: storage_service: clear _node_ops in batch
small_vector should be feature-wise compatible with std::vector<>,
let's add operator<=> for it.
also, there is not needd to define operator!=() explicitly, C++20
define this for us if operator==() is defined, so let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13032
quote from Avi's comment
> It's supposed to be illegal to call handle(...) without this->,
> because handle() is a dependent name (but many compilers don't
> insist, gcc is stricter here). So two error messages competed,
> and "unused this capture" won.
without this change, Clang complains that `this` is not used with
`-Wunused-lambda-capture`.
in this change, `this` is used. in this change, `this` is explicitly
referenced to silence Clang's warning.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the default generated operator<=> is exactly the same as the
handcrafted one. so let compiler do its job.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as, in C++20, compiler is able to generate the operator==() for us,
and the default generated one is identical to what we have now.
also, in C++20, operator!=() is generated by compiler if operator==()
is defined, so we can dispense with the former.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of the family of comparison operators, just define <=>. as
in C++20, compiler will define all six comparison operators for us.
in this change, the operator<=> is defined, so we can more compacted
code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, `seastar_memory_segment_store_backend`
is class with virtual method, but it does not have a virtual
dtor. but we do use a unique_ptr<segment_store_backend> to
manage the lifecycle of an intance of its derived class.
to enable the compiler to call the right dtor, we should
mark the base class's dtor as virtual. this should address
following warings from Clang-17:
```
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:100:2: error: delete called on non-final 'logalloc::seastar_memory_segment_store_backend' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete __ptr;
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:405:4: note: in instantiation of member function 'std::default_delete<logalloc::seastar_memory_segment_store_backend>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/utils/logalloc.cc:812:20: note: in instantiation of member function 'std::unique_ptr<logalloc::seastar_memory_segment_store_backend>::~unique_ptr' requested here
: _backend(std::make_unique<seastar_memory_segment_store_backend>())
^
```
and
```
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:100:2: error: delete called on 'logalloc::segment_store_backend' that is abstract but has non-virtual destructor [-Werror,-Wdelete-abstract-non-virtual-dtor]
delete __ptr;
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:405:4: note: in instantiation of member function 'std::default_delete<logalloc::segment_store_backend>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/utils/logalloc.cc:811:5: note: in instantiation of member function 'std::unique_ptr<logalloc::segment_store_backend>::~unique_ptr' requested here
contiguous_memory_segment_store()
^
```
Fixes#12872
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12873
- date: drop implicitly generated ctor
- date: use std::in_range() to check for invalid year
Closes#12878
* github.com:scylladb/scylladb:
date: use std::in_range() to check for invalid year
date: drop implicitly generated ctor
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
for better readability, and to silence following warning
from Clang 17:
```
/home/kefu/dev/scylladb/utils/date.h:5965:25: error: result of comparison of constant 9223372036854775807 with expression of type 'int' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
Y <= static_cast<int64_t>(year::max())))
~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/utils/date.h:5964:57: error: result of comparison of constant -9223372036854775808 with expression of type 'int' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
if (!(static_cast<int64_t>(year::min()) <= Y &&
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as one of its member variable does not have default constructor.
this silences following warning from Clang-17:
```
/home/kefu/dev/scylladb/utils/date.h:708:5: error: explicitly defaulted default constructor is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
year_month_weekday() = default;
^
/home/kefu/dev/scylladb/utils/date.h:705:27: note: default constructor of 'year_month_weekday' is implicitly deleted because field 'wdi_' has no default constructor
date::weekday_indexed wdi_;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as one of the (indirected) member variables has a user-declared move
ctor, this prevents the compiler from generating the default copy ctor
or assignment operator for the classes containing `timer`.
```
/home/kefu/dev/scylladb/utils/histogram.hh:440:5: warning: explicitly defaulted copy constructor is implicitly deleted [-Wdefaulted-function-deleted]
timed_rate_moving_average_and_histogram(const timed_rate_moving_average_and_histogram&) = default;
^
/home/kefu/dev/scylladb/utils/histogram.hh:437:31: note: copy constructor of 'timed_rate_moving_average_and_histogram' is implicitly deleted because field 'met' has a deleted copy constructor
timed_rate_moving_average met;
^
/home/kefu/dev/scylladb/utils/histogram.hh:298:17: note: copy constructor of 'timed_rate_moving_average' is implicitly deleted because field '_timer' has a deleted copy constructor
meter_timer _timer;
^
/home/kefu/dev/scylladb/utils/histogram.hh:212:13: note: copy constructor of 'meter_timer' is implicitly deleted because field '_timer' has a deleted copy constructor
timer<> _timer;
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/timer.hh:111:5: note: copy constructor is implicitly deleted because 'timer<>' has a user-declared move constructor
timer(timer&& t) noexcept : _sg(t._sg), _callback(std::move(t._callback)), _expiry(std::move(t._expiry)), _period(std::move(t._period)),
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fixes: https://github.com/scylladb/scylladb/issues/12309Closes#12720
* github.com:scylladb/scylladb:
service/raft: raft_group_registry: use recent_entries_map to store rate_limits in pinger. Fixes#12309
utils: introduce recent_entries_map datatype to track least recent visited entries.
Will be used by MVCC tests which don't want (can't) deal with the
row_cache as the container but work with the partition_entry directly.
Currently, rows_entry::on_evicted() assumes that it's embedded in
row_cache and would segfault when trying to evict the contining
partition entry which is not embedded in row_cache. The solution is to
call evict_shallow() from mvcc_tests, which does not attempt to evict
the containing partition_entry.