read_mutation_from_flat_mutation_reader might throw
so we need to close the reader returned from
ms.make_fragment_v1_stream also on the error
path to avoid the internal error abort when the
reader is destroyed while opened.
Fixes#14098
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14099
CWG 2631 (https://cplusplus.github.io/CWG/issues/2631.html) reports
an issue on how the default argument is evaluated. this problem is
more obvious when it comes to how `std::source_location::current()`
is evaluated as a default argument. but not all compilers have the
same behavior, see https://godbolt.org/z/PK865KdG4.
notebaly, clang-15 evaluates the default argument at the callee
site. so we need to check the capability of compiler and fall back
to the one defined by util/source_location-compat.hh if the compiler
suffers from CWG 2631. and clang-16 implemented CWG2631 in
https://reviews.llvm.org/D136554. But unfortunately, this change
was not backported to clang-15.
before switching over to clang-16, for using std::source_location::current()
as the default parameter and expect the behavior defined by CWG2631,
we have to use the compatible layer provided by Seastar. otherwise
we always end up having the source_location at the callee side, which
is not interesting under most circumstances.
so in this change, all places using the idiom of passing
std::source_location::current() as the default parameter are changed
to use seastar::compat::source_location::current(). despite that
we have `#include "seastarx.h"` for opening the seastar namespace,
to disambiguate the "namespace compat" defined somewhere in scylladb,
the fully qualified name of
`seastar::compat::source_location::current()` is used.
see also 09a3c63345, where we used
std::source_location as an alias of std::experimental::source_location
if it was available. but this does not apply to the settings of our
current toolchain, where we have GCC-12 and Clang-15.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14086
The manager is in charge of updating IO bandwidth on the respective prio class. Nowadays it uses global priority-manager, but unifying sched classes effort will require it to use non-global streaming sched group. After the patch the sched class field is unused, but it's a preparation towards huge (really huge) "switch to seastar API level 7" patch
ref: #13963Closes#13997
* github.com:scylladb/scylladb:
stream_manager: Add streaming sched group copy
cql_test_env: Move sched groups initialization up
Several test cases use common operations one files like existence checking, content comparing, etc. with the help of home-brew local helpers. The set makes use of some existing seastar:: ones and generalizes others into test/lib/. The primary intent here is `57 insertions(+), 135 deletions(-)`
Closes#13936
* github.com:scylladb/scylladb:
test: Generalize touch_file() into test_utils.*
test/database: Generalize file/dir touch and exists checks
test/sstables: Use seastar::file_exists() to check
test/sstables: Remove sstdesc
test/sstables: Use compare_files from utils/ in sstable_test
test/sstables: Use compare_files() from utils/ in sstable_3_x_test
test/util: Add compare_file() helpers
The way index_reader maintains io_priority_class can be relaxed a bit. The main intent is to shorten the #13963 final patch a bit, as a side effect index_reader gets its portion of API polishing.
ref: #13963Closes#13992
* github.com:scylladb/scylladb:
index_reader: Introduce and use default arguments to constructor
index_reader: Use _pc field in get_file_input_stream_options() directly
index_reader: Move index_reader::get_file_input_stream_options to private: block
The manager in question is responsible for maintaining the streaming
class IO bandwidth update. Nowadays it does it via priority manager's
global streaming IO priority class field, but it will need to switch to
streaming sched group.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The streaming manager will need to keep its copy of
streaming/maintenance group, so groups should be created early.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of creators of index_reader construct it with default prio class,
null trace pointer and use_caching::yes. Assigning implicit defaults to
constructor arguments keeps the code shorter and easier to read.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
As described in https://github.com/scylladb/scylladb/issues/8638,
we're moving away from `SimpleStrategy`, in the future
it will become deprecated.
We should remove all uses of it and replace them
with `NetworkTopologyStrategy`.
This change replaces `SimpleStrategy` with
`NetworkTopologyStrategy` in all unit tests,
or at least in the ones where it was reasonable to do so.
Some of the tests were written explicitly to test the
`SimpleStrategy` strategy, or changing the keyspace from
`SimpleStrategy` to `NetworkTopologyStrategy`.
These tests were left intact.
It's still a feature that is supported,
even if it's slowly getting deprecated.
The typical way to use `NetworkTopologyStrategy` is
to specify a replication factor for each datacenter.
This could be a bit cumbersome, we would have to fetch
the list of datacenters, set the repfactors, etc.
Luckily there is another way - we can just specify
a replication factor to use for or each existing
datacenter, like this:
```cql
CREATE KEYSPACE {} WITH REPLICATION =
{'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1};
```
This makes the change rather straightforward - just replace all
instances of `'SimpleStrategy'', with `'NetworkTopologyStrategy'`.
Refs: https://github.com/scylladb/scylladb/issues/8638
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13990
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559Closes#13812
* github.com:scylladb/scylladb:
table: signal compaction_manager when staging sstables become eligible for cleanup
compaction_manager: perform_cleanup: wait until all candidates are cleaned up
compaction_manager: perform_cleanup: perform_offstrategy if needed
compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
sstable_set: add for_each_sstable_gently* helpers
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).
The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.
Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.
Fixes: #13837Closes#13172
* github.com:scylladb/scylladb:
gossiper: Remove features and sysks from gossiper
system_keyspace: De-static save_local_supported_features()
system_keyspace: De-static load_|save_local_enabled_features()
system_keyspace: Move enable_features_on_startup to feature_service (cont)
system_keyspace: Move enable_features_on_startup to feature_service
feature_service: Open-code persist_enabled_feature_info() into enabler
gms: Move feature enabler to feature_service.cc
gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
gms: Persist features explicitly in features enabler
feature_service: Make persist_enabled_feature_info() return a future
system_keyspace: De-static load_peer_features()
gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
gossiper: Enable features and register enabler from outside
gms: Add feature_service and system_keyspace to feature_enabler
perform_cleanup may be waiting for those sstables
to become eligible for cleanup so signal it
when table::move_sstables_from_staging detects an
sstable that requires cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore. but many users would like to set the driver timers based on
server timers. we need to enable them to configure timeout even
when the server is still running.
in this change,
* `alternator_timeout_in_ms` is marked as live-updateable
* `executor::_s_default_timeout` is changed to a thread_local variable,
so it can be updated by a per-shard updateable_value. and
it is now a updateable_value, so its variable name is updated
accordingly. this value is set in the ctor of executor, and
it is disconnected from the corresponding named_value<> option
in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
executor via sharded_parameter, so `executor::_timeout_in_ms` can
be initialized on per-shard basis
* `executor::set_default_timeout()` is dropped, as we already pass
the option to executor in its ctor.
Fixes#12232Closes#13300
* github.com:scylladb/scylladb:
alternator: split the param list of executor ctor into multi lines
alternator,config: make alternator_timeout_in_ms live-updateable
Adding new APIs /column_family/tombstone_gc and /storage_service/tombstone_gc, that will allow for disabling tombstone garbage collection (GC) in compaction.
Mimicks existing APIs /column_family/autocompaction and /storage_service/autocompaction.
column_family variant must specify a single table only, following existing convention.
whereas the storage_service one can specify an entire keyspace, or a subset of a tables in a keyspace.
column_family API usage
-----
```
The table name must be in keyspace:name format
Get status:
curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
Enable GC
curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
Disable GC
curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
```
storage_service API usage
-----
```
Tables can be specified using a comma-separated list.
Enable GC on keyspace
curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"
Disable GC on keyspace
curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"
Enable GC on a subset of tables
curl -s -X POST
"http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2"
```
Closes#13793
* github.com:scylladb/scylladb:
test: Test new API for disabling tombstone GC
test: rest_api: extract common testing code into generic functions
Add API to disable tombstone GC in compaction
api: storage_service: restore indentation
api: storage_service: extract code to set attribute for a set of tables
tests: Test new option for disabling tombstone GC in compaction
compaction_strategy: bypass tombstone compaction if tombstone GC is disabled
table: Allow tombstone GC in compaction to be disabled on user request
Currently s3::client is created for each sstable::storage. It's later shared between sstable's files and upload sink(s). Also foreign_sstable_open_info can produce a file from a handle making a new standalone client. Coupled with the seastar's http client spawning connections on demand, this makes it impossible to control the amount of opened connections to object storage server.
In order to put some policy on top of that (as well as apply workload prioritization) s3 clients should be collected in one place and then shared by users. Since s3::client uses seastar::http::client under the hood which, in turn, can generate many connections on demand, it's enough to produce a single s3::client per configured endpoint one each shard and then share it between all the sstables, files and sinks.
There's one difficulty however, solving which is most of what this PR does. The file handle, that's used to transfer sstable's file across shards, should keep aboard all it needs to re-create a file on another shard. Since there's a single s3::client per shard, creation of a file out of a handle should grab that shard's client somehow. The meaningful shard-local object that can help is the sstables_manager and there are three ways to make use of it. All deal with the fact that sstables_manager-s are not sharded<> services, but are owner by the database independently on each shard.
1. walk the client -> sst.manager -> database -> container -> database -> sst.manager -> client chain by keeping its first half on the handle and unrolling the second half to produce a file
2. keep sharded peering service referenced by the sstables_manager that's initialized in main and passed though the database constructor down to sstables_manager(s)
3. equip file_handle::to_file with the "context" argument and teach sstables foreign info opener to push sstables_manager down to s3 file ... somehow
This PR chooses the 2nd way and introduces the sstables::storage_manager main-local sharded peering service that maintains all the s3::clients. "While at it" the new manager gets the object_storage_config updating facilities from the database (it's overloaded even without it already). Later the manager will also be in charge of collecting and exporting S3 metrics. In order to limit the number of S3 connections it also needs a patch seastar http::client, there's PR already doing that, once (if) merged there'll come one more fix on top.
refs: #13458
refs: #13369
refs: scylladb/seastar#1652Closes#13859
* github.com:scylladb/scylladb:
s3: Pick client from manager via handle
s3: Generalize s3 file handle
s3: Live-update clients' configs
sstables: Keep clients shared across sstables
storage_manager: Rewrap config map
sstables, database: Move object storage config maintenance onto storage_manager
sstables: Introduce sharded<storage_manager>
If tombstone GC was disabled, compaction will ensure that fully expired
sstables won't be bypassed and that no expired tombstones will be
purged. Changing the value takes immediate effect even on ongoing
compactions.
Not wired into an API yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Right now the map<endpoint, config> sits on the sstables manager and its
update is governed by database (because it's peering and can kick other
shards to update it as well).
Having the sharded<storage_manager> at hand lets freeing database from
the need to update configs and keeps sstables_manager a bit smaller.
Also this will allow keeping s3 clients shared between sstables via this
map by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manager in question keeps track of whatever sstables_manager needs
to work with the storage (spoiler: only S3 one). It's main-local sharded
peering service, so that container() call can be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
the series drops some of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation.
Closes#13845
* github.com:scylladb/scylladb:
test: drop unused helper functions
test: sstable_mutation_test: avoid using helper using generation_type::int_t
test: sstable_move_test: avoid using helper using generation_type::int_t
test: sstable_*test: avoid using helper using generation_type::int_t
test: sstable_3_x_test: do not use reuseable_sst() accepting integer
all users of these two helpers have switched to their alternatives,
so there is no need to keep them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type by offering a default paramter, which is a
generation created using 1. this preserves the existing behavior.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
CQL evolved several expression evaluation mechanisms: WHERE clause,
selectors (the SELECT clause), and the LWT IF clause are just some
examples. Most now use expressions, which use managed_bytes_opt
as the underlying value representation, but selectors still use bytes_opt.
This poses two problems:
1. bytes_opt generates large contiguous allocations when used with large blobs, impacting latency
2. trying to use expressions with bytes_opt will incur a copy, reducing performance
To solve the problem, we harmonize the data types to managed_bytes_opt
(#13216 notwithstanding). This is somewhat difficult since the source of the values
are views into a bytes_ostream. However, luckily bytes_ostream and managed_bytes_view
are mostly compatible so with a little effort this can be done.
The series is neutral wrt performance:
before:
```
222118.61 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
224250.14 tps ( 61.1 allocs/op, 12.1 tasks/op, 43094 insns/op, 0 errors)
224115.66 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
223508.70 tps ( 61.1 allocs/op, 12.1 tasks/op, 43107 insns/op, 0 errors)
223498.04 tps ( 61.1 allocs/op, 12.1 tasks/op, 43087 insns/op, 0 errors)
```
after:
```
220708.37 tps ( 61.1 allocs/op, 12.1 tasks/op, 43118 insns/op, 0 errors)
225168.99 tps ( 61.1 allocs/op, 12.1 tasks/op, 43081 insns/op, 0 errors)
222406.00 tps ( 61.1 allocs/op, 12.1 tasks/op, 43088 insns/op, 0 errors)
224608.27 tps ( 61.1 allocs/op, 12.1 tasks/op, 43102 insns/op, 0 errors)
225458.32 tps ( 61.1 allocs/op, 12.1 tasks/op, 43098 insns/op, 0 errors)
```
Though I expect with some more effort we can eliminate some copies.
Closes#13637
* github.com:scylladb/scylladb:
cql3: untyped_result_set: switch to managed_bytes_view as the cell type
cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
cql3: untyped_result_set: always own data
types: abstract_type: add mixed-type versions of compare() and equal()
utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view
utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt
utils: managed_bytes: add managed_bytes_view::with_linearized()
utils: managed_bytes: mark managed_bytes_view::is_linearized() const
The expression system uses managed_bytes_opt for values, but result_set
uses bytes_opt. This means that processing values from the result set
in expressions requires a copy.
Out of the two, managed_bytes_opt is the better choice, since it prevents
large contiguous allocations for large blobs. So we switch result_set
to use managed_bytes_opt. Users of the result_set API are adjusted.
The db::function interface is not modified to limit churn; instead we
convert the types on entry and exit. This will be adjusted in a following
patch.
this is one of the changes to reduce the usage of integer based generation
test. in future, we will need to expand the test to exercise the UUID
based generation, or at least to be neutral to the underlying generation's
identifier type. so, to remove the helpers which only accept `generation_type::int_t`
would helps us to make this happen.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we introduced the linkage to Boost::unit_test_framework in
fe70333c19, this library is used by
test/lib/test_utils.cc, so update CMake accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13781
There are two of them currently with slightly different declaration. Better to leave only one.
Closes#13772
* github.com:scylladb/scylladb:
test: Deduplicate test::filename() static overload
test: Make test::filename return fs::path
Current S3 client was tested over minio and it takes few more touches to work with amazon S3.
The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS.
So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs.
In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust.
fixes: #13425Closes#13493
* github.com:scylladb/scylladb:
doc: Add a document describing how to configure S3 backend
s3/test: Add ability to run boost test over real s3
s3/client: Sign requests if configured
s3/client: Add connection factory with DNS resolve and configurable HTTPS
s3/client: Keep server port on config
s3/client: Construct it with config
s3/client: Construct it with sstring endpoint
sstables: Make s3_storage with endpoint config
sstables_manager: Keep object storage configs onboard
code: Introduce conf/object_storage.yaml configuration file
There are two of them currently, both returning fs::path for sstable
components. One is static and can be dropped, callers are patched to use
the non-static one making the code tiny bit shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable::filename() is private and is not supposed to be used as a
path to open any files. However, tests are different and they sometimes
know it is. For that they use test wrapper that has access to private
members and may make assumptions about meaning of sstable::filename().
Said that, the test::filename() should return fs::path, not sstring.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
storage_service uses raft_group0 but the during shutdown the later is
destroyed before the former is stopped. This series move raft_group0
destruction to be after storage_service is stopped already. For the
move to work some existing dependencies of raft_group0 are dropped
since they do not really needed during the object creation.
Fixes#13522
In case an sstable unit test case is run individually, it would fail
with exception saying that S3_... environment is not set. It's better to
skip the test-case rather than fail. If someone wants to run it from
shell, it will have to prepare S3 server (minio/AWS public bucket) and
provide proper environment for the test-case.
refs: #13569
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13755
raft_group0 does not really depends on cdc::generation_service, it needs
it only transiently, so pass it to appropriate methods of raft_group0
instead of during its creation.
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.
To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name
The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.
To keep the described configuration the proposed place is the
object_storage.yaml file with the format
endpoints:
- name: a.b.c
port: 443
aws_key: 12345
aws_secret: abcdefghijklmnop
...
When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently there are only 2 tests for S3 -- the pure client test and compound object_store test that launches scylla, creates s3-backed table and CQL-queries it. At the same time there's a whole lot of small unit test for sstables functionality, part of it can run over S3 storage too.
This PR adds this support and patches several test cases to use it. More test cases are to come later on demand.
fixes: #13015Closes#13569
* github.com:scylladb/scylladb:
test: Make resharding test run over s3 too
test: Add lambda to fetch bloom filter size
test: Tune resharding test use of sstable::test_env
test: Make datafile test case run over s3 too
test: Propagate storage options to table_for_test
test: Add support for s3 storage_options in config
test: Outline sstables::test_env::do_with_async()
test: Keep storage options on sstable_test_env config
sstables: Add and call storage::destroy()
sstables: Coroutinize sstable::destroy()
Teach table_for_tests use any storage options, not just local one. For
now the only user that passes non-local options is sstables::test_env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the sstable test case wants to run over S3 storage it needs to
specify that in test config by providing the S3 storage options. So
first thing this patch adds is the helper that makes these options based
on the env left by minio launcher from test.py.
Next, in order to make sstables_manager work with S3 it needs the
plugged system keyspace which, in turn, needs query processor, proxy,
database, etc. All this stuff lives in cql_test_env, so the test case
running with S3 options will run in a sstables::test_env nested inside
cql_test_env. The latter would also need to plug its system keyspace to
the former's sstables manager and turn the experimental feature ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that it could be set to s3 by the test case on demand. Default is
local storage which uses env's tempdir or explicit path argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Also, add an group_id class and return it from compaction_group and table_state.
Use that to identify the compaction_group / table_state by "ks_name.cf_name compaction_group=idx/total" in log messages.
Fixes#13467Closes#13520
* github.com:scylladb/scylladb:
compaction_manager: print compaction_group id
compaction_group, table_state: add group_id member
compaction_manager: offstrategy compaction: skip compaction if no candidates are found
bytes_on_disk is the sum of all sstable components.
As read_simple() fetches the file size before parsing the component,
bytes_on_disk can be added incrementally rather than an additional
step after all components were already parsed.
Likewise, write_simple() tracks the offset for each new component,
and therefore bytes_on_disk can also be added incrementally.
This simplifies s3 life as it no longer have to care about feeding
a bytes_on_disk, which is currently limited to data and index
sizes only.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
Now gossiper doesn't need those two as its dependencies, they can be
removed making code shorter and dependencies graph simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>