Current S3 client was tested over minio and it takes few more touches to work with amazon S3.
The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS.
So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs.
In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust.
fixes: #13425Closes#13493
* github.com:scylladb/scylladb:
doc: Add a document describing how to configure S3 backend
s3/test: Add ability to run boost test over real s3
s3/client: Sign requests if configured
s3/client: Add connection factory with DNS resolve and configurable HTTPS
s3/client: Keep server port on config
s3/client: Construct it with config
s3/client: Construct it with sstring endpoint
sstables: Make s3_storage with endpoint config
sstables_manager: Keep object storage configs onboard
code: Introduce conf/object_storage.yaml configuration file
storage_service uses raft_group0 but the during shutdown the later is
destroyed before the former is stopped. This series move raft_group0
destruction to be after storage_service is stopped already. For the
move to work some existing dependencies of raft_group0 are dropped
since they do not really needed during the object creation.
Fixes#13522
raft_group0 does not really depends on cdc::generation_service, it needs
it only transiently, so pass it to appropriate methods of raft_group0
instead of during its creation.
If the endpoint config specifies AWS key, secret and region, all the
S3 requests get signed. Signature should have all the x-amz-... headers
included and should contain at least three of them. This patch includes
x-ams-date, x-amz-content-sha256 and host headers into the signing list.
The content can be unsigned when sent over HTTPS, this is what this
patch does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Existing seastar's factories work on socket_address, but in S3 we have
endpoint name which's a DNS name in case of real S3. So this patch
creates the http client for S3 with the custom connection factory that
does two things.
First, it resolves the provided endpoint name into address.
Second, it loads trust-file from the provided file path (or sets system
trust if configured that way).
Since s3 client creation is no-waiting code currently, the above
initialization is spawned in afiber and before creating the connection
this fiber is waited upon.
This code probably deserves living in seastar, but for now it can land
next to utils/s3/client.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.
To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name
The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.
To keep the described configuration the proposed place is the
object_storage.yaml file with the format
endpoints:
- name: a.b.c
port: 443
aws_key: 12345
aws_secret: abcdefghijklmnop
...
When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
raft_group0 does not really depends on migration_manager, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
raft_group0 does not really depends on query_processor, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
raft_group0 does not really depends on storage_service, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
Currently, responsible for injecting mutations of system.tablets to
schema changes.
Note that not all migrations are handled currently. Dependant view or
cdc table drops are not handled.
All users of global proxy are gone (*), proxy can be made fully main/cql_test_env local.
(*) one test case still needs it, but can get it via cql_test_env
Closes#13616
* github.com:scylladb/scylladb:
code: Remove global proxy
schema_change_test: Use proxy from cql_test_env
test: Carry proxy reference on cql_test_env
Introduce a new table `CDC_GENERATIONS_V3` (`system.cdc_generations_v3`).
The table schema is a copy-paste of the `CDC_GENERATIONS_V2` schema. The
difference is that V2 lives in `system_distributed_keyspace` and writes to it
are distributed using regular `storage_proxy` replication mechanisms based on
the token ring. The V3 table lives in `system_keyspace` and any mutations
written to it will go through group 0.
Extend the `TOPOLOGY` schema with new columns:
- `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping
node's `ring_slice`, it stores UUID of a newly introduced CDC
generation which is used as partition key for the `CDC_GENERATIONS_V3`
table to access this new generation's data. It's a regular column,
meaning that every row (corresponding to a node) will have its own.
- `current_cdc_generation_uuid` and `current_cdc_generation_timestamp`
together form the ID of the newest CDC generation in the cluster.
(the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is
when the CDC generation starts operating). Those are static columns
since there's a single newest CDC generation.
When topology coordinator handles a request for node to join, calculate a new
CDC generation using the bootstrapping node's tokens, translate it to mutation
format, and insert this mutation to the CDC_GENERATIONS_V3 table through group 0
at the same time we assign tokens to the node in Raft topology. The partition
key for this data is stored in the bootstrapping node's `ring_slice`.
After inserting new CDC generation data , we need to pick a timestamp for this
generation and commit it, telling all nodes in the cluster to start using the
generation for CDC log writes once their clocks cross that timestamp.
We introduce a separate step to the bootstrap saga, before
`write_both_read_old`, called `commit_cdc_generation`. In this step, the
coordinator takes the `new_cdc_generation_data_uuid` stored in a bootstrapping
node's `ring_slice` - which serves as the key to the table where the CDC
generation data is stored - and combines it with a timestamp which it generates
a bit into the future (as in old gossiper-based code, we use 2 * ring_delay, by
default 1 minute). This gives us a CDC generation ID which we commit into the
topology state as the `current_cdc_generation_id` while switching the saga to
the next step, `write_both_read_old`.
Once a new CDC generation is committed to the cluster by the topology
coordinator, we also need to publish it to the user-facing description tables so
CDC applications know which streams to read from.
This uses regular distributed table writes underneath (tables living in the
`system_distributed` keyspace) so it requires `token_metadata` to be nonempty.
We need a hack for the case of bootstrapping the first node in the cluster -
turning the tokens into normal tokens earlier in the procedure in
`token_metadata`, but this is fine for the single-node case since no streaming
is happening.
When a node notices that a new CDC generation was introduced in
`storage_service::topology_state_load`, it updates its internal data structures
that are used when coordinating writes to CDC log tables.
We include the current CDC generation data in topology snapshot transfers.
Some fixes and refactors included.
Closes#13385
* github.com:scylladb/scylladb:
docs: cdc: describe generation changes using group 0 topology coordinator
cdc: generation_service: add a FIXME
cdc: generation_service: add legacy_ prefix for gossiper-based functions
storage_service: include current CDC generation data in topology snapshots
db: system_keyspace: introduce `query_mutations` with range/slice
storage_service: hold group 0 apply mutex when reading topology snapshot
service: raft_group0_client: introduce `hold_read_apply_mutex`
storage_service: use CDC generations introduced by Raft topology
raft topology: publish new CDC generation to the user description tables
raft topology: commit a new CDC generation on node bootstrap
raft topology: create new CDC generation data during node bootstrap
service: topology_state_machine: make topology::find const
db: system_keyspace: small refactor of `load_topology_state`
cdc: generation: extract pure parts of `make_new_generation` outside
db: system_keyspace: add storage for CDC generations managed by group 0
service: topology_state_machine: better error checking for state name (de)serialization
service: raft: plumbing `cdc::generation_service&`
cdc: generation: `get_cdc_generation_mutations`: take timestamp as parameter
cdc: generation: make `topology_description_generator::get_sharding_info` a parameter
sys_dist_ks: make `get_cdc_generation_mutations` public
sys_dist_ks: move find_schema outside `get_cdc_generation_mutations`
sys_dist_ks: move mutation size threshold calculation outside `get_cdc_generation_mutations`
service/raft: group0_state_machine: signal topology state machine in `load_snapshot`
No code needs global proxy anymore. Keep on-stack values in main and
cql_test_env and keep the pointer on debug:: namespace.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that the routes referencing and using ctx.sp don't step on a proxy
that's going to be removed (not now, but some time later) fron under
them on shutdown.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch reads the relabel config from a file if it exists. A problem
with the file or metrics would stop Scylla from starting. This is on
purpose, as it's a configuration problem that should be addressed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
As a first step towards using host_id to identify nodes instead of ip addresses
this series introduces a node abstraction, kept in topology,
indexed by both host_id and endpoint.
The revised interface also allows callers to handle cases where nodes
are not found in the topology more gracefully by introducing `find_node()` functions
that look up nodes by host_id or inet_address and also get a `must_exist` parameter
that, if false (the default parameter value) would return nullptr if the node is not found.
If true, `find_node` throws an internal error, since this indicates a violation of an internal
assumption that the node must exist in the topology.
Callers that may handle missing nodes, should use the more permissive flavor
and handle the !find_node() case gracefully.
Closes#11987
* github.com:scylladb/scylladb:
topology: add node state
topology: remove dead code
locator: add class node
topology: rename update_endpoint to add_or_update_endpoint
topology: define get_{rack,datacenter} inline
shared_token_metadata: mutate_token_metadata: replicate to all shards
locator: endpoint_dc_rack: refactor default_location
locator: endpoint_dc_rack: define default operator==
test: storage_proxy_test: provide valid endpoint_dc_rack
And keep per node information (idx, host_id, endpoint, dc_rack, is_pending)
in node objects, indexed by topology on several indices like:
idx, host_id, endpoint, current/pending, per dc, per dc/rack.
The node index is a shorthand identifier for the node.
node* and index are valid while the respective topology instance is valid.
To be used, the caller must hold on to the topology / token_metadata object
(e.g. via a token_metadata_ptr or effective_replication_map)
Refs #6403
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
topology: add node idx
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The commitlog api originally implied that the commitlog_directory would contain files from a single commitlog instance. This is checked in segment_manager::list_descriptors, if it encounters a file with an unknown prefix, an exception occurs in `commitlog::descriptor::descriptor`, which is logged with the `WARN` level.
A new schema commitlog was added recently, which shares the filesystem directory with the main commitlog. This causes warnings to be emitted on each boot. This patch solves the warnings problem by moving the schema commitlog to a separate directory. In addition, the user can employ the new `schema_commitlog_directory` parameter to move the schema commitlog to another disk drive.
This is expected to be released in 5.3.
As #13134 (raft tables->schema commitlog) is also scheduled for 5.3, and it already requires a clean rolling restart (no cl segments to replay), we don't need to specifically handle upgrade here.
Fixes: #11867Closes#13263
* github.com:scylladb/scylladb:
commitlog: use separate directory for schema commitlog
schema commitlog: fix commitlog_total_space_in_mb initialization
The commitlog api originally implied that
the commitlog_directory would contain files
from a single commitlog instance. This is
checked in segment_manager::list_descriptors,
if it encounters a file with an unknown
prefix, an exception occurs in
commitlog::descriptor::descriptor, which is
logged with the WARN level.
A new schema commitlog was added recently,
which shares the filesystem directory with
the main commitlog. This causes warnings
to be emitted on each boot. This patch
solves the warnings problem by moving
the schema commitlog to a separate directory.
In addition, the user can employ the new
schema_commitlog_directory parameter to move
the schema commitlog to another disk drive.
By default, the schema commitlog directory is
nested in the commitlog_directory. This can help
avoid problems during an upgrade if the
commitlog_directory in the custom scylla.yaml
is located on a separate disk partition.
This is expected to be released in 5.3.
As #13134 (raft tables->schema commitlog)
is also scheduled for 5.3, and it already
requires a clean rolling restart (no cl
segments to replay), we don't need to
specifically handle upgrade here.
Fixes: #11867
The wasm engine is moved from replica::database to the query_processor.
The wasm instance cache and compilation thread runner were already there,
but now they're also initialized in the query_processor constructor.
By moving the initialization to the constructor, we can now
be certain that all wasm-related objects (wasm instance cache,
compilation thread runner, and wasm engine, which was already
passed in the constructor) are initialized when we try to use
them because we have to use the query processor to access them
anyway.
The change is also motivated by the fact that we're planning
to take Wasm UDFs out of experimental, after which they should
stop getting special treatment.
Closes#13311
* github.com:scylladb/scylladb:
wasm: move wasm initialization to query_processor constructor
wasm: return wasm instance cache as a reference instead of a pointer
wasm: move wasm engine to query_processor
By moving the initialization to the constructor, we can now
be certain that all wasm-related objects (wasm instance cache,
compilation thread runner, and wasm engine, which was already
passed in the constructor) are initialized when we try to use
them because we have to use the query processor to access them
anyway.
The change is also motivated by the fact that we're planning
to take Wasm UDFs out of experimental, after which they should
stop getting special treatment.
The builder will need generator for view_builder::consumer in one of the
next patches.
The builder is a standalone service that starts one of the latest and no
other services need builder as their dependency.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The generator will be responsible for spreading view updates with the
help of mutate_MV helper. The latter needs storage proxy to operate, so
the generator gets this dependency in advance.
There's no need to change start/stop order at the moment, generator
already starts after and stops before proxy. Also, services that have
generator as dependency are not required by proxy (even indirectly) so
no circular dependency is produced at this point.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The wasm engine is used for compiling and executing Wasm UDFs, so
the query_processor is a more appropriate location for it than
replica::database, especially because the wasm instance cache
and the wasm alien thread runner are already there.
This patch also reduces the number of wasm engines to 1, shared by
all shards, as recommended by the wasmtime developers.
We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog.
Fixes: #12642Closes#13134
* github.com:scylladb/scylladb:
test_raft_upgrade: add a test for schema commit log feature
scylla_cluster.py: add start flag to server_add
ServerInfo: drop host_id
scylla_cluster.py: add config to server_add
scylla_cluster.py: add expected_error to server_start
scylla_cluster.py: ScyllaServer.start, refactor error reporting
scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
raft: check if schema commitlog is initialized Refuse to boot if neither the schema commitlog feature nor force_schema_commit_log is set. For the upgrade procedure the user should wait until the schema commitlog feature is enabled before enabling consistent_cluster_management.
raft: move raft initialization after init_system_keyspace
database: rename before_schema_keyspace_init->maybe_init_schema_commitlog
raft: use schema commitlog for raft tables
init_system_keyspace: refactoring towards explicit load phases
Refuse to boot if neither the schema commitlog feature
nor force_schema_commit_log is set. For the upgrade
procedure the user should wait until
the schema commitlog feature is enabled before
enabling consistent_cluster_management.
Raft tables are loaded on the second call to
init_system_keyspace, so it seems more logical
to move initialization after it. This is not
necessary right now since raft tables are not used
in this initialization logic, but it may
change in the future and cause troubles.
We are going to move the raft tables from the first
load phase to the second. This means the second
init_system_keyspace call will load raft tables along
with the schema, making the name of this function imprecise.
We aim (#12642) to use the schema commit log
for raft tables. Now they are loaded at
the first call to init_system_keyspace in
main.cc, but the schema commitlog is only
initialized shortly before the second
call. This is important, since the schema
commitlog initialization
(database::before_schema_keyspace_init)
needs to access schema commitlog feature,
which is loaded from system.scylla_local
and therefore is only available after the
first init_system_keyspace call.
So the idea is to defer the loading of the raft tables
until the second call to init_system_keyspace,
just as it works for schema tables.
For this we need a tool to mark which tables
should be loaded in the first or second phase.
To do this, in this patch we introduce system_table_load_phase
enum. It's set in the schema_static_props for schema tables.
It replaces the system_keyspace::table_selector in the
signature of init_system_keyspace.
The call site for populate_keyspace in init_system_keyspace
was changed, table_selector.contains_keyspace was replaced with
db.local().has_keyspace. This check prevents calling
populate_keyspace(system_schema) on phase1, but allows for
populate_keyspace(system) on phase2 (to init raft tables).
On this second call some tables from system keyspace
(e.g. system.local) may have already been populated on phase1.
This check protects from double-populating them, since every
populated cf is marked as ready_for_writes.
This patch extends a previous patch that added these metrics globally:
- cql_requests_count
- cql_request_bytes
- cql_response_bytes
This patch adds a "scheduling_group_name" label to these metrics and changes corresponding
counters to be accounted on a per-scheduling-group level.
As a bonus this patch also marks all 3 metrics as 'skip_when_empty'.
Ref #13061
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20230321201412.3004845-1-vladz@scylladb.com>
To apply topology_change commands group0_state_machine needs to have an
access to the storage service to support topology changes over raft.
Message-Id: <20230316112801.1004602-10-gleb@scylladb.com>
On start scylla checks if the option is set. It's nowadays useless, as
it had been removed from seastar (see 9e34779c update)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13148
Under some circumstances, service_level_controller renames service
levels for internal purposes. However, the per-service-level metrics
registered by storage_proxy keep the name seen at first registration
time. This sometimes leads to mislabeled metrics.
Fix that by re-registering the metrics after scheduling groups
are renamed.
Fixes scylladb/scylla-enterprise#2755
Closes#13174
Simplified, more direct version of "dependency injection".
I.e. caller/initiator (main/cql_test_env) provides a set of
services it will eventually start. Configurable can remember
these. And use, at least after "start" notification.
Closes#13037
Allows a configurable to subscribe to life cycle notifications for scylla app.
I.e. do stuff on start/stop.
Also allow configurables in cql_test_env
v2:
* Fix camel casing
* Make callbacks future<> (should have been. mismerge?)
Closes#13035
The compilation of wasm UDFs is performed by a call to a foreign
function, which cannot be divided with yielding points and, as a
result, causes long reactor stalls for big UDFs.
We avoid them by submitting the compilation task to a non-seastar
std::thread, and retrieving the result using seastar::alien.
The thread is created at the start of the program. It executes
tasks from a queue in an infinite loop.
All seastar shards reference the thread through a std::shared_ptr
to a `alien_thread_runner`.
Considering that the compilation takes a long time anyway, the
alien_thread_runner is implemented with focus on simplicity more
than on performance. The tasks are stored in an std::queue, reading
and writing to it is synchronized using an std::mutex for reading/
writing to the queue, and an std::condition_variable waiting until
the queue has elements.
When the destructor of the alien runner is called, an std::nullopt
sentinel is pushed to the queue, and after all remaining tasks are
finished and the sentinel is read, the thread finishes.
Fixes#12904Closes#13051
* github.com:scylladb/scylladb:
wasm: move compilation to an alien thread
wasm: convert compilation to a future
The compilation of wasm UDFs is performed by a call to a foreign
function, which cannot be divided with yielding points and, as a
result, causes long reactor stalls for big UDFs.
We avoid them by submitting the compilation task to a non-seastar
std::thread, and retrieving the result using seastar::alien.
The thread is created at the start of the program. It executes
tasks from a queue in an infinite loop.
All seastar shards reference the thread through a std::shared_ptr
to a `alien_thread_runner`.
Considering that the compilation takes a long time anyway, the
alien_thread_runner is implemented with focus on simplicity more
than on performance. The tasks are stored in an std::queue, reading
and writing to it is synchronized using an std::mutex for reading/
writing to the queue, and an std::condition_variable waiting until
the queue has elements.
When the destructor of the alien runner is called, an std::nullopt
sentinel is pushed to the queue, and after all remaining tasks are
finished and the sentinel is read, the thread finishes.
There are two places that do it -- commitlog and batchlog replayers. Both can have local system-keyspace reference and use system-keyspace local query-processor for it. The peering save_truncation_record() is not that simple and is not patched by this PR
Closes#13087
* github.com:scylladb/scylladb:
system_keyspace: Unstatic get_truncation_record()
system_keyspace: Unstatic get_truncated_at()
batchlog_manager: Add system_keyspace dependency
main: Swap batchlog manager and system keyspace starts
system_keyspace: Unstatic get_truncated_position()
system_keyspace: Remove unused method
commitlog: Create commitlog_replayer with system keyspace
test: Make cql_test_env::get_system_keyspace() return sharded
commiltlog: Line-up field definitions
this change also includes change to main, to make this commit compile.
see below:
* seastar 9b6e181e42...9cbc1fe889 (46):
> Merge 'Make io-tester jobs share sched classes' from Pavel Emelyanov
> io_tester.md: Update the `rps` configuration option description
> io_tester: Add option to limit total number of requests sent
> Merge 'Keep outgoing queue all cancellable while negotiating (again)' from Pavel Emelyanov
> io_tester: Add option to share classes between jobs
> rpc: Abort connection if send_entry() fails
> Merge 'build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS' from Kefu Chai
> build: cooking.sh: use the same BUILD_SHARED_LIBS when building ingredients
> build: cooking.sh: use the same generator when building ingredients
> core/memory: handle `strerror_r` returning static string
> Merge 'build, rpc: lz4 related cleanups' from Kefu Chai
> build, rpc: do not support lz4 < 1.7.3
> build: set the correct version when finding lz4
> build: include CheckSymbolExists
> rpc: do not include lz4.h in header
> build: set CMP0135 for Cooking.cmake
> docs: drop building-*.md
> Merge 'seastar-addr2line: cleanups' from Kefu Chai
> seastar-addr2line: refactor tests using unittest
> seastar-addr2line: extract do_test() and main()
> seastar-addr2line: do not import unused modules
> scheduling: add a `rename` callback to scheduling_group_key_config
> reactor: syscall thread: wakeup up reactor with finer granularity
> build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS
> build: extract dpdk_extra_cflags out
> core/sstring: remove a temporary variable
> Merge 'treewide: include what we use, and add a checkheaders target' from Kefu Chai
> perftune.py: auto-select the same number of IRQ cores on each NUMA
> prometheus: remove unused headers
> core/sstring: define <=> operator for sstring
> Merge 'core: s/reserve_additional_memory/reserve_additional_memory_per_shard/' from Kefu Chai
> include: do not include <concepts> directly
> coding_style: note on self-contained header requirement
> circileci: build checkheaders in addition to default target
> build: add checkheaders target
> net/toeplitz: s/u_int/unsigned/
> net/tcp-stack: add forward declaration for seastar::socket
> core, net, util: include used headers
* main: set reserved memory for wasm on per-shard basis
this change is a follow-up of
f05d612da8 and
4a0134a097.
this change depends on the related change in Seastar to reserve
additional memory on a per-shard basis.
per Wojciech Mitros's comment:
> it should have probably been 50MB per shard
in other words, as we always execute the same set of udf on all
shards. and since one cannot predict the number of shards, but she
could have a rough estimation on the size of memory a regular (set
of) udf could use. so a per-shard setting makes more sense.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The manager will need system ks to get truncation record from, so add it
explicitly. Start-stop sequence no allows that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The former needs the latter to get truncation records from and will thus
need it as explicit dependency. In order to have it bathlog needs to
start after system ks. This works as starting batchlog manager doesn't
do anything that's required by system keyspace. This is indirectly
proven by cql-test-env in which batchlog manager starts later than it
does in main
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>