The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.
A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
Merged patch series by Pavel Emelyanov:
Alternator start and stop code is sitting inside the main()
and it's a big piece of code out there. Havig it all in main
complicates rework of start-stop sequences, it's much more
handy to have it in alternator/.
This set puts the mentioned code into transport- and thrift-
like controller model. While doing it one more call for global
storage service goes away.
* 'br-alternator-clientize' of https://github.com/xemul/scylla:
alternator: Move start-stop code into controller
alternator: Move the whole starting code into a sched group
alternator: Dont capture db, use cfg
alternator: Controller skeleton
alternator: Controller basement
alternator: Drop storage service from executor
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.
It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.
set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().
This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.
Refs #8772
Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
* tag 'validate-first-and-last-keys-ordering-v1':
sstable: validate first and last keys ordering
test: lib: reusable_sst: save unexpected errors
test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
test: sstable_test: define primary key in schema for compressed sstable
* scylla-dev/raft-group-0-part-1-rebase:
raft: (service) pass Raft service into storage_service
raft: (service) add comments for boot steps
raft: add ordering for raft::server_address based on id
raft: (internal) simplify construction of tagged_id
raft: (internal) tagged_id minor improvements
Raft group 0 initialization and configuration changes
should be integrated with Scylla cluster assembly,
happening when starting the storage service and joining
the cluster. Prepare for this.
Since Raft service depends on query processor, and query
processor depends on storage service, to break a dependency
loop split Raft initialization into two steps: starting
an under-constructed instance of "sharded" Raft service,
accepting an under-constructed instance of "sharded"
query_processor, and then passed into storage service start
function, and then the local state of Raft groups from system
tables once query processor starts.
Consistently abbreviate raft_services instance raft_svcs, as
is the convention at Scylla.
Update the tests.
reusable_sst tries openeing an sstable using
all sstable format versions in descending order.
It is expected to see "file not found" if the
actual sstable version is not the latest one.
That said, we may hit other error if the sstable
is malformed in any way, so do not override
this kind of error if "file not found" errors
are hit after it, and return the unexpected error
instead.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.
---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
(according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
system_distributed.cdc_generation_descriptions, using the
generation's timestamp as the partition key (we call this table
the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
application state.
The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.
Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.
The commit introduces a new table for broadcasting generation data with
the following properties:
- it uses a better schema that stores the data in multiple rows, each
of manageable size
- it resides in a new keyspace that uses EverywhereStrategy so the
data will be written to every node in the cluster that has a token in
the token ring
- the data will be written using CL=ALL and read using CL=ONE; thanks
to this, restarting node won't have to communicate with other nodes
to retrieve the data of the last known generation. Note that writing
with CL=ALL does not reduce availability: creating a new generation
*requires* all nodes to be available anyway, because they must learn
about the generation before their clocks go past the generation's
timestamp; if they don't, partitions won't be mapped to stream IDs
consistently across the cluster
- the partition key is no longer the generation's timestamp. Because it
was that way in the old internal table, it forced the algorithm to
choose the timestamp *before* the generation data was inserted into
the table. What if the inserting took a long time? It increased the
chance that nodes would learn about the generation too late (after
their clocks moved past its timestamp). With the new schema we will
first insert the generation data using a randomly generated UUID as
the partition key, *then* choose the timestamp, then gossip both the
timestamp and the UUID.
Observe that after a node learns about a generation broadcasted using
this new method through gossip it will retrieve its data very quickly
since it's one of the replicas and it can use CL=ONE as it was
written using CL=ALL.
The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.
Fixes#7961.
---
For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.
dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally
Closes#8643
* github.com:scylladb/scylla:
docs: update cdc.md with info about the new internal table
sys_dist_ks: don't create old CDC generations table on service initialization
sys_dist_ks: rename all_tables() to ensured_tables()
cdc: when creating new generations, use format v2 if possible
main: pass feature_service to cdc::generation_service
gms: introduce CDC_GENERATIONS_V2 feature
cdc: introduce retrieve_generation_data
test: cdc: include new generations table in permissions test
sys_dist_ks: increase timeout for create_cdc_desc
sys_dist_ks: new table for exchanging CDC generations
tree-wide: introduce cdc::generation_id_v2
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.
Ref: #2795
tests: unit(dev), perf_sstable(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
This warning triggers when a range for ("for (auto x : range)") causes
non-trivial copies, prompting the developer to replace with a capture
by reference. A few minor violations in the test suite are corrected.
Closes#8699
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).
The test suite currenty consists of the following
test-cases:
* Loading server instance with configuration from a snapshot.
* Loading server instance with configuration from a log.
* Configuration changes (remove + add node).
* Leader elections don't lead to RPC configuration changes.
* Voter <-> learner node transitions also don't change RPC
configuration.
* Reverting uncommitted configuration changes updates
RPC configuration accordingly (two cases: revert to
snapshot config or committed state from the log).
A few more refactorings are made along the way to be
able to reuse some existing functions from
`replication_test` in `rpc_test` implementation.
Please note, though, that there are still some functions
that are borrowed from `replication_test` but not yet
extracted to common helpers.
This is mostly because RPC tests doesn't need all
the complexity that `replication_test` has, thus,
some helpers are copied in a reduced form.
It would take some effort to refactor these bits to
fit both `replication_test` and `rpc_test` without
sacrificing convenience.
This will probably be addressed in another series later.
* manmanson/raft-rpc-tests-v9-alt3:
raft: add tests for RPC module
test: add CHECK_EVENTUALLY_EQUAL utility macro
raft: replication_test: reset test rpc network between test runs
raft: replication_test: extract tickers initialization into a separate func
raft: replication_test: support passing custom `apply_fn` to `change_configuration()`
raft: replication_test: introduce `test_server` aggregate struct
raft: replication_test: support voter<->learner configuration changes
raft: remove duplicate `create_command` function from `replication_test`
raft: avoid 'using' statements in raft testing helpers header
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places
A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).
The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).
Before:
Command being timed: "ninja dev-build"
User time (seconds): 28262.47
System time (seconds): 824.85
Percent of CPU this job got: 3979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2129888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1402838
Minor (reclaiming a frame) page faults: 124265412
Voluntary context switches: 1879279
Involuntary context switches: 1159999
Swaps: 0
File system inputs: 0
File system outputs: 11806272
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After:
Command being timed: "ninja dev-build"
User time (seconds): 26270.81
System time (seconds): 767.01
Percent of CPU this job got: 3905%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2117608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1400189
Minor (reclaiming a frame) page faults: 117570335
Voluntary context switches: 1870631
Involuntary context switches: 1154535
Swaps: 0
File system inputs: 0
File system outputs: 11777280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The observed improvement is about 5% of total wall clock time
for `dev-build` target.
Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"
* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
transport: remove extraneous `qos/service_level_controller` includes from headers
treewide: remove evidently unneded storage_proxy includes from some places
service_level_controller: remove extraneous `service/storage_service.hh` include
sstables/writer: remove extraneous `service/storage_service.hh` include
treewide: remove extraneous database.hh includes from headers
treewide: reduce boost headers usage in scylla header files
cql3: remove extraneous includes from some headers
cql3: various forward declaration cleanups
utils: add missing <limits> header in `extremum_tracking.hh`
"
There are many global stuff in repair -- a bunch of pointers to
sharded services, tracker, map of metas (maybe more). This set
removes the first group, all those services had become main-local
recently. Along the way a call to global storage proxy is dropped.
To get there the repair_service is turned into a "classical"
sharded<> service, gets all the needed dependencies by references
from main and spreads them internally where needed. Tracker and other
stuff is left global, but tracker is now the candidate for merging
with the now sharded repair_service, since it emulates the sharded
concept internally.
Overall the change is
- make repair_service sharded and put all dependencies on it at start
- have sharded<repair_service> in API and storage service
- carry the service reference down to repair_info and repair_meta
constructions to give them the depedencies
- use needed services in _info and _meta methods
tests: unit(dev), dtest.repair(dev)
"
* 'br-repair-service' of https://github.com/xemul/scylla: (29 commits)
repair: Drop most of globals from repair
repair: Use local references in messaging handler checks
repair: Use local references in create_writer()
repair: Construct repair_meta with local references
repair: Keep more stuff on repair_info
repair: Kill bunch of global usages from insert_repair_meta
repair: Pass repair service down to meta insertion
repair: Keep local migration manager on repair_info
repair: Move unused db captures
repair: Remove unused ms captures
repair: Construct repair_info with service
repair: Loop over repair sharded container
repair: Make sync_data_using_repair a method
repair: Use repair from storage service
repair: Keep repair on storage service
repair: Make do_repair_start a method
repair: Pass repair_service through the API until do_repair_start
repair: Fix indentation after previous patch
repair: Split sync_data_using_repair
repair: Turn repair_range a repair_info method
...
It would be good to have a `CHECK` variant in addition
to an existing `REQUIRE_EVENTUALLY_EQUAL` macro. Will be used
in raft RPC tests.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
A minimal implementation of alternator test env, a younger cousin
of cql_test_env, is implemented. Note that using this environment
for unit tests is strongly discouraged in favor of the official
test/alternator pytest suite. Still, alternator_test_env has its uses
for microbenchmarks.
Currently `cql_test_env` runs its `func` in the default (main) group and
also leaves all scheduling groups in `dbcfg` default initialized to the
same scheduling group. This results in every part of the system,
normally isolated from each other, running in the same (default)
scheduling group. Not a big problem on its own, as we are talking about
tests, but this creates an artificial difference between the test and
the real environment, which is ever more pronounced since certain query
parameters are selected based on the current scheduling group.
To bring cql test env just that little bit closer to the real thing,
this patch creates all the scheduling groups main does (well almost) and
configures `dbcfg` with them.
Creating and destroying the scheduling group on each setup-teardown of
cql test env breaks some internal seastar components which don't like
seeing the same scheduling group with the same name but different id. So
create the scheduling groups once on first access and keep them around
until the test executable is running.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>
Storage service calls a bunch of do_something_with_repair() methods. All
of them need the local repair_service and the only way to get it is by
keeping it on storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Ref: #7617
This series adds timeout parameters to service levels.
Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this:
```cql
CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s;
ATTACH SERVICE LEVEL sl2 TO cassandra;
ALTER SERVICE LEVEL sl2 WITH write_timeout = null;
```
Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`).
Closes#7913
* github.com:scylladb/scylla:
docs: add a paragraph describing service level timeouts
test: add per-service-level timeout tests
test: add refreshing client state
transport: add updating per-service-level params
client_state: allow updating per service level params
qos: allow returning combined service level options
qos: add a way of merging service level options
cql3: add preserving default values for per-sl timeouts
qos: make getting service level public
qos: make finding service level public
treewide: remove service level controller from query state
treewide: propagate service level to client state
sstables: disambiguate boost::find
cql3: add a timeout column to LIST SERVICE LEVEL statement
db: add extracting service level info via CQL
types: add a missing translation for cql_duration
cql3: allow unsetting service level timeouts
cql3: add validating service level timeout values
db: add setting service level params via system_distributed
cql3: add fetching service level attrs in ALTER and CREATE
cql3: add timeout to service level params
qos: add timeout to service level info
db,sys_dist_ks: add timeout to the service level table
migration_manager: allow table updates with timestamp
cql3: allow a null keyword for CQL properties
With a helper client state refresher, some attributes
which are usually only refreshed after a client disconnects
and then reconnects, can be verified in the test suite.
The only reason why storage service keeps a refernce on the migration
notifier is that the latter was needed by cdc before previous patch.
Now cdc gets the notifier directly from main, so storage service is
a bit more off the hook.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only way db_context's migration notifier reference is set up
is via cdc_service->db_context::builder->.build chain of calls.
Since the builder's notifier optional reference is always
disengaged (the .with_migration_notifier is removed by previous
patch) the only possible notifier reference there is from the
storage service which, in turn, is the same as in main.cc.
Said that -- push the notifier reference onto db_context directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
This patchset adds future-returning close methods to all
flat_mutation_reader-s and makes sure that all readers
are explicitly closed and waited for.
The main motivation for doing so is for providing a path
for cancelling outstanding i/o requests via a the input_stream
close (See https://github.com/scylladb/seastar/issues/859)
and wait until they complete.
Also, this series also introduces a stop
method to reader_concurrency_semaphore to be used when
shutting down the database, instead of calling
clear_inactive_readers in the database destructor.
The series does not change microbenchmarks performance in a significant way.
It looks like the results are within the tests' jitter.
- perf_simple_query: (in transactions per second, more is better)
before: median 184701.83 tps (90 allocs/op, 20 tasks/op)
after: median 188970.69 tps (90 allocs/op, 20 tasks/op) (+2.3%)
- perf_mutation_readers: (in time per iteration, less is better)
combined.one_row 65.042ns -> 57.961ns (-10.9%)
combined.single_active 46.634us -> 46.216us ( -0.9%)
combined.many_overlapping 364.752us -> 371.507us ( +1.9%)
combined.disjoint_interleaved 43.634us -> 43.448us ( -0.4%)
combined.disjoint_ranges 43.011us -> 42.991us ( -0.0%)
combined.overlapping_partitions_disjoint_rows 57.609us -> 58.820us ( +2.1%)
clustering_combined.ranges_generic 93.464ns -> 96.236ns ( +3.0%)
clustering_combined.ranges_specialized 86.537ns -> 87.645ns ( +1.3%)
memtable.one_partition_one_row 903.546ns -> 957.639ns ( +6.0%)
memtable.one_partition_many_rows 6.474us -> 6.444us ( -0.5%)
memtable.one_large_partition 905.593us -> 878.271us ( -3.0%)
memtable.many_partitions_one_row 13.815us -> 14.718us ( +6.5%)
memtable.many_partitions_many_rows 161.250us -> 158.590us ( -1.6%)
memtable.many_large_partitions 24.237ms -> 23.348ms ( -3.7%)
average -0.02%
Fixes#1076
Refs #2927
Test: unit(release, debug)
Perf: perf_mutation_readers, perf_simple_query (release)
Dtest: next-gating(release),
materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_max_to_half_test repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test(debug)
"
* tag 'flat_mutation_reader-close-v7' of github.com:bhalevy/scylla: (94 commits)
mutation_reader: shard_reader: get rid of stop
mutation_reader: multishard_combining_reader: get rid of destructor
flat_mutation_reader: abort if not closed before destroyed
flat_mutation_reader: require close
repair: row_level_repair: run: close repair_meta when done
repair: repair_reader: close underlying reader on_end_of_stream
perf: everywhere: close flat_mutation_reader when done
test: everywhere: close flat_mutation_reader when done
mutation_partition: counter_write_query: close reader when done
index: built_indexes_reader: implement close
mutation_writer: multishard_writer: close readers when done
mutation_writer: feed_writer: close reader when done
table: for_all_partitions_slow: close iteration_step reader when done
view_builder: stop: close all build_step readers
stream_transfer_task: execute: close send_info reader when done
view_update_generator: start: close staging_sstable_reader when done
view: build_progress_virtual_reader: implement close method
view: generate_view_updates: close builder readers when done
view_builder: initialize_reader_at_current_token: close reader before reassigning it
view_builder: do_build_step: close build_step reader when done
...
Make sure to close the unregistered inactive_read
before it's destroyed, if the unregistered reader_opt
is engaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than explcitily generating it by all callers
and then not using the argument at all.
Prepare for providing a different exception_ptr
from a stop() path to be introduced in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>