"
This series moves the timeout parameter, that is passed to most
f_m_r methods, into the reader_permit. This eliminates
the need to pass the timeout around, as it's taken
from the permit when needed.
The permit timeout is updated in certain cases
when the permit/reader is paused and retrieved
later on for reuse.
Following are perf_simple_query results showing ~1%
reduction in insns/op and corresponding increase in tps.
$ build/release/test/perf/perf_simple_query -c 1 --operations-per-shard 1000000 --task-quota-ms 10
Before:
102500.38 tps ( 75.1 allocs/op, 12.1 tasks/op, 45620 insns/op)
After:
103957.53 tps ( 75.1 allocs/op, 12.1 tasks/op, 45372 insns/op)
Test: unit(dev)
DTest:
repair_additional_test.py:RepairAdditionalTest.repair_abort_test (release)
materialized_views_test.py:TestMaterializedViews.remove_node_during_mv_insert_3_nodes_test (release)
materialized_views_test.py:InterruptBuildProcess.interrupt_build_process_with_resharding_half_to_max_test (release)
migration_test.py:TTLWithMigrate.big_table_with_ttls_test (release)
"
* tag 'reader_permit-timeout-v6' of github.com:bhalevy/scylla:
flat_mutation_reader: get rid of timeout parameter
reader_concurrency_semaphore: use permit timeout for admission
reader_concurrency_semaphore: adjust reactivated reader timeout
multishard_mutation_query: create_reader: validate saved reader permit
repair: row_level: read_mutation_fragment: set reader timeout
flat_mutation_reader: maybe_timed_out: use permit timeout
test: sstable_datafile_test: add sstable_reader_with_timeout
reader_permit: add timeout member
With data segregation on repair, thousands of sstables are potentially
added to maintenance set which causes high latency due to stalls.
That's because N*M sstables are created by a repair,
where N = # of ranges
and M = # of segregations
For TWCS, M = # of windows.
Assuming N = 768 and M = 20, ~15k sstables end up in sstable set
To fix this problem, let's avoid performing data segregation in repair,
as offstrategy will already perform the segregation anyway.
So from now on, only N non-overlapping sstables will be added to set.
Read amplification isn't affected because a query will only touch one
sstable in maintenance set.
When offstrategy starts, it will pick all sstables from set and
compact them in a single step while performing data segregation,
so data is properly laid out before integrated into the main set.
tests:
- sstable_compaction_test.twcs_reshape_with_disjoint_set_test
- mode(dev)
- manual test using repair-based bootstrap
Fixes#9199.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210824185043.76475-1-raphaelsc@scylladb.com>
As a preparation for up-front admission, add a permit parameter to
`make_streaming_reader()`, which will be the admitted permit once we
switch to up-front admission. For now it has to be a non-admitted
permit.
A nice side-effect of this patch is that now permits will have a
use-case specific description, instead of the generic "streaming" one.
Currently, if e.g. find_column_family throws an error,
as seen in #8776 when the table was dropped during repair,
the reader is not closed.
Use a coroutine to simplify error handling and
close the reader if an exception is caught.
Also, catch an error inside the lambda passed to make_interposer_consumer
when making the shared_sstable for streaming, and close the reader
their and return an exceptional future early, since
the reader will not be moved to sst->write_components, that assumes
ownership over it and closes it in all cases.
Fixes#8776
Test: unit(dev)
DTest: repair_additional_test.py:RepairAdditionalTest.repair_while_table_is_dropped_test (dev, debug) w/ https://github.com/scylladb/scylla/pull/8635#issuecomment-856661138
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8782
* github.com:scylladb/scylla:
streaming: make_streaming_consumer: close reader on errors
streaming: make_streaming_consumer: coroutinize returned function
The off-strategy compaction is now enabled for repair based node
operations. It is not bound to repair based node operations though. It
makes sense to enable it for streaming based node operations too.
Fixes#8820Closes#8821
Currently, if e.g. find_column_family throws an error,
as seen in #8776 when the table was dropped during repair,
the reader is not closed.
Use a coroutine to simplify error handling and
close the reader if an exception is caught.
Also, catch an error inside the lambda passed to make_interposer_consumer
when making the shared_sstable for streaming, and close the reader
their and return an exceptional future early, since
the reader will not be moved to sst->write_components, that assumes
ownership over it and closes it in all cases.
Fixes#8776
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Both streaming and repair call the distributed sstables writing with
equal lambdas each being ~30 lines of code. The only difference between
them is repair might request offstrategy compaction for new sstable.
Generalization of these two pieces save lines of codes and speeds the
release/repair/row_level.o compilation by half a minute (out of twelve).
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210531133113.23003-1-xemul@scylladb.com>
"
This patchset adds future-returning close methods to all
flat_mutation_reader-s and makes sure that all readers
are explicitly closed and waited for.
The main motivation for doing so is for providing a path
for cancelling outstanding i/o requests via a the input_stream
close (See https://github.com/scylladb/seastar/issues/859)
and wait until they complete.
Also, this series also introduces a stop
method to reader_concurrency_semaphore to be used when
shutting down the database, instead of calling
clear_inactive_readers in the database destructor.
The series does not change microbenchmarks performance in a significant way.
It looks like the results are within the tests' jitter.
- perf_simple_query: (in transactions per second, more is better)
before: median 184701.83 tps (90 allocs/op, 20 tasks/op)
after: median 188970.69 tps (90 allocs/op, 20 tasks/op) (+2.3%)
- perf_mutation_readers: (in time per iteration, less is better)
combined.one_row 65.042ns -> 57.961ns (-10.9%)
combined.single_active 46.634us -> 46.216us ( -0.9%)
combined.many_overlapping 364.752us -> 371.507us ( +1.9%)
combined.disjoint_interleaved 43.634us -> 43.448us ( -0.4%)
combined.disjoint_ranges 43.011us -> 42.991us ( -0.0%)
combined.overlapping_partitions_disjoint_rows 57.609us -> 58.820us ( +2.1%)
clustering_combined.ranges_generic 93.464ns -> 96.236ns ( +3.0%)
clustering_combined.ranges_specialized 86.537ns -> 87.645ns ( +1.3%)
memtable.one_partition_one_row 903.546ns -> 957.639ns ( +6.0%)
memtable.one_partition_many_rows 6.474us -> 6.444us ( -0.5%)
memtable.one_large_partition 905.593us -> 878.271us ( -3.0%)
memtable.many_partitions_one_row 13.815us -> 14.718us ( +6.5%)
memtable.many_partitions_many_rows 161.250us -> 158.590us ( -1.6%)
memtable.many_large_partitions 24.237ms -> 23.348ms ( -3.7%)
average -0.02%
Fixes#1076
Refs #2927
Test: unit(release, debug)
Perf: perf_mutation_readers, perf_simple_query (release)
Dtest: next-gating(release),
materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_max_to_half_test repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test(debug)
"
* tag 'flat_mutation_reader-close-v7' of github.com:bhalevy/scylla: (94 commits)
mutation_reader: shard_reader: get rid of stop
mutation_reader: multishard_combining_reader: get rid of destructor
flat_mutation_reader: abort if not closed before destroyed
flat_mutation_reader: require close
repair: row_level_repair: run: close repair_meta when done
repair: repair_reader: close underlying reader on_end_of_stream
perf: everywhere: close flat_mutation_reader when done
test: everywhere: close flat_mutation_reader when done
mutation_partition: counter_write_query: close reader when done
index: built_indexes_reader: implement close
mutation_writer: multishard_writer: close readers when done
mutation_writer: feed_writer: close reader when done
table: for_all_partitions_slow: close iteration_step reader when done
view_builder: stop: close all build_step readers
stream_transfer_task: execute: close send_info reader when done
view_update_generator: start: close staging_sstable_reader when done
view: build_progress_virtual_reader: implement close method
view: generate_view_updates: close builder readers when done
view_builder: initialize_reader_at_current_token: close reader before reassigning it
view_builder: do_build_step: close build_step reader when done
...
These two helpers are now namespace-scoped methods, but both
need the migration manager instance inside. All their callers
are now patched to have the migration manager at hands, so
the helpers can be turned into methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is dangerous to print a formatted string as is, like
sslog.warn(err.c_str()) since it might hold curly braces ('{}')
and those require respective runtime args.
Instead, it should be logged as e.g. sslog.warn("{}", err.c_str()).
This will prevent issues like #8436.
Refs #8436
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210408173048.124417-2-bhalevy@scylladb.com>
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.
Ref #1.
Closes#8403
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)
If configure_writer is called with a nullptr, the origin
will be equal to an empty string.
Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin. This was to reduce the
code churn in this patch and to keep the tests simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Commit e5be3352cf ("database, streaming, messaging: drop
streaming memtables") removed streaming memtables; this removes
the mechanisms to synchronize them: _streaming_flush_gate and
_streaming_flush_phaser. The memory manager for streaming is removed,
and its 10% reserve is evenly distributed between memtables and
general use (e.g. cache).
Note that _streaming_flush_phaser and _streaming_flush_date are
no longer used to syncrhonize anything - the gate is only used
to protect the phaser, and the phaser isn't used for anything.
Closes#7454
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.
As not all read can be associated with one schema, the schema is allowed
to be null.
The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
Streaming with RPC stream is supported for over 2 years and upgrades
are only allowed from versions which already have the support,
so the checks are hereby dropped.
... and tests. Printin a pointer in logs is considered to be a bad practice,
so the proposal is to keep this explicit (with fmt::ptr) and allow it for
.debug and .trace cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Showing raw pointer values in logs is not considered to be good
practice. However, for debugging/tracing this might be helpful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is useful to distinguish if the repair is a regular repair or used
for node operations.
In addition, log the keyspace and tables are repaired.
Fixes#7086
There are 4 places that call this helper:
- storage proxy. Callers are rpc verb handlers and already have the proxy
at hands from which they can get the messaging service instance
- repair. There's local-global messaging instance at hands, and the caller
is in verb handler too
- streaming. The caller is verb handler, which is unregistered on stop, so
the messaging service instance can be captured
- migration manager itself. The caller already uses "this", so the messaging
service instance can be get from it
The better approach would be to make get_schema_definition be the method of
migration_manager, but the manager is stopped for real on shutdown, thus
referencing it from the callers might not be safe and needs revisiting. At
the same time the messaging service is always alive, so using its reference
is safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Streaming uses messaging, init it with itw own reference.
Nowadays the whole streaming subsystem uses global static references on the
needed services. This is not nice, but still better than just using code-wide
globals, so treat the messaging service here the same way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
5 services register handlers in messaging, but not all of them
have clear unregistration methods.
Summary:
migration_manager: everything is in place, no changes
gossiper: ditto
proxy: some verbs unregistration is missing
repair: no unregistration at all
streaming: ditto
This patch adds the needed unregistration methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.
Remove this code.
Streaming is handled by just once group for CPU scheduling, so
separating it into read and write classes for I/O is artificial, and
inflates the resources we allow for streaming if both reads and writes
happen at the same time.
Merge both classes into one class ("streaming") and adjust callers. The
merged class has 200 shares, so it reduces streaming bandwidth if both
directions are active at the same time (which is rare; I think it only
happens in view building).
"
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:
```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.
In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```
Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.
After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```
Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.
Ta fix, we ignore the table of RF 1 when it is unable to find a source
node to stream.
Fixes#6351
"
* asias-fix_bootstrap_with_rf_one_in_range_streamer:
range_streamer: Handle table of RF 1 in get_range_fetch_map
streaming: Use separate streaming reason for replace operation
Current sender sends stream_mutation_fragments_cmd::end_of_stream to
receiver when an error is received from a peer node. To be safe, send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to prevent end_of_stream to
be written into the sstable when a partition is not closed yet.
In addition, use mutation_fragment_stream_validator to valid the
mutation fragments emitted from the reader, e.g., check if
partition_start and partition_end are paired when the reader is done. If
not, fail the stream session and send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to isolate the problematic
sstables on the sender node.
Refs: #6478
There's a static global instance of needed services and helpers
for it in streaming code. This is not great to use them, but at
least this change unifies different pieces of streaming code and
removes the storage_service.hh from streaming_session.cc (the
streaming_sessio.hh doesn't include it either).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is continuation of ac998e95 -- the sched group is
switched by messaging service for a verb, no need to do
it by hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>