The memory usage is now maintained and updated on each change to the
mutation fragment, so it needs not be recalculated on a call to
`memory_usage()`, hence the schema parameter is unused and can be
removed.
The memory usage of mutation fragments is now tracked through its
lifetime through a reader permit. This was the last major (to my current
knowledge) untracked piece of the reader pipeline.
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_mutation_start() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_range_tombstone() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_clustering_row() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_static_row() &&`.
Via a tracked_allocator. Although the memory allocations made by the
_buffer shouldn't dominate the memory consumption of the read itself,
they can still be a significant portion that scales with the number of
readers in the read.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
This can be used with standard containers and other containers that use
the std::allocator interface to track the allocations made by them via a
reader_permit.
In the next patches we plan to start tracking the memory consumption of
the actual allocations made by the circular_buffer<mutation_fragment>,
as well as the memory consumed by the mutation fragments.
This means that readers will start consuming memory off the permit right
after being constructed. Ironically this can prevent the reader from
being admitted, due to its own pre-admission memory consumption. To
prevent this hold on forwarding the memory consumption to the semaphore,
until the permit is actually admitted.
Track all resources consumed through the permit inside the permit. This
allows querying how much memory each read is consuming (as there should
be one read per permit). Although this might be interesting, especially
when debugging OOM cores, the real reason we are doing this is to be
able forward resource consumption to the semaphore only post-admission.
More on this in the patch introducing this.
Another advantage of tracking resources consumed through the permit is
that now we can detect resource leaks in the permit destructor and
report them. Even if it is just a case of the holder of the resources
wanting to release the resources later, with the permit destroyed it
will cause use-after-free.
In the next patches the reader permit will gain members that are shared
across all instances of the same permit. To facilitate this move all
internals into an impl class, of which the permit stores a shared
pointer. We use a shared_ptr to avoid defining `impl` in the header.
This is how the reader permit started in the beginning. We've done a
full circle. :)
And do all consuming and signalling through these methods. These
operations will soon be more involved than the simple forwarding they do
today, so we want to centralize them to a single method pair.
In the next patches we want to introduce per-permit resource tracking --
that is, have each permit track the amount of resource consumed through
it. For this, we need all consumption to happen through a permit, and
not directly with the semaphore.
To ensure progress at all times. This is due to evictable readers, who
still hold on to a buffer even when their underlying reader is evicted.
As we are introducing buffer and mutation fragment tracking in the next
patches, these readers will hold on to memory even in this state, so it
may theoretically happen that even though no readers are admitted (all
count resources all available) no reader can be admitted due to lack of
memory. To prevent such deadlocks we now always admit one reader if all
count resource are available.
Current code uses a single counter to produce multiple buffer worth of
data. This uses carry-on from on buffer to the other, which happens to
work with the current memory accounting but is very fragile. Account
each buffer separately, resetting the counter between them.
The test consumes all resources off the semaphore, leaving just enough
to admit a single reader. However this amount is calculated based on the
base cost of readers, but as we are going to track reader buffers as
well, the amount of memory consumed will be much less predictable.
So to make sure background readers can finish during shutdown, release
all the consumed resources before leaving scope.
No point in continuing processing the entire buffer once a failure was
found. Especially that an early failure might introduce conditions that
are not handled in the normal flow-path. We could handle these but there
is no point in this added complexity, at this point the test is failed
anyway.
Some tests rely on `consume*()` calls on the permit to take effect
immediately. Soon this will only be true once the permit has been
admitted, so make sure the permit is admitted in these tests.
Currently per-shard reader contexts are cleaned up as soon as the reader
itself is destroyed. This causes two problems:
* Continuations attached to the reader destroy future might rely on
stuff in the context being kept alive -- like the semaphore.
* Shard 0's semaphore is special as it will be used to account buffers
allocated by the multishard reader itself, so it has to be alive until
after all readers are destroyed.
This patch changes this so that contexts are destroyed only when the
lifecycle policy itself is destroyed.
* seastar e215023c7...292ba734b (4):
> future: Fix move of futures of reference type
> doc: fix hyper link to tutorial.html
> tutorial: fix formatting of code block
> README.md: fix the formatting of table
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader.
The intent is to prevent corrupt data from getting into the system as
well as to help catch these bugs as close to the source as possible.
Fixes: #7208
Tests: unit(dev), mutation_reader_test:debug (v4)
* botond/evictable-reader-validate-buffer/v5:
mutation_reader_test: add unit test for evictable reader self-validation
evictable_reader: validate buffer after recreation the underlying
evictable_reader: update_next_position(): only use peek'd position on partition boundary
mutation_reader_test: add unit test for evictable reader range tombstone trimming
evictable_reader: trim range tombstones to the read clustering range
position_in_partition_view: add position_in_partition_view before_key() overload
flat_mutation_reader: add buffer() accessor
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader. Several things are checked:
* The partition is the expected one -- the one we were in the middle of
or the next if we stopped at partition boundaries.
* The partition is in the read range.
* The first fragment in the partition is the expected one -- has a
an equal or larger position than the next expected fragment.
* The fragment is in the clustering range as defined by the slice.
As these validations are only done on the slow-path of recreating an
evicted reader, no performance impact is expected.
`evictable_reader::update_next_position()` is used to record the position the
reader will continue from, in the next buffer fill. This position is used to
create the partition slice when the underlying reader is evicted and has
to be recreated. There is an optimization in this method -- if the
underlying's buffer is not empty we peek at the first fragment in it and
use it as the next position. This is however problematic for buffer
validation on reader recreation (introduced in the next patch), because
using the next row's position as the next pos will allow for range
tombstones to be emitted with before_key(next_pos.key()), which will
trigger the validation. Instead of working around this, just drop this
optimization for mid-partition positions, it is inconsequential anyway.
We keep it for where it is important, when we detect that we are at a
partition boundary. In this case we can avoid reading the current
partition altogether when recreating the reader.
Currently mutation sources are allowed to emit range tombstones that are
out-of the clustering read range if they are relevant to it. For example
a read of a clustering range [ck100, +inf), might start with:
range_tombstone{start={ck1, -1}, end={ck200, 1}},
clustering_row{ck100}
The range tombstone is relevant to the range and the first row of the
range so it is emitted as first, but its position (start) is outside the
read range. This is normally fine, but it poses a problem for evictable
reader. When the underlying reader is evicted and has to be recreated
from a certain clustering position, this results in out-of-order
mutation fragments being inserted into the middle of the stream. This is
not fine anymore as the monotonicity guarantee of the stream is
violated. The real solution would be to require all mutation sources to
trim range tombstones to their read range, but this is a lot of work.
Until that is done, as a workaround we do this trimming in the evictable
reader itself.
The "mode" variable name is used everywhere, usually in a loop.
Therefore, rename the global "mode" to "checkheaders_mode" so that if
your code block happens to be outside of a loop, you don't accidentally
use the globally visible "mode" and spend hours debugging why it's
always "dev".
Spotted by Yaron Kaikov.
Message-Id: <20200924112237.315817-1-penberg@scylladb.com>
The script pull_github_pr.sh uses git merge's "--log" option to put in
the merge commit the list of titles of the individual patches being
merged in. This list is useful when later searching the log for the merge
which introduced a specific feature.
Unfortunately, "--log" defaults to cutting off the list of commit titles
at 20 lines. For most merges involving fewer than 20 commits, this makes
no difference. But some merges include more than 20 commits, and get
a truncated list, for no good reason. If someone worked hard to create a
patch set with 40 patches, the last thing we should be worried about is
that the merge commit message will be 20 lines longer.
Unfortunately, there appears to be no way to tell "--log" to not limit
the length at all. So I chose an arbitrary limit of 1000. I don't think
we ever had a patch set in Scylla which exceeded that limit. Yet :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200924114403.817893-1-nyh@scylladb.com>
This patch fixes a race between two methods in hints manager: drain_for
and store_hint.
The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.
In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.
End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.
To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.
Fixes#7257Closes#7278
Currently, sstable_manager is used to create sstables, but it loses track
of them immediately afterwards. This series makes an sstable's life fully
contained within its sstable_manager.
The first practical impact (implemented in this series) is that file removal
stops being a background job; instead it is tracked by the sstable_manager,
so when the sstable_manager is stopped, you know that all of its sstable
activity is complete.
Later, we can make use of this to track the data size on disk, but this is not
implemented here.
Closes#7253
* github.com:scylladb/scylla:
sstables: remove background_jobs(), await_background_jobs()
sstables: make sstables_manager take charge of closing sstables
test: test_env: hold sstables_manager with a unique_ptr
test: drop test_sstable_manager
test: sstables::test_env: take ownership of manager
test: broken_sstable_test: prepare for asynchronously closed sstables_manager
test: sstable_utils: close test_env after use
test: sstable_test: dont leak shared_sstable outside its test_env's lifetime
test: sstables::test_env: close self in do_with helpers
test: perf/perf_sstable.hh: prepare for asynchronously closed sstables_manager
test: view_build_test: prepare for asynchronously closed sstables_manager
test: sstable_resharding_test: prepare for asynchronously closed sstables_manager
test: sstable_mutation_test: prepare for asynchronously closed sstables_manager
test: sstable_directory_test: prepare for asynchronously closed sstables_manager
test: sstable_datafile_test: prepare for asynchronously closed sstables_manager
test: sstable_conforms_to_mutation_source_test: remove references to test_sstables_manager
test: sstable_3_x_test: remove test_sstables_manager references
test: schema_changes_test: drop use of test_sstables_manager
mutation_test: adjust for column_family_test_config accepting an sstables_manager
test: lib: sstable_utils: stop using test_sstables_manager
test: sstables test_env: introduce manager() accessor
test: sstables test_env: introduce do_with_async_sharded()
test: sstables test_env: introduce do_with_async_returning()
test: lib: sstable test_env: prepare for life as a sharded<> service
test: schema_changes_test: properly close sstables::test_env
test: sstable_mutation_test: avoid constructing temporary sstables::test_env
test: mutation_reader_test: avoid constructing temporary sstables::test_env
test: sstable_3_x_test: avoid constructing temporary sstables::test_env
test: lib: test_services: pass sstables_manager to column_family_test_config
test: lib: sstables test_env: implement tests_env::manager()
test: sstable_test: detemplate write_and_validate_sst()
test: sstable_test_env: detemplate do_with_async()
test: sstable_datafile_test: drop bad 'return'
table: clear sstable set when stopping
table: prevent table::stop() race with table::query()
database: close sstable_manager:s
sstables_manager: introduce a stub close()
sstable_directory_test: fix threading confusion in make_sstable_directory_for*() functions
test: sstable_datafile_test: reorder table stop in compaction_manager_test
test: view_build_test: test_view_update_generator_register_semaphore_unit_leak: do not discard future in timer
test: view_build_test: fix threading in test_view_update_generator_register_semaphore_unit_leak
view: view_update_generator: drop references to sstables when stopping