Disabling compaction of a region is currently done in order to keep
the references valid. But disabling only compaction is not enough, we
also need to disable eviction, as it also invalidates
references. Rather than introducing another type of lock, compaction
and eviction are controlled together, generalized as "reclaiming"
(hence the reclaim_lock).
The goal is to make allocation less likely to fail. With async
reclaimer there is an implicit bound on the amount of memory that can
be allocated between deferring points. This bound is difficult to
enforce though. Sync reclaimer lifts this limitation off.
Also, allocations which could not be satisfied before because of
fragmentation now will have higher chances of succeeding, although
depending on how much memory is fragmented, that could involve
evicting a lot of segments from cache, so we should still avoid them.
Downside of sync reclaiming is that now references into regions may be
invalidated not only across deferring points but at any allocation
site. compaction_lock can be used to pin data, preferably just
temporarily.
* seastar 5176352...68fee6c (1):
> Merge "Memory reclamation infrastructure follow-up" from Tomasz
Adjusted logalloc::tracker's reclaimer to fit new API
"Initial implementation/transposition of commit log replay.
* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
sstables are inspected for high water mark, and then replayed from
those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
per _previous_ runs shards, not current.
Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
like origin. Partly because I am lazy, but also partly because our serial
format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
file, detailing which keyspace/cf:s to replay). Partly because we have no
system properties.
There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
Fixes#98.
"I saw about 4% improvement in perf_sstable write on muninn with this. The
decorated_key comparison is gone from the perf profile now. Now most of the
work inside the reader is for copying the mutation."
By using a recognized idiom, gcc can optimize the unaligned little endian
load as a single instruction (actually less than an instruction, as it
combines it with a succeeding arithmetic operation).
"Initial implementation/transposition of commit log replay.
* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
sstables are inspected for high water mark, and then replayed from
those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
per _previous_ runs shards, not current.
Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
like origin. Partly because I am lazy, but also partly because our serial
format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
file, detailing which keyspace/cf:s to replay). Partly because we have no
system properties.
There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
If stopping a task, we shouldn't retry a compaction because if
removing a cf, we would push back the cf into the back of the
queue if an error happened, and that would possibly lead to a
use-after-free.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
It was noticed that the same sstable files could be selected for
compaction if concurrent compaction happens on the same cf.
That's possible because compaction manager uses 2 tasks for
handling compactions.
Solution is to not duplicate cf in the compaction manager queue,
and re-schedule compaction for a cf if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Instead of failing normal allocations when the seastar allocator cannot
allocate a segment, provide a generous reserve. An allocation failure
will now be satisified from the reserve, but it will still trigger a
reclaim. This allows hiding low-memory conditions from the user.
Like boost::dynamic_bitset, but less capable. On the other hand it avoids
very large allocations, which are incurred by the bloom filter's bitset
on even moderately sized sstables.
For stopping a task of compaction manager, we first close the gate
used by compaction then bust semaphore via semaphore::broken().
The problem is that semaphore::broken() only signals waiters, and so
subsequent semaphore::wait() calls would succeed and the task would
remain alive forever.
The fix is to signal semaphore, forcing the task to exit via gate
exception, so we will no longer rely on semaphore::broken() for
finishing the task. That's possible because we try to access the
gate right after we waited on semaphore.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
We need a way to remove a column family from the compaction manager
because when dropping a column family we need to make sure that the
compaction manager doesn't hold a reference to it anymore.
So compaction manager queue is now of column_family, allowing us
to cancel requests pertaining to a column family being dropped.
There may be an ongoing compaction for the column family being
dropped, so we also need to wait for its termination.
Testcase for compaction manager was also adapted and improved.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
This heavily used function shows up in many places in the profile (as part
of other functions), so it's worth optimizing by eliminating the special
case for the standard allocator. Use a statically allocated object instead.
(a non-thread-local object is fine since it has no data members).
While #152 is still open, we need to allow for moderately sized allocations
to succeed. Extend the segment size to 256k, which allows for threads to
be allocated.
Fixes#151.
"Histograms are used to collect latency information, in Origin, many of the
operations are timed, this is a potential performance issue. This series adds
an option to sample the operations, where small amount will be timed and the
most will only be counted.
This will give an estimation for the statistics, while keeping an accurate
count of the total events and have neglectible performance impact.
The first to use the modified histogram are the column family for their read
and write."
Conflicts:
database.hh
To free memory, we need to allocate memory. In lsa compaction, we convert
N segments with average occupancy of (N-1)/N into N-1 new segments. However,
to do that, we need to allocate segments, which we may not be able to do
due to the low memory condition which caused us to compact anyway.
Fix by introducing a segment reserve, which we normally try to ensure is
full. During low memory conditions, we temporarily allow allocating from
the emergency reserve.
Currently, each column family creates a fiber to handle compaction requests
in parallel to the system. If there are N column families, N compactions
could be running in parallel, which is definitely horrible.
To solve that problem, a per-database compaction manager is introduced here.
Compaction manager is a feature used to service compaction requests from N
column families. Parallelism is made available by creating more than one
fiber to service the requests. That being said, N compaction requests will
be served by M fibers.
A compaction request being submitted will go to a job queue shared between
all fibers, and the fiber with the lowest amount of pending jobs will be
signalled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
The histogrm object is used both as a general counter for the number of
events and for statistics and sampling.
This chanage the histogram implementation, so it would support spares
sampling while keeping the total number of event accurate.
The implementation includes the following:
Remove the template nature of the histogram, as it is used only for
timer and use the name ihistogram instead.
If in the future we'll need a histogram for other types, we can use the
histogrma name for it.
a total counter was added that count the number of events that are part
of the statistic calculation.
A helper methods where added to the ihistogram to handle the latency
counter object.
According to the sample mask it would mark the latency object as start
if the counter and the mask are non zero and it would accept the latency
object in its mark method, in which if the latency was not start, it
will not be added and only the 'count' counter that counts the total
number of events will be incremented.
This should reduce the impact of latency calculation to a neglectable
effect.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
When doing a spares latency check, it is required to know if a latency
object was started.
This returns true if the start timer was set.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
When LSA reclaimer cannot reclaim more space by compaction, it
will reclaim data by evicting from evictable regions.
Currently the only evictable region is the one owned by the row cache.
Requiring alignment means that there must be 64K of contiguous space
to allocate each 32K segment. When memory is fragmented, we may fail
to allocate such segment, even though there's plenty of free space.
This especially hurts forward progress of compaction, which frees
segments randomly and relies on the fact that freeing a segment will
make it available to the next segment request.