Commit Graph

143 Commits

Author SHA1 Message Date
Tomasz Grabiec
c82325a76c lsa: Make region evictor signal forward progress
In some cases region may be in a state where it is not empty and
nothing could be evicted from it. For example when creating the first
entry, reclaimer may get invoked during creation before it gets
linked. We therefore can't rely on emptiness as a stop condition for
reclamation, the evction function shall signal us if it made forward
progress.
2015-09-06 21:25:44 +02:00
Tomasz Grabiec
94f0db933f lsa: Fix typo in the word 'emergency' 2015-09-06 21:24:59 +02:00
Tomasz Grabiec
200562abe7 lsa: Reclaim over-max segments from segment pool reserve 2015-09-06 21:24:59 +02:00
Tomasz Grabiec
d022a1a4a3 lsa: Introduce allocating_section
Related to #259. In some cases we need to allocate memory and hold
reclaim lock at the same time. If that region holds most of the
reclaimable memory, allocations inside that code section may
fail. allocating_section is a work-around of the problem. It learns
how big reserves shold be from past execution of critical section and
tries to ensure proper reserves before entering the section.
2015-09-06 21:24:59 +02:00
Tomasz Grabiec
3caad2294b lsa: Tolerate empty segments when region is destroyed
Some times we may close an empty active segment, if all data in it was
evicted. Normally segments are removed as soon as the last object in
it is freed, but if the segment is already empty when closed, noone is
supposed to call free on it. Such segments would be quickly reclaimed
during compaction, but it's possible that we will destroy the region
before they're reclaimed by compaction. Currently we would fail on an
assertion which checks that there are no segments. This change fixes
the problem by handling empty closed segments when region is
destroyed.
2015-09-06 21:24:59 +02:00
Tomasz Grabiec
c37aa73051 lsa: Drop alignment requirement from segment 2015-09-06 21:24:59 +02:00
Tomasz Grabiec
2c1536b5a7 lsa: Make free() path noexcept
Memory releasing is invoked from destructors so should not throw. As a
consequence it should not allocate memory, so emergency segment pool
was switched from std::deque<> to an alloc-free intrusive stack.
2015-09-06 21:24:59 +02:00
Tomasz Grabiec
f404a238bb allocation_strategy: Make construct() exception-safe 2015-09-06 21:24:59 +02:00
Tomasz Grabiec
0d7cdab0ff utils: managed_vector: Make copy assignment exception-safe 2015-09-06 21:24:58 +02:00
Tomasz Grabiec
81d81c81f2 utils: managed_vector: Make copy constructor exception-safe
Copying values may throw (std::bad_alloc for example), which will
result in a leak because destructor will not be called when
constructor throws.

May manifest as the following assertion failure:

utils/logalloc.cc:672: virtual logalloc::region_impl::~region_impl(): Assertion `_active->is_empty()' failed.
2015-09-06 21:24:58 +02:00
Tomasz Grabiec
fa8d530cc2 lsa: Add ability to trace reclaiming latency 2015-09-06 21:24:58 +02:00
Tomasz Grabiec
25e4f10c14 utils/file_lock: Fix compilation error
Fixes complaint about ignored result:

utils/file_lock.cc: In destructor 'utils::file_lock::impl::~impl()':
utils/file_lock.cc:29:33: error: ignoring return value of 'int lockf(int, int, __off_t)', declared with attribute warn_unused_result [-Werror=unused-result]
2015-09-04 13:00:54 +02:00
Avi Kivity
ef582753c0 Merge "Lock files for CL+Data dirs" from Calle
"Acquires and maintains lock files during execution.
Lock files are deleted on "clean" exit, and re-taken on uncontended
startup.

Fixes #34 "
2015-09-01 19:18:46 +03:00
Calle Wilund
e3abde50f9 Add lock file helper class "file_lock"
Using posix "lockf" per-fd region locking
2015-09-01 17:50:18 +02:00
Tomasz Grabiec
870e9e5729 lsa: Replace compaction_lock with broader reclaim_lock
Disabling compaction of a region is currently done in order to keep
the references valid. But disabling only compaction is not enough, we
also need to disable eviction, as it also invalidates
references. Rather than introducing another type of lock, compaction
and eviction are controlled together, generalized as "reclaiming"
(hence the reclaim_lock).
2015-09-01 17:29:04 +03:00
Tomasz Grabiec
48569651ea lsa: Fix calculation of bytes.non_lsa_used_space 2015-09-01 17:29:03 +03:00
Tomasz Grabiec
d20fae96a2 lsa: Make reclaimer run synchronously with allocations
The goal is to make allocation less likely to fail. With async
reclaimer there is an implicit bound on the amount of memory that can
be allocated between deferring points. This bound is difficult to
enforce though. Sync reclaimer lifts this limitation off.

Also, allocations which could not be satisfied before because of
fragmentation now will have higher chances of succeeding, although
depending on how much memory is fragmented, that could involve
evicting a lot of segments from cache, so we should still avoid them.

Downside of sync reclaiming is that now references into regions may be
invalidated not only across deferring points but at any allocation
site. compaction_lock can be used to pin data, preferably just
temporarily.
2015-08-31 21:50:18 +02:00
Tomasz Grabiec
6105c05dbe lsa: Introduce compaction_lock helper 2015-08-31 21:50:17 +02:00
Tomasz Grabiec
42dce17c82 lsa: Fix documentation for eviction functions 2015-08-31 21:50:17 +02:00
Avi Kivity
203b349722 Merge seastar upstream
* seastar 5176352...68fee6c (1):
  > Merge "Memory reclamation infrastructure follow-up" from Tomasz

Adjusted logalloc::tracker's reclaimer to fit new API
2015-08-31 20:01:07 +03:00
Paweł Dziepak
8d0419b621 managed_bytes: simplify empty()
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-08-31 17:29:16 +02:00
Paweł Dziepak
0343fddeb0 managed_bytes: simplify default constructor
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-08-31 17:29:16 +02:00
Paweł Dziepak
956c27e021 utils: add LSA-aware vector capable of internal storage
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-08-31 17:29:13 +02:00
Avi Kivity
702de43ce3 Merge "Commit log replay" from Calle
"Initial implementation/transposition of commit log replay.

* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
  max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
  sstables are inspected for high water mark, and then replayed from
  those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
  per _previous_ runs shards, not current.

Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
  against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
  so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
  like origin. Partly because I am lazy, but also partly because our serial
  format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
  file, detailing which keyspace/cf:s to replay). Partly because we have no
  system properties.

There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"

Fixes #98.
2015-08-31 15:58:12 +03:00
Calle Wilund
c040565bf9 runtime: expose boot_time
(boot == app start, I did not rename the var).
2015-08-31 14:29:45 +02:00
Avi Kivity
8c69098c89 Merge "Optimize memtable's scanning_reader" from Tomasz
"I saw about 4% improvement in perf_sstable write on muninn with this. The
decorated_key comparison is gone from the perf profile now. Now most of the
work inside the reader is for copying the mutation."
2015-08-31 15:07:27 +03:00
Tomasz Grabiec
110a55886c lsa: Introduce region::compaction_counter() 2015-08-31 13:58:42 +02:00
Tomasz Grabiec
9ad3dbe592 lsa: Add region::compaction_enabled() 2015-08-31 13:58:42 +02:00
Tomasz Grabiec
048387782a lsa: Rename region::set_compactible() to set_compaction_enabled()
To avoid confusion with region_impl::is_compactible() when the getter
is added.
2015-08-31 13:58:42 +02:00
Avi Kivity
f171d71c16 utils: optimize murmur3_hash data fetch
By using a recognized idiom, gcc can optimize the unaligned little endian
load as a single instruction (actually less than an instruction, as it
combines it with a succeeding arithmetic operation).
2015-08-28 12:37:43 +03:00
Avi Kivity
5f62f7a288 Revert "Merge "Commit log replay" from Calle"
Due to test breakage.

This reverts commit 43a4491043, reversing
changes made to 5dcf1ab71a.
2015-08-27 12:39:08 +03:00
Avi Kivity
4e3c9c5493 Merge "compaction manager fixes" from Raphael 2015-08-27 11:05:26 +03:00
Avi Kivity
43a4491043 Merge "Commit log replay" from Calle
"Initial implementation/transposition of commit log replay.

* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
  max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
  sstables are inspected for high water mark, and then replayed from
  those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
  per _previous_ runs shards, not current.

Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
  against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
  so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
  like origin. Partly because I am lazy, but also partly because our serial
  format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
  file, detailing which keyspace/cf:s to replay). Partly because we have no
  system properties.

There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
2015-08-27 10:53:36 +03:00
Calle Wilund
45d07d2744 runtime: expose boot_time
(boot == app start, I did not rename the var).
2015-08-25 09:14:40 +02:00
Avi Kivity
0617aecb62 lsa: downgrade "no compactible pool" warning to trace
It's a fairly standard condition.
2015-08-24 17:26:48 +02:00
Raphael S. Carvalho
41b6d430c0 compaction_manager: do not retry compaction if stopping task
If stopping a task, we shouldn't retry a compaction because if
removing a cf, we would push back the cf into the back of the
queue if an error happened, and that would possibly lead to a
use-after-free.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-24 11:23:24 -03:00
Raphael S. Carvalho
4c9c144987 compaction_manager: avoid concurrent compaction on the same cf
It was noticed that the same sstable files could be selected for
compaction if concurrent compaction happens on the same cf.
That's possible because compaction manager uses 2 tasks for
handling compactions.

Solution is to not duplicate cf in the compaction manager queue,
and re-schedule compaction for a cf if needed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-24 11:11:47 -03:00
Avi Kivity
77b3212c88 lsa: provide a fallback during normal allocation
Instead of failing normal allocations when the seastar allocator cannot
allocate a segment, provide a generous reserve.  An allocation failure
will now be satisified from the reserve, but it will still trigger a
reclaim.  This allows hiding low-memory conditions from the user.
2015-08-23 16:38:04 +03:00
Avi Kivity
1bb840bb72 sstables: use large_bitset in bloom filter
Avoids allocation failures due to multi-megabyte filters.
2015-08-23 12:22:49 +03:00
Avi Kivity
e928bcaf19 utils: introduce large_bitset
Like boost::dynamic_bitset, but less capable.  On the other hand it avoids
very large allocations, which are incurred by the bloom filter's bitset
on even moderately sized sstables.
2015-08-23 12:22:49 +03:00
Raphael S. Carvalho
c6ea25c5fb compaction_manager: fix compaction_manager::stop
For stopping a task of compaction manager, we first close the gate
used by compaction then bust semaphore via semaphore::broken().

The problem is that semaphore::broken() only signals waiters, and so
subsequent semaphore::wait() calls would succeed and the task would
remain alive forever.
The fix is to signal semaphore, forcing the task to exit via gate
exception, so we will no longer rely on semaphore::broken() for
finishing the task. That's possible because we try to access the
gate right after we waited on semaphore.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-22 20:38:12 +03:00
Avi Kivity
f531f36a44 lsa: fix types in logs 2015-08-20 15:29:08 +03:00
Avi Kivity
9012f991bf logalloc: really allow dipping into the emergency pool during reclaim
The RAII wrapper for the emergency pool was invoked without an object,
and so had no effect.
2015-08-20 12:10:03 +03:00
Avi Kivity
9ed2bbb25c lsa: introduce region_group
A region_group is a nestable group of regions, for cumulative statistics
purposes.
2015-08-19 19:36:40 +03:00
Avi Kivity
71aad57ca8 lsa: make region::impl a top-level class
Makes using forward declarations possible.
2015-08-19 14:43:17 +03:00
Avi Kivity
5252d5ec9b managed_bytes: fix self-assignment 2015-08-19 11:18:07 +03:00
Avi Kivity
00f39c4e1a managed_bytes: add small string optimization 2015-08-19 11:18:07 +03:00
Raphael S. Carvalho
820ba6f4d2 adapt compaction manager for column family removal
We need a way to remove a column family from the compaction manager
because when dropping a column family we need to make sure that the
compaction manager doesn't hold a reference to it anymore.

So compaction manager queue is now of column_family, allowing us
to cancel requests pertaining to a column family being dropped.
There may be an ongoing compaction for the column family being
dropped, so we also need to wait for its termination.

Testcase for compaction manager was also adapted and improved.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-18 11:38:06 +03:00
Avi Kivity
608c0b8460 Merge "initial work on compaction manager API" from Rapahel 2015-08-17 17:24:13 +03:00
Avi Kivity
932ddc328c logalloc: optimize current_allocation_strategy()
This heavily used function shows up in many places in the profile (as part
of other functions), so it's worth optimizing by eliminating the special
case for the standard allocator.  Use a statically allocated object instead.

(a non-thread-local object is fine since it has no data members).
2015-08-17 16:51:10 +03:00