Commit Graph

45 Commits

Author SHA1 Message Date
Tomasz Grabiec
dc08cb46bb commitlog: Fix replay to not delete dirty segments
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.

The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.

Introduced in abe7358767.

dtest failure:

 commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup

Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f7197dabf8)
2016-12-07 15:56:08 +02:00
Glauber Costa
1578d7363a commitlog: rework blocking logic
The current incarnation of commitlog establishes a maximum amount of
writes that can be in-flight, and blocks new requests after that limit
is reached.

That is obviously something we must do, but the current approach to it
is problematic for two main reasons:

1) It forces the requests that trigger a write to wait on the current
   write to finish. That is excessive; ideally we would wait for one
   particular write to finish, not necessarily the current one. That
   is made worse by the fact that when a write is followed by a flush
   (happens when we move to a new segment), then we must wait for
   *all* writes in that segment to finish.

1) it casts concurrency in terms of writes instead of memory, which
   makes the aforementioned problem a lot worse: if we have very big
   buffers in flight and we must wait for them to finish, that can
   take a long time, often in the order of seconds, causing timeouts.

The approach taken by this patch is to replace the _write_semaphore
with a request_controller. This data structure will account the amount
of memory used by the buffers and set a limit on it. New allocations
will be held until we go below that limit, and will be released
as soon as this happens.

This guarantees that the latencies introduced by this mechanism are
spread out a lot better among requests and will keep higher percentile
latencies in check.

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:56:36 -04:00
Glauber Costa
a50996f376 commitlog: calculate segment-independent size of mutations
Goal is to calculate a size that is lesser or equal than the
segment-dependent size.

This was originally written by Tomasz, and featured in his submission
"commitlog: Handle overload more gracefully"

Extracted here so it sits clearly in a different patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Calle Wilund
e667dcc3d0 commitlog: Make segment->segment_manager relation shared pointer
The segment->segment_manager pointer has, until now, been a raw pointer,
which in a way is sensible, since making circular shared pointer
relations is in general bad. However, since the code and life cycle
of segments has evolved quite a bit since that initial relation
was defined, becoming both more and then suddenly, in a sense,
less, asynchronous over time, the usage of the relation is in fact
more consistent with a shared pointer, in that a segment needs to
access its manager to properly do things like write and flush.

These two ops in particular depend on accessing the segment manager
in a way that might be fine even using raw pointers, if it was not
again for that little annoying thing of continuation reordering.

So, lets just make the relation a shared pointer, solving the issue
of whether the manager is alive when a segment accesses it. If it
has been "released" (shut down), the existing mechanisms (gate)
will then trigger and prevent any actual _actions_ from taking
place. And we don't have to complicate anything else even more.

Only "big" change is that we need to explicitly orphan all
segments in commitlog destructor (segment_manager is essentially
a p-impl).

This fixes some spurious crashes in nightly unit tests.

Fixes #966.

Message-Id: <1456838735-17108-1-git-send-email-calle@scylladb.com>
2016-03-01 16:48:28 +02:00
Calle Wilund
f2c5315d33 commitlog: Add write/flush limits
Configured on start (for now - and dummy values at that). 
When shard write/flush count reaches limit, and incoming ops will queue
until previous ones finish. 

Consequently, if an allocation op forces a write, which blocks, any 
other incoming allocations will also queue up to provide back pressure.
2016-01-26 10:19:24 +00:00
Calle Wilund
7628a4dfe0 commitlog: Add some feedback/measurement methods
Suitable to derive "back pressure" from.
2016-01-26 09:47:14 +00:00
Paweł Dziepak
a877905bd4 commitlog: allow adding entries using commitlog_entry_writer
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-01-13 10:17:45 +01:00
Paweł Dziepak
9d74268234 commitlog: introduce entry_writer
Current commitlog interface requires writers to specify the size of a
new entry which cannot depend on the segment to which the entry is
written.
If column mappings are going to be stored in the commitlog that's not
enough since we don't know whether column mapping needs to be written
until we known in which segment the entry is going to be stored.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-01-13 10:13:26 +01:00
Tomasz Grabiec
c0ac7b3a73 commitlog: Wrap subscription in a unique_ptr<> to make it nothrow movable
future<> will require nothrow move constructible types.
2015-12-07 09:50:28 +01:00
Tomasz Grabiec
657841922a Mark move constructors noexcept when possible 2015-12-07 09:50:27 +01:00
Calle Wilund
262f44948d commitlog: Add get_flush_count method (for testing) 2015-11-23 15:42:45 +01:00
Calle Wilund
2fe2320490 commitlog: Make reading segments with crc/data errors non-fatal
Parser object now attempts to skip past/terminate parsing on corrupted
entries/chunks (as detected by invalid sizes/crc:s). The amount of data
skipped is kept track of (as well as we can estimate - pre-allocation
makes it tricky), and at the end of parsing/reporting, IFF errors 
occurred, and exception detailing the failures is thrown (since 
subsciption has little mechanism to deal with this otherwise). 

Thus a caller can decide how to deal with data corruption, but will be
given as many entries as possible.
2015-11-23 15:42:45 +01:00
Calle Wilund
dcabf8c1d2 Commitlog: Pre-allocate "reserve" segments
Refs #356

Pre-allocates N segments from timer task. N is "adaptive" in that it is
increased (to a max) every time segement acquisition is forced to allocate
a new instead of picking from pre-alloc (reserve) list. The idea is that it is
easier to adapt how many segments we consume per timer quanta than the timer
quanta itself.

Also does disk pressure check and flush from timer task now. Note that the
check is still only done max once every new segment.

Some logging cleanup/betterment also to make behaviour easier to trace.

Reserve segments start out at zero length, and are still deleted when finished.
This is because otherwise we'd still have to clear the file to be able to
properly parse it later (given that is can be a "half" file due to power fail
etc). This might need revisiting as well.

With this patch, there should be no case (except flush starvation) where
"add_mutation" actually waits for a (potentially) blocking op (disk).
Note that since the amount of reserve is increased as needed, there will
be occasional cases where a new segment is created in the alloc path
until the system finds equilebrium. But this should only be during a breif
warmup.

v2: Fixed timestamp not being reset on reserve acquire
2015-09-21 13:04:39 +02:00
Avi Kivity
d5cf0fb2b1 Add license notices 2015-09-20 10:43:39 +03:00
Avi Kivity
dcdc925b86 Revert "Commitlog: Pre-allocate "reserve" segments"
This reverts commit cbf3b63853, due to
reports of increased latency (instead of the opposite).
2015-09-19 09:26:39 +03:00
Calle Wilund
cbf3b63853 Commitlog: Pre-allocate "reserve" segments
Refs #356

Pre-allocates N segments from timer task. N is "adaptive" in that it is
increased (to a max) every time segement acquisition is forced to allocate
a new instead of picking from pre-alloc (reserve) list. The idea is that it is
easier to adapt how many segments we consume per timer quanta than the timer
quanta itself.

Also does disk pressure check and flush from timer task now. Note that the
check is still only done max once every new segment.

Some logging cleanup/betterment also to make behaviour easier to trace.

Reserve segments start out at zero length, and are still deleted when finished.
This is because otherwise we'd still have to clear the file to be able to
properly parse it later (given that is can be a "half" file due to power fail
etc). This might need revisiting as well.

With this patch, there should be no case (except flush starvation) where
"add_mutation" actually waits for a (potentially) blocking op (disk).
Note that since the amount of reserve is increased as needed, there will
be occasional cases where a new segment is created in the alloc path
until the system finds equilebrium. But this should only be during a breif
warmup.
2015-09-17 19:54:28 +03:00
Calle Wilund
456246dfd5 Commitlog: Add a gate + shutdown method
* Gate ensures we don't add data into a segment after close
* Shutdown closes all segments for business and prohibits new segments
2015-09-08 11:53:41 +02:00
Calle Wilund
4ed95b7020 Commitlog: Add sync_all_segments()
For #293 - allows explicit flush to disk (not close!) of all active segments
2015-09-07 20:31:59 +02:00
Calle Wilund
d614143f5e Commitlog/database: Fixup series "Commit log flush request on disk overflow"
Also at seastar-dev: calle/commitlog_flush_v3
(And, yes, this time I _did_ update the remote!)

Refs #262

Commit of original series was done on stale version (v2) due to authors
inability to multitask and update git repos.

v3:
* Removed future<> return value from callbacks. I.e. flush callback is now
  only fully syncronous over actual call
2015-09-07 21:29:19 +03:00
Calle Wilund
fdb921afb2 Commitlog: Add flushing of segment CF:s on disk overflow
* Do not throw away commitlog segments on disk size overflow. 
  Issue a flush request (i.e. calculate RP we want to free unto, 
  and for all dirty CF:s, do a request).
  "Abstracted" as registerable callback. I.e. DB:s responsibility 
  to actually do something with it.
2015-09-07 13:21:43 +02:00
Calle Wilund
841dd32a8a Commitlog: divide max on-disk-size by num cpus
To try to keep the resulting limit as configured
2015-09-07 13:13:46 +02:00
Calle Wilund
1814f89730 Commitlog: Add some more metrics + accessors for json API
Fixes #99

Adding missing commitlog metrics to the rest API.

v2: Mis-send (clumsy fingers)
v3: Use map_reduce0 + subroutine for nicer code
v4: rebased on current master
v5: rebased yet again.

Since the _second_ file in this previous patch set was commited, and is
dependent on this very change below to even compile, some expediency might be
warranted.
2015-09-01 10:15:33 +03:00
Calle Wilund
bbf82e80d0 Commitlog: Allow skipping X bytes in commit log reader
Also refactor reader into named methods for debugging sanity.
2015-08-31 14:29:49 +02:00
Calle Wilund
02d2bef1f2 Commitlog: Expose convinience method "list_existing_segments" 2015-08-31 14:29:48 +02:00
Calle Wilund
19052b3c09 Commitlog: Expose list_existing_descriptors 2015-08-31 14:29:48 +02:00
Calle Wilund
e068ffb5a5 Commitlog: Make file reader provide replay_position for entries 2015-08-31 14:29:47 +02:00
Calle Wilund
41b1ad8600 Commitlog: Make descriptor type visible/usable from outside 2015-08-31 14:29:47 +02:00
Avi Kivity
5f62f7a288 Revert "Merge "Commit log replay" from Calle"
Due to test breakage.

This reverts commit 43a4491043, reversing
changes made to 5dcf1ab71a.
2015-08-27 12:39:08 +03:00
Calle Wilund
86a97fea4c Commitlog: Allow skipping X bytes in commit log reader
Also refactor reader into named methods for debugging sanity.
2015-08-25 09:41:55 +02:00
Calle Wilund
4364d72ca3 Commitlog: Expose convinience method "list_existing_segments" 2015-08-25 09:41:54 +02:00
Calle Wilund
a3a02968ab Commitlog: Expose list_existing_descriptors 2015-08-25 09:41:54 +02:00
Calle Wilund
fcb87471b9 Commitlog: Make file reader provide replay_position for entries 2015-08-25 09:40:53 +02:00
Calle Wilund
db6370ad87 Commitlog: Make descriptor type visible/usable from outside 2015-08-25 09:40:53 +02:00
Calle Wilund
6ac6d644be Commitlog: add logging
Note: pretty lame logging, but modeled after origin.
Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
2015-08-10 18:42:41 +03:00
Calle Wilund
0729580f84 Separate replay_position into its own header
Less include bloat...

Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
2015-06-01 14:17:43 +03:00
Amnon Heiman
b95dabba38 Expose the segment names in commit log
This adds a method to return a vector with full-path to the active
segment names. It will be used by the API.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-05-19 15:27:52 +03:00
Calle Wilund
46e0676a7d commitlog: Add reader stream/subscription
Generic read-all-stream from a commit log segmen file.
Provides a byte view for each data entry, doing CRC checks and padding skips.

Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
2015-05-06 16:45:13 +03:00
Pekka Enberg
bf1734c480 db/commitlog: Use 'pragma once' in commitlog.hh
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-06 09:47:42 +03:00
Pekka Enberg
a32ae69b2b db/commitlog: Minor formatting fixes
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-06 09:47:21 +03:00
Pekka Enberg
9920d58b70 db/commitlog: Use C++ type aliases
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-06 09:43:04 +03:00
Calle Wilund
2f4e7a00f6 Use db/config object in main, database etc
* Uses config object to augument/impl options parsing
* Database now holds config obj
* Commitlog can now be inited with global config obj.
2015-04-29 18:01:17 +02:00
Calle Wilund
aeb83f2874 Add commitlog to db + use it in storage_proxy/handler
* A commitlog is created in "work" dirs when initing the db
  from a datadir. However, since we have neither disk data storage,
  nor replay capability yet (and no real db config), the settings 
  are basically to just write in-memory serialization, write them to 
  disk and then discard them. So in fact, pointless. But at least using
  the log...
* Moved the actual "apply" of mutation into database. If a commitlog
  is active, add an entry to it before applying mutation.
2015-04-29 10:10:21 +02:00
Calle Wilund
9979ee8d45 Modify commit log to use dataoutput
Both internal usage and external interface.
2015-04-01 10:08:00 +02:00
Calle Wilund
054f9ed082 Initial commit log support.
Implements a cassandra-file-compatible segmented log
of "mutations". Handles "batch" and "periodic" mode like
stock version. Also includes "dirty" management for the
interaction log/memtable.

Supports:
* add
* sync/flush
* clear dirty bits (thus discarding segments)

Many more estoric stock functions not yet implemented.

Missing: Storage management. Does not deal with total
size on disk of segments yet. Nor does it have any provisions
for dealing with active buffer bloat should async writes stall.

[avi: adjust for future<>::rescue() removal]
Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
2015-03-05 11:06:09 +02:00