Commit Graph

421 Commits

Author SHA1 Message Date
Avi Kivity
dcdc925b86 Revert "Commitlog: Pre-allocate "reserve" segments"
This reverts commit cbf3b63853, due to
reports of increased latency (instead of the opposite).
2015-09-19 09:26:39 +03:00
Calle Wilund
cbf3b63853 Commitlog: Pre-allocate "reserve" segments
Refs #356

Pre-allocates N segments from timer task. N is "adaptive" in that it is
increased (to a max) every time segement acquisition is forced to allocate
a new instead of picking from pre-alloc (reserve) list. The idea is that it is
easier to adapt how many segments we consume per timer quanta than the timer
quanta itself.

Also does disk pressure check and flush from timer task now. Note that the
check is still only done max once every new segment.

Some logging cleanup/betterment also to make behaviour easier to trace.

Reserve segments start out at zero length, and are still deleted when finished.
This is because otherwise we'd still have to clear the file to be able to
properly parse it later (given that is can be a "half" file due to power fail
etc). This might need revisiting as well.

With this patch, there should be no case (except flush starvation) where
"add_mutation" actually waits for a (potentially) blocking op (disk).
Note that since the amount of reserve is increased as needed, there will
be occasional cases where a new segment is created in the alloc path
until the system finds equilebrium. But this should only be during a breif
warmup.
2015-09-17 19:54:28 +03:00
Calle Wilund
b512192b3b Commitlog: Fix some timing/latency issues with sync
Refs #356

* Move sync time setting to sync initiate to help prevent double syncs
* Change add_mutation to only do explicit sync with wait if time elapsed
  since last is 2x sync window
* Do not wait for sync when moving to new segment in alloc path
* Initiate _sync_time properly.
* Add some tracing log messages to help debug
2015-09-16 20:07:25 +03:00
Calle Wilund
d42ff89e83 Config: Promote logging of unhandled options to warning
Fixes #222
2015-09-16 15:43:53 +03:00
Calle Wilund
bf727b2272 config.cc : add logging of unset attributes
Helps checking for missing stuff in scylla.yaml
2015-09-16 15:43:35 +03:00
Calle Wilund
8172717ba0 config.hh : update some default values to match scylla.conf 2015-09-16 15:43:35 +03:00
Calle Wilund
04562b23b4 commitlog_replayer: More correct fix for reordering issue in replay
* Removes previous, accidental fix that got committed.
* Instead just do not give RP:s to replay mutations. This is same as in Origin,
  and just as/more correct, since we intend to flush the data to sstables
  asap anyway
2015-09-16 15:41:17 +03:00
Avi Kivity
cab2148141 Merge "partial sstable handling" from Raphael
closes issue #75.
2015-09-13 12:03:50 +03:00
Gleb Natapov
17e54d0604 add logger for consistency level calculation 2015-09-13 11:59:17 +03:00
Raphael S. Carvalho
c729ea36e1 commitlog: guard commit log replay against reordering
After killing scylla in the middle of a write, the next scylla
instance failed to finish commit log replay, showing the following
error message:

scylla: core/future.hh:448: void promise<T>::set_value(A&& ...)
[with A = {}; T = {}]: Assertion `_state' failed.

After a long debug session, I figured out that check_valid_rp() was
triggering the exception replay_position_reordered_exception, which
means replay position reordering.

Looking at 8b9a63a3c6, I noticed that database::apply is guarded
against reodering, but commitlog replay code is not.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-12 06:17:14 -03:00
Gleb Natapov
04d2bef55b give preference to local data during query
Until dynamic snitch is implemented this is better than nothing.

Fixes #322
2015-09-10 15:45:20 +03:00
Pekka Enberg
1f7fa18970 db/schema_tables: Fix create keyspace notification
We need to send out the notification for all created keyspaces, not just
for the first one.

Spotted during code inspection.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-09-10 10:35:59 +03:00
Avi Kivity
6ccb0b15b5 Merge "Move the API configuration from command line to configuration" from Amnon
"It moves the API configuration from the command line argument to the general
config, it also move the api-doc directory to be configurable instead of hard
coded."
2015-09-09 12:35:03 +03:00
Gleb Natapov
df468504b6 schema_table: convert code to use distributed<storage_proxy> instead of storage_proxy&
All database code was converted to is when storage_proxy was made
distributed, but then new code was written to use storage_proxy& again.
Passing distributed<> object is safer since it can be passed between
shards safely. There was a patch to fix one such case yesterday, I found
one more while converting.
2015-09-09 10:19:30 +03:00
Calle Wilund
d46a95242a Config: Fix type where alias destination was copied instead of referenced
Fixes #310

Missing '&'.
(And no, cannot make the type non-copyable, since we want to copy config
 objects).
2015-09-08 16:54:04 +03:00
Calle Wilund
456246dfd5 Commitlog: Add a gate + shutdown method
* Gate ensures we don't add data into a segment after close
* Shutdown closes all segments for business and prohibits new segments
2015-09-08 11:53:41 +02:00
Calle Wilund
d666c747e3 Commitlog: Just add some more verbosity 2015-09-08 11:16:38 +02:00
Avi Kivity
a95d3f9cf5 Merge "Commitlog shutdown" from Calle
"Refs #293

* Add a commitlog::sync_all_segments, that explicitly forces all pending
  disk writes
* Only delete segments from disk IFF they are marked clean. Thus on partial
  shutdown or whatnot, even if CL is destroyed (destructor runs) disk files
  not yet clean visavi sstables are preserved and replayable
* Do a sync_all_segments first of all in database::stop.

Exactly what to not stop in main I leave up to others discretion, or at least
another patch."
2015-09-08 11:11:18 +03:00
Tomasz Grabiec
15ae1a92cb Merge branch 'pdziepak/compaction-remove-items/v4' from seastar-dev.git
From Pawel:

This series makes compaction remove items that are no longer items:
 - expired cells are changed into tombstones
 - items covered by higher level tombstones are removed
 - expired tombstones are removed if possible

Fixes #70.
Fixes #71.
2015-09-08 09:23:00 +02:00
Amnon Heiman
8be2ee54aa configuration: Add the API configuration to the general configuration
This adds the API configuration parameters to the configurtion, so it
will be possible to take them from the configuration file or from the
command line.

The following configuration were defined:
api_port
api_address
api_ui_dir
api_doc_dir

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-09-08 02:56:47 +03:00
Paweł Dziepak
64949e8339 schema: make gc_grace_seconds() return gc_clock::duration
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-09-07 21:14:41 +02:00
Calle Wilund
256c0550bf Commitlog: Only delete segments on disk if they are marked clean
For #293 - i.e. allow more or less coherent shutdown/destruction of the
commitlog while retaining disk data.
(tests still clear stuff explicitly).
2015-09-07 20:32:01 +02:00
Calle Wilund
4ed95b7020 Commitlog: Add sync_all_segments()
For #293 - allows explicit flush to disk (not close!) of all active segments
2015-09-07 20:31:59 +02:00
Calle Wilund
d614143f5e Commitlog/database: Fixup series "Commit log flush request on disk overflow"
Also at seastar-dev: calle/commitlog_flush_v3
(And, yes, this time I _did_ update the remote!)

Refs #262

Commit of original series was done on stale version (v2) due to authors
inability to multitask and update git repos.

v3:
* Removed future<> return value from callbacks. I.e. flush callback is now
  only fully syncronous over actual call
2015-09-07 21:29:19 +03:00
Avi Kivity
dee9060b12 Merge "Commit log flush request on disk overflow" from Calle
"Fixes #262

Handles CL disk size exceeding configured max size by calling flush handlers
for each dirty CF id / high replay_position mark. (Instead of uncontrolled
delete as previously).

* Increased default max disk size to 8GB. Same as Origin/scylla.yaml (so no
   real change, but synced).
* Divide the max disk size by cpus (so sum of all shards == max)
* Abstract flush callbacks in CL
* Handler in DB that initiates memtable->sstable writes when called.

Note that the flush request is done "syncronously" in new_segment() (i.e.
when getting a new segment and crossing threshold). This is however more or
less congruent with Origin, which will do a request-sync in the corresponding
case.
Actual dealing with the request should at least in production code however be
done async, and in DB it is, i.e. we initiate sstable writes. Hopefully
they finish soon, and CL segments will be released (before next segment is
allocated).

If the flush request does _not_ eventually result in any CF:s becoming
clean and segments released we could potentially be issuing flushes
repeatedly, but never more often than on every new segment."
2015-09-07 18:46:48 +03:00
Gleb Natapov
da242146b6 do not pass storage_proxy reference across cpus
storage_proxy instances are per cpu, so they cannot be passed around to
other cpus.
2015-09-07 17:16:29 +02:00
Calle Wilund
fdb921afb2 Commitlog: Add flushing of segment CF:s on disk overflow
* Do not throw away commitlog segments on disk size overflow. 
  Issue a flush request (i.e. calculate RP we want to free unto, 
  and for all dirty CF:s, do a request).
  "Abstracted" as registerable callback. I.e. DB:s responsibility 
  to actually do something with it.
2015-09-07 13:21:43 +02:00
Calle Wilund
31f2dcb342 Config: change commilog max size on disk to be in sync with scylla.yaml 2015-09-07 13:13:51 +02:00
Calle Wilund
841dd32a8a Commitlog: divide max on-disk-size by num cpus
To try to keep the resulting limit as configured
2015-09-07 13:13:46 +02:00
Asias He
f89a25562c storage_service: Fix is_auto_bootstrap
Get the value from cfg option.
2015-09-07 12:53:58 +03:00
Asias He
7cc768a864 gossip: Fix wrong cluster name and partitioner name
Right now, gossip returns hard coded cluster and partitioner name.

  sstring get_cluster_name() {
      // FIXME: DatabaseDescriptor.getClusterName()
      return "my_cluster_name";
  }
  sstring get_partitioner_name() {
      // FIXME: DatabaseDescriptor.getPartitionerName()
      return "my_partitioner_name";
  }

Fix it by setting the correct name from configure option.

With this

   cqlsh 127.0.0.$i -e "SELECT * from system.local;

returns correct cluster_name.

Fixes #291
2015-09-07 09:21:18 +03:00
Avi Kivity
42dc29619d Merge "Optimize mutation copies and moves" from Paweł
"This series deals with copies and moves of mutation. The former are dealt
with by adding std::move() and missing 'mutable' (in case of lambdas). The
latter are improved by storing mutation_partition externally thus removing
the need for moving mutation_partition each time mutation is moved.

Storing mutation_partition externally is obviously trading the cost of
move constructor for the cost of allocation which shows in perf_mutation
results since mutations aren't moved in that test.

perf_mutation (-c 1):
before: 3289520.06 tps
after:  3183023.37 tps
diff: -3.24%

perf_simple_query (read):
before: 526954.05 tps
after:  577225.16 tps
diff +9.54%

perf_simple_query (write):
before: 731832.70 tps
after:  734923.60 tps
diff: +0.42%

Fixes #150 (well, not completely)."
2015-09-03 12:05:28 +03:00
Paweł Dziepak
830a86258b db: avoid copying mutations
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-09-03 10:30:32 +02:00
Paweł Dziepak
ddec2b4d09 batchlog_manager: pass mutations by const ref
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-09-03 10:30:29 +02:00
Paweł Dziepak
8188896eb7 schema_tables: add missing mutable
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-09-03 10:30:25 +02:00
Glauber Costa
28f315fad4 system_keyspace: keep msg alive when needed
Fixes #266

Some callsites are fine: if we just get the message and process it, as is the
case with check_health for instance, msg will be alive and all is good. But if
we return a future inside the processing, msg must be kept alive. Classic bug,
appearing again.

Pekka saw this in practice in another bug. We haven't seen anything that is
related to this, but it is certainly wrong.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-09-03 09:11:07 +03:00
Pekka Enberg
ce39f9d57a db/system_keyspace: Fix use-after-free in build_dc_rack_info()
Fixes #264.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-09-02 16:37:34 +03:00
Calle Wilund
d95101664d Commitlog: Don't throw exceptions on unrecognized files in CL dir 2015-09-01 14:23:03 +02:00
Calle Wilund
1814f89730 Commitlog: Add some more metrics + accessors for json API
Fixes #99

Adding missing commitlog metrics to the rest API.

v2: Mis-send (clumsy fingers)
v3: Use map_reduce0 + subroutine for nicer code
v4: rebased on current master
v5: rebased yet again.

Since the _second_ file in this previous patch set was commited, and is
dependent on this very change below to even compile, some expediency might be
warranted.
2015-09-01 10:15:33 +03:00
Calle Wilund
9ba84e458a Commitlog: Handle partial writes in segment::cycle
* Fixes #247
* Re-introduce test_allocation_failure, but allow for the "failure" to not
  happen. I.e. if run with low memory settings, the test will check that
  allocation failure is graceful. With lots of memory it will check partial
  write.
2015-08-31 20:02:05 +03:00
Calle Wilund
d3a01072af CommitLogReplayer: Java -> C++
Initial implementation
2015-08-31 14:29:50 +02:00
Calle Wilund
bbf82e80d0 Commitlog: Allow skipping X bytes in commit log reader
Also refactor reader into named methods for debugging sanity.
2015-08-31 14:29:49 +02:00
Calle Wilund
da9ea641e5 Commitlog: Handle full paths in descriptor file name parse. 2015-08-31 14:29:48 +02:00
Calle Wilund
02d2bef1f2 Commitlog: Expose convinience method "list_existing_segments" 2015-08-31 14:29:48 +02:00
Calle Wilund
19052b3c09 Commitlog: Expose list_existing_descriptors 2015-08-31 14:29:48 +02:00
Calle Wilund
e068ffb5a5 Commitlog: Make file reader provide replay_position for entries 2015-08-31 14:29:47 +02:00
Calle Wilund
41b1ad8600 Commitlog: Make descriptor type visible/usable from outside 2015-08-31 14:29:47 +02:00
Calle Wilund
ea38b223bd Commitlog: change the ID generation scheme
* Make it more like origin, i.e. based on wall clock time of app start
* Encode shard ID in the, RP segement ID, to ensure RP:s and segement names
  are unique per shard
2015-08-31 14:29:46 +02:00
Calle Wilund
0fcf7e3e91 Commitlog: Make "position" type 32-bit to align replay_position with
Origin

* Note: removed commitlog_test:test_allocation_failure because with 
  segments limited to 4GB -> mutation limited to 2GB, actually forcing
  a fail is not guaranteed or even likely.
2015-08-31 14:29:44 +02:00
Calle Wilund
3f1a91b89c Commitlog: do not eagerly create first segment on init
Deferring makes it easier to separate old segments from new, which in turn
helps replay logic.
2015-08-31 13:11:44 +02:00