Refs #356
Pre-allocates N segments from timer task. N is "adaptive" in that it is
increased (to a max) every time segement acquisition is forced to allocate
a new instead of picking from pre-alloc (reserve) list. The idea is that it is
easier to adapt how many segments we consume per timer quanta than the timer
quanta itself.
Also does disk pressure check and flush from timer task now. Note that the
check is still only done max once every new segment.
Some logging cleanup/betterment also to make behaviour easier to trace.
Reserve segments start out at zero length, and are still deleted when finished.
This is because otherwise we'd still have to clear the file to be able to
properly parse it later (given that is can be a "half" file due to power fail
etc). This might need revisiting as well.
With this patch, there should be no case (except flush starvation) where
"add_mutation" actually waits for a (potentially) blocking op (disk).
Note that since the amount of reserve is increased as needed, there will
be occasional cases where a new segment is created in the alloc path
until the system finds equilebrium. But this should only be during a breif
warmup.
Refs #356
* Move sync time setting to sync initiate to help prevent double syncs
* Change add_mutation to only do explicit sync with wait if time elapsed
since last is 2x sync window
* Do not wait for sync when moving to new segment in alloc path
* Initiate _sync_time properly.
* Add some tracing log messages to help debug
* Removes previous, accidental fix that got committed.
* Instead just do not give RP:s to replay mutations. This is same as in Origin,
and just as/more correct, since we intend to flush the data to sstables
asap anyway
After killing scylla in the middle of a write, the next scylla
instance failed to finish commit log replay, showing the following
error message:
scylla: core/future.hh:448: void promise<T>::set_value(A&& ...)
[with A = {}; T = {}]: Assertion `_state' failed.
After a long debug session, I figured out that check_valid_rp() was
triggering the exception replay_position_reordered_exception, which
means replay position reordering.
Looking at 8b9a63a3c6, I noticed that database::apply is guarded
against reodering, but commitlog replay code is not.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
We need to send out the notification for all created keyspaces, not just
for the first one.
Spotted during code inspection.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
"It moves the API configuration from the command line argument to the general
config, it also move the api-doc directory to be configurable instead of hard
coded."
All database code was converted to is when storage_proxy was made
distributed, but then new code was written to use storage_proxy& again.
Passing distributed<> object is safer since it can be passed between
shards safely. There was a patch to fix one such case yesterday, I found
one more while converting.
"Refs #293
* Add a commitlog::sync_all_segments, that explicitly forces all pending
disk writes
* Only delete segments from disk IFF they are marked clean. Thus on partial
shutdown or whatnot, even if CL is destroyed (destructor runs) disk files
not yet clean visavi sstables are preserved and replayable
* Do a sync_all_segments first of all in database::stop.
Exactly what to not stop in main I leave up to others discretion, or at least
another patch."
From Pawel:
This series makes compaction remove items that are no longer items:
- expired cells are changed into tombstones
- items covered by higher level tombstones are removed
- expired tombstones are removed if possible
Fixes#70.
Fixes#71.
This adds the API configuration parameters to the configurtion, so it
will be possible to take them from the configuration file or from the
command line.
The following configuration were defined:
api_port
api_address
api_ui_dir
api_doc_dir
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Also at seastar-dev: calle/commitlog_flush_v3
(And, yes, this time I _did_ update the remote!)
Refs #262
Commit of original series was done on stale version (v2) due to authors
inability to multitask and update git repos.
v3:
* Removed future<> return value from callbacks. I.e. flush callback is now
only fully syncronous over actual call
"Fixes #262
Handles CL disk size exceeding configured max size by calling flush handlers
for each dirty CF id / high replay_position mark. (Instead of uncontrolled
delete as previously).
* Increased default max disk size to 8GB. Same as Origin/scylla.yaml (so no
real change, but synced).
* Divide the max disk size by cpus (so sum of all shards == max)
* Abstract flush callbacks in CL
* Handler in DB that initiates memtable->sstable writes when called.
Note that the flush request is done "syncronously" in new_segment() (i.e.
when getting a new segment and crossing threshold). This is however more or
less congruent with Origin, which will do a request-sync in the corresponding
case.
Actual dealing with the request should at least in production code however be
done async, and in DB it is, i.e. we initiate sstable writes. Hopefully
they finish soon, and CL segments will be released (before next segment is
allocated).
If the flush request does _not_ eventually result in any CF:s becoming
clean and segments released we could potentially be issuing flushes
repeatedly, but never more often than on every new segment."
* Do not throw away commitlog segments on disk size overflow.
Issue a flush request (i.e. calculate RP we want to free unto,
and for all dirty CF:s, do a request).
"Abstracted" as registerable callback. I.e. DB:s responsibility
to actually do something with it.
Right now, gossip returns hard coded cluster and partitioner name.
sstring get_cluster_name() {
// FIXME: DatabaseDescriptor.getClusterName()
return "my_cluster_name";
}
sstring get_partitioner_name() {
// FIXME: DatabaseDescriptor.getPartitionerName()
return "my_partitioner_name";
}
Fix it by setting the correct name from configure option.
With this
cqlsh 127.0.0.$i -e "SELECT * from system.local;
returns correct cluster_name.
Fixes#291
"This series deals with copies and moves of mutation. The former are dealt
with by adding std::move() and missing 'mutable' (in case of lambdas). The
latter are improved by storing mutation_partition externally thus removing
the need for moving mutation_partition each time mutation is moved.
Storing mutation_partition externally is obviously trading the cost of
move constructor for the cost of allocation which shows in perf_mutation
results since mutations aren't moved in that test.
perf_mutation (-c 1):
before: 3289520.06 tps
after: 3183023.37 tps
diff: -3.24%
perf_simple_query (read):
before: 526954.05 tps
after: 577225.16 tps
diff +9.54%
perf_simple_query (write):
before: 731832.70 tps
after: 734923.60 tps
diff: +0.42%
Fixes#150 (well, not completely)."
Fixes#266
Some callsites are fine: if we just get the message and process it, as is the
case with check_health for instance, msg will be alive and all is good. But if
we return a future inside the processing, msg must be kept alive. Classic bug,
appearing again.
Pekka saw this in practice in another bug. We haven't seen anything that is
related to this, but it is certainly wrong.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Fixes#99
Adding missing commitlog metrics to the rest API.
v2: Mis-send (clumsy fingers)
v3: Use map_reduce0 + subroutine for nicer code
v4: rebased on current master
v5: rebased yet again.
Since the _second_ file in this previous patch set was commited, and is
dependent on this very change below to even compile, some expediency might be
warranted.
* Fixes#247
* Re-introduce test_allocation_failure, but allow for the "failure" to not
happen. I.e. if run with low memory settings, the test will check that
allocation failure is graceful. With lots of memory it will check partial
write.
* Make it more like origin, i.e. based on wall clock time of app start
* Encode shard ID in the, RP segement ID, to ensure RP:s and segement names
are unique per shard
Origin
* Note: removed commitlog_test:test_allocation_failure because with
segments limited to 4GB -> mutation limited to 2GB, actually forcing
a fail is not guaranteed or even likely.