"Refs #293
* Add a commitlog::sync_all_segments, that explicitly forces all pending
disk writes
* Only delete segments from disk IFF they are marked clean. Thus on partial
shutdown or whatnot, even if CL is destroyed (destructor runs) disk files
not yet clean visavi sstables are preserved and replayable
* Do a sync_all_segments first of all in database::stop.
Exactly what to not stop in main I leave up to others discretion, or at least
another patch."
From Pawel:
This series makes compaction remove items that are no longer items:
- expired cells are changed into tombstones
- items covered by higher level tombstones are removed
- expired tombstones are removed if possible
Fixes#70.
Fixes#71.
Also at seastar-dev: calle/commitlog_flush_v3
(And, yes, this time I _did_ update the remote!)
Refs #262
Commit of original series was done on stale version (v2) due to authors
inability to multitask and update git repos.
v3:
* Removed future<> return value from callbacks. I.e. flush callback is now
only fully syncronous over actual call
"Fixes #262
Handles CL disk size exceeding configured max size by calling flush handlers
for each dirty CF id / high replay_position mark. (Instead of uncontrolled
delete as previously).
* Increased default max disk size to 8GB. Same as Origin/scylla.yaml (so no
real change, but synced).
* Divide the max disk size by cpus (so sum of all shards == max)
* Abstract flush callbacks in CL
* Handler in DB that initiates memtable->sstable writes when called.
Note that the flush request is done "syncronously" in new_segment() (i.e.
when getting a new segment and crossing threshold). This is however more or
less congruent with Origin, which will do a request-sync in the corresponding
case.
Actual dealing with the request should at least in production code however be
done async, and in DB it is, i.e. we initiate sstable writes. Hopefully
they finish soon, and CL segments will be released (before next segment is
allocated).
If the flush request does _not_ eventually result in any CF:s becoming
clean and segments released we could potentially be issuing flushes
repeatedly, but never more often than on every new segment."
* Do not throw away commitlog segments on disk size overflow.
Issue a flush request (i.e. calculate RP we want to free unto,
and for all dirty CF:s, do a request).
"Abstracted" as registerable callback. I.e. DB:s responsibility
to actually do something with it.
Right now, gossip returns hard coded cluster and partitioner name.
sstring get_cluster_name() {
// FIXME: DatabaseDescriptor.getClusterName()
return "my_cluster_name";
}
sstring get_partitioner_name() {
// FIXME: DatabaseDescriptor.getPartitionerName()
return "my_partitioner_name";
}
Fix it by setting the correct name from configure option.
With this
cqlsh 127.0.0.$i -e "SELECT * from system.local;
returns correct cluster_name.
Fixes#291
"This series deals with copies and moves of mutation. The former are dealt
with by adding std::move() and missing 'mutable' (in case of lambdas). The
latter are improved by storing mutation_partition externally thus removing
the need for moving mutation_partition each time mutation is moved.
Storing mutation_partition externally is obviously trading the cost of
move constructor for the cost of allocation which shows in perf_mutation
results since mutations aren't moved in that test.
perf_mutation (-c 1):
before: 3289520.06 tps
after: 3183023.37 tps
diff: -3.24%
perf_simple_query (read):
before: 526954.05 tps
after: 577225.16 tps
diff +9.54%
perf_simple_query (write):
before: 731832.70 tps
after: 734923.60 tps
diff: +0.42%
Fixes#150 (well, not completely)."
Fixes#266
Some callsites are fine: if we just get the message and process it, as is the
case with check_health for instance, msg will be alive and all is good. But if
we return a future inside the processing, msg must be kept alive. Classic bug,
appearing again.
Pekka saw this in practice in another bug. We haven't seen anything that is
related to this, but it is certainly wrong.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Fixes#99
Adding missing commitlog metrics to the rest API.
v2: Mis-send (clumsy fingers)
v3: Use map_reduce0 + subroutine for nicer code
v4: rebased on current master
v5: rebased yet again.
Since the _second_ file in this previous patch set was commited, and is
dependent on this very change below to even compile, some expediency might be
warranted.
* Fixes#247
* Re-introduce test_allocation_failure, but allow for the "failure" to not
happen. I.e. if run with low memory settings, the test will check that
allocation failure is graceful. With lots of memory it will check partial
write.
* Make it more like origin, i.e. based on wall clock time of app start
* Encode shard ID in the, RP segement ID, to ensure RP:s and segement names
are unique per shard
Origin
* Note: removed commitlog_test:test_allocation_failure because with
segments limited to 4GB -> mutation limited to 2GB, actually forcing
a fail is not guaranteed or even likely.
Before:
host_id in system.local is empty
After:
host_id in system.local is inserted correctly
This fixes a hasty problem that we always get a new host_id when
booting up a node with data.
"Initial implementation/transposition of commit log replay.
* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
sstables are inspected for high water mark, and then replayed from
those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
per _previous_ runs shards, not current.
Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
like origin. Partly because I am lazy, but also partly because our serial
format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
file, detailing which keyspace/cf:s to replay). Partly because we have no
system properties.
There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
* Make it more like origin, i.e. based on wall clock time of app start
* Encode shard ID in the, RP segement ID, to ensure RP:s and segement names
are unique per shard
We set status to COMPLETED in join_token_ring
set_bootstrap_state(db::system_keyspace::bootstrap_state::COMPLETED)
but
cqlsh 127.0.0.$i -e "SELECT * from system.local;"
shows
bootstrapped -> IN_PROGRESS
The static sstring state_name is the bad boy.