"Here's another round of cleanups to the CQL code. Nothing exciting here,
mostly moving code to source files which makes changing the code less
painful in terms of compilation times."
My plan was originally to have two separate sets of tests: one for the index,
and one for the data. With most of the code having ended up in the .hh file anyway,
this distinction became a bit pointless.
Let's put it everything here.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
"Initial implementation/transposition of commit log replay.
* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
sstables are inspected for high water mark, and then replayed from
those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
per _previous_ runs shards, not current.
Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
like origin. Partly because I am lazy, but also partly because our serial
format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
file, detailing which keyspace/cf:s to replay). Partly because we have no
system properties.
There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
Like boost::dynamic_bitset, but less capable. On the other hand it avoids
very large allocations, which are incurred by the bloom filter's bitset
on even moderately sized sstables.
This heavily used function shows up in many places in the profile (as part
of other functions), so it's worth optimizing by eliminating the special
case for the standard allocator. Use a statically allocated object instead.
(a non-thread-local object is fine since it has no data members).
"This series introduces the i_endpoint_snitch::reset_snitch() static method
that allows to replace the current (global) snitch instance with the new one.
This is done in an (per-shard) atomic way transparent so anyone holding a reference
to snitch_ptr.
This series starts with some cleanups, adds the above method and the unit test
that verifies its functionality."
"I am currently looking at the performance of our index_read, since it was in
the past pinpointed at the source of problems.
While the read side is the one that is mostly interesting, I would like to test
both - besides anything else, it is easier to test reads after writes so we
don't have to create synthetic data with outside tools.
This patch introduces the write side benchmark (read side will hopefully come
tomorrow). While the write side is, as mentioned, not the most interesting
part, I did see some standing from the flamegraph that allowed me to optimize
one particular function, yielding a 8.6 % improvement."
This is a test that allow us to query the performance of our sstable index
reads and writes (currently only writes implemented). A lot of potentially
common code is put into a header, which will make writing new tests easier if
needed.
We don't want to take shortcuts for this, so all reading and writing is done
through public sstable interfaces.
For writing, there is no way to write the index without writing the datafile.
But because we are only writing the primary key, the datafile will not contain
anything else. This is the closest we can get to an index testing with the
public interfaces.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Currently, each column family creates a fiber to handle compaction requests
in parallel to the system. If there are N column families, N compactions
could be running in parallel, which is definitely horrible.
To solve that problem, a per-database compaction manager is introduced here.
Compaction manager is a feature used to service compaction requests from N
column families. Parallelism is made available by creating more than one
fiber to service the requests. That being said, N compaction requests will
be served by M fibers.
A compaction request being submitted will go to a job queue shared between
all fibers, and the fiber with the lowest amount of pending jobs will be
signalled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
There's nothing legacy about it so rename legacy_schema_tables to
schema_tables. The naming comes from a Cassandra 3.x development branch
which is not relevant for us in the near future.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
This patch adds the beginning of node repair support. Repair is initiated
on a node using the REST API, for example to repair all the column families
in the "try1" keyspace, you can use:
curl -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1"
I tested that the repair already works (exchanges mutations with all other
replicas, and successfully repairs them), so I think can be committed,
but will need more work to be completed
1. Repair options are not yet supported (range repair, sequential/parallel
repair, choice of hosts, datacenters and column families, etc.).
2. *All* the data of the keyspace is exchanged - Merkle Trees (or an
alternative optimization) and partial data exchange haven't been
implemented yet.
3. Full repair for nodes with multiple separate ranges is not yet
implemented correctly. E.g., consider 10 nodes with vnodes and RF=2,
so each vnode's range has a different host as a replica, so we need
to exchange each key range separately with a different remote host.
4. Our repair operation returns a numeric operation id (like Origin),
but we don't yet provide any means to use this id to check on ongoing
repairs like Origin allows.
5. Error hangling, logging, etc., needs to be improved.
6. SMP nodes (with multiple shards) should work correctly (thanks to
Asias's latest patch for SMP mutation streaming) but haven't been
tested.
7. Incremental repair is not supported (see
http://www.datastax.com/dev/blog/more-efficient-repairs)
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds a "ninja clean", better than the current "ninja -t clean".
Ninja's "ninja -t clean" is a nice trick, designed to save the Makefile writer
the tedious chore of listing the targets to remove, by automatically gathering
this list. But our build system, following OSv's one, actually uses a much
cooler (and better) trick: All build files are generated in a single
subdirectory, "build/", and cleaning the build products is as simple as
"rm -rf build".
So this patch adds a target, "ninja clean", which does exactly this (rm -rf
build). "ninja clean" is not only easier to type than "ninja -t clean", it
also has one important benefit: When the ninja rules change, "ninja -t clean"
doesn't remember to delete now-defunct targets, and they stay behind. On my
build machine, "ninja -t clean" left behind almost a gigabyte of old crap.
Moreover, when the ninja file changes drastically (as it changed a few days
ago), not cleaning up everything can even cause new builds to break - e.g.,
when something was previously a file and now needs to be a directory.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch introduce init.cc file which hosts all the initialization
code. The benefits are 1) we can share initialization code with tests
code. 2) all the service startup dependency / order code is in one
single place instead of everywhere.
The utils file will hold general modules, that need to be used by
multiple modules.
As a start, it holds the histogram definition.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Instead of trying to second-guess the seastar build system, always rebuild
libseastar.a. Specify restat = 1 so that binaries are only relinked if
something truly changed.
The idea is to reuse the same testing code on any mutation_source, for
example on memtable.
The range query test cases are now part of a generic mutation_source
test suite.
The functionality is similar to RuntimeMBean.getUptime() that's needed
in schema pulling logic.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>