The reader has a field for the sstable, but we are not initializing it, so it
can be destroyed before we finish our job. It seems to work here, but transposing
this code to the test case crashed it. So this means at some point we will crash
here as well.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Read-ahead will require that we close input_streams. As part of that
we have to close sstables, and mutation_readers (which encapsulate
input_streams). This is part 1 of a patchset series to do that.
(The overarching goal is to enable read-ahead for sstables, see #244)
Conflicts:
sstables/compaction.cc
Using a lambda for implementing a mutation_reader is nifty, but does not
allow us to add methods.
Switch to a class-based implementation in anticipation of adding a close()
method.
Unlike cache, dirty memory cannot be evicted at will, so we must limit it.
This patch establishes a hard limit of 50% of all memory. Above that,
new requests are not allowed to start. This allows the system some time
to clean up memory.
Note that we will need more fine-grained bandwidth control than this;
the hard limit is the last line of defense against running our of reclaimable
memory.
Tested with a mixed read/write load; after reads start to dominate writes
(due to the proliferation of small sstables, and the inability of compaction
to keep up, dirty memory usage starts to climb until the hard stop prevents
it from climbing further and ooming the server).
"Initial implementation/transposition of commit log replay.
* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
sstables are inspected for high water mark, and then replayed from
those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
per _previous_ runs shards, not current.
Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
like origin. Partly because I am lazy, but also partly because our serial
format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
file, detailing which keyspace/cf:s to replay). Partly because we have no
system properties.
There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
"This series adds the missing code from origin to support this functionality.
While doing so, some method where changed to be const when it was more
appropriate and a few const version of methods where added when the two
variation was required."
This patch adds the get_non_system_keyspaces that found in origin and
expose the replication strategy. With the get_replication_strategy
method.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Make the exceptions created inside database::find_column_family() return
a readable message from their what() method.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Adding to API function to return count of sstables in L0 if leveled
compaction strategy is enabled, 0 otherwise. Currently, we don't
support leveled compaction strategy, so function to return count of
sstables in L0 always return zero.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
It was noticed that the same sstable files could be selected for
compaction if concurrent compaction happens on the same cf.
That's possible because compaction manager uses 2 tasks for
handling compactions.
Solution is to not duplicate cf in the compaction manager queue,
and re-schedule compaction for a cf if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
We need a way to remove a column family from the compaction manager
because when dropping a column family we need to make sure that the
compaction manager doesn't hold a reference to it anymore.
So compaction manager queue is now of column_family, allowing us
to cancel requests pertaining to a column family being dropped.
There may be an ongoing compaction for the column family being
dropped, so we also need to wait for its termination.
Testcase for compaction manager was also adapted and improved.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
We can catch most errors when we try to load an sstable. But if the TOC file is
the one missing, we won't try to load the sstable at all. This case is still an
invalid case, but it is way easier for us to treat it by waiting for all files
to be loaded, and then checking if we saw a file during scan_dir, without its
corresponding TOC.
Fixes#114
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Currently, each column family creates a fiber to handle compaction requests
in parallel to the system. If there are N column families, N compactions
could be running in parallel, which is definitely horrible.
To solve that problem, a per-database compaction manager is introduced here.
Compaction manager is a feature used to service compaction requests from N
column families. Parallelism is made available by creating more than one
fiber to service the requests. That being said, N compaction requests will
be served by M fibers.
A compaction request being submitted will go to a job queue shared between
all fibers, and the fiber with the lowest amount of pending jobs will be
signalled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Without this, Cassandra won't even try to read our sstables. The containing
directories will be ignored.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Let's change the default generated tables to ka, which is the one that is present
in Origin
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
A ka file has a slightly different name on disk. Change the
parser so we can deal with both
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
When a schema is available, we use it. However, we have, by now, way too many
tests. Some of them use tables for which we don't even know the schema. It would
have been a massive amount of work to require a schema for all of them - so I am
keeping both constructors around.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
It is currently only used to log a message, and for that we have an sstable
method that will do just fine. Using the name itself just makes it being passed
along throughout the captures. Remove it.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
ASan does not like commit 05c23c7f73
("database: Add create_keyspace_on_all() helper"):
==8112==WARNING: AddressSanitizer failed to allocate 0x7f88b84fc690 bytes
==8112==AddressSanitizer's allocator is terminating the process instead of returning 0
==8112==If you don't like this behavior set allocator_may_return_null=1
==8112==Sanitizer CHECK failed: ../../../../libsanitizer/sanitizer_common/sanitizer_allocator.cc:147 ((0)) != (0) (0, 0)
I was not able to determine the source of the bug. Make ASan happy by
reverting the code movement and using the "cpu zero" trick we use for
table creation.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
When probing for the type, I have made the classical mistake of using
as a parameter part of a structure that is moved into the capture. That
is what broke our tests.
But also, when stat'ing, de.name will give us only the component relative to
the current path. We need to add the directory so the stat will succeed.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Our directory scanner currently requires a type to be passed, and we have a
FIXME saying that we should stat when there is none. In some filesystems,
in particular, XFS, getdents won't return a type, meaning we should manually
probe it.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>