Adapt our compaction code to start writing a new sstable if the
one being written reached its maximum size. Leveled strategy works
with that concept. If a strategy other than leveled is being used,
everything will work as before.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.
The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.
But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.
Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
From Pekka:
This patch series implements support for CQL DROP TABLE. It uses the newly
added truncate infrastructure under the hood. After this series, the
test_table CQL test in dtest passes:
[penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test
table_test (cql_tests.TestCQL) ... ok
----------------------------------------------------------------------
Ran 1 test in 23.841s
OK
For drop_column_family(), we want to first remove the column_family from
lookup tables and truncate after that to avoid races. Introduce a
truncate() variant that takes keyspace and column_family references.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
Currently, we control incremental backups behavior from the storage service.
This creates some very concrete problems, since the storage service is not
always available and initialized.
The solution is to move it to the column family (and to the keyspace so we can
properly propagate the conf file value). When we change this from the api, we will
have to iterate over all of them, changing the value accordingly.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We will need to change some properties of the keyspace / cf. We need an acessor
that is not marked as const.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This patch contains the following changes, in the definition of the read
and write latency histogram it removes the mask value, so the the
default value will be used.
To support the gothering of the read latency histogram the query method
cannot be const as it modifies the histogram statistics.
The read statistic is sample based and it should have no real impact on
performance, if there will be an impact, we can always change it in the
future to a lower sampling rate.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
When we convert exceptions into CQL server errors, type information is
not preserved. Therefore, improve exception error messages to make
debugging dtest failures, for example, slightly easier.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
column_family
This patch adds a getter for the dirty_memory_region_group in the
database object and add an occupency method to column family that
returns the total occupency in all the memtable in the column family.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
All database code was converted to is when storage_proxy was made
distributed, but then new code was written to use storage_proxy& again.
Passing distributed<> object is safer since it can be passed between
shards safely. There was a patch to fix one such case yesterday, I found
one more while converting.
Unlike cache, dirty memory cannot be evicted at will, so we must limit it.
This patch establishes a hard limit of 50% of all memory. Above that,
new requests are not allowed to start. This allows the system some time
to clean up memory.
Note that we will need more fine-grained bandwidth control than this;
the hard limit is the last line of defense against running our of reclaimable
memory.
Tested with a mixed read/write load; after reads start to dominate writes
(due to the proliferation of small sstables, and the inability of compaction
to keep up, dirty memory usage starts to climb until the hard stop prevents
it from climbing further and ooming the server).
"Initial implementation/transposition of commit log replay.
* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
sstables are inspected for high water mark, and then replayed from
those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
per _previous_ runs shards, not current.
Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
like origin. Partly because I am lazy, but also partly because our serial
format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
file, detailing which keyspace/cf:s to replay). Partly because we have no
system properties.
There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
This patch adds the get_non_system_keyspaces that found in origin and
expose the replication strategy. With the get_replication_strategy
method.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Adding to API function to return count of sstables in L0 if leveled
compaction strategy is enabled, 0 otherwise. Currently, we don't
support leveled compaction strategy, so function to return count of
sstables in L0 always return zero.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
It was noticed that the same sstable files could be selected for
compaction if concurrent compaction happens on the same cf.
That's possible because compaction manager uses 2 tasks for
handling compactions.
Solution is to not duplicate cf in the compaction manager queue,
and re-schedule compaction for a cf if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
"This series expose statistics from the row_cache in the cache_service API.
After this series the following methods will be available:
get_row_hits
get_row_requests
get_row_hit_rate
get_row_size
get_row_entries"
We need a way to remove a column family from the compaction manager
because when dropping a column family we need to make sure that the
compaction manager doesn't hold a reference to it anymore.
So compaction manager queue is now of column_family, allowing us
to cancel requests pertaining to a column family being dropped.
There may be an ongoing compaction for the column family being
dropped, so we also need to wait for its termination.
Testcase for compaction manager was also adapted and improved.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
This expose the row_cache in the column family, it will be used by the
API to get the row_cache statistic information.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>