SSTables already have a priority argument wired to their read path. However,
most of our reads do not call that interface directly, but employ the services
of a mutation reader instead.
Some of those readers will be used to read through a mutation_source, and those
have to patched as well.
Right now, whenever we need to pass a class, we pass Seastar's default priority
class.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Compaction manager was initially created at utils because it was
more generic, and wasn't only intended for compaction.
It was more like a task handler based on futures, but now it's
only intended to manage compaction tasks, and thus should be
moved elsewhere. /sstables is where compaction code is located.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Cleanup is a procedure that will discard irrelevant keys from
all sstables of a column family, thus saving disk space.
Scylla will clean up a sstable by using compaction code, in
which this sstable will be the only input used.
Compaction manager was changed to become aware of cleanup, such
that it will be able to schedule cleanup requests and also know
how to handle them properly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The implementation is about storing generation of compacting sstables
in an unordered set per column family, so before strategy is called,
compaction manager will filter out compacting sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, compaction strategy is the responsible for both getting the
sstables selected for compaction and running compaction.
Moving the code that runs compaction from strategy to manager is a big
improvement, which will also make possible for the compaction manager
to keep track of which sstables are being compacted at a moment.
This change will also be needed for cleanup and concurrent compaction
on the same column family.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That code will be used by column family cleanup, so let's put
that code into a function. This change also improves the code
readability.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Replicates https://issues.apache.org/jira/browse/CASSANDRA-7910 :
"Prepare a statement with a wildcard in the select clause.
2. Alter the table - add a column
3. execute the prepared statement
Expected result - get all the columns including the new column
Actual result - get the columns except the new column"
There is one current schema for given column_family. Entries in
memtables and cache can be at any of the previous schemas, but they're
always upgraded to current schema on access.
The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.
Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.
Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.
Schema requesting across nodes is currently stubbed (throws runtime
exception).
Schema is tracked in memtable and cache per-entry. Entries are
upgraded lazily on access. Incoming mutations are upgraded to table's
current schema on given shard.
Mutating nodes need to keep schema_ptr alive in case schema version is
requested by target node.
service::storage_service::clear_snapshot() was built around _db.local()
calls so it makes more sense to move its code into the 'database' class
instead of calling _db.local().bla_bla() all the time.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
"When a node gain or regain responsibility for certain token ranges, streaming
will be performed, upon receiving of the stream data, the row cache
is invalidated for that range.
Refs #484."
The add interface of the estimated histogram is confusing as it is not
clear what units are used.
This patch removes the general add method and replace it with a add_nano
that adds nanoseconds or add that gets duration.
To be compatible with origin, nanoseconds vales are translated to
microseconds.
When analyzing a recent performance issue, I found helpful to keep track of
the amount of memtables that are currently in flight, as well as how much memory
they are consuming in the system.
Although those are memtable statistics, I am grouping them under the "cf_stats"
structure: being the column family a central piece of the puzzle, it is reasonable
to assume that a lot of metrics about it would be potentially welcome in the future.
Note that we don't want to reuse the "stats" structure in the column family: for once,
the fields not always map precisely (pending flushes, for instance, only tracks explicit
flushes), and also the stats structure is a lot more complex than we need.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Get initial tokens specified by the initial_token in scylla.conf.
E.g.,
--initial-token "-1112521204969569328,1117992399013959838"
--initial-token "1117992399013959838"
It can be multiple tokens split by comma.
"The snapshots API need to expose GET methods so people can
query information on them. Now that taking snapshots is supported,
this relatively simple series implement get_snapshot_details, a
column family method, and wire that up through the storage_service."
"This patchset implements load_new_sstables, allowing one to move tables inside the
data directory of a CF, and then call "nodetool refresh" to start using them.
Keep in mind that for Cassandra, this is deemed an unsafe operation:
https://issues.apache.org/jira/browse/CASSANDRA-6245
It is still for us something we should not recommend - unless the CF is totally
empty and not yet used, but we can do a much better job in the safety front.
To guarantee that, the process works in four steps:
1) All writes to this specific column family are disabled. This is a horrible thing to
do, because dirty memory can grow much more than desired during this. Throughout out
this implementation, we will try to keep the time during which the writes are disabled
to its bare minimum.
While disabling the writes, each shard will tell us about the highest generation number
it has seen.
2) We will scan all tables that we haven't seen before. Those are any tables found in the
CF datadir, that are higher than the highest generation number seen so far. We will link
them to new generation numbers that are sequential to the ones we have so far, and end up
with a new generation number that is returned to the next step
3) The generation number computed in the previous step is now propagated to all CFs, which
guarantees that all further writes will pick generation numbers that won't conflict with
the existing tables. Right after doing that, the writes are resumed.
4) The tables we found in step 2 are passed on to each of the CFs. They can now load those
tables while operations to the CF proceed normally."
This series adds a histogrm to the column family for live scanned and
tombstone scaned.
It expose those histogram via the API instead of the stub implmentation,
currently exist.
The implementation update of the histogram will be added in a different
series.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
CF-level code to load new SSTables. There isn't really a lot of complication
here. We don't even need to repopulate the entire SSTable directory: by
requiring that the external service who is coordinating this tell us explicitly
about the new SSTables found in the scan process, we can just load them
specifically and add them to the SSTable map.
All new tables will start their lifes as shared tables, and will be unshared
if it is possible to do so: this all happens inside add_sstable and there isn't
really anything special in this front.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Before loading new SSTables into the node, we need to make sure that their
generation numbers are sequential (at least if we want to follow Cassandra's
footsteps here).
Note that this is unsafe by design. More information can be found at:
https://issues.apache.org/jira/browse/CASSANDRA-6245
However, we can already to slightly better in two ways:
Unlike Cassandra, this method takes as a parameter a generation number. We
will not touch tables that are before that number at all. That number must be
calculated from all shards as the highest generation number they have seen themselves.
Calling load_new_sstables in the absence of new tables will therefore do nothing,
and will be completely safe.
It will also return the highest generation number found after the reshuffling
process. New writers should start writing after that. Therefore, new tables
that are created will have a generation number that is higher than any of this,
and will therefore be safe.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
During certain operations we need to stop writing SSTables. This is needed when
we want to load new SSTables into the system. They will have to be scanned by all
shards, agreed upon, and in most cases even renamed. Letting SSTables be written
at that point makes it inherently racy - specially with the rename.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Avoid using long for it, and let's use a fixed size instead. Let's do signed
instead of unsigned to avoid upsetting any code that we may have converted.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
From Pawel:
This series enables row cache to serve range queries. In order to achieve
that row cache needs to know whether there are some other partitions in
the specified range that are not cached and need to be read from the sstables.
That information is provied by key_readers, which work very similarly to
mutation_readers, but return only the decorated key of partitions in
range. In case of sstables key_readers is implemented to use partition
index.
Approach like this has the disadvantage of needing to access the disk
even if all partitions in the range are cached. There are (at least) two
solutions ways of dealing with that problem:
- cache partition index - that will also help in all other places where it
is neededed
- add a flag to cache_entry which, when set, indicates that the immediate
successor of the partition is also in the cache. Such flag would be set
by mutation reader and cleared during eviction. It will also allow
newly created mutations from memtable to be moved to cache provided that
both their successors and predecessors are already there.
The key_reader part of this patchsets adds a lot of new code that probably
won't be used in any other place, but the alternative would be to always
interleave reads from cache with reads from sstables and that would be
more heavy on partition index, which isn't cached.
Fixes#185.
For each of the snapshots available, the api may query for some information:
the total size on disk, and the "real" size. As far as I could understand, the
real size is the size that is used by the SSTables themselves, while the total
size includes also the metadata about the snapshot - like the manifest.json
file.
Details follow:
In the original Cassandra code, total size is:
long sizeOnDisk = FileUtils.folderSize(snapshot);
folderSize recurses on directories, and adds file.length() on files. Again, my
understanding is that file_size() would give us the same as the length() method
for Java.
The other value, real (or true) size is:
long trueSize = getTrueAllocatedSizeIn(snapshot);
getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance
of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files
within the tree who are "acceptable".
An acceptable file is a file which:
starts with the same prefix as we want (IOW, belongs to the same SSTable, we
will just test that directly), and is not "alive". The alive list is just the
list of all SSTables in the system that are used by the CFs.
What this tries to do, is to make sure that the trueSnapshotSize is just the
extra space on disk used by the snapshot. Since the snapshots are links, then
if a table goes away, it adds to this size. If it would be there anyway, it does
not.
We can do that in a lot simpler fashion: for each file, we will just look at
the original CF directory, and see if we can find the file there. If we can't,
then it counts towards the trueSize. Even for files that are deleted after
compaction, that "eventually" works, and that simplifies the code tremendously
given that we don't have to neither list all files in the system - as Cassandra
does - or go check other shards for liveness information - as we would have to
do.
The scheme I am proposing may need some tweaks when we support multiple data
directories, as the SSTables may not be directly below the snapshot level.
Still, it would be trivial to inform the CF about their possible locations.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We go to the filesystem to check if the snapshot exists. This should make us
robust against deletions of existing snapshots from the filesystem.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This allows for us to delete an existing snapshot. It works at the column
family level, and removing it from the list of keyspace snapshots needs to
happen only when all CFs are processed. Therefore, that is provided as a
separate operation.
The filesystem code is a bit ugly: it can be made better by making our file
lister more generic. First step would be to call it walker, not lister...
For now, we'll use the fact that there are mostly two levels in the snapshot
hierarchy to our advantage, and avoid a full recursion - using the same lambda
for all calls would require us to provide a separate class to handle the state,
that's part of making this generic.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Adapt our compaction code to start writing a new sstable if the
one being written reached its maximum size. Leveled strategy works
with that concept. If a strategy other than leveled is being used,
everything will work as before.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.
The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.
But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.
Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This patch adds the read and write latency estimated histogram support
and add an estimatd histogram to the number of sstable that were used in
a read.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
From Pekka:
This patch series implements support for CQL DROP TABLE. It uses the newly
added truncate infrastructure under the hood. After this series, the
test_table CQL test in dtest passes:
[penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test
table_test (cql_tests.TestCQL) ... ok
----------------------------------------------------------------------
Ran 1 test in 23.841s
OK
For drop_column_family(), we want to first remove the column_family from
lookup tables and truncate after that to avoid races. Introduce a
truncate() variant that takes keyspace and column_family references.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
Currently, we control incremental backups behavior from the storage service.
This creates some very concrete problems, since the storage service is not
always available and initialized.
The solution is to move it to the column family (and to the keyspace so we can
properly propagate the conf file value). When we change this from the api, we will
have to iterate over all of them, changing the value accordingly.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We will need to change some properties of the keyspace / cf. We need an acessor
that is not marked as const.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This patch contains the following changes, in the definition of the read
and write latency histogram it removes the mask value, so the the
default value will be used.
To support the gothering of the read latency histogram the query method
cannot be const as it modifies the histogram statistics.
The read statistic is sample based and it should have no real impact on
performance, if there will be an impact, we can always change it in the
future to a lower sampling rate.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>