maybe_flush_pi_block, which is called for each cell, assumes that
block_first_colname will be empty when the first cell is encountered
for each partition.
This didn't hold after writing partition which generated no index
entry, because block_first_colname was cleared only when there way any
data written into the promoted index. Fix by always clearing the name.
The effect was that the promoted index entry for the next partition
would be flushed sooner than necessary (still counting since the start
of the previous partition) and with offset pointing to the start of
the current partition. This will cause parsing error when such sstable
is read through promoted index entry because the offset is assumed to
point to a cell not to partition start.
Fixes#1567
Message-Id: <1470909915-4400-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f1c2481040)
The sanitizer of the debug build warns when a "bool" variable is read when
containing a value not 0 or 1. In particular, if a class has an
uninitialized bool field, which class logic allows to only be set later,
then "move"ing such an object will read the uninitialized value and produce
this warning.
This patch fixes four of these warnings seen in sstable_test by initializing
some bool fields to false, even though the code doesn't strictly need this
initialization.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit c2e4f5ba16)
Commit 0d8463aba5 broke some of the tests with an assertion
failure about local_is_initialized(). It turns out that there is more than
one level of local_is_initialized() we need to check... For some tests,
neither locals were initialized, but for others, one was and the other
wasn't, and the wrong one was tested.
With this patch, all unit tests except "flush_queue_test.cc" pass on my
machine. I doubt this test is relevant to the promoted index patches,
but I'll continue to investigate it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit bce020efbd)
This patch adds writing of promoted index to sstables.
The promoted index is basically a sample of columns and their positions
for large partitions: The promoted index appears in the sstable's index
file for partitions which are larger than 64 KB, and divides the partition
to 64 KB blocks (as in Cassandra, this interval is configurable through
the column_index_size_in_kb config parameter). Beyond modifying the index
file, having a promoted index may also modify the data file: Since each
of blocks may be read independently, we need to add in the beginning of
each block the list of range tombstones that are still open at that
position.
See also https://github.com/scylladb/scylla/wiki/SSTables-Index-FileFixes#959
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0d8463aba5)
Originally, streamed_mutations guaranteed that emitted tombstones are
disjoint. In order to achieve that two separate objects were produced
for each range tombstone: range_tombstone_begin and range_tombstone_end.
Unfortunately, this forced sstable writer to accumulate all clustering
rows between range_tombstone_begin and range_tombstone_end.
However, since there is no need to write disjoint tombstones to sstables
(see #1153 "Write range tombstones to sstables like Cassandra does") it
is also not necessary for streamed_mutations to produce disjoint range
tombstones.
This patch changes that by making streamed_mutation produce
range_tombstone objects directly.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
As Nadav notes we use the chunk length as the buffer size for the compressed
stream too.
Fix by using it only for the outer (uncompressed) stream; the inner
(compressed) stream uses the sstable buffer size, 128 kiB.
Fixes#1402.
Message-Id: <1467910556-5759-1-git-send-email-avi@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
The purpose of this patch is to split the actions of writing sstable and
sealing it. As long as the sstable is unsealed it is considered
incomplete and is going to be removed on reboot.
Such functionality is needed in order to defer visibility of sstables
created during streaming until the streaming is complete.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
We weren't updating max local deletion time for cells that contain
ttl, or for tombstone cells.
If there is a live cell with no ttl, then max local deletion time
is supposed to store maximum value, which means that the sstable
will not be fully expired later on.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If read ahead is going to be enabled it is important to close
input_stream<> properly (and wait for completion) before destroying it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
sstable_writer encapsulates all logic related to writing sstable.
Previously introduced component_writer is used to write actual
mutations. sstable_writer is intended to be used with
consume_flattened_in_thread(). Its purpose is to be used by higher-level
consumer that needs to write possibly more than one sstable (sstable
compaction is an example of such consumer).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch rewrites do_write_components() so that it can use
consume_flattened_in_thread(). All components-writing code is moved to a
new consumer: component_writer.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch uses the composite_marker to add inclusiveness information
to the prefixes of a range tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.
Fixes#1155
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
It's not working because it tries to overwrite existing statistics
file with exclusive flag.
It's fixed by writing new statistics into temporary file and
renaming it into place.
If Scylla failed in middle of rewrite, a temporary file is left
over. So boot code was adjusted to delete a temporary file created
by this rewrite procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Refresh will rewrite statistics of any migrated sstable with level
> 0. However, this operation is currently not working because O_EXCL
flag is used, meaning that create will fail.
It turns out that we don't actually need to change on-disk level of
a sstable by overwriting statistics file.
We can only set in-memory level of a sstable to 0. If Scylla reboots
before all migrated sstables are compacted, leveled strategy is smart
enough to detect sstables that overlap, and set their in-memory level
to 0.
Fixes#1124.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"If we compact sstables A, B into a new sstable C we must either delete both
A and B, or none of them. This is because a tombstone in B may delete data
in A, and during compaction, both the tombstone and the data are removed.
If only B is deleted, then the data gets resurrected.
Non-atomic deletion occurs because the filesystem does not support atomic
deletion of multiple files; but the window for that is small and is not
addressed in this patchset. Another case is when A is shared across
multiple shards (as is the case when changing shard count, or migrating
from existing Cassandra sstables). This case is covered by this patchset.
Fixes #1181."
When we compact a set of sstables, we have to remove the set atomically,
otherwise we can resurrect data if the following happens:
insert data to sstable A
insert tombstone to sstable B
compact A+B -> C (removing both data and tombstone)
delete B only
read data from A
Since an sstable may be shared by multiple shard, and each shard performs
compaction at a different time, we need to defer deletion of an sstable
set until all shards agree that the set can be deleted.
An additional atomicity issue exists because posix does not provide a way
to atomically delete multiple files. This issue is not addressed by this
patch.
"This patchset contains some fixes spotted during post-merged review
by {Nad,}av{,i}. I don't consider any of them a must for backport to 1.0,
but since we haven't yet even backported the main series, might as well backport
everything.
It also includes some unit tests to make sure that they will be kept working
in the future."
Now that we can boot without a Summary file, we can just as easily boot
with a broken one.
Suggested by Nadav, and it is actually very easy to do, so do it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Spotted by Avi post-merge
1) Need to close the file
2) Should be using the parameter pc instead of the default_class
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This shouldn't be a problem in practice, because if read_toc() fails,
the users will just tend to discard the sstable object altogether, and
not insist on using it.
However, if somebody does try to keep using it, a subsequent read_toc() could
theoretically have some components filled up leading the new reader to believe
the toc was populated successfully.
It is easier to just clear the _components set and never worry about it, than
trying to reason about whether or not that could happen.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are cases in which a Summary file will not be present, and imported
SSTables will have just the Index and Data files. In earlier versions of
Cassandra, a Summary didn't exist, so one may not be generated when migrating.
In Issue #1170, we can see an example of tables generated by CQLSSTableWriter,
and they lack a Summary. Cassandra is robust against this and can cope
perfectly with the Summary not existing. I will argue that we should do the
same.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We do that by bailing immediately if we detect that the components
map is already populated. This allow us to call read_toc() earlier
if we need to - for instance, to inquire about the existence of the
Summary - without the need to re-read the components again later.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
for prepare_summary we can just pass the min interval as a parameter and
avoid having the schema do yet another hop. For sealing the summary, it
is completely unused and we can do away with it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This is done so we can use other consumers. An example of that, is regeneration
of the Summary from an existing Index.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Because just creating an SSTable object does not generate any I/O,
get_sstable_key_range should be an instance method. The main advantage
of doing that is that we won't have to read the summary twice. The way
we're doing it currently, if happens to be a shard-relevant table we'll
call load() - which reads the summary again.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are times in which we read the Summary file twice. That actually happens
every time during normal boot (it doesn't during refresh). First during
get_sstable_key_range and then again during load().
Every summary will have at least one entry, so we can easily test for whether
or not this is properly initialized.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
After this change, user can query compression ratio on a per column
family basis with 'nodetool cfstats'.
look at 'nodetool cfstats' output:
./bin/nodetool cfstats ks.test5
Keyspace: ks
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: test5
SSTable count: 1
Space used (live): 4774
Space used (total): 4774
Space used by snapshots (total): 0
Off heap memory used (total): 131384
SSTable Compression Ratio: 0.833333
...
Fixes#636.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <a1bee5a23fe63787df3e387a88f2d216ba4a4134.1459802771.git.raphaelsc@scylladb.com>
With big rows I see contention in XFS allocations which cause reactor
thread to sleep. Commitlog is a main offender, so enlarge extent to
commitlog segment size for big files (commitlog and sstable Data files).
Message-Id: <20160404110952.GP20957@scylladb.com>
After 4e52b41a4, remove_by_toc_name() became aware of temporary TOC
files, however, it doesn't consider that some components may be
missing if temporary TOC is present.
When creating a new sstable, the first thing we do is to write all
components into temporary TOC, so content of a temporary TOC isn't
reliable until it is renamed.
Solution is about implementing the following flow (described by Avi):
"Flow should be:
- remove all components in parallel
- forgive ENOENT, since the compoent may not have been written;
otherwise deletion error should be raised
- fsync the directory
- delete the temporary TOC
"
This problem can be reproduced by running compaction without disk
space, so compaction would fail and leave a partial sstable that would
be marked for deletion. Afterwards, remove_by_toc_name() would try to
delete a component that doesn't exist because it looked at the content
of temporary TOC.
Fixes#1095.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>
Note: "normal" remove_by_toc_name must now be prepared for and check
if the TOC of the sstable is already moved to temp file when we
get to the juicy delete parts.
Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>
The same shard may create an sstables::sstable object for the same SStable
that doesn't belong to it more than once and mark it
for deletion (e.g. in a 'nodetool refresh' flow).
In that case the destructor of sstables::sstable accounted
the deletion requests from the same shard more than once since it was a simple
counter incremented each time there was a deletion request while it should
account request from the same shard as a single request. This is because
the removal logic waited for all shards to agree on a removal of a specific
SStable by comparing the counter mentioned above to the total
number of shards and once they were equal the SStable files were actually removed.
This patch fixes this by replacing the counter by an std::unordered_set<unsigned>
that will store a shard ids of the shards requesting the deletion
of the sstable object and will compare the size() of this set
to smp::count in order to decide whether to actually delete the corresponding
SStable files.
Fixes#1004
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1457886812-32345-1-git-send-email-vladz@cloudius-systems.com>