That's important for the reference to sstable to not be kept throughout
the compaction procedure, which would break the goal of releasing
space during compaction.
Manager passes a callback to compaction which calls it whenever
there's sstable replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that we want to release space for exhausted sstable and that
will only happen when all references to it are gone *and* that backlog
tracker takes the early replacement into account.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When compacting, we'll create all readers at once and will not select
again from incremental selector, meaning the selector will keep all
respective sstables in current_sstables, preventing compaction from
releasing space as it goes on.
The change is about refreshing sstable set's selector such that it
will not hold a reference to an exhausted sstable whatsoever.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By doing that, we'll be able to release exhausted sstable from both
simulteaneously.
That's achieved by sharing set containing input sstables with the incremental
reader selector and removing exhausted sstables from shared set when the
time has come.
Step towards reducing disk requirement for compaction by making it delete
sstable which all data is in a sealed new sstable. For that to happen,
all references must be gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, compaction only replace input sstables at end of compaction,
meaning compaction must be finished for all the space of those sstables
to be released.
What we can do instead is to delete earlier some input sstable under
some conditions:
1) SStable data should be committed to a new, sealed output sstable,
meaning it's exhausted.
2) Exhausted sstable mustn't overlap with a non-exhausted sstable
because a tombstone in the exhausted could have been purged and the
shadowed data in non-exhausted could be ressurected if system
crashes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The motivation is that compaction may remove a sstable from the set while the
incremental selector is alive, and for that to work, we need to invalidate
the iterators stored by the selector. We could have added a method to notify
it, but there will be a case where the one keeping the set cannot forward
the notification to the selector. So it's better for the selector to take
care of itself. Change counter approach is used which allows the selector
to know when to invalidate the iterators.
After invalidation, selector will move the iterator back into its right
place by looking for lower bound for current pos.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Use sstable generation instead to keep track of read sstables.
The motivation is that we'll not keep reference to sstables, so allowing
their space on disk to be released as soon they get exhausted.
Generation is used because it guarantees uniqueness of the sstable.
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Older sstables must have an identifier for them to be associated
with their own run.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It identifies a run which a particular sstable belongs to.
Existing sstables will have a random uuid associated with it
in memory.
UUID is the correct choice because it allows sstables to be
exported without having conflicts when using identifier generated
by different nodes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
These switches are fully covered. We can be sure they will stay this
way because of -Werror and gcc's -Wswitch warning.
We can also be sure that we never have an invalid enum value since the
state machine values are not read from disk.
The patch also removes a superfluous ';'.
Message-Id: <20181124020128.111083-1-espindola@scylladb.com>
The reason for that is that it's not available in sstable format mc,
so we can no longer rely on it in common code for the currently
supported formats.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181121170057.20900-1-raphaelsc@scylladb.com>
This series adds a generic test for schema changes that generates
various schema and data before and after an ALTER TABLE operation. It is
then used to check correctness of mutation::upgrade() and sstable
readers and lead to the discovery of #3924 and #3925.
Fixes#3925.
* https://github.com/pdziepak/scylla.git schema-change-test/v3.1
schema_builder: make member function names less confusing
converting_mutation_partition_applier: fix collection type changes
converting_mutation_partition_applier: do not emit empty collections
sstable: use format() instead of sprint()
tests/random-utils: make functions and variables inline
tests: add models for schemas and data
tests: generate schema changes
tests/mutation: add test for schema changes
tests/sstable: add test for schema changes
for_each_schema_change() is used for testing reading an sstable that was
written with a different schema. Because of #3924, for now the mc format
is not verified this way.
This patch adds for_each_schema_change() functions which generates
schemas and data before and after some modification to the schema (e.g.
adding a column, changing its type). It can be used to test schema
upgrades.
This patch introduces a model of Scylla schemas and data, implemented
using simple standard library primitives. It can be used for testing the
actuall schemas, mutation_partitions, etc. used by the schema by
comparing the results of various actions.
The initial use case for this model was to test schema changes, but
there is no reason why in the future it cannot be extended to test other
things as well.
packer 1.3.2 no longer supported enhanced_networking directive, we need
to use new directives("sriov_support" and "ena_support") to build with
new version.
packer provides automatic configuration file fixing tool, so new
scylla.json is generated by following command:
./packer/packer fix scylla.json
Fixes#3938
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181123053719.32451-1-syuu@scylladb.com>
appending_hash is used for computing hashes that become part of the
binary interface. They cannot change between Scylla version and the same
data needs to always result in the same hash.
At the moment, appending_hash<bytes_ostream> doesn't fulfil those
requirements since it leaks information how the underlying buffer is
fragmented. Fortunately, it has no users so it doesn't casue any
compatibility issues.
Moreover, bytes_ostream is usually used as an output of some
serialisation routine (e.g. frozen_mutation_fragment or CQL response).
Those serialisation formats do not guarantee that there is a single
representation of a given data and therefore are not fit to be hashed by
appending_hash. Removing appending_hash<bytes_ostream> may help
preventing such incorrect uses.
Message-Id: <20181122163823.12759-1-pdziepak@scylladb.com>
* seastar b924495...1fbb633 (3):
> rpc: Reduce code duplication
> tests: perf: Make do_not_optimize() take the argument by const&
> doc: Fix import paths in the tutorial
The format message was using the new stlye formatting markers ("{}")
which are understood by format() but not by sprint() (the latter is
basically deprecated).
This patch changes the behaviour of the schema upgrade code so that if
all cells and the tombstons of a collection are removed during the upgrade
the collection is not emitted (as opposed to emitting an empty one).
Both behaviours are valid, but the new one makes it more consistent with
how atomic cells are upgraded and how schema upgrades work for sstable
readers.
ALTER TABLE allows changing the type of a collection to a compatible
one. This includes changes from a fixed-sized type to a variable-sized
one. If that happens the atomic_cells representing collection elements
need to be rewritten so that the value size is included. The logic for
rewritting atomic cells already exists (for those that are not
collection members) and is reused in this patch.
Fixes#3925.
Right now, schema_builder member functions have names that very poorly
convey the actions that are performed for them. This is made even worse
by some overloads which drastically change the semantics. For example:
schema_builder()
.with_column("v1", /* ... */)
.without_column("v1", removal_timestamp);
Creates a column "v1" and adds an information that there was a column
with that name that was removed at 'removal_timestamp'.
schema_builder()
.with_coulmn("v1")
.without_column(utf8_type->decompose("v1"));
This adds column "v1" and then immediately removes it.
In order to clean up this mess the names were changes so that:
* with_/without_ functions only add informations to the schema (e.g.
info that a column was removed, but without removing a column of that
name if one exists)
* functions which names start with a verb actually perform that action,
e.g. the new remove_column() removes the column (and adds information
that it used to exist) as in the second example.
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>
"
Tested with perf_fast_forward from:
github.com/tgrabiec/scylla.git perf_fast_forward-for-sst3-opt-write-v1
Using the following command line:
build/release/tests/perf/perf_fast_forward_g --populate --sstable-format=mc \
--data-directory /tmp/perf-mc --rows=10000000 -c1 -m4G \
--datasets small-part
The average reported flush throughput was (stdev for the avergages is around 4k):
- for mc before the series: 367848 frag/s
- for lc before the series: 463458 frag/s (= mc.before +25%)
- for mc after the series: 429276 frag/s (= mc.before +16%)
- for lc after the series: 466495 frag/s (= mc.before +26%)
Refs #3874.
"
* tag 'sst3-opt-write-v2' of github.com:tgrabiec/scylla:
sstables: mc: Avoid serialization of promoted index when empty
sstables: mc: Avoid double serialization of rows
tests: sstable 3.x: Do not compare Statistics component
utils: Introduce memory_data_sink
schema: Optimize column count getters
sstables: checksummed_file_data_sink_impl: Bypass output_stream
The old code was serializing the row twice. Once to get the size of
its block on disk, which is needed to write the block length, and then
to actually write the block.
This patch avoids this by serializing once into a temporary buffer and
then appending that buffer to the data file writer.
I measured about 10% improvement in memtable flush throughput with
this for the small-part dataset in perf_fast_forward.
The Statistics component recorded in the test was generated using a
buggy verion of Scylla, and is not correct. Exposed by fixing the bug
in the way statistics are generated.
Rather than comparing binary content, we should have explicit checks
for statistics.
"
Enables sstable compression with LZ4 by default, which was the
long-time behavior until a regression turned off compression by
default.
Fixes#3926
"
* 'restore-default-compression/v2' of https://github.com/duarten/scylla:
tests/cql_query_test: Assert default compression options
compress: Restore lz4 as default compressor
tests: Be explicit about absence of compression
Indicate the default scylla directories, rather than Cassandra's.
Provide links to Scylladocumentation where possible,
update links to Casandra documentation otherwise.
Clean up a few typos.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181119141912.28830-1-bhalevy@scylladb.com>
This avoids a difference between little and big endian sytems. We
now also calculate a full murmur hash for tokens with less than 8
bytes, however in practice the token size is always 8.
Message-Id: <20181120214733.43800-1-mike.munday@ibm.com>
The boost multiprecision library that I am compiling against seems
to be missing an overload for the cast to a string. The easy
workaround seems to be to call str() directly instead.
This also fixes#3922.
Message-Id: <20181120215709.43939-1-mike.munday@ibm.com>
Fixes a regression introduced in
74758c87cd, where tables started to be
created without compression by default (before they were created with
lz4 by default).
Fixes#3926
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
The build environment may not installed ninja-build before running
install-dependencies.sh, so do it after running the script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181031110737.17755-1-syuu@scylladb.com>
* seastar a44cedf...d59fcef (10):
> dns: Set tcp output stream buffer size to zero explicitly
> tests: add libc-ares to travis dependencies
> tests: add dns_test to test suite
> build: drop bundled c-ares package
> prometheus: replace the instance label with an optional one
> build: Refactor C++ dialect detection
> build: add libatomic to install-depenencies.sh
> core: use std::underlying_type for open_flags
> core: introduce open_flags::operator&
> core: Fix build for `gnu++14`
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.
One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.
This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>