Introduced in 2a437ab427.
regular_compaction::select_sstable_writer() creates the sstable writer
when the first partition is consumed from the combined mutation
fragment stream. It gets the schema directly from the table
object. That may be a different schema than the one used by the
readers if there was a concurrent schema alter duringthat small time
window. As a result, the writing consumer attached to readers will
interpret fragments using the wrong version of the schema.
One effect of this is storing values of some columns under a different
column.
This patch replaces all column_family::schema() accesses with accesses
to the _schema memeber which is obtained once per compaction and is
the same schema which readers use.
Fixes#4304.
Tests:
- manual tests with hard-coded schema change injection to reproduce the bug
- build/dev/scylla boot
- tests/sstable_mutation_test
Message-Id: <1551698056-23386-1-git-send-email-tgrabiec@scylladb.com>
"
Currently we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.
A similar problem exists for the offset vector built
in write_promoted_index().
This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.
The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.
There still remains a problem that large-enough promoted index can cause OOM.
Refs #4217
Tests:
- unit (release)
- scylla-bench write
Branches: 3.0
"
* tag 'fix-large-alloc-for-promoted-index-v3' of github.com:tgrabiec/scylla:
sstables: mc: writer: Avoid large allocations for maintaining promoted index
sstables: mc: writer: Avoid double-serialization of the promoted index
Scan the table's pending_delete sub-directory if it exists.
Remove any temporary pending_delete log files to roll back the respective
delete_atomically operation.
Replay completed pending_delete log files to roll forward the respective
delete_atomically operation, and finally delete the log files.
Cleanup of temporary sstable directories and pending_delete
sstables are done in a preliminary scan phase when populating the column family
so that we won't attempt to load the to-be-deleted sstables.
Fixes#4082
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate recovery of a delete_atomically operation that crashed mid
way, add a replayable log file holding the committed sstables to delete.
It will be used by populate_column_family to replay the atomic deletion.
1. Write the toc names of sstables to be deleted into a temporary file.
2. Once flushed and closed, rename the temp log file into the final name
and flush the pending_delete directory.
3. delete the sstables.
4. Remove the pending_delete log file
and flush the pending_delete directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
component_basename returns just the basename for the component filename
without the leading sstdir path.
To be used for delete_atomically's pending_delete log file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
1. We would like to be able to call maybe_delete_large_partitions_entry
from the sstable destructor path in the future so the sstable might go away
while the large data entries are being deleted.
2. We would like the caller to handle any exception on this path,
especially in the prepatation part, before calling delete_large_partitions_entry().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be called by delete_atomically,
rather that passing a vector to delete_sstables.
This way, no need to build `sstables_to_delete_atomically` vector
To be replaced in the future with a sstable method once we
provide the large_data_handler upon construction.
Handle exceptions from remove_by_toc_name or maybe_delete_large_partitions_entry
by merely logging an error. There is nothing else we can do at this point.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to call delete_sstables which works on a list of sstable
(by toc name).
Also, add FIXME comment about not calling
large_data_handler.maybe_delete_large_partitions_entry
on this path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
checksummed_file_writer does not override allocate_buffer(), so it inherits
data_source_impl's default allocate_buffer, which does not care about alignment.
The buffer is then passed to the real file_data_sink_impl, and thence to the file
itself, which cannot complete the write since it is not properly aligned.
This doesn't fail in release mode, since the Seastar allocator will supply a
properly aligned buffer even if not asked to do so. The ASAN allocator usually
does supply an aligned buffer, but not always, which causes the test to fail.
Fix by forwarding the allocate_buffer() function to the underlying data_source.
Fixes#4262.
Branches: branch-3.0
Message-Id: <20190221184115.6695-1-avi@scylladb.com>
This patch removes the log message about "compaction_manager - Asked to stop"
at the very end of Scylla runs. This log message is confusing because it
only has the "asked to stop" part, without finally a "stopped", and may
lead a user to incorrectly fear that the shutdown hung - when it in fact
finished just fine.
The database object holds a compaction_manager and stop()s it when the
database is stop()ed - and that is the very last thing our shutdown does.
However, much earlier, as the *first* shutdown operation (i.e., the last
at_exit() in main.cc), we already stop() the compaction manager.
The second stop() call does nothing, but unfortunately prints the log
message just before checking if it has anything to stop. So this patch
just moves the log message to after the check.
Fixes#4238.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217142657.19963-1-nyh@scylladb.com>
Currently, we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.
A similar problem exists for the offset vector built
in write_promoted_index().
This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.
The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.
There still remains a problem that large-enough promoted index can cause OOM.
Refs #4217
The included testcase used to crash because during database::stop() we
would try to update system.large_partition.
There doesn't seem to be an order we can stop the existing services in
cql_test_env that makes this possible.
This patch then adds another step when shutting down a database: first
stop updating system.large_partition.
This means that during shutdown any memtable flush, compaction or
sstable deletion will not be reflected in system.large_partition. This
is hopefully not too bad since the data in the table is TTLed.
This seems to impact only tests, since main.cc calls _exit directly.
Tests: unit (release,debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190213194851.117692-1-espindola@scylladb.com>
There is no value if having a default value for encoding_stats parameter
of write_components(). If anything it weakens the tests by encouraging
not using the real encoding stats which is not what the actual sstable
write path in Scylla does.
This patch removes the default value and makes most of the tests provide
real encoding statistics. The ones that do not are those that have no
easy way of obtaining those (and those stats are not that important for
the test itself) or there is a reason for not using those
(sstable_3_x_test::test_sstable_write_large_row uses row size thresholds
based on size with default-constructed encoding_stats).
Message-Id: <20190212124356.14878-1-pdziepak@scylladb.com>
Each counter cell has a header with an entry for each local and global
shards. The detection of remote shards is done by checking if there are
any counter shards that do not have an entry in the header. This is done
by computing the number of counter shards in a cell and comparing it to
the number of header entries. However, the computation was wrong and
included the size taken by the header itself. As a result, if the header
was as big or larger than a single counter shard Scylla incorrectly
complained about remote shards.
column_stats is a per-partition tracker, while metadata_collector is the
global one. The statistics gathered by column_stats are merged into the
metadata_collector. In order to ensure that we get proper default values
in case no value of particular kind (e.g. no TTLs) was seen they need to
be set on the global tracker, not the per-partition one.
sstable class is responsible for much more things that it should. In
particular, it takes care of both writing and reading sstables. The
problem that it causes is that it is very easy to confuse those two.
This is what has happened in get_encoding_stats_for_compaction().
Originally, it was using _c_stats as a source of the statistics, which
is used only during the write and per-partition. Needless to say, the
returned encoding_stats were bogus.
The correct source of those statistics is get_stats_metadata().
We still allow the delete of rows from system.large_partition to run
in parallel with the sstable deletion, but now we return a future that
waits for both.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190205001526.68774-1-espindola@scylladb.com>
"
This is a first step in fixing #3988.
"
* 'espindola/large-row-warn-only-v4' of https://github.com/espindola/scylla:
Rename large_partition_handler
Print a warning if a row is too large
Remove defaut parameter value
Rename _threshold_bytes to _partition_threshold_bytes
keys: add schema-aware printing for clustering_key_prefix
Stop calling .remove_suffix on empty string_view.
ck_bview can be empty because this function can be
called for a half open range tombstone.
It is impossible to write such range tombstones to LA/KA SSTables
so we should throw a proper exception instead of allowing
an undefined behaviour.
Refs #4113
Tests: unit(release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c3738916953e4b10812aed95e645c739b4c29462.1548777086.git.piotr@scylladb.com>
Fully expired sstable is not added to compacting set, meaning it's not actually
compacted, but it's kept in a list of sstables which incremental compaction
uses to check if any sstable can be replaced.
Incremental compaction was unconditionally removing expired sstable from compacting
set, which led to segfault because end iterator was given.
The fix is about changing sstable_set::erase() behavior to follow standard one
for erase functions which will works if the target element is not present.
Fixes#4085.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190130163100.5824-1-raphaelsc@scylladb.com>
Committer: Avi Kivity <avi@scylladb.com>
Branch: next
Switch to the the CMake-ified Seastar
This change allows Scylla to be compiled against the `master` branch of
Seastar.
The necessary changes:
- Add `-Wno-error` to prevent a Seastar warning from terminating the
build
- The new Seastar build system generates the pkg-config files (for
example, `seastar.pc`) at configure time, so we don't need to invoke
Ninja to generate them
- The `-march` argument is no longer inherited from Seastar (correctly),
so it needs to be provided independently
- Define `SEASTAR_TESTING_MAIN` so that the definition of an entry
point is included for all unit test compilation units
- Independently link Scylla against Seastar's compiled copy of fmt in
its build directory
- All test files use the (now public) Seastar testing headers
- Add some missing Seastar headers to source files
[avi: regenerate frozen toolchain, adjust seastar submoule]
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <02141f2e1ecff5cbcd56b32768356c3bf62750c4.1548820547.git.jhaberku@scylladb.com>
1. fs::canonical required that the path will exist.
and there is no need for fs::canonical here.
2. fs::path::extension will return the leading dot.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When seastar/core/shared_ptr_incomplete.hh is included in a header
then it causes problems with all declarations of shared_ptr<T> with
incomplete type T that end up in the same compilation unit.
The problem happens when we have a compilation unit that includes
two headers a.hh and b.hh such that a.hh includes
seastar/core/shared_ptr_incomplete.hh and b.hh declares
shared_ptr<T> with incomplete type T. On the same time this
compilation unit does not use declared shared_ptr<T> so it should
compile and work but it does not because shared_ptr_incomplete.hh
is included and it forces instantiation of:
template <typename T>
T*
lw_shared_ptr_accessors<T,
void_t<decltype(lw_shared_ptr_deleter<T>{})>>::to_value(lw_shared_ptr_counter_base*
counter) {
return static_cast<T*>(counter);
}
for each declared shared_ptr<T> with incomplete type T. Even the once
that are never used.
Following commit "Decouple database.hh from types/user.hh"
moves user_types_metadata type out of database.hh and instead
declares shared_ptr<user_types_metadata> in database.hh where
user_types_metadata is incomplete. Without this commit
the compilation of the following one fails with:
In file included from ./sstables/sstables.hh:34,
from ./db/size_estimates_virtual_reader.hh:38,
from db/system_keyspace.cc:77:
seastar/include/seastar/core/shared_ptr_incomplete.hh: In
instantiation of ‘static T*
seastar::internal::lw_shared_ptr_accessors<T,
seastar::internal::void_t<decltype
(seastar::lw_shared_ptr_deleter<T>{})>
>::to_value(seastar::lw_shared_ptr_counter_base*) [with T =
user_types_metadata]’:
seastar/include/seastar/core/shared_ptr.hh:243:51: required from
‘static void seastar::internal::lw_shared_ptr_accessors<T,
seastar::internal::void_t<decltype
(seastar::lw_shared_ptr_deleter<T>{})>
>::dispose(seastar::lw_shared_ptr_counter_base*) [with T =
user_types_metadata]’
seastar/include/seastar/core/shared_ptr.hh:300:31: required from
‘seastar::lw_shared_ptr<T>::~lw_shared_ptr() [with T =
user_types_metadata]’
./database.hh:1004:7: required from ‘static void
seastar::internal::lw_shared_ptr_accessors_no_esft<T>::dispose(seastar::lw_shared_ptr_counter_base*)
[with T = keyspace_metadata]’
seastar/include/seastar/core/shared_ptr.hh:300:31: required from
‘seastar::lw_shared_ptr<T>::~lw_shared_ptr() [with T =
keyspace_metadata]’
./db/size_estimates_virtual_reader.hh:233:67: required from here
seastar/include/seastar/core/shared_ptr_incomplete.hh:38:12: error:
invalid static_cast from type ‘seastar::lw_shared_ptr_counter_base*’
to type ‘user_types_metadata*’
return static_cast<T*>(counter);
^~~~~~~~~~~~~~~~~~~~~~~~
[131/415] CXX build/release/distributed_loader.o
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
deletion_time struct as int32_t deletion_time that cannot hold long
time values. Cap local_deletion_time to max_local_deletion_time and
log a warning about that,
This corresponds to Cassandra's MAX_DELETION_TIME.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
max local_deletion_time_tracker in stats is int32_t so just track the limit
of (max int32_t - 1) if time_point is greater than the limit.
This corresponds to Cassandra's MAX_DELETION_TIME.
Refs #3353
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
mc format only writes delta local_deletion_time of tombstones.
Conventional deletion_time is written only for the partition header.
Restructure the code to pass a tombstone to write_delta_deletion_time
rather than struct deletion_time to prepare for using 64-bit deletion times.
The tombstone uses gc_clock::time_point while struct
deletion_time is limited to int32_t local_deletion_time.
Note that for "live" tombstones we encode <api::missing_timestamp,
no_deletion_time> as was previously evaluated by to_deletion_time().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fix runtime error: signed integer overflow
introduced by 2dc3776407
Delta-encoded values may wrap around if the encoded value is
less than the base value. This could happen in two places:
In the mc-format serialization header itself, where the base values are implicit
Cassandra epoch time, and in the sstables data files, where the base values
are taken from the encoding_stats (later written to the serialization_header).
In these cases, when the calculation is done using signed integer/long we may see
"runtime error: signed integer overflow" messages in debug mode
(with -fsanitize=undefined / -fsanitize=signed-integer-overflow).
Overflow here is expected and harmless since we do not gurantee that
neither the base values in the serialization header are greater than
or equal to Cassandra's epoch now that the delta-encoded values are
always greater than or equal to the respective base values in
the serialization header.
To prevent these warnings, the subtraction/addition should be done with unsigned
(two's complement) arithmetic and the result converted to the signed type.
Note that to keep the code simple where possible, when also rely on implicit
conversion of signed integers to unsigned when either one of added value is unsigned
and the other is signed.
Fixes: #4098
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190120142950.15776-1-bhalevy@scylladb.com>
When compacting several sstables, get and merge their encoding_stats
for encoding the result.
Introduce sstable::get_encoding_stats_for_compaction to return encoding_stats
based on the sstable's column stats.
Use encoding_stats_collector to keep track of the minimum encoding_stats
values of all input sstables.
Fixes#3971
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
This series adds generating view updates from sstables added through
/upload directory if their tables have accompanying materialized views.
Said sstables are left in /upload directory until updates are generated
from them and are treated just like staging sstables from /staging dir.
If there are no views for a given tables, sstables are simply moved
from /upload dir to datadir without any changes.
Tests: unit (release)
"
* 'add_handling_staging_sstables_to_upload_dir_5' of https://github.com/psarna/scylla:
all: rename view_update_from_staging_generator
distributed_loader: fix indentation
service: add generating view updates from uploaded sstables
init: pass view update generator to storage service
sstables: treat sstables in upload dir as needing view build
sstables,table: rename is_staging to requires_view_building
distributed_loader: use proper directory for opening SSTable
db,view: make throttling optional for view_update_generator
In some cases, sstables put in the upload dir should have view updates
generated from them. In order to avoid moving them across directories
(which then involves handling failure paths), upload dir will also be
treated as a valid directory where staging sstables reside.
Regular sstables that are not needed for view updates will be
immediately moved from upload/ dir as before.