Presence checker is constructed and destroyed in the standard
allocator context, but the presence check was invoked in the LSA
context. If the presence checker allocates and caches some managed
objects, there will be alloc-dealloc mismatch.
That is the case with LeveledCompactionStrategy, which uses
incremental_selector.
Fix by invoking the presence check in the standard allocator context.
Fixes#4063.
Message-Id: <1547547700-16599-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 32f711ce56)
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.
Refs #4051.
(master commit 1b7cad3531)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
test_fast_forwarding_across_partitions_to_empty_range uses an uninitialized
string to populate an sstable, but this can be invalid utf-8 so that sstable
cannot be sstabledumped.
Make it valid by using make_random_string().
Fixes#4040.
Message-Id: <20190107193240.14409-1-avi@scylladb.com>
(cherry picked from commit d8adbeda11)
While we keep ordinary hints in a directory parallel to the data directory,
we decided to keep the materialized view hints in a subdirectory of the data
directory, named "view_pending_updates". But during boot, we expect all
subdirectories of data/ to be keyspace names, and when we notice this one,
we print a warning:
WARN: database - Skipping undefined keyspace: view_pending_updates
This spurious warning annoyed users. But moreover, we could have bigger
problems if the user actually tries to create a keyspace with that name.
So in this patch, we move the view hints to a separate top-level directory,
which defaults to /var/lib/scylla/view_hints, but as usual can be configured.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190107142257.16342-1-nyh@scylladb.com>
(cherry picked from commit da090a5458)
The workload in #3844 has these characteristics:
- very small data set size (a few gigabytes per shard)
- large working set size (all the data, enough for high cache miss rate)
- high overwrite rate (so a compaction results in 12X data reduction)
As a result, the compaction backlog controller assigns very few shares to
compaction (low data set size -> low backlog), so compaction proceeds very slowly.
Meanwhile, we have tons of cache misses, and each cache miss needs to read from a
large number of sstables (since compaction isn't progressing). The end result is
a high read amplification, and in this test, timeouts.
While we could declare that the scenario is very artificial, there are other
real-world scenarios that could trigger it. Consider a 100% write load
(population phase) followed by 100% read. Towards the end of the last compaction,
the backlog will drop more and more until compaction slows to a crawl, and until
it completes, all the data (for that compaction) will have to be read from its
input sstables, resulting in read amplification.
We should probably have read amplification affect the backlog, but for now the
simpler solution is to increase the minimum shares to 50 so that compaction
always makes forward progress. This will result in higher-than-needed compaction
bandwidth in some low write rate scenarios so we will see fluctuations in request
rate (what the controller was designed to avoid), but these fluctioations will be
limited to 5%.
Since the base class backlog_controller has a fixed (0, 0) point, remove it
and add it to derived classes (setting it to (0, 50) for compaction).
Fixes#3844 (or at least improves it).
Message-Id: <20181231162710.29410-1-avi@scylladb.com>
(cherry picked from commit b0980ba7c6)
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.
Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.
We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.
By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.
There is a similar problem at the drain level, which is also fixed in this
series.
Fixes#3958
* git@github.com:glommer/scylla.git faster-restart
compaction_manager: delay initialization of the compaction manager.
drain: stop compactions early
(cherry picked from commit 3e70ae1d06)
Currently queriers evicted due to their TTL expiring are not
unregistered from the `reader_concurrency_semaphore`. This can cause a
use-after-free when the semaphore tries to evict the same querier at
some later point in time, as the querier entry it has a pointer to is
now invalid.
Fix by unregistering the querier from the semaphore before destroying
the entry.
Refs: #4018
Refs: #4031
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4adfd09f5af8a12d73c29d59407a789324cd3d01.1546504034.git.bdenes@scylladb.com>
(cherry picked from commit e5a0ea390a)
In insert_querier(), we may evict older queriers to make room for the new one.
However, we forgot to unregister the evicted queriers from
reader_concurrency_semaphore. As a result, when reader_concurrency_semaphore
eventually wanted to evict something, it saw an inactive_read_handle that was
not connected to a querier_cache::entry, and crashed on use-after-free.
Fix by evicting through the inactive_read_handle associated with the querier
to be evicted. This removes traces of the querier from both
reader_concurrency_semaphore and querier_cache. We also have to massage the
statistics since querier_inactive_read::evict() updates different counters.
Fixes#4018.
Tests: unit(release)
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190102175023.26093-1-avi@scylladb.com>
(cherry picked from commit 918d255168)
"This series contains a couple of fixes to the
view_update_from_staging_generator, the object responsible for
generating view updates from sstables written through streaming.
Fixes#4021"
* 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla:
db/view/view_update_from_staging_generator: Break semaphore on stop()
db/view/view_update_from_staging_generator: Restore formatting
db/view/view_update_from_staging_generator: Avoid creating more than one fiber
(cherry picked from commit 96172b7bca)
When streaming, sstables for which we need to generate view updates
are placed in a special staging directory. However, we only need to do
this for tables that actually have views.
Refs #4021
Message-Id: <20181227215412.5632-1-duarte@scylladb.com>
(cherry picked from commit bab7e6877b)
"
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive, the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumes destruction.
The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.
When memtable is destroyed without being moved to cache there is no
problem because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.
Introduced in f3da043 (in >= 3.0-rc1)
Fixes#4030.
Tests:
- mvcc_test (debug)
"
* tag 'fix-snapshot-merging-use-after-free-v1.1' of github.com:tgrabiec/scylla:
tests: mvcc: Add test_snapshot_merging_after_container_is_destroyed
tests: mvcc: Introduce mvcc_container::migrate()
tests: mvcc: Make mvcc_partition move-constructible
tests: mvcc: Introduce mvcc_container::make_not_evictable()
tests: mvcc: Allow constructing mvcc_container without a cache_tracker
mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup
mvcc: partition_snapshot: Introduce migrate()
mutation_cleaner: impl: Store a back-reference to the owning mutation_cleaner
(cherry picked from commit 8e2f6d0513)
When creating a sstable from which to generate view updates, we held
on to a table reference across defer points. In case there's a
concurrent schema drop, the table object might be destroyed and we
will incur in a use-after-free. Solve this by holding on to a shared
pointer and pinning the table object.
Refs #4021
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227105921.3601-1-duarte@scylladb.com>
(cherry picked from commit 66e45469b2)
rpc::source cannot be abandoned until EOS is reached, but current code
does not obey it if error code is received, it throws exception instead that
aborts the reading loop. Fix it by moving exception throwing out of the
loop.
Fixes: #4025
Message-Id: <20181227135051.GC29458@scylladb.com>
(cherry picked from commit 37b4043677)
This patch is for branch 3.0's build_ami.sh.
It checks out the latest master branch of scylla-jmx, which not only
sounds wrong, it also doesn't work: the latest master of scylla-jmx
can only build a "relocatable package" but branch 3.0 doesn't work with
those.
This patch needs to be applied only in branch 3.0.
It should probably be made more general, though... build_ami.sh should
have been able to figure out what is the *current* branch, and if it is
branch-3.0 or next-3.0, check out branch-3.0 of the other repositories.
But I'm not sure how to do this correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181217214610.4498-1-nyh@scylladb.com>
When the next pending fragments are after the start of the new range,
we know there is no need to skip.
Caught by perf_fast_forward --datasets large-part-ds3 \
--run-tests=large-partition-slicing
Refs #3984
Message-Id: <1545308006-16389-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7afe2bad51)
"
Contains several improvements for fast-forwarding and slicing readers. Mainly
for the MC format, but not only:
- Exiting the parser early when going out of the fast-forwarding window [MC-format-only]
- Avoiding reading of the head of the partition when slicing
- Avoiding parsing rows which are going to be skipped [MC-format-only]
"
* 'sstable-mc-optimize-slicing-reads' of github.com:tgrabiec/scylla:
sstables: mc: reader: Skip ignored rows before parsing them
sstables: mc: reader: Call _cells.clear() when row ends rather than when it starts
sstables: mc: mutation_fragment_filter: Take position_in_partition rather than a clustering_row
sstables: mc: reader: Do not call consume_row_marker_and_tombstone() for static rows
sstables: mc: parser: Allow the consumer to skip the whole row
sstables: continuous_data_consumer: Introduce skip()
sstables: continuous_data_consumer: Make position() meaningful inside state_processor::process_state()
sstables: mc: parser: Allocate dynamic_bitset once per read instead of once per row
sstables: reader: Do not read the head of the partition when index can be used
sstables: mc: mutation_fragment_filter: Check the fast-forward window first
sstables: mc: writer: Avoid calling unsigned_vint::serialized_size()
(cherry picked from commit e6d26a528f)
"
The motivation is to keep code related to each format separate, to make it
easier to comprehend and reduce incremental compilation times.
Also reduces dependency on sstable writer code by removing writer bits from
sstales.hh.
The ka/la format writers are still left in sstables.cc, they could be also extracted.
"
* 'extract-sstable-writer-code' of github.com:tgrabiec/scylla:
sstables: Make variadic write() not picked on substitution error
sstables: Extract MC format writer to mc/writer.cc
sstables: Extract maybe_add_summary_entry() out of components_writer
sstables: Publish functions used by writers in writer.hh
sstables: Move common write functions to writer.hh
sstables: Extract sstable_writer_impl to a header
sstables: Do not include writer.hh from sstables.hh
sstables: mc: Extract bound_kind_m related stuff into mc/types.hh
sstables: types: Extract sstable_enabled_features::all()
sstables: Move components_writer to .cc
tests: sstable_datafile_test: Avoid dependency on components_writer
(cherry picked from commit b023e8b45d)
To be used by sstable_writer for stats collection.
Note that this patch is factored out so it can be verified with no
other change in functionality.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6853c1677d)
Prepare for per-sstable sub directory.
Also, these functions get most of their parameters from the sst at hand so they might
as well be first class members.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ad5f1e4fbb)
"
This series contains several optimizations of the MC format sstable writer, mainly:
- Avoiding output_stream when serializing into memory (e.g. a row)
- Faster serialization of primitive types when serializing into memory
I measured the improvement in throughput (frag/s) using perf_fast_forward for
datasets with a single large partition with many small rows:
- 10% for a row with a single cell of 8 bytes
- 10% for a row with a single cell of 100 bytes
- 9% for a row with a single cell of 1000 bytes
- 13% for a row with 6 cells of 100 bytes
"
* tag 'avoid-output-stream-in-sstable-writer-v2' of github.com:tgrabiec/scylla:
bytes_ostream: Optimize writing of fixed-size types
sstables: mc: Write temporary data to bytes_ostream rather than file_writer
sstables: mc: Avoid double-serialization of a range tombstone marker
sstables: file_writer: Generalize bytes& writer to accept bytes_view
sstables: Templetize write() functions on the writer
sstables: Turn m_format_write_helpers.cc into an impl header
sstables: De-futurize file_writer
bytes_ostream: Implement clear()
bytes_ostream: Make initial chunk size configurable
(cherry picked from commit e3f53542c9)
Currently if something throws while streaming in mutation sending loop
sink is not closed. Also when close() is running the code does not hold
onto sink object. close() is async, so sink should be kept alive until
it completes. The patch uses do_with() to hold onto sink while close is
running and run close() on error path too.
Fixes#4004.
Message-Id: <20181220155931.GL3075@scylladb.com>
(cherry picked from commit 393269d34b)
The newer version of node_exporter comes with important bug fixes, that
is especially important for I3.metal is not supported with the older
version of node_exporter.
The dashboards can now support both the new and the old version of
node_exporter.
Fixes#3927
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181210085251.23312-1-amnon@scylladb.com>
(cherry picked from commit 09c2b8b48a)
"
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica and we use the base’s backlog of view updates to derive
this delay.
To validate this approach we tested a 3 node Scylla cluster on GCE,
using n1-standard-4 instances with NVMEs. A loader running on a
n1-standard-8 instance run cassandra-stress with 100 threads. With the
delay function d(x) set to 1s, we see no base write timeouts. With the
delay function as defined in the series, we see that backlogs stabilize
at some (arbitrary) point, as predicted, but this stabilization
co-exists with base write timeouts. However, the system overall behaves
better than the current version, with the 100 view update limit, and
also better than the version without such limit or any backpressure.
More work is necessary to further stabilize the system. Namely, we want
to keep delaying until we see the backlog is decreasing. This will
require us to add more delay beyond the stabilization point, which in
turn should minimize the base write timeouts, and will also minimize the
amount of memory the backlog takes at each base replica.
Design document:
https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWoFixes#2538
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits)
service/storage_proxy: Release mutation as early as possible
service/storage_proxy: Delay replica writes based on view update backlog
service/storage_proxy: Get the backlog of a particular base replica
service/storage_proxy: Add counters for delayed base writes
main: Start and stop the view_update_backlog_broker
service: Distribute a node's view update backlog
service: Advertise view update backlog over gossip
service/storage_proxy: Send view update backlog from replicas
service/storage_proxy: Prepare to receive replica view update backlog
service/storage_proxy: Expose local view update backlog
tests/view_schema_test: Add simple test for db::view::node_update_backlog
db/view: Introduce node_update_backlog class
db/hints: Initialize current backlog
database: Add counter for current view backlog
database: Expose current memory view update backlog
idl: Add db::view::update_backlog
db/view: Add view_update_backlog
database: Wait on view update semaphore for view building
service/storage_proxy: Use near-infinite timeouts for view updates
database: generate_and_propagate_view_updates no longer needs a timeout
...
(cherry picked from commit b66f59aa3d)
The "enable_sstables_mc_format" config item help text wants to remove itself
before release. Since scylla-3.0 did not get enough mc format mileage, we
decided to leave it in, so the notice should be removed.
Fixes#4003.
Message-Id: <20181219082554.23923-1-avi@scylladb.com>
(cherry picked from commit dd51c659f7)
In some cases which I've yet to understand, build fails without libatomic.
We need to add it to the mock build machine.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181218154757.25236-1-nyh@scylladb.com>
build_rpm.sh uses "mock" to build an entire Scylla build environment,
which easily spans more than 15 gigabytes. mock, by defaults, puts this
build directory in a subdirectory of /var/lib/mock. There is no reason
why temporary build products need to be in the root directory: Some machines
(like mine) don't have that much free space in the root directory making it
impossible to use this script on such machines. and it's too easy
to leave temporary files there without noticing.
With this patch, the mock directories are put in build/mock/ instead of
/var/lib/mock.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181217195952.15154-1-nyh@scylladb.com>
"
Before the reader was just ignoring such columns but this creates a risk of data loss.
Refs #2598
"
* 'haaawk/2598/v3' of github.com:scylladb/seastar-dev:
sstables: Add test_sstable_reader_on_unknown_column
sstables: Exception on sstable's column not present in schema
sstables: store column name in column_translation::column_info
sstables: Make test_dropped_column_handling test dropped columns
(cherry picked from commit b0cb69ec25)
"
Previously we were checking for schema incompatibility between current schema and sstable
serialization header before reading any data. This isn't the best approach because
data in sstable may be already irrelevant due to column drop for example.
This patchset moves the check after actual data is read and verified that it has
a timestamp new enough to classify it as nonobsolete.
Fixes#3924
"
* 'haaawk/3924/v3' of github.com:scylladb/seastar-dev:
sstables: Enable test_schema_change for MC format
sstables3: Throw error on schema mismatch only for live cells
sstables: Pass column_info to consume_*_column
sstables: Add schema_mismatch to column_info
sstables: Store column data type in column_info
sstables: Remove code duplication in column_translation
(cherry picked from commit 62ea153629)
This series adds a generic test for schema changes that generates
various schema and data before and after an ALTER TABLE operation. It is
then used to check correctness of mutation::upgrade() and sstable
readers and lead to the discovery of #3924 and #3925.
Fixes#3925.
* https://github.com/pdziepak/scylla.git schema-change-test/v3.1
schema_builder: make member function names less confusing
converting_mutation_partition_applier: fix collection type changes
converting_mutation_partition_applier: do not emit empty collections
sstable: use format() instead of sprint()
tests/random-utils: make functions and variables inline
tests: add models for schemas and data
tests: generate schema changes
tests/mutation: add test for schema changes
tests/sstable: add test for schema changes
(cherry picked from commit 564b328b2e)
"
Compression is not deterministic so instead of binary comparing the sstable files we just read data back
and make sure everything that was written down is still present.
Tests: unit(release)
"
* 'haaawk/binary-compare-of-compressed-sstables/v3' of github.com:scylladb/seastar-dev:
sstables: Remove compressed parameter from get_write_test_path
sstables: Remove unused sstable test files
sstables: Ensure compare_sstables isn't used for compressed files
sstables: Don't binary compare compressed sstables
sstables: Remove debug printout from test_write_many_partitions
(cherry picked from commit 1ff6b8fb96)