"
The problem happens under the following circumstances:
- we have a partially populated partition in cache, with a gap in the middle
- a read with no clustering restrictions trying to populate that gap
- eviction of the entry for the lower bound of the gap concurrent with population
The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.
Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.
The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.
Fixes#3608.
"
* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
row_cache: Fix violation of continuity on concurrent eviction and population
position_in_partition: Introduce is_before_all_clustered_rows()
This series contains a couple of fixes to the bookkeeping of the view
build process, which could cause data to be left behind in the system
tables.
* git@github.com:duarten/scylla.git materialized-views/view-build-fixes/v1:
Duarte Nunes (3):
db/system_keyspace: Add function to remove view build status of a
shard
db/view: Don't have shard 0 clear other shard's status on drop
db/view: Restrict writes to the distributed system keyspace to shard 0
This series contains a couple of fixes to the adjusting of clustering
keys in the build_progress_virtual_reader, some of which could
potentially cause heap overflows when querying the legacy system table.
* git@github.com:duarten/scylla.git materialized-views/build-progress-virtual-reader-fixes/v1:
Duarte Nunes (3):
db/view/build_progress_virtual_reader: Use correct schema to adjust ck
db/view/build_progress_virtual_reader: Fix full ck detection
db/view/build_progress_virtual_reader: Also adjust end RT bound
"
This series fixes two issues related to bad_allocs and keys which require
linearization (larger than 12.8 KiB). With such keys, comparators may throw if
memory allocation fails. This may cause lookups in partition and rows trees to
fail with bad_alloc.
The first issue (#3583) was that partition version merging
(mutation_partition::apply_monotonically()) was not taking into account that
lookups may fail. If we fail, the partition which is being applied may be
incorrectly left with the clustering range since the begging of the range up
to the current row marked as continuous, if the current row has the continuity
flag set, because we've moved all of the preceding rows into the target, and
the correct lower bound row is no longer there in the source. This may mark
some discontinuous ranges as continuous. Merging is retried by
allocating_section, and there will be no problem if it eventually succeeds,
original continuity will be reflected in the sum. The problem will persist if
it doesn't eventually succeed, when we're really out of memory.
The user-perceivable effect of this would be temporary loss of writes in the
clustering range which was marked as continuous but shouldn't. Introduced in
2.2-rc1.
The second issue (#3585) is that the code which inserts partitions in memtable
and cache will leak the entry if boost::intrusive_set::insert() throws. This
will also cause SIGSEGV when cache tries to evict from such a leaked entry.
"
* tag 'tgrabiec/fix-bad-continuity-on-oom-in-apply-v2' of github.com:tgrabiec/scylla:
managed_bytes: Mark read_linearize() as an allocation point
tests: Relax expectation about continuity after failed merging
tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging
tests: Switch to seastar's allocation failure injector
mutation_partition: Introduce set_continuity()
clustering_interval_set: Introduce contained_in()
clustering_interval_set: Introduce add() overload accepting another interval set
mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
mutation_partition: Preserve continuity in case row merging with no tracker throws
memtable, cache: Fix exception safety of partition entry insertions
ensure_population_lower_bound() returned true if current clustering
range covers all rows, which means that the populator has a right to
set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise we're populating since _last_row, and
should consult it.
The fix introduces a new flag, set when starting to populte, which
indicates if we're populating from the beginning of the range or
not. We cannot simply check if _last_row is set in
ensure_population_lower_bound() because _last_row can be set and then
become empty again.
Fixes#3608
Currently we check that the sum of continuities is exactly the same as
expected on failure. Relax this to require that continuity is not
broader, since in some bad_alloc scenarios, or preemption, we will
have to mark some ranges as discontinuous.
When clustering keys are larger than 12.8 KiB they may get fragmented
and key comparator will need to linearize them on comparison. This may
cause lookups in the rows tree to fail with bad_alloc. Partition
version merging (mutation_partition::apply_monotonically()) was not
taking this into account. If we fail on lookup, the partition which is
being applied may be incorrectly left with the clustering range since
the begging up to the current row marked as continuous, if the current
row has the continuity flag set, because we've moved all of the
preceding rows into the target, and the correct lower bound row is no
longer there in the source. This may mark some discontinuous ranges as
continuous.
Merging is retried by allocating_section, and there will be no problem
if it eventually suceeds, original continity will be reflected in the
sum. The problem will persist if it doesn't eventually succeed, when
we're really out of memory.
To protect against this, we could reset the continuity flag of the
current row in the source when exiting on exception.
Fixes#3583
Example:
p: row{key=A, cont=0} row{key=C, cont=1}
this: row{key=C, cont=0}
When we get to processing key=C, key=A was already moved to this, so p
has stale continuity on key=C, which marks (-inf,C) as continuous,
whereas it should mark only (A, C). That's not a problem if merging
succeeds, but if exception happens at this point, we will violate the
invariant which says that the sum of p and this should yield the same
logical partition. It wouldn't because continuity of the sum is
calculated as a set union, and (-inf, A) would be incorrectly turned
into a continuous range.
This is not a problem currently because continuity is always full when
there is no tracker (memtables), so won't change anyway, and when
there is a tracker (cache) we never merge but overwrite instead, so
there is no memory allocation and thus no possibility for failure. But
better be safe.
boost::intrusive::set::insert() may throw if keys require
linearization and that fails, in which case we will leak the entry.
When this happens in cache, we will also violate the invariant for
entry eviction, which assumes all tracked entries are linked, and
cause a SEGFAULT.
Use the non-throwing and faster insert_before() instead. Where we
can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure
that entry is deallocated on insert failure.
Fixes#3585.
"
Partition keys are currently stored in serialized form in the
system.large_partitions table. This is an obstacle to operators
who usually can't deserialize partition keys in their heads.
Improve the situation by deserializing the partition key for them.
"
* tag 'pkey-print/v1' of https://github.com/avikivity/scylla:
large_partition_handler: output friendly partition key
keys: schema-aware printing of a partition_key
* seastar aac6cf1...6b97e00 (5):
> Merge "changes to fix travis CI builds" from Kefu
> tls.cc: Make "close" timeout delay exception proof
> core/sharded: mark foreign_ptr::get_owner_shard() const
> core/memory: Expose counter of large allocations
> tests: add test for multi-fragmented net::packet
Fixes#3461.
Ref scylladb/seastar#474.
In case population of the vector throws, the vector object would not
be destroyed. It's a managed object, so in addition to causing a leak,
it would corrupt memory if later moved by the LSA, because it would
try to fixup forward references to itself.
Caused sporadic failures and crashes of row_cache_test, especially
with allocation failure injector enabled.
Introduced in 27014a23d7.
Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>
Use abstract_type::to_string() to prettify partition key components.
Manually tested by setting --compaction-large-partition-warning-threshold-mb
to zero and inspecting the output for compound and non-compound partition
keys.
Add a with_schema() helper to decorate a partition key with its
schema for pretty-printing purposes, and matching operator<<.
This is useful to print partition keys where the operator, who
may not be familiar with the encoding, may see them.
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.
Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.
Fixes#3605.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
"
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.
Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.
For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.
In this design:
- We scan all data directories for existing data.
- resharding only happens within a particular data directory.
- snapshot details are accumulated with data for all directories that
host snapshots for the tables we are examining
- snapshots are created with files in its own directories, but the
manifest file goes to the main directory. For this one, note that in
Cassandra the same thing happens, except that there is no "main"
directory. Still the manifest file is still just in one of them.
- SSTables are flushed into the main directory.
- Compactions write data into the main directory
Despite the restrictions, one example of usage of this is recovery. If
we have network attached devices for instance, we can quickly attach a
network device to an existing node and make the data immediately
available as it is compacted back to main storage.
Tests: unit (release)
"
* 'multi-data-file-v2' of github.com:glommer/scylla:
database: change ident
database: support multiple data directories
database: allow resharing to specify a directory
database: support multiple directories in get_snapshot_details
database: move get_snapshot_info into a seastar::thread
snapshots: always create the snapshot directory
sstables: pass sstable dir with entry descriptor
database: make nodetool listsnapshots print correct information
sstables: correctly create descriptors for snapshots
"
This work is on top of Gleb's rpc streaming which is merged recently.
What this series does is to replace scylla streaming service's data plane to
use the new rpc streaming instead of the old rpc verb to send the mutations for
scylla streaming. Other parts of scylla streaming, the control plane, are not
changed.
In my test, to bootstrap a new node to the existing one node cluster, smp 2,
scylla stores data on ramdisk to minimize disk io impact.
I saw x2 improvment in streaming bandwidth.
Before:
[shard 0] stream_session - [Stream #2ae92320-5fc8-11e8-911a-000000000000]
Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1570312 KiB, 109521.02 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 14.338 seconds
After:
[shard 0] stream_session - [Stream #e5589ac0-5fc7-11e8-b463-000000000000]
Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1546875 KiB, 220415.36 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 7.018 seconds
Tests: dtest update_cluster_layout_tests.py
Fixes: #3591
"
* tag 'asias/scylla_streaming_with_rpc_streaming_v8' of github.com:scylladb/seastar-dev:
streaming: Add rpc streaming support
storage_service: Introduce STREAM_WITH_RPC_STREAM feature
streaming: Add estimate_partitions to send_info
messaging_service: Add streaming with rpc streaming support
messaging_service: Add streaming_domain
database: Add add_sstable_and_update_cache
database: Add make_streaming_sstable_for_write
Assign a scheduling_group for each RPC service. Assignement is
done by connection (get_rpc_client_idx()) - all verbs on the
same connection are assigned the same group. While this may seem
arbitrary, it avoids priority inversion; if two verbs on the same
connection have different scheduling groups, the verb with the low
shares may cause a backlog and stall the connection, including
following requests from verbs that ought to have higher shares.
The scheduling_group parameters are encapsulated in different
classes as they are passed around to avoid adding dependencies.
Message-Id: <20180708140433.6426-1-avi@scylladb.com>
This patch addresses several issues.
1. The class no longer uses placement-new trick for move-assignment.
It was incorrect to use because the class contains const refererences
and re-initializing the same region of memory would result in undefined
behaviour on accessing these members.
2. Use boost::iterator_range for tracking the current range of
cr_ranges. It is easier to deal with and avoids possible bugs like
assigning only one of two iterators
Message-Id: <4096182c4ee2fb1157e135c487c41012b266ba69.1531440684.git.vladimir@scylladb.com>
This patch changes scylla streaming to use the recently added rpc
streaming feature provided by seastar to send mutation fragments for
scylla streaming instead of the rpc verbs.
It also changes the receiver to write to the sstable file directly,
skipping writing to memtable.
Preparation for adding rpc streaming in scylla streaming.
- register_stream_mutation_fragments is used to register the rpc
streaming verb
- make_sink_and_source_for_stream_mutation_fragments is used to get the
sink and source object for the sender
- make_sink_for_stream_mutation_fragments is used to get a sink object
for the receiver
Since we can write mutations to sstable directly in streaming, we need
to add those sstables to the system so it can be seen by the query.
Also we need to update the cache so the query refects the latest data.
This will be used to create sstable for streaming receiver to write the
mutations received from network to sstable file instead of writing to
memtable.
Since some AMIs using consistent network device naming, primary NIC
ifname is not 'eth0'.
But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add
--nic option to specify custom NIC ifname.
Fixes#3584
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712142446.15909-1-syuu@scylladb.com>
Reduces size of index_entry from 384 bytes to 64 bytes by using
indirection for the optional promoted index instead of embedding it.
Improves query time from 9ms to 4ms in a micro benchmark with a very
large index page.
Message-Id: <1531406354-10089-1-git-send-email-tgrabiec@scylladb.com>
When periodically reloading the values in the loading_cache, we would
iterate over the list of entries and call the load() function for
those which need to be reloaded.
For some concrete caches, load() can remove the entry from the LRU set,
and can be executed inline from the parallel_for_each(). This means we
could potentially keep iterating using an invalidated iterator.
Fix this by using a temporary container to hold those entries to be
reloaded.
Spotted when reading the code.
Also use if constexpr and fix the comment in the function containing
the changes.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712124143.13638-1-duarte@scylladb.com>
The continuation attached to _load() needs the key of the loaded entry
to check whether it was disposed during the load. However if _load()
invalidates the entry the continuation's capture line will access
invalid memory while trying to obtain the key.
To avoid this save a copy of the key before calling _load() and pass it
to both _load() and the continuation.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>
"
If there is a lot of partitions in the index page, index_list may grow large
and require large contiguous blocks of memory, because it's based on
std::vector. That puts pressure on the memory allocator, and if memory is
fragmented, may not be possible to satisfy without a lot of eviction. Switch
to chunked_vector to avoid this.
Refs #3597
"
* 'tgrabiec/avoid-large-alloc-in-index-reader' of github.com:tgrabiec/scylla:
sstables: Switch index_list to chunked_vector to avoid large allocations
utils: chunked_vector: Do not require T to be default-constructible for clear()
utils: chunked_vector: Implement front()