Due to the asynchronous nature of view update propagation, results
might still be absent from views when we query them. To be able to
deterministically assert on view rows, this patch retries a query a
bounded number of times until it succeeds.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718212646.2958-1-duarte@scylladb.com>
(cherry picked from commit ab72132cb1)
Clang is happy to create a vector<data_value> from a {}, a {1, 2}, but not a {1}.
No doubt it is correct, but sheesh.
Make the data_value explicit to humor it.
Message-Id: <20170713074315.9857-1-avi@scylladb.com>
(cherry picked from commit 162d9aa85d)
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.
Fixes#3304.
"
* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test reads with no clustering ranges and no static columns
tests: simple_schema: Allow creating schema with no static column
clustering_ranges_walker: Stop after static row in case no clustering ranges
(cherry picked from commit 054854839a)
This backport includes changes from the "loading_shared_values and size limited and evicting prepared statements cache" series,
"missing bits from loading_shared_values series" series and the "cql_transport::cql_server: fix the distributed prepared statements cache population"
patch (the last one is squashed inside the "cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache"
patch in order to avoid the bisect breakage).
* 'branch-2-0-backport-prepared-cache-fixes-v1' of https://github.com/vladzcloudius/scylla:
tests: loading_cache_test: initial commit
utils + cql3: use a functor class instead of std::function
cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
transport::server::process_prepare() don't ignore errors on other shards
cql3: prepared statements cache on top of loading_cache
utils::loading_cache: make the size limitation more strict
utils::loading_cache: added static_asserts for checking the callbacks signatures
utils::loading_cache: add a bunch of standard synchronous methods
utils::loading_cache: add the ability to create a cache that would not reload the values
utils::loading_cache: add the ability to work with not-copy-constructable values
utils::loading_cache: add EntrySize template parameter
utils::loading_cache: rework on top of utils::loading_shared_values
sstables::shared_index_list: use utils::loading_shared_values
utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
- Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class.
- Add the corresponding metrics to the query_processor:
- Evictions count.
- Current entries count.
- Current memory footprint.
Fixes#2474
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.
The fix is to always collect such zones.
Fixes#3129
Refs #3119
Refs #3120"
* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
lsa: Expose max_zone_segments for tests
lsa: Expose tracker::non_lsa_used_space()
lsa: Fix memory leak on zone reclaim
(cherry picked from commit 4ad212dc01)
end_bound was not updated in one of the cases in which end and
end_kind was changed, as a result later merging decision using
end_bound were incorrect. end_bound was using the new key, but the old
end_kind.
Fixes#3083.
Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dfe48bbbc7)
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.
Introduced in 2.0
Fixes #3053."
* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add reproducer for issue #3053
tests: mvcc: Add test for partition_snapshot::range_tombstones()
mvcc: Optimize partition_snapshot::range_tombstones() for single version case
mvcc: Fix partition_snapshot::range_tombstones()
tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
(cherry picked from commit 051cbbc9af)
(cherry picked from commit be5127388d)
[tgrabiec: dropped mvcc_test change, because the file does not exist here]
Fix two issues with serializing non-compound range tombstones as
compound: convert a non-compound clustering element to compound and
actually advertise the issue to other nodes.
* git@github.com:duarten/scylla.git rt-compact-fixes/v1:
compound_compact: Allow rvalues in size()
sstables/sstables: Convert non-compound clustering element to compound
tests/sstable_mutation_test: Verify we can write/read non-correct RTs
service/storage_service: Export non-compound RT feature
(cherry picked from commit e9cce59b85)
(cherry picked from commit 740fcc73b8)
Undid changes to size_estimates_virtual_reader
Changed test from memtable::make_flat_reader to memtable::make_reader
These tests now require having the storage service initialize, which
is needed to decide whether correct non-compound range tombstones
should be emitted or not.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126152921.5199-1-duarte@scylladb.com>
(cherry picked from commit 922f095f22)
(cherry picked from commit 8567723a7b)
This series mainly fixes issues with the serialization of promoted
index entries for non-compound schemas and with the serialization of
range tombstones, also for non-compound schemas.
We lift the correct cell name writing code into its own function,
and direct all users to it. We also ensure backward compatibility with
incorrectly generated promoted indexes and range tombstones.
Fixes#2995Fixes#2986Fixes#2979Fixes#2992Fixes#2993
* git@github.com:duarten/scylla.git promoted-index-serialization/v3:
sstables/sstables: Unify column name writers
sstables/sstables: Don't write index entry for a missing row maker
sstables/sstables: Reuse write_range_tombstone() for row tombstones
sstables/sstables: Lift index writing for row tombstones
sstables/sstables: Leverage index code upon range tombstone consume
sstables/sstables: Move out tombstone check in write_range_tombstone()
sstables/sstables: A schema with static columns is always compound
sstables/sstables: Lift column name writing logic
sstables/sstables: Use schema-aware write_column_name() for
collections
sstables/sstables: Use schema-aware write_column_name() for row marker
sstables/sstables: Use schema-aware write_column_name() for static row
sstables/sstables: Writing promoted index entry leverages
column_name_writer
sstables/sstables: Add supported feature list to sstables
sstables/sstables: Don't use incorrectly serialized promoted index
cql3/single_column_primary_key_restrictions: Implement is_inclusive()
cql3/delete_statement: Constrain range deletions for non-compound
schemas
tests/cql_query_test: Verify range deletion constraints
sstables/sstables: Correctly deserialize range tombstones
service/storage_service: Add feature for correct non-compound RTs
tests/sstable_*: Start the storage service for some cases
sstables/sstable_writer: Prepare to control range tombstone
serialization
sstables/sstables: Correctly serialize range tombstones
tests/sstable_assertions: Fix monotonicity check for promoted indexes
tests/sstable_assertions: Assert a promoted index is empty
tests/sstable_mutation_test: Verify promoted index serializes
correctly
tests/sstable_mutation_test: Verify promoted index repeats tombstones
tests/sstable_mutation_test: Ensure range tombstone serializes
correctly
tests/sstable_datafile_test: Add test for incorrect promoted index
tests/sstable_datafile_test: Verify reading of incorrect range
tombstones
sstables/sstable: Rename schema-oblivious write_column_name() function
sstables/sstables: No promoted index without clustering keys
tests/sstable_mutation_test: Verify promoted index is not generated
sstables/sstables: Optimize column name writing and indexing
compound_compat: Don't assume compoundness
(cherry picked from commit bd1efbc25c)
Also added sstables::make_sstable() to preserve source compatibility in tests.
Fixes#2938.
* 'tgrabiec/fix-range-tombstone-list-exception-safety-v1' of github.com:scylladb/seastar-dev:
tests: range_tombstone_list: Add test for exception safety of apply()
tests: Introduce range_tombstone_list assertions
cache: Make range tombstone merging exception-safe
range_tombstone_list: Introduce apply_monotonically()
range_tombstone_list: Make reverter::erase() exception-safe
range_tombstone_list: Fix memory leaks in case of bad_alloc
mutation_partition: Fix abort in case range tombstone copying fails
managed_bytes: Declare copy constructor as allocation point
Integrate with allocation failure injection framework
(cherry picked from commit 5a4b46f555)
serialized_action_tests depends on the fact that first part of the
serialized_action is executed at cetrtain points (in which it reads a
global variable that is later updated by the main thread).
This worked well in the release mode before ready continuations were
inlined and run immediately, but not in the debug mode since inlining
was not happening and the main seastar::thread was missing some yield
points.
Message-Id: <20170731103013.26542-1-pdziepak@scylladb.com>
(cherry picked from commit e970630272)
"Fixes the problem of concurrent populations of clustering row ranges
leading to some readers skipping over some of the rows.
Spotted during code review.
Fixes #2834."
* tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev:
tests: mvcc: Add test for partition_snapshot_row_cursor
tests: row_cache: Add test for concurrent population
tests: row_cache: Make populate_range() accept partition_range
tests: Add simple_schema::make_ckey_range()
cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()
mvcc: Keep track of all iterators in partition_snapshot_row_cursor
mvcc: Make partition_snapshot_row_cursor printable
(cherry picked from commit af1976bc30)
[tgrabiec: resolved conflicts]
"When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema
for some reason, reads or writes which involve both 1.7 and 2.0 nodes may
start to fail with the following error logged:
storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d
The situation should heal after whole cluster is upgraded.
Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes
due to change in the schema tables format. Mismatch is meant to be avoided by
having 2.0 nodes calculate the old digest on schema migration during upgrade,
and use that version until next time the table is altered. It is thus not
allowed to alter tables during the rolling upgrade.
Two 2.0 nodes may exchange schema, if they detect through gossip that their
schema versions don't match. They may not match temporarily during boot, until
the upgraded node completes the bootstrap and propagates its new schema
through gossip. One source of such temporary mismatch is construction of new
tracing tables, which didn't exist on 1.7. Such schema pull will result in a
schema merge, which cause all tables to be altered and their schema version to
be recalculated. The new schema will not match the one used by 1.7 nodes,
causing reads and writes to fail, because schema requesting won't work during
rolling upgrade from 1.7 to 2.0.
The main fix employed here is to hold schema pulls, even among 2.0 nodes,
until rolling upgrade is complete."
Fixes#2802.
* 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev:
tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case
tests: cql_test_env: Enable all features in tests
schema_tables: Make make_scylla_tables_mutation() visible
migration_manager: Disable pulls during rolling upgrade from 1.7
storage_service: Introduce SCHEMA_TABLES_V3 feature
schema_tables: Don't alter tables which differ only in version
schema_mutations: Use mutation_opt instead of stdx::optional<mutation>
(cherry picked from commit 8378fe190a)
"Scylla 1.7.4 and older use incorrect ordering of counter shards, this
was fixed in 0d87f3dd7d ("utils::UUID:
operator< should behave as comparison of hex strings/bytes"). However,
that patch was not backported to 1.7 branch until very recently. This
means that versions 1.7.4 and older emit counter shards in an incorrect
order and expect them to be so. This is particularly bad when dealing
with imported correct sstables in which case some shards may become
duplicated.
The solution implemented in this patch is to allow any order of counter
shards and automaticly merge all duplicates. The code is written in a
way so that the correct ordering is expected in the fast path in order
not to excessively punish unaffected deployments.
A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless
upgrade from 1.7.4 to later Scylla versions. If that feature is not
available Scylla still writes sstables and sends on-wire counters using
the old ordering so that it can be correctly understood by 1.7.4, once
the flag becomes available Scylla switches to the correct order.
Fixes #2752."
* tag 'fix-upgrade-with-counters/v2' of https://github.com/pdziepak/scylla:
tests/counter: verify counter_id ordering
counter: check that utils::UUID uses int64_t
mutation_partition_serializer: use old counter ordering if necessary
mutation_partition_view: do not expect counter shards to be sorted
sstables: write counter shards in the order expected by the cluster
tests/sstables: add storage_service_for_tests to counter write test
tests/sstables: add test for reading wrong-order counter cells
sstables: do not expect counter shards to be sorted
storage_service: introduce CORRECT_COUNTER_ORDER feature
tests/counter: test 1.7.4 compatible shard ordering
counters: add helper for retrieving shards in 1.7.4 order
tests/counter: add tests for 1.7.4 counter shard order
counters: add counter id comparator compatible with Scylla 1.7.4
tests/counter: verify order of counter shards
tests/counter: add test for sorting and deduplicating shards
counters: add function for sorting and deduplicating counter cells
counters: add counter_id::operator>
(cherry picked from commit 31706ba989)
"Fixes #2734."
* 'tgrabiec/make-sstable-reader-work-with-empty-range-set' of github.com:scylladb/seastar-dev:
tests: Introduce clustering_ranges_walker_test
tests: simple_schema: Add missing include
sstables: reader: Make clustering_ranges_walker work with empty range set
clustering_ranges_walker: Make adjacency more accurate
(cherry picked from commit 5224ab9c92)
Ref #2733.
* 'tgrabiec/fix-fast-forwarding' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Add more tests for fast forwarding across partitions
sstables: Fix abort in mutation reader for certain skip pattern
sstables: Fix reader returning partition past the query range in some cases
sstables: Introduce data_consume_context::eof()
(cherry picked from commit 4e67bc9573)
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.
Fixes#2333"
* 'pi-cell-name/v1' of github.com:duarten/scylla:
tests/sstable_mutation_test: Test promoted index blocks are monotonic
sstables: Consider eoc when flushing pi block
sstables: Extract out converting bound_kind to eoc
(cherry picked from commit db7329b1cb)
These patches contain some minor fixes for performance regression reported
by perf_fast_forward after partial cache was merged. The solution is still
far from perfect, there is one case that still has 30% degradation, but
there is some improvement so there is no reason to hold these changes back.
Refs #2582.
Some numbers:
before - before cache changes were merged
(555621b537)
cache - at the commit that introduced the partial cache
(9b21a9bfb6)
after - recent master + this series
(based on e988121dbb)
Differences are shown relative to "before".
Testing effectiveness of caching of large partition, single-key slicing reads:
Large partitions, range [0, 500000], populating cache
before cache after
1636840 1013688 1234606
-38% -25%
Large partitions, range [0, 500000], reading from cache
before cache after
2012615 3076812 3035423
+53% +51%
Testing scanning small partitions with skips.
reading small partitions (skip 0)
before cache after
227060 165261 200639
-27% -11%
skipping small partitions (skip 1)
before cache after
29813 27312 38210
-8% +28%
Testing slicing small partitions:
slicing small partitions (offset 0, read 4096)
before cache after
195282 149695 180497
-23% -8%
* https://github.com/pdziepak/scylla.git perf_fast_forward-regression/v3:
sstables: make sure that fill_buffer() actually fills buffer
mutation_merger: improve handling of non-deferring fill_buffer()s
partition_snapshot_row_cursor: avoid apply() in single-version cases
sstables: introduce decorated_key_view
ring_position_comparator: accept sstables::decorated_key_view
sstable: keep a pre-computed token in summary_entry
sstables: cache token in index entries
index_reader: advance_and_check_if_present() use index_comparator
ring_position_comparator: drop unused overloads
cache_streamed_mutation: avoid moving clustering_row
streamed_mutation: introduce consume_mutation_fragments_until()
cache_streamed_mutation: use consumer based read_context reader
rows_entry: make position() inlineable
mutation_fragment: make destructor always_inline
keys: introduce compound_wrapper::from_exploded_view()
sstables: avoid copying key components
compound_compat: explode: reserve some elements in a vector
cache: short-circut static row logic if there are no static columns
cache: use equality comparators instead of tri_compare
sstables: avoid indirect calls to abstract_type::is_multi_cell()
(cherry picked from commit e9fc0b0491)
Empty clustering key range is perfectly valid and signifies that the
reader is not interested in anything but the static row. Let's not
make it mean anything else.
Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>
(cherry picked from commit 1ea507d6ae)
Boost 1.55 accidentally removed support for "range for" on
recursive_directory_iterator (previous and latter versions do
support it). Use old-style iteration instead.
Message-Id: <20170724080128.8824-1-avi@scylladb.com>
(cherry picked from commit c21bb5ae05)
"Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555)
by reverting back to using the old layout in memory and thus also
in across-node requests. We still use the new v3 layout in schema
tables (needed by drivers and external tools). Translations happen
when converting to/from schema mutations."
* tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev:
schema: Revert back to the 1.7 layout of static compact tables in memory
schema: Use v3 column layout when converting to/from schema mutations
schema: Encapsulate column layout translations in the v3_columns class
(cherry picked from commit 1daf1bc4bb)
Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk
with write-back cache, and forever on a spinning disk.
Message-Id: <20170716081653.10018-1-avi@scylladb.com>
(cherry picked from commit d9c64ef737)
We will be creating links to those sstable's files, and those don't work
if the data directory and the test sstable are on different devices.
Copying the files to the same directory fixes the problem.
Message-Id: <20170716090405.14307-1-avi@scylladb.com>
(cherry picked from commit 9116dd91cb)
"Currently new nodes calculate digests based on v3 schema mutations,
which are very different from v2 mutations. As a result they will
use schemas with different table_schema_version that the old nodes.
The old nodes will not recognize the version and will try to request
its definition. That will fail, because old nodes don't understand
v3 schema mutations.
To fix this problem, let's preserve the digests during migration,
so that they're the same on new and old nodes. This will allow
requests to proceed as usual.
This does not solve the problem of schema being changed during
the rolling upgrade. This is not allowed, as it would bring the
same problem back.
Fixes #2549."
* tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev:
tests: Add test for concurrent column addition
legacy_schema_migrator: Set digest to one compatible with the old nodes
schema_tables: Persist table_schema_version
schema_tables: Introduce system_schema.scylla_tables
schema_tables: Simplify read_table_mutations()
schema_tables: Resurrect v2 read_table_mutations()
system_keyspace: Forward-declare legacy schemas
legacy_schema_migrator: Take storage_proxy as dependency
(cherry picked from commit a397889c81)
* tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev:
row_cache: Switch _stats.hits/misses to row granularity
row_cache: Rename num_entries() to partitions() for clarity
row_cache: Track mispopulations also at row level
row_cache: Track row insertions
row_cache: Track row hits and misses
row_cache: Make mispopulation counter also apply for continuity information
row_cache: Add partition_ prefix to current counters
misc_services: Switch to using reads_with[_no]_misses counters
row_cache: Add metrics for operations on underlying reader
row_cache: Add reader-related metrics
row_cache: Remove dead code
(cherry picked from commit b1a0e37fcb)
"This series introduces selective_token_range_sharder and uses it in repair to
generate dht::token_range belongs to a specific shard."
* tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev:
repair: Use selective_token_range_sharder
tests: Add test_selective_token_range_sharder
dht: Add selective_token_range_sharder
(cherry picked from commit 66e56511d6)
The test put a wrapping range into a non-wrapping range variable.
This was harmless at the time this test was written, but newer code
may not be as forgiving so better use a non-wrapping range as intended.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170704103128.29689-1-nyh@scylladb.com>
(cherry picked from commit d95f908586)
"This series enables cache to keep partial partitions.
Reads no longer have to read whole partition from sstables
in order to cache the result.
The 10MB threshold for partition size in cache is lifted.
Known issues:
- There is no partial eviction yet, whole partitions are still evicted,
and partition snapshots held by active reads are not evictable at all
- Information about range continuity is not recorded if that
would require inserting a dummy entry, or if previous entry
doesn't belong to the latest snapshot
- Cache update after memtable flush happening concurrently with reads
may inhibit that reads' ability to populate cache (new issue)
- Cache update from flushed memtables has partition granularity,
so may cause latency problems with large partition
- Schema is still tracked per-partition, so after schema changes
reads may induce high latency due to whole partition needing
to be converted atomically
- Range tombstones are repeated in the stream for every range between
cache entries they cover (new issue)
- Populating scans for both small and large partitions (perf_fast_forward)
experienced a 40% reduction of throughput, CPU bound
How was this tested:
- test.py --mode release
- row_cache_stress_test -c1 -m1G
- perf_fast_forward, passes except for the test case checking range continuity population
which would require inserting a dummy entry (mentioned above)
- perf_simple_query (-c1 -m1G --duration 32):
before: 90k [ops/s] stdev: 4k [ops/s]
after: 94k [ops/s] stdev: 2k [ops/s]"
* tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits)
tests: row_cache: Add test_tombstone_merging_in_partial_partition test case
tests: Introduce row_cache_stress_test
utils: Add helpers for dealing with nonwrapping_range<int>
tests: simple_schema: Allow passing the tombstone to make_range_tombstone()
tests: simple_schema: Accept value by reference
tests: simple_schema: Make add_row() accept optional timestamp
tests: simple_schema: Make new_timestamp() public
tests: simple_schema: Introduce make_ckeys()
tests: simple_schema: Introduce get_value(const clustered_row&) helper
tests: simple_schema: Fix comment
tests: simple_schema: Add missing include
row_cache: Introduce evict()
tests: Add cache_streamed_mutation_test
tests: mutation_assertions: Allow expecting fragments
mutation_fragment: Implement equality check
tests: row_cache: Add test for population of random partitions
tests: row_cache: Add test for partition tombstone population
tests: row_cache: Test reading randomly populated partition
tests: row_cache: Add test_single_partition_update()
tests: row_cache: Add test_scan_with_partial_partitions
...
Runs readers, updates and eviction concurrently and verifies the
following property of reads:
- reads see all past writes
- reads see no partial writes within a single partition