Commit Graph

16172 Commits

Author SHA1 Message Date
Vladimir Krivopalov
d4e0fa96e3 tests: Read rows only index
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
5561c713d9 sstables: Do not seek through the promoted index for static row positions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
917528c427 sstables: Read promoted index stored in SSTables 3.x ('mc') format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
86d14f8166 sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats.
This is a pre-requisite for parsing promoted index blocks written in
SSTables 'mc' format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
79c2f0095c utils: Add overloaded_functor helper.
The overloaded_functor class template can be used to encompass multiple
lambdas accepting different types into a single callable object that can
be used with any of those types.

One application is visitors for std::variant where different handling is
required for different types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
593d8faf7d position_in_partition: Add a constructor from range_tag_t{}, bound_kind and clustering_key_prefix.
This facilitates position_in_partition creation when parsing range tombstones bounds from SSTables files.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
997ebaaa14 sstables: Support reading signed vints in continuous_data_consumer.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
540dfcc9bf sstables: Factor out the code building a vector of fixed clustering values lengths.
This code will be re-used in promoted_index_blocks_parser to parse
clustering key prefixes from SSTables 3.x format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
741d5f3b5d sstables: Remove unused includes from index_entry.hh
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
b29b948872 tests: Add test for reading SSTables 3.x index file with empty promoted index.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
054eb2df66 tests: Rename sstable_assertions.hh -> tests/index_reader_assertions.hh
The previous name of the file is moreover confusing as we have several
sstable_assertions classes throughout tests but this header only
contains a class for index reader assertions.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
f50ffa267f sstables: Support parsing index entries from SSTables 3.x format.
With this patch, index_reader is capable of reading index_entries from
both 'ka'/'la' and 'mc' formats.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Piotr Jastrzebski
d0f8c71e28 sstables: move bound_kind_m to header
and add helper methods.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-20 13:50:17 -07:00
Duarte Nunes
6bd087facb Merge 'Make indexed queries with pk restrictions non-filtering' from Piotr
"
Queries that use secondary index and have a full partition key restriction
or full primary key restriction should not require filtering - it's
sufficient to add these restrictions to the index query.
This also adds secondary index tests to cover this case.

Tests: unit (release)
"

* 'si_and_pk_restrictions_2' of https://github.com/psarna/scylla:
  tests: add index + partition key test
  cql3: make index+primary key restrictions filtering-independent
  cql3: use primary key restrictions in filtering index queries
  cql3: add is_all_eq to primary key restrictions
  cql3: add explicit conversion between key restrictions
  cql3: add apply_to() method to single column restriction
  cql3: make primary key restrictions' values unambiguous
2018-07-19 16:54:43 +01:00
Tomasz Grabiec
d5534d6a77 Merge "Improve categorization of messaging verbs into connections" from Avi
Now that verb categorizations also affect scheduling, getting them
correct is more important. The first three patches in this series
improve the infrastructure a little, and the forth fixes some
categorization errors wrt. repair/streaming verbs.

* https://github.com/avikivity/scylla msg-idx-sanity/v1:
  messaging: choose connection index via a look-up table
  messaging: convert do_get_rpc_client_idx into a switch
  messaging: remove default when computing rpc client index
  messaging: categorize more streaming/repair verbs as streaming
2018-07-19 15:03:15 +02:00
Tomasz Grabiec
ef4fb1f91d sstables: mp_row_consumer_m: Add trace-level logging
Very useful for debugging. The old mp_row_consumer_k_l had this.

Message-Id: <1532000326-28649-1-git-send-email-tgrabiec@scylladb.com>
2018-07-19 14:58:00 +03:00
Asias He
1f06ee3960 range_streamer: Limit nr of nodes to stream in parallel
For example, to bootstrap a 50th node in a cluster

 [shard 0] range_streamer - Bootstrap with
 [127.0.0.8, 127.0.0.2, 127.0.0.24, 127.0.0.21, 127.0.0.49, 127.0.0.44,
 127.0.0.9, 127.0.0.7, 127.0.0.47, 127.0.0.15, 127.0.0.5, 127.0.0.30,
 127.0.0.14, 127.0.0.12, 127.0.0.36, 127.0.0.11, 127.0.0.48, 127.0.0.28,
 127.0.0.33, 127.0.0.10, 127.0.0.41, 127.0.0.4, 127.0.0.40, 127.0.0.3,
 127.0.0.6, 127.0.0.43, 127.0.0.22, 127.0.0.26, 127.0.0.42, 127.0.0.25,
 127.0.0.17, 127.0.0.37, 127.0.0.23, 127.0.0.13, 127.0.0.38, 127.0.0.1,
 127.0.0.18, 127.0.0.20, 127.0.0.39, 127.0.0.27, 127.0.0.34, 127.0.0.32,
 127.0.0.19, 127.0.0.16, 127.0.0.31, 127.0.0.45, 127.0.0.29, 127.0.0.35,
 127.0.0.46]
 for keyspace=keyspace1 started, nodes_to_stream=49, nodes_in_parallel=49

the new node will get data from 49 existing nodes.

Currently, it will stream from all the 49 existing nodes at the same
time. It is not a good idea to stream from all the nodes in parallel
which can overwhelm the bootstrap node, i.e., 49 nodes sending, 1 node
receiving.

To fix this, limit the nr of nodes to stream in parallel. We should have
a better control over the memory usage and parallelism. But for now,
limit the nr of nodes to a maximum of 16 as a starter. With this limit,
each shard can work with as many as 16 remote nodes in parallel, I think
this has enough parallelism for streaming in terms of performance.

This change have effect on the bootstrap/decommission/removenode node
operations, and do not have effect on repair.

Refs #2782

Message-Id: <980610dc97490d4f16281a0c3203b9bee73e04e4.1531989557.git.asias@scylladb.com>
2018-07-19 11:44:05 +03:00
Avi Kivity
31d4d37161 Merge "Reduce continuous memory usage in gossip" from Asias"
"
Use chunked_vector instead of vector. It won't have compatibility issues
because chunked_vector and vector have the same on wire format.

Refs #278
"

* 'asias/gossip_memory_v2' of github.com:scylladb/seastar-dev:
  gossip: Reduce continuous memory usage
  to_string: Add std::list and utils::chunked_vector support
  serializer: Add chunked_vector support
2018-07-19 09:12:09 +03:00
Tomasz Grabiec
9a0548397c tests: row_cache: Add test for eviction from invalidated partitions
Message-Id: <1531933216-28026-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 21:06:36 +03:00
Piotr Sarna
82c049692b tests: add index + partition key test
Tests covering querying both index and partition keys are added
- it's checked that such queries do not require filtering.
2018-07-18 18:45:08 +02:00
Piotr Sarna
0c85bdcdc2 cql3: make index+primary key restrictions filtering-independent
If full partition key (or full primary key) is used in an indexed
query, it should not require filtering, because queries like that
can be efficiently narrowed down with stricter index restrictions.
2018-07-18 18:45:08 +02:00
Piotr Sarna
2542630a18 cql3: use primary key restrictions in filtering index queries
If both index and partition key is used in a query, it should not
require filtering, because indexed query can be narrowed down
with partition key information. This commit appends partition key
restrictions to index query.
2018-07-18 18:45:08 +02:00
Piotr Sarna
27590816f0 cql3: add is_all_eq to primary key restrictions
is_all_eq is later needed to decide if restrictions can be used
in an indexed query.
2018-07-18 18:45:08 +02:00
Piotr Sarna
20a349777e cql3: add explicit conversion between key restrictions
Partition and clustering key restrictions sometimes need to be converted
and this commit provides a way to do that.
2018-07-18 18:45:08 +02:00
Piotr Sarna
f1357defd6 cql3: add apply_to() method to single column restriction
This method allows copying single column restriction,
possibly with a new column definition.
2018-07-18 18:44:38 +02:00
Tomasz Grabiec
dc453d4f5d tests: flat_mutation_reader: Use fluent assertions for better error messages
Message-Id: <1531908313-29810-2-git-send-email-tgrabiec@scylladb.com>
2018-07-18 13:52:23 +01:00
Tomasz Grabiec
604c8baed8 tests: flat_mutation_reader_assertions: Introduce produces(mutation_fragment)
Message-Id: <1531908313-29810-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 13:52:23 +01:00
Tomasz Grabiec
c46813717c tests: sstables: Check that reading large index pages does not cause large allocations
Reproducer of #3597.

Message-Id: <1531914040-5427-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 14:56:41 +03:00
Piotr Sarna
30f9924ad5 cql3: make primary key restrictions' values unambiguous
using directive must be used to disambiguate the overridden method.
2018-07-18 13:28:37 +02:00
Avi Kivity
31151cadd4 Merge "row_cache: Fix violation of continuity on concurrent eviction and population" from Tomasz
"
The problem happens under the following circumstances:

  - we have a partially populated partition in cache, with a gap in the middle

  - a read with no clustering restrictions trying to populate that gap

  - eviction of the entry for the lower bound of the gap concurrent with population

The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.

Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.

The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.

Fixes #3608.
"

* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
  row_cache: Fix violation of continuity on concurrent eviction and population
  position_in_partition: Introduce is_before_all_clustered_rows()
2018-07-18 10:11:34 +03:00
Asias He
506eed325a dht: Fix typo in boot_strapper.cc
Eror -> Error

Message-Id: <ab1050c526f6e70c3a365595376acde7706d86e9.1531877929.git.asias@scylladb.com>
2018-07-18 10:00:27 +03:00
Tomasz Grabiec
894961006b Merge "db/view/view_builder: Fixes to bookkeeping" from Duarte
This series contains a couple of fixes to the bookkeeping of the view
build process, which could cause data to be left behind in the system
tables.

* git@github.com:duarten/scylla.git materialized-views/view-build-fixes/v1:

Duarte Nunes (3):
  db/system_keyspace: Add function to remove view build status of a
    shard
  db/view: Don't have shard 0 clear other shard's status on drop
  db/view: Restrict writes to the distributed system keyspace to shard 0
2018-07-17 18:01:28 +02:00
Tomasz Grabiec
25d09e51ac Merge "db/view/build_progress_virtual_reader: Fixes to clustering key adjusts" from Duarte
This series contains a couple of fixes to the adjusting of clustering
keys in the build_progress_virtual_reader, some of which could
potentially cause heap overflows when querying the legacy system table.

* git@github.com:duarten/scylla.git materialized-views/build-progress-virtual-reader-fixes/v1:

Duarte Nunes (3):
  db/view/build_progress_virtual_reader: Use correct schema to adjust ck
  db/view/build_progress_virtual_reader: Fix full ck detection
  db/view/build_progress_virtual_reader: Also adjust end RT bound
2018-07-17 18:00:30 +02:00
Avi Kivity
9ffa6b9ad6 Merge "Fix leaks and corruption of continuity in cache in case of bad_alloc from key linearization" from Tomasz
"
This series fixes two issues related to bad_allocs and keys which require
linearization (larger than 12.8 KiB). With such keys, comparators may throw if
memory allocation fails. This may cause lookups in partition and rows trees to
fail with bad_alloc.

The first issue (#3583) was that partition version merging
(mutation_partition::apply_monotonically()) was not taking into account that
lookups may fail. If we fail, the partition which is being applied may be
incorrectly left with the clustering range since the begging of the range up
to the current row marked as continuous, if the current row has the continuity
flag set, because we've moved all of the preceding rows into the target, and
the correct lower bound row is no longer there in the source. This may mark
some discontinuous ranges as continuous. Merging is retried by
allocating_section, and there will be no problem if it eventually succeeds,
original continuity will be reflected in the sum. The problem will persist if
it doesn't eventually succeed, when we're really out of memory.

The user-perceivable effect of this would be temporary loss of writes in the
clustering range which was marked as continuous but shouldn't. Introduced in
2.2-rc1.

The second issue (#3585) is that the code which inserts partitions in memtable
and cache will leak the entry if boost::intrusive_set::insert() throws. This
will also cause SIGSEGV when cache tries to evict from such a leaked entry.
"

* tag 'tgrabiec/fix-bad-continuity-on-oom-in-apply-v2' of github.com:tgrabiec/scylla:
  managed_bytes: Mark read_linearize() as an allocation point
  tests: Relax expectation about continuity after failed merging
  tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging
  tests: Switch to seastar's allocation failure injector
  mutation_partition: Introduce set_continuity()
  clustering_interval_set: Introduce contained_in()
  clustering_interval_set: Introduce add() overload accepting another interval set
  mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
  mutation_partition: Preserve continuity in case row merging with no tracker throws
  memtable, cache: Fix exception safety of partition entry insertions
2018-07-17 18:19:37 +03:00
Tomasz Grabiec
477d7b439b row_cache: Fix violation of continuity on concurrent eviction and population
ensure_population_lower_bound() returned true if current clustering
range covers all rows, which means that the populator has a right to
set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise we're populating since _last_row, and
should consult it.

The fix introduces a new flag, set when starting to populte, which
indicates if we're populating from the beginning of the range or
not. We cannot simply check if _last_row is set in
ensure_population_lower_bound() because _last_row can be set and then
become empty again.

Fixes #3608
2018-07-17 16:43:21 +02:00
Tomasz Grabiec
8d47d21149 position_in_partition: Introduce is_before_all_clustered_rows() 2018-07-17 16:43:21 +02:00
Tomasz Grabiec
612b223819 managed_bytes: Mark read_linearize() as an allocation point 2018-07-17 16:39:43 +02:00
Tomasz Grabiec
be678a81ee tests: Relax expectation about continuity after failed merging
Currently we check that the sum of continuities is exactly the same as
expected on failure. Relax this to require that continuity is not
broader, since in some bad_alloc scenarios, or preemption, we will
have to mark some ranges as discontinuous.
2018-07-17 16:39:43 +02:00
Tomasz Grabiec
f366ac76e8 tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
d9db79a85d tests: Switch to seastar's allocation failure injector
It catches more allocation sites.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
6b1fe6cbe5 mutation_partition: Introduce set_continuity() 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
ac772cbd81 clustering_interval_set: Introduce contained_in() 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
d24ebe8565 clustering_interval_set: Introduce add() overload accepting another interval set 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
c6c54021a8 mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
When clustering keys are larger than 12.8 KiB they may get fragmented
and key comparator will need to linearize them on comparison. This may
cause lookups in the rows tree to fail with bad_alloc. Partition
version merging (mutation_partition::apply_monotonically()) was not
taking this into account. If we fail on lookup, the partition which is
being applied may be incorrectly left with the clustering range since
the begging up to the current row marked as continuous, if the current
row has the continuity flag set, because we've moved all of the
preceding rows into the target, and the correct lower bound row is no
longer there in the source. This may mark some discontinuous ranges as
continuous.

Merging is retried by allocating_section, and there will be no problem
if it eventually suceeds, original continity will be reflected in the
sum. The problem will persist if it doesn't eventually succeed, when
we're really out of memory.

To protect against this, we could reset the continuity flag of the
current row in the source when exiting on exception.

Fixes #3583
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
de5c52f422 mutation_partition: Preserve continuity in case row merging with no tracker throws
Example:

 p:      row{key=A, cont=0} row{key=C, cont=1}
 this:                      row{key=C, cont=0}

When we get to processing key=C, key=A was already moved to this, so p
has stale continuity on key=C, which marks (-inf,C) as continuous,
whereas it should mark only (A, C). That's not a problem if merging
succeeds, but if exception happens at this point, we will violate the
invariant which says that the sum of p and this should yield the same
logical partition. It wouldn't because continuity of the sum is
calculated as a set union, and (-inf, A) would be incorrectly turned
into a continuous range.

This is not a problem currently because continuity is always full when
there is no tracker (memtables), so won't change anyway, and when
there is a tracker (cache) we never merge but overwrite instead, so
there is no memory allocation and thus no possibility for failure. But
better be safe.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
567da3e063 memtable, cache: Fix exception safety of partition entry insertions
boost::intrusive::set::insert() may throw if keys require
linearization and that fails, in which case we will leak the entry.

When this happens in cache, we will also violate the invariant for
entry eviction, which assumes all tracked entries are linked, and
cause a SEGFAULT.

Use the non-throwing and faster insert_before() instead. Where we
can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure
that entry is deallocated on insert failure.

Fixes #3585.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
c82c0be0be tests: mutation_diff: Ignore differences in memory addresses
Differences in memory addresses are not necessarily differences in
values.

Refs #3571

Message-Id: <1531824919-12737-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 16:32:04 +03:00
Amos Kong
0fcdab8538 scylla_setup: nic setup dialog is only for interactive mode
Current code raises dialog even for non-interactive mode when we pass options
in executing scylla_setup. This blocked automatical artifact-test.

Fixes #3549

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <58f90e1e2837f31d9333d7e9fb68ce05208323da.1531824972.git.amos@scylladb.com>
2018-07-17 16:31:18 +03:00
Paweł Dziepak
422d1eaeb9 Merge "Improve usability of pkeys in system.large_partitions table" from Avi
"
Partition keys are currently stored in serialized form in the
system.large_partitions table. This is an obstacle to operators
who usually can't deserialize partition keys in their heads.

Improve the situation by deserializing the partition key for them.
"

* tag 'pkey-print/v1' of https://github.com/avikivity/scylla:
  large_partition_handler: output friendly partition key
  keys: schema-aware printing of a partition_key
2018-07-17 13:51:22 +01:00
Avi Kivity
002ac87aac Update seastar submodule
* seastar aac6cf1...6b97e00 (5):
  > Merge "changes to fix travis CI builds" from Kefu
  > tls.cc: Make "close" timeout delay exception proof
  > core/sharded: mark foreign_ptr::get_owner_shard() const
  > core/memory: Expose counter of large allocations
  > tests: add test for multi-fragmented net::packet

Fixes #3461.
Ref scylladb/seastar#474.
2018-07-17 15:43:01 +03:00