Commit Graph

19730 Commits

Author SHA1 Message Date
Tomasz Grabiec
2fc144e1a8 tests: memtable_snapshot_source: Allow changing the schema 2019-10-03 22:03:29 +02:00
Tomasz Grabiec
22dde90dba tests: simple_schema: Prepare for schema altering
Currently, methods of simple_schema assume that table's schema doesn't
change. Accessors like get_value() assume that rows were generated
using simple_schema::_s. Because if that, the column_definition& for
the "v" column is cached in the instance. That column_definiion&
cannot be used to access objects created with a different schema
version. To allow using simple_schema after schema changes,
column_definition& caching is now tagged with the table schema version
of origin. Methods which access schema-dependent objects, like
get_value(), are now accepting schema& corresponding to the objects.

Also, it's now possible to tell simple_schema to use a different
schema version in its generator methods.
2019-10-03 22:03:29 +02:00
Tomasz Grabiec
e6afc89735 row_cache: Record upgraded schema in memtable entries during update
Cache update may defer in the middle of moving of partition entry
from a flushed memtable to the cache. If the schema was changed since
the entry was written, it upgrades the schema of the partition_entry
first but doesn't update the schema_ptr in memtable_entry. The entry
is removed from the memtable afterward. If a memtable reader
encounters such an entry, it will try to upgrade it assuming it's
still at the old schema.

That is undefined behavior in general, which may include:

 - read failures due to bad_alloc, if fixed-size cells are interpreted
   as variable-sized cells, and we misinterpret a value for a huge
   size

 - wrong read results

 - node crash

This doesn't result in a permanent corruption, restarting the node
should help.

It's the more likely to happen the more rows there are in a
partition. It's unlikely to happen with single-row partitions.

Introduced in 70c7277.

Fixes #5128.
2019-10-03 22:03:29 +02:00
Tomasz Grabiec
ea461a3884 memtable: Extract memtable_entry::upgrade_schema() 2019-10-03 22:03:29 +02:00
Tomasz Grabiec
90d6c0b9a2 row_cache, mvcc: Prevent locked snapshots from being evicted
If the whole partition entry is evicted while being updated from the
memtable, a subsequent read may populate the partition using the old
version of data if it attempts to do it before cache update advances
past that partition. Partial eviction is not affected because
populating reads will notice that there is a newer snapshot
corresponding to the updater.

This can happen only in OOM situations where the whole cache gets evicted.

Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.

Introduced in 70c7277.

Fixes #5134.
2019-10-03 22:03:29 +02:00
Tomasz Grabiec
57a93513bd row_cache: Make evict() not use invalidate_unwrapped()
invalidate_unwrapped() calls cache_entry::evict(), which cannot be
called concurrently with cache update. invalidate() serializes it
properly by calling do_update(), but evict() doesn't. The purpose of
evict() is to stress eviction in tests, which can happen concurrently
with cache update. Switch it to use memory reclaimer, so that it's
both correct and more realistic.

evict() is used only in tests.
2019-10-03 22:03:28 +02:00
Tomasz Grabiec
c88a4e8f47 mvcc: Introduce partition_snapshot::touch() 2019-10-03 22:03:28 +02:00
Tomasz Grabiec
25e2f87a37 row_cache, mvcc: Do not upgrade schema of entries which are being updated
When a read enters a partition entry in the cache, it first upgrades
it to the current schema of the cache. The same happens when an entry
is updated after a memtable flush. Upgrading the entry is currently
performed by squashing all versions and replacing them with a single
upgraded version. That has a side effect of detaching all snapshots
from the partition entry. Partition entry update on memtable flush is
writing into a snapshot. If that snapshot is detached by a schema
upgrade, the entry will be missing writes from the memtable which fall
into continuous ranges in that entry which have not yet been updated.

This can happen only if the update of the entry is preempted and the
schema was altered during that, and a read hit that partition before
the update went past it.

Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.

The problem is fixed by locking updated entries and not upgrading
schema of locked entries. cache_entry::read() is prepared for this,
and will upgrade on-the-fly to the cache's schema.

Fixes #5135
2019-10-03 22:03:28 +02:00
Tomasz Grabiec
0675088818 row_cache: Use the correct schema version to populate the partition entry
The sstable reader which populates the partition entry in the cache is
using the schema of the partition entry snapshot, which will be the
schema of the cache at the time the partition was entered. If there
was a schema change after the cache reader entered the partition but
before it created the sstable reader, the cache populating reader will
interpret sstable fragments using the wrong schema version. That is
more likely if partitions have many rows, and the front of the
partition is populated. With single-row partitions that's unlikely to
happen.

That is undefined behavior in general, which may include:

 - read failures due to bad_alloc, if fixed-size cells are
   interpreted as variable-sized cells, and we misinterpret
   a value for a huge size

 - wrong read results

 - node crash

This doesn't result in a permanent corruption, restarting the node
should help.

Fixes #5127.
2019-10-03 22:03:28 +02:00
Tomasz Grabiec
10992a8846 delegating_reader: Optimize fill_buffer()
Use move_buffer_content_to() which is faster than fill_buffer_from()
because it doesn't involve popping and pushing the fragments across
buffers. We save on size estimation costs.
2019-10-03 22:03:28 +02:00
Tomasz Grabiec
aad1307b14 row_cache, memtable: Use upgrade_schema() 2019-10-03 13:28:33 +02:00
Tomasz Grabiec
3177732b35 flat_mutation_reader: Introduce upgrade_schema() 2019-10-03 13:28:33 +02:00
Asias He
a9b95f5f01 repair: Fix tracker::start and tracker::done in case of error
The operation after gate.enter() in tracker::start() can fail and throw,
we should call gate.leave() in such case to avoid unbalanced enter and
leave calls. tracker::done() has similar issue too.

Fix it by removing the gate enter and leave logic in tracker start and
done. A helper tracker::run() is introduced to take care of the gate and
repair status.

In addition, the error log is improved. It now logs exceptions on all
shards in the summary. e.g.,

[shard 0] repair - repair id 1 failed: std::runtime_error
({shard 0: std::runtime_error (error0), shard 1: std::runtime_error (error1)})

Fixes #5074
2019-10-03 13:33:02 +03:00
Botond Dénes
00b432b61d querier_cache: correctly account entries evicted on insertion in the population
Currently, the population stat is not increased for entries that are
evicted immediately on insert, however the code that does the eviction
still decreases the population stat, leading to an imbalance and in some
cases the underflow of the population stat. To fix, unconditionally
increase the population stat upon inserting an entry, regardless of
whether it is immediately evicted or not.

Fixes: #5123

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191001153215.82997-1-bdenes@scylladb.com>
2019-10-03 11:49:44 +03:00
Konstantin Osipov
e8c13efb41 lwt: move mutation hashers to mutation.hh
Prepare mutation hashers for reuse in CAS implementation.
Message-Id: <20190930202409.40561-2-kostja@scylladb.com>
2019-10-01 19:49:31 +02:00
Konstantin Osipov
6cde985946 lwt: remove code that no longer servers as a reference
Remove ifdef'ed Java code, since LWT implementation
is based on the current state of the origin.
Message-Id: <20190930201022.40240-2-kostja@scylladb.com>
2019-10-01 19:46:15 +02:00
Konstantin Osipov
4d214b624b lwt: ensure enum_set::of is constexpr.
This allows using it to initialize const static members.
Message-Id: <20190930200530.40063-2-kostja@scylladb.com>
2019-10-01 19:45:56 +02:00
Tomasz Grabiec
3b9bf9d448 Merge "storage_proxy: replace variadic futures with structs" from Avi
Seastar variadic futures are deprecated, so replace with structs to
avoid nasty deprecation warnings.
2019-10-01 19:32:55 +02:00
Avi Kivity
162730862d storage_proxy: remove variadic future from query_partition_key_range_concurrent()
Seastar variadic futures are deprecated, so replace with a nice struct.
2019-09-30 21:33:44 +03:00
Avi Kivity
968b34a2b4 storage_proxy: remove variadic future from digest_read_resolver
Seastar variadic futures are deprecated, so replace with a nice
struct.
2019-09-30 21:32:17 +03:00
Nadav Har'El
c9aae13fae docs/alternator/getting-started.md: fix indentation in example code
The example Python code had wrong indentation, and wouldn't actually
work if naively copy-pasted. Noticed by Noam Hasson.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190929091440.28042-1-nyh@scylladb.com>
2019-09-30 13:03:29 +03:00
Avi Kivity
c6b66d197b Merge "Couple of preparatory patches for lwt" from Gleb
"
This is a collection of assorted patches that will be needed for LWT.
Most of them are trivial, but one touches a lot of files, so have a
good chance to cause rebase headache (I already had to rebase it on
top of Alternator). Lets push them earlier instead of carrying them in
the lwt branch.
"

* 'gleb/lwt-prepare-v2' of github.com:scylladb/seastar-dev:
  lwt: make _last_timestamp_micros static
  lwt: Add client_state::get_timestamp_for_paxos() function
  lwt: Pass client_state reference all the way to storage_proxy::query
  exceptions: Add a constructor for unavailable_exception that allows providing a custom message
  serializer: Add std::variant support
  lwt: Add missing functions to utils/UUID_gen.hh
2019-09-29 13:02:26 +03:00
Avi Kivity
9e990725d9 Merge "Simplify and explain from_varint_to_integer #5031" from Rafael
"
This is the second version of the patch series. The previous one was just the second patch, this one adds more tests an another patch to make it easier to test that the new code has the same behavior as the old one.
"

* 'espindola/overflow-is-intentional' of https://github.com/espindola/scylla:
  types: Simplify and explain from_varint_to_integer
  Add more cast tests
2019-09-29 11:27:55 +03:00
Tomasz Grabiec
b0e0f29b06 db: read: Filter-out sstables using its first and last keys
Affects single-partition reads only.

Refs #5113

When executing a query on the replica we do several things in order to
narrow down the sstable set we read from.

For tables which use LeveledCompactionStrategy, we store sstables in
an interval set and we select only sstables whose partition ranges
overlap with the queried range. Other compaction strategies don't
organize the sstables and will select all sstables at this stage. The
reasoning behind this is that for non-LCS compaction strategies the
sstables' ranges will typically overlap and using interval sets in
this case would not be effective and would result in quadratic (in
sstable count) memory consumption.

The assumption for overlap does not hold if the sstables come from
repair or streaming, which generates non-overlapping sstables.

At a later stage, for single-partition queries, we use the sstables'
bloom filter (kept in memory) to drop sstables which surely don't
contain given partition. Then we proceed to sstable indexes to narrow
down the data file range.

Tables which don't use LCS will do unnecessary I/O to read index pages
for single-partition reads if the partition is outside of the
sstable's range and the bloom filter is ineffective (Refs #5112).

This patch fixes the problem by consulting sstable's partition range
in addition to the bloom filter, so that the non-overlapping sstables
will be filtered out with certainty and not depend on bloom filter's
efficiency.

It's also faster to drop sstables based on the keys than the bloom
filter.

Tests:
  - unit (dev)
  - manual using cqlsh

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>
2019-09-28 19:42:57 +03:00
Tomasz Grabiec
b93cc21a94 sstables: Fix partition key count estimation for a range
The method sstable::estimated_keys_for_range() was severely
under-estimating the number of partitions in an sstable for a given
token range.

The first reason is that it underestimated the number of sstable index
pages covered by the range, by one. In extreme, if the requested range
falls into a single index page, we will assume 0 pages, and report 1
partition. The reason is that we were using
get_sample_indexes_for_range(), which returns entries with the keys
falling into the range, not entries for pages which may contain the
keys.

A single page can have a lot of partitions though. By default, there
is a 1:20000 ratio between summary entry size and the data file size
covered by it. If partitions are small, that can be many hundreds of
partitions.

Another reason is that we underestimate the number of partitions in an
index page. We multiply the number of pages by:

   (downsampling::BASE_SAMPLING_LEVEL * _components->summary.header.min_index_interval)
     / _components->summary.header.sampling_level

Using defaults, that means multiplying by 128. In the cassandra-stress
workload a single partition takes about 300 bytes in the data file and
summary entry is 22 bytes. That means a single page covers 22 * 20'000
= 440'000 bytes of the data file, which contains about 1'466
partitions. So we underestimate by an order of magnitude.

Underestimating the number of partitions will result in too small
bloom filters being generated for the sstables which are the output of
repair or streaming. This will make the bloom filters ineffective
which results in reads selecting more sstables than necessary.

The fix is to base the estimation on the number of index pages which
may contain keys for the range, and multiply that by the average key
count per index page.

Fixes #5112.
Refs #4994.

The output of test_key_count_estimation:

Before:

count = 10000
est = 10112
est([-inf; +inf]) = 512
est([0; 0]) = 128
est([0; 63]) = 128
est([0; 255]) = 128
est([0; 511]) = 128
est([0; 1023]) = 128
est([0; 4095]) = 256
est([0; 9999]) = 512
est([5000; 5000]) = 1
est([5000; 5063]) = 1
est([5000; 5255]) = 1
est([5000; 5511]) = 1
est([5000; 6023]) = 128
est([5000; 9095]) = 256
est([5000; 9999]) = 256
est(non-overlapping to the left) = 1
est(non-overlapping to the right) = 1

After:

count = 10000
est = 10112
est([-inf; +inf]) = 10112
est([0; 0]) = 2528
est([0; 63]) = 2528
est([0; 255]) = 2528
est([0; 511]) = 2528
est([0; 1023]) = 2528
est([0; 4095]) = 5056
est([0; 9999]) = 10112
est([5000; 5000]) = 2528
est([5000; 5063]) = 2528
est([5000; 5255]) = 2528
est([5000; 5511]) = 2528
est([5000; 6023]) = 5056
est([5000; 9095]) = 7584
est([5000; 9999]) = 7584
est(non-overlapping to the left) = 0
est(non-overlapping to the right) = 0

Tests:
  - unit (dev)

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927141339.31315-1-tgrabiec@scylladb.com>
2019-09-28 19:36:43 +03:00
Piotr Sarna
10f90d0e25 types: remove deprecated comment
The comment does not apply anymore, as this definition is no more
in database.hh.
Message-Id: <a0b6ff851e1e3bcb5fcd402fbf363be7af0219af.1569580556.git.sarna@scylladb.com>
2019-09-27 19:32:17 +02:00
Dejan Mircevski
9a89e0c5ec dbuild: Update README on interactive mode
`dbuild` was recently (24c732057) updated to run in interactive mode
when given no arguments; we can now update the README to mention that.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-09-27 16:33:27 +02:00
Dejan Mircevski
f8638d8ae1 alternator: Add build byproducts to .gitignore
Add .pytest_cache and expressions.tokens to the top-level .gitignore.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-09-27 16:18:45 +02:00
Dejan Mircevski
332ffa77ea alternator: Actually use BEGINS_WITH in its tests
For some reason, BEGINS_WITH tests used EQ as comparison operator.

Tests: pytest test_expected.py

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2019-09-26 22:41:34 +03:00
Tomasz Grabiec
5b0e48f25b Merge "toppartitions: don't transport schema_ptr across shards" from Avi
When the toppartitions operation gathers results, it copies partition
keys with their schema_ptr:s. When these schema_ptr:s are copies
or destroyed, they can cause leaks or premature frees of the schema
in its original shard since reference count operations in are not atomic.

Fix that by converting the schema_ptr to a global_schema_ptr during
transportation.

Fixes #5104 (direct bug)
Fixes #5018 (schema prematurely freed, toppartitions previously executed on that node)
Fixes #4973 (corrupted memory pool of the same size class as schema, toppartitions previously executed on that node)

Tests: new test added that fails with the existing code in debug mode,
manual toppartitions test
2019-09-26 17:09:54 +02:00
Avi Kivity
36b4d55b28 tests: add test for toppartitions cross-shard schema_ptr copy 2019-09-26 17:40:46 +03:00
Avi Kivity
670f398a8a toppartitions: do not copy schema_ptr:s in item keys across shards
Copying schema_ptrs across shards results in memory corruption since
lw_shared_ptr does not use atomic operations for reference counts.
Prevent that by converting schema_ptr:s to global_schema_ptr:s before
shipping them across shards in the map operation, and converting them
back to local schema_ptr:s in the reduce operation.
2019-09-26 17:26:40 +03:00
Avi Kivity
f015bd69b7 toppartitions: compare schemas using schema::id(), not pointer to schema
This allows keys from different stages in the schema's like to compare equal.
This is safe since the partition key cannot change, unlike the rest of the schema.

More importantly, it will allow us to compare keys made local after a pass through
global_schema_ptr, which does not guarantee that the schema_ptr conversion will be
the same even when starting with the same global_schema_ptr.
2019-09-26 17:15:46 +03:00
Avi Kivity
ea4976a128 schema_registry: mark global_schema_ptr move constructor noexcept
Throwing move constructors are a a pain; so we should try to make
them noexcept. Currently, global_schema_ptr's move constructor
throws an exception if used illegaly (moving from a different shard);
this patch changes it to an assert, on the grounds that this error
is impossible to recover from.

The direct motivation for the patch is the desire to store objects
containing a global_schema_ptr in a chunked_vector, to move lists
of partition keys across shards for the topppartitions functionality.
chunked_vector currently requires noexcept move constructors for its
value_type.
2019-09-26 16:56:59 +03:00
Avi Kivity
ba64ec78cf messaging_service: use rpc::tuple instead of variadic futures for rpc
Since variadic future<> is deprecated, switch to rpc::tuple for multiple
return values in rpc calls. This is more or less mechanical translation.
2019-09-26 12:09:31 +02:00
Tomasz Grabiec
9183e28f2c Merge "Recreate dependent user types" from Rafael
When a user type changes we were not recreating other uses types that
use it. This patch series fixes that and makes it clear which code is
responsible for it.

In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.

At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.

Fixes #5049
2019-09-26 12:06:32 +02:00
Gleb Natapov
e0b303b432 lwt: make _last_timestamp_micros static
If each client_state has its own copy of the variable two clients may
generate timestamps that clash and needlessly create contention. Making
the variable shared between all client_state on the same shard will make
sure this will not happen to two clients on the same shard. It may still
happen for two client on two different shards or two different nodes.
2019-09-26 11:44:00 +03:00
Gleb Natapov
622d21f740 lwt: Add client_state::get_timestamp_for_paxos() function
Paxos needs a unique timestamp that is greater than some other
timestamp, so that the next round had more chances to succeed.
Add a function that returns such a timestamp.
2019-09-26 11:44:00 +03:00
Gleb Natapov
e72a105b5e lwt: Pass client_state reference all the way to storage_proxy::query
client_state holds a state to generate monotonically increasing unique
timestamp. Queries with a SERIAL consistency level need it to generate
a paxos round.
2019-09-26 11:44:00 +03:00
Gleb Natapov
556f65e8a1 exceptions: Add a constructor for unavailable_exception that allows providing a custom message 2019-09-26 11:44:00 +03:00
Gleb Natapov
209414b4eb serializer: Add std::variant support 2019-09-26 11:44:00 +03:00
Gleb Natapov
f9209e27d4 lwt: Add missing functions to utils/UUID_gen.hh
Some lwt related code is missing in our UUID implementation. Add it.
2019-09-26 11:44:00 +03:00
Rafael Ávila de Espíndola
5af8b1e4a3 types: recreate dependent user types.
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.

At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.

Fixes #5049

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-09-25 15:41:45 -07:00
Rafael Ávila de Espíndola
4c3209c549 types: Don't include dependent user types in update.
The way schema changes propagate is by editing the system tables and
comparing the before and after state.

When a user type A uses another user type B and we modify B, the
representation of A in the system table doesn't change, so this code
was not producing any changes on the diff that the receiving side
uses.

Deleting it makes it clear that it is the receiver's responsibility to
handle dependent user types.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-09-25 15:41:45 -07:00
Rafael Ávila de Espíndola
34eddafdb0 types: Don't modify the type list in db::cql_type_parser::raw_builder
With this patch db::cql_type_parser::raw_builder creates a local copy
of the list of existing types and uses that internally. By doing that
build() should have no observable behavior other than returning the
new types.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-09-25 15:41:45 -07:00
Rafael Ávila de Espíndola
d6b2e3b23b types: pass a reference to prepare_internal
We were never passing a null pointer and never saving a copy of the
lw_shared_ptr. Passing a reference is more flexible as not all callers
are required to hold the user_types_metadata in a lw_shared_ptr.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-09-25 15:40:30 -07:00
Avi Kivity
03260dd910 Update seastar submodule
* seastar b56a8c5045...c21a7557f9 (3):
  > net: socket::{set,get}_reuseaddr() should not be virtual
  > iotune: print verbose message in case of shutdown errors
  > iotune: close test file on shutdown

Fixes #4946.
2019-09-25 16:08:32 +03:00
Tomasz Grabiec
06b9818e98 Merge "storage_proxy: tolerate view_update_write_response_handler id not found on shutdown" from Benny
1. Add assert in remove_response_handler to make crashes like in #5032 easier to understand.
2. Lookup the view_update_write_response_handler id before calling  timeout_cb and tolerate it not found.
   Just log a warning if this happened.

Fixes #5032
2019-09-25 14:49:42 +02:00
Avi Kivity
83bc59a89f Merge "mvcc: Fix incorrect schema version being used to copy the mutation when applying (#5099)" from Tomasz
"
Currently affects only counter tables.

Introduced in 27014a2.

mutation_partition(s, mp) is incorrect because it uses s to interpret
mp, while it should use mp_schema.

We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during table schema altering when we receive the
mutation from a node which hasn't processed the schema change yet.

This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.

Fixes #5095.
"

* 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla:
  mvcc: Fix incorrect schema verison being used to copy the mutation when applying
  mutation_partition: Track and validate schema version in debug builds
  tests: Use the correct schema to access mutation_partition
2019-09-25 15:30:22 +03:00
Tomasz Grabiec
11440ff792 mvcc: Fix incorrect schema verison being used to copy the mutation when applying
Currently affects only counter tables.

Introduced in 27014a2.

mutation_partition(s, mp) is incorrect, because it uses s to interpret
mp, while it should use mp_schema.

We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during alter when we receive the
mutation from a node which hasn't processed the schema change yet.

This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.

Fixes #5095.
2019-09-25 11:28:07 +02:00