We saw failure in dtest concurrent_schema_changes_test.py:
TestConcurrentSchemaChanges.changes_while_node_down_test test.
======================================================================
ERROR: changes_while_node_down_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 432, in changes_while_node_down_test
self.make_schema_changes(session, namespace='ns2')
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 86, in make_schema_changes
session.execute('USE ks_%s' % namespace)
File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result()
File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
raise self._final_exception
ConnectionShutdown: Connection to 127.0.0.1 is closed
The test:
session = self.patient_cql_connection(node2)
self.prepare_for_changes(session, namespace='ns2')
node1.stop()
self.make_schema_changes(session, namespace='ns2') --> ConnectionShutdown exception throws
The problem is that, after receiving the DOWN event, the python
Cassandra driver will call Cluster:on_down which checks if this client
has any connections to the node being shutdown. If there is any
connections, the Cluster:on_down handler will exit early, so the session
to the node being shutdown will not be removed.
If we shutdown the cql server first, the connection count will be zero
and the session will be removed.
Fixes: #4013
Message-Id: <7388f679a7b09ada10afe7e783d7868a58aac6ec.1545634941.git.asias@scylladb.com>
When creating a sstable from which to generate view updates, we held
on to a table reference across defer points. In case there's a
concurrent schema drop, the table object might be destroyed and we
will incur in a use-after-free. Solve this by holding on to a shared
pointer and pinning the table object.
Refs #4021
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227105921.3601-1-duarte@scylladb.com>
On at least one system, using the container's /tmp as provided by docker
results in spurious EINVALs during aio:
INFO 2018-12-27 09:54:08,997 [shard 0] gossip - Feature ROW_LEVEL_REPAIR is enabled
unknown location(0): fatal error: in "test_write_many_range_tombstones": storage_io_error: Storage I/O error: 22: Invalid argument
seastar/tests/test-utils.cc(40): last checkpoint
The setup is overlayfs over xfs.
To avoid this problem, pass through the host's /tmp to the container.
Using --tmpfs would be better, but it's not possible to guess a good size
as the amount of temporary space needed depends on build concurrency.
Message-Id: <20181227101345.11794-1-avi@scylladb.com>
Image fedora-29-20181219 was broken due to the followin chain of events:
- we install gnutls, which currently is at version 3.6.5
- gnutls 3.6.5 introduced a dependency on nettle 3.4.1
- the gnutls rpm does not include a version requirement on nettle,
so an already-installed nettle will not be upgraded when gnutls is
installed
- the fedora:29 image which we use as a baseline has nettle installed
- docker does not pull the latest tag in FROM statements during
"docker build"
- my build machine already had a fedora:29 image, with nettle 3.4
installed (the repository's image has 3.4.1, but docker doesn't
automatically pull if an image with the required tag exists)
As a result, the image ended up hacing gnutls 3.6.5 and nettle 3.4, which
are incompatible.
To fix, update all packages after installation to attempt to have a self
consistent package set even if dependencies are not perfect, and regenerate
the image.
Message-Id: <20181226135711.24074-1-avi@scylladb.com>
The '-t' flag to 'docker run' passes the tty from the caller environment
to the container, which is nice for interactive jobs, but fails if there
is no tty, such as in a continuous integration environment.
Given that, the '-i' flag doesn't make sense either as there isn't any
input to pass.
Remove both, and replace with --sig-proxy=true which allows SIGTERM to
terminate the container instead of leaving it alive. This reduces the
chances of the build stopping but leaving random containers around.
Message-Id: <20181222105837.22547-1-avi@scylladb.com>
The function pending_collection is only called when
cdef->is_multi_cell() is true, so the throw is dead.
This patch converts it to an assert.
Message-Id: <20181207022119.38387-1-espindola@scylladb.com>
"
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes: #3033
Tests: dtests/repair_additional_test.py
"
* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
repair: Enable row level repair
repair: Add row_level_repair
repair: Add docs for row level repair
repair: Add repair_init_messaging_service_handler
repair: Add repair_meta
repair: Add repair_writer
repair: Add repair_reader
repair: Add repair_row
repair: Add fragment_hasher
repair: Add decorated_key_with_hash
repair: Add get_random_seed
repair: Add get_common_diff_detect_algorithm
repair: Add shard_config
repair: Add suportted_diff_detect_algorithms
repair: Add repair_stats to repair_info
repair: Introduce repair_stats
flat_mutation_reader: Add make_generating_reader
storage_service: Introduce ROW_LEVEL_REPAIR feature
messaging_service: Add RPC verbs for row level repair
repair: Export the repair logger
...
We had issue to build offline installer on RHEL because of repository
difference.
This fix enables to build offline installer both on CentOS and RHEL.
Also it introduces --releasever <ver>, to build offline installer for
specific minor version of CentOS and RHEL.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181212032129.29515-1-syuu@scylladb.com>
Sometimes one wants to just compile all the source files in the
projects, because for example one just moved around code or files and
there is no need to link and run anything, just check that everything
still compiles.
Since linking takes up a considerable amount of time it is worthwhile to
have a specific target that caters for such needs.
This patch introduces a ${mode}-objects target for each mode (e.g.
release-objects) that only runs the compilation step for each source
file but does not link anything.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <eaad329bf22dfaa3deff43344f3e65916e2c8aaf.1545045775.git.bdenes@scylladb.com>
In one test the types in the schema don't match the types in the
statistics file. In another a column is missing.
The patch also updates the exceptions to have more human readable
messages.
Tests: unit (release)
Part of issue #3960.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181219233046.74229-1-espindola@scylladb.com>
Validate ascii string by ORing all bytes and check if 7-th bit is 0.
Compared with original std::any_of(), which checks ascii string byte
by byte, this new approach validates input in 8 bytes and two
independent streams. Performance is much higher for normal cases,
though slightly slower when string is very short. See table below.
Speed(MB/s) of ascii string validation
+---------------+-------------+---------+
| String length | std::any_of | u64 x 2 |
+---------------+-------------+---------+
| 9 bytes | 1691 | 1635 |
+---------------+-------------+---------+
| 31 bytes | 2923 | 3181 |
+---------------+-------------+---------+
| 129 bytes | 3377 | 15110 |
+---------------+-------------+---------+
| 1039 bytes | 3357 | 31815 |
+---------------+-------------+---------+
| 16385 bytes | 3448 | 47983 |
+---------------+-------------+---------+
| 1048576 bytes | 3394 | 31391 |
+---------------+-------------+---------+
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1544669646-31881-1-git-send-email-yibo.cai@arm.com>
"
Some files use db/config.hh just to access extensions. Reduce dependencies
on this global and volatile file by providing another path to access extensions.
Tests: unit(release)
"
* tag 'unconfig-2/v1' of https://github.com/avikivity/scylla:
hints: reduce dependencies on db/config.hh
commitlog: reduce dependencies on db/config.hh
cql3: reduce dependencies on db/config.hh
database: provide accessor to db::extensions
Rather than forcing callers to go through get_config(), provide a
direct accessor. This reduces dependencies on config.hh, and will
allow separation of extensions from configuration.
When the next pending fragments are after the start of the new range,
we know there is no need to skip.
Caught by perf_fast_forward --datasets large-part-ds3 \
--run-tests=large-partition-slicing
Refs #3984
Message-Id: <1545308006-16389-1-git-send-email-tgrabiec@scylladb.com>
Currently if something throws while streaming in mutation sending loop
sink is not closed. Also when close() is running the code does not hold
onto sink object. close() is async, so sink should be kept alive until
it completes. The patch uses do_with() to hold onto sink while close is
running and run close() on error path too.
Fixes#4004.
Message-Id: <20181220155931.GL3075@scylladb.com>
Mutation readers allow fast-forwarding the ranges from which the data is
being read. The main user of this feature is cache which, when reading
from the underlying reader, may want to skip some data it already has.
Unsurprisingly, this adds more complexity to the implementation of the
readers and more edge cases the developers need to take care of.
While most of the readers were at least to some extent checked in this
area those test usually were quite isolated (e.g. one test doing
inter-partition fast-forwarding, another doing intra-partition
fast-forwarding) and as a consequence didn't cover many corner cases.
This patch adds a generic test for fast-forwarding and slicing that
covers more complicated scenarios when those operations are combined.
Needless to say that did uncover some problems, but fortunately none of
them is user-visible.
Fixes#3963.
Fixes#3997.
Tests: unit(release, debug)
* https://github.com/pdziepak/scylla.git test-fast-forwarding/v4.1:
tests/flat_mutation_reader_assertions: accumulate received tombstones
tests/flat_mutation_reader_assertions: add more test messages
tests/flat_mutation_reader_assertions: relax has_monotonic_positions()
check
tests/mutation_readers: do not ignore streamed_mutation::forwarding
Revert "mutation_source_test: add option to skip intra-partition
fast-forwarding tests"
memtable: it is not a single partition read if partition
fast-forwaring is enabled
sstables: add more tracing in mp_row_consumer_m
row_cache: use make_forwardable() to implement
streamed_mutation::forwarding
row_cache: read is not single-partition if inter-partition forwarding
is enabled
row_cache: drop support for streamed_mutation::forwarding::yes
entirely
sstables/mp_row_consumer: position_range end bound is exclusive
mutation_fragment_filter: handle streamed_mutation::forwarding::yes
properly
tests/mutation_reader: reduce sleeping time
tests/memtable: fix partition_range use-after-free
tests/mutation: fix partition range use-after-free
flat_mutation_reader_from_mutations: add overload that accepts a slice
and partition range
flat_mutation_reader_from_mutations: fix empty range case
flat_mutation_reader_from_mutations: destroy all remaining mutations
tests/mutation_source: drop dropped column handling test
tests/mutation_source: add test for complex fast_forwarding and
slicing
While we already had tests that verified inter- and intra-partition
fast-forwarding as well as slicing, they had quite limited scope and
didn't combine those operations. The new test is meant to extensively
test these cases.
Schema changes are now covered by for_each_schema_change() function.
Having some additional tests in run_mutation_source_tests() is
problematic when it is used to test intermediate mutation readers
because schema changes may be irrelevant for them, which makes the test
a waste of time (might be a problem in debug mode) and requires those
intermediate reader to use more complex underlying reader that supports
schema changes (again, problem in a very slow debug mode).
If the reader is fast-forwarded to another partition range mutation_ may
be left with some partial mutations. Make sure that those are properly
destroyed.
It is a very bad taste to sleep anywhere in the code. The test should be
fixed to explicitly test various orderings between concurrent
operations, but before that happens let's at least readuce how much
those sleeps slow it down by changing it from milliseconds to
microseconds.
Implementing intra-partition fast-forwarding adds more complexity to
already very-much-not-trivial cache readers and isn't really critical in
any way since it is not used outside of the tests. Let's use the generic
adapter instead of natively implementing it.
Single-partition reader is less expensive than the one that accepts any
range of partitions, but it doesn't support fast-forwarding to another
partition range properly and therefore cannot be used if that option is
enabled.
This reverts commit b36733971b. That commit made
run_mutation_reader_tests() support mutation_sources that do not implement
streamed_mutation::forwarding::yes. This is wrong since mutation_sources
are not allowed to ignore or otherwise not support that mode. Moreover,
there is absolutely no reason for them to do so since there is a
make_forwardable() adapter that can make any mutation_reader a
forwardable one (at the cost of performance, but that's not always
important).
It is wrong to silently ignore streamed_mutation::forwarding option
which completely changes how the reader is supposed to operate. The best
solution is to use make_forwardable() adapter which changes
non-forwardable reader to a forwardable one.
Since 41ede08a1d "mutation_reader: Allow
range tombstones with same position in the fragment stream" mutation
readers emit fragments in non-decreasing order (as opposed to strictly
increasing), has_monotonic_posiitons() needs to be updated to allow
that.
Current data model employed by mutation readers doesn't have an unique
representation of range tombstones. This complicates testing by making
multiple ways of emitting range tombstones and rows equally valid.
This patch adds an option to verify mutation readers by checking whether
tombstones they emit properly affect the clustered rows regardless of how
exactly the tombstones are emitted. The interface of
flat_mutation_reader_assertions is extended by adding
may_produce_tombstones() that accepts any number of tombstones and
accumulates them. Then, produces_row_with_key() accepts an additional
argument which is the expected timestamp of the range tombstone that
affects that row.
"
Contains several improvements for fast-forwarding and slicing readers. Mainly
for the MC format, but not only:
- Exiting the parser early when going out of the fast-forwarding window [MC-format-only]
- Avoiding reading of the head of the partition when slicing
- Avoiding parsing rows which are going to be skipped [MC-format-only]
"
* 'sstable-mc-optimize-slicing-reads' of github.com:tgrabiec/scylla:
sstables: mc: reader: Skip ignored rows before parsing them
sstables: mc: reader: Call _cells.clear() when row ends rather than when it starts
sstables: mc: mutation_fragment_filter: Take position_in_partition rather than a clustering_row
sstables: mc: reader: Do not call consume_row_marker_and_tombstone() for static rows
sstables: mc: parser: Allow the consumer to skip the whole row
sstables: continuous_data_consumer: Introduce skip()
sstables: continuous_data_consumer: Make position() meaningful inside state_processor::process_state()
sstables: mc: parser: Allocate dynamic_bitset once per read instead of once per row
sstables: reader: Do not read the head of the partition when index can be used
sstables: mc: mutation_fragment_filter: Check the fast-forward window first
sstables: mc: writer: Avoid calling unsigned_vint::serialized_size()
"
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica and we use the base’s backlog of view updates to derive
this delay.
To validate this approach we tested a 3 node Scylla cluster on GCE,
using n1-standard-4 instances with NVMEs. A loader running on a
n1-standard-8 instance run cassandra-stress with 100 threads. With the
delay function d(x) set to 1s, we see no base write timeouts. With the
delay function as defined in the series, we see that backlogs stabilize
at some (arbitrary) point, as predicted, but this stabilization
co-exists with base write timeouts. However, the system overall behaves
better than the current version, with the 100 view update limit, and
also better than the version without such limit or any backpressure.
More work is necessary to further stabilize the system. Namely, we want
to keep delaying until we see the backlog is decreasing. This will
require us to add more delay beyond the stabilization point, which in
turn should minimize the base write timeouts, and will also minimize the
amount of memory the backlog takes at each base replica.
Design document:
https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo
Fixes#2538
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits)
service/storage_proxy: Release mutation as early as possible
service/storage_proxy: Delay replica writes based on view update backlog
service/storage_proxy: Get the backlog of a particular base replica
service/storage_proxy: Add counters for delayed base writes
main: Start and stop the view_update_backlog_broker
service: Distribute a node's view update backlog
service: Advertise view update backlog over gossip
service/storage_proxy: Send view update backlog from replicas
service/storage_proxy: Prepare to receive replica view update backlog
service/storage_proxy: Expose local view update backlog
tests/view_schema_test: Add simple test for db::view::node_update_backlog
db/view: Introduce node_update_backlog class
db/hints: Initialize current backlog
database: Add counter for current view backlog
database: Expose current memory view update backlog
idl: Add db::view::update_backlog
db/view: Add view_update_backlog
database: Wait on view update semaphore for view building
service/storage_proxy: Use near-infinite timeouts for view updates
database: generate_and_propagate_view_updates no longer needs a timeout
...