Commit Graph

16751 Commits

Author SHA1 Message Date
Botond Dénes
c899191ad5 reader_concurrency_semaphore: use the correct types in the constructor
Previously there was a type mismatch for `count` and `memory`, between
the actual type used to store them in the class (signed) and the type
of the parameters in the constructor (unsigned).
Although negative numbers are completely valid for these members,
initializing them to negative numbers don't make sense, this is why they
used unsigned types in the constructor. This restriction can backfire
however when someone intends to give these parameters the maximum
possible value, which, when interpreted as a signed value will be `-1`.
What's worse the caller might not even be aware of this unsigned->signed
conversion and be very suprised when they find out.
So to prevent surprises, expose the real type of these members, trusting
the clients of knowing what they are doing.

Also add a `no_limits` constructor, so clients don't have to make sure
they don't overflow internal types.

(cherry picked from commit e1d8237e6b)
2018-12-18 14:34:33 +02:00
Botond Dénes
a3563e5f7d reader_concurrency_semaphore: add consume_resources()
(cherry picked from commit dfd649a6b4)
2018-12-18 14:34:33 +02:00
Botond Dénes
78c5b09694 reader_concurrency_semaphore::inactive_read_handle: add operator bool()
(cherry picked from commit 21b44adbfe)
2018-12-18 14:34:33 +02:00
Avi Kivity
46efc08882 Update seastar submodule
* seastar 6700dc3...08f1258 (1):
  > reactor: disable nowait aio due to a kernel bug

Fixes #3996.
2018-12-17 17:00:14 +02:00
Avi Kivity
c95433c967 Update seastar submodule
* seastar 1651a2a...6700dc3 (3):
  > build: link against libatomic
  > core/semaphore: Allow combining semaphore_units()
  > core/shared_ptr: Allow releasing a lw_shared_ptr to a non-const object

Fixes #3996.
2018-12-17 15:53:01 +02:00
Piotr Sarna
df3b6fb4a8 cql3: refuse to create index on COMPACT STORAGE with ck
To follow C* compatibility, creating an index on COMPACT STORAGE
table should be disallowed not only on base primary keys,
but also when the base table contains clustering keys.
Message-Id: <ab40c39730aff2e164d11ee5159ff62b8ec9e8e8.1544698186.git.sarna@scylladb.com>

(cherry picked from commit 6743af5dbd)
2018-12-17 09:45:43 +02:00
Piotr Sarna
44ee43bb17 cql3: add refusing to create an index on static column
Secondary indexes on static columns are not yet supported,
so creating such index should return an appropriate error.

Fixes #3993
Message-Id: <700b0a71e80da52d2d5250edacc12626b55681fa.1544785127.git.sarna@scylladb.com>

(cherry picked from commit 63bd43e57e)
2018-12-17 09:44:52 +02:00
Asias He
aac363ca86 storage_service: Notify NEW_NODE only when a node is new node
This is a backport of CASSANDRA-11038.

Before this, a restarted node will be reported as new node with NEW_NODE
cql notification.

To fix, only send NEW_NODE notification when the node was not part of
the cluster

Fixes: #3979
Tests: pushed_notifications_test.py:TestPushedNotifications.restart_node_test
Message-Id: <453d750b98b5af510c4637db25b629f07dd90140.1544583244.git.asias@scylladb.com>
(cherry picked from commit 71c1681f6c)
2018-12-16 13:59:19 +02:00
Duarte Nunes
9e6cc5b024 service/storage_proxy: Embed the expire timer in the response handler
Embedding the expire timer for a write response in the
abstract_write_response_handler simplifies the code as it allows
removing the rh_entry type.

It will also make the timeout easily accessible inside the handler,
for future patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181213111818.39983-1-duarte@scylladb.com>
(cherry picked from commit f8878238ed)
2018-12-13 13:24:09 +00:00
Duarte Nunes
13b72c7b92 Merge branch 'gossip: Send node UP event to cql client after cql server is up' from Asias
"
This is a backport of CASSANDRA-8236.

Before this patch, scylla sends the node UP event to cql client when it
sees a new node joins the cluster, i.e., when a new node's status
becomes NORMAL. The problem is, at this time, the cql server might not
be ready yet. Once the client receives the UP event, it tries to
connect to the new node's cql port and fails.

To fix, a new application_sate::RPC_READY is introduced, new node sets
RPC_READY to false when it starts gossip in the very beginning and sets
RPC_READY to true when the cql server is ready.

The RPC_READY is a bad name but I think it is better to follow Cassandra.

Nodes with or without this patch are supposed to work together with no
problem.

Refs #3843
"

* 'asias/node_up_down.upstream.v4.1' of github.com:scylladb/seastar-dev:
  storage_service: Use cql_ready facility
  storage_service: Handle application_state::RPC_READY
  storage_service: Add notify_cql_change
  storage_service: Add debug log in notify_joined
  storage_service: Add extra check in notify_joined
  storage_service: Add notify_joined
  storage_service: Add debug log in notify_up
  storage_service: Add extra check in notify_up
  storage_service: Add notify_up
  storage_service: Make notify_left log debug level
  storage_service: Introduce notify_left
  storage_service: Add debug log in notify_down
  storage_service: Introduce notify_down
  storage_service: Add set_cql_ready
  gossip: Add gossiper::is_cql_ready
  gms: Add endpoint_state::is_cql_ready
  gms: Add application_state::RPC_READY
  gms: Introduce cql_ready in versioned_value

(cherry picked from commit a42b2895c2)
2018-12-13 12:06:59 +00:00
Avi Kivity
6b011fbe0a build: pass C compiler configuration in dist package build
Just like we allow customizing the C++ compiler, we should allow customizing
the C compiler.

Ref #3978
Message-Id: <20181211172821.30830-1-avi@scylladb.com>

(cherry picked from commit fa96e07e6b)
scylla-3.0.rc2
2018-12-12 14:41:38 +02:00
Tomasz Grabiec
9dd4e1b01f sstables: index_reader: Avoid schema copy in advance_to()
Introduced in 7e15e43.

Exposed by perf_fast_forward:

  running: large-partition-skips on dataset large-part-ds1
  Testing scanning large partition with skips.
  Reads whole range interleaving reads with skips according to read-skip pattern:
  read    skip      time (s)     frags     frag/s (...)
  1       0         5.268780   8000000    1518378

  1       1        31.695985   4000000     126199
Message-Id: <1544614272-21970-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0a853b8866)
2018-12-12 14:38:49 +02:00
Nadav Har'El
e91c741ef5 secondary indexes: fail attempts to create a CUSTOM INDEX
Cassandra supports a "CREATE CUSTOM INDEX" to create a secondary index
with a custom implementation. The only custom implementation that Cassandra
supports is SASI. But Scylla doesn't support this, or any other custom
index implementation. If a CREATE CUSTOM INDEX statement is used, we
shouldn't silently ignore the "CUSTOM" tag, we should generate an error.

This patch also includes a regression test that "CREATE CUSTOM INDEX"
statements with valid syntax fail (before this patch, they succeeded).

Fixes #3977

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-2-nyh@scylladb.com>
(cherry picked from commit a0379209e6)
2018-12-12 00:32:35 +00:00
Nadav Har'El
b18e9e115d Fix typo in error message
Interestingly, this typo was copied from the original Cassandra source
code :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-1-nyh@scylladb.com>
(cherry picked from commit 36db4fba23)
2018-12-12 00:32:35 +00:00
Avi Kivity
0b86ab0d2a build: build libdeflate with user selected C compiler
If the user specified a C compiler, use it to build libdeflate.

Fixes #3978.
Message-Id: <20181211145604.14847-1-avi@scylladb.com>

(cherry picked from commit 34a31a807d)
2018-12-11 19:24:24 +02:00
Duarte Nunes
97cd9108d6 db/system_distributed_keyspace: Create the schema with min_timestamp
Different nodes can concurrently create the distributed system
keyspace on boot, before the "if not exists" clause can take effect.

However, the resulting schema mutations will be different since
different nodes use different timestamps. This patch forces the
timestamps to be the same across all nodes, so we save some schema
mismatches.

This fixes a bug exposed by ca5dfdf, whereby the initialization of the
distributed system keyspace is done before waiting for schema
agreement. While waiting for schema agreement in
storage_service::join_token_ring(), the node still hasn't joined the
ring and schemas can't be pulled from it, so nodes can deadlock. A
similar situation can happen between a seed node and a non-seed node,
where the seed node progresses to a different "wait for schema
agreement" barrier, but still can't make progress because it can't
pull the schema from the non-seed node still trying to join the ring.

Finally, it is assumed that changes to the schema of the current
distributed system keyspace tables will be protected by a cluster
feature and a subsequent schema synchronization, such that all nodes
will be at a point where schemas can be transferred around.

Fixes #3976

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181211113407.20075-1-duarte@scylladb.com>
(cherry picked from commit 89ae3fbf11)
2018-12-11 14:53:30 +00:00
Hagit Segev
f81fe96b0b release: prepare for 3.0-rc2 2018-12-11 12:32:34 +02:00
Avi Kivity
91ce3a7957 sstables: fix overflow in clustering key blocks header bit access
_ck_blocks_header is a 64-bit variable, so the mask should be 64 bits too.
Otherwise, a shift in the range 32-63 will produce wrong results.

Fix by using a 64-bit mask.

Found by Fedora 29's ubsan.

Fixes #3973.
Message-Id: <20181209120549.21371-1-avi@scylladb.com>

(cherry picked from commit 7c7da0b462)
2018-12-10 14:10:27 +02:00
Takuya ASADA
af7e58f4c5 dist/offline_installer/redhat: fix missing dependencies
Offline installer with Scylla 3.0 causes dependency error on CentOS, added
missing packages.

Fixes #3969

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181207020711.23055-1-syuu@scylladb.com>
(cherry picked from commit a2d0ebf4d9)
2018-12-10 14:10:15 +02:00
Amos Kong
bd3373b511 scylla_setup: only ask for nic in interactive mode
Current scylla_setup still asks for nic even nic is already assigned in cmdline.

Fixes #3908

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <6b867e17a5583c495c771a37d5fa1e8366b1d61b.1542337635.git.amos@scylladb.com>
(cherry picked from commit 09a3b11c2f)
2018-12-09 19:26:34 +02:00
Gleb Natapov
4820130abe storage_proxy: fix crash during write timeout callback invocation
rh_entry address is captured inside timeout's callback lambda, so the
structure should not be moved after it is created. Change the code to
create rh_entry in-place instead of moving it into the map.

Fixes #3972.

Message-Id: <20181206164043.GN25283@scylladb.com>
(cherry picked from commit 9fb79bf379)
2018-12-09 15:25:52 +02:00
Tomasz Grabiec
9b299241e5 Merge "Fixes for collecting stats in SST3 + more tests" from Vladimir
This patchset fixes several remaining issues found during thorough
testing of SSTables 3.x statistics and enriches ~30 unit tests with
statistics validation against Cassandra-generated golden copies.

* https://github.com/argenet/scylla/tree/projects/sstables-30/sst3-tests-statistics/v1:
  sstables: Enforce estimated_partitions in generate_summary() to be
    always positive.
  sstables: Don't enforce default max_local_deletion_time value for 'mc'
    files.
  sstables: Update TTL/local deletion stats for non-expiring and live
    liveness_info.
  sstables: Collect statistics when writing RT markers to SSTables 3.x.
  tests: Return sstable_assertions from validate_read() helper.
  tests: Introduce helper for validating stats metadata in SSTables 3.x
    tests.
  tests: Add stats metadata validation to test_write_static_row.
  tests: Add stats metadata validation to
    test_write_composite_partition_key.
  tests: Add stats metadata validation to
    test_write_composite_clustering_key.
  tests: Add stats metadata validation to test_write_wide_partitions.
  tests: Add stats metadata validation to write_ttled_row
  tests: Add stats metadata validation to write_ttled_column
  tests: Add stats metadata validation to write_deleted_column
  tests: Add stats metadata validation to write_deleted_row
  tests: Add stats metadata validation to write_collection_wide_update
  tests: Add stats metadata validation to
    write_collection_incremental_update
  tests: Add stats metadata validation to write_multiple_partitions
  tests: Add stats metadata validation to write_multiple_rows
  tests: Add stats metadata validation to
    write_missing_columns_large_set
  tests: Add stats metadata validation to write_different_types
  tests: Add stats metadata validation to write_empty_clustering_values
  tests: Add stats metadata validation to write_large_clustering_key
  tests: Add stats metadata validation to write_compact_table
  tests: Add stats metadata validation to write_user_defined_type_table
  tests: Add stats metadata validation to write_simple_range_tombstone
  tests: Add stats metadata validation to
    write_adjacent_range_tombstones
  tests: Add stats metadata validation to
    write_non_adjacent_range_tombstones
  tests: Add stats metadata validation to
    write_mixed_rows_and_range_tombstones
  tests: Add stats metadata validation to
    write_adjacent_range_tombstones_with_rows
  tests: Add stats metadata validation to
    write_range_tombstone_same_start_with_row
  tests: Add stats metadata validation to
    write_range_tombstone_same_end_with_row
  tests: Add stats metadata validation to
    write_two_non_adjacent_range_tombstones
  tests: Delete unused (bogus) Statistics.db file from write_ SST3
    tests.

(cherry picked from commit bb24d378b2)
2018-12-08 14:08:46 +02:00
Avi Kivity
745a98e151 Merge "Fix deadlocking multishard readers" from Botond
"
Multishard combining readers, running concurrently, with limited
concurrency and no timeout may deadlock, due to inactive shard readers
sitting on permits. To avoid this we have to make sure that all shard
readers belonging to a multishard combining readers, that are not
currently active, can be evicted to free up their permits, ensuring that
all readers can make progress.
Making inactive shard readers evictable is the solution for this
problem, however the original series introducing this solution
(414b14a6bd) did not go all they way and
left some loose ends. These loose ends are tied up by this mini-series.
Namely, two issues remained:
* The last reader to reach EOS was not paused (made evictable).
* Readers created/resumed as part of a read-ahead were not paused
  immediately after finishing the read-ahead.

This series fixes both of these.

Fixes: #3865
Tests: unit(release, debug)
"

* 'fix-multishard-reader-deadlock/v1' of https://github.com/denesb/scylla:
  multishard_combining_reader: pause readers after reading ahead
  multishard_combining_reader: pause *all* EOS'd readers

(cherry picked from commit 21b4b2b9a1)
2018-12-08 14:08:46 +02:00
Avi Kivity
b9c99af18b Merge "Fix tombstone histogram when writing SSTables 3.x" from Vladimir
"
This patchset extends a number of existing tests to check SSTables
statistics for 'mc' format and fixes an issue discovered with the help
of one of the tests.

Tests: unit {release}
"

* 'projects/sstables-30/check-stats/v2' of https://github.com/argenet/scylla:
  tests: Run sstable_timestamp_metadata_correcness_with_negative with all SSTables versions.
  tests: Run sstable_tombstone_histogram_test for all SSTables versions.
  tests: Run min_max_clustering_key_test on all SSTables versions.
  tests: Expand test_sstable_max_local_deletion_time_2 to run for all SSTables versions.
  tests: Run test_sstable_max_local_deletion_time on all SSTables versions.
  tests: Extend test checking tombstones histogram to cover all SSTables versions.
  sstables: Properly track row-level tombstones when writing SSTables 3.x.
  tests: Run min_max_clustering_key_test_2 for all SSTables versions.
  tests: Make reusable_sst() helper accept SSTables version parameter.

(cherry picked from commit f073ea5f87)
2018-12-08 14:08:44 +02:00
Asias He
cded9c7ac7 gossip: Fix race in real_mark_alive and shutdown msg
In dtest, we have

   self.check_rows_on_node(node1, 2000)
   self.check_rows_on_node(node2, 2000)

which introduce the following cluster operations:

1) Initially:

- node1 up
- node2 up

2) self.check_rows_on_node(node1, 2000)
- node2 down
- node2 up (A: node2 will call gossiper::real_mark_alive when node2 boots
up to mark node1 up)

3) self.check_rows_on_node(node2, 2000)
- node1 down (B: node1 will send shutdown gossip message to node2, node2
will mark node1 down)
- node1 up (C: when node1 is up, node2 will call
gossiper::real_mark_alive)

Since there is no guarantee the order of Operation A and Operation B, it
is possible node2 will mark node1 as status=shutdown and mark node1 is
UP.

In Operation C, node2 will call gossiper::real_mark_alive to mark node1
up, but since node2 might think node1 is already up, node2 will exit
early in gossiper::real_mark_alive and not log "InetAddress 127.0.0.1 is
now UP, status={}"

As a result, dtest fails to see node2 reports node1 is up when it boots
node1 and fail the test.

   TimeoutError: 23 Nov 2018 10:44:19 [node2] Missing: ['127.0.0.1.* now UP']

In the log we can see node1 marked as DOWN and UP almost at the same time on node2:

   INFO  2018-11-23 22:31:29,999 [shard 0] gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
   INFO  2018-11-23 22:31:30,006 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = shutdown

Fixes #3940

Tests: dtest with 20 consecutive succesful runs
Message-Id: <996dc325cbcc3f94fc0b7569217aa65464eaaa1c.1543213511.git.asias@scylladb.com>
(cherry picked from commit eeeb2da7bb)
2018-12-08 13:42:43 +02:00
Gleb Natapov
4acfc5ed8f hints: make hints manager more resilient to unexpected directory content
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>

(cherry picked from commit b4a8802edc)
2018-12-08 13:42:43 +02:00
Gleb Natapov
cb9199bc7f hints: add auxiliary function for scanning high level hints directory
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>

(cherry picked from commit 9433d02624)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
695ff5383f Merge "Correct the usage of row ttl and add write-read test" from Piotr
Fixes the condition which determines whether a row ttl should be used for a cell
and adds a test that uses each generated mutation to populate mutation source
and then verifies that it can read back the same mutation.

* seastar-dev.git haaawk/sst3/write-read-test/v3:
  Fix use_row_ttl condition
  Add test_all_data_is_read_back

(cherry picked from commit b8c405c019)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
730e48bf60 configure.py: Always add a rule for building gen_crc_combine_table
Fixes a build failure when only the scylla binary was selected for
building like this:

  ./configure.py --with scylla

In this case the rule for gen_crc_combine_table was missing, but it is
needed to build crc_combine_table.o

Message-Id: <1544010138-21282-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit edbef7400b)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
af6d4f40e1 utils/gz: Fix compilation on non-x86 archs
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.

Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 9a4c00beb7)
2018-12-08 13:42:43 +02:00
Avi Kivity
9d8507de09 Merge "Optimize checksum_combine() for CRC32" from Tomek
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.

This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.

Performance results using perf_checksum and second buffer of length 64 KiB:

zlib CRC32 combine:   38'851   ns
libdeflate CRC32:      4'797   ns
fast_crc32_combine():     11   ns

So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.

Tested on i7-5960X CPU @ 3.00GHz

Performance was also evaluated using sstable writer benchmark:

  perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
     --value-size=10000 --rows 1000000 --datasets small-part

It yielded 9% improvement in median frag/s (129'055 vs 117'977).

Refs #3874
"

* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
  tests: perf_checksum: Test fast_crc32_combine()
  tests: Rename libdeflate_test to checksum_utils_test
  tests: libdeflate: Add more tests for checksum_combine()
  tests: libdeflate: Check both libdeflate and default checksummers
  sstables: Use fast_crc_combine() in the default checksummer
  utils/gz: Add fast implementation of crc32_combine()
  utils/gz: Add pre-computed polynomials
  utils/gz: Import Barett reduction implementation from libdeflate
  utils: Extract clmul() from crc.hh

(cherry picked from commit b098b5b987)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
07c980845d utils/crc: Add clmul_u32() implementation
Needed for backporting dependent changes.

Extracted from:

      commit 79136e895f
      Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
      Date:   Thu Nov 1 03:26:16 2018 +0000

        utils/crc: calculate crc in parallel
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
c52b8239d0 configure.py: Compile against Westmere on x86
Needed for backporting dependent changes.

Extracted from:

  commit 79136e895f
  Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
  Date:   Thu Nov 1 03:26:16 2018 +0000

    utils/crc: calculate crc in parallel
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
5a07a4fac8 configure.py: Use armv8-a+crc+crypto ISA on aarch64
Needed for backporting dependent changes.

Extracted from:

  commit 1c48e3fbec
  Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
  Date:   Mon Oct 29 02:58:19 2018 +0000

    utils/crc: leverage arm64 crc extension
2018-12-08 13:42:43 +02:00
Avi Kivity
b9c046b17b Merge "Optimize checksum computation for the MC sstable format" from Tomek
"
One part of the improvement comes from replacing zlib's CRC32 with the one
from libdeflate, which is optimized for modern architecture and utilizes the
PCLMUL instruction.

perf_checksum test was introduced to measure performance of various
checksumming operations.

Results for 514 B (relevant for writing with compression enabled):

    test                                      iterations      median         mad         min         max
    crc_test.perf_deflate_crc32_combine            58414    16.711us     3.483ns    16.708us    16.725us
    crc_test.perf_adler_combine                165788278     6.059ns     0.031ns     6.027ns     7.519ns
    crc_test.perf_zlib_crc32_combine               59546    16.767us    26.191ns    16.741us    16.801us
    ---
    crc_test.perf_deflate_crc32_checksum        12705072    83.267ns     4.580ns    78.687ns    98.964ns
    crc_test.perf_adler_checksum                 3918014   206.701ns    23.469ns   183.231ns   258.859ns
    crc_test.perf_zlib_crc32_checksum            2329682   428.787ns     0.085ns   428.702ns   510.085ns

Results for 64 KB (relevant for writing with compression disabled):

    test                                      iterations      median         mad         min         max
    crc_test.perf_deflate_crc32_combine            25364    38.393us    17.683ns    38.375us    38.545us
    crc_test.perf_adler_combine                169797143     5.842ns     0.009ns     5.833ns     6.901ns
    crc_test.perf_zlib_crc32_combine               26067    38.663us    95.094ns    38.546us    40.523us
    ---
    crc_test.perf_deflate_crc32_checksum          202821     4.937us    14.426ns     4.912us     5.093us
    crc_test.perf_adler_checksum                   44684    22.733us   206.263ns    22.492us    25.258us
    crc_test.perf_zlib_crc32_checksum              18839    53.049us    36.117ns    53.013us    53.274us

The new CRC32 implementation (deflate_crc32) doesn't provide a fast
checksum_combine() yet, it delegates to zlib so it's as slow as the latter.

Because for CRC32 checksum_combine() is several orders of magnitude slower
than checksum(), we avoid calling checksum_combine() completely for this
checksummer. We still do it for adler32, which has combine() which is faster
than checksum().

SStable write performance was evaluated by running:

  perf_fast_forward --populate --data-directory /tmp/perf-mc \
     --rows=10000000 -c1 -m4G --datasets small-part

Below is a summary of the average frag/s for a memtable flush. Each result is
an average of about 20 flushes with stddev of about 4k.

Before:

 [1] MC,lz4: 330'903
 [2] LA,lz4: 450'157
 [3] MC,checksum: 419'716
 [4] LA,checksum: 459'559

After:

 [1'] MC,lz4: 446'917 ([1] + 35%)
 [2'] LA,lz4: 456'046 ([2] + 1.3%)
 [3'] MC,checksum: 462'894 ([3] + 10%)
 [4'] LA,checksum: 467'508 ([4] + 1.7%)

After this series, the performance of the MC format writer is similar to that
of the LA format before the series.

There seems to be a small but consistent improvement for LA too. I'm not sure
why.
"

* tag 'improve-mc-sstable-checksum-libdeflate-v3' of github.com:tgrabiec/scylla:
  tests: perf: Introduce perf_checksum
  tests: Add test for libdeflate CRC32 implementation
  sstables: compress: Use libdeflate for crc32
  sstables: compress: Rename crc32_utils to zlib_crc32_checksummer
  licenses: Add libdeflate license
  Integrate libdeflate with the build system
  Add libdeflate submodule
  sstables: Avoid checksum_combine() for the crc32 checksummer
  sstables: compress: Avoid unnecessary checksum_combine()
  sstables: checksum_utils: Add missing include

(cherry picked from commit 5e759b0c07)
2018-12-08 13:42:43 +02:00
Avi Kivity
979cb636b8 Update seastar submodule
* seastar e64281d...1651a2a (1):
  > tests: perf: Make do_not_optimize() take the argument by const&
2018-12-08 13:42:43 +02:00
Botond Dénes
59cf9d9070 querier: fix evict_one() and evict_all_for_table()
Both of these have the same problem. They remove the to-be-evicted
entries from `_entries` but they don't unregister the `entry` from the
`read_concurrency_semaphore`. This results in the
`reader_concurrency_semaphore` being left with a dangling pointer to the
entries will trigger segfault when it tries to evict the associated
inactive reads.

Also add a unit test for `evict_all_for_table()` to check that it works
properly (`evict_one()` is only used in tests, so no dedicated test for
it).

Fixes: #3962

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <57001857e3791c6385721b624d33b667ccda2e7d.1544010868.git.bdenes@scylladb.com>
(cherry picked from commit 77dbc7d09a)
2018-12-06 11:38:44 +02:00
Duarte Nunes
c9ec9d4087 Merge seastar upstream
* seastar 880826e...e64281d (2):
  > core/semaphore: Change the access of semaphore_units main ctor
  > Merge "Add semaphore_units<>::split() function" from Duarte

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-05 20:25:17 +00:00
Gleb Natapov
2e8fefbc5a storage_proxy: store hint for CL=ANY if all nodes replied with failure
Current code assumes that request failed if all replicas replied with
failure, but this is not true for CL=ANY requests. Take it into account.

Fixed: #3565
(cherry picked from commit 17197fb005)
2018-12-05 20:14:58 +00:00
Gleb Natapov
6be0635029 storage_proxy: complete write request early if all replicas replied with success of failure
Currently if write request reaches CL and all replicas replied, but some
replied with failures, the request will wait for timeout to be retired.
Detect this case and retire request immediately instead.

Fixes #3566

(cherry picked from commit d1d04eae3c)
2018-12-05 20:14:57 +00:00
Gleb Natapov
04a544c0a2 storage_proxy: check that write failure response comes from recognized replica
Before accounting failure response we need to make sure it comes from a
replica that participates in the request.

(cherry picked from commit 76ab3d716b)
2018-12-05 20:14:57 +00:00
Gleb Natapov
028f9b95d1 storage_proxy: move code executed on write timeout into separate function
Currently the callback is in lambda, but we will want to call the code
not only during timer expiration.

(cherry picked from commit 7bc68aa0eb)
2018-12-05 20:14:57 +00:00
Avi Kivity
54258ca8eb Merge "db/hints: Use frozen_mutation in hinted handoff" from Duarte
"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.

Tests: unit(release)
"

* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
  db/hints/manager: Use frozen_mutation instead of mutation
  db/hints/manager: Use database::find_schema()
  db/commitlog/commitlog_entry: Allow moving the contained mutation
  service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
  service/storage_proxy: Build a shared_mutation from a frozen_mutation
  service/storage_proxy: Lift frozen_mutation_and_schema
  service/storage_proxy: Allow non-const ranges in mutate_prepare()

(cherry picked from commit 1891779e64)
2018-12-05 20:14:57 +00:00
Gleb Natapov
c9a030f1f0 storage_proxy: count number of timed out write attempts after CL is reached
It is useful to have this counter to investigate the reason for read
repairs. Non zero value means that writes were lost after CL is reached
and RR is expected.

Message-Id: <20181009120900.GF22665@scylladb.com>
(cherry picked from commit 207b57a892)
2018-12-05 20:14:57 +00:00
Gleb Natapov
1c7daef554 storage_proxy: do not pass write_stats down to send_to_live_endpoints
write_stats is referenced from write handler which is available in
send_to_live_endpoints already. No need to pass it down.

Message-Id: <20181009133017.GA14449@scylladb.com>
(cherry picked from commit 319ece8180)
2018-12-05 20:14:57 +00:00
Duarte Nunes
f8195a77b0 db/view/view_builder: Don't timeout waiting for view to be built
Remove the timeout argument to
db::view::view_builder::wait_until_built(), a test-only function to
wait until a given materialized view has finished building.

This change is motivated by the fact that some tests running on slow
environments will timeout. Instead of incrementally increasing the
timeout, remove it completely since tests are already run under an
exterior timeout.

Fixes #3920

Tests: unit release(view_build_test, view_schema_test)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181115173902.19048-1-duarte@scylladb.com>
(cherry picked from commit 6fbf792777)
2018-12-05 19:20:36 +00:00
Duarte Nunes
5b724c80ab db/view: Don't copy keyspace name
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181022104527.14555-1-duarte@scylladb.com>
(cherry picked from commit f3a5ec0fd9)
2018-12-05 19:19:26 +00:00
Nadav Har'El
4a7ae81b3f materialized views: update stats.write statistics in all cases
mutate_MV usually calls send_to_endpoint() to push view update to remote
view replicas. This function gets passed a statistics object,
service::storage_proxy_stats::write_stats and, in particular, updates
its "writes" statistic which counts the number of ongoing writes.

In the case that the paired view replica happens to be the *same* node,
we avoid calling send_to_endpoint() and call mutate_locally() instead.
That function does not take a write_stats object, so the "writes" statistic
doesn't get incremented for the duration of the write. So we should do
this explicitly.

Co-authored-by: Nadav Har'El <nyh@scylladb.com>
Co-authored-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 1d5f8d0015)
2018-12-05 19:19:26 +00:00
Piotr Sarna
3cf26a60a2 auth: add abort_source to waiting for schema agreement
When the auth service is requested to stop during bootstrap,
it might have still not reached schema agreement.
Currently, waiting for this agreement is done in an infinite loop,
without taking abort_source into account.
This patch introduces checking if abort was requested
and breaking the loop in such case, so auth service can terminate.

Tests:
unit (release)
dtest (bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test)
Message-Id: <1b7ded14b7c42254f02b5d2e10791eb767aae7fc.1543914769.git.sarna@scylladb.com>

(cherry picked from commit 7b0a3fbf8a)
2018-12-04 14:33:05 +00:00
Tomasz Grabiec
2103d0d52b sstables: Write Statistics.db offset map entries in the same order as Cassandra
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.

Fixes #3955.

Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit aa19f98d18)
2018-12-04 14:30:19 +02:00