Previously there was a type mismatch for `count` and `memory`, between
the actual type used to store them in the class (signed) and the type
of the parameters in the constructor (unsigned).
Although negative numbers are completely valid for these members,
initializing them to negative numbers don't make sense, this is why they
used unsigned types in the constructor. This restriction can backfire
however when someone intends to give these parameters the maximum
possible value, which, when interpreted as a signed value will be `-1`.
What's worse the caller might not even be aware of this unsigned->signed
conversion and be very suprised when they find out.
So to prevent surprises, expose the real type of these members, trusting
the clients of knowing what they are doing.
Also add a `no_limits` constructor, so clients don't have to make sure
they don't overflow internal types.
(cherry picked from commit e1d8237e6b)
* seastar 1651a2a...6700dc3 (3):
> build: link against libatomic
> core/semaphore: Allow combining semaphore_units()
> core/shared_ptr: Allow releasing a lw_shared_ptr to a non-const object
Fixes#3996.
This is a backport of CASSANDRA-11038.
Before this, a restarted node will be reported as new node with NEW_NODE
cql notification.
To fix, only send NEW_NODE notification when the node was not part of
the cluster
Fixes: #3979
Tests: pushed_notifications_test.py:TestPushedNotifications.restart_node_test
Message-Id: <453d750b98b5af510c4637db25b629f07dd90140.1544583244.git.asias@scylladb.com>
(cherry picked from commit 71c1681f6c)
Embedding the expire timer for a write response in the
abstract_write_response_handler simplifies the code as it allows
removing the rh_entry type.
It will also make the timeout easily accessible inside the handler,
for future patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181213111818.39983-1-duarte@scylladb.com>
(cherry picked from commit f8878238ed)
"
This is a backport of CASSANDRA-8236.
Before this patch, scylla sends the node UP event to cql client when it
sees a new node joins the cluster, i.e., when a new node's status
becomes NORMAL. The problem is, at this time, the cql server might not
be ready yet. Once the client receives the UP event, it tries to
connect to the new node's cql port and fails.
To fix, a new application_sate::RPC_READY is introduced, new node sets
RPC_READY to false when it starts gossip in the very beginning and sets
RPC_READY to true when the cql server is ready.
The RPC_READY is a bad name but I think it is better to follow Cassandra.
Nodes with or without this patch are supposed to work together with no
problem.
Refs #3843
"
* 'asias/node_up_down.upstream.v4.1' of github.com:scylladb/seastar-dev:
storage_service: Use cql_ready facility
storage_service: Handle application_state::RPC_READY
storage_service: Add notify_cql_change
storage_service: Add debug log in notify_joined
storage_service: Add extra check in notify_joined
storage_service: Add notify_joined
storage_service: Add debug log in notify_up
storage_service: Add extra check in notify_up
storage_service: Add notify_up
storage_service: Make notify_left log debug level
storage_service: Introduce notify_left
storage_service: Add debug log in notify_down
storage_service: Introduce notify_down
storage_service: Add set_cql_ready
gossip: Add gossiper::is_cql_ready
gms: Add endpoint_state::is_cql_ready
gms: Add application_state::RPC_READY
gms: Introduce cql_ready in versioned_value
(cherry picked from commit a42b2895c2)
Introduced in 7e15e43.
Exposed by perf_fast_forward:
running: large-partition-skips on dataset large-part-ds1
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read skip time (s) frags frag/s (...)
1 0 5.268780 8000000 1518378
1 1 31.695985 4000000 126199
Message-Id: <1544614272-21970-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 0a853b8866)
Cassandra supports a "CREATE CUSTOM INDEX" to create a secondary index
with a custom implementation. The only custom implementation that Cassandra
supports is SASI. But Scylla doesn't support this, or any other custom
index implementation. If a CREATE CUSTOM INDEX statement is used, we
shouldn't silently ignore the "CUSTOM" tag, we should generate an error.
This patch also includes a regression test that "CREATE CUSTOM INDEX"
statements with valid syntax fail (before this patch, they succeeded).
Fixes#3977
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-2-nyh@scylladb.com>
(cherry picked from commit a0379209e6)
Different nodes can concurrently create the distributed system
keyspace on boot, before the "if not exists" clause can take effect.
However, the resulting schema mutations will be different since
different nodes use different timestamps. This patch forces the
timestamps to be the same across all nodes, so we save some schema
mismatches.
This fixes a bug exposed by ca5dfdf, whereby the initialization of the
distributed system keyspace is done before waiting for schema
agreement. While waiting for schema agreement in
storage_service::join_token_ring(), the node still hasn't joined the
ring and schemas can't be pulled from it, so nodes can deadlock. A
similar situation can happen between a seed node and a non-seed node,
where the seed node progresses to a different "wait for schema
agreement" barrier, but still can't make progress because it can't
pull the schema from the non-seed node still trying to join the ring.
Finally, it is assumed that changes to the schema of the current
distributed system keyspace tables will be protected by a cluster
feature and a subsequent schema synchronization, such that all nodes
will be at a point where schemas can be transferred around.
Fixes#3976
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181211113407.20075-1-duarte@scylladb.com>
(cherry picked from commit 89ae3fbf11)
_ck_blocks_header is a 64-bit variable, so the mask should be 64 bits too.
Otherwise, a shift in the range 32-63 will produce wrong results.
Fix by using a 64-bit mask.
Found by Fedora 29's ubsan.
Fixes#3973.
Message-Id: <20181209120549.21371-1-avi@scylladb.com>
(cherry picked from commit 7c7da0b462)
rh_entry address is captured inside timeout's callback lambda, so the
structure should not be moved after it is created. Change the code to
create rh_entry in-place instead of moving it into the map.
Fixes#3972.
Message-Id: <20181206164043.GN25283@scylladb.com>
(cherry picked from commit 9fb79bf379)
"
Multishard combining readers, running concurrently, with limited
concurrency and no timeout may deadlock, due to inactive shard readers
sitting on permits. To avoid this we have to make sure that all shard
readers belonging to a multishard combining readers, that are not
currently active, can be evicted to free up their permits, ensuring that
all readers can make progress.
Making inactive shard readers evictable is the solution for this
problem, however the original series introducing this solution
(414b14a6bd) did not go all they way and
left some loose ends. These loose ends are tied up by this mini-series.
Namely, two issues remained:
* The last reader to reach EOS was not paused (made evictable).
* Readers created/resumed as part of a read-ahead were not paused
immediately after finishing the read-ahead.
This series fixes both of these.
Fixes: #3865
Tests: unit(release, debug)
"
* 'fix-multishard-reader-deadlock/v1' of https://github.com/denesb/scylla:
multishard_combining_reader: pause readers after reading ahead
multishard_combining_reader: pause *all* EOS'd readers
(cherry picked from commit 21b4b2b9a1)
"
This patchset extends a number of existing tests to check SSTables
statistics for 'mc' format and fixes an issue discovered with the help
of one of the tests.
Tests: unit {release}
"
* 'projects/sstables-30/check-stats/v2' of https://github.com/argenet/scylla:
tests: Run sstable_timestamp_metadata_correcness_with_negative with all SSTables versions.
tests: Run sstable_tombstone_histogram_test for all SSTables versions.
tests: Run min_max_clustering_key_test on all SSTables versions.
tests: Expand test_sstable_max_local_deletion_time_2 to run for all SSTables versions.
tests: Run test_sstable_max_local_deletion_time on all SSTables versions.
tests: Extend test checking tombstones histogram to cover all SSTables versions.
sstables: Properly track row-level tombstones when writing SSTables 3.x.
tests: Run min_max_clustering_key_test_2 for all SSTables versions.
tests: Make reusable_sst() helper accept SSTables version parameter.
(cherry picked from commit f073ea5f87)
In dtest, we have
self.check_rows_on_node(node1, 2000)
self.check_rows_on_node(node2, 2000)
which introduce the following cluster operations:
1) Initially:
- node1 up
- node2 up
2) self.check_rows_on_node(node1, 2000)
- node2 down
- node2 up (A: node2 will call gossiper::real_mark_alive when node2 boots
up to mark node1 up)
3) self.check_rows_on_node(node2, 2000)
- node1 down (B: node1 will send shutdown gossip message to node2, node2
will mark node1 down)
- node1 up (C: when node1 is up, node2 will call
gossiper::real_mark_alive)
Since there is no guarantee the order of Operation A and Operation B, it
is possible node2 will mark node1 as status=shutdown and mark node1 is
UP.
In Operation C, node2 will call gossiper::real_mark_alive to mark node1
up, but since node2 might think node1 is already up, node2 will exit
early in gossiper::real_mark_alive and not log "InetAddress 127.0.0.1 is
now UP, status={}"
As a result, dtest fails to see node2 reports node1 is up when it boots
node1 and fail the test.
TimeoutError: 23 Nov 2018 10:44:19 [node2] Missing: ['127.0.0.1.* now UP']
In the log we can see node1 marked as DOWN and UP almost at the same time on node2:
INFO 2018-11-23 22:31:29,999 [shard 0] gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
INFO 2018-11-23 22:31:30,006 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = shutdown
Fixes#3940
Tests: dtest with 20 consecutive succesful runs
Message-Id: <996dc325cbcc3f94fc0b7569217aa65464eaaa1c.1543213511.git.asias@scylladb.com>
(cherry picked from commit eeeb2da7bb)
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>
(cherry picked from commit b4a8802edc)
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>
(cherry picked from commit 9433d02624)
Fixes the condition which determines whether a row ttl should be used for a cell
and adds a test that uses each generated mutation to populate mutation source
and then verifies that it can read back the same mutation.
* seastar-dev.git haaawk/sst3/write-read-test/v3:
Fix use_row_ttl condition
Add test_all_data_is_read_back
(cherry picked from commit b8c405c019)
Fixes a build failure when only the scylla binary was selected for
building like this:
./configure.py --with scylla
In this case the rule for gen_crc_combine_table was missing, but it is
needed to build crc_combine_table.o
Message-Id: <1544010138-21282-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit edbef7400b)
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.
Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 9a4c00beb7)
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.
This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.
Performance results using perf_checksum and second buffer of length 64 KiB:
zlib CRC32 combine: 38'851 ns
libdeflate CRC32: 4'797 ns
fast_crc32_combine(): 11 ns
So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.
Tested on i7-5960X CPU @ 3.00GHz
Performance was also evaluated using sstable writer benchmark:
perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
--value-size=10000 --rows 1000000 --datasets small-part
It yielded 9% improvement in median frag/s (129'055 vs 117'977).
Refs #3874
"
* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
tests: perf_checksum: Test fast_crc32_combine()
tests: Rename libdeflate_test to checksum_utils_test
tests: libdeflate: Add more tests for checksum_combine()
tests: libdeflate: Check both libdeflate and default checksummers
sstables: Use fast_crc_combine() in the default checksummer
utils/gz: Add fast implementation of crc32_combine()
utils/gz: Add pre-computed polynomials
utils/gz: Import Barett reduction implementation from libdeflate
utils: Extract clmul() from crc.hh
(cherry picked from commit b098b5b987)
"
One part of the improvement comes from replacing zlib's CRC32 with the one
from libdeflate, which is optimized for modern architecture and utilizes the
PCLMUL instruction.
perf_checksum test was introduced to measure performance of various
checksumming operations.
Results for 514 B (relevant for writing with compression enabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 58414 16.711us 3.483ns 16.708us 16.725us
crc_test.perf_adler_combine 165788278 6.059ns 0.031ns 6.027ns 7.519ns
crc_test.perf_zlib_crc32_combine 59546 16.767us 26.191ns 16.741us 16.801us
---
crc_test.perf_deflate_crc32_checksum 12705072 83.267ns 4.580ns 78.687ns 98.964ns
crc_test.perf_adler_checksum 3918014 206.701ns 23.469ns 183.231ns 258.859ns
crc_test.perf_zlib_crc32_checksum 2329682 428.787ns 0.085ns 428.702ns 510.085ns
Results for 64 KB (relevant for writing with compression disabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 25364 38.393us 17.683ns 38.375us 38.545us
crc_test.perf_adler_combine 169797143 5.842ns 0.009ns 5.833ns 6.901ns
crc_test.perf_zlib_crc32_combine 26067 38.663us 95.094ns 38.546us 40.523us
---
crc_test.perf_deflate_crc32_checksum 202821 4.937us 14.426ns 4.912us 5.093us
crc_test.perf_adler_checksum 44684 22.733us 206.263ns 22.492us 25.258us
crc_test.perf_zlib_crc32_checksum 18839 53.049us 36.117ns 53.013us 53.274us
The new CRC32 implementation (deflate_crc32) doesn't provide a fast
checksum_combine() yet, it delegates to zlib so it's as slow as the latter.
Because for CRC32 checksum_combine() is several orders of magnitude slower
than checksum(), we avoid calling checksum_combine() completely for this
checksummer. We still do it for adler32, which has combine() which is faster
than checksum().
SStable write performance was evaluated by running:
perf_fast_forward --populate --data-directory /tmp/perf-mc \
--rows=10000000 -c1 -m4G --datasets small-part
Below is a summary of the average frag/s for a memtable flush. Each result is
an average of about 20 flushes with stddev of about 4k.
Before:
[1] MC,lz4: 330'903
[2] LA,lz4: 450'157
[3] MC,checksum: 419'716
[4] LA,checksum: 459'559
After:
[1'] MC,lz4: 446'917 ([1] + 35%)
[2'] LA,lz4: 456'046 ([2] + 1.3%)
[3'] MC,checksum: 462'894 ([3] + 10%)
[4'] LA,checksum: 467'508 ([4] + 1.7%)
After this series, the performance of the MC format writer is similar to that
of the LA format before the series.
There seems to be a small but consistent improvement for LA too. I'm not sure
why.
"
* tag 'improve-mc-sstable-checksum-libdeflate-v3' of github.com:tgrabiec/scylla:
tests: perf: Introduce perf_checksum
tests: Add test for libdeflate CRC32 implementation
sstables: compress: Use libdeflate for crc32
sstables: compress: Rename crc32_utils to zlib_crc32_checksummer
licenses: Add libdeflate license
Integrate libdeflate with the build system
Add libdeflate submodule
sstables: Avoid checksum_combine() for the crc32 checksummer
sstables: compress: Avoid unnecessary checksum_combine()
sstables: checksum_utils: Add missing include
(cherry picked from commit 5e759b0c07)
Both of these have the same problem. They remove the to-be-evicted
entries from `_entries` but they don't unregister the `entry` from the
`read_concurrency_semaphore`. This results in the
`reader_concurrency_semaphore` being left with a dangling pointer to the
entries will trigger segfault when it tries to evict the associated
inactive reads.
Also add a unit test for `evict_all_for_table()` to check that it works
properly (`evict_one()` is only used in tests, so no dedicated test for
it).
Fixes: #3962
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <57001857e3791c6385721b624d33b667ccda2e7d.1544010868.git.bdenes@scylladb.com>
(cherry picked from commit 77dbc7d09a)
Current code assumes that request failed if all replicas replied with
failure, but this is not true for CL=ANY requests. Take it into account.
Fixed: #3565
(cherry picked from commit 17197fb005)
Currently if write request reaches CL and all replicas replied, but some
replied with failures, the request will wait for timeout to be retired.
Detect this case and retire request immediately instead.
Fixes#3566
(cherry picked from commit d1d04eae3c)
Before accounting failure response we need to make sure it comes from a
replica that participates in the request.
(cherry picked from commit 76ab3d716b)
"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.
Tests: unit(release)
"
* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
db/hints/manager: Use frozen_mutation instead of mutation
db/hints/manager: Use database::find_schema()
db/commitlog/commitlog_entry: Allow moving the contained mutation
service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
service/storage_proxy: Build a shared_mutation from a frozen_mutation
service/storage_proxy: Lift frozen_mutation_and_schema
service/storage_proxy: Allow non-const ranges in mutate_prepare()
(cherry picked from commit 1891779e64)
It is useful to have this counter to investigate the reason for read
repairs. Non zero value means that writes were lost after CL is reached
and RR is expected.
Message-Id: <20181009120900.GF22665@scylladb.com>
(cherry picked from commit 207b57a892)
write_stats is referenced from write handler which is available in
send_to_live_endpoints already. No need to pass it down.
Message-Id: <20181009133017.GA14449@scylladb.com>
(cherry picked from commit 319ece8180)
Remove the timeout argument to
db::view::view_builder::wait_until_built(), a test-only function to
wait until a given materialized view has finished building.
This change is motivated by the fact that some tests running on slow
environments will timeout. Instead of incrementally increasing the
timeout, remove it completely since tests are already run under an
exterior timeout.
Fixes#3920
Tests: unit release(view_build_test, view_schema_test)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181115173902.19048-1-duarte@scylladb.com>
(cherry picked from commit 6fbf792777)
mutate_MV usually calls send_to_endpoint() to push view update to remote
view replicas. This function gets passed a statistics object,
service::storage_proxy_stats::write_stats and, in particular, updates
its "writes" statistic which counts the number of ongoing writes.
In the case that the paired view replica happens to be the *same* node,
we avoid calling send_to_endpoint() and call mutate_locally() instead.
That function does not take a write_stats object, so the "writes" statistic
doesn't get incremented for the duration of the write. So we should do
this explicitly.
Co-authored-by: Nadav Har'El <nyh@scylladb.com>
Co-authored-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 1d5f8d0015)
When the auth service is requested to stop during bootstrap,
it might have still not reached schema agreement.
Currently, waiting for this agreement is done in an infinite loop,
without taking abort_source into account.
This patch introduces checking if abort was requested
and breaking the loop in such case, so auth service can terminate.
Tests:
unit (release)
dtest (bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test)
Message-Id: <1b7ded14b7c42254f02b5d2e10791eb767aae7fc.1543914769.git.sarna@scylladb.com>
(cherry picked from commit 7b0a3fbf8a)
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.
Fixes#3955.
Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit aa19f98d18)