n1, n2, n3 in the cluster,
shutdown n1, n2, n3
start n1, n2
start n3, we saw features are enabled using the system table while n1 and n2 are already up and running in the cluster.
INFO 2019-02-27 09:24:41,023 [shard 0] gossip - Feature check passed. Local node 127.0.0.3 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO 2019-02-27 09:24:41,025 [shard 0] storage_service - Starting up server gossip
INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.1 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.2 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
The problem is we enable the features too early in the start up process.
We should enable features after gossip is settled.
Fixes#4289
Message-Id: <04f2edb25457806bd9e8450dfdcccc9f466ae832.1551406991.git.asias@scylladb.com>
This brings the version check up-to-date with README.md and HACKING.md,
which were updated by commit fa2b03 ("Replace std::experimental types
with C++17 std version.") to say that minimum GCC 8.1.1 is required.
Tests: manually run configure.py with various `--compiler` values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190318130543.24982-1-dejan@scylladb.com>
Three nodes in the cluster node1, node2, node3
Shutdown the whole cluster
Start node1
Start node2, node2 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.2 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.
Start node3, node3 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.3 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.
Fixes#4225Fixes#4341
* dev asias/fix_check_knows_remote_features.upstream.v4.1:
gossiper: Remove unused register_feature and unregister_feature
gossiper: Remove unused wait_for_feature_on_all_node and
wait_for_feature_on_node
gossiper: Log feature is enabled only if the feature is not enabled
previously
gossiper: Fix empty remote common_features in
check_knows_remote_features
Three nodes in the cluster node1, node2, node3
Shutdown the whole cluster
Start node1
Start node2, node2 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.2 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.
Start node3, node3 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.3 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.
Fixes#4225
We saw the log "Feature FOO is enabled" more than once like below. It is
better to log it only when the feature is not enabled previously.
gossip - InetAddress 127.0.0.1 is now UP, status = NORMAL
gossip - Feature CORRECT_COUNTER_ORDER is enabled
gossip - Feature CORRECT_NON_COMPOUND_RANGE_TOMBSTONES is enabled
gossip - Feature COUNTERS is enabled
gossip - Feature DIGEST_MULTIPARTITION_READ is enabled
gossip - Feature INDEXES is enabled
gossip - Feature LARGE_PARTITIONS is enabled
gossip - Feature LA_SSTABLE_FORMAT is enabled
gossip - Feature MATERIALIZED_VIEWS is enabled
gossip - Feature MC_SSTABLE_FORMAT is enabled
gossip - Feature RANGE_TOMBSTONES is enabled
gossip - Feature ROLES is enabled
gossip - Feature ROW_LEVEL_REPAIR is enabled
gossip - Feature SCHEMA_TABLES_V3 is enabled
gossip - Feature STREAM_WITH_RPC_STREAM is enabled
gossip - Feature TRUNCATION_TABLE is enabled
gossip - Feature WRITE_FAILURE_REPLY is enabled
gossip - Feature XXHASH is enabled
gossip - Feature CORRECT_COUNTER_ORDER is enabled
gossip - Feature CORRECT_NON_COMPOUND_RANGE_TOMBSTONES is enabled
gossip - Feature COUNTERS is enabled
gossip - Feature DIGEST_MULTIPARTITION_READ is enabled
gossip - Feature INDEXES is enabled
gossip - Feature LARGE_PARTITIONS is enabled
gossip - Feature LA_SSTABLE_FORMAT is enabled
gossip - Feature MATERIALIZED_VIEWS is enabled
gossip - Feature MC_SSTABLE_FORMAT is enabled
gossip - Feature RANGE_TOMBSTONES is enabled
gossip - Feature ROLES is enabled
gossip - Feature ROW_LEVEL_REPAIR is enabled
gossip - Feature SCHEMA_TABLES_V3 is enabled
gossip - Feature STREAM_WITH_RPC_STREAM is enabled
gossip - Feature TRUNCATION_TABLE is enabled
gossip - Feature WRITE_FAILURE_REPLY is enabled
gossip - Feature XXHASH is enabled
gossip - InetAddress 127.0.0.2 is now UP, status = NORMAL
It works accidentally but it just expanded by bash to use mached files
in current directory, not correctly recognized by tar.
Need to use full file name instead.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190312172243.5482-2-syuu@scylladb.com>
We get following link error when running reloc/build_reloc.sh in dbuild,
need to enable DPDK on Seastar:
g++: error: /usr/lib64/librte_cfgfile.so: No such file or directory
g++: error: /usr/lib64/librte_cmdline.so: No such file or directory
g++: error: /usr/lib64/librte_ethdev.so: No such file or directory
g++: error: /usr/lib64/librte_hash.so: No such file or directory
g++: error: /usr/lib64/librte_kvargs.so: No such file or directory
g++: error: /usr/lib64/librte_mbuf.so: No such file or directory
g++: error: /usr/lib64/librte_eal.so: No such file or directory
g++: error: /usr/lib64/librte_mempool.so: No such file or directory
g++: error: /usr/lib64/librte_mempool_ring.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_bnxt.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_e1000.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ena.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_enic.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_fm10k.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_qede.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_i40e.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ixgbe.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_nfp.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ring.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_sfc_efx.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_vmxnet3_uio.so: No such file or directory
g++: error: /usr/lib64/librte_ring.so: No such file or directory
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190312172243.5482-1-syuu@scylladb.com>
Since we removed dist/common/bin/scyllatop we are getting a build error
on .deb package build (1bb65a0888).
To fix the error we need to create a symlink for /usr/bin/scyllatop.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190316162105.28855-1-syuu@scylladb.com>
"
Refuse to accept SSTables that were created with partitioner
different than the one used by the Scylla server.
Fixes#4331
"
* 'haaawk/4331/v4' of github.com:scylladb/seastar-dev:
sstables: Add test for sstable::validate_partitioner
sstables: Add sstable::validate_partitioner and use it
Make sure the exception is thrown when Scylla
tries to load an SSTable created with non-compatible partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Scylla server can't read sstables that were created
with different partitioner than the one being used by Scylla.
We should make sure that Scylla identifies such mismatch
and refuses to use such SSTables.
We can use partitioner information stored in validation metadata
(Statistics.db file) for each SSTable and compare it against
partitioner used by Scylla.
Fixes#4331
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
A previous version of the patch that introduced these calls had no
limit on how far behind the large data recording could get, and
maybe_record_large_cells returned null.
The final version switched to a semaphore, but unfortunately these
calls were not updated.
Tests: unit (dev)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190314195856.66387-1-espindola@scylladb.com>
In order to allow yielding when handling endpoint lifecycle changes,
notifiers now run in threaded context.
Implementations which used this assumption before are supplemented
with assertions that they indeed run in seastar::async mode.
Fixes#4317
Message-Id: <45bbaf2d25dac314e4f322a91350705fad8b81ed.1552567666.git.sarna@scylladb.com>
* seastar e640314...463d24e (3):
> Merge 'Handle IOV_MAX limit in posix_file_impl' from Paweł
> core: remove unneeded 'exceptional future ignored' report
> tests/perf: support multiple iterations in a single test run
Issue #4234 asks for a large collection detector. Discussing the issue
Benny pointed out that it is probably better to have a generic large
cell detector as it makes a natural progression on what we already
warn on (large partitions and large rows).
This patch series implements that. It is on top of
shutdown-order-patches-v7 which is currently on next.
With the charges to use a semaphore this patch series might be getting
a bit big. Let me know if I should split it.
* https://github.com/espindola/scylla espindola/large-cells-on-top-of-shutdown-v5:
db: refactor large data deletion code
db: Rename (maybe_)?update_large_partitions
db: refactor a try_record helper
large_data_handler: assert it is not used after stop()
db: don't use _stopped directly
sstables: delete dead error handling code.
large_data_handler: Remove const from a few functions
large_data_handler: propagate a future out of stop()
large_data_handler: Run large data recording in parallel
Create a system.large_cells table
db: Record large cells
Add a test for large cells
This is analogous to the system.large_rows table, but holds individual
cells, so it also needs the column name.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this changes the futures returned by large_data_handler will not
normally wait for entries to be written to system.large_rows or
system.large_partitions.
We use a semaphore to bound how behind system.large_* table updates
can get.
This should avoid delaying sstables writes in the common case, which
is more relevant once we warn of large cells since the the default
threshold will be just 1MB.
Note that there is no ordering between the various maybe_record_* and
maybe_delete_large_data_entries requests. This means that we can end
up with a stale entry that is only removed once the TTL expires.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These will use a member semaphore variable in a followup patch, so they
cannot be const.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
maybe_delete_large_data_entries handles exceptions internally, so the
code this patch deletes would never run.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This should have been changed in the patch
db: stop the commit log after the tables during shutdown
But unfortunately I missed it then.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We had almost identical error handling for large_partitions and
large_rows. Refactor in preparation for large_cells.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This renames it to record_large_partitions, which matches
record_large_rows. It also changes the signature to be closer to
record_large_rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The code for deleting entries from system.large_partitions was almost
a duplicate from the code for deleting entries from system.large_rows.
This patch unifies the two, which also improves the error message when
we fail to delete entries from system.large_partitions.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.
The keep alive timer is used to close the session in the following case:
n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2
We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.
Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>
"
This series allows canceling view update requests when a node is
discovered DOWN. View updates are sent in the background with long
timeout (5 minutes), and in case we discover that the node is
unavailable, there's no point in waiting that long for the request
to finish. What's more, waiting for these requests occurs on shutdown,
which may result in waiting 5 minutes until Scylla properly shuts down,
which is bad for both users and dtests.
This series implements storage_proxy as a lifecycle subscriber,
so it can react to membership changes. It also keeps track of all
"interruptible" writes per endpoint, so once a node is detected as DOWN,
an artificial timeout can be triggered for all aforementioned write
requests.
Fixes#3826Fixes#3966Fixes#4028
"
* 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla:
service: remove unused stop_hints_manager
storage_proxy: add drain_on_shutdown implementation
main: register storage proxy as lifecycle subscriber
storage_proxy: add endpoint_lifecycle_subscriber interface
storage_proxy: register view update handlers for view write type
storage_proxy: add intrusive list of view write handlers
storage_proxy: add view_update_write_response_handler
Complex timestamp tests were ported from dtest and contained a potential
race - rows were updated with TTL 1 and then checked if the row exists
in both base and view replicas in an eventually() loop.
During this loop however, TTL of 1 second might have already passed
and the row could have been deleted from base.
This patch changes the mentioned TTL to 30 seconds, making the tests
extremely unlikely to be flaky.
Message-Id: <6b43fe31850babeaa43465eb771c0af45ee6e80d.1552041571.git.sarna@scylladb.com>
This series contains several improvements to perf_fast_forward that
either address some of the problems seen in the automated runs or help
understanding the results.
The main problem was that test small-partition-slicing had a preparation
stage disproportionally long compared to the actual testing phase. While
the fragments per second results wasn't affected by that, it restricted
the number of iterations of the test that we were able to run, and the
test which single iterations is short (and more prone to noise) was
executed only four times. This was solved by sharing the preparation
stage with all iterations, thus enabling the test to be run many times
and improving the stability of the results.
Another, improvement is the ability to dump all test results and process
them producing histograms. This allows us to see how the distribution of
particular statistics looks like and if there are some complications.
Refs #4278.
* https://github.com/pdziepak/scylla.git more-perf_fast_forward/v1:
tests/perf_fast_forward: print number of iterations of each test
tests/perf_fast_forward: reuse keys in small partition slicing test
tests/perf_fast_forward: extract json result file writing logic
tests/perf_fast_forward: add an option to dump all results
tests/perf_fast_forward: add script for analysing full results
When storage proxy is shutting down, all interruptible writes
can be timed out in order not to wait for them. Instead, the mechanism
will fall back to storing hints and/or not progressing with view
building.