* seastar d40453b...8a76d06 (3):
> memory: be less strict about NUMA bindings
> reactor: let the resource code specify the default memory reserve
> resource: reserve even more memory when hwloc is compiled in
In commit 56df32ba56 (gossip: Mark node as
dead even if already left). A node liveness check is missed.
Fix it up.
Before: (mark a node down multiple times)
[Tue Dec 8 12:16:33 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:33 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:34 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:34 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:35 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:35 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:36 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:36 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
After: (mark a node down only one time)
[Tue Dec 8 12:28:36 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:28:36 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
(cherry picked from commit 5a65d8bcdd)
If there is no snapshot directory for the specific column family,
get_snapshot_details should return an empty map.
This patch check that a directory exists before trying to iterate over
it.
Fixes#619
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Scylla changes:
sstable.cc: Remove file_exists() function which conflicts with seastar's
Amnon Heiman (2):
reactor: Add file_exists method
Add a wrapper for file_exists
Avi Kivity (2):
Merge "Introduce shared_future" from Tomasz
Merge ""scripts: a few fixes in posix_net_conf.sh" from Vlad
Gleb Natapov (3):
rpc: not stop client in error state
avoid allocation in parallel_for_each is there is nothing to do
memory: fix size_to_idx calculation
Nadav Har'El (1):
test: fix use-after-free in timertest
Pawe�� Dziepak (1):
memory: use size instead of old_size to shrink memory block
Tomasz Grabiec (7):
file: Mark move constructor as noexcept
core: future: Add static asserts about type's noexcept guarantees
core: future: Drop now redundant move_noexcept flag
core: future_state: Make state getters non-destructive for non-rvalue-refs
core: future: Make get_available_state() noexcept
core: Introduce shared_future
Make json_return_type movable
Vlad Zolotarov (8):
scripts: posix_net_conf.sh: ban NIC IRQs from being moved by irqbalance
scripts: posix_net_conf.sh: exclude CPU0 siblings from RPS
scripts: posix_net_conf.sh: Configure XPS
scripts: posix_net_conf.sh: Add a new mode for MQ NICs
scripts: posix_net_conf.sh: increase some backlog sizes
core: to_sstring(): cleanup
core: to_sstring_strintf(): always use %g(or %lg) format for floating point values
core: prevent explicit calls for to_sstring_sprintf()
In a recent discussion with the XFS developers, Dave Chinner recommended
us *not* to use discard, but rather issue fstrims explicitly. In machines
like Amazon's c3-class, the situation is made worse by the fact that discard
is not supported by the disk. Contrary to my intuition, adding the discard
mount option in such situation is *not* a nop and will just create load
for no reason.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Objects extending json_base are not movable, so we won't be able to
pass them via future<>, which will assert that types are nothrow move
constructible.
This problem only affects httpd::utils_json::histogram, which is used
in map-reduce. This patch changes the aggregation to work on domain
value (utils::ihistrogram) instead of json objects.
The config file expresses this number in MB, while total_memory() gives us
a quantity in bytes. This causes the commitlog not to flush until we reach
really skyhigh numbers.
While we need this fix for the short term before we cook another release,
I will note that for the mid/long term, it would be really helpful to stop
representing memory amounts as integers, and use an explicit C++ type for
those. That would have prevented this bug.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Print a map in the form of [(]{ key0 : value0 }[, { keyN : valueN }]*[)]
The map is printed inside () brackets if it's frozen.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In origin, there are two APIs to get the information about the current
running compactions. Both APIs do the string formatting.
This patch changes the API to have a single API get_compaction that
would return a list of summary object.
The jmx would do the string formatting for the two APIs.
This change gives a better API experience is it's better documented and
would make it easier to support future format changes in origin.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
That's what we're trying to standardize on.
This patch also fixes an issue with current query::result::serialize()
not being const-qualified, because it modifies the
buffer. messaging_service did a const cast to work this around, which
is not safe.
This patch adds the implementation to the get_version.
After this patch the following url will be available:
messaging_service/version?addr=127.0.0.1
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"This series allows the compaction manager to be used by the nodetool as a stub implementation.
It has two changes:
* Add to the compaction manager API a method that returns a compaction info
object
* Stub all the compaction method so that it will create an unimplemented
warning but will not fail, the API implementation will be reverted when the
work on compaction will be completed."
This patch fixes the following cql_query_test failure.
cql_query_test: scylla/seastar/core/sharded.hh:439:
Service& seastar::sharded<Service>::local() [with Service =
gms::gossiper]: Assertion `local_is_initialized()' failed.
The problem is in gossiper::stop() we call gossip::add_local_application_state()
which will in turn call gms::get_local_gossiper(). In seastar::sharded::stop
_instances[engine().cpu_id()].service = nullptr;
return inst->stop().then([this, inst] {
return _instances[engine().cpu_id()].freed.get_future();
});
We set the _instances to nullptr before we call the stop method, so
local_is_initialized asserts when we try to access get_local_gossiper
again.
To fix, we make the stopping of gossiper explicit. In the shutdown
procedure, we call stop_gossiping() explicitly.
This has two more advantages:
1) The api to stop gossip is now calling the stop_gossiping() instead of
sharing the seastar::sharded's stop method.
2) We can now get rid of the _handler seastar::sharded helper.
The add interface of the estimated histogram is confusing as it is not
clear what units are used.
This patch removes the general add method and replace it with a add_nano
that adds nanoseconds or add that gets duration.
To be compatible with origin, nanoseconds vales are translated to
microseconds.
This patch adds a started counter, that is used to mark the number of
operation that were started.
This counter serves two purposes, it is a better indication for when to
sample the data and it is used to indicate how many pending operations
are.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the column family API that return the snapshot size.
The changes in the swagger definition file follo origin so the same API will be used for the metric and the
column_family.
The implementation is based on the get_snapshot_details in the
column_family.
This fix:
425
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Backport: CASSANDRA-10330
ae4cd69 Print versions for gossip states in gossipinfo
For instance, the version for each state, which can be useful for
diagnosing the reason for any missing states. Also instead of just
omitting the TOKENS state, let's indicate whether the state was actually
present or not.
With
Node 1 (Seed node, Port 7000 is opened, 10.184.9.144)
Node 2 (Port 7000 is opened, 10.184.9.145)
Node 3 (Port 7000 is blocked by firewall)
On Node 3, we saw the following error which was very confusing: Node 3
saw Node 1 and Node 3 but it complained it can not contact any seeds.
The message "Node 10.184.9.144 is now part of the cluster" and friends
are actually messages printed during the gossip shadow round where Node
3 connects to Node 1's port 7000 and Node 1 returns all info it knows to
Node 3, so that Node 3 knows Node 1 and Node 2 and we see the "Node
10.184.9.144/145 is now part of the cluster" message.
However, during the normal gossip round, Node 3 will not mark Node 1 and
Node 2 UP until the Seed node initiates a gossip round to Node 3, (note
port 7000 on node 3 is blocked in this case). So Node 3 will not mark
Node 1 and Node 2 UP and we see the "Unable to contact any seeds" error.
[shard 0] storage_service - Loading persisted ring state
[shard 0] gossip - Node 10.184.9.144 is now part of the cluster
[shard 0] gossip - inet_address 10.184.9.144 is now UP
[shard 0] gossip - Node 10.184.9.145 is now part of the cluster
[shard 0] gossip - inet_address 10.184.9.145 is now UP
[shard 0] storage_service - Starting up server gossip
scylla_run[12479]: Start gossiper service ...
[shard 0] storage_service - JOINING: waiting for ring information
[shard 0] storage_service - JOINING: schema complete, ready to bootstrap
[shard 0] storage_service - JOINING: waiting for pending range calculation
[shard 0] storage_service - JOINING: calculation complete, ready to bootstrap
[shard 0] storage_service - JOINING: getting bootstrap token
[shard 0] storage_service - JOINING: sleeping 5000 ms for pending range setup
scylla_run[12479]: Exiting on unhandled exception of type 'std::runtime_error': Unable to contact any seeds!
Backported: CASSANDRA-8336 and CASSANDRA-9871
84b2846 remove redundant state
b2c62bb Add shutdown gossip state to prevent timeouts during rolling restarts
8f9ca07 Cannot replace token does not exist - DN node removed as Fat Client
Fixes:
When X is shutdown, X sends SHUTDOWN message to both Y and Z, but for
some reason, only Y receives the message and Z does not receive the
message. If Z has a higher gossip version for X than Y has for
X, Z will initiate a gossip with Y and Y will mark X alive again.
X ------> Y
\ /
\ /
Z
Fixes: #593
"Changes the parser/replayer to treat data corruption as non-fatal,
skipping as little as possible to get the most data out of a segment,
but keeping track of, and reporting, the amount corrupted.
Replayer handles this and reports any non-fatal errors on replay finish.
Also added tests for corruption cases.
This patch series contains a cleanup-patch for commitlog_tests that was
previously submitted, but got lost."
If something bad happens between write request handler creation and
request execution the request handler have to be destroyed. Currently
code tries to do that explicitly in all places where request may be
abandoned, but it misses some (at least one). This patch replaces this
by introducing unique_response_handler object that will remove the handler
automatically if request is not executed for some reason.
Rename antlr3-tool to antlr3 (same as distribution package), and use distribution version if it's available
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
"Before this change, populations could race with update from flushed
memtable, which might result in cache being populated with older
data. Populations started before the flush are not considering the
memtable nor its sstable.
The fix employed here is to make update wait for populations which
were started before the flushed memtable's sstable was added to the
undrelying data source. All populatinos started after that are
guaranteed to see the new data. The update() call will wait only for
current populating reads to complete, it will not wait for readers to
get advanced by the consumer for instance."
To avoid a race where natural endpoint was updated to contain node A,
but A was not yet removed from pending endpoints.
This fixes the root cause of commit d9d8f87c1 (storage_proxy: filter out
natural endpoints from pending endpoint). This patch alone fixes#539,
but we still want commit d9d8f87c1 to be safe.
When other bootstrapping/leaving/moving nodes are found during
bootstrap, instead of throwing immediately, sleep and try again for one
minute, hoping other nodes will finish the operation soon.
Since we are retrying using shadow gossip round more than once, we need
to put the gossip state back to shadow round after each shadow round, to
make shadow round works correctly.
This is useful when starting an empty cluster for testing. E.g,
$ scylla --listen-address 127.0.0.1
$ sleep 3
$ scylla --listen-address 127.0.0.2
$ sleep 3
$ scylla --listen-address 127.0.0.3
Without this patch, node 3 will hit the check.
TIME STATUS
-----------------------
Node 1:
32:00 Starts
32:00 In NORMAL status
Node 2:
32:03 Starts
32:04 In BOOT status
32:10 In NORMAL status
Node 3:
32:06 Starts
32:06 Found node 2 in BOOT status, hit the check, sleep and try again
32:11 Found node 2 in NORMAL status, can keep going now
32:12 In BOOT status
32:18 In NORMAL status
When other bootstrapping/leaving/moving nodes are found during
bootstrap, instead of throwing immediately, sleep and try again for one
minute, hoping other nodes will finish the operation soon.
This is useful when starting an empty cluster for testing. E.g,
$ scylla --listen-address 127.0.0.1
$ scylla --listen-address 127.0.0.2
$ scylla --listen-address 127.0.0.3
Without this patch, node 3 will hit the check.
TIME STATUS
-----------------------
Node 1:
25:19 Starts
25:20 In NORMAL status
Node 2:
25:19 Starts
25:23 In BOOT status
25:28 In NORMAL status
Node 3:
25:19 Starts
25:24 Found node 2 in BOOT status, hit the check, sleep and try again
25:29 Found node 2 in NORMAL status, can keep going now
25:29 In BOOT status
25:34 In NORMAL status
Before this change, populations could race with update from flushed
memtable, which might result in cache being populated with older
data. Populations started before the flush are not considering the
memtable nor its sstable.
The fix employed here is to make update wait for populations which
were started before the flushed memtable's sstable was added to the
undrelying data source. All populatinos started after that are
guaranteed to see the new data.
The text data type is no longer present in CQL binary protocol v3 and
later. We don't need it for encoding earlier versions either because
it's an alias for varchar which is present in all CQL binary protocol
versions.
Fixes#526.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
This patch plus pekka's previous commit 3c72ea9f96
"gms: Fix gossiper::handle_major_state_change() restart logic"
fix CASSANDRA-7816.
Backported from:
def4835 Add missing follow on fix for 7816 only applied to
cassandra-2.1 branch in 763130bdbde2f4cec2e8973bcd5203caf51cc89f
763130b Followup commit for 7816
2199a87 Fix duplicate up/down messages sent to native clients
Tested by:
pushed_notifications_test.py:TestPushedNotifications.restart_node_test
CQL 3.2.1 introduces a "TRUNCATE TABLE X" alias for "TRUNCATE X":
4e3555c1d9
Fix our CQL grammar to also support that.
Please note that we don't bump up advertised CQL version yet because our
cqlsh clients won't be able to connect by default until we upgrade them
to C* 2.1.10 or later.
Fixes#576
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
The FIXMEs are no longer valid, we load schema on bootstrap and don't
support hot-plugging of column families via file system (nor does
Cassandra).
Handling of missing tables matches Cassandra 2.1, applies log
it and continue, queries propagate the error.
If request comes after natural endpoint was updated to contain node A,
but A was not yet removed from pending endpoints it will be in both and
write request logic cannot handle this properly. Filter nodes which are
already in natural endpoint from pending endpoint to fix this.
Fixes#539.
boost::heap::binomial_heap allocates helper object in push() and,
therefore, may throw an exception. This shouldn't happen during
compaction.
The solution is to reserve space for this helper object in
segment_descriptor and use a custom allocator with
boost::heap::binomial_heap.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
LSA memory reclaimer logic assumes that the amount of memory used by LSA
equals: segments_in_use * segment_size. However, LSA is also responsible
for eviction of large objects which do not affect the used segmentcount,
e.g. region with no used segments may still use a lot of memory for
large objects. The solution is to switch from measuring memory in used
segments to used bytes count that includes also large objects.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Since this won't check disk types, may re-initialize RAID on EBS when first block was lost.
But in such condition, probably re-initialize RAID is the only choice we can take, so this is fine.
Fixes#364.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
With this patch, start two nodes
node 1:
scylla --rpc-address 127.0.0.1 --broadcast-rpc-address 127.0.0.11
node 2:
scylla --rpc-address 127.0.0.2 --broadcast-rpc-address 127.0.0.12
On node 1:
cqlsh> SELECT rpc_address from system.peers;
rpc_address
-------------
127.0.0.12
which means client should use this address to connect node 2 for cql and
thrift protocol.
It is same as
-Dcassandra.consistent.rangemovement
in cassandra.
Use it as:
$ scylla --consistent-rangemovement 0
or
$ scylla --consistent-rangemovement 1
Messaging service closes connection in rpc call continuation on
closed_error, but the code runs for each outstanding rpc call on the
connection, so first continuation may destroy genuinely closed connection,
then connection is reopened and next continuation that handless previous
error kills now perfectly healthy connection. Fix this by closing
connection only in error state.
From Avi:
Origin supports a notion of empty values for non-container types; these
are serialized as zero-length blobs. They are mostly useless and only
retained for compatibility.
The implementation here introduces a wrapper maybe_empty<T>, similar to
optional<T> but oriented towards usually-nonempty usage with implicit
conversion.
There is more work needed for full empty support: fixing up deserializers to
create empty values instead of nulls, and splitting up data_value into
data_value and a data_value_nonnull for the cases that require it.
(I chose maybe_empty<> rather than using optional<data_value> for nullable
data_value both because it requires fewer changes, and because
optional<data_value> introduces a lot of control flow when moving or copying,
which would be mostly useless in most cases).
This cleanup patch got lost in git-space some time ago. It is however sorely
needed...
* Use cleaner wrapper for creating temp dir + commit log, avoiding
having to clear and clean in every test, etc.
* Remove assertions based on file system checks, since these are not
valid due to both the async nature of the CL, and more to the point,
because of pre-allocation of files and file blocks. Use CL
counters/methods instead
* Fix some race conditions to ensure tests are safe(r)
* Speed up some tests
Discern fatal and non-fatal excceptions, and handle data corruption
by adding to stats, resporting it, but continue processing.
Note that "invalid_arguement", i.e. attempting to replay origin/old
segments are still considered fatal, as it is probably better to
signal this strongly to user/admin
Parser object now attempts to skip past/terminate parsing on corrupted
entries/chunks (as detected by invalid sizes/crc:s). The amount of data
skipped is kept track of (as well as we can estimate - pre-allocation
makes it tricky), and at the end of parsing/reporting, IFF errors
occurred, and exception detailing the failures is thrown (since
subsciption has little mechanism to deal with this otherwise).
Thus a caller can decide how to deal with data corruption, but will be
given as many entries as possible.
An empty serialized representation means an empty value, not NULL.
Fix up the confusion by converting incorrect make_null() calls to a new
make_empty(), and removing make_null() in empty-capable types like
bytes_type.
Collections don't support empty serialized representations, so remove
the call there.
Paramter evaluation order is unspecified, so it's possible that the
move of 'schema' into lambda captures would happen before construction of
mutation.
"To speed up boot, parallelism was introduced to our code that loads
sstables from a column family, a function was implemented to read
the minimum from a sstable to determine whether it belongs to the
current shard, and buffer size in read simple is dynamically chosen
based on the size of the file and dma alignment.
The latter is important because filter file can be considerably
large when the respective sstable (data file) is very large.
Before this patchset, scylla took about 5 minutes to boot with a
data directory of 660GB. After this patchset, scylla took about 20
seconds to boot with the same data directory."
Avi says:
"A small buffer size will hurt if we read a large file, but
a large buffer size won't hurt if we read a small file, since
we close it immediately."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, we only determine if a sstable belongs to current shard
after loading some of its components into memory. For example,
filter may be considerably big and its content is irrelevant to
decide if a sstable should be included to a given shard.
Start using the functions previously introduced to optimize the
sstable loading process. add_sstable no longer checks if a sstable
is relevant to the current shard.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Boot may be slow because the function that loads sstables do so
serially instead of in parallel. In the callback supplied to
lister::scan_dir, let's push the future returned by probe_file
(function that loads sstable) into a vector of future and wait
for all of them at the end.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We cannot share some dependency package names between 14.04 and 15.10, so need to add ifdefs.
Not tested on other version of Ubuntu.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Origin supports (https://issues.apache.org/jira/browse/CASSANDRA-5648) "empty"
values even for non-container types such as int. Use maybe_empty<> to
encapsulate abstract_type::native_type, adding an empty flag if needed.
Similar to optional<>, with the following differences:
- decays back to the encapsulated type, with an emptiness check;
this reflects the expectation that the value will rarely be empty
- avoids conditionals during copy/move (and requires a default constructor),
again with the same expectation.
When we start to sending mutations for cf_id to remote node, remote node
might do not have the cf_id anymore due to dropping of the cf for
instance.
We should not fail the streaming if this happens, since the cf does not
exist anymore there is no point streaming it.
Fixes#566
When a new node joins a cluster, it will starts a gossip round with seed
node. However, within this round, the seed node will not tell the new
node anything it knows about other nodes in the cluster, because the
digest in the gossip SYN message contains only the new node itself and
no other nodes. The seed node will pick randomly from the live nodes,
including the newly added node in do_gossip_to_live_member to start a
gossip round. If the new node is "lucky", seed node will talk to it very
soon and tells all the information it knows about the cluster, thus the
new node will mark the seed node alive and think it has seen the seed
node. If there considerably large number of live nodes, it might take a
long time before the seed node pick the new node and talk to it.
In bootstrap code, storage_service::bootstrap checks if we see any nodes
after sleep of RING_DELAY milliseconds and throw "Unable to contact any
seeds!" if not, thus the node will fail to bootstrap.
To help the seed node talk to new node faster, we favor new node in
do_gossip_to_live_member.
In origin, get_all_endpoint_states perform all the information
formatting and returns a string.
This is not a good API approach, this patch replaces the implementation
so the API will return an array of values and the JMX will do the
formatting.
This is a better API and would make it simpler in the future to stay in
sync with origin output.
This patch is part of #508
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fixes#551.
Change mountpoint to /var/lib/scylla, copy conf/ on it.
Note: need to replace conf/ with symlink to /etc/scylla when new rpm uploaded on yum repository.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Signed-off-by: Pekka Enberg <penberg@iki.fi>
If we get a partition with no row data, but statics, we should treat this as
a row (include in count), but also make sure we skip to next partition
if our page ends here.
The "end partition" with zero rows but static data can also happen if we
happen to resume paging by giving a column range exluding all data. In this
case we should _not_ include it, since we have already provided the
data in question in previous page.
Fixes#556
1.) Should not reset to input query state if run repeatedly
2.) And if run repeatedly without input state, likewise keep
the internal one active
Fixes#560
"To keep compatibility with scylla-tools-java, it links /etc/scylla to /var/lib/scylla/conf.
Problem on this patchset is, I added SCYLLA_HOME and SCYLLA_CONF on /etc/sysconfig/scylla-server.
However, the file is marked as config file, it won't be automatically upgrade.
If user doesn't upgrade the file manually, scylla-server still able to run with /var/lib/scylla/conf because we have symlink, but never switches to /etc/scylla."
While the objects above max_manage_object_size aren't stored in the
LSA segments they are still considered to be belonging to the LSA
region and are evictable using that region evictor.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"This series adds the natural_endpoints API. It adds the implementation to the storage_service and to the storage_service API.
After this series the noodtool command getendpoints should work.
example:
$ bin/nodetool getendpoints keyspace1 standard1 0x5032394c323239385030127.0.0.2
127.0.0.2"
This patch adds the API for timeout messages and dropped messages.
For dropped messages, origin has two APIs one for messages and one for
command.
droped messages return the number of messages per ver, so our API was
rename to reflect that.
For dropped messages (command) we currently do not have this logic of
throwing messages before sending, so the API will always return 0.
The total timeout API was removed and will be done on the jmx proxy
level.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
If listen_address is different than broadcast_address, we should use
broadcast_address for the seeds list. Check and ask user to fix the
configuration, e.g.,
$ scylla --rpc-address 127.0.0.1 --listen-address 127.0.0.1 --broadcast-address 192.168.1.100 --seed-provider-parameters seeds=127.0.0.1
Use broadcast_address instead of listen_address for seeds list: seeds={127.0.0.1}, listen_address=127.0.0.1, broadcast_address=192.168.1.100
Exiting on unhandled exception of type 'std::runtime_error': Use broadcast_address for seeds list
Write handler keeps track of all endpoints that not yet acked mutation
verb. It uses broadcast address as an enpoint id, but if local address
is different from broadcast address for local enpoints acknowledgements
will come from different address, so socket address cannot be used as
an acknowledgement source. Origin solves this by sending "from" in each
message, it looks like an overhead, solve this by providing endpoint's
broadcast address in rpc client_info and use that instead.
The restart logic is wrong because C* had a bug in
bf599fb5b062cbcc652da78b7d699e7a01b949ad and they fixed later and we
translated the broken version. We must check if there is an existing
endpoint state and call on_restart() hooks on that, not the newly
available endpoint state.
Spotted while inspecting the code.
Acked-by: Asias He <asias@scylladb.com>
From Avi:
Memtables do not use an allocating_section to guard against allocation
failure, and hence can fail an allocation. Reproducible by changing
perf_mutation to use an allocating type (bytes_type with a nontrivial
size) and making the loop longer.
Fix by using an allocating_section.
Recently, I have introduced cf_stats into the database, propagating all the way
back to the column family. The problem, however, is that some tests create a
column family config themselves instead of going through make_column_family.
That is ultimately ok if those tests are not expected to flush memtables. But
if they are, the cf_stats pointer will be null and we will crash. Although
there are many solutions to this, the one that is in tune with our current
practices is to have the test that requires it provide an empty cf_stats storage
area that can be written to. That's already how we handle the disk directory and
other things like compaction properties.
With this patch, test.py passes again.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This patch substitutes uint64_t for uint32_t as the type for
commitlog_total_space_in_mb. Moving to 64 is not strictly needed, since even a
signed 32-bit type would allow us to easily handle 2TB. But since we store that
in the commitlog as a 64-bit value, let's match it.
Moving from unsigned to signed, however, allow us to represent negative
numbers. With that in place, we can change the semantics of the value
slightly, so to allow a negative number to mean "all memory".
The reason behind this, is that the default value "8GB", is an artifact of the
JVM. We don't need that, and in many-shards configuration, each shard flushes
the commitlog way too often, since 8GB / many_shards = small_number.
8GB also happens to be a popular heap size for C* in the JVM. For us, we would
like to equate that (at least) with the amount of memory. The problem is how to
do that without introducing new options or changing the semantics of existing
options too radically.
The proposed solution will allow us to still parse C* yaml files, since those
will always have positive numbers, while introducing our own defaults.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Debian package system has two types of package, 'native' and 'non-native'.
'native' is the package just for Debian, it contains debian/ directory source tar.gz, doesn't have debian.tar.gz.
'non-native' has orig.tar.gz which is upstream source code tar ball, then it has debian.tar.gz which contains debian/ directory.
Scylla is 'native' now but should be 'non-native' since this is not just for Debian, so move debian/ to dist/ubuntu/, make orig.tar.gz using git-archive-all, copy dist/ubuntu/debian/ to debian/ then generate debian.tar.gz.
atomic_cell will soon become type-aware, so add helpers to class operation
that can supply the type, as it is available in operation::column.type.
(the type will be used in following patches)
schema_tables manages some boolean columns stored in system tables; it
dynamically creates them from C++ values. But as we lacked bool->data_value
conversion, the C++ value was converted to a int32_type. Somehow this didn't
cause any problems, but with some pending patches I have, it does.
Add a bool->data_value converting constructor to fix this.
Since bytes is a very generic value that is returned from many calls,
it is easy to pass it by mistake to a function expecting a data_value,
and to get a wrong result. It is impossible for the data_value constructor
to know if the argument is a genuine bytes variable, a data_value of another
type, but serialized, or some other serialized data type.
To prevent misuse, make the data_value(bytes) constructor
(and complementary data_value(optional<bytes>) explicit.
When do_stop_native_transport exits, cserver is destroyed which can
happen before cserver->stop(). Fix by capturing cserver in
cserver->stop()'s continuation to extend its lifetime. The same for
thrift server.
scylla: scylla/seastar/core/sharded.hh:327: seastar::sharded<Service>::~sharded()
[with Service = transport::cql_server]: Assertion `_instances.empty()' failed.
When analyzing a recent performance issue, I found helpful to keep track of
the amount of memtables that are currently in flight, as well as how much memory
they are consuming in the system.
Although those are memtable statistics, I am grouping them under the "cf_stats"
structure: being the column family a central piece of the puzzle, it is reasonable
to assume that a lot of metrics about it would be potentially welcome in the future.
Note that we don't want to reuse the "stats" structure in the column family: for once,
the fields not always map precisely (pending flushes, for instance, only tracks explicit
flushes), and also the stats structure is a lot more complex than we need.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
* seastar 5c10d3e...20bf03b (5):
> do not re-throw exception to get to an exception pointer
> Adding timeout counter to the rpc
> configure.py: support for pkg-config before release 0.28
> future: don't forget to warn about ignored exception
> tutorial: continue network API section
Found by debug build
==10190==ERROR: AddressSanitizer: new-delete-type-mismatch on 0x602000084430 in thread T0:
object passed to delete has wrong type:
size of the allocated type: 16 bytes;
size of the deallocated type: 8 bytes.
#0 0x7fe244add512 in operator delete(void*, unsigned long) (/lib64/libasan.so.2+0x9a512)
#1 0x3c674fe in std::default_delete<dht::range_streamer::i_source_filter>::operator()(dht::range_streamer::i_source_filter*)
const /usr/include/c++/5.1.1/bits/unique_ptr.h:76
#2 0x3c60584 in std::unique_ptr<dht::range_streamer::i_source_filter, std::default_delete<dht::range_streamer::i_source_filter> >::~unique_ptr()
/usr/include/c++/5.1.1/bits/unique_ptr.h:236
#3 0x3c7ac22 in void __gnu_cxx::new_allocator<std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> > >::destroy<std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> > >(std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> >*) /usr/include/c++/5.1.1/ext/new_allocator.h:124
...
Fixes#549.
Being clinically absent-minded, aggregate query support (i.e. count(...))
was left out of the "paging" change set.
This adds repeated paged querying to do aggregate queries (similar to
origin). Uses "batched" paging.
Until the compaction manager api would be ready, its failing command
causes problem with nodetool related tests.
Ths patch stub the compaction manager logic so it will not fail.
It will be replaced by an actuall implementation when the equivelent
code in compaction will be ready.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a compaction info object and an API that returns it.
It will be mapped to the JMX getCompactions that returns a map.
The use of an object is more RESTFull and will be better documented in
the swagger definition file.
For compatibility reasons, compaction_strategy should accept both class
name strategy and the full class name that includes the package name.
In origin the result name depends on the configuration, we cannot mimic
that as we are using enum for the type.
So currently the return class name remains the class itself, we can
consider changing it in the future.
If the name is org.apache.cassandra.db.compaction.Name the it will be
compare as Name
The error message was modified to report the name it was given.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fixes#545
"Slight file format change for commitlog segments, now incluing
a scylla "marker". Allows for fast-fail if trying to load an
Origin segment.
WARNING: This changes the file format, and there is no good way for me to
check if a CL is "old" scylla, or Origin (since "version" is the same). So
either "old" scylla files also fail, or we never fail (until later, and
worse). Thus, if upgrading from older to this patch ensure to
have cleaned out all commit logs first."
Fixes#355
"Implements query paging similar to origin. If driver sets a "page size" in
a query, and we cannot know that we will not exceed this limit in a single
query, the query is performed using a "pager" object, which, using modified
partition ranges and query limits, keeps track of returned rows to "page"
through the results.
Implementation structure sort of mimics the origin design, even though it
is maybe a little bit overkill for us (currently). On the other hand, it
does not really hurt.
This implementation is tested using the "paging_test" subset in dtest.
It passes all test except:
* test_paging_using_secondary_indexes
* test_paging_using_secondary_indexes_with_static_cols
* test_failure_threshold_deletions
The two first because we don't have secondary indexes yet, the latter
because the test depends on "tombstone_failure_threshold" in origin.
Potential todo: Currently the pager object does not shortcut result
building fully when page limit is exceeded. Could save a little work
here, but probably not very significant."
Allows us fail fast if someone tries to replay an Origin commit log.
WARNING: This changes the file format, and there is no good way for me to
check if a CL is "old" scylla, or Origin (since "version" is the same). So
either "old" scylla files also fail, or we never fail (until later, and
worse). Thus, if upgrading from older to this patch, likewise, ensure to
have cleaned out all commit logs first.
* Static query method to determine if paging might be required
(very conservative - almost all querys will be paged me thinks).
* Static factory method for pager
* Actual pager implementation
Pager object uses three variables to keep track of paging state:
1.) Last partition key - partition key of last partion processed
-> next partition to start process
2.) Last clustering key, i.e. row offset within last key partition,
i.e. how far we got last time
3.) Max remaining - max rows to process further, i.e. initial limit -
processed so far
Partition ranges are modified/removed so that we begin with "Last key",
if present. (Or end with, in the case of reversed processing)
A counting visitor then keeps count of rows to include in processing.
Basic interface for paging control objects.
We probably do not need virtual behaviour for paging, but on the other
hand it does not really cost much, and it keeps a nice symmetry with
origin.
Allows for having more than one clustering row range set, depending on
PK queried (although right now limited to one - which happens to be exactly
the number of mutiplexing paging needs... What a coincidence...)
Encapsulates the row_ranges member in a query function, and if needed holds
ranges outside the default one in an extra object.
Query result::builder::add_partition now fetches the correct row range for
the partition, and this is the range used in subsequent iteration.
Note: serial format blob is different compared to origin, due to scyllas
different internal architecture. I.e. we query actual rows.
But drivers etc ignore the content of the blob, it is opaque.
Currently, there are multiple places we can close a session, this makes
the close code path hard to follow. Remove the call to maybe_completed
in follower_start_sent to simplify closing a bit.
- stream_session::follower_start_sent -> maybe_completed()
- stream_session::receive_task_completed -> maybe_completed()
- stream_session::transfer_task_completed -> maybe_completed()
- on receive of the COMPLETE_MESSAGE -> complete()
nodetool decommission node 127.0.0.2, on node 127.0.0.1, I saw:
DEBUG [shard 0] gossip - failure_detector: Forcing conviction of 127.0.0.1
TRACE [shard 0] gossip - convict ep=127.0.0.1, phi=8, is_alive=1, is_dead_state=0
TRACE [shard 0] gossip - marking as down 127.0.0.1
INFO [shard 0] gossip - inet_address 127.0.0.1 is now DOWN
DEBUG [shard 0] storage_service - on_dead endpoint=127.0.0.1
This is wrong since the argument for send_gossip_shutdown should be the
node being shutdown instead of the live node.
Since the introduction of sets::element_discarder sets::discarder is
always given a set, never a single value.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently sets::discarder is used by both set difference and removal of
a single element operations. To distinguish between them the discarder
checks whether the provided value is a set or something else, this won't
work however if a set of frozen sets is created.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Error handling in column_family::try_flush_memtable_to_sstable() is
misplaced. It happens after update_cache(), so writing sstable may
have succeeded, but moving memtable into the cache may have failed.
update_cache() destroys memtable even if it fails, but error handler
is not aware of it (it does not even distinguish whether error happened
during sstable creation or moving into cache) and when it tells caller
to retry it retries with already destroyed memtable. Fix it by ignoring
moving to cache errors.
This reverts commit fff37d15cd.
Says Tomek (and the comment in the code):
"update_cache() must be called before unlinking the memtable because cache + memtable at any time is supposed to be authoritative source of data for contained partitions. If there is a cache hit in cache, sstables won't be checked. If we unlink the memtable before cache is updated, it's possible that a query will miss data which was in that unlinked memtable, if it hits in the cache (with an old value)."
Error handling in column_family::try_flush_memtable_to_sstable() is
misplaced. It happens after update_cache(), so writing sstable may
have succeeded, but moving memtable into the cache may have failed.
update_cache() destroys memtable even if it fails, but error handler
is not aware of it (it does not even distinguish whether error happened
during sstable creation or moving into cache) and when it tells caller
to retry it retries with already destroyed memtable. Fix it by ignoring
moving to cache errors.
nodetool decommission hangs forever due to a recursive lock.
decommission()
with api lock
shutdown_client_servers()
with api lock
stop_rpc_server()
with api lock
stop_native_transport()
Fix it by calling helpers for stop_rpc_server and stop_native_transport
without the lock.
std::set_difference requires the container to be sorted which is not
true here, use remove_if.
Do not use assert, use throw instead so that we can recover from this
error.
Currently error code is attached to a future returned by when_all() which
is never is exceptional one, but it may hold exceptional future as a
first element. Move error handling close to where error it tries to
catch is generated instead.
Let's move the code that prints that a compaction succeeded only
after the code that catches exception on either read or write
fibers. Let's also get rid of done and use repeat instead in
the read fiber.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If write timeout and last acknowledgement needed for CL happen simultaneously
_ready can be sent to be exceptional by the timeout handler, but since
removal of the response handler happens in continuation it may be
reordered with last ack processing and there _ready will be set again
which will trigger assert. Fix it by removing the handler immediately,
no need to wait for continuation. It makes code simpler too.
The get_cm_stats gets a pointer to a field in the stats object. It
should capture it by value or segmentation falut may occure when the
caller gets out of scope.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Currently, we don't let the user know even what is the filename that failed.
That information should be included in the message.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This assert (in write fiber) would fail if read fiber failed
because the variable done will not be set to true.
The use of assert is very bad, because it prevents scylla
from proceeding, which is possible.
To solve it, let's trigger an exception if done is not true.
We do have code that will wait for both read and write fibers,
and catch exceptions, if any.
Closes#523.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since 4641dfff24, query_state keeps a
copy of client_state, not a reference. Therefore _cl is no longer
updated by queries using _qp. Fix by using the client_state from _qp.
Fixes#525.
All responses sent from the server have protocol version set to
connection::_version which is set to the version used by the client
in its first message. However, if the protocol version used by the
client is unuspported or invalide the server should use the latest
version it recognizes.
This solves problem with version negotiation with Java driver. The
driver first sends a request in the latest version it recognizes, if
that fails it retries with the version that server has used in the error
message. If that fails as well it gives up. However, since Scylla always
responds with the same version that the client has used the negotiation
always fails if the client supports more protocol version than the
server.
Refs #317.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Get initial tokens specified by the initial_token in scylla.conf.
E.g.,
--initial-token "-1112521204969569328,1117992399013959838"
--initial-token "1117992399013959838"
It can be multiple tokens split by comma.
"This series adds the missing functionality that the nodetool describering would work.
It import the missing functionality from origin.
After this patch the API:
GET /storage_service/describe_ring/{keyspace}
will be available"
This patch chanages the API to support describe ring instead of describe
ring jmx that will be implemented in the jmx server.
The API will return a list of objects instead of string.
An additional api was added as the equivelent to the jmx call with an
empty param.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the following methods implementation:
getRpcaddress
getRangeToAddressMap
getRangeToAddressMapInLocalDC
describeRing
getAllRanges
Those methods are used as part of the describe_ring method
implementation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The storage server uses the token_range in origin to return inforamtion
about the ring.
This import the structures. The functionality in origin is redundant in
this case and was not imported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Use all the disks except the one for rootfs for RAID0 which stores
scylla data. If only one disk is available warn the user since currently
our AMI's rootfs is not XFS.
[fedora@ip-172-31-39-189 ~]$ cat WARN.TXT
WARN: Scylla is not using XFS to store data. Performance will suffer.
Tested on AWS with 1 disk, 2 disks, 7 disk case.
(cherry picked from commit 49d6cba471)
Mistakenly didn't included on yum repository for AMI patchset, but it's needed
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
(cherry picked from commit 8587c4d6b3)
The nodetool cleanup command is used in many of the tests, because the
API call is not implemented it causes the tests to fail.
This is a workaround until the cleanup will be implemented, the method
return successfuly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Normally an API call that is not implemented should fail, there are
cases that as a workaround an API call is stub, in those cases a warning
is added to indicate that the API is not implemented.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch do the following:
It adds a getter for the completed respond messages (i.e. the total
messages that were sent by the server)
It replaces the return mapping for the statistics to use the key, value
notation that is used in the jmx side.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This adds the read repair statistics to he storage_proxy stats and adds
to its implementation incrementing the counters value.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The API needs to get the stats from the rpc server, that is hidden from the
messaging service API.
This patch adds a foreach function that goes over all the server stats
without exposing the server implementation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"The main objective of the series is to introduce statistics about ongoing
read/writes and especially those that are done in the background (acknowledged,
but uncompleted), but it contains some cleanups as well."
Add statistics for ongoing reads and ongoing background reads. Read is
a background one if it was acknowledged, but there still work to do to
complete it.
"Commit 4cd9c4c0c5441cf55e280c6f2f2e5529426b9c98 introduced a minor
issue: a wrong snitch instance may be used when updating a Gossiper state
(if I/O CPU is different from CPU0).
In order to fix this issue a local snitch instance on CPU0 should be used,
just like a Gossiper local instance.
We have to move some interfaces to i_endpoint_snitch
from being private in a gossiping_property_file_snitch in order to be
able to access it using snitch_ptr handle."
Don't ignore yet another returned future in reload_configuration().
Since commit 5e8037b50a
storage_service::gossip_snitch_info() returns a future.
This patch takes this into an account.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When we access a gossiper instance we use a _gossip_started
state of a snitch, which is set in a gossiper_starting() method.
gossiper_starting() method however is invoked by a gossiper on CPU0
only therefore the _gossip_started snitch state will be set for an
instance on CPU0 only.
Therefore instead of synchronizing the _gossip_started state between
all shards we just have to make sure we check it on the right CPU,
which is CPU0.
This patch fixes this issue.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Adjust the interface and distribution of prefer_local parameter read
from a snitch property file with the rest of similar parameters (e.g. dc and rack):
they are read and their values are distributed (copied) across all shards'
instances.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Make reload_gossiper_state() be a virtual method
of a base class in order to allow calling it using a snitch_ptr
handle.
A base class already has a ton of virtual methods so no harm is
done performance-wise. Using virtual methods instead of doing
dynamic_cast results in a much cleaner code however.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Move the member and add an access method.
This is needed in order to be able to access this state using
snitch_ptr handle.
This also allows to get rid of ec2_multi_region_snitch::_helper_added
member since it duplicates _gossip_started semantics.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
* seastar 9ae6407...258daf9 (6):
> rpc server: Add pending and sent messages to server
> scripts: posix_net_conf.sh: Use a generic logic for RPS configuring
> scripts: posix_net_conf.sh: allow passing a NIC name as a parameter
> doc: link to the tutorial
> tutorial: begin documenting the network API
> slab: remove bogus uintptr_t definition
"In 5e8037b50a (gossip: Futurize
add_local_application_state()) , we futurized add_local_application_state.
However, not all of the callers are futurized. Fix it up."
"- Fix snitch names from EC2XXX to Ec2XXX to align with configuration.
- Copy cassandra-rackdc.properties file to /var/lib/scylla/conf
- Set SCYLLA_HOME before booting process"
During testing build, the debugging statement at the end
of the function body (after return statements) causes compilation to
fail due to the flag -Werror=return-type:
service/storage_service.cc: In member function ‘future<> service::storage_service::clear_snapshot(sstring, std::vector<basic_sstring<char, unsigned int, 15u> >)’:
service/storage_service.cc:1358:1: error: control reaches end of non-void function [-Werror=return-type]
Which traces back to 21f84d77. Let's attach a then_wrapped()
clause to parallel_for_each() adding the debug message as
suggested by Avi.
CC: Glauber Costa <glommer@scylladb.com>
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
We are ignoring the future returned by seastar::async. Futurize it so
caller can wait for the application state to be actually applied.
In addition, dropping the unused add_local_application_states function.
We use boost::any to convert to and from database values (stored in
serlialized form) and native C++ values. boost::any captures information
about the data type (how to copy/move/delete etc.) and stores it inside
the boost::any instance. We later retrieve the real value using
boost::any_cast.
However, data_value (which has a boost::any member) already has type
information as a data_type instance. By teaching data_type intances about
the corresponding native type, we can elimiante the use of boost::any.
While boost::any is evil and eliminating it improves efficiency somewhat,
the real goal is growing native type support in data_type. We will use that
later to store native types in the cache, enabling O(log n) access to
collections, O(1) access to tuples, and more efficient large blob support.
"gossiping_property_file_snitch checks its property
file (cassandra-rackdc.properties) for changes every minute and
if there were changes it re-registers the helper and initiates
re-read of the new DC and Rack values in the corresponding places.
Therefore we need the ability to unregister/register the corresponding subscriber
at the same time when a subscriber list is possibly iterated by
some other asynchronous context on the current CPU.
The current gossiper implementation assumes that subscribers list may not be
changed from the context different from the one that iterates on their list.
So, this had to be fixed.
There was also missing an update_endpoint(ep) interface in the locator::topology
class and the corresponding token_metadata::update_topology(ep) wrapper.
Also there were some bugs in the gossiping_property_file::reload_configuration()
method."
On hindsight, it doesn't make much sense to print an
empty string, so let's only print stdout if it's non
None, non empty.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
This functions were empty and now they have the intended code:
- Register the reconnectable_snitch_helper if "prefer_local"
parameter was given the TRUE value.
- Set the application INTERNAL_IP state to listen_address().
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Invoke reload_gossiper_state() and gossip_snitch_info() on CPU0 since
gossiper is effectively running on CPU0 therefore all methods
modifying its state should be invoked on CPU0 as well.
- Don't invoke any method on external "distributed" objects unless their
corresponding per-shard service object have already been initialized.
- Update a local Node info in a storage_service::token_metadata::topology
when reloading snitch configuration when DC and/or Rack info has changed.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Introduce a subscribers_list class that exposes 3 methods:
- push_back(s) - adds a new element s to the back of the list
- remove(s) - removes an element s from the list
- for_each(f) - invoke f on each element of the list
- make a subscriber_list store shared_ptr to a subscriber
to allow removing (currently it stores a naked pointer to the object).
subscribers_list allows push_back() and remove() to be called while
another thread (e.g. seastar::async()) is in the middle of for_each().
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Simplify subscribers_list::remove() method.
- load_broadcaster: inherit from enable_shared_from_this instead
of async_sharded_service.
* seastar 9d8913a...9ae6407 (2):
> core/memory.cc: Declare member min_free_pages from cpu_pages struct
> http: All http replies should have version set
It may happen that the user will migrate a table to Scylla which
compaction strategy isn't supported yet, such as Data tiered.
Let's handle that by falling back to size-tiered compaction
strategy and printing a warning message.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Do not hold the api lock while streaming the data since it might take a
long time, so we need to reconcile other operations while we are in the
middle of rebuild.
remove() is the function used to remove every reference to a cf from
the compaction manager. This function works by removing cf from the
queue, and waiting for possible ongoing compaction on cf.
However, a cf may be re-queued by compaction manager task if there
is pending compaction by the end of compaction.
If cf is still referenced by the time remove() returns, we could end
up with an use-after-free. To fix that, a task shouldn't re-queue a
cf if it was asked to stop. The stat pending_tasks was also not
being updated when a cf was removed from the task queue.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The only place local_dc is checked during mutation sending is in
send_to_live_endpoints(), but current code pass it there throw several
function call layers. Simplify the code by getting local_dc when it is
used directly.
Since 4641dfff24 "service: Copy client
state to query state" after executing a query client state needs to be
merged back. If that's not done client_state::_last_timestamp_micros
won't be advanced properly and mutations originating from the same
source may have exactly the same timestamp.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Fix some PEP8 problems found in the tester code:
* Wrong spacing around operators
* Lines between class and function definitions
* Fixed some of the larger than 80 column statements
* Removed an unused import
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
_unreachable_endpoints is replicated to call cores. No need to query on
core 0.
This also fixes a bug in storage_proxy::truncate_blocking
which might access _unreachable_endpoints on non-zero cores.
The current code will try to print the output of a
subprocess.Popen().communicate() call even if that
call raised an exception and that output is None.
Let's fix this problem by only printing the output
if it's not None.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Since commit 5613979a85
broadcast address has to be set before it's used for the first
time.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
* seastar 501e4cb...9d8913a (3):
> Add mutable to with_lock and do_with
> app-template: disable collectd by default
> reactor: use fdatasync() instead of fsync()
"Currently, CQL requests are processed on the same CPU core where the
connection lives in. This series adds infrastructure for migrating CQL
processing to other cores and implements a round-robin load balancing
algorithm that can be enabled with the "--load-balance=round-robin"
command line option. Load balancing is not enabled by default because we
need to first run performance tests to determine if the simple
round-robin algorithm is sufficient, or wheter we need to implement more
sophisticated dynamic load balancing."
In preparation for processing queries on shards other than where the
connection lives in, merge client state changes in process_request().
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
In preparation for processing CQL requests on different core than where
the connection lives in, copy client state to query state for
processing and merge back the results after we're done.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
In preparation for spreading request processing to multiple cores, make
sure CQL response is written out on the connection shard.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
cql_query_test hasn't configured Broadcast address before
it was used for the first time.
Broadcast address is an essential Node's configuration.
There is an assert in utils::fb_utils::get_broadcast_address()
that ensures that broadcast address has been properly configured
before it's used for the first time and it is triggered without
this patch.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
"Fixes for commitlog (debug) test failures related to shutdowns.
Note that most the fixes here are only really related to the tests
failing, not really real scylla runs. However, at some point we'll
have real shutdown in scylla as well (not just hard exit), at which
point this becomes more relevant there as well.
Main issue was post-flush continuation chains for stats update
remaining unexecuted, due to task reordering, once the commitlog
object itself had been destroyed. This could have been handled by just
making the stats object a shared pointer, but in general it seems more
prudent to enforce having all tasks completed after shutdown.
* Change commitlog shutdown to use gate+wait for all outstanding ops
(flush, write, timer). Thus we can ensure everything is finished
when returning from "shutdown".
* Fix bug with "commitlog::clear" (test method) not doing the intended deed
* Most importantly, fix the tests themselves, cleaning up old crud, and
fixing invalid assumptions (CL behaviour changed quite a bit since tests
were created), and remove races.
Disclaimer: I've _never_ managed to reproduce the debug tests failing
like in jenkins locally (though I managed to provoke other failures),
but at least jenkins runs with this series have been clean. Knock knock."
Now that #475 is solved an read_indexes() guarantees to return disjont
sets of keys sstable key reader can be simplified, namely, only two key
lookups are needed (the first and the last one) and there is no need for
range splitting.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"This series add the mighty EC2MultiRegionSnitch and some missing
multi-DC related functionality:
- Use the proper Broadcast Address: either the one from the
.yaml configuration (if present) or the one configured by some
scylla component (e.g. snitch).
- Introduce the ability to switch to internal IPs when connecting
to Nodes in the same data center.
- Store the known internal IPs in system.peers table and
load then immediately during boot.
This series also contains some related fixes done on the way."
* Do close + fsync on all segments
* Make sure all pending cycle/sync ops are guarded with a gate, and
explicitly wait for this gate on shutdown to make sure we don't
leave hanging flushes in the task queue.
* Fix bug where "commitlog::clear" did not in fact shut down the CL,
due to "_shutdown" being already set.
Note: This is (at least currently) not an issue for anything else than tests,
since we don't shutdown the normal server "properly", i.e. the CL itself
will not go away, and hanging tasks are ok, as long as the sync-all is done
(which it was previously). But, to make tests predictable, and future-proof
the CL, this is better.
sstable level is set to zero by default, but it may be set to
a different value if a new sstable is the result of leveled
compaction. This is done outside write_components.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We were incorrectly setting s.header.min_index_interval to
BASE_SAMPLING_LEVEL, which luckily is the default value to
min index interval. BASE_SAMPLING_LEVEL was also used as
the min index interval when checking if the estimated
number of summary entries is greater than the limit.
To fix problems, get min index interval from schema and
use this value to check the limit.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This snitch in addition to what EC2Snitch does registers
a reconnectable_snitch_helper that will make messenger_service
connect to internal IPs when it connects to the nodes in the same
data center with the current Node.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v4:
- Added dual license in newly added files.
New in v3:
- Returned the Apache license.
New in v2:
- Update the license to the latest version. ;)
Add utils::fb_utilities::set_broadcast_address().
Set it to either broadcast_address or listen_address configuration value
if appropriate values are set. If none of the two values above
are set - abort the application.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Simplify the utils::fb_utilities::get_broadcast() logic.
reconnectable_snitch_helper implements i_endpoint_state_change_subscriber
and triggers reconnect using the internal IP to the nodes in the
same data center when one of the following events happen:
- on_join()
- on_change() - when INTERNAL_IP state is changed
- on_alive()
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v4:
- Added dual license for newly added files.
New in v3:
- Fix reconnect() logic.
- Returned the Apache license.
- Check if the new local address is not already stored in the cache.
- Get rid of get_ep_addr().
New in v2:
- Update the license to the latest version. ;)
Added load_config() function that reads AWS info and property file
and distributes the read values on all shards.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This map will contain the (internal) IPs corresponding to specific Nodes.
The mapping is also stored in the system.peers table.
So, instead of always connecting to external IP messaging_service::get_rpc_client()
will query _preferred_ip_cache and only if there is no entry for a given
Node will connect to the external IP.
We will call for init_local_preferred_ip_cache() at the end of system table init.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Improved the _preferred_ip_cache description.
- Code styling issues.
New in v3:
- Make get_internal_ip() public.
- get_rpc_client(): return a get_preferred_ip() usage dropped
in v2 by mistake during rebase.
This function erases shard_info objects from all _clients maps.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Use remove_rpc_client_one() instead of direct map::erase().
- Ensure messaging_service::stop() blocks until all rpc_protocol::client::stop()
are over.
- Remove the async code from rpc_protocol_client_wrapper destructor - call
for stop() everywhere it's needed instead. Ensure that
rpc_protocol_client_wrapper is always "stopped" when its destructor is called.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v3:
- Code style fixes.
- Killed rpc_protocol_client_wrapper::_stopped.
- Killed rpc_protocol_client_wrapper::~rpc_protocol_client_wrapper().
- Use std::move() for saving shared pointer before
erasing the entry from _clients in
remove_rpc_client_one() in
order to avoid extra ref count bumping.
This makes code cleaner. Also it would allow less changes
if we decide to increase _clients size in the future.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
get_preferred_ips() returns all preferred_ip's stored in system.peers
table.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Get rid of extra std::move().
Scylla is not "daemon" (witch forks twice), but it can be "fork" (forks once) when we don't use "exec" to call startup scripts.
Fixes#495
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
"This series adds two types of functionality to the storage_proxy, it adds the
API that returns the timeout constants from the config and it aligned the
metrics of the read, write and range to origin StorageProxy metrics."
read_indexes() will not work for a column family that minimum
index interval is different than sampling level or that sampling
level is lower than BASE_SAMPLING_LEVEl.
That's because the function was using sampling level to determine
the interval between indexes that are stored by index summary.
Instead, method from downsampling will be used to calculate the
effective interval based on both minimum_index_interval and
sampling_level parameters.
Fixes issue #474.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This patch adds the implmentation for the read, write and range
estimated histogram and total latency.
After this patch the following url will be available:
/storage_proxy/metrics/read/estimated_histogram/
/storage_proxy/metrics/read
/storage_proxy/metrics/write/estimated_histogram/
/storage_proxy/metrics/write
/storage_proxy/metrics/range/estimated_histogram/
/storage_proxy/metrics/range
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
This patch close the gap between the storage_proxy read, write and range
metrics and the API.
For each of the metrics there will be a histogram, estimated histogram
and total.
The patch contains the definitions for the following:
get_read_estimated_histogram
get_read_latency
get_write_estimated_histogram
get_write_latency
get_range_estimated_histogram
get_range_latency
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
need mrege storage_proxy
This patch expose the configuration timeout values of the timers.
The timers will return their values in seconds, the swagger definition
file was modified to reflect the change.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
We are not removing the range. Current node and new node will be
responsible for the range are calculated. We only need to stream data to
node = new node - current node. E.g,
Assume we have node 1 and node 2 in the cluster, RF=2. If we remove node2:
Range (3c 25 fa 7e d2 2a 26 b4 , 81 2a a7 32 29 e5 3a 7c ],
current_replica_endpoints={127.0.0.1, 127.0.0.2} new_replica_endpoints={127.0.0.1}
Range (3c 25 fa 7e d2 2a 26 b4 , 81 2a a7 32 29 e5 3a 7c ] already in all replicas
no data will be streamed to node 1 since it already has it.
This patch adds a definition and a stub for the compaction history. The
implementation should read fromt the compaction history table and return
an array of results.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
"This series adds an API to get and set the log level.
After this series it will be possible to use the folloing url:
GET/POST:
/system/logger
/system/logger/{name}"
"This series adds the functionality that is required for nodetool
describecluster
It uses the gossiper for get cluster name and get partitioner. The
describe_schema_versions functionality is missing and a workaround is used so
the command would work.
After this series an example for nodetool describecluster:
./bin/nodetool describecluster
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.SimpleSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
127.0.0.1: [48c4e6c8-5d6a-3800-9a3a-517d3f7b2f26]"
"This series add code for computing mutation_partition difference.
For mutations A and B:
diffA = A.difference(B);
diffB = B.difference(A);
AB = A.apply(B);
diffA is the minimal mutation that when applied to B makes it equal
to AB and diffB is the minimal mutation that applied to A results in AB.
Fixes #430."
"The snapshots API need to expose GET methods so people can
query information on them. Now that taking snapshots is supported,
this relatively simple series implement get_snapshot_details, a
column family method, and wire that up through the storage_service."
Fix for (mainly) test failures (use-after free)
I.e. test case test_commitlog_delete_when_over_disk_limit causes
use-after free because test shuts down before a pending flush is done,
and the segment manager is actually gone -> crash writing stats.
Now, we could make the stats a shared pointer, but we should never
allow an operation to outlive the segment_manager.
In normal op, we _almost_ guarantee this with the shutdown() call,
but technically, we could have a flush continuation trailing somewhere.
* Make sure we never delete segments from segment_manager until they are
fully flushed
* Make test disposal method "clear" be more defensive in flushing and
clearing out segments
"Tested with:
- start node 1
- insert value
- start node 2
- insert value
- decommission node2
I can see from the log that data range belongs to node2 is streamed to node1
and cqlsh query node1 returns all the data, and node2 is not in the live node
list from node1's view."
"This patchset implements load_new_sstables, allowing one to move tables inside the
data directory of a CF, and then call "nodetool refresh" to start using them.
Keep in mind that for Cassandra, this is deemed an unsafe operation:
https://issues.apache.org/jira/browse/CASSANDRA-6245
It is still for us something we should not recommend - unless the CF is totally
empty and not yet used, but we can do a much better job in the safety front.
To guarantee that, the process works in four steps:
1) All writes to this specific column family are disabled. This is a horrible thing to
do, because dirty memory can grow much more than desired during this. Throughout out
this implementation, we will try to keep the time during which the writes are disabled
to its bare minimum.
While disabling the writes, each shard will tell us about the highest generation number
it has seen.
2) We will scan all tables that we haven't seen before. Those are any tables found in the
CF datadir, that are higher than the highest generation number seen so far. We will link
them to new generation numbers that are sequential to the ones we have so far, and end up
with a new generation number that is returned to the next step
3) The generation number computed in the previous step is now propagated to all CFs, which
guarantees that all further writes will pick generation numbers that won't conflict with
the existing tables. Right after doing that, the writes are resumed.
4) The tables we found in step 2 are passed on to each of the CFs. They can now load those
tables while operations to the CF proceed normally."
This series adds a histogrm to the column family for live scanned and
tombstone scaned.
It expose those histogram via the API instead of the stub implmentation,
currently exist.
The implementation update of the histogram will be added in a different
series.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
This is the storage_service implementation of load_new_sstables, and this is
where most of the complication lives.
Keep in mind that for Cassandra, this is deemed an unsafe operation:
https://issues.apache.org/jira/browse/CASSANDRA-6245
It is still for us something we should not recommend - unless the CF is
totally empty and not yet used, but we can do a much better job in the safety front.
To guarantee that, the process works in four steps:
1) All writes to this specific column family are disabled. This is a horrible thing to
do, because dirty memory can grow much more than desired during this. Throughout out
this implementation, we will try to keep the time during which the writes are disabled
to its bare minimum.
While disabling the writes, each shard will tell us about the highest generation number
it has seen.
2) We will scan all tables that we haven't seen before. Those are any tables found in the
CF datadir, that are higher than the highest generation number seen so far. We will link
them to new generation numbers that are sequential to the ones we have so far, and end up
with a new generation number that is returned to the next step
3) The generation number computed in the previous step is now propagated to all CFs, which
guarantees that all further writes will pick generation numbers that won't conflict with
the existing tables. Right after doing that, the writes are resumed.
4) The tables we found in step 2 are passed on to each of the CFs. They can now load those
tables while operations to the CF proceed normally.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
CF-level code to load new SSTables. There isn't really a lot of complication
here. We don't even need to repopulate the entire SSTable directory: by
requiring that the external service who is coordinating this tell us explicitly
about the new SSTables found in the scan process, we can just load them
specifically and add them to the SSTable map.
All new tables will start their lifes as shared tables, and will be unshared
if it is possible to do so: this all happens inside add_sstable and there isn't
really anything special in this front.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
The current codes assumes a particular dir/generation pair. We
will use it for a more generic case. This code could really use some
clean up, by the way. We should do it later.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Before loading new SSTables into the node, we need to make sure that their
generation numbers are sequential (at least if we want to follow Cassandra's
footsteps here).
Note that this is unsafe by design. More information can be found at:
https://issues.apache.org/jira/browse/CASSANDRA-6245
However, we can already to slightly better in two ways:
Unlike Cassandra, this method takes as a parameter a generation number. We
will not touch tables that are before that number at all. That number must be
calculated from all shards as the highest generation number they have seen themselves.
Calling load_new_sstables in the absence of new tables will therefore do nothing,
and will be completely safe.
It will also return the highest generation number found after the reshuffling
process. New writers should start writing after that. Therefore, new tables
that are created will have a generation number that is higher than any of this,
and will therefore be safe.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
During certain operations we need to stop writing SSTables. This is needed when
we want to load new SSTables into the system. They will have to be scanned by all
shards, agreed upon, and in most cases even renamed. Letting SSTables be written
at that point makes it inherently racy - specially with the rename.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This will be used, for instance, when importing an SSTable.
We would like to force all new SSTables to sit at level 0 for
compaction purposes.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
In some situations (restoring a backup from load_new_sstables), we want to
change the SSTable generation number. This patch provides a procedure to
achieve that.
It does so by linking the old files to new ones, and then removing the old
ones.
The reason we link instead of removing, is that we want to make sure that in
case there is a crash in the middle, the old data is still accessible.
If the crash happens after the link is done but before we start removing the
old files, that is fine: we will end up with duplicated data that will
disappear after the next compaction.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
That is the way to generate groups of files for the SSTables, so we must do it.
Because the links were mostly used by processes like snapshots and backups
where and external tool would (hopefully) verify the results, it was not that
serious.
But we now plan to use links to bring things into the main directory. It must
absolutely be done right.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
During some situations (restoring a snapshot for instance) we may want a file
to get a different generation. This patch changes the code in create_links
slightly, so that it is able to link not only to a different location, but to
files with a different name, possibly in the same location - that is equivalent
to a generation change.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This is done on behalf of load_new_sstables: we would like to know which
components are present in the file, but without triggering the read for the
rest of the metadata.
As noted by Avi, using this directly can leave the SSTable in an inconsistent
state. We will have to fix is later since this is not the first offender.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
There is no reason aside from testing for a table to just change its generation
number.
There will be, however, when we support loading new sstables. The method
however needs to be completely rewritten, so let's make sure the tests are not
using that.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Avoid using long for it, and let's use a fixed size instead. Let's do signed
instead of unsigned to avoid upsetting any code that we may have converted.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
The change to use consistency_level::ONE in send_batchlog_mutation
sort of fixes#478, but is not 100% correct.
When doing async_remove_from_batchlog, the CL is actually supposed to
be ANY.
Also, we should _not_ remove the batch log mutation from any nodes
if the mutate fails, since having it there in case of failure is sort of
the whole point of it. I.e. async_remove_from_batchlog should not be
called from a "finally", but from a "then".
Refs #478
From Pawel:
This series enables row cache to serve range queries. In order to achieve
that row cache needs to know whether there are some other partitions in
the specified range that are not cached and need to be read from the sstables.
That information is provied by key_readers, which work very similarly to
mutation_readers, but return only the decorated key of partitions in
range. In case of sstables key_readers is implemented to use partition
index.
Approach like this has the disadvantage of needing to access the disk
even if all partitions in the range are cached. There are (at least) two
solutions ways of dealing with that problem:
- cache partition index - that will also help in all other places where it
is neededed
- add a flag to cache_entry which, when set, indicates that the immediate
successor of the partition is also in the cache. Such flag would be set
by mutation reader and cleared during eviction. It will also allow
newly created mutations from memtable to be moved to cache provided that
both their successors and predecessors are already there.
The key_reader part of this patchsets adds a lot of new code that probably
won't be used in any other place, but the alternative would be to always
interleave reads from cache with reads from sstables and that would be
more heavy on partition index, which isn't cached.
Fixes#185.
For CFStats, one of the things needed is the size used by the snapshots. Since
the bulk of the work is map-reducing it and adding them together, we will just
call get_snapshot_details for the column family, and just selectively add just
what we need. No need for a separate method here.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
The column family object can, for each column family, provide us with a map between
each snapshots it knows about, and two sizes: the total size, and the "real" (or live)
size, which is how much extra space the snapshot is costing us.
This patch map-reduces all CFs to accumulate that system-wide, and then formats that
into an a map of "snapshot_details". That is a more convenient format to be consumed
by our json generator.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
For each of the snapshots available, the api may query for some information:
the total size on disk, and the "real" size. As far as I could understand, the
real size is the size that is used by the SSTables themselves, while the total
size includes also the metadata about the snapshot - like the manifest.json
file.
Details follow:
In the original Cassandra code, total size is:
long sizeOnDisk = FileUtils.folderSize(snapshot);
folderSize recurses on directories, and adds file.length() on files. Again, my
understanding is that file_size() would give us the same as the length() method
for Java.
The other value, real (or true) size is:
long trueSize = getTrueAllocatedSizeIn(snapshot);
getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance
of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files
within the tree who are "acceptable".
An acceptable file is a file which:
starts with the same prefix as we want (IOW, belongs to the same SSTable, we
will just test that directly), and is not "alive". The alive list is just the
list of all SSTables in the system that are used by the CFs.
What this tries to do, is to make sure that the trueSnapshotSize is just the
extra space on disk used by the snapshot. Since the snapshots are links, then
if a table goes away, it adds to this size. If it would be there anyway, it does
not.
We can do that in a lot simpler fashion: for each file, we will just look at
the original CF directory, and see if we can find the file there. If we can't,
then it counts towards the trueSize. Even for files that are deleted after
compaction, that "eventually" works, and that simplifies the code tremendously
given that we don't have to neither list all files in the system - as Cassandra
does - or go check other shards for liveness information - as we would have to
do.
The scheme I am proposing may need some tweaks when we support multiple data
directories, as the SSTables may not be directly below the snapshot level.
Still, it would be trivial to inform the CF about their possible locations.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
The migrator tells lsa how to move an object when it is compacted.
Currently it is a function pointer, which means we must know how to move
the object at compile time. Making it an object allows us to build the
migration function at runtime, making it suitable for runtime-defined types
(such as tuples and user-defined types).
In the future, we may also store the size there for fixed-size types,
reducing lsa overhead.
C++ variable templates would have made this patch smaller, but unfortunately
they are only supported on gcc 5+.
This reader enables range queries on row cache. An underlying key_reader
is used to obtain information about partitions that belong to the
specified range and if any of them isn't in the cache an underlying
mutation reader is used to read the missing data.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This mutation reader returns mutations from cache that are in a given
range. There may be other mutations in the system (e.g. in sstables)
that won't be returned, so this reader on its own cannot really satisfy
any query.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Combined key reader, just like its mutation equivalents, combines
output from multiple key_readers and provides a single sorted stream
of decorated keys.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
key_readers provide an interface analogous to mutation_readers, but the
only data they return are decorated keys.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Since mutation stores all its data externally and the object itself is
basically just a std::unique_ptr<> there is no need for stdx::optional.
Smart pointer set to nullptr represents a disengaged mutation_opt.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"Fixes: #469
We occasionally generate memtables that are not empty, yet have no
high replay_position set. (Typical case is CL replay, but apparently
there are others).
Moreover, we can do this repeatedly, and thus get caught in the flush
queue ordering restrictions.
Solve this by treating a flush without replay_position as a flush at the
highest running position, i.e. "last" in queue. Note that this will not
affect the actual flush operation, nor CL callbacks, only anyone waiting
for the operation(s) to complete.
To do this, the flush_queue had its restrictions eased, and some introspection
methods added."
This patch provides an storage service api to delete an snapshot. Because all
keyspaces and CFs are visible in all shards. This will allow us to fetch the
list of keyspaces in the present shard and issue the filesystem operations in
that same shard.
That simplifies the code tremendously, and because there are not any operations
we need to do previous to the fs ones (like in the case of create snapshot), we
need no synchronization. Even easier.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We go to the filesystem to check if the snapshot exists. This should make us
robust against deletions of existing snapshots from the filesystem.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This allows for us to delete an existing snapshot. It works at the column
family level, and removing it from the list of keyspace snapshots needs to
happen only when all CFs are processed. Therefore, that is provided as a
separate operation.
The filesystem code is a bit ugly: it can be made better by making our file
lister more generic. First step would be to call it walker, not lister...
For now, we'll use the fact that there are mostly two levels in the snapshot
hierarchy to our advantage, and avoid a full recursion - using the same lambda
for all calls would require us to provide a separate class to handle the state,
that's part of making this generic.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
There are situations in which we would like to match more than one directory
type. One example of that, would be a recursive delete operation: we need to
delete the files inside directories and the directories themselves, but we
still don't want a "delete all" since finding anything other than a directory
or a file is an error, and we should treat it as such.
Since there aren't that many times, it should be ok performance wise to just
use a list. I am using an unordered_set here just because it is easy enough,
but we could actually relax it later if needed. In any case, users of the
interface should not worry about that, and that decision is abstracted away
into lister::dir_entry_types.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This is certainly the right thing to do and seems to fix#403. However
I didn't manage to convince myself that this would cause problems for
binomial_heap, given that binomial_heap::erase() calls siftup()
anyway:
void erase(handle_type handle)
{
node_pointer n = handle.node_;
siftup(n, force_inf());
top_element = n;
pop();
}
void increase (handle_type handle)
{
node_pointer n = handle.node_;
siftup(n, *this);
update_top_element();
sanity_check();
}
There was a confusion between the snapshot key and the keyspace in the
snapshot details, this fixes it.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
"Those are fixes needed for the snapshotting process itself. I have bundled this
in the create_snapshot series before to avoid a rebase, but since I will have to
rewrite that to get rid of the snapshot manager (and go to the filesystem),
I am sending those out on their own."
This adds a workaround for the get schema_version, it will return only a
shcema version of the local node, this is a temporary workaround until
describe_schema_versions will be implemented.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
This adds the implementation for the get cluster name and get
partitioner name to the storage_service API.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
The API can call any of the gossiper shareds to get the cluster name, so
the initilization needs to set it in all of them.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
We occasionally generate memtables that are not empty, yet have no
high replay_position set. (Typical case is CL replay, but apparently
there are others).
Moreover, we can do this repeatedly, and thus get caught in the flush
queue ordering restrictions.
Solve this by treating a flush without replay_position as a flush at the
highest running position, i.e. "last" in queue. Note that this will not
affect the actual flush operation, nor CL callbacks, only anyone waiting
for the operation(s) to complete.
As long as we guarantee that the execution order for the post ops are
upheld, we can allow insertion of multiple ops on the same key.
Implemented by adding a ref count to each position.
The restriction then becomes that an added key must either be larger
than any already existing key, _OR_ already exist. In the latter case,
we still know that we have not finished this position and signaled
"upwards".
test_setup::do_with_test_directory is missing. For some reason,
the test wasn't failing without it until now. Adding it is the
correct thing to do anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This patchset introduces leveled compaction to Scylla.
We don't handle all corner cases yet, but we already have the strategy
and compaction working as expected. Test cases were written and I also
tested the stability with a load of cassandra-stress.
Leveled compaction may output more than one sstable because there is
a limit on the size of sstables. 160M by default.
Related to handling of partial compaction, it's still something to be
worked on.
Anyway, it will not be a big problem. Why? Suppose that a leveled
compaction will generate 2 sstables, and scylla is interrupted after
the first sstable is completely written but before the second one is
completely written. The next boot will delete the second sstable,
because it was partially written, but will not do anything with the
first one as it was completely written.
As a result, we will have two sstables with redundant data."
This patch adds the ability to set one or all log levels get a log level
and get all logs name.
After this patch the following url will be available:
GET/POST
/system/logger
/system/logger/{name}
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
The system api will include system related command, currently it holds
the logger related API. It holds definition for the following commands:
get_all_logger_names
set_all_logger_level
get_logger_level
set_logger_level
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
This is a helper function that returns a log level name. It will be used
by the API to report the log levels.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
With the distribute-and-sync method we are using, if an exception happens in
the snapshot creation for any reason (think file permissions, etc), that will
just hang the server since our shard won't do the necessary work to
synchronize and note that we done our part (or tried to) in snapshot creation.
Make the then clause a finally, so that the sync part is always executed.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
create_links will fail in one of the shards if one of the SSTables happen to be
shared. It should be fine if the link already exists, so let's just ignore that case.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Explicitly use up all the memory in the system as best as we can instead
Still not super reliable, but should have less side effects. And work better
with pre-allocated segment files
Explicitly use up all the memory in the system as best as we can instead
Still not super reliable, but should have less side effects. And work better
with pre-allocated segment files
Make production_snitch_base constructor signature consistent with
the rest of production snitches.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
A non-empty default value of a configuration file name was preventing
the db::config::get_conf_dir() to kick in when a default snitch constructor
was used (this is the way it's always used from scylla).
Fixes issue #459
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When building the in-memory schema for a column family, we were
ignoring compaction strategy class because of a bug in the
existing code. Example: suppose that you create a column family
with leveled compaction strategy. This option would be ignored
and the default strategy (size-tiered) would be used instead.
Found this problem while working on leveled compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
"This series cleans up a few places in the snitches
code that has been noticed during the work on issues #464
and #459.
The last patch actually fixes the issue #464"
Make production_snitch_base constructor signature consistent with
the rest of production snitches.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
A non-empty default value of a configuration file name was preventing
the db::config::get_conf_dir() to kick in when a default snitch constructor
was used (this is the way it's always used from scylla).
Fixes issue #459
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When building the in-memory schema for a column family, we were
ignoring compaction strategy class because of a bug in the
existing code. Example: suppose that you create a column family
with leveled compaction strategy. This option would be ignored
and the default strategy (size-tiered) would be used instead.
Found this problem while working on leveled compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
After the previous two patches, the CF directory where the SSTable will live is
guaranteed to always exist: system CFs' are touched at boot while newly created
tables' are touched when the creation mutations are announced.
With that in place, there is no more need for the recursive touch in the
SSTables path.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
In Cassandra, when you create a new column family, a directory for it
immediately appears under the KS directory.
In the past, we have made a decision to delay that creation until the first
SSTable is created, which works well in general.
There is a problem, however, for backup restoration: the standard procedure to
call loadNewSSTables is to do that in an empty directory. But the directory
simply won't be there until we create the first SSTable: bummer!
In the current incarnation of the code in schema_tables.cc, there is already
some code that runs on CPU0 only. That is a perfect place for the directory
creation. So let's do it.
After this patch, a directory for the CF appears right after the CF creation.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Current code calls make_directory, which will fail if the directory already exists.
We didn't use this code path much before, but once we start creating CF directories
on CF creation - and not on SSTable creation, that will become our default method.
Use touch_directory instead
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Adapt our compaction code to start writing a new sstable if the
one being written reached its maximum size. Leveled strategy works
with that concept. If a strategy other than leveled is being used,
everything will work as before.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Useful for leveled strategy which looks for overlapping sstables
by checking if token range overlaps.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
* seastar a2523ae...8207f2c (3):
> rwlock: provide lock / unlock semantics
> with_lock: run a function under a lock
> rwlock: add documentation to the rwlock module
Fixes spurious failures in test_commitlog_discard_completed_segments
* Do explicit sync on all segments to prevent async flushed from keeping
segements alive.
* Use counter instead of actual file counting to avoid racing with
pre-allocation of segments
This adds an implementation for the stream manager metrics.
And the following url will be available:
/stream_manager/metrics/outbound
/stream_manager/metrics/incoming/{peer}
/stream_manager/metrics/incoming
/stream_manager/metrics/outgoing/{peer}
/stream_manager/metrics/outgoing
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
This adds the swagger definition file for the stream manager. The API is
based on the StreamManagerMBean and the StreamMetrics.
The following commands where added:
get_current_streams
get_current_streams_state
get_all_active_streams_outbound
get_total_incoming_bytes
get_all_total_incoming_bytes
get_total_outgoing_bytes
get_all_total_outgoing_bytes
The Fedora base image has changed so we need to add "hostname" that's
used by the Docker-specific launch script to our image.
Fixes Scylla startup.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
"Fixes the crashes in debug mode with the flush queue test, and
simplifies and cleans up the queue itself.
Aforementioned crashes happened due to reordering with the signalling
loop in previous version. A task completing could race with a reordered
loop continuation in who would get to signal and remove an item.
Rewritten to use much simpler promise chaining instead (which also allows
return value to propagate from Pre- to post op), ensuring only one actor
modifies the queue entry."
Previous version dit looping on post execution and signaling of waiters.
This could "race" with an op just finishing if task reordering happened.
This version simplifies the code significantly (and raises the question why
it was not written like this in the first place... Shame on me) by simpy
building a promise-dependency chain between _previous_ queue items and next
instead.
Also, the code now handles propagation of return value from the "Func" pre-op
to the "Post" op, with exceptions automatically handled.
xfs doesn't like writes beyond eof (exactly at eof is fine), and due
to continuation reordering, we sometimes do that.
Fix by pre-truncating the segment to its maximum size.
Re-check file size overflow after each cycle() call (new buffer),
otherwise we could write more, in the case we are storing a mutation
larger than current buffer size (current pos + sizeof(mut) < max_size, but
after cycle required by sizeof(mut) > buf_remain, the former might not be
true anymore.
"Adds a small utility queue and through this enforces memtable flush ordering
such that a flush may _run_ unchecked, however the "post" operation may
execute once all "lower numbered" (i.e. lower replay position) post ops
has finished.
This means that:
a.) Callbacks to commitlog are now guaranteed to fulfill ordering criteria
b.) Calling column_family::flush() and waiting for the result will also
wait for any previously initiated flushes to finish. But not those
initiated _after_."
Small utility to order operation->post operation
so that the "post" step is guaranteed to only be run
when all "post"-ops for lower valued keys (T) have been completed
This is a generalized utility mainly to be testable.
Before:
$ nodetool info
ID : a5adfbbf-cfd8-4c88-ab6b-6a34ccc2857c
Gossip active : false
After:
$ nodetool info
ID : a5adfbbf-cfd8-4c88-ab6b-6a34ccc2857c
Gossip active : true
Fix#354.
* seastar 78e3924...a2523ae (7):
> core: fix pipe unread
> Merge 'xfs-extents'
> Merge "separate-dma-alignment"
> output_stream: wait for stream to be taken out of poller in case final flush returns exception.
> reactor: Use more widely compatible xfs include
> readme: Add xfslibs-dev to Ubuntu deps
> pipe: add unread() operation
The first problem is the while loop around the code that processes prestate.
That's wrong because there may be a need to read more data before continuing
to process a prestate.
The second problem is the code assumption that a prestate will be processed
at once, and then unconditionally process the current state.
Both problems are likely to happen when reading a large buffer because more
than one read may be required.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
I was mildly annoyed by seeing two warnings about the same directory not
being XFS, when the sstable directory and the commitlog directory are the
same one (I don't know if this is typical, but this is what I do in all
my tests...). So I wrote this trivial patch to make sure not to test the
same directory twice.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
"With this, a new node can stream data from existing nodes when joins the cluster.
I tested with the following:
1) stat a node 1
2) insert data into node 1
3) start node 2
I can see from the logger that data is streamed correctly from node 1
to node 2."
Add code to actually stream data from other nodes during bootstrap.
I tested with the following:
1) stat a node 1
2) insert data into node 1
3) start node 2
I can see from the logger that data is streamed correctly from node 1
to node 2.
One version returns only the ranges
std::vector<range<token>>
Another version returns a map
std::unordered_map<range<token>, std::unordered_set<inet_address>>
which is converted from
std::unordered_multimap<range<token>, inet_address>
They are needed by token_metadata::pending_endpoints_for,
storage_service::get_all_ranges_with_strict_sources_for and
storage_service::decommission.
Given the current token_metadata and the new token which will be
inserted into the ring after bootstrap, calculate the ranges this new
node will be responsible for.
This is needed by boot_strapper::bootstrap().
"This series adds EC2Snich.
Since both GossipingPropertyFileSnitch and EC2SnitchXXX snitches family
are using the same property file it was logical to share the corresponding
code. Most of this series does just that... "
While trying to debug an unrelated bug, I was annoyed by the fact that parsing
caching options keep throwing exceptions all the time. Those exceptions have no
reason to happen: we try to convert the value to a number, and if we fail we
fall back to one of the two blessed strings.
We could just as easily just test for those strings beforehand and avoid all of
that.
While we're on it, the exception message should show the value of "r", not "k".
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.
The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.
But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.
Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Checks the following:
- That EC2Snich is able to receive the availability zone from EC2.
- That the resulting DC and RACK values are distributed among all
shards.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This snitch will read the EC2 availability zone and set the DC
and RACK as follows:
If availability zone is "us-east-1d", then
DC="us-east" and RACK="1d".
If cassandra-rackdc.properties contains "dc_suffix" field then
DC will be appended with its value.
For instance if dc_suffix=_1_cassandra, then in the example above
DC=us-east_1_cassandra
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This is a configuration file used by GossipingPropertyFileSnitch and
EC2SnitchXXX snitches family.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Move property file parsing code into production_snitch_base class.
- Make a parsing code more general:
- Save the parsed keys in the hash table.
- Check only two types of errors:
- Repeating keys.
- Add a set of all supported keys and add the check for a key
being supported.
- Added production_snitch_base.cc file.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This function returns the directory containing the configuration
files. It takes into an account the evironment variables as follows:
- If SCYLLA_CONF is defines - this is the directory
- else if SCYLLA_HOME is defines, then $SCYLLA_HOME/conf is the directory
- else "conf" is a directory, namely the configuration files should be
looked at ./conf
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Updated get_conf_dir() description.
We are generating a general object ({}), whereas Cassandra 2.1.x generates an
array ([]). Let's do that as well to avoid surprising parsers.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We still need to write a manifest when there are no files in the snapshot.
But because we have never reached the touch_directory part in the sstables
loop for that case, nobody would have created jsondir in that case.
Since now all the file handling is done in the seal_snapshot phase, we should
just make sure the directory exists before initiating any other disk activity.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We currently have one optimization that returns early when there are no tables
to be snapshotted.
However, because of the way we are writing the manifest now, this will cause
the shard that happens to have tables to be waiting forever. So we should get
rid of it. All shards need to pass through the synchronization point.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
If we are hashing more than one CF, the snapshot themselves will all have the same name.
This will cause the files from one of them to spill towards the other when writing the manifest.
The proper hash is the jsondir: that one is unique per manifest file.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This patch fix an issue with the read latency estimated historam
implementation and add a call to the estimated number of sstable
histogram.
The later is not yet implemented on the datbase side.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
This patch adds the read and write latency estimated histogram support
and add an estimatd histogram to the number of sstable that were used in
a read.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Taking time messurment of an operation can cause peformence degredation,
this patch adds sampling support for the estimated histogram.
It allows to add a sample with a counter that holds what is the actual
total number so far. So the samplling will be counted as multiple
entries in the estimated histogram.
The totatl count of the entries in the histogram will equal the _count
parameter.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-08 14:59:17 +03:00
374 changed files with 20545 additions and 6153 deletions
throwexceptions::invalid_request_exception("system keyspace is not user-modifiable");
}
// keyspace name
std::regexname_regex("\\w+");
if(!std::regex_match(name,name_regex)){
throwexceptions::invalid_request_exception(sprint("\"%s\" is not a valid keyspace name",_name.c_str()));
}
if(name.length()>schema::NAME_LENGTH){
throwexceptions::invalid_request_exception(sprint("Keyspace names shouldn't be more than %d characters long (got \"%s\")",schema::NAME_LENGTH,_name.c_str()));
throwexceptions::invalid_request_exception("system keyspace is not user-modifiable");
}
// keyspace name
std::regexname_regex("\\w+");
if(!std::regex_match(name,name_regex)){
throwexceptions::invalid_request_exception(sprint("\"%s\" is not a valid keyspace name",_name.c_str()));
}
if(name.length()>schema::NAME_LENGTH){
throwexceptions::invalid_request_exception(sprint("Keyspace names shouldn't be more than %d characters long (got \"%s\")",schema::NAME_LENGTH,_name.c_str()));
"Total space used for commitlogs. If the used space goes above this value, Cassandra rounds up to the next nearest segment multiple and flushes memtables to disk for the oldest commitlog segments, removing those log segments. This reduces the amount of data to replay on startup, and prevents infrequently-updated tables from indefinitely keeping commitlog segments. A small total commitlog space tends to cause more flush activity on less-active tables.\n" \
"Log WARN on any batch size exceeding this value in kilobytes. Caution should be taken on increasing the size of this threshold as it can lead to node instability." \
"The IP address a node tells other nodes in the cluster to contact it by. It allows public and private address to be different. For example, use the broadcast_address parameter in topologies where not all nodes have access to other nodes by their private IP addresses.\n" \
"If your Cassandra cluster is deployed across multiple Amazon EC2 regions and you use the EC2MultiRegionSnitch , set the broadcast_address to public IP address of the node and the listen_address to the private IP." \
) \
val(initial_token,sstring,/* N/A */,Unused, \
val(initial_token,sstring,/* N/A */,Used, \
"Used in the single-node-per-token architecture, where a node owns exactly one contiguous range in the ring space. Setting this property overrides num_tokens.\n" \
"If you not using vnodes or have num_tokens set it to 1 or unspecified (#num_tokens), you should always specify this parameter when setting up a production cluster for the first time and when adding capacity. For more information, see this parameter in the Cassandra 1.1 Node and Cluster Configuration documentation.\n" \
"This parameter can be used with num_tokens (vnodes ) in special cases such as Restoring from a snapshot." \
"RPC address to broadcast to drivers and other Cassandra nodes. This cannot be set to 0.0.0.0. If blank, it is set to the value of the rpc_address or rpc_interface. If rpc_address or rpc_interfaceis set to 0.0.0.0, this property must be set.\n" \
) \
val(rpc_port,uint16_t,9160,Used, \
@@ -715,6 +742,14 @@ public:
val(api_address,sstring,"",Used,"Http Rest API address") \
val(api_ui_dir,sstring,"swagger-ui/dist/",Used,"The directory location of the API GUI") \
val(api_doc_dir,sstring,"api/api-doc/",Used,"The API definition file directory") \
val(load_balance,sstring,"none",Used,"CQL request load balancing: 'none' or round-robin'") \
val(consistent_rangemovement,bool,true,Used,"When set to true, range movements will be consistent. It means: 1) it will refuse to bootstrapp a new node if other bootstrapping/leaving/moving nodes detected. 2) data will be streamed to a new node only from the node which is no longer responsible for the token range. Same as -Dcassandra.consistent.rangemovement in cassandra") \
val(join_ring,bool,true,Used,"When set to true, a node will join the token ring. When set to false, a node will not join the token ring. User can use nodetool join to initiate ring joinging later. Same as -Dcassandra.join_ring in cassandra.") \
val(load_ring_state,bool,true,Used,"When set to true, load tokens and host_ids previously saved. Same as -Dcassandra.load_ring_state in cassandra.") \
val(replace_node,sstring,"",Used,"The UUID of the node to replace. Same as -Dcassandra.replace_node in cssandra.") \
val(replace_token,sstring,"",Used,"The tokens of the node to replace. Same as -Dcassandra.replace_token in cassandra.") \
val(replace_address,sstring,"",Used,"The listen_address or broadcast_address of the dead node to replace. Same as -Dcassandra.replace_address.") \
val(replace_address_first_boot,sstring,"",Used,"Like replace_address option, but if the node has been bootstrapped sucessfully it will be ignored. Same as -Dcassandra.replace_address_first_boot.") \
throwstd::runtime_error("num_tokens must be >= 1");
}
// if (numTokens == 1)
// logger.warn("Picking random token for a single vnode. You should probably add more vnodes; failing that, you should probably specify the token manually");
if(num_tokens ==1){
logger.warn("Picking random token for a single vnode. You should probably add more vnodes; failing that, you should probably specify the token manually");
throwstd::runtime_error(sprint("A node required to move the data consistently is down (%s). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false",source_ip));
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.