"This patch set ensures we can correctly handle queries
where the minimum token is specified."
* 'min-token/v3' of github.com:duarten/scylla:
cql_query_test: Add test case for min/max token bounds
token_restriction: Deal with minimum tokens
partitioner: Parse token from bytes
This object, similarly to a global_schema_ptr, allows to dynamically
create the trace_state_ptr objects on different shards in a context
of the original tracing session.
This object would create a secondary tracing session object from the
original trace_state_ptr object when a trace_state_ptr object is needed
on a "remote" shard, similarly to what we do when we need it on a remote
Node.
Fixes#1678Fixes#1647
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>
This patch adds a call to the scylla-housekeeping check version during
setup, so a warning will be printed if a newer version is available.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This changes are for running scylla during setup.
It contains the following changes:
1. get the current version from the command line (as the syclla does not
run at this stage).
2. It support a mode parameter in the command line to indicate that we
running during the installation.
3. It accept an external uuid that will be used with all interaction
with the check_version server.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)
Fixes#1720.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.
Fixes#1719.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.
Refs #1446.
Fixes#1684.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
CQL server is supposed to throttle requests so that they don't
overflow memory. The problem is that it currently accounts for
request's memory only around reading of its frame from the connection
and not actual request execution. As a result too many requests may be
allowed to execute and we may run out of memory.
Fixes#1708.
Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>
This patch fixes a bug where queries such as the following are not
handled properly:
"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"
Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.
Ref #1139
Ref #693Fixes#1717
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar 5b7252d...9e1d5db (5):
> prometheus: prevent illegal prometheus names
> scollectd: raw_to_value should not use network order
> semaphore: Introduce get_units()
> core::scollectd: truncate the identifiers fields on a 63 characters boundary
> Merge "Fix ASAN errors in debug builds" from Tomasz
Before: the range is split only once, so it is split into 2 sub ranges
INFO 2016-09-29 15:52:43,625 [shard 0] repair - target_partitions=100, estimated_partitions=537, ranges.size=2,
range=(8993553141924659802, 8997061146192366917] ->
ranges={
(8993553141924659802, 8995307144058513359], (8995307144058513359, 8997061146192366917]}
After: the range is split mulitple times, resulting 16 sub ranges.
INFO 2016-09-29 15:55:07,934 [shard 0] repair - target_partitions=100, estimated_partitions=67, ranges.size=16,
range=(8993553141924659802, 8997061146192366917] ->
ranges={
(8993553141924659802, 8993772392191391496], (8993772392191391496, 8993991642458123191],
(8993991642458123191, 8994210892724854885], (8994210892724854885, 8994430142991586580],
(8994430142991586580, 8994649393258318274], (8994649393258318274, 8994868643525049969],
(8994868643525049969, 8995087893791781664], (8995087893791781664, 8995307144058513359],
(8995307144058513359, 8995526394325245053], (8995526394325245053, 8995745644591976748],
(8995745644591976748, 8995964894858708443], (8995964894858708443, 8996184145125440138],
(8996184145125440138, 8996403395392171832], (8996403395392171832, 8996622645658903527],
(8996622645658903527, 8996841895925635222], (8996841895925635222, 8997061146192366917]}
Without this patch, repair can do checksum with a range with a lot of
partitions, not the expected less than 100 partitions per checksum. This
can lead to unncessary data transfer since the checksum is too coarse.
For instacne, as above, if the checksum of 1 out of 537 partitions is
different, the whole 527 partitions will be synced.
Fixes#1613
Message-Id: <0775c20c485c105df5f10bd685048227f074c365.1475137029.git.asias@scylladb.com>
The snappy_compress() function expects the "compressed_length" parameter
to contain the actual output buffer length but now we're passing random
garbage from the stack.
Fixes#1711
Message-Id: <1475132127-316-1-git-send-email-penberg@scylladb.com>
* seastar 2b55789...5b7252d (3):
> Merge "rpc: serialize large messages into fragmented memory" from Gleb
> Merge "Print backtrace on SIGSEGV and SIGABRT" from Tomasz
> test_runner: avoid nested optionals
Includes patch from Gleb to adapt to seastar changes.
"This series improves repair by
1) using less streaming sessions
2) reducing unnecessary streaming traffic
3) fixing a hang during shutdown
See commit log for "repair: Reduce stream_plan usage", "repair: Reduce
unnecessary streaming traffic" and "streaming: Fail streaming sessions
during shutdown" for details.
Tested with repair_additional_test.py."
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.
This counter is named for consistency after similar counters from transport/.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, we export the region group where memtables are placed as dirty bytes.
Upcoming patches will optimistically mark some bytes in this region as free, a
scheme we know as "virtual dirty".
We are still interested in knowing the real state of the dirty region, so we
will keep track of the bytes virtually freed and split the counters in two.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Also make sure to not listen on the same exact address twice in case
listen_address == broadcast_address. Scylla configuration code does not
allow such thing to be configured, but better to be safe.
Message-Id: <20160927102316.GO32178@scylladb.com>
Print when the node will be removed from gossip membership, e.g.,
INFO 2016-09-27 08:54:49,262 [shard 0] gossip - Node 127.0.0.3 will be
removed from gossip at [2016-09-30 08:54:48]: (expire = 1475196888294489339,
now = 1474937689262295270, diff = 259199 seconds)
The expire time which is used to decide when to remove a node from
gossip membership is gossiped around the cluster. We switched to steady
clock in the past. In order to have a consistent time_point in all the
nodes in the cluster, we have to use wall clock. Switch to use
system_clock for gossip.
Fixes#1704
"The prometheus project and its sub project does not have RPM/DEB packaging yet,
but it does have binaries for download.
This series adds an installation script that download install and run as a
service the node_exporter. For os that uses systemd it has a spec file ready
that will be package with the system. For ubuntu a service file will be created
when running the installer.
After this series running node_exporter_install a node_exporter will be running
as a service on the machine."
In patch ac619820 (streaming: Switch to use make_streaming_reade), we
switched to use make_streaming_reader for streaming. In repair, the
checksum phases also uses a mutation reader. For the same reasons (no
pollution to row cache, bounded new data after the reader is created),
switch repair checksum calculation to use the make_streaming_reader too.
Fixes#382Fixes#1682
Message-Id: <9e0ecda861bb0b6f690da5e2378b208159ffa41c.1474933195.git.asias@scylladb.com>
From Asias:
With this series, streaming and repair are improved:
- streaming, repair will not pollute the row cache on the sender side
any more. Currently, we are risking evicting all the frequently-queried
partitions from the cache when an operation like repair reads entire
sstables and floods the row cache with swathes of cold data from they
read from disk.
- less data will be sent becasue the reader will only return existing
data before the point of the reader is created, plus bounded amount
of writes which arrive later. This helps reducing the streaming time
in the case new data is being inserted all the time while streaming is
in progress. E.g., adding a new node while there is a lot of cql write
workload.
Fixes#382 and #1682
Using make_streaming_reader for streaming on the sender side, it has
the following advantages:
- streaming, repair will not pollute the row cache on the sender side
any more. Currently, we are risking evicting all the frequently-queried
partitions from the cache when an operation like repair reads entire
sstables and floods the row cache with swathes of cold data from they
read from disk.
- less data will be sent becasue the reader will only return existing
data before the point of the reader is created, plus bounded amount
of writes which arrive later. This helps reducing the streaming time
in the case new data is being inserted all the time while streaming is
in progress. E.g., adding a new node while there is a lot of cql write
workload.
Fixes#382Fixes#1682
The make_streaming_reader returns a combined mutation reader reads
mutations from sstables and memtable. The memtable reader handles
memtable flushing automatically so no special handling is needed here.
It will be used by streaming soon.
When there's no free disk, it asks to select disks from empty list:
"Please select disks from following list:
type 'done' to finish selection. selected:"
We should avoid to ask it, abort RAID setup instead.
Fixes#1673
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1474429218-28382-1-git-send-email-syuu@scylladb.com>
When using multiple physical network interfaces, set this to true to
listen on broadcast_address in addition to the listen_address, allowing
nodes to communicate in both interfaces. Ignore this property if the
network configuration automatically routes between the public and
private networks such as EC2.
Message-Id: <20160921094810.GA28654@scylladb.com>
If the remote peers have the same checksum, we can only fetch from
one of the peer node instead of all of them since they all have the same
data anyway. No need to fetch from all of them.
In addition to above optimization, if the local peer has no data, we can
skip sending the data back to the remote peer. Due to the fact that all
the remote peers have the same checksum and local peer has no data, so
each and every remote peer has all the data. There is no need to merge
the remote data with local data and send back the merged data back to
remote peers.
Refs: #1617
Right now, we are using one stream_plan for each range of a column
family. This generates tons of stream_plans and stream_sessions. Each
stream_plan can transfer multiple ranges and column families. We can
use a single stream_plan to stream datas for multiple ranges and column
families, so that 1) overhead of stream_plan/session negotiation is
reduced 2) it is much easier to debug/monitor few stream_sessions
Fixes#1685
"When a node is decommissioned, its gossip state will not be removed from gossip
immediately. It will only be removed 3 days later which helps nodes that were
down when the node was decommissioned to know decommission later when they are
up again.
This series improves the logging to reduce confusion when a node tries to
talking to a decommissioned node. In addition, we now do not try to talk to the
decommissioned in the unreachable_endpoints gossip round.
Fixes#1615"
* tag 'asias/loggging_decommissioned_nodes/v1' of github.com:cloudius-systems/seastar-dev:
gossip: Make two log items debug level
gossip: Print node status when node is UP or DOWN
gossip: Ignore the node which is decommissioned in gossip round
gossip: Print convict debug info only when the node is alive
gossip: Add more timing log in add_expire_time_for_endpoint
streaming: Print on_remove and on_restart log when peer exists
streaming: Introduce has_peer in stream_manager