When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)
Fixes#1720.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit a8ab4b8f37)
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.
Fixes#1719.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit a3bf7558f2)
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.
This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.
The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.
Fixes#1741.
Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 2a5a90f391)
CQL server is supposed to throttle requests so that they don't
overflow memory. The problem is that it currently accounts for
request's memory only around reading of its frame from the connection
and not actual request execution. As a result too many requests may be
allowed to execute and we may run out of memory.
Fixes#1708.
Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7e25b958ac)
This object, similarly to a global_schema_ptr, allows to dynamically
create the trace_state_ptr objects on different shards in a context
of the original tracing session.
This object would create a secondary tracing session object from the
original trace_state_ptr object when a trace_state_ptr object is needed
on a "remote" shard, similarly to what we do when we need it on a remote
Node.
Fixes#1678Fixes#1647
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>
(cherry picked from commit 7e180c7bd3)
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.
Refs #1446.
Fixes#1684.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit eb1fcf3ecc)
* seastar 5b7252d...9e1d5db (5):
> prometheus: prevent illegal prometheus names
> scollectd: raw_to_value should not use network order
> semaphore: Introduce get_units()
> core::scollectd: truncate the identifiers fields on a 63 characters boundary
> Merge "Fix ASAN errors in debug builds" from Tomasz
Before: the range is split only once, so it is split into 2 sub ranges
INFO 2016-09-29 15:52:43,625 [shard 0] repair - target_partitions=100, estimated_partitions=537, ranges.size=2,
range=(8993553141924659802, 8997061146192366917] ->
ranges={
(8993553141924659802, 8995307144058513359], (8995307144058513359, 8997061146192366917]}
After: the range is split mulitple times, resulting 16 sub ranges.
INFO 2016-09-29 15:55:07,934 [shard 0] repair - target_partitions=100, estimated_partitions=67, ranges.size=16,
range=(8993553141924659802, 8997061146192366917] ->
ranges={
(8993553141924659802, 8993772392191391496], (8993772392191391496, 8993991642458123191],
(8993991642458123191, 8994210892724854885], (8994210892724854885, 8994430142991586580],
(8994430142991586580, 8994649393258318274], (8994649393258318274, 8994868643525049969],
(8994868643525049969, 8995087893791781664], (8995087893791781664, 8995307144058513359],
(8995307144058513359, 8995526394325245053], (8995526394325245053, 8995745644591976748],
(8995745644591976748, 8995964894858708443], (8995964894858708443, 8996184145125440138],
(8996184145125440138, 8996403395392171832], (8996403395392171832, 8996622645658903527],
(8996622645658903527, 8996841895925635222], (8996841895925635222, 8997061146192366917]}
Without this patch, repair can do checksum with a range with a lot of
partitions, not the expected less than 100 partitions per checksum. This
can lead to unncessary data transfer since the checksum is too coarse.
For instacne, as above, if the checksum of 1 out of 537 partitions is
different, the whole 527 partitions will be synced.
Fixes#1613
Message-Id: <0775c20c485c105df5f10bd685048227f074c365.1475137029.git.asias@scylladb.com>
The snappy_compress() function expects the "compressed_length" parameter
to contain the actual output buffer length but now we're passing random
garbage from the stack.
Fixes#1711
Message-Id: <1475132127-316-1-git-send-email-penberg@scylladb.com>
* seastar 2b55789...5b7252d (3):
> Merge "rpc: serialize large messages into fragmented memory" from Gleb
> Merge "Print backtrace on SIGSEGV and SIGABRT" from Tomasz
> test_runner: avoid nested optionals
Includes patch from Gleb to adapt to seastar changes.
"This series improves repair by
1) using less streaming sessions
2) reducing unnecessary streaming traffic
3) fixing a hang during shutdown
See commit log for "repair: Reduce stream_plan usage", "repair: Reduce
unnecessary streaming traffic" and "streaming: Fail streaming sessions
during shutdown" for details.
Tested with repair_additional_test.py."
Also make sure to not listen on the same exact address twice in case
listen_address == broadcast_address. Scylla configuration code does not
allow such thing to be configured, but better to be safe.
Message-Id: <20160927102316.GO32178@scylladb.com>
Print when the node will be removed from gossip membership, e.g.,
INFO 2016-09-27 08:54:49,262 [shard 0] gossip - Node 127.0.0.3 will be
removed from gossip at [2016-09-30 08:54:48]: (expire = 1475196888294489339,
now = 1474937689262295270, diff = 259199 seconds)
The expire time which is used to decide when to remove a node from
gossip membership is gossiped around the cluster. We switched to steady
clock in the past. In order to have a consistent time_point in all the
nodes in the cluster, we have to use wall clock. Switch to use
system_clock for gossip.
Fixes#1704
"The prometheus project and its sub project does not have RPM/DEB packaging yet,
but it does have binaries for download.
This series adds an installation script that download install and run as a
service the node_exporter. For os that uses systemd it has a spec file ready
that will be package with the system. For ubuntu a service file will be created
when running the installer.
After this series running node_exporter_install a node_exporter will be running
as a service on the machine."
In patch ac619820 (streaming: Switch to use make_streaming_reade), we
switched to use make_streaming_reader for streaming. In repair, the
checksum phases also uses a mutation reader. For the same reasons (no
pollution to row cache, bounded new data after the reader is created),
switch repair checksum calculation to use the make_streaming_reader too.
Fixes#382Fixes#1682
Message-Id: <9e0ecda861bb0b6f690da5e2378b208159ffa41c.1474933195.git.asias@scylladb.com>
From Asias:
With this series, streaming and repair are improved:
- streaming, repair will not pollute the row cache on the sender side
any more. Currently, we are risking evicting all the frequently-queried
partitions from the cache when an operation like repair reads entire
sstables and floods the row cache with swathes of cold data from they
read from disk.
- less data will be sent becasue the reader will only return existing
data before the point of the reader is created, plus bounded amount
of writes which arrive later. This helps reducing the streaming time
in the case new data is being inserted all the time while streaming is
in progress. E.g., adding a new node while there is a lot of cql write
workload.
Fixes#382 and #1682
Using make_streaming_reader for streaming on the sender side, it has
the following advantages:
- streaming, repair will not pollute the row cache on the sender side
any more. Currently, we are risking evicting all the frequently-queried
partitions from the cache when an operation like repair reads entire
sstables and floods the row cache with swathes of cold data from they
read from disk.
- less data will be sent becasue the reader will only return existing
data before the point of the reader is created, plus bounded amount
of writes which arrive later. This helps reducing the streaming time
in the case new data is being inserted all the time while streaming is
in progress. E.g., adding a new node while there is a lot of cql write
workload.
Fixes#382Fixes#1682
The make_streaming_reader returns a combined mutation reader reads
mutations from sstables and memtable. The memtable reader handles
memtable flushing automatically so no special handling is needed here.
It will be used by streaming soon.
When there's no free disk, it asks to select disks from empty list:
"Please select disks from following list:
type 'done' to finish selection. selected:"
We should avoid to ask it, abort RAID setup instead.
Fixes#1673
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1474429218-28382-1-git-send-email-syuu@scylladb.com>
When using multiple physical network interfaces, set this to true to
listen on broadcast_address in addition to the listen_address, allowing
nodes to communicate in both interfaces. Ignore this property if the
network configuration automatically routes between the public and
private networks such as EC2.
Message-Id: <20160921094810.GA28654@scylladb.com>
If the remote peers have the same checksum, we can only fetch from
one of the peer node instead of all of them since they all have the same
data anyway. No need to fetch from all of them.
In addition to above optimization, if the local peer has no data, we can
skip sending the data back to the remote peer. Due to the fact that all
the remote peers have the same checksum and local peer has no data, so
each and every remote peer has all the data. There is no need to merge
the remote data with local data and send back the merged data back to
remote peers.
Refs: #1617
Right now, we are using one stream_plan for each range of a column
family. This generates tons of stream_plans and stream_sessions. Each
stream_plan can transfer multiple ranges and column families. We can
use a single stream_plan to stream datas for multiple ranges and column
families, so that 1) overhead of stream_plan/session negotiation is
reduced 2) it is much easier to debug/monitor few stream_sessions
Fixes#1685
"When a node is decommissioned, its gossip state will not be removed from gossip
immediately. It will only be removed 3 days later which helps nodes that were
down when the node was decommissioned to know decommission later when they are
up again.
This series improves the logging to reduce confusion when a node tries to
talking to a decommissioned node. In addition, we now do not try to talk to the
decommissioned in the unreachable_endpoints gossip round.
Fixes#1615"
* tag 'asias/loggging_decommissioned_nodes/v1' of github.com:cloudius-systems/seastar-dev:
gossip: Make two log items debug level
gossip: Print node status when node is UP or DOWN
gossip: Ignore the node which is decommissioned in gossip round
gossip: Print convict debug info only when the node is alive
gossip: Add more timing log in add_expire_time_for_endpoint
streaming: Print on_remove and on_restart log when peer exists
streaming: Introduce has_peer in stream_manager
It is duplciated with "InetAddresss x.x.x.x is now UP" message.
INFO 2016-09-23 10:35:15,512 [shard 0] gossip - Node 127.0.0.1 has restarted, now UP, status = NORMAL
INFO 2016-09-23 10:35:15,513 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = NORMAL
Make the log a bit cleaner.
For example:
gossip - InetAddress 127.0.0.4 is now UP, status = NORMAL
gossip - InetAddress 127.0.0.3 is now DOWN, status = LEFT
gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
We print the following messages even if there is no stream_session with
that peer. It is a bit confusing.
INFO 2016-09-23 08:26:37,254 [shard 0] stream_session - stream_manager:
Close all stream_session with peer = 127.0.0.1 in on_restart
INFO 2016-09-23 08:26:37,287 [shard 0] stream_session - stream_manager:
Close all stream_session with peer = 127.0.0.3 in on_remove
Print only when the streaming session with the peer exists.