When the cluster is under heavy load, the time to exchange a gossip
message might take longer than 1s. Let's make the timeout longer for now
before we can solve the large delay of gossip message issue.
Patch "Fix some timing/latency issues with sync" changed new_segment to
_not_ wait for flush to finish. This means that checking actual files on
disk in the test case might race.
Lucklily, we can more or less just check the segment list instead
(added recently-ish)
Refs #356
* Move sync time setting to sync initiate to help prevent double syncs
* Change add_mutation to only do explicit sync with wait if time elapsed
since last is 2x sync window
* Do not wait for sync when moving to new segment in alloc path
* Initiate _sync_time properly.
* Add some tracing log messages to help debug
Race condition happens when two or more shards will try to delete
the same partial sstable. So the problem doesn't affect scylla
when it boots with a single shard.
To fix this problem, shard 0 will be made the responsible for
deleting a partial sstable.
fixes#359.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
From Avi:
We currently send out each cql transport response in its own packet, which
is very inefficient.
Use a poller to schedule responses to be flushed out, which allows multiple
responses to be sent out in one packet, reducing tcp stack overhead.
I see ~50% improvement with this on my desktop (single core).
* Removes previous, accidental fix that got committed.
* Instead just do not give RP:s to replay mutations. This is same as in Origin,
and just as/more correct, since we intend to flush the data to sstables
asap anyway
Instead of flushing responses immediately, ask a reactor poller to flush
them for us. This lets several responses to be flushed out together in
one packet.
We should ignore equal and less than operators for shard_id as well.
Within a 3 nodes cluster, each node has 4 cpus, on first node
Before:
[fedora@ip-172-30-0-99 ~]$ netstat -nt|grep 100\:7000
tcp 0 0 172.30.0.99:36998 172.30.0.100:7000 ESTABLISHED
tcp 0 0 172.30.0.99:36772 172.30.0.100:7000 ESTABLISHED
tcp 0 0 172.30.0.99:40125 172.30.0.100:7000 ESTABLISHED
tcp 0 0 172.30.0.99:60182 172.30.0.100:7000 ESTABLISHED
tcp 0 0 172.30.0.99:38013 172.30.0.100:7000 ESTABLISHED
tcp 0 0 172.30.0.99:51997 172.30.0.100:7000 ESTABLISHED
tcp 0 0 172.30.0.99:56532 172.30.0.100:7000 ESTABLISHED
After:
[fedora@ip-172-30-0-99 ~]$ netstat -nt|grep 100\:7000
tcp 0 0 172.30.0.99:45661 172.30.0.100:7000 ESTABLISHED
tcp 0 0 172.30.0.99:57395 172.30.0.100:7000 ESTABLISHED
tcp 0 0 172.30.0.99:37807 172.30.0.100:7000 ESTABLISHED
tcp 0 36 172.30.0.99:50567 172.30.0.100:7000 ESTABLISHED
Each shard of a node is supposed to have 1 connection to a peer node,
thus each node will have #cpu connections to a peer node.
With this patch, the cluster is much more stable than before on AWS. So
far, I see no timeout in the gossip syn message exchange.
Make (apparently dead?) test routine (not in test class)stream_session::test
use query_options::DEFAULT the way it is intended. Not copy it (semantically
prohibited, but accidentally possible in code)
Fix the hard-coded version number from RPM spec file by using the
SCYLLA-VERSION-GEN script.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
This adds version number generation in the build system. Version numbers
follow the format:
<version>-<release>
where release consists of:
<date>-<git-hash>
The version and release numbers are generated by the SCYLLA-VERSION-GEN
script and they are stored in SCYLLA-VERSION-FILE and
SCYLLA-RELEASE-FILE files so that other parts of the build system can
easily pick them up.
For builds that happen from release tarballs, for example,
SCYLLA-VERSION-GEN looks for a "version" file in the tree and just uses
that.
Basically, we're doing pretty much the same as Git is doing in its build
system.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
If an sstable is irrelevant for a shard, delete it. The deletion will
only complete when all shards agree (either ignore the sstable or
delete it after compaction).
Until there will be an API for the compaction manager, the API return 0
for the number of total compaction.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
The following function where added to column family:
is_auto_compaction_disabled
get_built_indexes
get_compression_metadata_off_heap_memory_used
get_compression_parameters
get_compression_ratio
get_read_latency_estimated_histogram
get_write_latency_estimated_histogram
And the get and set compaction strategy methods and a stub
implementation for the compression parameter, crc chec and sstable
count.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
The compaction strategy was modify to return its compaction type.
The type method calls the virtual impl type method. Each of the
implementations return its type.
A name method was added to the compaction strategy that return the name
according to the strategy type.
And the static type method was modified to recieve a const reference to
the string.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
The get_load_map method should return a map between nodes addresses and
their load. In origin the implementation is based on the load
broadcaster that we currently do not have.
This workaround return a map with a single entry of the current node
address and its load
In event of a compaction failure, run_compaction would be called
more than one time for a request, which could result in an
underflow in the stats pending_compactions.
Let's fix that by only decreasing it if compaction succeeded.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
* seastar 49989ca...aa18f5c (11):
> stream: workaround native network stack drops
> build: fix sanitize=vptr auto-disable
> tests: test thread scheduling groups
> thread: scheduling groups
> thread: introduce thread_attributes
> reactor: make later() more fair
> reactor: introduce force_poll()
> core: move later() out of line
> test futurize
> fix futurize<void> for the case in which Func returns a future
> futures_test: silence exceptional future ignored messages
Fixes#187.
When populating a column family, we will now delete all components
of a sstable with a temporary toc file. A sstable with a temporary
TOC file means that it was partially written, and can be safely
deleted because the respective data is either saved in the commit
log, or in the compacted sstables in case of the partial sstable
being result of a compaction.
Deletion procedure is guarded against power failure by only deleting
the temporary TOC file after all other components were deleted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
When populating a cf, we should also check for a sstable with
temporary TOC file, and act accordingly. By the time being,
we will only refuse to boot. Subsequent work is to gather all
files of a sstable with a temporary TOC file and delete them.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>