Streaming ususally takes long time to complete. Abort it on false
positive idle detection can be very wasteful.
Increase the abort timeout from 10 minutes to a very large timeout, 300
minutes. The real idle session will be aborted eventually if other
mechanisms, e.g., streaming manager has gossip callback for on_remove
and on_restart event to abort, do not abort the session.
Fixes#2197
Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>
Use the boost::intrusive containers in order to achieve a O(1) complexity
for both "LRU list" update and to minimize the memory overhead in the hash
table item to "LRU list" item connection:
- Make the timestamped_val be both a bi::list and a bi::unordered_set
item.
- Make a bi::unordered_set be a cache backend instead of the
std::unordered_map.
As a result dropping k LRU items becomes an O(k) operation instead of
O(N log N), where N is a total number of all cached items:
- Every time a value is read - move it to the front of the "LRU list"
(O(1)).
- When we need to remove k LRU items:
- Repeat k times:
- Take an element from the back of the "LRU list". (O(1)).
- Remove it from the bi::unordered_set and dispose. (O(1)).
We use an auto-unlink configuration for bi::list, therefore
disposing an item is going to auto unlink it from the list.
* 'permissions_cache_move_to_intrusive-v1' of github.com:scylladb/seastar-dev:
utils::loading_cache: cleanup
utils/loading_cache.hh: use intrusive list to store the lru entry
utils::loading_cache: implement automatic rehashing
utils::loading_cache: make the underlying map to be an intrusive unordered_set
"Enforces commutativity of addition:
m1 + m2 == m2 + m1
and consistency of difference and addition with equality:
m1 + (m2 - m1) == m1 + m2"
* tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev:
mutation: Make compare_*_for_merge() consistent with equals()
tests: mutation: Improve assertion failure message
tests: Use default equality in test_mutation_diff_with_random_generator
mutation: Make counter cell difference consistent with apply
tests: range_tombstone_list_test: Improve error message
tests: range_tombstone_list: Check adjacent range merging
range_tombstone_list: Merge adjacent range tombstones in apply()
tests: mutation: Check commutativity of mutation addition
range_tombstone_list: Avoid violating set invariant
range_tombstone_list: Make tombstone merging commutative
range_tombstone_list: Add erase() operation to the reverter
range_tombstone_list: Make all undo operations ordered relative to each other
utils: Extract to_boost_visitor() to a separate header
allocating_strategy: Introduce alloc_strategy_unique_ptr<>
equals() considers expiring cells to be different form non-expiring cells,
but compare_row_marker_for_merge() considers them equal. Fix the latter to
pick expiring cells. The choice was arbitrary.
The case when both cells are dead was not handled properly, the diff
was always empty, whereas the cell with higher timestamp should win.
Caused test_mutation_diff_with_random_generator to fail.
The code was inserting an entry with the same key as its successor,
and only later adjusting the key of the old entry. This is violating
set's invariant of unique keys, and insertion may cause rebalancing. I
don't know if this violation actually causes problems currently, but
it's safer not to.
Fix by first updating the existing entry and then inserting the new
one.
Example of non-commutative case:
a = [1, 5]@t2
b = {[2, 3]@t1, [4, 5]@t1}
a + b = [1, 5]@t2
b + a = [1, 4)@t2, [4, 5]@t2
After this patch, both will yield [1, 5]@t2.
The patch also changes the logic to handle overlaps of tombstones with
equal timestamps to be handled symmetrically. They are now merged
instead of split on either of the boundary.
Refs #2158.
Later operation may depend on the result of previous operation. Same dependency
is present when reverting the operations.
Fixes assertion failure in update reverter.
Fix the shrink() O(n log n) complexity issue by constantly pushing the corresponding intrusive
list entry to the head of the list every time the values are read.
This will keep the list ordered by the last read time from the most recently read
to the least recently read entry.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
- Start the cache with 256 buckets - the minimum number of buckets.
- Limit the maximal number of buckets by 1M buckets.
- Keep the load factor between 0.25 and 1.0 as long as the number of buckets is
between the minimum and the maximum values mentioned above.
- Grow and shrink the hash every "refresh" period if needed.
- Enable bi::power_2_buckets and bi::compare_hash bi::unordered_set options.
- Enable bi::unordered_set_base_hook's bi::store_hash option.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Make the underlying map to be a boost::intrusive::unordered_set<timestamped_val>
instead of std::unordered_set<Key, timestamped_val>.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Metadata is read using default priority class, which can significantly
slow down the process under high load. Compaction class can be used,
and if it turns out to be a problem, we can switch to a special class
for it.
Fixes#1859.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170517184546.17497-1-raphaelsc@scylladb.com>
"fixes a problem in which memory requirement for loading in-memory
components of sstables is very high due to unlimited parallelism."
* 'mem_requirement_sstable_load_v2_2' of github.com:raphaelsc/scylla:
database: fix indentation of distributed_loader::open_sstable
database: reduce memory requirement to load sstables
sstables: loads components for a sstable in parallel
sstables: enable read ahead for read of in-memory components
sstables: make random_access_reader work with read ahead
SSTable load temporarily uses more space than needed to store metadata,
due to:
1) All components are read using read_simple() which uses 128k buffer.
file::dma_read_bulk() will allocate 128k, and may potentially allocate
another big buffer (128k - read) for file::read_maybe_eof().
2) read_filter() may use double the space it needs to.
Due to the fact that sstable loading parallelism is unlimited, Scylla
may require much more memory to load all sstables, and that may lead to
OOM. Higher the number of sstables higher the memory overhead.
To confirm this problem, I wrote a test[1] which loads 30k sstables in
parallel and reports the memory usage peak in the end.
When loading 30k sstables, each of which metadata is ~300kb, memory
usage peak was ~18G. When loading completed, only ~9GB were needed to
store all the metadata.
[1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d
To fix this problem, we need to set a limit on load parallelism (let's
start with a small number like 3 and adjust later if needed) and rely
on readahead so that the requirement drops considerably without
increasing boot time. Actually, boot time is improved by it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Read ahead 4 is used. Let's adjust it later if needed. File size is
used to prevent file_input_stream from issuing useless reads beyond
file size with read ahead enabled. We can switch to variant without
length once file_input_stream handles it properly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Scylla crashes if read ahead is enabled by file_random_access_reader
because a call to seek() destroys the existing input stream without
closing it first.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The bootstrapping node will be a gossip only member, until the streaming
finishes and the node becomes NORMAL state. If during this time, the
bootstrapping node is overwhelmed with streaming, it is possible the node will
delay the update the gossip heartbeat. Be forgiving for the bootstrapping node
and do not remove it from gossip too fast. Otherwise, streaming rpc verbs will
not be resent becasue the node is not in gossip membership anymore.
Fixes#2150
Message-Id: <286d7035d854f2a48abf4e1e2e3bfcb8b22b9ca2.1494553580.git.asias@scylladb.com>
We are currently suspecting that the bloom filter false positive ratio
is not being respected. While trying to debug that, I found out that we
have a more basic problem:
The numbers are all meaningless, because the stats are wrong. We are
accumulating by summing the ratios together. It's easy to see how this
doesn't work, if we look at an example where the ratio for some CFs is
zero:
SST1: false = 1, total = 2. ratio = 0.5
SST2: false = 0, total = 98 . ratio = 0.
The real ratio in this example is 1 / (98 + 2) = 1 %, but the displayed
ratio will be 0.5 + 0 = 0.5.
This patch will map reduce all the sstables together keeping both
numerator and denominator, yielding the right value at the end. To do
that, we'll reuse the existing ratio_holder class, which already does
exactly what we want.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170518222333.16307-1-glauber@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
* seastar 45b718b...f726938 (2):
> memory: add --mbind option to supress warning message when running Seastar apps on container
> Add support for Gentoo Linux irqbalance configuration detection.
"This series fixes bugs related to materialized views, most pertaining
to column filtering in the where clause."
* 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla:
tests/view_schema_test: Add more test cases
tests/cql_assertions: Add assertion for row set equality
single_column_relation: Correctly print IN relation
statement_restrictions: Allow filtering regular columns for views
statement_restrictions: Relax clustering restrictions for views
statement_restrictions: Relax partition restrictions for views
cql3/statements: Prevent setting default ttl on view
cql3/restrictions: Complete implementation of is_satisfied_by()
db/view: Re-implement clustering_prefix_matches()
db/view: Re-implement partition_key_matches()
db/view: Generate regular tombstone for base deletions
db/view: Consider cell liveness when generating updates
db/view: Don't generate view updates for static rows
"There are numerous issues in the current implementation of permissions
cache starting from the logical errors and bugs and ending with the
suboptimal implementation described in the issue #2262."
* 'permissions_cache_fixes-v4' of github.com:scylladb/seastar-dev:
utils::loading_cache: avoid the reads storm when the key is not in the cache
utils::loading_cache: cleanup
utils::loading_cache: align the constrains in the constructor with the parameters description
utils::loading_cache: refresh in the background
auth::auth: add operator<<() for a permission_cache key
auth::auth::permissions_cache: use the values from the configuration - don't try to be smart
db::config: define a saner default value for permissions_validity_in_ms
mutation_partition has a slicing constructor which is supposed to copy
only the rows from the query range. The rows are located using
nonwrapping_range::lower_bound() and
nonwrapping_range::lower_bound(). Those two have two different
implementations chosen with SFINAE. One is using std::lower_bound(),
and one is using container's built in lower_bound() should it
exist. We're using intrusive tree in mutation_partition, so
container's lower_bound() is preferred. It's O(log N) whereas
std::lower_bound() is O(N), because tree's iterator is not random
access.
However, the current rule for picking container's lower_bound() never
triggers, because lower_bound() has two overloads in the container:
./range.hh:618:14: error: decltype cannot resolve address of overloaded function
typename = decltype(&std::remove_reference<Range>::type::upper_bound)>
^~~~~~~~
As a result, the overload which uses std::lower_bound() is used.
Spotted when running perf_fast_forward with wide partition limit in
cache lifted off. It's so slow that I timeouted waiting for the result
(> 16 min).
Fixes#2395.
Message-Id: <1495048614-9913-1-git-send-email-tgrabiec@scylladb.com>
"These patches add support to setup and operate ScyllaDB on Gentoo Linux.
* scylla_setup and related scripts
* node_health_check
I have kept them as simple as possible and tested them to setup and operate
succesfully a three nodes cluster running on Gentoo Linux."
* 'gentoo_linux_support' of github.com:ultrabug/scylla:
scylla_setup: add gentoo linux installation detection
prometheus node_exporter install: add support for gentoo linux
raid setup: add support for gentoo linux
ntp setup: add support for gentoo linux
kernel check: add support for gentoo linux
cpuscaling setup: add support for gentoo linux
coredump setup: add support for gentoo linux
detect gentoo linux on selinux setup
add gentoo_variant detection and SYSCONFIG setup