SSTable load temporarily uses more space than needed to store metadata,
due to:
1) All components are read using read_simple() which uses 128k buffer.
file::dma_read_bulk() will allocate 128k, and may potentially allocate
another big buffer (128k - read) for file::read_maybe_eof().
2) read_filter() may use double the space it needs to.
Due to the fact that sstable loading parallelism is unlimited, Scylla
may require much more memory to load all sstables, and that may lead to
OOM. Higher the number of sstables higher the memory overhead.
To confirm this problem, I wrote a test[1] which loads 30k sstables in
parallel and reports the memory usage peak in the end.
When loading 30k sstables, each of which metadata is ~300kb, memory
usage peak was ~18G. When loading completed, only ~9GB were needed to
store all the metadata.
[1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d
To fix this problem, we need to set a limit on load parallelism (let's
start with a small number like 3 and adjust later if needed) and rely
on readahead so that the requirement drops considerably without
increasing boot time. Actually, boot time is improved by it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Read ahead 4 is used. Let's adjust it later if needed. File size is
used to prevent file_input_stream from issuing useless reads beyond
file size with read ahead enabled. We can switch to variant without
length once file_input_stream handles it properly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Scylla crashes if read ahead is enabled by file_random_access_reader
because a call to seek() destroys the existing input stream without
closing it first.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The bootstrapping node will be a gossip only member, until the streaming
finishes and the node becomes NORMAL state. If during this time, the
bootstrapping node is overwhelmed with streaming, it is possible the node will
delay the update the gossip heartbeat. Be forgiving for the bootstrapping node
and do not remove it from gossip too fast. Otherwise, streaming rpc verbs will
not be resent becasue the node is not in gossip membership anymore.
Fixes#2150
Message-Id: <286d7035d854f2a48abf4e1e2e3bfcb8b22b9ca2.1494553580.git.asias@scylladb.com>
We are currently suspecting that the bloom filter false positive ratio
is not being respected. While trying to debug that, I found out that we
have a more basic problem:
The numbers are all meaningless, because the stats are wrong. We are
accumulating by summing the ratios together. It's easy to see how this
doesn't work, if we look at an example where the ratio for some CFs is
zero:
SST1: false = 1, total = 2. ratio = 0.5
SST2: false = 0, total = 98 . ratio = 0.
The real ratio in this example is 1 / (98 + 2) = 1 %, but the displayed
ratio will be 0.5 + 0 = 0.5.
This patch will map reduce all the sstables together keeping both
numerator and denominator, yielding the right value at the end. To do
that, we'll reuse the existing ratio_holder class, which already does
exactly what we want.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170518222333.16307-1-glauber@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
* seastar 45b718b...f726938 (2):
> memory: add --mbind option to supress warning message when running Seastar apps on container
> Add support for Gentoo Linux irqbalance configuration detection.
"This series fixes bugs related to materialized views, most pertaining
to column filtering in the where clause."
* 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla:
tests/view_schema_test: Add more test cases
tests/cql_assertions: Add assertion for row set equality
single_column_relation: Correctly print IN relation
statement_restrictions: Allow filtering regular columns for views
statement_restrictions: Relax clustering restrictions for views
statement_restrictions: Relax partition restrictions for views
cql3/statements: Prevent setting default ttl on view
cql3/restrictions: Complete implementation of is_satisfied_by()
db/view: Re-implement clustering_prefix_matches()
db/view: Re-implement partition_key_matches()
db/view: Generate regular tombstone for base deletions
db/view: Consider cell liveness when generating updates
db/view: Don't generate view updates for static rows
"There are numerous issues in the current implementation of permissions
cache starting from the logical errors and bugs and ending with the
suboptimal implementation described in the issue #2262."
* 'permissions_cache_fixes-v4' of github.com:scylladb/seastar-dev:
utils::loading_cache: avoid the reads storm when the key is not in the cache
utils::loading_cache: cleanup
utils::loading_cache: align the constrains in the constructor with the parameters description
utils::loading_cache: refresh in the background
auth::auth: add operator<<() for a permission_cache key
auth::auth::permissions_cache: use the values from the configuration - don't try to be smart
db::config: define a saner default value for permissions_validity_in_ms
mutation_partition has a slicing constructor which is supposed to copy
only the rows from the query range. The rows are located using
nonwrapping_range::lower_bound() and
nonwrapping_range::lower_bound(). Those two have two different
implementations chosen with SFINAE. One is using std::lower_bound(),
and one is using container's built in lower_bound() should it
exist. We're using intrusive tree in mutation_partition, so
container's lower_bound() is preferred. It's O(log N) whereas
std::lower_bound() is O(N), because tree's iterator is not random
access.
However, the current rule for picking container's lower_bound() never
triggers, because lower_bound() has two overloads in the container:
./range.hh:618:14: error: decltype cannot resolve address of overloaded function
typename = decltype(&std::remove_reference<Range>::type::upper_bound)>
^~~~~~~~
As a result, the overload which uses std::lower_bound() is used.
Spotted when running perf_fast_forward with wide partition limit in
cache lifted off. It's so slow that I timeouted waiting for the result
(> 16 min).
Fixes#2395.
Message-Id: <1495048614-9913-1-git-send-email-tgrabiec@scylladb.com>
"These patches add support to setup and operate ScyllaDB on Gentoo Linux.
* scylla_setup and related scripts
* node_health_check
I have kept them as simple as possible and tested them to setup and operate
succesfully a three nodes cluster running on Gentoo Linux."
* 'gentoo_linux_support' of github.com:ultrabug/scylla:
scylla_setup: add gentoo linux installation detection
prometheus node_exporter install: add support for gentoo linux
raid setup: add support for gentoo linux
ntp setup: add support for gentoo linux
kernel check: add support for gentoo linux
cpuscaling setup: add support for gentoo linux
coredump setup: add support for gentoo linux
detect gentoo linux on selinux setup
add gentoo_variant detection and SYSCONFIG setup
According to description of permissions_validity_in_ms the permissions_cache is enabled if this
value is set to a non-zero value. Otherwise the permissions_cache is disabled.
According to the permissions_update_interval_in_ms description it must have a non-zero value if permissions_cache
is enabled.
permissions_cache_max_entries description doesn't explicitly state it but it makes no sense to allow it to be zero
if permissions_cache is enabled.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patch changes the way a loading_cache works.
Before this patch:
1) If a permissions key is not in the cache it's loaded in the foreground and the original
query is blocked till the permissions are loaded.
2) Every _period the timer does the following:
1) If a value was loaded more than _expiry time ago it is removed from the cache.
2) If the cache is too big - the less recently loaded values are removed till the cache
fits the requested size.
After this patch:
1) If a permissions key is not in the cache it's loaded in the foreground and the original
query is blocked till the permissions are loaded.
2) Every _period the timer does the following:
1) If a value in the cache was loaded or read for the last time more than _expiry time ago - it's removed from the cache.
2) If the cache is too big - the less recently read values are removed till the cache fits the requested size.
3) The values that were loaded more than _refresh time ago are re-read in the background.
The new implementation allows to minimize the amount of the foreground reads for a frequently used value to a single
event (when the value is loaded for the first time).
It also ensures we do not reload values we no longer need.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Our configuration already has the default values for for permission cache parameters.
Therefore if user decides to give some bad parameters we'd rather fail the load and inform him/her
about the bad parameters instead of trying to silently "fix" them.
In addition the original code wasn't passing the parameters correctly: it switched the "expiry" and "refresh" parameters in
the utils::loaded_cache constructor.
Add to this that the original code was doing really strange things in the permission_cache::expiry(cfg) method.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
It makes little sense to have the same value for permissions_update_interval_in_ms and permissions_validity_in_ms.
This may cause the values to be invalidated only because some minor delays in the timer scheduling.
It makes a lot more sense to make the permissions_update_interval_in_ms value smaller than permissions_validity_in_ms.
This way we would minimize the chances of "false invalidation" due to some small delays in the timer scheduling.
In addition, 2s seems to be a too small value for permissions_validity_in_ms since our default read_request_timeout_in_ms is 5s.
This means that a single system_auth read failure would guarantee that the following queries are going to read system_auth data
in the foreground.
Setting it to 10s would allow a second read attempt before we enforce the foreground read.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"Notably:
- add validation of the results (e.g. fragment count, expectations about disk activity)
- add cache-specific tests"
* 'tgrabiec/add-cache-tests-to-perf-fast-forward' of github.com:cloudius-systems/seastar-dev:
tests: perf_fast_forward: Report cache stats
row_cache: Keep counters in a struct
tests: perf_fast_forward: Add cache-specific tests
tests: perf_fast_forward: Extract test_reading_all()
tests: perf_fast_forward: Add validation of the results
tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments
tests: perf_fast_forward: Allow the test to be interrupted
tests: perf_fast_forward: Allow testing with cache enabled
row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
Not 100% proper, but in line with how we still store the info.
Ensures (helps at least) to keep schema loaded from tables
and schema from builder comparable.
Fixes schema_changes_test error.
Message-Id: <1495030581-2138-2-git-send-email-calle@scylladb.com>
"When mutation reader enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request."
* 'tgrabiec/fix-initial-fast-forward-to-for-single-key-sstable-readers' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test forwarding in single-key readers
sstables: Remove unused code
sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
sstables: Fix verify_end_state() to tolerate ATOM_START_2 state
"This series adds private repository support to scylla-housekeeping"
* 'amnon/housekeeping_private_repo_v3' of github.com:cloudius-systems/seastar-dev:
scylla-housekeeping service: Support private repositories
scylla-housekeeping-upstart: Use repository id, when checking for version
scylla-housekeeping: support private repositories
make_pkeys() needs to be invoked with n equal to the number of keys
which the table was populated with. Otherwise the extra keys, which
are missing in the table, may be placed anywhere in the vector due to
ring order sorting, and break the assumption that the table contains
all keys from the array up to index n. This resulted in the test
reading slighlty less fragments than it would follow from the desired
count.
Another problem is that we should not skip the fast_forward_to() call
for the inital range (workaround for a bug in sstable mutation
reader), otherwise we will read slightly less than expected as well.