log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
Collect coordinator side read statistic per CF and use them in percentile
speculative read executor. Getting percentile from estimated_histogram
object is rather expensive, so cache it and recalculate only once per
second (or if requested percentile changes).
Fixes#2757
Message-Id: <20170911131752.27369-3-gleb@scylladb.com>
Currently overflow values are stored in incorrect bucket (last one
instead of special "overflow" one) and percentile() function throws
if there is overflow value. The patch fixes the code to store overflow
value in corespondent bucket and makes percentile() to take it into
account instead of throwing.
Message-Id: <20170911131752.27369-2-gleb@scylladb.com>
Large deques require contiguous storage, which may not be available (or may
be expensive to obtain). Switch to new custom container instead, which allocates
less contiguous storage.
Allocation problems were observed with the summary and compression info. While
there is work to reduce compression info contiguous space use, this solves
all std::deque problems (and should not conflict with that work).
Fixes#2708
* tag '2708/v6' of https://github.com/avikivity/scylla:
sstables: switch std::deque to chunked_vector
tests: add test for chunked_vector
utils: add a new container type chunked_vector
We currently use std::deque<> for when we need large random-access containers,
but deque<> requires nr_items * sizeof(T) / 64 bytes of contiguous memory, which can
exceed our 256k fragmentation unit with large sstables. The new
container, which is a cross between deque and vector, has much lower
limitations.
Like deque, we allocate chunks of contiguous items, but they are
128k in size instead of 512. The last chunk can be smaller to avoid
allocating 128k for a really small vector.
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.
Message-Id: <20170819135212.25230-1-avi@scylladb.com>
The timer is armed inside the section guarded by the _timer_reads_gate
therefore it has to be canceled after the gate is closed.
Otherwise we may end up with the armed timer after stop() method has
returned a ready future.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
stop the last bucket."
Fixes#2650.
* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
database: remove latency from the system table
estimated histogram: return a smaller histogram
1. assert() is not constexpr.
2. can't use static_assert(), because the contructor may be called in a non-constexpr
environment; moved to log_histogram
3. pow2_rank() uses count_leading_zeros() which is not constexpr; split
into constexpr and non-constexpr versions
4. duplicated number_of_buckets() because bucket_of() can't be constexpr due to pow2_rank
Message-Id: <20170726105444.32698-1-avi@scylladb.com>
The current histogram contains 91 buckets, this is a very high
resolution with a high upper limit.
To reduce traffic passed, between scylla and the prometheus, this patch
generate a smaller histogram.
It limit the number of buckets (16 by default), set a lower limit to the
lowest bucket, and uses 2 as the bucket coeficient.
Highest empty buckets will not be reported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
estimated histogram
loading_cache invokes a timer that may issue asynchronous operations
(queries) that would end with writing into the internal fields.
We have to ensure that these operations are over before we can destroy
the loading_cache object.
Fixes#2624
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501096256-10949-1-git-send-email-vladz@scylladb.com>
Arm the timer with a period that is not greater than either the permissions_validity_in_ms
or the permissions_update_interval_in_ms in order to ensure that we are not stuck with
the values older than permissions_validity_in_ms.
Fixes#2590
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"This feature is intended to make compaction more efficient at getting rid of
droppable tombstone and expired data wasting disk space. So far, people have
been dealing with it manually through major compaction.
With strategies other than date tiered, large sstables will be left untouched
for a long time even though it's all expired. Date tiered suffers from it when
mixing data with different TTL because it only includes for compaction sstable
that is fully expired.
sstables keeps as metadata a histogram which allows us to easily estimate
droppable data ratio from gc_before. sstables which droppable data ratio is
above 20% (default value for tombstone_threshold option) will be considered
candidates for the operation.
Like in C*, we will only do tombstone removal compaction when there's nothing
to compact in standard way. It would be interesting to trigger it too when
disk usage is above a given threshold, but I decided to leave this for later.
Fixes #2306."
* 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla:
tests: more testing for tombstone compaction options
tests: basic tombstone compaction test for date tiered
compaction/dtcs: add support for tombstone compaction
tests: basic test of tombstone compaction with lcs
compaction/lcs: add support for tombstone compaction
tests: basic tombstone compaction test for size tiered
compaction/stcs: add support for tombstone compaction
tests: add test for estimation of droppable tombstone ratio
sstables: introduce function to estimate droppable tombstone ratio
compaction_manager: periodically submit cfs for compaction
streaming_histogram: fix coding style
tests: add streaming_histogram_test
streaming_histogram: implement sum
tests: add test for sstable with bad tombstone histogram
sstables: discard bad streaming histogram for future use
tests: add sstable tombstone histogram test
streaming_histogram: fix update
streaming_histogram: move it to utils
streaming_histogram: do not limit it to be used by sstables
sstables: update tombstone_histogram for cells with expiration time
This function is used to estimate number of points in interval
[-inf,b]. It will be useful for estimating droppable tombstone
ratio in a given sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This bug was introduced when converting java code. Return value
of map::erase() was used as if it were the value of the removed
entry, but it's actually the number of removed entries.
update() also relies on ordered keys, so map is used instead
by histogram.
In addition, histograms will be written in sorted order (like C*
does) such that we can detect bad histograms, using disk_array.
disk_array is also used from now on to read histograms.
The conversion from array to map is fine because histograms for
sstables are limited to 100 elements.
Coming patch will detect bad histograms (generated only by us)
and discard them, because we can't rely on their information.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).
This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.
Introduced in 11b5076b3c.
Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
Use the boost::intrusive containers in order to achieve a O(1) complexity
for both "LRU list" update and to minimize the memory overhead in the hash
table item to "LRU list" item connection:
- Make the timestamped_val be both a bi::list and a bi::unordered_set
item.
- Make a bi::unordered_set be a cache backend instead of the
std::unordered_map.
As a result dropping k LRU items becomes an O(k) operation instead of
O(N log N), where N is a total number of all cached items:
- Every time a value is read - move it to the front of the "LRU list"
(O(1)).
- When we need to remove k LRU items:
- Repeat k times:
- Take an element from the back of the "LRU list". (O(1)).
- Remove it from the bi::unordered_set and dispose. (O(1)).
We use an auto-unlink configuration for bi::list, therefore
disposing an item is going to auto unlink it from the list.
* 'permissions_cache_move_to_intrusive-v1' of github.com:scylladb/seastar-dev:
utils::loading_cache: cleanup
utils/loading_cache.hh: use intrusive list to store the lru entry
utils::loading_cache: implement automatic rehashing
utils::loading_cache: make the underlying map to be an intrusive unordered_set
"Enforces commutativity of addition:
m1 + m2 == m2 + m1
and consistency of difference and addition with equality:
m1 + (m2 - m1) == m1 + m2"
* tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev:
mutation: Make compare_*_for_merge() consistent with equals()
tests: mutation: Improve assertion failure message
tests: Use default equality in test_mutation_diff_with_random_generator
mutation: Make counter cell difference consistent with apply
tests: range_tombstone_list_test: Improve error message
tests: range_tombstone_list: Check adjacent range merging
range_tombstone_list: Merge adjacent range tombstones in apply()
tests: mutation: Check commutativity of mutation addition
range_tombstone_list: Avoid violating set invariant
range_tombstone_list: Make tombstone merging commutative
range_tombstone_list: Add erase() operation to the reverter
range_tombstone_list: Make all undo operations ordered relative to each other
utils: Extract to_boost_visitor() to a separate header
allocating_strategy: Introduce alloc_strategy_unique_ptr<>
Fix the shrink() O(n log n) complexity issue by constantly pushing the corresponding intrusive
list entry to the head of the list every time the values are read.
This will keep the list ordered by the last read time from the most recently read
to the least recently read entry.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
- Start the cache with 256 buckets - the minimum number of buckets.
- Limit the maximal number of buckets by 1M buckets.
- Keep the load factor between 0.25 and 1.0 as long as the number of buckets is
between the minimum and the maximum values mentioned above.
- Grow and shrink the hash every "refresh" period if needed.
- Enable bi::power_2_buckets and bi::compare_hash bi::unordered_set options.
- Enable bi::unordered_set_base_hook's bi::store_hash option.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Make the underlying map to be a boost::intrusive::unordered_set<timestamped_val>
instead of std::unordered_set<Key, timestamped_val>.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
According to description of permissions_validity_in_ms the permissions_cache is enabled if this
value is set to a non-zero value. Otherwise the permissions_cache is disabled.
According to the permissions_update_interval_in_ms description it must have a non-zero value if permissions_cache
is enabled.
permissions_cache_max_entries description doesn't explicitly state it but it makes no sense to allow it to be zero
if permissions_cache is enabled.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patch changes the way a loading_cache works.
Before this patch:
1) If a permissions key is not in the cache it's loaded in the foreground and the original
query is blocked till the permissions are loaded.
2) Every _period the timer does the following:
1) If a value was loaded more than _expiry time ago it is removed from the cache.
2) If the cache is too big - the less recently loaded values are removed till the cache
fits the requested size.
After this patch:
1) If a permissions key is not in the cache it's loaded in the foreground and the original
query is blocked till the permissions are loaded.
2) Every _period the timer does the following:
1) If a value in the cache was loaded or read for the last time more than _expiry time ago - it's removed from the cache.
2) If the cache is too big - the less recently read values are removed till the cache fits the requested size.
3) The values that were loaded more than _refresh time ago are re-read in the background.
The new implementation allows to minimize the amount of the foreground reads for a frequently used value to a single
event (when the value is loaded for the first time).
It also ensures we do not reload values we no longer need.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The standard serialization API (e.g. in data_value) includes the following methods:
size_t serialized_size() const;
void serialize(bytes::iterator& it) const;
bytes serialize() const;
Align the utils::UUID API with the pattern above.
The only addition is that we are going to make an output iterator parameter of a second method above
a template so that we may serialize into different output sources.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Use the same templated implementation for all different serialize_XXX(...).
The chosen implementation is based on the std::copy_n(char*, size, OutputIterator),
which is heavily optimized and will be using memcpy/memmove where possible.
This patch also removes the not needed specializations that accept signed integer
values since we were casting them to unsigned value anyway.
The std::ostream based specifications are also removed since they are not used
anywhere except for a test-serialization.cc and adjusting the ostream to the iterator
is a single-liner.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>