Use the boost::intrusive containers in order to achieve a O(1) complexity
for both "LRU list" update and to minimize the memory overhead in the hash
table item to "LRU list" item connection:
- Make the timestamped_val be both a bi::list and a bi::unordered_set
item.
- Make a bi::unordered_set be a cache backend instead of the
std::unordered_map.
As a result dropping k LRU items becomes an O(k) operation instead of
O(N log N), where N is a total number of all cached items:
- Every time a value is read - move it to the front of the "LRU list"
(O(1)).
- When we need to remove k LRU items:
- Repeat k times:
- Take an element from the back of the "LRU list". (O(1)).
- Remove it from the bi::unordered_set and dispose. (O(1)).
We use an auto-unlink configuration for bi::list, therefore
disposing an item is going to auto unlink it from the list.
* 'permissions_cache_move_to_intrusive-v1' of github.com:scylladb/seastar-dev:
utils::loading_cache: cleanup
utils/loading_cache.hh: use intrusive list to store the lru entry
utils::loading_cache: implement automatic rehashing
utils::loading_cache: make the underlying map to be an intrusive unordered_set
"Enforces commutativity of addition:
m1 + m2 == m2 + m1
and consistency of difference and addition with equality:
m1 + (m2 - m1) == m1 + m2"
* tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev:
mutation: Make compare_*_for_merge() consistent with equals()
tests: mutation: Improve assertion failure message
tests: Use default equality in test_mutation_diff_with_random_generator
mutation: Make counter cell difference consistent with apply
tests: range_tombstone_list_test: Improve error message
tests: range_tombstone_list: Check adjacent range merging
range_tombstone_list: Merge adjacent range tombstones in apply()
tests: mutation: Check commutativity of mutation addition
range_tombstone_list: Avoid violating set invariant
range_tombstone_list: Make tombstone merging commutative
range_tombstone_list: Add erase() operation to the reverter
range_tombstone_list: Make all undo operations ordered relative to each other
utils: Extract to_boost_visitor() to a separate header
allocating_strategy: Introduce alloc_strategy_unique_ptr<>
Fix the shrink() O(n log n) complexity issue by constantly pushing the corresponding intrusive
list entry to the head of the list every time the values are read.
This will keep the list ordered by the last read time from the most recently read
to the least recently read entry.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
- Start the cache with 256 buckets - the minimum number of buckets.
- Limit the maximal number of buckets by 1M buckets.
- Keep the load factor between 0.25 and 1.0 as long as the number of buckets is
between the minimum and the maximum values mentioned above.
- Grow and shrink the hash every "refresh" period if needed.
- Enable bi::power_2_buckets and bi::compare_hash bi::unordered_set options.
- Enable bi::unordered_set_base_hook's bi::store_hash option.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Make the underlying map to be a boost::intrusive::unordered_set<timestamped_val>
instead of std::unordered_set<Key, timestamped_val>.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
According to description of permissions_validity_in_ms the permissions_cache is enabled if this
value is set to a non-zero value. Otherwise the permissions_cache is disabled.
According to the permissions_update_interval_in_ms description it must have a non-zero value if permissions_cache
is enabled.
permissions_cache_max_entries description doesn't explicitly state it but it makes no sense to allow it to be zero
if permissions_cache is enabled.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patch changes the way a loading_cache works.
Before this patch:
1) If a permissions key is not in the cache it's loaded in the foreground and the original
query is blocked till the permissions are loaded.
2) Every _period the timer does the following:
1) If a value was loaded more than _expiry time ago it is removed from the cache.
2) If the cache is too big - the less recently loaded values are removed till the cache
fits the requested size.
After this patch:
1) If a permissions key is not in the cache it's loaded in the foreground and the original
query is blocked till the permissions are loaded.
2) Every _period the timer does the following:
1) If a value in the cache was loaded or read for the last time more than _expiry time ago - it's removed from the cache.
2) If the cache is too big - the less recently read values are removed till the cache fits the requested size.
3) The values that were loaded more than _refresh time ago are re-read in the background.
The new implementation allows to minimize the amount of the foreground reads for a frequently used value to a single
event (when the value is loaded for the first time).
It also ensures we do not reload values we no longer need.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The standard serialization API (e.g. in data_value) includes the following methods:
size_t serialized_size() const;
void serialize(bytes::iterator& it) const;
bytes serialize() const;
Align the utils::UUID API with the pattern above.
The only addition is that we are going to make an output iterator parameter of a second method above
a template so that we may serialize into different output sources.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Use the same templated implementation for all different serialize_XXX(...).
The chosen implementation is based on the std::copy_n(char*, size, OutputIterator),
which is heavily optimized and will be using memcpy/memmove where possible.
This patch also removes the not needed specializations that accept signed integer
values since we were casting them to unsigned value anyway.
The std::ostream based specifications are also removed since they are not used
anywhere except for a test-serialization.cc and adjusting the ostream to the iterator
is a single-liner.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
While blob_storage is marked as an unaligned type, the back references also
point to an unaligned type (a pointer to blob_storage), since a back
reference can live in a blob_storage. This triggers errors from zapcc/clang 4.
Fix by creating a type for the reference, which is marked as unaligned.
Message-Id: <20170502071404.507-1-avi@scylladb.com>
Attempting to create huge zones may introduce significant latency. This
patch introduces the maximum allowed zone size so that the time spent
trying to allocate and initialising zone is bounded.
Fixes#2335.
Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>
This patch extracts out the relational operators in struct tombstone
to a class capable of generating them from a tri-compare function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This series fixes some more errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler. A single issue remains, but it's
probably in std::experimental::optional::swap(); not in our code."
* tag 'clang/2/v1' of https://github.com/avikivity/scylla:
sstable_test: avoid passing negative non-type template arguments to unsigned parameters
UUID: add more comparison operators
sstable_datafile_test: avoid string_view user-defined literal conversion operator
mutation_source_test: avoid template function without template keyword
cql_query_test: define static variable
cql_query_test: add braces for single-item collection initializers
storage_service: don't use typeid(temporary)
logalloc: remove unused max_occupancy_for_compaction
storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
storage_proxy: drop unused member access from return value
storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
read_repair_decision: fix operator<<(std::ostream&, ...)
Every lsa-allocated object is prefixed by a header that contains information
needed to free or migrate it. This includes its size (for freeing) and
an 8-byte migrator (for migrating). Together with some flags, the overhead
is 14 bytes (16 bytes if the default alignment is used).
This patch reduces the header size to 1 byte (8 bytes if the default alignment
is used). It uses the following techniques:
- ULEB128-like encoding (actually more like ULEB64) so a live object's header
can typically be stored using 1 byte
- indirection, so that migrators can be encoded in a small index pointing
to a migrator table, rather than using an 8-byte pointer; this exploits
the fact that only a small number of types are stored in LSA
- moving the responsibility for determining an object's size to its
migrator, rather than storing it in the header; this exploits the fact
that the migrator stores type information, and object size is in fact
information about the type
The patch improves the results of memory_footprint_test as following:
Before:
- in cache: 976
- in memtable: 947
After:
mutation footprint:
- in cache: 880
- in memtable: 858
A reduction of about 10%. Further reductions are possible by reducing the
alignment of lsa objects.
logalloc_test was adjusted to free more objects, since with the lower
footprint, rounding errors (to full segments) are different and caused
false errors to be detected.
Missing: adjustments to scylla-gdb.py; will be done after we agree on the
new descriptor's format.
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
Statistics for reclamation pauses for a read workload over
larger-than-memory data set:
Before:
avg = 865.796362
stdev = 10253.498038
min = 93.891000
max = 264078.000000
sum = 574022.988000
samples = 663
After:
avg = 513.685650
stdev = 275.270157
min = 212.286000
max = 1089.670000
sum = 340573.586000
samples = 663
Refs #1634.
Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
We will want to reuse the min_size mechanism for the whole compaction
threshold, including the occupancy threshold. That threshold is close
to the segment size and we cannot pick a power of two which would be
close enough to what we need.
Therefore, change log_histogram to support arbitrary minimum base.
bucket_of() was moved into log_histogram_options so that it can be used
in number_of_buckets(), which makes for a simple and much less
error-prone implementation.
Idle-time compaction should not produce not-compactible segments
becuase that means we would have to evict a lot when we finally need
to reclaim some memory, so that occupancy falls below the regular
compaction threshold. This may cause latency spikes.
Refs #1634.
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides. Replace with the right type.
Since the right type is private, add some friendship.
segment_zone::migrate_all_segments() was trying to migrate all segments
inside a zone to the other one hoping that the original one could be
completely freed. This was an attempt to optimise for throughput.
However, this may unnecesairly hurt latency if the zone is large, but
only few segments are required to satisfy reclaimer's demands.
Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>
One of the goals of can_allocate_more_memory() is to prevent depleting
seastar's free memory close to its minimum, leaving a head room above
that minimum so that standard allocations will not cause reclamation
immediately. Currently the function doesn't take into accoutn actual
threshold used by the seastar allocator, so there could be no gap or
even could go below the minimum.
Fix that by ensuring there's always a gap above min_free_memory().
min_gap was reduced to 1 MiB so that low memory setups are not
impacted significantly by the change.
Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>
"This replaces use of a generic forwarding wrapper in sstable reader with
specialized implentation. Forwarding doesn't yet utilize indexes in this
series, only integrates it with mp_row_consumer, which is a prerequisite.
It's still an optimization, since mp_row_consumer will not try to consume
past the range as it used to.
Sending early for easier consumption."
* tag 'tgrabiec/forwarding-of-mp-row-consumer-v2' of github.com:scylladb/seastar-dev:
sstables: Remove use of forwarding wrapper
sstables: Implement sstable_streamed_mutation::fast_forward_to()
sstables: Extract and use clustering_ranges_walker
tests: sstables: Add test for handling of repeated tombstones
sstables: Extract writer parameters into config objects
tests: Move as_mutation_source() helper to header
tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions
streamed_mutation: Add streamed_mutation_returning() helper
tests: mutation_source_test: Add test case for forwarding to a full range
tests: simple_schema: Add fragment factories
tests: Extract simple_schema
sstables: Move workaround for out-of-order range tombstones to mp_row_consumer
sstables: Drop default mp_row_consumer constructor
sstables: Swap order of values in "proceed" so that "no" is assigned 0
util/optimized_optional: Make printable
position_in_partition: Add is_static_row() in the view
range_tombstone_stream: Add reset()
range_tombstone_stream: Add get_next(position_in_partition_view)
sstables: streamed_mutation: Stop reading when end of slice reached
sstables: Switch is_in_range() to position_in_partition
Change linkage of segment_descriptor_hist_options to external to keep
good old GCC5 happy, despite C++11 allowing static linkage of non-type
template arguments.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213206.10383-1-duarte@scylladb.com>
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.
Performance was tested using:
perf-simple-query -c4 --counters --duration 60
The following results are medians.
before after diff
write 18640.41 33156.81 +77.9%
read 58002.32 62733.93 +8.2%"
* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
cell_locker: add metrics for lock acquisition
storage_proxy: count counter updates for which the node was a leader
storage_proxy: use counter-specific timeout for writes
storage_proxy: transform counter timeouts to mutation_write_timeout_exception
db: avoid allocations in do_apply_counter_update()
tests/counters: add test for apply reversability
counters: attempt to apply in place
atomic_cell: add COUNTER_IN_PLACE_REVERT flag
counters: add equality operators
counters: implement decrement operators for shard_iterator
counters: allow using both views and mutable_views
atomic_cell: introduce atomic_cell_mutable_view
managed_bytes: add cast to mutable_view
bytes: add bytes_mutable_view
utils: introduce mutable_view
db: add more tracing events for counter writes
db: propagate tracing state for counter writes
tests/cell_locker: add test for timing out lock acquisition
counter_cell_locker: allow setting timeouts
db: propagate timeout for counter writes
...
This patch replaces the current heap with a logarithmic histogram
to hold the closed segment descriptors.
This histogram stores elements in different buckets according to
their size. Values are mapped to a sequence of power-of-two ranges
that are split in N sub-buckets. Values less than a minimum value
are placed in bucket 0, whereas values bigger than a maximum value
are not admitted.
There is some loss of precision as segments are now not totally
ordered, and precision decreases the more sparse a segment is. This
allows to reduce the cost of the computations needed when freeing
from a closed segment.
Performance results for perf_simple_query -c4 --duration 60
before after diff
read 43954.27 45246.10 +2.9%
write 48911.54 52807.76 +7.9%
Fixes#1442
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235328.27937-1-duarte@scylladb.com>