Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.
However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.
In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.
Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.
In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
Since linearization_context is thread_local every time it is accessed
the compiler needs to emit code that checks if it was already
constructed and does so if it wasn't. Moreover, upon leaving the context
from the outermost scope the map needs to be cleared.
All these operations impose some performance overhead and aren't really
necessary if no buffers were linearised (the expected case). This patch
rearranges the code so that lineatization_context is trivially
constructible and the map is cleared only if it was modified.
buffer_input_stream is a simple input_stream wrapping a single
temporary_buffer.
prepended_input_stream suits for the case when some data has been read
into a buffer and the rest is still in a stream. It accepts a buffer and
a data_source and first reads from the buffer and then, when it ends,
proceeds reading from the data_source.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The exponential_backoff_retry instance is captured by move and is then
indirectly moved again as repeat_until_value() moves the lambda its
passed into its internal state. This caused problems as internal
lambdas store references to the instance and these references go stale
after the move.
To fix this keep hold of the existential_backoff_retry instance in an
enclosing do_with() to make it safe for internal lambdas to reference
it.
Indentation will be fixed by the next patch.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <adc49d25a6176756d60e092f3713c0c897732382.1517222195.git.bdenes@scylladb.com>
_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.
The fix is to always collect such zones.
Fixes#3129
Refs #3119
Refs #3120
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.
As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.
In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch adds the do_until_value static member function to
exponential_backoff_retry, which retries the specified function until
it returns an engaged optional.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Class optimized_optional was moved into seastar, and its usage
simplified so move_and_disengage() is replaced in favour of
std::exchange(_, { }).
* seastar adaca37...b0f5591 (9):
> Merge "core: Introduce cancellation mechanism" from Duarte
> Fix Seastar build that no longer builds with --enable-dpdk after the recent commit fd87ea2
> noncopyable_function: support function objects whose move constructors throw
> Adding new hardware options to new config format, using new config format for dpdk device
> Fix check for Boost version during pre-build configuration.
> variant_utils: add variant_visitor constructor for C++17 mode
> Merge "Allows json object to be stream to an" from Amnon
> Merge 'Default to C++17' from Avi
> Add const version of subscript operator to circular_buffer
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171228112126.18142-1-duarte@scylladb.com>
Add a widely used method that returns TRUE if a given address is a broadcast
address of the local node.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"Fixes #2866Fixes#2894
Changes gossip propagation to allow "atomic" grouping of values to ensure
their respective order.
Modifies gossip bootstrap startup to potentially wait longer in cases
where stabilization (messages done) takes time, to avoid data loss
in repair."
* 'calle/gossip' of github.com:scylladb/seastar-dev:
gossip: wait for stabilized gossip on bootstrap
gossiper: Prevent race condition in propagation
utils::to_string: Add printers for pairs+maps
utils::in: Add helper type for perfect forwarding initializer lists
The alignment of packed structs can be 1. The system¹ posix_memalign
function will return EINVAL when passed this alignment. This fix
forces the alignment to be at least sizeof(void*).
¹ The seastar implementation of posix_memalign does not appear to
have this limitation currently.
Replace the oblique process(T) overloads for integer types with
explicit process_le/be(T) methods that would interpret the given integer
as a stream of bytes using the corresponding endiannes.
For instance
process_le(0x11223344) would treat this integer as the following array of bytes:
{0x44, 0x33, 0x22, 0x11}.
process_be(0x11223344) on the other hand would treat this integer as if it's
{0x11, 0x22, 0x33, 0x44}.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Prometheus histograms have 3 embedded metrics: count, buckets, and sum.
Currently we fill up count and buckets but sum is left at 0. This is
particularly bad, since according to the prometheus documentation, the
best way to calculate histogram averages is to write:
rate(metric_sum[5m]) / rate(metric_count[5m])
One way of keeping track of the sum is adding the value we sampled,
every time we sample. However, the interface for the estimated histogram
has a method that allows to add a metric while allowing to adjust the
count for missing metrics (add_nano())
That makes acumulating a sum inaccurate--as we will have no values for
the points that were added. To overcome that, when we call add_nano(),
we pretend we are introducing new_count - _count metrics, all with the
same value.
Long term, doing away with sampling may help us provide more accurate
results.
After this patch, we are able to correctly calculate latency averages
through the data exported in prometheus.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171122144558.7575-1-glauber@scylladb.com>
There is existing code (e.g. use of partition_snapshot_row_cursor in
cache_streamed_mutation) which assumes that references will be
invalidated when bad_alloc is thrown from allocating_section. That is
currently the case because on retry we will attempt memory reclamation
which will invalidate references either through compaction or
eviction. Make this guarantee explicit.
Because of the small size optimization, not all copies will call the
allocator, so allocation failure injection may miss this site if the
value is not large enough. Make the testing more effective by marking
this place explicitly as an allocation point.
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.
Fixes #2855."
* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
gms/gossiper: Remove periodic replication of endpoint state map
gossiper: Check for features in the change listener
gms/gossiper: Replicate changes incrementally to other shards
gms/gossiper: Document validity of endpoint_state properties
storage_service: Update token_metadata after changing endpoint_state
gms/gossiper: Process endpoints in parallel
gms/gossiper: Serialize state changes and notifications for given node
utils/loading_shared_values: Allow Loader to return non-future result
gms/gossiper: Encapsulate lookup of endpoint_state
storage_service: Batch token metadata and endpoint state replication
utils/serialized_action: Introduce trigger_later()
gossiper: Add and improve logging
gms/gossiper: Don't fire change listeners when there is no change
gms/gossiper: Allow parallel apply_state_locally()
gms/gossiper: Avoid copies in endpoint_state::add_application_state()
gms/failure_detector: Ignore short update intervals
"Extracts the yaml/boost-po aspects of the "self-describing" db::config
into an abstract type.
db::config is then reimplemented in said type, removing some of the
slightly cumbersome entanglement with seastar opts (log).
Adds a main hook for additional configuration files (options + file)"
* 'calle/config' of github.com:scylladb/seastar-dev:
main/init: Add registerable configuration objects
db::config: Re-implement on utils/config_file.
utils::config_file: Abstract out config file to external type
Handling all the boost::commandline + YAML stuff.
This patch only provides an external version of these functions,
it does not modify the db::config object. That is for a follow-up
patch.
"Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/
One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.
The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.
2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.
About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:
"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."
It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that."
Acked-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Gleb Natapov <gleb@scylladb.com>
* 'prometheus-histograms' of github.com:glommer/scylla:
storage_proxy: change reporting of estimated histograms
estimated_histogram: bring histogram closer to what prometheus expects.
Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/
One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.
The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.
2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.
About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:
"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."
It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that.
This patch changes the histogram format to be more in line with what
prometheus expect.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Allows registry to give back, for example, shared_ptr etc instead of
solely unique_ptr. If a registry is defined with seastar/std
shared/lw_shared/unique_ptr as "BaseType", the type will assume
this is the intended result type.
"
The original motivation for the "utils: introduce a loading_shared_values" series was a hinted handoff work where
I needed an on-demand asynchronously loading key-value container (a replica address to a commitlog instance map).
It turned out that we already have the classes that do almost what I needed:
- utils::loading_cache
- sstables::shared_index_lists
Therefore it made sense to find a common ground, unify this functionality and reuse the code both in the classes above and in the
new hinted handoff code.
This series introduces the utils::loading_shared_values that generalizes the sstables::shared_index_lists
API on top of bi::unordered_set with the rehashing logic from the utils::loading_cache triggered by an addition
of an entry to the set (PATCH1).
Then it reworks the sstables::shared_index_lists and utils::loading_cache on top of the new class (PATCH2 and PATCH3).
PATCH4 optimizes the loading_cache for the long timer period use case.
But then we have discovered that we have another "customer" for the loading_cache. Apparently our prepared statements cache
had a birth flaw - it was unlimited in size - unless the corresponding keyspace and/or table are modified/dropped the entries
are never evicted. We clearly need to limit its size and it would also make sense to evict the cache entries that haven't been
used long enough.
This seems like a perfect match for a utils::loading_cache except for prepared statements don't need to be reloaded after
they are created.
Patches starting from PATCH5 are dealing with adding the utils::loading_cache the missing functionality (like making the "reloading"
conditional and adding the synchronous methods like find(key)) and then transitioning the CQL and Thrift prepared statements
caches to utils::loading_cache.
This also fixes #2474."
* 'evict_unused_prepared-v5' of https://github.com/vladzcloudius/scylla:
tests: loading_cache_test: initial commit
cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
cql3: prepared statements cache on top of loading_cache
utils::loading_cache: make the size limitation more strict
utils::loading_cache: added static_asserts for checking the callbacks signatures
utils::loading_cache: add a bunch of standard synchronous methods
utils::loading_cache: add the ability to create a cache that would not reload the values
utils::loading_cache: add the ability to work with not-copy-constructable values
utils::loading_cache: add EntrySize template parameter
utils::loading_cache: rework on top of utils::loading_shared_values
sstables::shared_index_list: use utils::loading_shared_values
utils: introduce loading_shared_values
log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
Ensure that the size of the cache is never bigger than the "max_size".
Before this patch the size of the cache could have been indefinitely bigger than
the requested value during the refresh time period which is clearly an undesirable
behaviour.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Sometimes we don't want the cached values to be periodically reloaded.
This patch adds the ability to control this using a ReloadEnabled template parameter.
In case the reloading is not needed the "loading" function is not given to the constructor
but rather to the get_ptr(key, loader) method (currently it's the only method that is used, we may add
the corresponding get(key, loader) method in the future when needed).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>