'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides. Replace with the right type.
Since the right type is private, add some friendship.
segment_zone::migrate_all_segments() was trying to migrate all segments
inside a zone to the other one hoping that the original one could be
completely freed. This was an attempt to optimise for throughput.
However, this may unnecesairly hurt latency if the zone is large, but
only few segments are required to satisfy reclaimer's demands.
Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>
One of the goals of can_allocate_more_memory() is to prevent depleting
seastar's free memory close to its minimum, leaving a head room above
that minimum so that standard allocations will not cause reclamation
immediately. Currently the function doesn't take into accoutn actual
threshold used by the seastar allocator, so there could be no gap or
even could go below the minimum.
Fix that by ensuring there's always a gap above min_free_memory().
min_gap was reduced to 1 MiB so that low memory setups are not
impacted significantly by the change.
Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>
"This replaces use of a generic forwarding wrapper in sstable reader with
specialized implentation. Forwarding doesn't yet utilize indexes in this
series, only integrates it with mp_row_consumer, which is a prerequisite.
It's still an optimization, since mp_row_consumer will not try to consume
past the range as it used to.
Sending early for easier consumption."
* tag 'tgrabiec/forwarding-of-mp-row-consumer-v2' of github.com:scylladb/seastar-dev:
sstables: Remove use of forwarding wrapper
sstables: Implement sstable_streamed_mutation::fast_forward_to()
sstables: Extract and use clustering_ranges_walker
tests: sstables: Add test for handling of repeated tombstones
sstables: Extract writer parameters into config objects
tests: Move as_mutation_source() helper to header
tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions
streamed_mutation: Add streamed_mutation_returning() helper
tests: mutation_source_test: Add test case for forwarding to a full range
tests: simple_schema: Add fragment factories
tests: Extract simple_schema
sstables: Move workaround for out-of-order range tombstones to mp_row_consumer
sstables: Drop default mp_row_consumer constructor
sstables: Swap order of values in "proceed" so that "no" is assigned 0
util/optimized_optional: Make printable
position_in_partition: Add is_static_row() in the view
range_tombstone_stream: Add reset()
range_tombstone_stream: Add get_next(position_in_partition_view)
sstables: streamed_mutation: Stop reading when end of slice reached
sstables: Switch is_in_range() to position_in_partition
Change linkage of segment_descriptor_hist_options to external to keep
good old GCC5 happy, despite C++11 allowing static linkage of non-type
template arguments.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213206.10383-1-duarte@scylladb.com>
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.
Performance was tested using:
perf-simple-query -c4 --counters --duration 60
The following results are medians.
before after diff
write 18640.41 33156.81 +77.9%
read 58002.32 62733.93 +8.2%"
* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
cell_locker: add metrics for lock acquisition
storage_proxy: count counter updates for which the node was a leader
storage_proxy: use counter-specific timeout for writes
storage_proxy: transform counter timeouts to mutation_write_timeout_exception
db: avoid allocations in do_apply_counter_update()
tests/counters: add test for apply reversability
counters: attempt to apply in place
atomic_cell: add COUNTER_IN_PLACE_REVERT flag
counters: add equality operators
counters: implement decrement operators for shard_iterator
counters: allow using both views and mutable_views
atomic_cell: introduce atomic_cell_mutable_view
managed_bytes: add cast to mutable_view
bytes: add bytes_mutable_view
utils: introduce mutable_view
db: add more tracing events for counter writes
db: propagate tracing state for counter writes
tests/cell_locker: add test for timing out lock acquisition
counter_cell_locker: allow setting timeouts
db: propagate timeout for counter writes
...
This patch replaces the current heap with a logarithmic histogram
to hold the closed segment descriptors.
This histogram stores elements in different buckets according to
their size. Values are mapped to a sequence of power-of-two ranges
that are split in N sub-buckets. Values less than a minimum value
are placed in bucket 0, whereas values bigger than a maximum value
are not admitted.
There is some loss of precision as segments are now not totally
ordered, and precision decreases the more sparse a segment is. This
allows to reduce the cost of the computations needed when freeing
from a closed segment.
Performance results for perf_simple_query -c4 --duration 60
before after diff
read 43954.27 45246.10 +2.9%
write 48911.54 52807.76 +7.9%
Fixes#1442
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235328.27937-1-duarte@scylladb.com>
The metrics histogram is a struct that describe a histogram.
This patch adds a getter method that lets the estimated_histogram return
a metrics::histogram, this will allow to register it as a histogram
metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
Before, the logic for releasing writes blocked on dirty worked like
this:
1) When region group size changes and it is not under pressure and
there are some requests blocked, then schedule request releasing
task
2) request releasing task, if no pressure, runs one request and if
there are still blocked requests, schedules next request
releasing task
If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.
However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.
The group size can also change when memory is reclaimed from the
groups (e.g. when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.
Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.
Refs #2021.
Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.
The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
The hard pressure was only signalled on region group when
run_when_memory_available() was called after the pressure condition
was met.
So the following loop is always an infinite loop rather than stopping
when engouh is allocated to cause pressure:
while (!gr.under_pressure()) {
region.allocate(...);
}
It's cleaner if pressure notification works not only if
run_when_memory_available() is used but whenever conditino changes,
like we do for the soft pressure.
There is comment in run_when_memory_available() which gives reasons
why notifications are called from there, but I think those reasons no
longer hold:
- we already notify on soft pressure conditions from update(), and if
that is safe, notifying about hard pressure should also be safe. I
checked and it looks safe to me.
- avoiding notification in the rare case when we stopped writing
right after crossing the threshold doesn't seem benefitial. It's
unlikely in the first place, and one could argue it's better to
actually flush now so that when writes resume they will not block.
We already call these when crossing the soft threshold. We shouldn't
stop reclaiming when hard pressure is gone because soft pressure may
still be present. Calling start_reclaiming() on hard pressure is
unnecessary because soft pressure also starts it, and when there is
hard pressure there is also soft pressure.
* seastar 397685c...c1dbd89 (13):
> lowres_clock: drop cache-line alignment for _timer
> net/packet: add missing include
> Merge "Adding histogram and description support" from Amnon
> reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
> Set the option '--server' of tests/tcp_sctp_client to be required
> core/memory: Remove superfluous assignment
> core/memory: Remove dead code
> core/reactor: Use logger instead of cerr
> fix inverted logic in overprovision parameter
> rpc: fix timeout checking condition
> rpc: use lowres_clock instead of high resolution one
> semaphore: make semaphore's clock configurable
> rpc: detect timedout outgoing packets earlier
Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.
Includes fixup from Amnon:
collectd api should use the metrics getters
As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.
This will be important when the internal implementation will changed
from union to variant.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
This reverts commit d61002cc33.
Introduced a regression in row_cache_alloc_stress.
The problem is that reclaim_from_evictable() evicts way too much after
the refactor due to the stop condition not taking into account how
much data was evicted so far and only looking at occupancy of the
minimal segment. This may lead to eviction of the whole region.
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
The change improves worst case latency. Reclamation time statistics
over 30 second period after cache fills up, in microseconds:
Before:
avg = 1524.283148
stdev = 11021.021118
min = 12.934000
max = 144356.000000
sum = 257603.852000
samples = 169
After:
avg = 1317.362414
stdev = 1913.542802
min = 263.935000
max = 19244.600000
sum = 175209.201000
samples = 133
Refs #1634.
Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
As the metrics migration progressed, some include to scollectd.hh left
behind.
Because of the nature of the scollecd implementation those include
brings alot of code with them to the header files and eventually to many
source file.
This patch remove those include and add a missing include to
storage_proxy.cc.
The reason the compiler didn't complain is an indication to the
problematic nature of those include in the first place.
Before this patch, change in metrics.hh would cause 169 files to
compile, after this change 17.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>
We need 64-bit year and days representation to support the boundary
values of the CQL data type, which is implemented using Joda Time
library's DateTime type.
Will allow us to calculate the hash once and use it on many filters
instead of calculating the hash for each filter separately.
Another change made is to avoid precomputing all indexes during filter
operations, and have for_each_index() template instead which invokes a
functor.
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The region_group_reclaimer will let us know every time we are over the
limit we have specified for memory usage.
However, For some applications, we would be interested in knowing about
memory build up earlier, so we can start doing something about it before
we reach that condition.
This patch introduce soft limit notifications for the
region_group_reclaimer. After this patch is applied, start_reclaim() is
called earlier, and stop_reclaim() later, after the soft condition is
abated.
There are methods that allow one to easily test if the pressure
condition is a soft limit condition or a hard, threshold condition and
act accordingly. Whether to act on both conditions or just one of them
is up to the application.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Now that the histogram has its own unit expressed in its template
parameter, there is no reason to convert it to nano just so we may need
to convert it back if the histogram needs another unit.
This patch will keep everything as a duration until last moment, and
then we'll convert when needed.
This was suggested by Amnon.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>
We are tracking latencies in microseconds, but almost everywhere else
they are reported in microseconds. Instead of just converting, this
patch tries to be a bit more future proof and embed the unit into the
type - and we then default to microseconds.
I have verified that the JMX measures now report sane values for both
the storage proxy and the column family. nodetool cfhistograms still
works fine. That one is reported in nanoseconds, but through the
estimated_histogram, not ihistogram.
Fixes#1836
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have recently fixed a bug due to which the constructor parameters for
moving average were inverted, leading to the numbers being just plain
wrong. However, the calculation of alpha was already inverted, meaning
it was right by accident and now that's wrong.
With the wrong alpha, the values we see are still correct, but they move
very quickly. The intention of this code is obviously to smooth things
out.
This was found out by Nadav. I have tested and confirmed that the smoothing
factor now works as expected.
Fixes #1837
Signed-off-by: Glauber Costa <glauber@scylladb.com>
moving_averages constructor is defined like this:
moving_average(latency_counter::duration interval, latency_counter::duration tick_interval)
But when it is time to initialize them, we do this:
... {tick_interval(), std::chrono::minutes(1)} ...
As it can be seen, the interval and tick interval are inverted. This
leads to the metrics being assigned bogus values.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.
This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.
Fixes#1394.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>