"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
"This series introduces support for counters. The implementation of
counters more or less follows the design described on our wiki page [1].
Counter cells contain many shards with replicas being able to modify
and announce new versions only of the shards that they own. Historically,
there were three types of shards: local, remote and global. In these
patches only support for the global ones is added.
[1] https://github.com/scylladb/scylla/wiki/Counters
Currently, counters are only enabled as experimental features as there
still several things that need to be done before they become production
ready. Namely, the performance is expected to be quite poor (especially
for writes), there is no proper tracing support and timed out counter
requests may not be recognized and dropped early. There are also no
counter-related metrics.
However, apart from these problems there are no other missing parts of
counter implementation and they are expected to work correctly.
Fixes #577."
* 'pdziepak/counters/v3-rebased' of github.com:cloudius-systems/seastar-dev: (38 commits)
perf_simple_query: add counter tables tests
thrift: add support for counter operations
cql3: allow counters in CREATE TABLE statements
cql3: selection: do not panic when seeing counters
storage_proxy: support counter updates
storage_proxy: add get_live_endpoints()
cql3: add counter increment and decrement operations
db: add operations for applying counter updates
counters: implement transforming counter deltas to shards
add infrastructure for locking counter cells
add fnv1a hasher
position_in_partition: add feed_hash()
position_in_partition: add functions for querying object type
types: make counter_type_impl report its cql3_type
transport: encode counters as long_type
mutation_partition: make for_each_cell() accessible outside source file
messaging_service: add COUNTER_MUTATION verb
storage_service: add COUNTERS feature
idl: add idl description of consistency level
schema: make is_counter() return correct value
...
The leader receives counter updates as deltas which have to be
transformed to counter shards. In order to do that, current local shard
of the modified counter cell needs to be read, logical clock incremented
and the value modified by the specified delta.
The leader receives counter update in a form of deltas which need to be
transformed to counter shards. In order to do that the node needs to
read its current state of the modified counter cells. Since this is
essentially a read-modify-write opertation an appropriate locking
mechanism is needed.
Counter cell locker introduced in this patch uses a hashtable of
partition entry each containing a hashtable of cell entries. Inside a
cell entry there is a semaphore used for synchronization. Once no longer
needed cell entries and partition entries are removed.
In order to avoid deadlocks cell entries are always locked in the same
order which is the lexicographical order of (clustering key, column id)
pairs. Note that schema changes are not a difficulty since they do not
make it possible to change ordering of such pairs.
Support for deletion of counters is limited in a way that once deleted
they cannot be used again (i.e. tombstone always wins, regardless of the
timestamp). Logic responsible for merging two counter cells already
makes sure that tombstones are handled properly, but it is also
necessary to ensure that higher level tombstones always cover counters.
Live counter cells are collections of shards, each one representing the
sum of all operations performed by a particular replica. This commits
introduces an in-memory representation of counters as well as basic
operations such as merge, difference and hashing.
* seastar c1dbd89...f07f8ed (3):
> Merge "Introduce when_all_succeed()" from Paweł
> tests: adjust collectd test for metric API change
> Merge "DNS query support" from Calle
Before, the logic for releasing writes blocked on dirty worked like
this:
1) When region group size changes and it is not under pressure and
there are some requests blocked, then schedule request releasing
task
2) request releasing task, if no pressure, runs one request and if
there are still blocked requests, schedules next request
releasing task
If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.
However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.
The group size can also change when memory is reclaimed from the
groups (e.g. when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.
Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.
Refs #2021.
Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.
The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
The hard pressure was only signalled on region group when
run_when_memory_available() was called after the pressure condition
was met.
So the following loop is always an infinite loop rather than stopping
when engouh is allocated to cause pressure:
while (!gr.under_pressure()) {
region.allocate(...);
}
It's cleaner if pressure notification works not only if
run_when_memory_available() is used but whenever conditino changes,
like we do for the soft pressure.
There is comment in run_when_memory_available() which gives reasons
why notifications are called from there, but I think those reasons no
longer hold:
- we already notify on soft pressure conditions from update(), and if
that is safe, notifying about hard pressure should also be safe. I
checked and it looks safe to me.
- avoiding notification in the rare case when we stopped writing
right after crossing the threshold doesn't seem benefitial. It's
unlikely in the first place, and one could argue it's better to
actually flush now so that when writes resume they will not block.