Before, the logic for releasing writes blocked on dirty worked like
this:
1) When region group size changes and it is not under pressure and
there are some requests blocked, then schedule request releasing
task
2) request releasing task, if no pressure, runs one request and if
there are still blocked requests, schedules next request
releasing task
If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.
However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.
The group size can also change when memory is reclaimed from the
groups (e.g. when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.
Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.
Refs #2021.
Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.
The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
The hard pressure was only signalled on region group when
run_when_memory_available() was called after the pressure condition
was met.
So the following loop is always an infinite loop rather than stopping
when engouh is allocated to cause pressure:
while (!gr.under_pressure()) {
region.allocate(...);
}
It's cleaner if pressure notification works not only if
run_when_memory_available() is used but whenever conditino changes,
like we do for the soft pressure.
There is comment in run_when_memory_available() which gives reasons
why notifications are called from there, but I think those reasons no
longer hold:
- we already notify on soft pressure conditions from update(), and if
that is safe, notifying about hard pressure should also be safe. I
checked and it looks safe to me.
- avoiding notification in the rare case when we stopped writing
right after crossing the threshold doesn't seem benefitial. It's
unlikely in the first place, and one could argue it's better to
actually flush now so that when writes resume they will not block.
We already call these when crossing the soft threshold. We shouldn't
stop reclaiming when hard pressure is gone because soft pressure may
still be present. Calling start_reclaiming() on hard pressure is
unnecessary because soft pressure also starts it, and when there is
hard pressure there is also soft pressure.
This reverts commit d61002cc33.
Introduced a regression in row_cache_alloc_stress.
The problem is that reclaim_from_evictable() evicts way too much after
the refactor due to the stop condition not taking into account how
much data was evicted so far and only looking at occupancy of the
minimal segment. This may lead to eviction of the whole region.
parse_time() adds hourse, minutes, etc to a final value 'result'.
However, it is of type std::chrono::nanoseconds which means it is not
zeroed at initialization unless it is explicitly asked to do so.
Fixed debug mode failures in types_tyes and cql_query_test.
Message-Id: <20170125155239.1253-1-pdziepak@scylladb.com>
"This patchset properly implements range_tombstone_list::difference(),
which was very broken. We add unit tests for the function and ensure
we always randomly generate range_tombstones in other unit tests so
other problems aren't hidden."
This patch ensures the mutation_merger emits any deferred tombstones
that it still may be holding before closing the stream.
Together with the range_tombstone_list: Properly implement
difference() patch set, this fixes breakage of streamed_mutation_test
and row_cache_test.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170123195643.9876-1-duarte@scylladb.com>
* seastar 6d80c6a...397685c (4):
> Merge "add label to the io_queue" from Amnon
> rpc: Modify the shutdown code to wait and handle exceptions
> tls.cc: Fix shutdown_input/output to conform with expected socket behaviour
> core: Add counter for polls
The difference method wasn't properly implemented. The version in this
patch correctly computes the difference and returns a range tombstone
list contains those range tombstones in "this" but absent from the
other, specified range tombstone list.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar 38aaa4a...6d80c6a (2):
> DPDK: Change the metrics registration with label support
> metric: Fix the error: could not convert {...} from <brace-enclosed initializer list> to struct metric_definition_impl
Add a counter field to RELEASE, just before the date, and fix it at zero.
This allows custom package builds to override it in a way that sorts before
the official packages.
Example:
Official release: 1.6.0-0.20160120.<githash>
Custom release 1: 1.6.0-1.avi.20160121.<githash>
Custom release 2: 1.6.0-2.avi.20160122.<githash>
The counter (0/1/2) ensures that the build number dominates over the date
when sorting.
Message-Id: <20170122102814.19649-1-avi@scylladb.com>
From Avi:
In many cases, batch statements are used to mutate a single partition, or
a number of partitions that is smaller than the number of statements within
the batch. We can detect this case and reduce the numbers of mutations
applied, and in some cases, convert a logged batch into an unlogged batch.
Ref #1689.
continuous_data_consumer::fast_forward_to() returns a future which was
later ignored by data_consume_context::fast_forward_to().
With the current implementation, the future in question is always ready
and that's why the problem didn't manifest itself in the form of crashes
or invalid results.
Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>
Batch statements are often used to insert multiple rows into the same
partition. Recognize this case and merge mutations to the same partition.
If the result is a single mutation, there is an additional win (already
present in the code), where a logged batch can be converted into an unlogged
batch.
Ref #1689.
Add a boolean to short circuit the read path on empty range
hoping for some speedup.
tested in read write with cs using:
cl=QUORUM duration=1m -mode native cql3 -rate threads=700 -node localhost
Will do some additional benchmark.
Fixes#1056
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170118194451.16836-1-benoit@scylladb.com>
That's because a single shard is used to calculate generation for new
sstables in upload directory, and that will result in that single shard
sharing all the resources with other shards.
For refresh without upload dir, it currently works fine because we
reshuffle column family dir instead.
flush_upload_dir() is now a free function, takes a distributed database
object, and uses calculate_shard_from_sstable_generation() to decide
which shard will move sstable using its own generation namespace.
Fixes#2008.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>
This patch ensures that when adding a shared sstable, we select only
one cpu to update that column family's stats. This is important so we
don't overestimated the on-disk size of sstables when resharding
This fixes only a temporary miscount of the current load, since shared
sstables are eventually re-written, but a fixes a permanent miscount
of the total load.
Refs #1592
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170119144823.31041-1-duarte@scylladb.com>
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
The change improves worst case latency. Reclamation time statistics
over 30 second period after cache fills up, in microseconds:
Before:
avg = 1524.283148
stdev = 11021.021118
min = 12.934000
max = 144356.000000
sum = 257603.852000
samples = 169
After:
avg = 1317.362414
stdev = 1913.542802
min = 263.935000
max = 19244.600000
sum = 175209.201000
samples = 133
Refs #1634.
Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
* seastar 240b0bf...ff098c8 (15):
> metrics::impl::shard(): check if reactor is initialized before using it
> reactor: introduce engine_is_ready()
> fix metric name
> Merge "Add label support to the metric layer" from Amnon
> core: Avoid memory leak when submission to syscall_work_queue fails
> core: Avoid memory leak when submission to smp_message_queue fails
> core: append_challenged_posix_file_impl: Make exception-safe
> Merge "Log backtrace in report_failed_future" from Tomasz
> install-dependencies.sh: add systemtap-sdt-dev to Ubuntu/Debian dependencies
> core: add fsqual.cc/.hh to core
> dpdk: Fix compile error with rte_pci.h
> fstream_test: fix spurious failures due to BOOST_REQUIRE_EQUAL thread-unsafety
> reactor: unregister metrics of queue on shard 0
> build: track system header changes too
> Prometheus: do not rely on collectd for the hostname
close() operation is like a destructor, it cannot fail. It just
reports errors, but close itself succeeds. So we should proceed with
the closing even if it fails.
Message-Id: <1484245886-7269-1-git-send-email-tgrabiec@scylladb.com>
As Tomek pointed out, previous code, regardless of version mismatch, of generating
comparator description string was not correct (as in: in sync with origin).
This modifies it to look at
1.) Actual clustring size
2.) Compound-ness
3.) Dense-ness
to determine whether we should generate a compound desc, and whether it
should contain a trailing utf8-desc type.
v2: Simplify non-dense base column addition and ensure it handles
thrift non-utf8 (as per comments from tomek)
Message-Id: <1484670171-18362-1-git-send-email-calle@scylladb.com>
As the metrics migration progressed, some include to scollectd.hh left
behind.
Because of the nature of the scollecd implementation those include
brings alot of code with them to the header files and eventually to many
source file.
This patch remove those include and add a missing include to
storage_proxy.cc.
The reason the compiler didn't complain is an indication to the
problematic nature of those include in the first place.
Before this patch, change in metrics.hh would cause 169 files to
compile, after this change 17.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>
Fixes#2019
According to the Java driver and cassandra, all versions < 3
include the PK in the comparator descriptor string.
This broke for us when bumping the cassandra version 2.1 -> 2.2
Message-Id: <1484657580-14411-1-git-send-email-calle@scylladb.com>