scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Avi Kivity	7a00dd6985	Merge "Avoid avalanche of tasks after memtable flush" from Tomasz "Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce()	2017-02-02 17:49:31 +02:00
Paweł Dziepak	788892e931	counters: fix build failure on gcc5 Message-Id: <20170202132049.4497-1-pdziepak@scylladb.com>	2017-02-02 14:23:49 +01:00
Piotr Jastrzebski	36b2c4df19	row_cache_test: extend test_mvcc Make the test execute with and without an active reader to memtable that's flushed to cache. This improves the code covarage of MVCC with tests. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <007b6cd1ba7a84ea5675ea82e454bf1adf3b3330.1485954941.git.piotr@scylladb.com>	2017-02-02 13:51:32 +01:00
Tomasz Grabiec	5458a32f13	gdb: Introduce commands for inspecting pending task queue Message-Id: <1485426236-6627-1-git-send-email-tgrabiec@scylladb.com>	2017-02-02 13:15:17 +02:00
Avi Kivity	000edc36c4	Merge "Counters" from Paweł "This series introduces support for counters. The implementation of counters more or less follows the design described on our wiki page [1]. Counter cells contain many shards with replicas being able to modify and announce new versions only of the shards that they own. Historically, there were three types of shards: local, remote and global. In these patches only support for the global ones is added. [1] https://github.com/scylladb/scylla/wiki/Counters Currently, counters are only enabled as experimental features as there still several things that need to be done before they become production ready. Namely, the performance is expected to be quite poor (especially for writes), there is no proper tracing support and timed out counter requests may not be recognized and dropped early. There are also no counter-related metrics. However, apart from these problems there are no other missing parts of counter implementation and they are expected to work correctly. Fixes #577." * 'pdziepak/counters/v3-rebased' of github.com:cloudius-systems/seastar-dev: (38 commits) perf_simple_query: add counter tables tests thrift: add support for counter operations cql3: allow counters in CREATE TABLE statements cql3: selection: do not panic when seeing counters storage_proxy: support counter updates storage_proxy: add get_live_endpoints() cql3: add counter increment and decrement operations db: add operations for applying counter updates counters: implement transforming counter deltas to shards add infrastructure for locking counter cells add fnv1a hasher position_in_partition: add feed_hash() position_in_partition: add functions for querying object type types: make counter_type_impl report its cql3_type transport: encode counters as long_type mutation_partition: make for_each_cell() accessible outside source file messaging_service: add COUNTER_MUTATION verb storage_service: add COUNTERS feature idl: add idl description of consistency level schema: make is_counter() return correct value ...	2017-02-02 12:40:09 +02:00
Paweł Dziepak	8671d8329d	perf_simple_query: add counter tables tests	2017-02-02 10:35:14 +00:00
Paweł Dziepak	4ca7f0a491	thrift: add support for counter operations	2017-02-02 10:35:14 +00:00
Paweł Dziepak	fa29ef3cc0	cql3: allow counters in CREATE TABLE statements	2017-02-02 10:35:14 +00:00
Paweł Dziepak	fce6e0987f	cql3: selection: do not panic when seeing counters At this stage counters cells are already long_type values, so no special handling is necessary.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	1e8814f5ce	storage_proxy: support counter updates	2017-02-02 10:35:14 +00:00
Paweł Dziepak	c14c6b753b	storage_proxy: add get_live_endpoints()	2017-02-02 10:35:14 +00:00
Paweł Dziepak	d6ebf84edf	cql3: add counter increment and decrement operations	2017-02-02 10:35:14 +00:00
Paweł Dziepak	5a0955e89d	db: add operations for applying counter updates	2017-02-02 10:35:14 +00:00
Paweł Dziepak	8d889082bf	counters: implement transforming counter deltas to shards The leader receives counter updates as deltas which have to be transformed to counter shards. In order to do that, current local shard of the modified counter cell needs to be read, logical clock incremented and the value modified by the specified delta.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	55277b3182	add infrastructure for locking counter cells The leader receives counter update in a form of deltas which need to be transformed to counter shards. In order to do that the node needs to read its current state of the modified counter cells. Since this is essentially a read-modify-write opertation an appropriate locking mechanism is needed. Counter cell locker introduced in this patch uses a hashtable of partition entry each containing a hashtable of cell entries. Inside a cell entry there is a semaphore used for synchronization. Once no longer needed cell entries and partition entries are removed. In order to avoid deadlocks cell entries are always locked in the same order which is the lexicographical order of (clustering key, column id) pairs. Note that schema changes are not a difficulty since they do not make it possible to change ordering of such pairs.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	22fbb11f90	add fnv1a hasher	2017-02-02 10:35:14 +00:00
Paweł Dziepak	a16761dcb4	position_in_partition: add feed_hash()	2017-02-02 10:35:14 +00:00
Paweł Dziepak	f4fce93807	position_in_partition: add functions for querying object type	2017-02-02 10:35:14 +00:00
Paweł Dziepak	53d9a6f220	types: make counter_type_impl report its cql3_type	2017-02-02 10:35:14 +00:00
Paweł Dziepak	a805bea97a	transport: encode counters as long_type For the purposes of CQL counters are long values (either a delta in case of writes or the final value for reads).	2017-02-02 10:35:14 +00:00
Paweł Dziepak	b6564651e4	mutation_partition: make for_each_cell() accessible outside source file for_each_cell() const already can be used from any place in the code, allow the same with non-const version.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	bf60b7844b	messaging_service: add COUNTER_MUTATION verb This verb is going to be used for coordinator<->leader communication during counter updates.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	67ca6959bd	storage_service: add COUNTERS feature	2017-02-02 10:35:14 +00:00
Paweł Dziepak	9989239c97	idl: add idl description of consistency level	2017-02-02 10:35:14 +00:00
Paweł Dziepak	4b3c0db5cc	schema: make is_counter() return correct value	2017-02-02 10:35:14 +00:00
Paweł Dziepak	99b21fbb86	tests: random_mutation_generator: generate counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	de2acd47c9	tests/sstables: test reading and writing counters	2017-02-02 10:35:14 +00:00
Paweł Dziepak	83c6fc1114	sstables: write counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	5905729c4a	sstables: read counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	de698105e4	tests/counter: test apply, difference and freeze	2017-02-02 10:35:14 +00:00
Paweł Dziepak	0c93d01232	atomic_cell: make sure upper level tombstones cover counters Support for deletion of counters is limited in a way that once deleted they cannot be used again (i.e. tombstone always wins, regardless of the timestamp). Logic responsible for merging two counter cells already makes sure that tombstones are handled properly, but it is also necessary to ensure that higher level tombstones always cover counters.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	9f1ebd4f7c	idl/mutation: add counter serialisation logic	2017-02-02 10:35:14 +00:00
Paweł Dziepak	47d14906e6	mutation_partition: support querying counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	63f25eb12c	mutation_hasher: handle counter cells properly	2017-02-02 10:35:14 +00:00
Paweł Dziepak	25c8ed1c71	feed_hash: allow additional arguments	2017-02-02 10:35:14 +00:00
Paweł Dziepak	a57e86cc37	mutation_partition: compute counter difference	2017-02-02 10:35:13 +00:00
Paweł Dziepak	2725a4945d	mutation_partition: apply counter cells properly	2017-02-02 10:35:13 +00:00
Paweł Dziepak	496b42fcc7	tests: add test for counters	2017-02-02 10:35:13 +00:00
Paweł Dziepak	7bb5b49799	add in memory representation of counters Live counter cells are collections of shards, each one representing the sum of all operations performed by a particular replica. This commits introduces an in-memory representation of counters as well as basic operations such as merge, difference and hashing.	2017-02-02 10:35:13 +00:00
Paweł Dziepak	c66db213d3	storage_service: allow getting local host id without futures<>	2017-02-02 10:35:13 +00:00
Paweł Dziepak	0a8f00c159	atomic_cell: add flag for recognizing counter updates A counter cell may be either a collection of shards or just a delta. The former can only appear in certain places on coordinator and leader.	2017-02-02 10:35:13 +00:00
Paweł Dziepak	ab344c5aa3	mutation_partition_view: extract atomic_cell variant	2017-02-02 10:35:13 +00:00
Paweł Dziepak	83f6018ea2	schema: keep counter information in column definition	2017-02-02 10:35:13 +00:00
Avi Kivity	aec419da13	Merge seastar upstream * seastar c1dbd89...f07f8ed (3): > Merge "Introduce when_all_succeed()" from Paweł > tests: adjust collectd test for metric API change > Merge "DNS query support" from Calle	2017-02-02 12:30:10 +02:00
Piotr Jastrzebski	15cc8460bd	mutation_partition: make rows_entry constructors explicit All converting constructors should be explicit otherwise they can create a confusion. I got myself in such a situation when clustering key got implicitly converted into rows_entry when I was not expecting it. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c3f19719760f6dc7cf5e858b9c452506faedf521.1485950529.git.piotr@scylladb.com>	2017-02-01 17:57:50 +01:00
Tomasz Grabiec	2fd339787b	tests: lsa: Add test for reclaimer starting and stopping	2017-02-01 17:41:56 +01:00
Tomasz Grabiec	f943296da0	tests: lsa: Add request releasing stress test	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	e40fb438f5	lsa: Avoid avalanche releasing of requests Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs. The logic for notification across hierachy was replaced by calling region_group::notify_relief() from region_group::update() on the broadest relieved group.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	d55baa0cd1	lsa: Move definitions to .cc	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	8f8b111b33	lsa: Simplify hard pressure notification management The hard pressure was only signalled on region group when run_when_memory_available() was called after the pressure condition was met. So the following loop is always an infinite loop rather than stopping when engouh is allocated to cause pressure: while (!gr.under_pressure()) { region.allocate(...); } It's cleaner if pressure notification works not only if run_when_memory_available() is used but whenever conditino changes, like we do for the soft pressure. There is comment in run_when_memory_available() which gives reasons why notifications are called from there, but I think those reasons no longer hold: - we already notify on soft pressure conditions from update(), and if that is safe, notifying about hard pressure should also be safe. I checked and it looks safe to me. - avoiding notification in the rare case when we stopped writing right after crossing the threshold doesn't seem benefitial. It's unlikely in the first place, and one could argue it's better to actually flush now so that when writes resume they will not block.	2017-02-01 17:41:55 +01:00

1 2 3 4 5 ...

11272 Commits