scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 06:05:53 +00:00

Author	SHA1	Message	Date
Glauber Costa	4098831ebc	commitlog: wait for pending allocations to finish before closing gate. allocations may enter the gate, so it would be wise for us to wait for them. Fixes #1860 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <53cd6996c1cbd8b38bab3b03604bd11e5c20beda.1479650012.git.glauber@scylladb.com> (cherry picked from commit `21c1e2b48c`)	2016-11-20 20:00:32 +02:00
Avi Kivity	affc0d9138	Merge "get rid of memtable size parameter and rework flush logic" from Glauber "This patchset allows Scylla to determine the size of a memtable instead of relying in the user-provided memtable_cleanup_threshold. It does that by allowing the region_group to specify a soft limit which will trigger the allocation as early as it is reached. Given that, we'll keep the memtables in memory for as long as it takes to reach that limit, regardless of the individual size of any single one of them. That limit is set to 1/4 of dirty memory. That's the same as last submission, except this time I have run some experiments to gauge behavior of that versus 1/2 of dirty memory, which was a preferred theoretical value. After that is done, the flush logic is reworked to guarantee that flushes are not initiated if we already have one memtable under flush. That allow us to better take advantage of coalescing opportunities with new requests and prevents the pending memtable explosion that is ultimately responsible for Issue 1817. I have run mainly two workloads with this. The first one a local RF=1 workload with large partitions, sized 128kB and 100 threads. The results are: Before: op rate : 632 [WRITE:632] partition rate : 632 [WRITE:632] row rate : 632 [WRITE:632] latency mean : 157.8 [WRITE:157.8] latency median : 115.5 [WRITE:115.5] latency 95th percentile : 486.7 [WRITE:486.7] latency 99th percentile : 534.8 [WRITE:534.8] latency 99.9th percentile : 599.0 [WRITE:599.0] latency max : 722.6 [WRITE:722.6] Total partitions : 189667 [WRITE:189667] Total errors : 0 [WRITE:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:00 END After: op rate : 951 [WRITE:951] partition rate : 951 [WRITE:951] row rate : 951 [WRITE:951] latency mean : 104.8 [WRITE:104.8] latency median : 102.5 [WRITE:102.5] latency 95th percentile : 155.8 [WRITE:155.8] latency 99th percentile : 177.8 [WRITE:177.8] latency 99.9th percentile : 686.4 [WRITE:686.4] latency max : 1081.4 [WRITE:1081.4] Total partitions : 285324 [WRITE:285324] Total errors : 0 [WRITE:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:00 END The other workload was the workload described in #1817. And the result is that we now have a load that is very stable around 100k ops/s and hardly any timeouts, instead of the 1.4 baseline of wild variations around 100k ops/s and lots of timeouts, or the deep reduction of 1.5-rc1." * 'issue-1817-v4' of github.com:glommer/scylla: database: rework memtable flush logic get rid of max_memtable_size pass a region to dirty_memory_manager accounting API memtable: add a method to expose the region_group logalloc: allow region group reclaimer to specify a soft limit database: remove outdated comment database: uphold virtual dirty for system tables. (cherry picked from commit `5d067eebf2`)	2016-11-17 14:41:23 +02:00
Glauber Costa	a13c410749	commitlog: cycle based on total size, not on mutation size We calculate two sizes during the allocation: "size", which is the in-segment size of this mutation, and "s", which is that plus the overhead. cycle() must be called with the latter, not the former, as doing otherwise may lead to buffer overflows. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <ccf346d8d0ebb44a1ba9fd069653bab0d7be0a61.1477063157.git.glauber@scylladb.com>	2016-10-21 18:57:41 +03:00
Glauber Costa	d9875784a1	commitlog: do not wait on pending operations for batch mode This was explicitly mentioned in my set as gone in one of the versions. Somehow it came back in the final version - sorry about that. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2a0eba28cd74267d1a1fdcf1aef2901cc74ffc9f.1477059963.git.glauber@scylladb.com>	2016-10-21 17:27:16 +03:00
Glauber Costa	d5618c6ace	commitlog: add total_operations type for requests_blocked_memory Current tracker for pending allocations is a queue_size GAUGE. Add a total_operations version so we have more insight on what's going on. It will be called requests_blocked_memory for consistency with other subsystems that track similar things. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-20 09:25:38 -04:00
Glauber Costa	1578d7363a	commitlog: rework blocking logic The current incarnation of commitlog establishes a maximum amount of writes that can be in-flight, and blocks new requests after that limit is reached. That is obviously something we must do, but the current approach to it is problematic for two main reasons: 1) It forces the requests that trigger a write to wait on the current write to finish. That is excessive; ideally we would wait for one particular write to finish, not necessarily the current one. That is made worse by the fact that when a write is followed by a flush (happens when we move to a new segment), then we must wait for all writes in that segment to finish. 1) it casts concurrency in terms of writes instead of memory, which makes the aforementioned problem a lot worse: if we have very big buffers in flight and we must wait for them to finish, that can take a long time, often in the order of seconds, causing timeouts. The approach taken by this patch is to replace the _write_semaphore with a request_controller. This data structure will account the amount of memory used by the buffers and set a limit on it. New allocations will be held until we go below that limit, and will be released as soon as this happens. This guarantees that the latencies introduced by this mechanism are spread out a lot better among requests and will keep higher percentile latencies in check. To test this, I have ran a workload that times out frequently. That workload use 10 threads to write 100 partitions (to isolate from the effects of the memtable introduced latencies) in a loop and each partition is 2MB in size. After 10 minutes running this load, we are left with the following percentiles: latency mean : 51.9 [WRITE:51.9] latency median : 9.8 [WRITE:9.8] latency 95th percentile : 125.6 [WRITE:125.6] latency 99th percentile : 1184.0 [WRITE:1184.0] latency 99.9th percentile : 1991.2 [WRITE:1991.2] latency max : 2338.2 [WRITE:2338.2] After this patch: latency mean : 54.9 [WRITE:54.9] latency median : 43.5 [WRITE:43.5] latency 95th percentile : 126.9 [WRITE:126.9] latency 99th percentile : 253.9 [WRITE:253.9] latency 99.9th percentile : 364.6 [WRITE:364.6] latency max : 471.4 [WRITE:471.4] Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:56:36 -04:00
Glauber Costa	aec724bbda	commitlog: factor out code for checking mutation size In a subsequent patch, I'll use this code in a different place. To prepare for that, we move it out as a method. It also fits a lot better inside the segment manager, so move it there. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	a50996f376	commitlog: calculate segment-independent size of mutations Goal is to calculate a size that is lesser or equal than the segment-dependent size. This was originally written by Tomasz, and featured in his submission "commitlog: Handle overload more gracefully" Extracted here so it sits clearly in a different patch. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	0b7c9fa17f	commitlog: remove _needed_size It is mostly an optimization, and while it makes sense in this context, it won't soon as we'll stop waiting for the current cycle specifically to finish. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	6214bdeb66	commitlog: move segment_manager constructor outside the class definition We'll do that so we can, in following patches, use static members from the segment. Those are not defined at this point. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	299877f432	commitlog: add a counter for pending allocations We track the amount of pending allocations but we don't really export it. It will be crucial when we stop tracking pending writes. This patch exports it through a method instead of the totals structure, so we can easily change it. Current code probing pending_allocations (the api code) is also converted to use the public method instead of the totals struct. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Duarte Nunes	c19c633299	size_estimates_recorder: Increase estimate accuracy This patch uses the estimated_keys_for_range() function to get better estimates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-10 17:52:16 +02:00
Avi Kivity	c94fb1bf12	build: reduce inclusions of messaging_service.hh Remove inclusions from header files (primary offender is fb_utilities.hh) and introduce new messaging_service_fwd.hh to reduce rebuilds when the messaging service changes. Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>	2016-10-05 11:46:49 +03:00
Gleb Natapov	32989d1e66	Merge seastar upstream * seastar 2b55789...5b7252d (3): > Merge "rpc: serialize large messages into fragmented memory" from Gleb > Merge "Print backtrace on SIGSEGV and SIGABRT" from Tomasz > test_runner: avoid nested optionals Includes patch from Gleb to adapt to seastar changes.	2016-09-28 17:34:16 +03:00
Gleb Natapov	26ae8e8365	implement listen_on_broadcast_address option When using multiple physical network interfaces, set this to true to listen on broadcast_address in addition to the listen_address, allowing nodes to communicate in both interfaces. Ignore this property if the network configuration automatically routes between the public and private networks such as EC2. Message-Id: <20160921094810.GA28654@scylladb.com>	2016-09-26 08:49:54 +03:00
Nadav Har'El	fe1ba753ce	Avoid semaphore's default initial value The fact that Seastar's semaphore has a default initializer of 1 if not explicitly initialized is confusing and unexpected and recently lead to two bugs. So ScyllaDB should not rely on this default behavior, and specify the initial value of each semaphore explicitly. In several cases in the ScyllaDB code, the explict initialization was missing, and this patch adds it. In one case (rate_limiter) I even think the default of 1 was a bit strange, and 0 makes more sense. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1474530745-23951-1-git-send-email-nyh@scylladb.com>	2016-09-24 19:25:02 +03:00
Glauber Costa	ffc2131c51	decouple estimated_histogram from sstables There is nothing really that fundamentally ties the estimated histogram to sstables. This patch gets rid of the few incidental ties. They are: - the namespace name, which is now moved to utils. Users inside sstables/ now need to add a namespace prefix, while the ones outside have to change it to the right one - sstables::merge, which has a very non-descriptive name to begin with, is changed to a more descriptive name that can live inside utils/ - the disk_types.hh include has to be removed - but it had no reason to be here in the first place. Todo, is to actually move the file outside sstables/. That is done in a separate step for clarity. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:23 -04:00
Avi Kivity	98226a14ac	Merge "Exception propagation writers in commitlog batch" " While periodic mode is a all-bets-off crap-shoot as far as knowing if data actually reached disk or not, batch mode is supposed to be somewhat more reliable/deterministic. Thus, if we get an exception writing/flushing the current buffer, we should propagate exceptions to all execution paths involved in this buffer. Flush queue can now (optionally) propagate exceptions to all clients, and commit log uses this to ensure that commit log writers in batch mode all generate exceptions on disk errors. Also includes some rudimentary tests for flush queue mechanisms. Note: other main user, sstable flushing, is not affected, as default mode is still to keep exceptions to individual worker continuations, not waiters."	2016-08-08 15:33:26 +03:00
Duarte Nunes	e0a43a82c6	system_keyspace: Correctly deal with wrapped ranges This patch ensures we correctly deal with ranges that wrap around when querying the size_estimates system table. Ref #693 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470412433-7767-1-git-send-email-duarte@scylladb.com>	2016-08-05 19:17:00 +03:00
Avi Kivity	b0a275945f	Merge "Remove compact columns" from Duarte "The compact column is a dense schema's single regular column. Its existence has been a source of bugs, so this patchset removes the column_kind::compact_column, as well as further references to compact columns from the code base. Fixes #1542"	2016-08-05 12:39:23 +03:00
Duarte Nunes	cb0516a76c	schema: Remove compact_column concept This is a confusing one, and can be replaced the fact that dense schemas have a single regular column. Ref #1542 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-03 17:21:41 +00:00
Duarte Nunes	529c3a3ae6	column_kind: Drop compact_column A compact column is a dense schema's single regular column. The fact that it is a different column_kind has lead to various bugs (#1535, derived by the schema being dense and the column being regular. Fixes #1542 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-03 17:21:37 +00:00
Calle Wilund	0f9e868839	commitlog: Use exception propagation in flush_queue (for batch) Fixes: #1490 While periodic mode is a all-bets-off crap-shoot as far as knowing if data actually reached disk or not, batch mode is supposed to be somewhat more reliable/deterministic. Thus, if we get an exception writing/flushing the current buffer, we should propagate exceptions to all execution paths involved in this buffer. Thus, adding a muation to commit log in batch, will now, if an error is generated, result in an exception to the caller, which should be interpreted as "data might not have been persisted". The failing segment is then closed, and we happily hope things will get better in the next. Which they probably wont. Missing: registration of some sort of "error-handling policy", similar to origin, which can either kill transports or shut down process. (A reasonable guess is that disk errors in commit log are not gonna be recoverable).	2016-08-03 14:49:43 +00:00
Tomasz Grabiec	9476bc5a31	Introduce --abort-on-lsa-bad-alloc command line option Useful for triggerring core dump on allocation failure inside LSA, which makes it easier to debug allocation failures. They normally don't cause aborts, just fail the current operation, which makes it hard to figure out what was the cause of allocation failure. Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>	2016-08-03 17:26:44 +03:00
Amnon Heiman	bb4268a8a5	Add prometheus API This patch adds the prometheus API it adds the proto library to the compilation, adds an optional configuration parameter to change the prometheus listening port and start the prometheus API in main. To disable the prometheus API, set its listening port to 0. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1470228764-19545-2-git-send-email-amnon@scylladb.com>	2016-08-03 15:55:18 +03:00
Avi Kivity	75ee8fc2a7	size_estimates_recorder: adjust indentation	2016-07-30 20:10:12 +03:00
Avi Kivity	64d0cf58ea	size_estimates_recorder: unwrap ranges before searching for sstables column_family::select_sstables() requires unwrapped ranges, so unwrap them. Fixes crash with Leveled Compaction Strategy. Fixes #1507. Reviewed-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469563488-14869-1-git-send-email-avi@scylladb.com>	2016-07-27 10:06:21 +03:00
Duarte Nunes	ecfa04da77	system_keyspace: Add query_size_estimates() function The query_size_estimates() function queries the size_estimates system table for a given keyspace and table, filtering out the token ranges according to the specified tokens. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-24 22:43:58 +00:00
Duarte Nunes	d984cc30bf	size_estimates_recorder: Fix stop() This patch fixes stop() by checking if the current CPU instead of whether the service is active (which it won't be at the time stop() is called). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-24 22:43:58 +00:00
Duarte Nunes	e16f3f2969	system_keyspace: Avoid pointers in range_estimates This patch makes range_estimates a proper struct, where tokens are represented as dht::tokens rather than dht::ring_position*. We also pass other arguments to update_ and clear_size_estimates by copy, since one will already be required. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-24 22:43:35 +00:00
Tomasz Grabiec	5e8f0efc85	schema_tables: Fix hang during keyspace drop Fixes #1484. We drop tables as part of keyspace drop. Table drop starts with creating a snapshot on all shards. All shards must use the same snapshot timestamp which, among other things, is part of the snapshot name. The timestamp is generated using supplied timestamp generating function (joinpoint object). The joinpoint object will wait for all shards to arrive and then generate and return the timestamp. However, we drop tables in parallel, using the same joinpoint instance. So joinpoint may be contacted by snapshotting shards of tables A and B concurrently, generating timestamp t1 for some shards of table A and some shards of table B. Later the remaining shards of table A will get a different timestamp. As a result, different shards may use different snapshot names for the same table. The snapshot creation will never complete because the sealing fiber waits for all shards to signal it, on the same name. The fix is to give each table a separate joinpoint instance. Message-Id: <1469117228-17879-1-git-send-email-tgrabiec@scylladb.com>	2016-07-21 19:14:57 +03:00
Paweł Dziepak	8a386a51bd	Merge "Don't cache wide partitions" from Piotr "When reading a partition try to read it all but once more bytes are read than a given limit we decide that partition is wide and we don't cache it. Instead we retry the read with clustering key filtering applied."	2016-07-21 10:24:25 +01:00
Piotr Jastrzebski	636a4acfd0	Add flag to configure max size of a cached partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:47:20 +02:00
Pekka Enberg	aff8cf319d	db/config: Start Thrift server by default We have Thrift support now so start the server by default. Message-Id: <1469002000-26767-1-git-send-email-penberg@scylladb.com>	2016-07-20 09:25:44 +01:00
Tomasz Grabiec	a0832f08d2	schema_tables: Add more logging Message-Id: <1468917771-2592-1-git-send-email-tgrabiec@scylladb.com>	2016-07-20 10:12:00 +03:00
Vlad Zolotarov	b36b69c1d6	service::storage_proxy: remove a default value for a tracing::trace_state_ptr parameter Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Vlad Zolotarov	baa6496816	service::storage_proxy: READ instrumentation: store trace state object in abstract_read_executor Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Duarte Nunes	9ffdf4a5cd	db: Implement size_estimates_recorder This patch implements the size_estimates_recorder, which periodically writes estimations for all the non-system column families in the size_estimates system table. The size_estimates_recorder class corresponds to the one in Cassandra's SizeEstimatesRecorder.java. Estimation is carried out by shard 0. Since we're estimating based on data in shared sstables, having multiple shards doing this would skew the results. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-19 09:44:58 +00:00
Duarte Nunes	f8f61cf246	system_keyspace: Record and clear size estimates This patch implements functions that allow the size_estimates system table to be updated and cleared. The size_estimates table is updated per schema with a set of token ranges and the associated estimations of how many partitions there are and their mean size. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Gleb Natapov	9cc076c9f3	storage_proxy: preserve endpoint's order while filtering local nodes for query filter_for_query() gets sorted by preference list of endpoints and should preserve that order after filtering out non local endpoints for local query. partition() does not guaranty this while stable_partition() does, so use it instead. Fixes #1450. Message-Id: <20160713100909.GM10767@scylladb.com>	2016-07-13 13:17:28 +03:00
Glauber Costa	73a70e6d0a	config: Use Scylla in user visible options We have imported most of our data about config options from Cassandra. Due to that, many options that mention the database by name are still using "Cassandra". Specially for the user visible options, which is something that a user sees, we should really be using Scylla here. This patch was created by automatically replacing every occurrence of "Cassandra" with "Scylla" and then later on discarding the ones in which the change didn't make sense (such as Unused options and mentions to the Cassandra documentation) Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1423e1d7e36874a1f46bd091aec96dcb4d8482d9.1468267193.git.glauber@scylladb.com>	2016-07-12 09:18:17 +03:00
Gleb Natapov	726b79ea91	messaging_service: enable internode_compression option Use LZ4 for internode compression if enabled. Message-Id: <20160711141734.GZ18455@scylladb.com>	2016-07-11 18:30:21 +03:00
Calle Wilund	4ab03e98cf	commitlog: Ensure we don't end up in a loop when we must wait for alloc Continuation reordering could cause us to repeatedly see the segment-local flag var even though actual write/sync ops are done. Can cause wild recursion without actual delayed continuation -> SOE. Fix by also checking queue status, since this is the wait object. Message-Id: <1468234873-13581-1-git-send-email-calle@scylladb.com>	2016-07-11 14:12:38 +03:00
Calle Wilund	14b0fe23c5	commitlog: Ensure we don't end up in a loop when we must wait for alloc Continuation reordering could cause us to repeatedly see the segment-local flag var even though actual write/sync ops are done. Can cause wild recursion without actual delayed continuation -> SOE. Fix by also checking queue status, since this is the wait object.	2016-07-11 07:45:36 +00:00
Tomasz Grabiec	8c4b5e4283	db: Avoiding checking bloom filters during compaction Checking bloom filters of sstables to compute max purgeable timestamp for compaction is expensive in terms of CPU time. We can avoid calculating it if we're not about to GC any tombstone. This patch changes compacting functions to accept a function instead of ready value for max_purgeable. I verified that bloom filter operations no longer appear on flame graphs during compaction-heavy workload (without tombstones). Refs #1322.	2016-07-10 09:54:20 +02:00
Asias He	f4389349e4	config: Enable partitioner option Enable --partitioner option so that user can choose partitioner other than the default Murmur3Partitioner. Currently, only Murmur3Partitioner and ByteOrderedPartitioner are supported. When non-supported partitioner is specifed, error will be propogated to user.	2016-07-08 17:44:55 +08:00
Glauber Costa	7169b727ea	move system tables to its own region In the spirit of what we are doing for the read semaphore, this patch moves system writes to its own dirty memory manager. Not only will it make sure that system tables will not be serialized by its own semaphore, but it will also put system tables in its own region group. Moving system tables to its own region group has the advantage that system requests won't be waiting during throttle behind a potentially big queue of user requests, since requests are tended to in FIFO order within the same region group. However, system tables being more controlled and predictable, we can actually go a step further and give them some extra reservation so they may not necessarily block even if under pressure (up to 10 MB more). Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 17:46:28 -04:00
Avi Kivity	76cc6408cd	Merge "feature check for seed node" from Asias ""This series implemnts feature check for seed node.	2016-07-05 19:01:01 +03:00
Asias He	6f69963ef9	system_keyspace: Simplify load_host_ids implementation - Use plain loop instead of do_for_each - Use row.get_as() instead of row.template get_as() Message-Id: <3e108d3a6258c0caaf569eb9c79532d9789ea411.1467703722.git.asias@scylladb.com>	2016-07-05 09:47:21 +02:00
Asias He	3f31be58b6	system_keyspace: Simplify load_tokens implemntation - Use plain loop instead of do_for_each - Use row.get_as() instead of row.template get_as() Message-Id: <f959ace4f30078695d383c849ed4520169228f97.1467703722.git.asias@scylladb.com>	2016-07-05 09:47:21 +02:00

1 2 3 4 5 ...

705 Commits