scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Avi Kivity	0591303b72	Merge "avoid excessive memory usage during resharding" from Rapahel "Intended to reduce memory usage when resharding by sharing sstable components among shards. File descriptors are also shared from now on, meaning that a much smaller number of file descriptors will be used during resharding. Fixes #1951." branch 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla * 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla: db: avoid excessive memory usage during resharding checked_file_impl: add support to dup sstables: group sstable components that can be shared among shards sstables: rename sstable member	2017-01-09 20:43:50 +02:00
Raphael S. Carvalho	68dfcf5256	db: avoid excessive memory usage during resharding After resharding, sstables may be owned by all shards, which means that file descriptors and memory usage for metadata will increase by a factor equal to number of shards. That can easily lead to OOM. SSTable components are immutable, so they can be stored in one shard and shared with others that need it. We use the following formula to decide which shard will open the sstable and share it with the others: (generation % smp::count), which is the inverse of how we calculate generation for new sstables. So if no resharding is performed, everything is shard-local. With this approach, resource usage due to loaded sstables will be evenly distributed among shards. For this approach to work, we now only populate keyspaces from shard 0. It's now the sole responsible for iterating through column family dirs. In addition, most of population functions are now free and take distributed database object as parameter. Fixes #1951. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-01-09 15:24:36 -02:00
Avi Kivity	85f4e16336	main: fix incorrect low memory warning A spurious division by smp::count warns that memory is low even when plenty is available. Fix by removing the division. Fix #2002. Message-Id: <20170108122216.27233-1-avi@scylladb.com> Tested-by: Benoît Canet <benoit@scylladb.com>	2017-01-08 15:14:36 +02:00
Gleb Natapov	9ed3346f98	main: fix error reporting about low memory Message-Id: <20170108112144.GT1829@scylladb.com>	2017-01-08 13:46:48 +02:00
Vlad Zolotarov	492295eb7f	init: move supervisor_notify() out of main.cc Transform the supervisor_notify() and related functions into the "supervisor" class and place this class implementation in a separate .cc file. This is going to fix the compilation breakage of tests introduced by a commit `8014adc2a1` init: serialize the creation of system_traces KS objects Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1483663955-20096-1-git-send-email-vladz@scylladb.com>	2017-01-06 10:10:55 +00:00
Vlad Zolotarov	8014adc2a1	init: serialize the creation of system_traces KS objects Serialize the creation of a system_traces KS objects when they do not exist - the initial cluster boot. Avoid creating them in parallel by different cluster Nodes in order to avoid issue #420. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1483552503-12873-3-git-send-email-vladz@scylladb.com>	2017-01-05 12:41:38 +01:00
Nadav Har'El	45f19f2633	main: better error message on failing to start Prometheus Previously, if the Prometheus port (by default, 0.0.0.0:9180) could not be opened, the following message appeared in the log about 10 seconds into the run, and Scylla crashed. ERROR 2017-01-01 19:31:04,066 [shard 0] seastar - Exiting on unhandled exception: std::system_error (error system:98, Address already in use) The puzzled user would have no idea which address was already in use, why, or why Scylla stopped. In this patch, before the above message we get the much more informative message: ERROR 2017-01-01 19:58:19,080 [shard 0] init - Could not start Prometheus API server on 0.0.0.0:9180: std::system_error (error system:98, Address already in use) We continue to print the original message - and exit - in this case, under the assumption that it's better not to run the database while improperly configured. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170102121304.2060-1-nyh@scylladb.com>	2017-01-04 14:58:26 +02:00
Avi Kivity	339cc0c2fa	main: verify sufficient memory per shard Refuse to boot if we don't have at least 1 GiB per shard, unless in developer mode. The primary violator here is docker, but since it starts in developer mode, it won't get fixed. We need some extra logic for this case. Message-Id: <20161221090222.28677-1-avi@scylladb.com>	2016-12-27 12:05:52 +02:00
Amnon Heiman	70b2a1bfd4	Set the prometheus prefix to scylla This patch make the prometheus prefix configurable and set the default value to scylla. Fixes #1964 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>	2016-12-25 15:21:53 +02:00
Raphael S. Carvalho	27fb8ec512	db: avoid excessive disk usage during sstable resharding Shared sstables will now be resharded in the same order to guarantee that all shards owning a sstable will agree on its deletion nearly the same time, therefore, reducing disk space requirement. That's done by picking which column family to reshard in UUID order, and each individual column family will reshard its shared sstables in generation order. Fixes #1952. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>	2016-12-21 23:18:06 +02:00
Vlad Zolotarov	62cad0f5f5	tracing: don't start tracing until a Tracing service is fully initialized RPC messaging service is initialized before the Tracing service, so we should prevent creation of tracing spans before the service is fully initialized. We will use an already existing "_down" state and extend it in a way that !_down equals "started", where "started" is TRUE when the local service is fully initialized. We will also split the Tracing service initialization into two parts: 1) Initialize the sharded object. 2) Start the tracing service: - Create the I/O backend service. - Enable tracing. Fixes issue #1939 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>	2016-12-21 12:40:14 +02:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Tomasz Grabiec	f7197dabf8	commitlog: Fix replay to not delete dirty segments The problem is that replay will unlink any segments which were on disk at the time the replay starts. However, some of those segments may have been created by current node since the boot. If a segment is part of reserve for example, it will be unlinked by replay, but we will still use that segment to log mutations. Those mutations will not be visible to replay after a crash though. The fix is to record preexisting segents before any new segments will have a chance to be created and use that as the replay list. Introduced in `abe7358767`. dtest failure: commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>	2016-12-07 15:54:47 +02:00
Takuya ASADA	2976799ef2	main: fix startup failing on Ubuntu 15.10/16.04 Since Ubuntu 15.10/16.04 still uses Upstart to manage GUI session (not as init), when we directly launch Scylla on Ubuntu's GUI Terminal(not using systemctl or initctl), raise(SIGSTOP) mistakenly calls (Because GUI session has "UPSTART_JOB" environment variable, won't happen when running Scylla as systemd service). To avoid this, we need to verify UPSTART_JOB == "scylla-server". If it's part of GUI session UPSTART_JOB has to be "unity7", we need to avoid raise(SIGSTOP) in that case. Fixes #1199 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480620421-28967-1-git-send-email-syuu@scylladb.com>	2016-12-05 16:28:25 +02:00
Avi Kivity	28857e42e7	Merge " Virtualize size_estimates system table" from Duarte "We currently write the size_estimates system table for every schema on a periodic basis, currently set to 5 minutes, which can interfere with an ongoing workload. This patchset virtualizes it such that queries are intercepted and we calculate the results on the fly, only for the ranges the caller is interested in. Fixes #1616" * 'virtual-estimates/v4' of github.com:duarten/scylla: size_estimates_virtual_reader: Add unit test db: Delete size_estimates_recorder size_estimates: Add virtual reader column_family: Add support for virtual readers storage_service: get_local_tokens() returns a future nonwrapping_range: Add slice() function range: Find a sequence's lower and upper bounds system_keyspace: Build mutations for size estimates size_estimates: Store the token range as bytes range_estimates: Add schema murmur3_partitioner: Convert maximum_token to sstring	2016-11-28 10:12:59 +02:00
Avi Kivity	07d5a20bae	Wire up sharding ignore msb parameter to configuration We might have used a fancy map<sstring, any> to pass the parameters, but that's overkill for now.	2016-11-22 22:40:47 +02:00
Duarte Nunes	6a37d87c76	db: Delete size_estimates_recorder Now that access to the size_estimates system is virtualized, we no longer need the recorder. Fixes #1616 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Raphael S. Carvalho	9a9f0d3a0f	main: fix exception handling when initializing data or commitlog dirs Exception handling was broken because after io checker, storage_io_error exception is wrapped around system error exceptions. Also the message when handling exception wasn't precise enough for all cases. For example, lack of permission to write to existing data directory. Fixes #883. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>	2016-11-14 12:34:10 +02:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Takuya ASADA	587d375e19	main: exit with 1 when verify_seastar_io_scheduler() failed Since we are exiting Scylla process in engine().at_exit() using ::_exit(0), even verify_seastar_io_scheduler() throwing an exception, scylla always exit with 0. Systemd misunderstands scylla-server.service was shutdown successfully because of this, so we need to pass correct exit code to ::_exit() here. Fixes #1674 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>	2016-10-17 13:57:00 +03:00
Raphael S. Carvalho	76862d0d9c	main: start compaction procedure after commit log is replayed Commit log replay is a synchronous operation in bootstrap, so services will only be started after it's completed. By starting compaction before, less bandwidth will be available to both and consequently boot will be slowed down. Fix is simply about moving compaction, which is an asynchronous operation after commitlog replay is over. Fixes #1620. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d2a173a4ee4d474317b970c6b39530e61067fea9.1475527955.git.raphaelsc@scylladb.com>	2016-10-06 18:25:24 +03:00
Avi Kivity	c94fb1bf12	build: reduce inclusions of messaging_service.hh Remove inclusions from header files (primary offender is fb_utilities.hh) and introduce new messaging_service_fwd.hh to reduce rebuilds when the messaging service changes. Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>	2016-10-05 11:46:49 +03:00
Gleb Natapov	26ae8e8365	implement listen_on_broadcast_address option When using multiple physical network interfaces, set this to true to listen on broadcast_address in addition to the listen_address, allowing nodes to communicate in both interfaces. Ignore this property if the network configuration automatically routes between the public and private networks such as EC2. Message-Id: <20160921094810.GA28654@scylladb.com>	2016-09-26 08:49:54 +03:00
Pekka Enberg	f1d0401ed2	main: Use proper logger for API server messages We have a "startlog" that we can use to print out API server messages. Message-Id: <1474358312-26510-1-git-send-email-penberg@scylladb.com>	2016-09-20 11:09:59 +03:00
Tomasz Grabiec	9476bc5a31	Introduce --abort-on-lsa-bad-alloc command line option Useful for triggerring core dump on allocation failure inside LSA, which makes it easier to debug allocation failures. They normally don't cause aborts, just fail the current operation, which makes it hard to figure out what was the cause of allocation failure. Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>	2016-08-03 17:26:44 +03:00
Amnon Heiman	bb4268a8a5	Add prometheus API This patch adds the prometheus API it adds the proto library to the compilation, adds an optional configuration parameter to change the prometheus listening port and start the prometheus API in main. To disable the prometheus API, set its listening port to 0. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1470228764-19545-2-git-send-email-amnon@scylladb.com>	2016-08-03 15:55:18 +03:00
Duarte Nunes	9ffdf4a5cd	db: Implement size_estimates_recorder This patch implements the size_estimates_recorder, which periodically writes estimations for all the non-system column families in the size_estimates system table. The size_estimates_recorder class corresponds to the one in Cassandra's SizeEstimatesRecorder.java. Estimation is carried out by shard 0. Since we're estimating based on data in shared sstables, having multiple shards doing this would skew the results. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-19 09:44:58 +00:00
Paweł Dziepak	7e06499458	repair: convert hashing to streamed_mutations This patch makes hashing for repair calculate checksums in a way that doesn't require rebuilding whole mutation. Unfortunately, such checksums are incompatible with the old ones so the old way for computing checksums is preserved for compatibility reasons. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Gleb Natapov	726b79ea91	messaging_service: enable internode_compression option Use LZ4 for internode compression if enabled. Message-Id: <20160711141734.GZ18455@scylladb.com>	2016-07-11 18:30:21 +03:00
Raphael S. Carvalho	85cb2a6d35	database: trigger compaction on boot At the moment, we only trigger compaction after creating a new sstable as a result of memtable flush, or some other event such as changing compaction strategy of a column family. However, it's important to trigger compaction on boot too. That will happen after loading all column families. Fixes #1404. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <54d38a418157454eec97aaba6b8a6b6e51484db4.1467135349.git.raphaelsc@scylladb.com>	2016-06-29 13:47:42 +03:00
Avi Kivity	5b81448ed6	main: add scylla --version option Fixes #1384. Message-Id: <1466691517-29964-1-git-send-email-avi@scylladb.com>	2016-06-23 16:24:03 +02:00
Avi Kivity	5af22f6cb1	main: handle exceptions during startup If we don't, std::terminate() causes a core dump, even though an exception is sort-of-expected here and can be handled. Add an exception handler to fix. Fixes #1379. Message-Id: <1466595221-20358-1-git-send-email-avi@scylladb.com>	2016-06-23 09:25:33 +03:00
Nadav Har'El	3372052d48	Rewriting shared sstables only after all shards loaded sstables After commit `faa4581`, each shard only starts splitting its shared sstables after opening all sstables. This was important because compaction needs to be aware of all sstables. However, another bug remained: If one shard finishes loading its sstables and starts the splitting compactions, and in parallel a different shard is still opening sstables - the second shard might find a half-written sstable being written by the first shard, and abort on a malformed sstable. So in this patch we start the shared sstable rewrites - on all shards - only after all shards finished loading their sstables. Doing this is easy, because main.cc already contains a list of sequential steps where each uses invoke_on_all() to make sure the step completes on all shards before continuing to the next step. Fixes #1371 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1466426641-3972-1-git-send-email-nyh@scylladb.com>	2016-06-20 16:25:24 +03:00
Avi Kivity	85bb5ea064	Merge "Reduce LSA reclaim latency" from Tomasz "Reclaiming many segments was observed to cause up to multi-ms latency. With the new setting, the latency of reclamation cycle with full segments (worst case mode) is below 1ms. I saw no difference in throughput in a CQL write micro benchmark in neither of these workloads: - full segments, reclaim by random eviction - sparse segments (3% occupancy), reclaim by compaction and no eviction Fixes #1274."	2016-06-16 10:47:57 +03:00
Tomasz Grabiec	75f899cc93	lsa: Make reclamation step configurable via config	2016-06-14 15:13:15 +02:00
Vlad Zolotarov	d3960f0bbb	tracing: rearrange shut down tracing::tracing local instance is dereferenced from a cql_server::connection::process_request(), therefore tracing::tracing service may be stop()ed only after a CQL server service is down. On the other hand it may not be stopped before RPC service is down because a remote side may request a tracing for a specific command too. This patch splits the tracing::tracing stop() into two phases: 1) Flush all pending tracing records and stop the backend. 2) Stop the service. The first phase is called after CQL server is down and before RPC is down. The second phase is called after RPC is down. Fixes #1339 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1465840496-19990-1-git-send-email-vladz@cloudius-systems.com>	2016-06-14 07:58:04 +03:00
Asias He	e6f63a50e1	main: Delay the messaging_service api registration Since messaging_service is fully initialized in storage_service::init_server which calls messaging_service::start_listen, we need to delay the messaging_service api registration after it.	2016-06-08 11:13:35 +08:00
Vlad Zolotarov	4b43b08ffc	main: start a tracing service Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:13:53 +03:00
Pekka Enberg	0255318bf3	Revert "Revert "main: change order between storage service and drain execution during exit"" This reverts commit `b3ed55be1d`. The issue is in the failing dtest, not this commit. Gleb writes: "The bug is in the test, not the patch. Test waits for repair session to end one way or the other when node is killed, but for nodetool to know if repair is completed it needs to poll for it. If node dies before nodetool managed to see repair completion it will stuck forever since jmx is alive, but does not provide answers any more. The patch changes timing, repair is completed much close to exit now, so problem appears, but it may happen even without the patch. The fix is for dtest to kill jmx as part of killing a node operation." Now that Lucas fixed the problem in scylla-ccm, revert the revert.	2016-06-01 08:48:50 +03:00
Pekka Enberg	b3ed55be1d	Revert "main: change order between storage service and drain execution during exit" This reverts commit `0ebd8b18b7`. The change breaks repair_additional_test.py:RepairAdditionalTest.repair_kill_1_test	2016-05-30 12:48:09 +03:00
Avi Kivity	b50cb3eca8	config: rename compact_on_idle compact_on_idle will lead users to thinking we're talking about sstable compaction, not log-structured-allocator compaction. Rename the variable to reduce the probability of confusion. Message-Id: <1464261650-14136-1-git-send-email-avi@scylladb.com>	2016-05-30 08:39:13 +03:00
Gleb Natapov	0ebd8b18b7	main: change order between storage service and drain execution during exit Even the comment says drain_on_shutdown should be called first, but for that in has to be registered last. Fixes #862 Message-Id: <1463579574-15789-2-git-send-email-gleb@scylladb.com>	2016-05-29 11:39:24 +03:00
Piotr Jastrzebski	136b8148d2	Use idle CPU to compact LSA memory Register an idle CPU handler that compacts a single segment every time there's nothing better to execute on CPU. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c26aa608a1e0752fb9e6db1833ef3ba1de95f161.1464169748.git.piotr@scylladb.com>	2016-05-26 12:43:53 +03:00
Pekka Enberg	94e7e61cd0	api: Register snitch API earlier Currently, we register snitch API in set_server_gossip_settle() which waits until a node has joined the cluster. This makes 'nodetool status' not properly show the status of a joining node. Fix the issue by registering snitch API earlier. Fixes #1269. Message-Id: <1463576381-15484-1-git-send-email-penberg@scylladb.com>	2016-05-20 14:24:14 +03:00
Raphael S. Carvalho	bf18025937	main: stop compaction manager earlier Avi says: "During shutdown, we prevent new compactions, but perhaps too late. Memtables are flushed and these can trigger compaction." To solve that, let's stop compaction manager at a very early step of shutdown. We will still try to stop compaction manager in database::stop() because user may ask for a shutdown before scylla was fully started. It's fine to stop compaction manager twice. Only the first call will actually stop the manager. Fixes #1238. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <c64ab11f3c91129c424259d317e48abc5bde6ff3.1462496694.git.raphaelsc@scylladb.com>	2016-05-06 07:41:29 +03:00
Takuya ASADA	2bfc8e8c12	main: add tcp_syncookies sanity check Check net.ipv4.tcp_syncookies, show error message when it set to 0. Fixes #1118 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1460738415-3798-1-git-send-email-syuu@scylladb.com>	2016-04-21 14:55:26 +03:00
Avi Kivity	e43dbac836	main: cancel pending atomic deletions on shutdown A shared sstable must be compacted by all shards before it can be deleted. Since we're stoping, that's not going to happen. Cancel those pending deletions to let anyone waiting on them to continue.	2016-04-14 17:14:26 +03:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Glauber Costa	e750a94300	sanity check Seastar's I/O queue configuration While Seastar in general can accept any parameter for its I/O queues, Scylla in particular shouldn't run with them disabled. Such will be the status when the max-io-requests parameter is not enabled. On top of that, we would like to have enough depth per I/O queue not to allow for shard-local parallelism. Therefore, we will require a minimum per-queue capacity of 4. In machines where the disk iodepth is not enough to allow for 4 concurrent requests per shard, one should reduce the number of I/O queues. For --max-io-requests, we will check the parameter itself. However, the --num-io-queues parameter is not mandatory, and given enough concurrent requests, Seastar's default configuration can very well just be doing the right thing. So for that, we will check the final result of each I/O queue. As it is the case with other checks of the sorts, this can be overridden by the --developer-mode switch. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <63bf7e91ac10c95810351815bb8f5e94d75592a5.1458836000.git.glauber@scylladb.com>	2016-03-25 11:33:57 +03:00
Gleb Natapov	48c83163b9	init: make more initialization threaded Since initialization now runs in a thread storage, messaging and gossiper services initialization code may take advantage of it too. Message-Id: <20160323094732.GF2282@scylladb.com>	2016-03-23 11:53:11 +02:00

1 2 3 4 5

203 Commits