scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 13:37:04 +00:00

Author	SHA1	Message	Date
Amnon Heiman	ff3d83bc2f	node_exporter_install script update version to 0.14 Fixes #2097 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170612125724.7287-1-amnon@scylladb.com>	2017-06-18 12:25:58 +03:00
Calle Wilund	3464422051	commitlog_test: Fix reader test dropping rp handles Test wants data in live segments to read from, so should not just drop the handles returned from allocate. Message-Id: <1497344532-2616-1-git-send-email-calle@scylladb.com>	2017-06-16 22:45:46 +01:00
Etienne Kruger	be0a947596	tests: perf_simple_query: Add delete perf test Add a performance test for deletion in addition to the existing update and query tests. The deletion performance test is executed using the '--delete' argument to perf_simple_query. Fixes #2417. Signed-off-by: Etienne Kruger <el@loadavg.io> Message-Id: <20170615232500.26987-1-el@loadavg.io>	2017-06-16 14:51:00 +01:00
Avi Kivity	2c57ab84b2	mutation_reader: fix typo in forwarding_tag The typo went unnoticed since the compiler picked up the global scope's forwarding_tag. The bug made streamed_mutation::forwarding and mutation_reader::forwarding the same type, but fortunately there were no type mixups due to this.	2017-06-15 20:13:01 +03:00
Avi Kivity	9cf6db3de5	Merge	2017-06-15 19:11:07 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Avi Kivity	da24bd7c34	Merge "Balance read requests according to CF's cache hit ratio" from Gleb "During read query with CL<ALL not all replicas are contacted. It is possible for some replicas to cache less data for some CF's (for instance because of node restart), so the replica choice may have a big impact on request's completion latency and on amount of work it generates in a cluster. This patch series keep track of per CF cached hit ratio and uses this information to choose best replicas for a request. Nodes with lower hit ratios are still contacted in order to populate their cache, but less frequently." * 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev: storage_proxy: load balance read requests according to cache hit rates choose extra replica for speculation in filter_for_query() consistency_level: drop filter_for_query_dc_local function database: reset node's hit rate information on connection drop messaging_service: connection drop notifier Store cluster wide cache hit statistics in CF messaging_service: return cache hit ratio as part of data read Distribute cache temperature over gossiper. periodically calculate avg cache hit rate between all shards database: introduce cache_temperature class Rename load_broadcaster.cc to misc_services.cc storage_proxy: use db::count_local_endpoints function instead open code it	2017-06-15 14:33:08 +03:00
Avi Kivity	7dffe7f933	Merge "parallel repair and more memory usage fix" from Asias "This series reduces repair memory usage and improves repair speed." * tag 'asias/fix-repair-2430-branch-master-v4.1' of github.com:cloudius-systems/seastar-dev: repair: Repair on all shards repair: Allow one stream plan in flight	2017-06-15 14:00:19 +03:00
Duarte Nunes	5736468a71	mutation_partition_serializer: Assume range tombstone support Range tombstones were introduced in version 1.3 and there exists no direct upgrade from 1.2 to vnext, so we can retire the code enforcing backwards compatibility. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170614211654.82501-1-duarte@scylladb.com>	2017-06-15 09:54:05 +03:00
Avi Kivity	e11f1c9cc3	tests: fix partitioner_test build on gcc 5	2017-06-14 17:22:01 +03:00
Gleb Natapov	4fdfa2dbb7	gdb: Fix "scylla heapprof" command Message-Id: <20170612084241.GF21915@scylladb.com>	2017-06-14 15:41:39 +02:00
Gleb Natapov	c7a59ab7ff	do not calculate serialized size of commitlog_entry_writer before final format is knows Currently commitlog_entry_writer constructor calculates serialized size before it is knows if a schema should be included into the entry. The result is never used since it is recalculated when schema information is supplied. The patch removes needless calculation. Message-Id: <20170614114607.GA21915@scylladb.com>	2017-06-14 14:53:07 +03:00
Gleb Natapov	a032078410	intern also tuple and user defined types Currently each time UDT or tuple is parsed new object is created. If those objects are used to create container type repeatedly it will cause memory leak since container types are interned, but lookup in the cache is done using pointer to a contained type (which will be always different for UDT and tuples). This patches interns also UDT and tuple, so each type the same object is parsed same pointer is also returned. Refs #2469 Fixes #2487 Message-Id: <20170612142942.GO21915@scylladb.com>	2017-06-14 14:41:17 +03:00
Asias He	47345078ec	repair: Repair on all shards Currently, shard zero is the coordinator of the repair. All the work of checksuming of the local node and sending of the repair checksum rpc verb is done on shard zero only. This causes other shards being underutilized. With this patch, we split the ranges need to be repaired into at least smp::count ranges, so sizeof(ranges) / smp::count will be assigned to each shard. For exmaple, we have 8 shards and 256 ragnes, each shard will repair 32 ranges. Each shard will repair the 32 ranges sequencially. There will be at most 8 (smp::count) ranges of repair in parallel.	2017-06-14 17:52:49 +08:00
Asias He	54831a344c	repair: Allow one stream plan in flight In "repair: Use more stream_plan" (commit `2043ffc064`), we switched to do stream while doing checksum instead of do stream only after checksum pahse is completed. We take a parallelism_semaphore before we do checksum, if there are more than sub_ranges_to_stream (1024) ranges, we start a stream_plan and wait for the streaming to complete (still under the parallelism_semaphore). So at most parallelism_semaphore (100) stream_plans can be in parallel. The parallelism_semaphore limits the parallelism of both checksum and the streaming plan. However, it is not necessary to have the same parallelism for both checksum and streaming, because 1) a streaming operation itself runs in parallel (handling ranges on all shards in prallel, sending mutaitons in parallel) , 2) and with more streaming plan (in worse case 100) means we can write to 100 memtables at the same time and flush 100 memtables to disk at the same time which can take a lot of memory. With this patch, we only allow one stream plan in flight.	2017-06-14 17:52:36 +08:00
Calle Wilund	525730e135	database: Fix assert in truncate to handle empty memtables+sstables If we do two truncates in a row, the second will have neither memtable nor sstable data. Thus we will not write/remove sstables, and thus get no resulting truncation replay position. Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>	2017-06-14 11:21:21 +02:00
Gleb Natapov	87094849fa	storage_proxy: load balance read requests according to cache hit rates This patch makes storage proxy to choose replicas to read from base on their cache hit rates. Replicas with higher cache hit rates will see more requests while replicas with lower hit rates will see less. Local node has a special bonus and will get more requests even if another node has slightly higher cache hit rate (same goes for local vs remote DC), but after the patch it is no longer guarantied that a coordinator node will be chosen as a replica for the read (if the feature is enabled).	2017-06-13 09:57:14 +03:00
Gleb Natapov	bc8aa1b4ee	choose extra replica for speculation in filter_for_query() Currently storage proxy has to loop over remaining replicas to search for suitable extra replica, but doing it in filter_for_query() is extremely easy, so do it there instead.	2017-06-13 09:57:14 +03:00
Gleb Natapov	8437ea3b99	consistency_level: drop filter_for_query_dc_local function Merge filter_for_query_dc_local() functionality into filter_for_query(). This is more efficient since filter_for_query_dc_local() partitions endpoints into 'local' and 'remote' set but filter_for_query() already does it for CL=LOCAL so for such queries we needlessly do it twice.	2017-06-13 09:57:14 +03:00
Gleb Natapov	ca812a8ea0	database: reset node's hit rate information on connection drop Node may go down, so after it restarts cache hit rate info will be incorrect and it can be overwhelmed with traffic until new and up-to-date cache hit rate arrives. Solve this by dropping node's information on connection reset, it is more accurate than relying on gossip which may be slow and miss reboot of a node.	2017-06-13 09:57:14 +03:00
Gleb Natapov	23c51b3e57	messaging_service: connection drop notifier Allow registering callbacks that will be called when connection is going down.	2017-06-13 09:57:14 +03:00
Gleb Natapov	0e4d5bc2f3	Store cluster wide cache hit statistics in CF	2017-06-13 09:57:14 +03:00
Gleb Natapov	69c5526301	messaging_service: return cache hit ratio as part of data read	2017-06-13 09:57:14 +03:00
Gleb Natapov	8ca1432b04	Distribute cache temperature over gossiper. When a node start it does not have any information about cache temperature of other nodes in the cluster and it is hard (if not impossible) to make right guess. During cluster startup all nodes have cold caches, so there is no point to redirect reads to other nodes even though local cache it cold, but if only that node restarted than other nodes have populated cache and reads should be redirected. The node will get up-to-date information about other nodes caches, but only after receiving first reply, until then it does not have the information to make right decisions which may cause unwanted spikes immediately after restart. Having cache temperature in gossiper helps to solve the problem.	2017-06-13 09:57:14 +03:00
Gleb Natapov	991ec4a16c	periodically calculate avg cache hit rate between all shards This patch adds new class cache_hitrate_calculator whose responsibility is to periodically calculate average cache hit rates between all shards for each CF.	2017-06-13 09:57:14 +03:00
Gleb Natapov	fab18c0c5a	database: introduce cache_temperature class The class will represent cache hit rate for a column family and is serializable for use with RPC.	2017-06-13 09:57:14 +03:00
Gleb Natapov	f59ecc2687	Rename load_broadcaster.cc to misc_services.cc load_broadcaster is very small class, move it into generic file so that we can put other small services there to save on compilation time.	2017-06-13 09:57:14 +03:00
Gleb Natapov	7bcf4c690f	storage_proxy: use db::count_local_endpoints function instead open code it	2017-06-13 09:57:14 +03:00
Gleb Natapov	21197981a5	Fix use after free in nonwrapping_range::intersection end_bound() returns temporary object (end_bound_ref), so it cannot be taken by reference here and used later. Copy instead. Message-Id: <20170612132328.GJ21915@scylladb.com>	2017-06-12 15:34:36 +01:00
Tomasz Grabiec	20095d7ed6	gdb: Fix "scylla column_families" command Apparently some GDB versions (7.11.1-86.fc24) don't parse double '>' in a type name, so this: std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family>> should be this: std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family> > Message-Id: <1497256644-4335-1-git-send-email-tgrabiec@scylladb.com>	2017-06-12 11:39:50 +03:00
Tomasz Grabiec	9e7a040f0c	gdb: Fix "scylla keyspaces" command The problem is that 'key' is a 'bytes' object now, which doesn't have __format__. Fixes the following error: Traceback (most recent call last): File "~/src/scylla/scylla-gdb.py", line 184, in invoke TypeError: non-empty format string passed to object.__format__ Error occurred in Python command: non-empty format string passed to object.__format__ Message-Id: <1497253433-374-2-git-send-email-tgrabiec@scylladb.com>	2017-06-12 11:22:59 +03:00
Tomasz Grabiec	230683bdfa	gdb: Add missing seastar namespace qualifier Message-Id: <1497253433-374-1-git-send-email-tgrabiec@scylladb.com>	2017-06-12 11:22:53 +03:00
Asias He	2bcb368a13	repair: Fix range use after free Capture it by value. scylla: [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed) scylla: [shard 0] repair - Failed sync of range ==<runtime_exception (runtime error: Invalid token. Should have size 8, has size 0#012)>: streaming::stream_exception (Stream failed) Message-Id: <7fda4432e54365f64b556e7e4c26e36d3a9bb1b7.1497238229.git.asias@scylladb.com>	2017-06-12 11:00:57 +03:00
Avi Kivity	419ad9d6cb	Merge "repair memory usage fix" from Asias "This series switches repair to use more stream plans to stream the mismatched sub ranges and use a range generator to produce sub ranges. Test shows no huge memory is used for repair with large data set. In addition, we now have a progress reporter in the log how many ranges are processed. Jun 06 14:18:22 [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942] Jun 06 14:19:55 [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942] Fixes #2430." * tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev: repair: Remove unused sub_ranges_max repair: Reduce parallelism in repair_ranges repair: Tweak the log a bit repair: Use more stream_plan repair: iterator over subranges instead of list	2017-06-08 14:19:08 +03:00
Tomasz Grabiec	9b7f170121	gdb: Improve error message Message-Id: <1496849069-21750-1-git-send-email-tgrabiec@scylladb.com>	2017-06-07 18:26:31 +03:00
Tomasz Grabiec	0dfe1ad431	Merge "Relax replay position ordering requirement" from Calle From seastar-dev.git calle/concorde Normally, we require that all mutations applied to a column family have replay positions higher than all previously flushed. The main reason for this is to be able to determine when to drop a commit log segment, i.e. determine that all replay positions less than X are now in sstables. This patch series, small as it is, relaxes this by instead of just keeping track of high rp applied, keep a reference count to each segment per CF in memtables, and on flush, release this very count. The only case where we need to keep a water mark for RP is then for table truncation, for which we simply say that the highest RP applied to the column family is the lowest allowed henceforth, and use the old reordering logic for this instead. I.e. very rare. There is of course one (big?) downside to all this, and this is "normal" commit log replay on startup after crash/shutdown. Since we relax RP ordering, we cannot use RP:s in sstables as low marks for replay start, since it is now allowed to exist non-persisted mutations in commitlog with lower RP:s than previously flushed. I.e. we more or less always have to replay the full commit log. It is worth noting though that due to compaction and the non- propagation of RP marks to new sstables, we end up often doing this anyway, so it is hard to say how much of a regression this is.	2017-06-07 14:51:28 +02:00
Calle Wilund	18806989b6	database: remove hard rp ordering requirement, set low rp mark on truncate With commitlog keeping use-count per CF id, we can ease the ordering restriction on replay positiontion. Previously we required that all added mutations have a position > previously flushed. However, if we accept that replay must now be all data, by keeping track instead per CF of highest RP ever entered, we can instead just set a low mark on truncation, since this is the only remaining hard RP divider.	2017-06-07 12:07:01 +00:00
Calle Wilund	d9b8c79eb9	commitlog_replayer: Ignore sstable replay positions With relaxed position ordering, we cannot use existing sstables as water mark for replay. We must replay everything above truncation marks.	2017-06-07 12:07:01 +00:00
Calle Wilund	2913241df1	memtable/commitlog: Change bookkeep to track individul segments Use per CF-id reference count instead, and use handles as result of add operations. These must either be explicitly released or stored (rp_set), or they will release the corresponding replay_position upon destruction. Note: this does _not_ remove the replay positioning ordering requirement for mutations. It just removes it as a means to track segment liveness.	2017-06-07 12:07:01 +00:00
Calle Wilund	0c598e5645	commitlog_test: Fix test_commitlog_delete_when_over_disk_limit Test should a.) Wait for the flush semaphore b.) Only compare segement sets between start and end, not start, end and inbetwen. I.e. the test sort of assumed we started with < 2 (or so) segments. Not always the case (timing) Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>	2017-06-07 12:44:02 +03:00
Avi Kivity	07ff3f68e0	Merge seastar upstream * seastar b1f69cc...621b7ed (8): > net/api: Remove outdated comments > Merge "Fixes for Clang 5" from Paweł > Merge "Metrics: Safely transfer metadata between shared" from Amnon > posix: add missing #include > build: add cmake dependency > build: add -Wno-maybe-uninitialized > rpc: handle messages larger than memory limit (Fixes #2453) > doxygen: enable macro expansion	2017-06-07 11:04:56 +03:00
Takuya ASADA	7fe63c539a	dist/debian: install gdebi when it's not exist Since we started to use gdebi for install build-dep metapackage that generated by mk-build-dep, we need to install gdebi on build_deb.sh too. Fixes #2451 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1496819209-30318-1-git-send-email-syuu@scylladb.com>	2017-06-07 10:24:22 +03:00
Asias He	3fdb8a3d3f	repair: Remove unused sub_ranges_max With the sub range iterator, it is not used anymore. Drop it.	2017-06-07 08:52:45 +08:00
Asias He	ca00c10b35	repair: Reduce parallelism in repair_ranges We currently repair all the ranges in parallel. 1) All the ranges will contend for parallelism_semaphore, instead of processing multiple ranges in parallel and calculating the sub ranges (which take memory) for each range in parallel, we can handle the ranges one bye one. We could have enough parallelism because the checksum are calucated on all the shards. 2) If for some reason the repair failed, if we handle ranges 1 by 1, we can log which range of repair is successful. Next time, we can ignore them. If we start ranges in parallel, it has a high chance, no single range is completed because all the ranges are on going. Refs #1912	2017-06-07 08:50:57 +08:00
Asias He	3852665156	repair: Tweak the log a bit - Count n out m ranges the repair is running for (kind of progress report) - Make the 'Found differing range' log debug because it can be millions of such entries - Print the failed ranges	2017-06-07 08:50:57 +08:00
Asias He	2043ffc064	repair: Use more stream_plan In the very beginning, we use a stream_plan for each checksum range. Later, we changed to use a single stream_plan for all the checksum ranges. It pushes memory presure to streaming, e.g., millinons of ranges in a vector to send over RPC. To fix, we do checksum and streaming in parallel, limit the number of checksum ranges stored in memory. Fixes #2430	2017-06-07 08:50:56 +08:00
Nadav Har'El	b3ff37e67f	repair: iterator over subranges instead of list When starting repair, we divided the large token ranges (vnodes) linto small subranges of a desired length (around 100 partition), and built a huge list of those subranges - to iterate over them later and compare checksums of those chunks. However, building this list up-front is completely unnecessary, and wastes a lot of memory: In a test with 1 TB of data, as much as 3 gigabytes was spent on this list. Instead, what we do in this patch is to find the next chunk in a DFS-like splitting algorithm, using only the token range midpoint() function (as before). The amount of memory needed for this is O(logN), instead of O(N) in the previous implementation. Refs #2430. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2017-06-07 08:50:56 +08:00
Raphael S. Carvalho	0ca1e5cca3	sstables: fix report of disk space used by bloom filter After change in boot, read_filter is called by distributed loader, so its update to _filter_file_size is lost. The load variant which receives foreign components that must do it. We were also not updating it for newly created sstables. Fixes #2449. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170606151129.5477-1-raphaelsc@scylladb.com>	2017-06-06 18:20:28 +03:00
Takuya ASADA	a4c392c113	dist/debian: use gdebi instead of mk-build-deps -i At least on Debian8, mk-build-deps -i silently finishes with return code 0 even it fails to install dependencies. To prevent this, we should manually install the metapackage generated by mk-build-deps using gdebi. Fixes #2445 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1496737502-10737-2-git-send-email-syuu@scylladb.com>	2017-06-06 11:37:34 +03:00
Takuya ASADA	5608842e96	dist/debian/dep: install texlive from jessie-backports to prevent gdb build fail on jessie Installing openjdk-8-jre-headless from jessie-backports breaks texlive on jessie main repo. It causes 'Unmet build dependencies' error when building gdb package. To prevent this, force insatlling texlive from jessie-backports before start building gdb. Fixes #2444 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1496737502-10737-1-git-send-email-syuu@scylladb.com>	2017-06-06 11:37:33 +03:00

1 2 3 4 5 ...

12195 Commits