scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 12:06:44 +00:00

Author	SHA1	Message	Date
Paweł Dziepak	67aaaefde7	Merge "api: type-erase more of the column_family API" from Avi "Together with the already merged patch, we reduce the object file from 114MB to 81MB." * tag 'api-diet-1/v1' of https://github.com/avikivity/scylla: api: type-erase all-column_family map_reduce variant api: simplify 6-argument map_reduce_cf() variant	2018-04-05 11:07:17 +02:00
Botond Dénes	3c078d2554	forwardable reader: pass down timeout in fast_forward_to() The `const dht::partition_range&` overload to be more precise. The timeout wasn't passed to the underlying reader. Spotted during test debugging. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <39c02a55196d923bd0af8e6be6f0baa578cba070.1522915463.git.bdenes@scylladb.com>	2018-04-05 11:43:21 +03:00
Avi Kivity	1fa8682412	Merge seastar upstream * seastar 7328d17...33d8f74 (3): > memory: switch to buddy allocation > tls: Ensure we always pass through semaphores on shutdown > memory: replace placement-new in unions with member construction See scylladb/seastar#426.	2018-04-05 11:12:30 +03:00
Raphael S. Carvalho	30b6c9b4cd	database: make sure sstable is also forwarded to shard responsible for its generation After `f59f423f3c`, sstable is loaded only at shards that own it so as to reduce the sstable load overhead. The problem is that a sstable may no longer be forwarded to a shard that needs to be aware of its existence which would result in that sstable generation being reallocated for a write request. That would result in a failure as follow: "SSTable write failed due to existence of TOC file for generation..." This can be fixed by forwarding any sstable at load to all its owner shards and the shard responsible for its generation, which is determined as follow: s = generation % smp::count Fixes #3273. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180405035245.30194-1-raphaelsc@scylladb.com>	2018-04-05 10:58:05 +03:00
Tzach Livyatan	58e47fa0b3	docs/docker: Fix and add links to Scylla docs - Fix link for reporting a Scylla problem - Add a link to Best Practices for Running Scylla on Docker Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20180404065129.16776-1-tzach@scylladb.com>	2018-04-04 10:52:04 +03:00
Piotr Sarna	ae3265f905	cql_server: use handle_exception for failed accepts Follows up "cql_server: replace recursion in do_accepts with repeat". Failed accepts are now handled with handle_exception routine instead of generic then_wrapped. Message-Id: <db820a674100ae57f3acc7b49ebae57d0c2bdbb8.1522785444.git.sarna@scylladb.com>	2018-04-03 21:34:46 +01:00
Piotr Sarna	b298bb2f7a	cql_server: replace recursion in do_accepts with repeat Recursion in do_accepts function is now replaced with repeat utility. Fixes #2467 Message-Id: <07d6da60726fc3ecc06139309b9716180e8accf7.1522777060.git.sarna@scylladb.com>	2018-04-03 21:23:11 +03:00
Avi Kivity	9cef37e643	Merge "db/view: View building fixes" from Duarte " Fixes to the view building process, discovered from field experience. Tests: dtest(materialized_view_tests.py, smp=2) " * 'views/view-build-fixes/v1' of https://github.com/duarten/scylla: db/view: Start view building after schema agreement db/system_keyspace: scylla_views_builds_in_progress writes are user mem db/view: Require configuration option to enable view building	2018-04-03 17:42:21 +03:00
Duarte Nunes	b84bbfc51d	tests/view_schema_test: Test empty partition key entries are rejected Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180403122244.10626-2-duarte@scylladb.com>	2018-04-03 15:25:53 +03:00
Duarte Nunes	ec8960df45	db/view: Reject view entries with non-composite, empty partition key Empty partition keys are not supported on normal tables - they cannot be inserted or queried (surprisingly, the rules for composite partition keys are different: all components are then allowed to be empty). However, the (non-composite) partition key of a view could end up being empty if that column is: a base table regular column, a base table clustering key column, or a base table partition key column, part of a composite key. Fixes #3262 Refs CASSANDRA-14345 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180403122244.10626-1-duarte@scylladb.com>	2018-04-03 15:25:52 +03:00
Duarte Nunes	d4db043f03	db/view: Start view building after schema agreement If a base table or view has been dropped in one node, but another one hasn't yet learned about it, it starts the view build process immediately on boot, possibly calculating unneeded view updates and causing errors at the view replica, if that replica has already processed the schema changes. We should thus wait for schema agreement, even if the node is a seed. Fixes #3328 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-03 13:16:28 +01:00
Duarte Nunes	75bb66a50d	db/system_keyspace: scylla_views_builds_in_progress writes are user mem Treat writes to scylla_views_builds_in_progress as user memory, as the number of writes is dependent on the amount of user data on views (times the number of views, divided by the view building batch size). Fixes #3325 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-03 13:16:28 +01:00
Duarte Nunes	bf5045c7eb	db/view: Require configuration option to enable view building View building, enabled by default, can contain or expose issues that prevent the node from starting. In those cases, it is necessary to disable view building such that the node can be submitted to maintenance operations. Fixes #3329 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-03 13:16:28 +01:00
Avi Kivity	6c35db2c44	api: type-erase all-column_family map_reduce variant Encapsulate the map_reduce parameters in type-erased std::function, as well as the iterator-on-all-column-families logic. Reduces binary size by 18%.	2018-04-03 13:08:22 +03:00
Avi Kivity	0ade558999	api: simplify 6-argument map_reduce_cf() variant The 6-argument map_reduce_cf function is identical to the 5-argument version, except that it applies performs an extra cast (by calling the 6th argument's operator=()). Simplify the code by calling the 5-argument version from the 6-argument version. Reduces binary size by ~10%.	2018-04-03 12:22:14 +03:00
Duarte Nunes	11ece46f14	db/view: Remove leftover debug statement Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180402175238.5528-1-duarte@scylladb.com>	2018-04-03 09:41:33 +01:00
Avi Kivity	cadd983856	api: type-erase map_reduce_cf() map_reduce_cf() is called with varying template parameters which each have to be compiled separately. Unifying the internals to use types based on std::any reduced the object size by 15% (115MB->99MB) with presumably a commensurate decrease in compile time. A version that used "I" instead of "std::any" (and thus merged the internals only for callers that used the same result type) delivered a 10% decrease in object size. While std::any is less safe, in this case it is completely encapsulated. Message-Id: <20180402213732.432-1-avi@scylladb.com>	2018-04-03 09:31:04 +01:00
Avi Kivity	ffcdcd6d16	tests: logalloc_test: relax test_large_allocation test_large_allocation attempts to allocate almost half of memory. With a buddy allocator, even if more than half of memory is free, and even if it is contiguous, it is unlikely to be available as a single allocation because the allocator inserts boundaries at powers- of-two addresses. Relax the test by allocating smaller chunks (but still the same amount, and still with challenging sizes); allocating half of memory contiguously is not a goal. Also use a vector instead of a deque, and reserve it, so we don't get intervening non-lsa allocations. I'm not sure there's a problem there but let's not depend on the allocation patterns. Message-Id: <20180401150828.13921-1-avi@scylladb.com>	2018-04-02 19:23:06 +01:00
Avi Kivity	7ab52947dc	conf: define named_value<log_level> externally While building with -O1, I saw that the linker could not find the vtable for named_value<log_level>. Rather than fixing up the includes (and likely lengthening build time), fix by defining the class as an extern template, preventing it from being instantiated at the call site. Message-Id: <20180401150235.13451-1-avi@scylladb.com>	2018-04-02 19:23:06 +01:00
Avi Kivity	3964fd0be2	client_state: initialize _remote_addr for internal queries -O1 complains that client_state::_remote_addr is not initialized (and it is right). The call site is tracing, which likely won't be invoked for internal queries, but still. Message-Id: <20180401150410.13651-1-avi@scylladb.com>	2018-04-02 19:23:06 +01:00
Avi Kivity	2edf36f863	bytes: don't allocate NUL terminator Since bytes is used to encapsulate blobs, not strings, there's no need for a NUL terminator. It will never be passed to a function that expects a C string. Message-Id: <20180401151009.14108-1-avi@scylladb.com>	2018-04-02 19:23:06 +01:00
Duarte Nunes	abe8bbe7b5	Merge seastar upstream * seastar a66cc34...7328d17 (5): > sstring: add support for non-nul-terminated sstrings > core/sharded: Make async_sharded_service dtor virtual > reactor: pass naked pointer to submit_io > Merge http: "Add alias support to the API" from Amnon > systemwide_memory_barrier: use madvise(MADV_DONTNEED) instead of mprotect() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-02 19:23:06 +01:00
Glauber Costa	ef84780c27	docker: default docker to overprovisioned mode. By default, overprovisioned is not enabled on docker unless it is explicitly set. I have come to believe that this is a mistake. If the user is running alone in the machine, and there are no other processes pinned anywhere - including interrupts - not running overprovisioned is the best choice. But everywhere else, it is not: even if a user runs 2 docker containers in the same machine and statically partitions CPUs with --smp (but without cpuset) the docker containers will pin themselves to the same sets of CPU, as they are totally unaware of each other. It is also very common, specially in some virtualized environments, for interrupts not to be properly distributed - being particularly keen on being delivered on CPU0, a CPU which Scylla will pin by default. Lastly, environments like Kubernetes simply don't support pinning at the moment. This patch enables the overprovisioned flag if it is explicitly set - like we did before - but also by default unless --cpuset is set. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180331142131.842-1-glauber@scylladb.com>	2018-04-01 09:17:20 +03:00
Takuya ASADA	95129c4b12	dist/ami: point wiki page when variables.json Since there's no document for build_ami.sh on this repo, point to wiki page. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1521710239-9687-1-git-send-email-syuu@scylladb.com>	2018-03-29 18:54:42 +03:00
Glauber Costa	a9ef72537f	parse and ignore background writer controller Unused options are not exposed as command line options and will prevent Scylla from booting when present, although they can still be pased over YAML, for Cassandra compatibility. That has never been a problem, but we have been adding options to i3 (and others) that are now deprecated, but were previously marked as Used. Systems with those options may have issues upgrading. While this problem is common to all Unused options, the likelihood for any other unused option to appear in the command line is near zero, except for those two - since we put them there ourselves. There are two ways to handle this issue: 1) Mark them as Used, and just ignore them. 2) Add them explicitly to boost program options, and then ignore them. The second option is preferred here, because we can add them as hidden options in program_options, meaning they won't show up in the help. We can then just print a discrete message saying that those options are, for now on ignored. v2: mark set as const (Botond) v3: rebase on top of master, identation suggested by Duarte. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180329145517.8462-1-glauber@scylladb.com>	2018-03-29 17:57:30 +03:00
Avi Kivity	c9aa9f0d86	Revert "logalloc: capture current scheduling group for deferring function" This reverts commit `3b53f922a3`. It's broken in two ways: 1. concrete_allocating_function::allocate()'ss caller, region_group::start_releaser() loop, will delete the object as soon as it returns; however we scheduled some work depending on `this` in a separate continuation (via with_scheduling_group()) 2. the calling loop's termination condition depends on the work being done immediately, not later.	2018-03-29 16:08:12 +03:00
Vladimir Krivopalov	3a9cb54c76	Merge the pair of index_readers into just one tracking a range. Historically, we had two index_readers per a sstable_mutation_reader, one for the lower bound and one for the upper bound. Most of public members of the index_reader class were only called on either of those. With the changes introduced in #2981, two readers are even more tied together as they now have a shared-per-pair list of index pages that needs proper cleanup and was protruding woefully into the caller code. This fix re-structures index_reader so that it now keeps track of both lower and upper bounds. The shared_index_lists structure is encapsulated within index_reader and becomes an internal detail rather than a liability. Fixes #3220. Tests: unit (debug, release) + Tested using cassandra-stress commands from #3189. perf_fast_forward results indicate there is no performance degradation caused by thix fix. =========================== Baseline =================================== running: large-partition-skips Testing scanning large partition with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 0.494458 1000000 2022418 1018 126960 27 0 0 0 0 0 0 0 97.6% 1 1 1.754717 500000 284946 997 127064 6 0 0 3 3 0 0 0 99.9% 1 8 0.551664 111112 201413 997 127064 6 0 0 3 3 0 0 0 99.7% 1 16 0.383888 58824 153232 1001 127080 10 0 0 5 5 0 0 0 99.5% 1 32 0.289073 30304 104832 997 127064 28 0 0 3 3 0 0 0 99.3% 1 64 0.236963 15385 64926 997 127064 122 0 0 3 3 0 0 0 99.2% 1 256 0.172901 3892 22510 997 127064 217 0 0 3 3 0 0 0 95.5% 1 1024 0.117570 976 8301 997 127064 235 0 0 3 3 0 0 0 49.0% 1 4096 0.085811 245 2855 664 27172 375 274 0 3 3 0 0 0 21.4% 64 1 0.512781 984616 1920149 1142 127064 139 0 0 3 3 0 0 0 98.7% 64 8 0.479232 888896 1854833 1001 127080 10 0 0 5 5 0 0 0 99.6% 64 16 0.451193 800000 1773078 997 127064 6 0 0 3 3 0 0 0 99.6% 64 32 0.408684 666688 1631305 997 127064 6 0 0 3 3 0 0 0 99.5% 64 64 0.351906 500032 1420924 997 127064 14 0 0 3 3 0 0 0 99.5% 64 256 0.227008 200000 881026 997 127064 211 0 0 3 3 0 0 0 99.1% 64 1024 0.125803 58880 468032 997 127064 290 0 0 3 3 0 0 0 65.1% 64 4096 0.098155 15424 157139 703 27856 401 267 0 3 3 0 0 0 25.8% running: large-partition-slicing Testing slicing of large partition: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000701 1 1427 9 296 6 4 0 3 3 0 0 0 12.4% 0 32 0.000698 32 45827 9 296 6 3 0 3 3 0 0 0 13.9% 0 256 0.000808 256 316920 10 328 6 3 0 3 3 0 0 0 24.9% 0 4096 0.004368 4096 937697 25 808 14 3 0 3 3 0 0 0 45.9% 500000 1 0.001196 1 836 13 412 9 4 0 3 3 0 0 0 22.7% 500000 32 0.001200 32 26664 13 412 9 4 0 3 3 0 0 0 22.2% 500000 256 0.001503 256 170338 14 444 10 4 0 3 3 0 0 0 25.3% 500000 4096 0.004351 4096 941465 30 956 20 4 0 3 3 0 0 0 50.7% running: large-partition-slicing-clustering-keys Testing slicing of large partition using clustering keys: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000625 1 1601 7 176 6 0 0 3 3 0 0 0 23.2% 0 32 0.000604 32 53016 7 176 6 0 0 3 3 0 0 0 24.7% 0 256 0.000695 256 368498 8 180 6 0 0 3 3 0 0 0 36.4% 0 4096 0.004083 4096 1003106 20 692 12 1 0 3 3 0 0 0 47.0% 500000 1 0.001198 1 835 12 516 9 3 0 3 3 0 0 0 22.8% 500000 32 0.000981 32 32631 12 388 9 3 0 3 3 0 0 0 29.2% 500000 256 0.001320 256 194011 13 384 10 3 0 3 3 0 0 0 29.0% 500000 4096 0.003944 4096 1038567 25 840 17 2 0 3 3 0 0 0 52.2% running: large-partition-slicing-single-key-reader Testing slicing of large partition, single-partition reader: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000849 1 1178 9 488 6 0 0 3 3 0 0 0 16.5% 0 32 0.000661 32 48415 9 296 6 0 0 3 3 0 0 0 22.2% 0 256 0.000756 256 338648 10 328 6 0 0 3 3 0 0 0 33.3% 0 4096 0.004147 4096 987610 22 840 12 1 0 3 3 0 0 0 47.9% 500000 1 0.001041 1 960 13 476 9 3 0 3 3 0 0 0 25.9% 500000 32 0.001020 32 31375 13 412 9 3 0 3 3 0 0 0 29.1% 500000 256 0.001265 256 202373 14 444 10 3 0 3 3 0 0 0 32.0% 500000 4096 0.004121 4096 994014 30 988 18 3 0 3 3 0 0 0 52.7% running: large-partition-select-few-rows Testing selecting few rows from a large partition: stride rows time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1000000 1 0.000668 1 1498 9 296 6 4 0 3 3 0 0 0 19.8% 500000 2 0.000976 2 2048 13 412 9 4 0 3 3 0 0 0 29.0% 250000 4 0.001408 4 2842 18 572 12 6 0 3 3 0 0 0 28.8% 125000 8 0.002004 8 3993 29 912 19 10 0 3 3 0 0 0 34.0% 62500 16 0.002883 16 5551 50 1584 32 18 0 3 3 0 0 0 41.9% 2 500000 1.053215 500000 474737 1138 127080 120 0 0 5 5 0 0 0 99.7% running: large-partition-forwarding Testing forwarding with clustering restriction in a large partition: pk-scan time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu yes 0.002717 2 736 24 2684 8 16 0 3 3 0 0 0 19.7% no 0.001004 2 1992 13 412 8 2 0 3 3 0 0 0 30.2% running: small-partition-skips Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 1.466523 1000000 681885 1369 139732 33 1 0 0 0 0 0 0 99.7% -> 1 1 12.792183 500000 39086 6235 177736 5155 0 0 5123 7663 0 0 0 96.4% -> 1 8 3.451431 111112 32193 6235 177736 5155 0 0 5123 9673 0 0 0 84.8% -> 1 16 2.223815 58824 26452 6234 177704 5154 0 0 5122 9965 0 0 0 75.0% -> 1 32 1.512511 30304 20036 6233 177680 5155 1 0 5123 10090 0 0 0 61.8% -> 1 64 1.129465 15385 13621 6227 177464 5154 0 0 5122 10159 0 0 0 49.5% -> 1 256 0.733282 3892 5308 6211 175464 5178 24 0 5122 10220 0 0 0 33.8% -> 1 1024 0.397302 976 2457 5946 142152 5369 217 0 5120 10235 0 0 0 32.1% -> 1 4096 0.187746 245 1305 5499 81992 5296 142 0 5122 10240 0 0 0 46.8% -> 64 1 2.428488 984616 405444 7332 177736 5155 25 0 5123 5208 0 0 0 79.9% -> 64 8 2.262876 888896 392817 6235 177736 5155 0 0 5123 5654 0 0 0 78.1% -> 64 16 2.137544 800000 374261 6234 177732 5154 0 0 5122 6110 0 0 0 77.1% -> 64 32 1.862466 666688 357960 6235 177736 5155 0 0 5123 6844 0 0 0 73.7% -> 64 64 1.547757 500032 323069 6234 177728 5155 0 0 5123 7651 0 0 0 68.7% -> 64 256 0.914612 200000 218672 6233 177704 5154 0 0 5122 9202 0 0 0 55.5% -> 64 1024 0.475472 58880 123835 6229 177492 5154 5 0 5122 9930 0 0 0 45.4% -> 64 4096 0.271239 15424 56865 6158 169480 5257 114 0 5115 10142 0 0 0 44.1% running: small-partition-slicing Testing slicing small partitions: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.003209 1 312 3 260 2 7 0 1 1 0 0 0 15.5% 0 32 0.004205 32 7610 16 1428 10 0 0 5 5 0 0 0 15.7% 0 256 0.009830 256 26042 97 8572 62 0 0 31 31 0 0 0 18.7% 0 4096 0.015471 4096 264748 100 8704 64 0 0 32 32 0 0 0 48.4% 500000 1 0.003654 1 274 34 492 33 0 0 32 64 0 0 0 28.7% 500000 32 0.004287 32 7464 40 1260 36 0 0 32 64 0 0 0 26.0% 500000 256 0.009598 256 26673 100 8748 64 4 0 32 64 0 0 0 20.6% 500000 4096 0.014151 4096 289449 119 7892 85 0 0 53 64 0 0 0 54.1% ======================== With the patch ================================ running: large-partition-skips Testing scanning large partition with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 0.468887 1000000 2132711 1018 126960 29 0 0 0 0 0 0 0 98.4% 1 1 1.735113 500000 288166 1001 127080 10 0 0 5 5 0 0 0 99.9% 1 8 0.535616 111112 207447 997 127064 6 0 0 3 3 0 0 0 99.6% 1 16 0.365487 58824 160947 1001 127080 15 0 0 5 5 0 0 0 99.5% 1 32 0.272208 30304 111326 997 127064 21 0 0 3 3 0 0 0 99.3% 1 64 0.224049 15385 68668 997 127064 208 0 0 3 3 0 0 0 99.1% 1 256 0.159247 3892 24440 997 127064 250 0 0 3 3 0 0 0 94.7% 1 1024 0.102107 976 9559 997 127064 292 0 0 3 3 0 0 0 53.6% 1 4096 0.084310 245 2906 664 27172 371 273 0 3 3 0 0 0 20.2% 64 1 0.508340 984616 1936923 1142 127064 129 0 0 3 3 0 0 0 98.1% 64 8 0.470369 888896 1889786 997 127064 6 0 0 3 3 0 0 0 99.6% 64 16 0.439917 800000 1818526 1001 127080 10 0 0 5 5 0 0 0 99.6% 64 32 0.397938 666688 1675358 997 127064 6 0 0 3 3 0 0 0 99.5% 64 64 0.344144 500032 1452972 997 127064 18 0 0 3 3 0 0 0 99.4% 64 256 0.219996 200000 909107 997 127064 251 0 0 3 3 0 0 0 99.1% 64 1024 0.124294 58880 473715 997 127064 284 1 0 3 3 0 0 0 62.2% 64 4096 0.097580 15424 158065 703 27856 400 267 0 3 3 0 0 0 25.3% running: large-partition-slicing Testing slicing of large partition: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000733 1 1365 9 296 6 4 0 3 3 0 0 0 19.3% 0 32 0.000705 32 45417 9 296 6 3 0 3 3 0 0 0 15.3% 0 256 0.000830 256 308364 10 328 6 3 0 3 3 0 0 0 26.7% 0 4096 0.004631 4096 884529 25 808 14 3 0 3 3 0 0 0 48.1% 500000 1 0.001184 1 845 13 412 9 4 0 3 3 0 0 0 23.7% 500000 32 0.001199 32 26690 13 412 9 4 0 3 3 0 0 0 21.9% 500000 256 0.001530 256 167296 14 444 10 4 0 3 3 0 0 0 26.8% 500000 4096 0.004379 4096 935474 30 956 19 4 0 3 3 0 0 0 51.5% running: large-partition-slicing-clustering-keys Testing slicing of large partition using clustering keys: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000620 1 1614 7 176 6 0 0 3 3 0 0 0 27.4% 0 32 0.000625 32 51218 7 176 6 0 0 3 3 0 0 0 27.0% 0 256 0.000701 256 365148 8 180 6 0 0 3 3 0 0 0 35.2% 0 4096 0.004063 4096 1008130 20 692 12 1 0 3 3 0 0 0 47.6% 500000 1 0.001208 1 827 12 516 9 3 0 3 3 0 0 0 24.3% 500000 32 0.000973 32 32876 12 388 9 3 0 3 3 0 0 0 28.7% 500000 256 0.001315 256 194612 13 384 10 3 0 3 3 0 0 0 29.0% 500000 4096 0.003950 4096 1037068 25 840 17 2 0 3 3 0 0 0 52.7% running: large-partition-slicing-single-key-reader Testing slicing of large partition, single-partition reader: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.000844 1 1185 9 488 6 0 0 3 3 0 0 0 16.5% 0 32 0.000656 32 48753 9 296 6 0 0 3 3 0 0 0 23.1% 0 256 0.000751 256 341011 10 328 6 0 0 3 3 0 0 0 34.0% 0 4096 0.004173 4096 981632 22 840 12 1 0 3 3 0 0 0 47.0% 500000 1 0.001036 1 966 13 476 9 3 0 3 3 0 0 0 25.4% 500000 32 0.001014 32 31573 13 412 9 3 0 3 3 0 0 0 27.4% 500000 256 0.001280 256 200044 14 444 10 3 0 3 3 0 0 0 31.8% 500000 4096 0.004081 4096 1003746 30 988 18 3 0 3 3 0 0 0 51.6% running: large-partition-select-few-rows Testing selecting few rows from a large partition: stride rows time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1000000 1 0.000668 1 1498 9 296 6 3 0 3 3 0 0 0 21.7% 500000 2 0.000958 2 2088 13 412 9 4 0 3 3 0 0 0 27.7% 250000 4 0.001495 4 2676 18 572 12 6 0 3 3 0 0 0 25.8% 125000 8 0.002069 8 3866 29 912 19 10 0 3 3 0 0 0 30.8% 62500 16 0.002856 16 5603 50 1584 32 18 0 3 3 0 0 0 41.7% 2 500000 1.063129 500000 470310 1138 127080 120 0 0 5 5 0 0 0 99.7% running: large-partition-forwarding Testing forwarding with clustering restriction in a large partition: pk-scan time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu yes 0.002567 2 779 24 2684 8 16 0 3 3 0 0 0 21.5% no 0.001013 2 1975 13 412 8 2 0 3 3 0 0 0 28.9% running: small-partition-skips Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 1.349959 1000000 740763 1369 139732 33 1 0 0 0 0 0 0 99.7% -> 1 1 12.640751 500000 39555 8144 191168 7064 0 0 7032 11481 0 0 0 96.2% -> 1 8 3.404269 111112 32639 6651 180660 5571 0 0 5539 10505 0 0 0 84.5% -> 1 16 2.175424 58824 27040 6434 179116 5354 0 0 5322 10365 0 0 0 74.3% -> 1 32 1.493365 30304 20292 6335 178404 5257 0 0 5225 10294 0 0 0 61.1% -> 1 64 1.112168 15385 13833 6256 177672 5183 0 0 5151 10217 0 0 0 48.7% -> 1 256 0.719282 3892 5411 6211 175464 5178 24 0 5122 10220 0 0 0 33.3% -> 1 1024 0.393236 976 2482 5946 142152 5369 217 0 5120 10235 0 0 0 30.7% -> 1 4096 0.185284 245 1322 5499 81992 5296 142 0 5122 10240 0 0 0 44.7% -> 64 1 2.356711 984616 417792 7361 177944 5184 21 0 5152 5266 0 0 0 79.1% -> 64 8 2.192331 888896 405457 6253 177868 5173 0 0 5141 5690 0 0 0 77.2% -> 64 16 2.029835 800000 394121 6245 177812 5165 0 0 5133 6132 0 0 0 75.7% -> 64 32 1.806448 666688 369060 6245 177808 5165 0 0 5133 6864 0 0 0 72.6% -> 64 64 1.508492 500032 331478 6242 177788 5163 0 0 5131 7667 0 0 0 67.7% -> 64 256 0.892881 200000 223994 6233 177704 5154 0 0 5122 9202 0 0 0 54.2% -> 64 1024 0.465715 58880 126429 6229 177492 5154 0 0 5122 9930 0 0 0 44.0% -> 64 4096 0.266582 15424 57858 6158 169480 5257 114 0 5115 10142 0 0 0 42.3% running: small-partition-slicing Testing slicing small partitions: offset read time (s) frags frag/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 0 1 0.003113 1 321 3 260 2 0 0 1 1 0 0 0 13.4% 0 32 0.004166 32 7682 16 1428 10 0 0 5 5 0 0 0 14.9% 0 256 0.009813 256 26088 97 8572 62 0 0 31 31 0 0 0 18.4% 0 4096 0.014798 4096 276794 100 8704 64 0 0 32 32 0 0 0 46.3% 500000 1 0.003700 1 270 34 492 33 0 0 32 64 0 0 0 28.4% 500000 32 0.004030 32 7940 40 1260 36 0 0 32 64 0 0 0 27.8% 500000 256 0.009514 256 26908 100 8748 64 0 0 32 64 0 0 0 20.2% 500000 4096 0.013368 4096 306413 119 7892 85 0 0 53 64 0 0 0 53.6% Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <a72818f79ca4081a606424545b0053fa581d49e7.1522173144.git.vladimir@scylladb.com>	2018-03-29 15:23:31 +03:00
Asias He	f539e993d3	gossip: Relax generation max difference check start node 1 2 3 shutdown node2 shutdown node1 and node3 start node1 and node3 nodetool removenode node2 clean up all scylla data on node2 bootstrap node2 as a new node I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever: On node1, node3 [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704 On node2 [shard 0] storage_service - JOINING: waiting for schema information to complete This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2. gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe(); gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe(); Each force_newer_generation_unsafe increases the generation by 1. Here is an example, Before nodetool removenode: ``` curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" \| python -mjson.tool { "addrs": "127.0.0.2", "generation": 0, "is_alive": false, "update_time": 1521778757334, "version": 0 }, ``` After nodetool revmoenode: ``` curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" \| python -mjson.tool { "addrs": "127.0.0.2", "application_state": [ { "application_state": 0, "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246", "version": 214 }, { "application_state": 6, "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a", "version": 153 } ], "generation": 2, "is_alive": false, "update_time": 1521779276246, "version": 0 }, ``` In gossiper::apply_state_locally, we have this check: ``` if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) { // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself) logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation); } ``` to skip the gossip update. To fix, we relax generation max difference check to allow the generation of a removed node. After this patch, the removed node bootstraps successfully. Tests: dtest:update_cluster_layout_tests.py Fixes #3331 Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>	2018-03-29 12:09:49 +03:00
Glauber Costa	b092234f2b	sstables: print informative message earlier Just saw this today during a crash when creating Materialized Views. It is still unclear why this happened. But the message says: Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: scylla: sstables/sstables.cc:2973: sstables::sstable::remove_sstable_with_temp_toc(seastar::sstring, seastar::sstring, seastar::sstring, int64_t, sstables::sstable::version_types, sstables::sstable::format_types)::<lambda()>: Assertion `tmptoc == true' failed. Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Aborting on shard 0. Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Backtrace: Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4b4c Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4df5 Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4ea3 Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libpthread.so.0+0x000000000000f0ff Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x00000000000355f6 Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x0000000000036ce7 Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e565 Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e611 Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000015969d0 Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x0000000001596f7a Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x000000000051ca8d I can't even guess which table caused the problem, let alone which SSTable. That's because those asserts are the very first thing we do. We can discuss whether or not assert is the right behaviour (usually we can't guarantee the state is sane if that is missing, so I don't see a problem) But it would be nice to see which SSTable we are processing before we assert. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180328160856.10717-1-glauber@scylladb.com>	2018-03-28 19:55:04 +03:00
Avi Kivity	4419e60207	Merge "Add a confiugration API" from Amnon " The configuration API is part of scylla v2 configuration. It uses the new definition capabilities of the API to dynamically create the swagger definition for the configuration. This mean that the swagger will contain an entry with description and type for each of the config value. To get the v2 of the swager file: http://localhost:10000/v2 If using with swagger ui, change http://localhost:10000/api-doc to http://localhost:10000/v2 It takes longer to load because the file is much bigger now. " * 'amnon/config_api_v5' of github.com:scylladb/seastar-dev: Explanation about the API V2 API: add the config API as part of the v2 API. Defining the config api	2018-03-28 12:45:17 +03:00
Amnon Heiman	71a04b5d26	Explanation about the API V2 Currently it holds a general explanation about the V2 and specific entry about the config. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2018-03-28 12:42:04 +03:00
Amnon Heiman	94c2d82942	API: add the config API as part of the v2 API. After this patch, the API v2 will contain a config section with all the configuration parametes. get http://localhost:10000/v2 Will contain the config section. An example for getting a configuration parameter: curl http://localhost:10000/v2/config/listen_address Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2018-03-28 12:42:04 +03:00
Amnon Heiman	6d907e43e0	Defining the config api The config API is created dynamically from the config. This mean that the swagger definition file will contain the description and types based on the configuration. The config.json file is used by the code generator to define a path that is used to register the handler function. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2018-03-28 12:41:55 +03:00
Vladimir Krivopalov	b268ea951a	tests: perf_fast_forward: Sanitize JSON files names. Substitute various brackets and parentheses with alnum strings, remove whitespaces, strip single-range values off curly braces. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <206adea8d05a1e64ce2627df1e4da3a845454906.1522171869.git.vladimir@scylladb.com>	2018-03-28 12:29:07 +03:00
Tomasz Grabiec	52c61df930	Relax includes To avoid unnecessary recompilations. Message-Id: <1522168295-994-1-git-send-email-tgrabiec@scylladb.com>	2018-03-28 10:49:07 +03:00
Avi Kivity	4c3e82bd67	Merge "db/view: Populate views with existing base table data" from Duarte " This series introduces the view_builder class, a sharded service responsible for building all defined materialized views. This process entails walking over the existing data in a given base table, and using it to calculate and insert the respective entries for one or more views. The view_builder uses the migration_manager to subscribe to schema change events, and update its bookkeeping accordingly. We prefer this to having the database call into the view_builder, as that would create a cyclic dependency. We serialize changes to the views of a particular base table, such that schema changes do not interfere with the view building process. We employ a flat_mutation_reader for each base table for which we're building views. We consume from the reader associated with each base table until all its views are built. If the reader reaches the end and there are incomplete views, then a view was added while others were being built. In such cases, we restart the reader to the beginning of the current token, but not to the beginning of the token range, when the view is added. Then, when we exhaust the reader, we simply create a new one for the whole token range, and resume building the pending views. We aim to be resource-conscious. On a given shard, at any given moment, we consume at most from one reader. We also strive for fairness, in that each build step inserts entries for the views of a different base. Each build step reads and generates updates for batch_size rows. We lack a controller, which could potentially allow us to go faster (to execute multiple steps at the same time, or consume more rows per batch), and also which would apply backpressure, so we could, for example, delay executing a build step. Interaction with the system tables: - When we start building a view, we add an entry to the scylla_views_builds_in_progress system table. If the node restarts at this point, we'll consider these newly inserted views as having made no progress, and we'll treat them as new views; - When we finish a build step, we update the progress of the views that we built during this step by writing the next token to the scylla_views_builds_in_progress table. If the node restarts here, we'll start building the views at the token in the next_token column. - When we finish building a view, we mark it as completed in the built views system table, and remove it from the in-progress system table. Under failure, the following can happen: * When we fail to mark the view as built, we'll redo the last step upon node reboot; * When we fail to delete the in-progress record, upon reboot we'll remove this record. A view is marked as completed only when all shards have finished their share of the work, that is, if a view is not built, then all shards will still have an entry in the in-progress system table; - A view that a shard finished building, but not all other shards, remains in the in-progress system table, with first_token == next_token. Interaction with the distributed system tables: - When we start building a view, we mark the view build as being in-progress; - When we finish building a view, we mark the view as being built. Upon failure, we ensure that if the view is in the in-progress system table, then it may not have been written to this table. We don't load the built views from this table when starting. When starting, the following happens: * If the view is in the system.built_views table and not the in-progress system table, then it will be in this one; * If the view is in the system.built_views table and not in this one, it will still be in the in-progress system table - we detect this and mark it as built in this table too, keeping the invariant; * If the view is in this table but not in system.built_views, then it will also be in the in-progress system table - we don't detect this and will redo the missing step, for simplicity. View building is necessarily a sharded process. That means that on restart, if the number of shards has changed, we need to calculate the most conservative token range that has been built, and build the remainder. When building view updates, we consider that everything is new and nothing pre-existing is there (which means no tombstones will be sent out to the paired view replicas). Tests: unit (debug) dtest (materialized_view_test.py(smp=1, smp=2)) " * 'view-building/v4' of https://github.com/duarten/scylla: (22 commits) tests/view_build_test: Add tests for view building tests/cql_test_env: Move eventually() to this file tests/cql_assertions: Assert result set is not empty tests/cql_test_env: Start the view_builder db/view/view_builder: Allow synchronizing with the end of a build db/view/view_builder: Actually build views flat_mutation_reader: Make reader from mutation fragments db/view/view_builder: React to schema changes service/migration_listener: Add class for view notifications db/view: Introduce view_builder column_family: Add function to populate views column_family: Allow synchronizing with in-progress writes database: Compare view id instead of name in find_views() database: Add get_views() function db/view: Return a future when sending view updates service/storage_service: Allow querying the view build status db: Introduce system_distributed_keyspace tests: Add unit test for build_progress_virtual_reader db/system_keyspace: Add API for MV-related system tables db/system_keyspace: Add virtual reader for MV in-progress build status ...	2018-03-27 15:41:28 +03:00
Daniel Fiala	051ed12ad2	cql3/functions: Print function declaration with cql3 types, not with internal types. Signed-off-by: Daniel Fiala <daniel@scylladb.com> Message-Id: <20180327084953.20313-3-daniel@scylladb.com>	2018-03-27 13:33:29 +03:00
Duarte Nunes	9f5cfa76f7	tests/view_build_test: Add tests for view building This is a separate file from view_schema_test because that one is already becoming too long to run; also, having multiple test files means they can be executed in parallel. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	e5031f70ef	tests/cql_test_env: Move eventually() to this file Move eventually() from view_schema_test to cql_test_env. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	8528584056	tests/cql_assertions: Assert result set is not empty Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	a2c94e7925	tests/cql_test_env: Start the view_builder Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	a45fa8eaa2	db/view/view_builder: Allow synchronizing with the end of a build Intended for use by unit tests, this patch allows synchronizing with the end of a build for a particular view. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	5f822e3928	db/view/view_builder: Actually build views This patch adds the missing view building code to the eponymous class. We consume from the reader associated with each base table until all its views are built. If the reader reaches the end and there are incomplete views, then a view was added while others were being built. In such cases, we restart the reader to the beginning of the current token, but not to the beginning of the token range, when the view is added. Then, when we exhaust the reader, we simply create a new one for the whole token range, and resume building the pending views. We aim to be resource-conscious. On a given shard, at any given moment, we consume at most from one reader. We also strive for fairness, in that each build step inserts entries for the views of a different base. Each build step reads and generates updates for batch_size rows. We lack a controller, which could potentially allow us to go faster (to execute multiple steps at the same time, or consume more rows per batch), and also which would apply backpressure, so we could, for example, delay executing a build step. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	1f3e3d3813	flat_mutation_reader: Make reader from mutation fragments Builds a reader from a set of ordered mutations fragments. This is useful for building a reader out of a subset of segments returned by a different reader. It is equivalent to building a mutation out of the set of mutation fragments, and calling make_flat_mutation_reader_from_mutations, except that it doest not yet support fast-forwarding. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	a21efeffa0	db/view/view_builder: React to schema changes The view_builder now uses the migration_manager to subscribe to schema change events, and update its bookkeeping accordingly. We prefer this to having the database call into the view_builder, as that would create a cyclic dependency. We serialize changes to the views of a particular base table, such that schema changes do not interfere with the upcoming view building code. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	3ffa3b6b54	service/migration_listener: Add class for view notifications Add a convenience base class for view notifications, which provides a default implementation for all other types of notifications. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	901faabaa2	db/view: Introduce view_builder This patch introduces the view_builder class, a sharded service responsible for building all defined materialized views. This process entails walking over the existing data in a given base table, and using it to calculate and insert the respective entries for one or more views. This patch introduces only the bootstrap functionality, which is responsible for loading the data stored in the system tables and filling the in-memory data structures with the relevant information, to be used in subsequent patches for the actual view building. The interaction with the system tables is as follows. Interaction with the tables in system_keyspace: - When we start building a view, we add an entry to the scylla_views_builds_in_progress system table. If the node restarts at this point, we'll consider these newly inserted views as having made no progress, and we'll treat them as new views; - When we finish a build step, we update the progress of the views that we built during this step by writing the next token to the scylla_views_builds_in_progress table. If the node restarts here, we'll start building the views at the token in the next_token column. - When we finish building a view, we mark it as completed in the built views system table, and remove it from the in-progress system table. Under failure, the following can happen: * When we fail to mark the view as built, we'll redo the last step upon node reboot; * When we fail to delete the in-progress record, upon reboot we'll remove this record. A view is marked as completed only when all shards have finished their share of the work, that is, if a view is not built, then all shards will still have an entry in the in-progress system table; - A view that a shard finished building, but not all other shards, remains in the in-progress system table, with first_token == next_token. Interaction with the distributed system table (view_build_status): - When we start building a view, we mark the view build as being in-progress; - When we finish building a view, we mark the view as being built. Upon failure, we ensure that if the view is in the in-progress system table, then it may not have been written to this table. We don't load the built views from this table when starting. When starting, the following happens: * If the view is in the system.built_views table and not the in-progress system table, then it will be in view_build_status; * If the view is in the system.built_views table and not in this one, it will still be in the in-progress system table - we detect this and mark it as built in this table too, keeping the invariant; * If the view is in this table but not in system.built_views, then it will also be in the in-progress system table - we don't detect this and will redo the missing step, for simplicity. View building is necessarily a sharded process. That means that on restart, if the number of shards has changed, we need to calculate the most conservative token range that has been built, and build the remainder. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Duarte Nunes	f298f57137	column_family: Add function to populate views The populate_views() function takes a set of views to update, a tokento select base table partitions, and the set of sstables to query. This lays the foundation for a view building mechanism to exist, which walks over a given base table, reads data token-by-token, calculates view updates (in a simplified way, compared to the existing functions that push view updates), and sends them to the paired view replicas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Duarte Nunes	67dd3e6e5d	column_family: Allow synchronizing with in-progress writes This patch adds a mechanism to class column_family through which we can synchronize with in-progress writes. This is useful for code that, after some modification, needs to ensure that new writes will see it before it can proceed. In particular, this will be used by the view building code, which needs to wait until the in-progress writes, which may have missed that there is now a view, is observable to the view building code. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Duarte Nunes	9640205f11	database: Compare view id instead of name in find_views() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00

1 2 3 4 5 ...

14982 Commits