scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 17:40:34 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	f366ac76e8	tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	d9db79a85d	tests: Switch to seastar's allocation failure injector It catches more allocation sites.	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	6b1fe6cbe5	mutation_partition: Introduce set_continuity()	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	ac772cbd81	clustering_interval_set: Introduce contained_in()	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	d24ebe8565	clustering_interval_set: Introduce add() overload accepting another interval set	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	c6c54021a8	mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc When clustering keys are larger than 12.8 KiB they may get fragmented and key comparator will need to linearize them on comparison. This may cause lookups in the rows tree to fail with bad_alloc. Partition version merging (mutation_partition::apply_monotonically()) was not taking this into account. If we fail on lookup, the partition which is being applied may be incorrectly left with the clustering range since the begging up to the current row marked as continuous, if the current row has the continuity flag set, because we've moved all of the preceding rows into the target, and the correct lower bound row is no longer there in the source. This may mark some discontinuous ranges as continuous. Merging is retried by allocating_section, and there will be no problem if it eventually suceeds, original continity will be reflected in the sum. The problem will persist if it doesn't eventually succeed, when we're really out of memory. To protect against this, we could reset the continuity flag of the current row in the source when exiting on exception. Fixes #3583	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	de5c52f422	mutation_partition: Preserve continuity in case row merging with no tracker throws Example: p: row{key=A, cont=0} row{key=C, cont=1} this: row{key=C, cont=0} When we get to processing key=C, key=A was already moved to this, so p has stale continuity on key=C, which marks (-inf,C) as continuous, whereas it should mark only (A, C). That's not a problem if merging succeeds, but if exception happens at this point, we will violate the invariant which says that the sum of p and this should yield the same logical partition. It wouldn't because continuity of the sum is calculated as a set union, and (-inf, A) would be incorrectly turned into a continuous range. This is not a problem currently because continuity is always full when there is no tracker (memtables), so won't change anyway, and when there is a tracker (cache) we never merge but overwrite instead, so there is no memory allocation and thus no possibility for failure. But better be safe.	2018-07-17 16:30:01 +02:00
Tomasz Grabiec	567da3e063	memtable, cache: Fix exception safety of partition entry insertions boost::intrusive::set::insert() may throw if keys require linearization and that fails, in which case we will leak the entry. When this happens in cache, we will also violate the invariant for entry eviction, which assumes all tracked entries are linked, and cause a SEGFAULT. Use the non-throwing and faster insert_before() instead. Where we can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure that entry is deallocated on insert failure. Fixes #3585.	2018-07-17 16:30:01 +02:00
Paweł Dziepak	422d1eaeb9	Merge "Improve usability of pkeys in system.large_partitions table" from Avi " Partition keys are currently stored in serialized form in the system.large_partitions table. This is an obstacle to operators who usually can't deserialize partition keys in their heads. Improve the situation by deserializing the partition key for them. " * tag 'pkey-print/v1' of https://github.com/avikivity/scylla: large_partition_handler: output friendly partition key keys: schema-aware printing of a partition_key	2018-07-17 13:51:22 +01:00
Avi Kivity	002ac87aac	Update seastar submodule * seastar aac6cf1...6b97e00 (5): > Merge "changes to fix travis CI builds" from Kefu > tls.cc: Make "close" timeout delay exception proof > core/sharded: mark foreign_ptr::get_owner_shard() const > core/memory: Expose counter of large allocations > tests: add test for multi-fragmented net::packet Fixes #3461. Ref scylladb/seastar#474.	2018-07-17 15:43:01 +03:00
Tomasz Grabiec	3f509ee3a2	mutation_partition: Fix exception-safety of row copy constructor In case population of the vector throws, the vector object would not be destroyed. It's a managed object, so in addition to causing a leak, it would corrupt memory if later moved by the LSA, because it would try to fixup forward references to itself. Caused sporadic failures and crashes of row_cache_test, especially with allocation failure injector enabled. Introduced in `27014a23d7`. Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>	2018-07-17 13:21:21 +01:00
Avi Kivity	acb3163639	large_partition_handler: output friendly partition key Use abstract_type::to_string() to prettify partition key components. Manually tested by setting --compaction-large-partition-warning-threshold-mb to zero and inspecting the output for compound and non-compound partition keys.	2018-07-17 14:44:52 +03:00
Avi Kivity	bfd14b4123	keys: schema-aware printing of a partition_key Add a with_schema() helper to decorate a partition key with its schema for pretty-printing purposes, and matching operator<<. This is useful to print partition keys where the operator, who may not be familiar with the encoding, may see them.	2018-07-17 14:43:12 +03:00
Tomasz Grabiec	d94c7c07a3	lsa: Disable alloc failure injector inside the LSA sanitizer Message-Id: <1531814822-30259-1-git-send-email-tgrabiec@scylladb.com>	2018-07-17 11:27:56 +01:00
Botond Dénes	cc4acb6e26	storage_proxy: use the original row limits for the final results merging `query_partition_key_range()` does the final result merging and trimming (if necessary) to make sure we don't send more rows to the client than requested. This merging and trimming is done by a continuation attached to the `query_partition_key_range_concurrent()` which does the actual querying. The continuations captures via value the `row_limit` and `partition_limit` fields of the `query::read_command` object of the query. This has an unexpected consequence. The lambda object is constructed after the call to `query_partition_key_range_concurrent()` returns. If this call doesn't defer, any modifications done to the read command object done by `query_partition_key_range_concurrent()` will be visible to the lambda. This is undesirable because `query_partition_key_range_concurrent()` updates the read command object directly as the vnodes are traversed which in turn will result in the lambda doing the final trimming according to a decremented `row_limits`, which will cause the paging logic to declare the query as exhausted prematurely because the page will not be full. To avoid all this make a copy of the relevant limit fields before `query_partition_key_range_concurrent()` is called and pass these copies to the continuation, thus ensuring that the final trimming will be done according to the original page limits. Spotted while investigating a dtest failure on my 1865/range-scans/v2 branch. On that branch the way range scans are executed on replicas is completely refactored. These changes appearantly reduce the number of continuations in the read path to the point where an entire page can be filled without deferring and thus causing the problem to surface. Fixes #3605. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>	2018-07-16 16:54:50 +03:00
Takuya ASADA	9479ff6b1e	dist/common/scripts/scylla_prepare: fix error when /etc/scylla/ami_disabled exists On this part shell command wasn't converted to python3, need to fix. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180715075015.13071-1-syuu@scylladb.com>	2018-07-16 09:29:38 +03:00
Takuya ASADA	1511d92473	dist/redhat: drop scylla_lib.sh from .rpm Since we dropped scylla_lib.sh at `58e6ad22b2`, we need remove it from RPM spec file too. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180712155129.17056-1-syuu@scylladb.com>	2018-07-15 14:46:22 +03:00
Avi Kivity	ef9b36376c	Merge "database: support multiple data directories" from Glauber " While Cassandra supports multiple data directories, we have been historically supporting just one. The one-directory model suits us better because of the I/O Scheduler and so far we have seen very few requests -- if any, to support this. Still, the infrastructure needed to support multiple directories can be beneficial so I am trying to bring this in. For simplicity, we will treat the first directory in the list as the main directory. By being able to still associate one singular directory with a table, most of the code doesn't have to change and we don't have to worry about how to distribute data between the directories. In this design: - We scan all data directories for existing data. - resharding only happens within a particular data directory. - snapshot details are accumulated with data for all directories that host snapshots for the tables we are examining - snapshots are created with files in its own directories, but the manifest file goes to the main directory. For this one, note that in Cassandra the same thing happens, except that there is no "main" directory. Still the manifest file is still just in one of them. - SSTables are flushed into the main directory. - Compactions write data into the main directory Despite the restrictions, one example of usage of this is recovery. If we have network attached devices for instance, we can quickly attach a network device to an existing node and make the data immediately available as it is compacted back to main storage. Tests: unit (release) " * 'multi-data-file-v2' of github.com:glommer/scylla: database: change ident database: support multiple data directories database: allow resharing to specify a directory database: support multiple directories in get_snapshot_details database: move get_snapshot_info into a seastar::thread snapshots: always create the snapshot directory sstables: pass sstable dir with entry descriptor database: make nodetool listsnapshots print correct information sstables: correctly create descriptors for snapshots	2018-07-15 13:31:04 +03:00
Avi Kivity	8ee807321f	Merge "scylla streaming with rpc streaming" from Asias " This work is on top of Gleb's rpc streaming which is merged recently. What this series does is to replace scylla streaming service's data plane to use the new rpc streaming instead of the old rpc verb to send the mutations for scylla streaming. Other parts of scylla streaming, the control plane, are not changed. In my test, to bootstrap a new node to the existing one node cluster, smp 2, scylla stores data on ramdisk to minimize disk io impact. I saw x2 improvment in streaming bandwidth. Before: [shard 0] stream_session - [Stream #2ae92320-5fc8-11e8-911a-000000000000] Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1570312 KiB, 109521.02 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 14.338 seconds After: [shard 0] stream_session - [Stream #e5589ac0-5fc7-11e8-b463-000000000000] Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1546875 KiB, 220415.36 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 7.018 seconds Tests: dtest update_cluster_layout_tests.py Fixes: #3591 " * tag 'asias/scylla_streaming_with_rpc_streaming_v8' of github.com:scylladb/seastar-dev: streaming: Add rpc streaming support storage_service: Introduce STREAM_WITH_RPC_STREAM feature streaming: Add estimate_partitions to send_info messaging_service: Add streaming with rpc streaming support messaging_service: Add streaming_domain database: Add add_sstable_and_update_cache database: Add make_streaming_sstable_for_write	2018-07-15 12:36:52 +03:00
Avi Kivity	8c993e0728	messaging: tag RPC services with scheduling groups Assign a scheduling_group for each RPC service. Assignement is done by connection (get_rpc_client_idx()) - all verbs on the same connection are assigned the same group. While this may seem arbitrary, it avoids priority inversion; if two verbs on the same connection have different scheduling groups, the verb with the low shares may cause a backlog and stall the connection, including following requests from verbs that ought to have higher shares. The scheduling_group parameters are encapsulated in different classes as they are passed around to avoid adding dependencies. Message-Id: <20180708140433.6426-1-avi@scylladb.com>	2018-07-13 13:57:08 +02:00
Vladimir Krivopalov	cf7b42619d	clustering_ranges_walker: Improve class consistency and readability. This patch addresses several issues. 1. The class no longer uses placement-new trick for move-assignment. It was incorrect to use because the class contains const refererences and re-initializing the same region of memory would result in undefined behaviour on accessing these members. 2. Use boost::iterator_range for tracking the current range of cr_ranges. It is easier to deal with and avoids possible bugs like assigning only one of two iterators Message-Id: <4096182c4ee2fb1157e135c487c41012b266ba69.1531440684.git.vladimir@scylladb.com>	2018-07-13 11:23:33 +02:00
Asias He	deff5e7d60	streaming: Add rpc streaming support This patch changes scylla streaming to use the recently added rpc streaming feature provided by seastar to send mutation fragments for scylla streaming instead of the rpc verbs. It also changes the receiver to write to the sstable file directly, skipping writing to memtable.	2018-07-13 08:36:47 +08:00
Asias He	71e22fe981	storage_service: Introduce STREAM_WITH_RPC_STREAM feature With this feature, the node supports scylla streaming using the rpc streaming.	2018-07-13 08:36:47 +08:00
Asias He	faa6769cdb	streaming: Add estimate_partitions to send_info The sender needs to estimate the number of partitions to send, because the receiver needs this to prepare the sstables.	2018-07-13 08:36:46 +08:00
Asias He	ddfb4590ce	messaging_service: Add streaming with rpc streaming support Preparation for adding rpc streaming in scylla streaming. - register_stream_mutation_fragments is used to register the rpc streaming verb - make_sink_and_source_for_stream_mutation_fragments is used to get the sink and source object for the sender - make_sink_for_stream_mutation_fragments is used to get a sink object for the receiver	2018-07-13 08:36:46 +08:00
Asias He	671e1b08fe	messaging_service: Add streaming_domain The rpc streaming needs a streaming_domain id for the same logical server. Chose one for our messaging service.	2018-07-13 08:36:46 +08:00
Asias He	6540051f77	database: Add add_sstable_and_update_cache Since we can write mutations to sstable directly in streaming, we need to add those sstables to the system so it can be seen by the query. Also we need to update the cache so the query refects the latest data.	2018-07-13 08:36:45 +08:00
Asias He	dfc2739625	database: Add make_streaming_sstable_for_write This will be used to create sstable for streaming receiver to write the mutations received from network to sstable file instead of writing to memtable.	2018-07-13 08:36:45 +08:00
Takuya ASADA	ee61660b76	dist/common/scripts/scylla_ec2_check: support custom NIC ifname on EC2 Since some AMIs using consistent network device naming, primary NIC ifname is not 'eth0'. But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add --nic option to specify custom NIC ifname. Fixes #3584 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180712142446.15909-1-syuu@scylladb.com>	2018-07-12 18:22:28 +03:00
Tomasz Grabiec	b17f7257a9	sstables: index_reader: Reduce size of index_entry by indirecting promoted_index Reduces size of index_entry from 384 bytes to 64 bytes by using indirection for the optional promoted index instead of embedding it. Improves query time from 9ms to 4ms in a micro benchmark with a very large index page. Message-Id: <1531406354-10089-1-git-send-email-tgrabiec@scylladb.com>	2018-07-12 17:46:58 +03:00
Tomasz Grabiec	101dcdbb48	gdb: Fix scylla heapprof command Type of _frames was chagned to static_vector<> Message-Id: <1531233685-20786-2-git-send-email-tgrabiec@scylladb.com>	2018-07-12 16:51:30 +03:00
Tomasz Grabiec	059133ffa8	gdb: Introduce iteration wrapper for static_vector Message-Id: <1531233685-20786-1-git-send-email-tgrabiec@scylladb.com>	2018-07-12 16:51:30 +03:00
Duarte Nunes	63b63b0461	utils/loading_cache: Avoid using invalidated iterators When periodically reloading the values in the loading_cache, we would iterate over the list of entries and call the load() function for those which need to be reloaded. For some concrete caches, load() can remove the entry from the LRU set, and can be executed inline from the parallel_for_each(). This means we could potentially keep iterating using an invalidated iterator. Fix this by using a temporary container to hold those entries to be reloaded. Spotted when reading the code. Also use if constexpr and fix the comment in the function containing the changes. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180712124143.13638-1-duarte@scylladb.com>	2018-07-12 13:59:09 +01:00
Botond Dénes	2e7bf9c6f9	loading_cache::reload(): obtain key before calling _load() The continuation attached to _load() needs the key of the loaded entry to check whether it was disposed during the load. However if _load() invalidates the entry the continuation's capture line will access invalid memory while trying to obtain the key. To avoid this save a copy of the key before calling _load() and pass it to both _load() and the continuation. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>	2018-07-12 13:42:42 +01:00
Avi Kivity	a4a2f743a8	Merge "Avoid large allocations when reading sstable index pages" from Tomasz " If there is a lot of partitions in the index page, index_list may grow large and require large contiguous blocks of memory, because it's based on std::vector. That puts pressure on the memory allocator, and if memory is fragmented, may not be possible to satisfy without a lot of eviction. Switch to chunked_vector to avoid this. Refs #3597 " * 'tgrabiec/avoid-large-alloc-in-index-reader' of github.com:tgrabiec/scylla: sstables: Switch index_list to chunked_vector to avoid large allocations utils: chunked_vector: Do not require T to be default-constructible for clear() utils: chunked_vector: Implement front()	2018-07-12 15:30:18 +03:00
Duarte Nunes	1fb3b924f4	utils/loading_cache: Remove superfluous continuation Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180712122031.13424-1-duarte@scylladb.com>	2018-07-12 15:22:35 +03:00
Takuya ASADA	8f80d23b07	dist/common/scripts/scylla_util.py: fix typo Fix typo, and rename get_mode_cpu_set() to get_mode_cpuset(), since a term 'cpuset' is not included '_' on other places. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180711141923.12675-1-syuu@scylladb.com>	2018-07-12 10:14:55 +03:00
Tomasz Grabiec	8c85b01ad3	gdb: Fix scylla lsa-segment on python 3 Referring to a function parameter via "global" no longer works on python 3. We should be using "nonlocal", which is absent on python 2 though. To make the script work on both, inline next(). Message-Id: <1531317984-29224-1-git-send-email-tgrabiec@scylladb.com>	2018-07-12 10:14:22 +03:00
Duarte Nunes	a7fdf4fc49	Merge 'ALLOW FILTERING for indexed queries' from Piotr " Previous series on ALLOW FILTERING introduced it for regular queries, but it's also possible to have an indexed query which requires filtering. This series contains minor fixes that allow treating indexed+filtered queries properly. The most important part is having more selective approach of extracting values from restrictions in read_posting_list() helper function. Before ALLOW FILTERING, restrictions contained only a single entry that matched the indexed column, but it's not the case with filtering (and it won't be the case with multiple indexing support). This series also comes with test cases for indexed+filtered queries. Tests: unit (release) " * 'allow_filtering_and_si_3' of https://github.com/psarna/scylla: tests: add filtering indexed queries tests cql3: use single restriction value in index creation cql3: add secondary index condition to need_filtering cql3: add value_for method cql3: add missing inline declarations to restrictions cql3: make index detection more specific index: add target_column getter to index	2018-07-12 00:17:36 +01:00
Piotr Sarna	fcfbc804e4	tests: add filtering indexed queries tests Tests covering ALLOW FILTERING usage while using secondary indexes as well are added to cql_query_test. Tests are based on Cassandra's test suite for filtering secondary indexes + some more simple cases.	2018-07-11 18:06:21 +02:00
Piotr Sarna	7d9715db27	cql3: use single restriction value in index creation ALLOW FILTERING support caused index-related restrictions to possibly have more values. In order to remain correct, only those restrictions which match the indexed columns should be used.	2018-07-11 18:06:21 +02:00
Piotr Sarna	1d75035672	cql3: add secondary index condition to need_filtering A query that restricts a partition key and an indexed column needs filtering (after reading an index) and it wasn't properly detected before.	2018-07-11 18:06:21 +02:00
Piotr Sarna	80ce9b72a1	cql3: add value_for method In order to extract value from a restriction for just one column, value_for(column_name, options) method is implemented. It's needed because once ALLOW FILTERING support was introduced, index-related restrictions may contain more than 1 value.	2018-07-11 18:06:21 +02:00
Piotr Sarna	c1ad28f28e	cql3: add missing inline declarations to restrictions In order to prevent future compilation errors, externally defined class methods from single column primary key restrictions are explicitly marked inline.	2018-07-11 18:06:21 +02:00
Piotr Sarna	02811d8996	cql3: make index detection more specific Conditions that detect if restrictions need an indexed query weren't specific enough to work properly with mixed index-filtering queries, because they would overly eager assume that partition/clustering key restrictions have a backing index.	2018-07-11 18:06:21 +02:00
Piotr Sarna	372644c909	index: add target_column getter to index Target column for an index is later needed to find matching restrictions.	2018-07-11 18:06:21 +02:00
Tomasz Grabiec	3b2890e1db	sstables: Switch index_list to chunked_vector to avoid large allocations If there is a lot of partitions in the index page, index_list may grow large and require large contiguous blocks of memory. That puts pressure on the memory allocator, and if memory is fragmented, may not be possible to satisfy without a lot of eviction.	2018-07-11 16:55:20 +02:00
Tomasz Grabiec	b0f5df10d2	utils: chunked_vector: Do not require T to be default-constructible for clear() resize(), used by clear(), requires T to be default-constructible in case the vector is expanded. It's not actually needed for clearing, and there will be users which use clear() with non-default-constructible T, so implement clear() without using resize().	2018-07-11 16:55:20 +02:00
Tomasz Grabiec	03832dab97	utils: chunked_vector: Implement front() std::vector<> has it, so should this, for easy migration.	2018-07-11 16:55:20 +02:00
Piotr Sarna	dcdd8be59c	cql3: make index-related tests less timing dependent Indexes and materialized views take time to build, so checks that rely on that are now wrapped with 'eventually' blocks. Message-Id: <6d3def2bc49b76dda11d7a1c9974a8b3d221003f.1531312518.git.sarna@scylladb.com>	2018-07-11 15:45:52 +03:00

1 2 3 4 5 ...

16119 Commits