scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Author	SHA1	Message	Date
Avi Kivity	339cc0c2fa	main: verify sufficient memory per shard Refuse to boot if we don't have at least 1 GiB per shard, unless in developer mode. The primary violator here is docker, but since it starts in developer mode, it won't get fixed. We need some extra logic for this case. Message-Id: <20161221090222.28677-1-avi@scylladb.com>	2016-12-27 12:05:52 +02:00
Avi Kivity	868b4d110c	Merge "Fixes for intentional short reads" from Paweł "This patchset contains fixes for the changes introduced in "Query result size limiting". It also improves handling of short data reads. I order to minimise chances of digest mismatch during data queries replicas that were asked just to return a digest also keep track of the size of the data (in the IDL representation) so that they would stop at the same point nodes doing full data queries would. Moreover, data queries are not affected by per-shard memory limit and the coordinator sends individual result size limits to replicas in order not to depend on hardcoded values. It is still possible to get digest mismatches if the IDL changes (e.g. a new field is added), but, hopefully, that won't be a serious problem." * 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev: query: introduce result_memory_accounter::foreign_state storage_proxy: fix short reads in parallel range queries storage_proxy: pass maximum result size to replicas mutation_partition: use result limiter for digest reads query: make result_memory_limiter constants available for linker result_memory_limiter: add accounter for digest reads idl: allow writers to use any output stream result_memory_limiter: split new_read() to new_{data, mutation}_read() idl: is_short_read() was added in 1.6 mutation_partition: honour allowed_short_read for static rows storage_proxy: fix _is_short_read computation storage_proxy: disallow short reads if got no live rows storage_proxy: don't stop after result with no live rows	2016-12-26 10:42:49 +02:00
Avi Kivity	1d9ee358f1	Revert "Merge "Reduce the size of mutation_partition" from Piotr" This reverts commit `aa392810ff`, reversing changes made to a24ff47c637e6a5fd158099b8a65f1191fc2d023; it uses boost::intrusive::detail directly, which it must not, and doesn't compile on all boost versions as a consequence.	2016-12-25 16:07:48 +02:00
Avi Kivity	59d389bd46	Merge seastar upstream * seastar 0b98024...f32e4c2 (11): > Merge "Moving the reactor counters to the metric layer" from Amnon > metrics: Metrics function should take variable as a refernce > Revert "Merge ""Moving the reactor counters to the metric layer from Amnon" > Merge ""Moving the reactor counters to the metric layer from Amnon > Revert "fstream: Auto-close data_sink and data_source" > rpc: Avoid resource unit leaks on failure > fstream: Auto-close data_sink and data_source > http: Move metrics registration to the metrics layer > output_stream: add batching to zero copy interface > Revert "slab: Move the metrics registration to the metrics layer" > slab: Move the metrics registration to the metrics layer	2016-12-25 15:50:09 +02:00
Amnon Heiman	70b2a1bfd4	Set the prometheus prefix to scylla This patch make the prometheus prefix configurable and set the default value to scylla. Fixes #1964 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>	2016-12-25 15:21:53 +02:00
Avi Kivity	b99a0fc076	licenses: clarify that licenses in this directory do not cover entire work	2016-12-25 12:59:38 +02:00
Avi Kivity	aa392810ff	Merge "Reduce the size of mutation_partition" from Piotr "Reduce the size of mutation_partition by implementing intrusive set using bi::rbtree_algorithms directly and using tree nodes optimized for size. This will reduce the size of mutation_partition by: 24 bytes + <number of cql rows> * 8 bytes This should have a positive impact on performance because mutation_partitions are stored both in memtable and cache. Fixes #742." * 'haaawk/742' of github.com:cloudius-systems/seastar-dev: intrusive_set: rename size() to calculate_size() Make intrusive_set_external_comparator::_value_traits static Implement intrusive set using rbtree_algorithms mutation_partition: make apply_reversibly_intrusive_set nongeneric mutation_partition: take schema in find_row and clustered_row mutation_partition: Extract intrusive set logic to a class. mutation_partition: Replace value_comp with key_comp calls	2016-12-25 12:56:10 +02:00
Benoît Canet	a24ff47c63	scylla_setup: Use blkid or ls to list potentials block devices blkid does not list root raw device. Revert to lsblk while taking care of having a fallback path in case the -p option is not supported. Fixes #1963. Suggested-by: Avi Kivity <avi@scylladb.com> Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <20161225100204.13297-1-benoit@scylladb.com>	2016-12-25 12:03:40 +02:00
Takuya ASADA	f3e45bc9ef	dist/redhat: don't try to adduser when user is already exists Currently we get "failed adding user 'scylla'" on .rpm installation when user is already exists, we can skip it to prevent error. Fixes #1958 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1482550075-27939-1-git-send-email-syuu@scylladb.com>	2016-12-25 11:37:25 +02:00
Piotr Jastrzebski	345ed5b6ff	intrusive_set: rename size() to calculate_size() This hopefully will make it more apparent that the time complexity of this method is O(N) not O(1). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:32:13 +01:00
Piotr Jastrzebski	151fa3aaf0	Make intrusive_set_external_comparator::_value_traits static _value_traits can be shared among all instances and there's no need to store it in every single one. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:32:13 +01:00
Piotr Jastrzebski	671affc36c	Implement intrusive set using rbtree_algorithms This new implementation takes less memory because it does not store comparator. It also uses tree nodes optimized for size. This means that instead of storing an enum field \|color\| they embed this information inside pointer to parent. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:32:13 +01:00
Piotr Jastrzebski	b0f712a4e8	mutation_partition: make apply_reversibly_intrusive_set nongeneric apply_reversibly_intrusive_set is used only in one place and always with rows_type. There's no need for it to be generic. This will allow changing intrusive set implementation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:29:07 +01:00
Piotr Jastrzebski	2af6ff68d9	mutation_partition: take schema in find_row and clustered_row This will allow intrusive set implementation that does not store schema. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:29:07 +01:00
Piotr Jastrzebski	b3b924dec9	mutation_partition: Extract intrusive set logic to a class. It will make it easier to change the implementation of the intrusive set. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:29:07 +01:00
Piotr Jastrzebski	ac7481f4b2	mutation_partition: Replace value_comp with key_comp calls This will reduce the size of bi::set API being used. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:29:07 +01:00
Tomasz Grabiec	f2a63270d1	sstables: Fix double close on index and data files when writing fails file output streams take the responsibility of closing the file, they will close the file as part of closing the stream. During sstable writing we create sstable object and keep file references there as well. Sstable object also has responsibility for closing the files, and does so from sstable::~sstable(). Double close was supposed to be avoided by a construct like this: writer.close().get(); _file = {}; However if close() failed, which can happen when write-ahead failed, _file would not be cleared, and both the writer and sstable would close the file. This will result in a crash in append_challenged_posix_file_impl::close(), which is not prepared to be closed twice. Another problem is that if exception happened before we reached that construct, we still should close the writer. Currently we don't, so there's no double close on the file, but that's a bug which needs to be fixed and once that's fixed double close on _file will be even more likely. The fix employed here is to not keep files inside sstable object when writing. As soon as the writer is constructed, it's the only owner of the file. Fixes #1764. Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>	2016-12-23 11:44:43 +02:00
Raphael S. Carvalho	fd80499b3d	database: make column_family::add_sstable() private again Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <38226308bee2970a91b0e35370d6a646b85ecfe9.1482459877.git.raphaelsc@scylladb.com>	2016-12-23 11:42:16 +02:00
Paweł Dziepak	e6d27ac529	query: introduce result_memory_accounter::foreign_state Range queries used to be performed sequentially and the shard performing part of the read was reading state of the merger's memory accounter directly. Now, they may be performed in parallel so it is safer to just pass relevant data by value to the intersted shards so that they are not reading something that another shard is modyfing at the same time. Since query is done in parallel there is a chance of overread. However, the parallelism is high only in sparsely populated tables and that's when the overread is less serious problem.	2016-12-22 17:16:24 +01:00
Paweł Dziepak	49d675223e	storage_proxy: fix short reads in parallel range queries Since `a1cafed370` "storage_proxy: handle range scans of sparsely populated tables" nonsingular range queries may be performed in parallel on multiple shards. The consequence of this that result may be added to the merger out of order. This requires more complex logic for handling short reads. As soon as mutation_result_merger gets a short read it starts to discard all subsequently received results that are known to contain partitions with larger keys. Then when the final result is being prepared the merger may need to combine and sorts results which ordering is not known. If at least one of these results is a short one all partitions with larger keys are removed. Due to request being performed in parallel it is possible that even though there was a short read the merger has got enough live data to satisfy specified limits. If this has happened the short read flag is not set on the final result.	2016-12-22 17:16:24 +01:00
Paweł Dziepak	1a52569f7d	storage_proxy: pass maximum result size to replicas We may want to change the default individual result size limit in the future. If it is provided by the coordinator and not hardcoded in the replicas this can be done without causing data query digest mismatches or wasteful mutation query results.	2016-12-22 17:16:23 +01:00
Paweł Dziepak	40176ca2f8	mutation_partition: use result limiter for digest reads Even if we are performing a digest query we should do proper result memory accounting so that the result ends exactly in the same place that it would if it was a data query. This is to avoid digest mismatches between replicas.	2016-12-22 17:16:23 +01:00
Avi Kivity	8686a59ea5	dht: use nonwrapping_ranges in ring_position_range_sharder It was the observation that ring_position_range_sharder doesn't support wrapping ranges that started the nonwrapping_range madness, but that class still has some leftover wrapping ranges. Close the circle by removing them. Message-Id: <20161123153113.8944-1-avi@scylladb.com>	2016-12-22 14:40:30 +01:00
Takuya ASADA	7c3b98806d	dist/common/scripts/scylla_setup: improve the message of disk selection prompt Not to confuse users, describe we only list up unmounted disks. Fixes #1841 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479720708-6021-1-git-send-email-syuu@scylladb.com>	2016-12-22 15:36:46 +02:00
Paweł Dziepak	a7d694654a	query: make result_memory_limiter constants available for linker	2016-12-22 13:35:04 +01:00
Paweł Dziepak	a0523df8d6	result_memory_limiter: add accounter for digest reads Digest reads differ from data reads in a way that they do not really consume any memory. We still want them to stop in the same place that data reads would, but the per-shard semaphore shouldn't be updated by them.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	38ee69dee0	idl: allow writers to use any output stream Original IDL generated code was hardcoded to always use bytes_ostream. This patch makes the output stream a template parameter so that any valid output stream can be used. Unfortunately, making IDL writers generic requires updates in the code that uses them, this is fixed in C++17 which would be able to deduce the parameter in most cases.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	aa083d3d85	result_memory_limiter: split new_read() to new_{data, mutation}_read() For data queries it is very important that all replicas get limited in the same place (this includes replicas returning only digest). That's why they shouldn't be affected by per-shard result memory limit. Moreover, we should make sure that individual memory limits are the same, making the coordinator provide it for replicas which allow to safely change it in the future. Mutation queries are not as sensitive but it is still beneficial to make sure that all replicas use the same individual limit.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	b8e29cc99c	idl: is_short_read() was added in 1.6	2016-12-22 13:35:04 +01:00
Paweł Dziepak	1c7cade559	mutation_partition: honour allowed_short_read for static rows	2016-12-22 13:35:04 +01:00
Paweł Dziepak	a7a454c388	storage_proxy: fix _is_short_read computation	2016-12-22 13:35:04 +01:00
Paweł Dziepak	8c1e4a707c	storage_proxy: disallow short reads if got no live rows If after reconciliation the coordinator ends up with no live rows and short reads are allowed a retry may not make any progress if replicas end their reads in the same place. The solution is to disallow short reads on retries which are caused by final result having no live rows.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	6db262446f	storage_proxy: don't stop after result with no live rows mutation_result_merger merges results from different shards and stops as soon as a shard returned a short read or memory usage on the merging shard is too high. However, it should never stop unless at least one live rows is in the merged result.	2016-12-22 13:35:04 +01:00
Avi Kivity	74ecd7072a	Merge "Reduce overhead of get_max_purgeable_timestamp() during compaction" from Tomasz * 'tgrabiec/calculate-hash-once-compaction' of github.com:cloudius-systems/seastar-dev: sstables: Calculate key hash only once during compaction tests: sstables: Add more test cases to tombstone_purge_test db: Expose column_family::add_sstable tests: sstables: Ensure timestamps are increasing tests: sstables: Simplify tombstone_purge_test	2016-12-22 14:33:30 +02:00
Tomasz Grabiec	045b9fd7c1	sstables: Calculate key hash only once during compaction Improves compaction performance.	2016-12-22 13:24:46 +01:00
Tomasz Grabiec	fb8765bef9	tests: sstables: Add more test cases to tombstone_purge_test	2016-12-22 13:24:46 +01:00
Tomasz Grabiec	c7ff2a2bb0	db: Expose column_family::add_sstable Needed by compaction tests.	2016-12-22 13:24:46 +01:00
Tomasz Grabiec	d841cab02c	tests: sstables: Ensure timestamps are increasing	2016-12-22 13:24:45 +01:00
Tomasz Grabiec	21ade8e4a4	tests: sstables: Simplify tombstone_purge_test - moved to seastar thread - extracted sstable creation and validation logic - reduced code duplication - switched to mutation_reader assertions - used result of compact_sstable() to locate the new sstable - rather than setting gc timestamp in the past, bump the clock before compacting	2016-12-22 13:24:41 +01:00
Tomasz Grabiec	bc6486b304	Use gc_clock instead of db_clock where possible Some code paths were obtaining db_clock timestamp to only convert it to gc_clock later. Avoid this. In the future we could make gc_clock cheaper cause it has low precision. Message-Id: <1482401190-2035-1-git-send-email-tgrabiec@scylladb.com>	2016-12-22 13:27:55 +02:00
Raphael S. Carvalho	c26090a6b2	sstables/compress: fix error message for snappy uncompression Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <898ad07db705355bdbf780afdb3aa982b8ca3823.1482364125.git.raphaelsc@scylladb.com>	2016-12-22 09:08:34 +01:00
Raphael S. Carvalho	27fb8ec512	db: avoid excessive disk usage during sstable resharding Shared sstables will now be resharded in the same order to guarantee that all shards owning a sstable will agree on its deletion nearly the same time, therefore, reducing disk space requirement. That's done by picking which column family to reshard in UUID order, and each individual column family will reshard its shared sstables in generation order. Fixes #1952. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>	2016-12-21 23:18:06 +02:00
Tomasz Grabiec	d87d50dc64	db: Use microsecond precision for server-side timestamps Currently server-side timestamps use a clock with millisecond precision. Timestamps have microsecond resolution, with lower bits used to serialize mutations originating from given client. Timestamps for column drops always use just the millisecond base. A column drop which is executed after an insert may thus be given lower timestamp than the insert, even when the two are serialized on the client side over same connection. Use microsecond precision to reduce chances of that event. This is supposed to fix sporadic failures of schema_test.py:TestSchema.drop_column_queries_test dtest. Message-Id: <1482343119-27698-1-git-send-email-tgrabiec@scylladb.com>	2016-12-21 18:03:22 +00:00
Avi Kivity	875635554d	Merge "educe overhead of partition presence checker during cache update" from Tomasz Refs #1943. * 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev: db: Compute key hash once in partition_presence_checker bloom_filter: Allow checking presence using pre-hashed key db: Use incremental selector in partition_presence_checker	2016-12-21 14:24:54 +02:00
Takuya ASADA	d356c21512	configure.py: don't allow to run multiple 'ninja -C seastar' on same time Scylla's build.ninja allows to run multiple 'ninja -C seastar' on same time, it breaks DPDK build after upgraded to DPDK-16.10: https://gist.github.com/syuu1228/4bd1170630b7e5f15653281b4728e521 To prevent it, we need to limit number of seastar build only one in same time. Note: it doesn't mean disabling parallel build on Seastar. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1482250560-20289-1-git-send-email-syuu@scylladb.com>	2016-12-21 12:42:52 +02:00
Vlad Zolotarov	62cad0f5f5	tracing: don't start tracing until a Tracing service is fully initialized RPC messaging service is initialized before the Tracing service, so we should prevent creation of tracing spans before the service is fully initialized. We will use an already existing "_down" state and extend it in a way that !_down equals "started", where "started" is TRUE when the local service is fully initialized. We will also split the Tracing service initialization into two parts: 1) Initialize the sharded object. 2) Start the tracing service: - Create the I/O backend service. - Enable tracing. Fixes issue #1939 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>	2016-12-21 12:40:14 +02:00
Gleb Natapov	0a2dd39c75	messaging_service: move MUTATION_DONE messages to separate connection If a node gets more MUTATION request that it can handle via RPC it will stop reading from this RPC connection, but this will prevent it from getting MUTATION_DONE responses for requests it coordinates because currently MUTATION and MUTATION_DONE messages shares same connection. To solve this problem this patches moves MUTATION_DONE messages to separate connection. Fixes: #1843 Message-Id: <20161201155942.GC11581@scylladb.com>	2016-12-21 11:10:15 +02:00
Piotr Jastrzebski	3e502de153	mutation_partition: don't use unique_ptr to manage LSA objects Unique_ptr won't destruct them correctly. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>	2016-12-21 09:40:15 +01:00
Raphael S. Carvalho	e28537b56f	sstables: fix calculation of memory footprint for summary size of keys weren't taken into account, so value reported via collectd is much smaller than actual footprint. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <3ca24612e4e84d1cbdea4f2d79e431a4f4479291.1482255327.git.raphaelsc@scylladb.com>	2016-12-20 18:28:47 +00:00
Paweł Dziepak	d0e61fd092	test.py: remove '.cc' from view_schema_test	2016-12-20 18:26:52 +00:00

1 2 3 4 5 ...

11057 Commits