scylladb

Author	SHA1	Message	Date
Piotr Jastrzebski	6cd4b6b09c	Remove sstable_range_wrapping_reader The wrapper is no longer needed because read_range_rows returns ::mutation_reader instead of sstables::mutation_reader and the reader returned from it keeps the pointer to shared_sstable that was used to create the reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:02 +01:00
Paweł Dziepak	dca93bea23	db: convert make_streaming_reader() to flat_mutation_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	37640f223b	db: drop single-range make_streaming_reader()	2017-11-13 16:49:52 +00:00
Glauber Costa	a6b2226562	dirty_memory_manager: block if we hit the real dirty limit Since we started accounting virtual dirty memory we no longer have a cap on real dirty memory. In most situations that is not needed, since real dirty will just be at most twice as much as virtual dirty (current flushing memtable plus new memtable). However, due to things like cache updates and component flushing we can end up having a lot of memtables that are virtually freed but not yet fully released, leading real dirty memory to explode using all the box' memory. This patch adds a cap on real dirty memory as well. Because of the hierarchical nature of region_group, if the parent blocks due to memory depletion, so will the child (virtual dirty region group). A next step is to add a controller that will increase the priority of the tasks involving in releasing real dirty memory if we get dangerously close to the threshold. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Avi Kivity	d6cd44a725	Revert "Merge 'Single key sstable reader optimization' from Botond" This reverts commit `5e9cd128ad`, reversing changes made to `1f4e6759a7`. Tomek found some serious issues.	2017-10-19 12:47:21 +03:00
Botond Dénes	dfe312ca3a	Add counters for the single-key reader optimization Add two counters, one to determine how many of the reads fall into the optimization, and a second one to determine it's effectiveness. The first one is single_key_reader_optimization_hit_rate. It contains the rate of reads that the optimization applies to out of all the reads that go into the single_key_sstable_reader. The second one, single_key_reader_optimization_extra_read_proportion is a histogram of the efficiency of the optimization. It contains the proportion of extra data-sources read. It's a number between 0 and 1, where 0 is the best case (only one data-source was read) and 1 is the worst case (all data-sources were read eventually). This is the same number that is used for the threshold option (see previous patch). Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1] range. Note that single_key_parallel_scan_threshold effectively provides an upper bound for the proportion as the optimization is turned off as soon as it goes above that number. The counters are disabled if single_key_parallel_scan_threshold is set to 0 disabling the optimization entirely.	2017-10-18 17:24:03 +03:00
Botond Dénes	08502f2d48	Add single_key_parallel_scan_threshold option This option regulates when exactly the single-key optimization is considered ineffective and turned off. The threshold is the proportion of the extra data source candidates that can be read before the optimization is considered ineffective and disabled. The proportion is calculated as follows: (read_data_sources - 1) / (total_data_sources - 1) We substract 1 from the read_data_sources and total_data_sources to effectively measure the rate of extra data sources we read. This makes sure that the proportion is meaningful even if e.g. we have only have a total of 2 data-sources and we read only 1 (best case). Whenever this number goes above the threshold the optimization is disabled. The threshold is number between 0 and 1, 0 forces the optimization off and 1 forces it on. Increase the treshold to favor throughput over latency for single-row reads, decrease the treshold to improve latency at the expense of throughput. If the threshold is > 0 (it's not force disabled) and the optimization is disabled due to a read crossing the threshold, we will issue "probing" reads (every 100th read) to determine if the optimization is worth re-enabling. Probing reads are allowed to run through the optimization path and if they go below the threshold the optimization is re-enabled.	2017-10-18 17:24:03 +03:00
Botond Dénes	3c1fa3ecc1	single_key_sstable_reader: optimize single-row queries For single-row queries that only query atomic cells one can put a lower bound on the timestamps which may affect the query results and thus rule out entire data sources. This allows the query to read only those sstables that actually contribute to the result. To do this we incrementally move through the sstables overlapping with the query range, checking after each read mutation whether we already have a value for all required cells and whether the lower-bound of their timestamps is higher than the upper-bound of the timestamps of all the remaining data-sources. When this condition is met we terminate the read.	2017-10-18 17:24:03 +03:00
Botond Dénes	5fc44c4307	single_key_sstable_reader: move reading code into it's own method	2017-10-18 17:24:03 +03:00
Paweł Dziepak	c28e31eac4	database: fix build (auto shards&)	2017-10-18 13:10:00 +01:00
Duarte Nunes	446e5f53db	database: Avoid superfluous shards_for_this_sstable vector copies Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171018112643.40411-1-duarte@scylladb.com>	2017-10-18 15:00:52 +03:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Tomasz Grabiec	6d5a0f8a98	db: Add debug-level logging related to streaming Message-Id: <1505896395-30203-1-git-send-email-tgrabiec@scylladb.com>	2017-10-16 18:49:10 +01:00
Raphael S. Carvalho	16dd0d15fc	sstables: make get_shards_for_this_sstable return const ref Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171012072850.12681-1-raphaelsc@scylladb.com>	2017-10-12 11:58:23 +02:00
Duarte Nunes	bb89b97cbb	cache_hit_rate: Avoid copies in get_hit_rate() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	ceebbe14cc	gossiper: Avoid endpoint_state copies gossiper::get_endpoint_state_for_endpoint() returns a copy of endpoint_state, which we've seen can be very expensive. This patch adds a similar function which returns a pointer instead, and changes the call sites where using the pointer-returning variant is deemed safe (the pointer neither escapes the function, nor crosses any defer point). Fixes #764 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:48:02 +01:00
Duarte Nunes	a011eb72c2	Merge branch 'CQL secondary index backing views' from Pekka "This patch series adds backing materialized view for secondary indices. When a new index is created with the 'CREATE INDEX' statement, a backing materialized view is created automatically. For example, assuming the following table: CREATE TABLE ks1.users ( userid uuid, email text, PRIMARY KEY (userid) ); When the following index is created: CREATE INDEX user_email ON ks1.users (email); The following materialized view is also created: cqlsh> DESCRIBE ks1.users; <snip> CREATE MATERIALIZED VIEW ks1.user_email_index AS SELECT email, userid FROM ks1.users WHERE email IS NOT NULL PRIMARY KEY (email, userid) WITH CLUSTERING ORDER BY (userid ASC) AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; CQL queries will use the backing materialized view as part of queries on indexed columns to fetch the primary keys." * 'penberg/cql-2i-backing-view/v3' of github.com:scylladb/seastar-dev: schema_tables: Create backing view for indices database: Kill obsolete secondary index manager stub cql3: Wire up secondary index manager cql3/restrictions: Add term_slice::is_supported_by() function index: Add secondary_index_manager::create_view_for_index() index: Add target_parser::parse() helper cql3/statements: Add index_target::from_sstring() helper index: Add secondary_index_manager::get_dependent_indices() index: Add secondary_index_manager::reload() index: Add secondary_index_manager::list_indexes() index: Add index class index: Pass column_family to secondary_index_manager constructor database: Make secondary index manager per-column family	2017-10-05 12:08:14 +01:00
Pekka Enberg	4045e1ec09	schema_tables: Create backing view for indices This patch wires calls to secondary index manager reload() in merge_tables_and_views() and changes make_update_indices_mutations() to also create mutations for the backing materialized view. After this patch, "CREATE INDEX" CQL statement also creates a materialized view.	2017-10-05 10:07:44 +03:00
Botond Dénes	fea6214a0a	Update reader restriction related metrics Update description of existing reader count metrics, add memory consumption metrics. Use labels to distinguish between system, user and streaming reads related metrics.	2017-10-03 12:44:17 +03:00
Botond Dénes	47e07b787e	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-10-03 12:44:12 +03:00
Avi Kivity	78eae8bf48	Revert "Merge "Make restricting_mutation_reader more accurate" from Botond" This reverts commit `c6e5dcc556`, reversing changes made to `19b21a0ab2`. Failes to build, plus author has more changes.	2017-10-03 11:58:59 +03:00
Avi Kivity	c6e5dcc556	Merge "Make restricting_mutation_reader more accurate" from Botond "Currently restricting_mutation_reader restricts mutation_readears on a count basis. This is inaccurate on multiple levels. The reader might be a combined_mutation_reader, which might be composed of multiple individual readers, whose number might change during the lifetime of the reader. The memory consumption of the readers can vary and may change during the lifetime of the reader as well. To remedy this, make the restriction memory-consumption based. The restricting semaphore is now configured with the amound of memory (bytes) that its readers are allowed to consume in total. New readers consume 128k units up-front to account for read-ahead buffers, and then consume additional units for any buffer (returned from input_stream<>::read()) they keep around. Like before, readers already allowed to read will not be blocked, instead new readers will be blocked on their first read if all the units all consumed." Fixes #2692. * 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla: Update reader restriction related metrics Add restricted_reader_test unit test restricted_mutation_reader: restrict based-on memory consumption mutation_reader.hh: Move restricted_reader related code	2017-10-03 11:15:34 +03:00
Raphael S. Carvalho	63eb9f61c0	db: use correct dirty memory manager for system column families Dirty memory manager for non-system column families was being used when applying mutations to system cfs. That previously lead to deadlock when updating history. Basically, write disable waits on compaction, and compaction waits on a write that would release dirty memory for updating compaction history. Only using the correct dirty manager wouldn't solve this problem if write is disabled for system cf, but the problem is completely solved in addition to previous change which updates history outside the sstable lock. Refs #2769. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>	2017-09-26 19:51:31 +02:00
Raphael S. Carvalho	e34c1db642	db: update compaction history outside the sstable write lock The reason to do that is because compaction can deadlock if refresh disables write which waits for compaction, and compaction in turn waits for dirty memory[1] that would be released by memtable write. Dirty memory manager for non-system cfs was being used for system cfs, which was useful for exposing this problem. [1]: when updating compaction history. Fixes #2769. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>	2017-09-26 19:51:12 +02:00
Botond Dénes	43dba8f173	Update reader restriction related metrics Update description of existing reader count metrics, add memory consumption metrics.	2017-09-20 11:16:21 +03:00
Botond Dénes	33e97e7457	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-09-20 11:14:35 +03:00
Botond Dénes	96c6d54a5c	incremental_reader_selector: Remove unecessary check for duplicated next_token The next_token will never be the same as the current _selector_position, unless they are both maximum_token, which is already handled. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <9c54ae07a18d201185027c9b533bcb5256bead8a.1505826102.git.bdenes@scylladb.com>	2017-09-19 16:42:02 +03:00
Botond Dénes	8cb953b58b	incremental_reader_selector: don't create readers unconditionally on ff When fast-forwarding check that the new position is past the selector before attempting to create new readers. Also don't clear the set of already created readers and don't overwrite the selector position. Fixes #2807 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <514f69005eb29c2a3359f098d40abf588900b76f.1505811064.git.bdenes@scylladb.com>	2017-09-19 11:27:47 +02:00
Glauber Costa	51829f528d	sstables: move write_monitor to its own header Soon I am about to introduce a read monitor, and pairing infrastructure to manage it. Having it all living in sstables.hh force to include it everytime, even in places that don't really need it. Move to its own header. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-09-15 14:09:07 -04:00
Glauber Costa	eb93d5f8ad	database: pass a monitor as a parameter to memtable writer Right now we pass a permit to the memtable writer and that permit is used insite write_memtable_to_sstable to compose a write_monitor. We would like to extend the write_monitor to include other things, that right now are not available as parameters to write_memtable_to_sstable - and which are possibly too specialized to be. The solution for that is to pass the write_monitor instead of the permit to the writer. Conceptually, that also makes sense because the write_monitor is something the sstable writer is aware of. Permits, on the other hand, are a database concept that is alien to the sstable writer. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170915032836.21154-1-glauber@scylladb.com>	2017-09-15 12:26:56 +02:00
Gleb Natapov	31e803a36c	storage_proxy: wire up percentile speculative read properly Collect coordinator side read statistic per CF and use them in percentile speculative read executor. Getting percentile from estimated_histogram object is rather expensive, so cache it and recalculate only once per second (or if requested percentile changes). Fixes #2757 Message-Id: <20170911131752.27369-3-gleb@scylladb.com>	2017-09-14 10:31:26 +03:00
Raphael S. Carvalho	ef18b1162b	sstables/compaction_manager: rename and better explain reshard function submit doesn't properly describe the function and also improve explanation of the relationship between function itself and its job parameter. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170912032034.23043-1-raphaelsc@scylladb.com>	2017-09-12 12:25:17 +03:00
Tomasz Grabiec	3f527e028d	Merge "Reduce dependencies on sstables.hh" from Avi This patchset reduces includes of sstables.hh, reducing compile time by both reducing the amount of code compiled, and the amount of needless recompiles caused by false dependencies. It does so by replacing lw_shared_ptr<sstable>, which requires a complete class, with a new custom type shared_sstable, which allows an incomplete sstable class definition. * https://github.com/avikivity/scylla deps2/v2.1 database: change truncate() to flush while compaction is disabled database: make run_with_compaction_disabled() a non-template database: add indirection to compaction_manager instance database: remove dependency on compaction.hh and compaction_manager.hh size_estimates_virtual_reader.hh: add missing include system_keyspace: add missing include main: add missing include storage_service: add missing include repair: add missing include compaction.hh: add missig include and forward declaration compaction_manager: add missing include shared_index_lists.hh: add missing include perf_fast_forward: add missing include sstable_mutation_test: add missing include sstables: extract version and format enum into a separate header file database.hh: add missing forward declaration for foreign_sstable_open_info cql_test_env: add forward declaration database: make column_family::disable_sstable_write() out-of-line sstables: introduce make_sstable() as a shortcut for make_lw_shared<sstable> treewide: use shared_sstable, make_sstable in place of lw_shared_ptr<sstable> sstables: use support for lw_shared_ptr with incomplete type for shared_sstable sstables: reduce dependencies streaming: remove unneeded includes	2017-09-12 09:56:46 +02:00
Tomasz Grabiec	ee1e7732a6	database: Create tables with continuous cache When table is created, it doesn't contain any data, so we can mark the whole data range as continuous in cache. This way reads will immediately hit, and flushes will populate. If sstables are later attached, the attaching process is supposed to invalidate affected ranges (and it does). Fixes #2536. Message-Id: <1505200269-4031-1-git-send-email-tgrabiec@scylladb.com>	2017-09-12 10:53:07 +03:00
Avi Kivity	f7023501d6	treewide: use shared_sstable, make_sstable in place of lw_shared_ptr<sstable> Since shared_sstable is going to be its own type soon, we can't use the old alias.	2017-09-12 10:43:05 +03:00
Avi Kivity	88b91c84a1	database: make column_family::disable_sstable_write() out-of-line Reduces dependencies.	2017-09-12 10:43:05 +03:00
Avi Kivity	9b540eccb0	database: remove dependency on compaction.hh and compaction_manager.hh	2017-09-11 20:09:45 +03:00
Avi Kivity	f9c8c1ddc2	database: add indirection to compaction_manager instance Allows making it forward-declared later on, reducing dependencies.	2017-09-11 20:09:45 +03:00
Avi Kivity	9d0aaa941a	database: make run_with_compaction_disabled() a non-template Allows reducing dependencies down the line, and un-templating non-performance-critical functions is a good thing.	2017-09-11 20:09:45 +03:00
Avi Kivity	6b5514a3df	database: change truncate() to flush while compaction is disabled In preparation to make run_with_compaction_disabled() a non-template, we want to remove any non-copyable captures (so the function can be an std::function, which requires copyability). Move the flush within the compaction disabled region. This changes the behavior, but it shouldn't matter.	2017-09-11 20:09:45 +03:00
Paweł Dziepak	e401d2d50b	db: reject non-Scylla counter sstables in flush_upload_dir Scylla already refuses to load counter sstables that do not have Scylla component. However, if this happens because of 'nodetool refresh' command the existing protection will trigger after sstables have been moved to the data directory. This is too later, so an additional check is added when the upload directory is scanned.	2017-09-06 12:04:26 +01:00
Paweł Dziepak	6a5e8bace1	db: disallow loading non-Scylla counter sstables Scylla does not support local and remote counter shards. This means that it is unsafe to directly load sstables that may contain them.	2017-09-06 12:03:58 +01:00
Tomasz Grabiec	d22fdf4261	row_cache: Improve safety of cache updates Cache imposes requirements on how updates to the on-disk mutation source are made: 1) each change to the on-disk muation source must be followed by cache synchronization reflecting that change 2) The two must be serialized with other synchronizations 3) must have strong failure guarantees (atomicity) Because of that, sstable list update and cache synchronization must be done under a lock, and cache synchronization cannot fail to synchronize. Normally cache synchronization achieves no-failure thing by wiping the cache (which is noexcept) in case failure is detect. There are some setup steps hoever which cannot be skipped, e.g. taking a lock followed by switching cache to use the new snapshot. That truly cannot fail. The lock inside cache synchronizers is redundant, since the user needs to take it anyway around the combined operation. In order to make ensuring strong exception guarantees easier, and making the cache interface easier to use correctly, this patch moves the control of the combined update into the cache. This is done by having cache::update() et al accept a callback (external_updater) which is supposed to perform modiciation of the underlying mutation source when invoked. This is in-line with the layering. Cache is layered on top of the on-disk mutation source (it wraps it) and reading has to go through cache. After the patch, modification also goes through cache. This way more of cache's requirements can be confined to its implementation. The failure semantics of update() and other synchronizers needed to change due to strong exception guaratnees. Now if it fails, it means the update was not performed, neither to the cache nor to the underlying mutation source. The database::_cache_update_sem goes away, serialization is done internally by the cache. The external_updater needs to have strong exception guarantees. This requirement is not new. It is however currently violated in some places. This patch marks those callbacks as noexcept and leaves a FIXME. Those should be fixed, but that's not in the scope of this patch. Aborting is still better than corrupting the state. Fixes #2754. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Thread stack allocation may fail, in which case we did not do the necessary invalidation.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	bf75b882ae	database: Add non-throwing try_trigger_compaction()	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	116d4ae02b	database: Make add_sstable() have strong exception guarantees If insert() fails, we left the database with stats updated, but sstable not being attached.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	56e3ce05db	row_cache: Don't require presence checker to be supplied externally The API is simpler and safer this way.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	df787afe6a	database: Supply presence checker in sstable snapshots	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	ab8632b225	database: Add missing serialization of sstable set udpate and cache invalidation Commit `e3ad676433` missed a few places. It is required to serialize sstable list update and cache synchronization in order to preserve partition update isolation. Fixes #2746.	2017-09-04 10:04:29 +02:00
Glauber Costa	e642aee3f7	database: wait for asynchronous operations to end before closing CF This was part of "add gate for generic async operations to column family" but somehow didn't make it into the final patch. Add the missing piece. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170830164205.4497-1-glauber@scylladb.com>	2017-08-31 11:16:30 +03:00
Tzach Livyatan	12fb975282	Fix typos in metrics description Fixes #2658 Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170803121732.19640-1-tzach@scylladb.com>	2017-08-28 10:48:28 +03:00

1 2 3 4 5 ...

923 Commits