scylladb

Author	SHA1	Message	Date
Nadav Har'El	2ea1922a4d	Materialized views: serialize read-modify-update of base table Before this patch, our Materialized Views implementation can produce incorrect results when given concurrent updates of the same base-table row. Such concurrent updates may result, in certain cases, in two different rows added to the view table, instead of just one with the latest data. In this patch we we add locking which serializes the two conflicting updates, and solves this problem. The locking for a single base-table column_family is implemented by the row_locker class introduced in a previous patch. A long comment in the code of this patch explains in more detail why this locking is needed, when, and what types of locks are needed: We sometimes need to lock a single clustering row, sometimes an entire partition, sometimes an exclusive lock and sometimes a shared lock. Fixes #3168 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-01-30 16:21:43 +02:00
Botond Dénes	b7d902a9e9	database: remove unused concurrency config members Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <b257c7e9d403c55aaec34fc48863c18f9c9ae11a.1517314398.git.bdenes@scylladb.com>	2018-01-30 14:21:25 +02:00
Glauber Costa	0c00667206	streaming big: keep write_monitor alive until the end of flush After the new compaction controller code, the monitor has to be kept alive until the sstable is added to the SSTable set. This is correctly handled for all the writers, except the streaming big. That flusher is a big confusing, as it builds an sstable list first and only later adds the elements in the list to the sstable set. The monitors are destroyed at the end of phase 1, so we will SIGSEGV later when calling add_sstable(). The fix for this is to make sure the lifetime of the monitors are tied to the lifetime of the sstables being handled big the big streaming flush process. Caught by dtests, update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test Fixes #3131 Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test now passes. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180118202230.17107-1-glauber@scylladb.com>	2018-01-21 14:09:43 +02:00
Piotr Jastrzebski	4c74b8c7e7	Migrate materalized views to flat_mutation_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-18 07:32:35 +01:00
Glauber Costa	08a0c3714c	allow request-specific read timeouts in storage proxy reads Timeouts are a global property. However, for tables in keyspaces like the system keyspace, we don't want to uphold that timeout--in fact, we wan't no timeout there at all. We already apply such configuration for requests waiting in the queued sstable queue: system keyspace requests won't be removed. However, the storage proxy will insert its own timeouts in those requests, causing them to fail. This patch changes the storage proxy read layer so that the timeout is applied based on the column family configuration, which is in turn inherited from the keyspace configuration. This matches our usual way of passing db parameters down. In terms of implementation, we can either move the timeout inside the abstract read executor or keep it external. The former is a bit cleaner, the the latter has the nice property that all executors generated will share the exact same timeout point. In this patch, we chose the latter. We are also careful to propagate the timeout information to the replica. So even if we are talking about the local replica, when we add the request to the concurrency queue, we will do it in accordance with the timeout specified by the storage proxy layer. After this patch, Scylla is able to start just fine with very low timeouts--since read timeouts in the system keyspace are now ignored. Fixes #2462 Implementation notes, and general comments about open discussion in 2462: * Because we are not bypassing the timeout, just setting it high enough, I consider the concerns about the batchlog moot: if we fail for any other reason that will be propagated. Last case, because the timeout is per-CF, we could do what we do for the dirty memory manager and move the batchlog alone to use a different timeout setting. * Storage proxy likes specifying its timeouts as a time_point, whereas when we get low enough as to deal with the read_concurrency_config, we are talking about deltas. So at some point we need to convert time_points to durations. We do that in the database query functions. v2: - use per-request instead of per-table timeouts. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:21 -05:00
Glauber Costa	3c9eeea4cf	restricted_mutation_reader: don't pass timeouts through the config structure This patch enables passing a timeout to the restricted_mutation_reader through the read path interface -- using fill_buffer and friends. This will serve as a basis for having per-timeout requests. The config structure still has a timeout, but that is so far only used to actually pass the value to the query interface. Once that starts coming from the storage proxy layer (next patch) we will remove. The query callers are patched so that we pass the timeout down. We patch the callers in database.cc, but leave the streaming ones alone. That can be safely done because the default for the query path is now no_timeout, and that is what the streaming code wants. So there is no need to complicate the interface to allow for passing a timeout that we intend to disable. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:21 -05:00
Glauber Costa	80c4a211d8	consolidate timeout_clock At the moment, various different subsystems use their different ideas of what a timeout_clock is. This makes it a bit harder to pass timeouts between them because although most are actually a lowres_clock, that is not guaranteed to be the case. As a matter of fact, the timeout for restricted reads is expressed as nanoseconds, which is not a valid duration in the lowres_clock. As a first step towards fixing this, we'll consolidate all of the existing timeout_clocks in one, now called db::timeout_clock. Other things that tend to be expressed in terms of that clock--like the fact that the maximum time_point means no timeout and a semaphore that wait()s with that resolution are also moved to the common header. In the upcoming patch we will fix the restricted reader timeouts to be expressed in terms of the new timeout_clock. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Glauber Costa	40c428dc19	database: delete unused function no in-tree users. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Glauber Costa	4f1b875784	database: add a controller for I/O on memtable flushes. The algorithm and principle of operation is the same as the CPU controller. It is, however, always enabled and we will operate on I/O shares. I/O-bound workloads are expected to hit the maximum once virtual dirty fills up and stay there while the load is steady. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:58:57 -05:00
Glauber Costa	244c564aac	compaction: adjust shares for compactions Compactions can be a heavy disk user and the I/O scheduler can always guarantee that it uses its fair share of disk. Such fair share can, however, be a lot more than what compaction indeed need. This patch draws on the controllers infrastructure to adjust the I/O shares that the compaction class will get so that compaction bandwidth is dynamically adjusted. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:58:57 -05:00
Glauber Costa	1671d9c433	factor out some of the controller code The control algorithm we are using for memtables have proven itself quite successful. We will very likely use the same for other processes, like compactions. Make the code a bit more generic, so that a new controller has to only set the desired parameters Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:56:54 -05:00
Glauber Costa	3bd6bceaf0	sstables: add read_monitor_generator Passing the read monitor down to the sstable readers is tricky. The point of interest - like compaction - are usually very far from the interfaces that register the monitor, like read_rows. Between the two, there is usually a mutation_reader, which is and ought to be totally unaware of the read monitor: technically, a mutation_reader may not even know it is backed by sstables. The solution is to create a read_monitor_generator, that can be passed from the upper layers, like compaction, to the layers that are actually making the decision of which sstables to create readers for. Note that we don't need an equivalent piece of infrastructure for writes, because writes don't happen through hidden layers and have all the information they need to initialize their monitors. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Paweł Dziepak	24026a0c7d	db: fully convert incremental_reader_selector to flat readers	2017-12-13 12:01:03 +00:00
Paweł Dziepak	73b3d02cc0	db: make make_range_sstable_reader() return flat reader	2017-12-13 12:01:03 +00:00
Paweł Dziepak	8b3c3fc832	db: make column_family::make_reader() return flat reader	2017-12-13 12:01:03 +00:00
Paweł Dziepak	e12959616c	db: make column_family::make_sstable_reader() return a flat reader	2017-12-13 12:01:03 +00:00
Raphael S. Carvalho	f1b65a115a	db: introduce make_range_sstable_reader introduce reader variant that will allow its caller to read a range in a given table without any filter applied. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 03:15:26 -02:00
Raphael S. Carvalho	d1b146baa6	rename make_range_sstable_reader to make_local_shard_sstable_reader Tomek says: "I think that the least surprising behavior for a function named like this is to read the sstables unfiltered (it just reads them), and the filtering should be indicated specially in the name or by accepting a parameter." Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 03:15:25 -02:00
Raphael S. Carvalho	3d725d6823	db: extract sstable reader creation from incremental_reader_selector step closer to divorcing incremental_selector from sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 01:53:16 -02:00
Paweł Dziepak	dca93bea23	db: convert make_streaming_reader() to flat_mutation_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	37640f223b	db: drop single-range make_streaming_reader()	2017-11-13 16:49:52 +00:00
Avi Kivity	d6cd44a725	Revert "Merge 'Single key sstable reader optimization' from Botond" This reverts commit `5e9cd128ad`, reversing changes made to `1f4e6759a7`. Tomek found some serious issues.	2017-10-19 12:47:21 +03:00
Botond Dénes	dfe312ca3a	Add counters for the single-key reader optimization Add two counters, one to determine how many of the reads fall into the optimization, and a second one to determine it's effectiveness. The first one is single_key_reader_optimization_hit_rate. It contains the rate of reads that the optimization applies to out of all the reads that go into the single_key_sstable_reader. The second one, single_key_reader_optimization_extra_read_proportion is a histogram of the efficiency of the optimization. It contains the proportion of extra data-sources read. It's a number between 0 and 1, where 0 is the best case (only one data-source was read) and 1 is the worst case (all data-sources were read eventually). This is the same number that is used for the threshold option (see previous patch). Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1] range. Note that single_key_parallel_scan_threshold effectively provides an upper bound for the proportion as the optimization is turned off as soon as it goes above that number. The counters are disabled if single_key_parallel_scan_threshold is set to 0 disabling the optimization entirely.	2017-10-18 17:24:03 +03:00
Botond Dénes	08502f2d48	Add single_key_parallel_scan_threshold option This option regulates when exactly the single-key optimization is considered ineffective and turned off. The threshold is the proportion of the extra data source candidates that can be read before the optimization is considered ineffective and disabled. The proportion is calculated as follows: (read_data_sources - 1) / (total_data_sources - 1) We substract 1 from the read_data_sources and total_data_sources to effectively measure the rate of extra data sources we read. This makes sure that the proportion is meaningful even if e.g. we have only have a total of 2 data-sources and we read only 1 (best case). Whenever this number goes above the threshold the optimization is disabled. The threshold is number between 0 and 1, 0 forces the optimization off and 1 forces it on. Increase the treshold to favor throughput over latency for single-row reads, decrease the treshold to improve latency at the expense of throughput. If the threshold is > 0 (it's not force disabled) and the optimization is disabled due to a read crossing the threshold, we will issue "probing" reads (every 100th read) to determine if the optimization is worth re-enabling. Probing reads are allowed to run through the optimization path and if they go below the threshold the optimization is re-enabled.	2017-10-18 17:24:03 +03:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Raphael S. Carvalho	16dd0d15fc	sstables: make get_shards_for_this_sstable return const ref Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171012072850.12681-1-raphaelsc@scylladb.com>	2017-10-12 11:58:23 +02:00
Botond Dénes	a43901f842	row_consumer: de-virtualize io_priority() and resource_tracker() Fixes #2830 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <448a1f739ab8c88a7a5562bce8dce5ae6efdf934.1507302530.git.bdenes@scylladb.com>	2017-10-06 18:50:12 +01:00
Duarte Nunes	a011eb72c2	Merge branch 'CQL secondary index backing views' from Pekka "This patch series adds backing materialized view for secondary indices. When a new index is created with the 'CREATE INDEX' statement, a backing materialized view is created automatically. For example, assuming the following table: CREATE TABLE ks1.users ( userid uuid, email text, PRIMARY KEY (userid) ); When the following index is created: CREATE INDEX user_email ON ks1.users (email); The following materialized view is also created: cqlsh> DESCRIBE ks1.users; <snip> CREATE MATERIALIZED VIEW ks1.user_email_index AS SELECT email, userid FROM ks1.users WHERE email IS NOT NULL PRIMARY KEY (email, userid) WITH CLUSTERING ORDER BY (userid ASC) AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; CQL queries will use the backing materialized view as part of queries on indexed columns to fetch the primary keys." * 'penberg/cql-2i-backing-view/v3' of github.com:scylladb/seastar-dev: schema_tables: Create backing view for indices database: Kill obsolete secondary index manager stub cql3: Wire up secondary index manager cql3/restrictions: Add term_slice::is_supported_by() function index: Add secondary_index_manager::create_view_for_index() index: Add target_parser::parse() helper cql3/statements: Add index_target::from_sstring() helper index: Add secondary_index_manager::get_dependent_indices() index: Add secondary_index_manager::reload() index: Add secondary_index_manager::list_indexes() index: Add index class index: Pass column_family to secondary_index_manager constructor database: Make secondary index manager per-column family	2017-10-05 12:08:14 +01:00
Pekka Enberg	5d30ad5e1a	database: Kill obsolete secondary index manager stub	2017-10-05 10:07:44 +03:00
Botond Dénes	fea6214a0a	Update reader restriction related metrics Update description of existing reader count metrics, add memory consumption metrics. Use labels to distinguish between system, user and streaming reads related metrics.	2017-10-03 12:44:17 +03:00
Botond Dénes	47e07b787e	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-10-03 12:44:12 +03:00
Avi Kivity	78eae8bf48	Revert "Merge "Make restricting_mutation_reader more accurate" from Botond" This reverts commit `c6e5dcc556`, reversing changes made to `19b21a0ab2`. Failes to build, plus author has more changes.	2017-10-03 11:58:59 +03:00
Avi Kivity	c6e5dcc556	Merge "Make restricting_mutation_reader more accurate" from Botond "Currently restricting_mutation_reader restricts mutation_readears on a count basis. This is inaccurate on multiple levels. The reader might be a combined_mutation_reader, which might be composed of multiple individual readers, whose number might change during the lifetime of the reader. The memory consumption of the readers can vary and may change during the lifetime of the reader as well. To remedy this, make the restriction memory-consumption based. The restricting semaphore is now configured with the amound of memory (bytes) that its readers are allowed to consume in total. New readers consume 128k units up-front to account for read-ahead buffers, and then consume additional units for any buffer (returned from input_stream<>::read()) they keep around. Like before, readers already allowed to read will not be blocked, instead new readers will be blocked on their first read if all the units all consumed." Fixes #2692. * 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla: Update reader restriction related metrics Add restricted_reader_test unit test restricted_mutation_reader: restrict based-on memory consumption mutation_reader.hh: Move restricted_reader related code	2017-10-03 11:15:34 +03:00
Raphael S. Carvalho	63eb9f61c0	db: use correct dirty memory manager for system column families Dirty memory manager for non-system column families was being used when applying mutations to system cfs. That previously lead to deadlock when updating history. Basically, write disable waits on compaction, and compaction waits on a write that would release dirty memory for updating compaction history. Only using the correct dirty manager wouldn't solve this problem if write is disabled for system cf, but the problem is completely solved in addition to previous change which updates history outside the sstable lock. Refs #2769. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>	2017-09-26 19:51:31 +02:00
Botond Dénes	43dba8f173	Update reader restriction related metrics Update description of existing reader count metrics, add memory consumption metrics.	2017-09-20 11:16:21 +03:00
Botond Dénes	33e97e7457	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-09-20 11:14:35 +03:00
Gleb Natapov	31e803a36c	storage_proxy: wire up percentile speculative read properly Collect coordinator side read statistic per CF and use them in percentile speculative read executor. Getting percentile from estimated_histogram object is rather expensive, so cache it and recalculate only once per second (or if requested percentile changes). Fixes #2757 Message-Id: <20170911131752.27369-3-gleb@scylladb.com>	2017-09-14 10:31:26 +03:00
Avi Kivity	f7023501d6	treewide: use shared_sstable, make_sstable in place of lw_shared_ptr<sstable> Since shared_sstable is going to be its own type soon, we can't use the old alias.	2017-09-12 10:43:05 +03:00
Avi Kivity	88b91c84a1	database: make column_family::disable_sstable_write() out-of-line Reduces dependencies.	2017-09-12 10:43:05 +03:00
Avi Kivity	02e5bf1c20	database.hh: add missing forward declaration for foreign_sstable_open_info Supplied by an incidental include now, but it will be gone soon.	2017-09-12 10:43:05 +03:00
Avi Kivity	c4bafd912c	sstables: extract version and format enum into a separate header file This allows removing a dependency on sstables.hh later on.	2017-09-12 10:43:05 +03:00
Avi Kivity	9b540eccb0	database: remove dependency on compaction.hh and compaction_manager.hh	2017-09-11 20:09:45 +03:00
Avi Kivity	f9c8c1ddc2	database: add indirection to compaction_manager instance Allows making it forward-declared later on, reducing dependencies.	2017-09-11 20:09:45 +03:00
Avi Kivity	9d0aaa941a	database: make run_with_compaction_disabled() a non-template Allows reducing dependencies down the line, and un-templating non-performance-critical functions is a good thing.	2017-09-11 20:09:45 +03:00
Tomasz Grabiec	d22fdf4261	row_cache: Improve safety of cache updates Cache imposes requirements on how updates to the on-disk mutation source are made: 1) each change to the on-disk muation source must be followed by cache synchronization reflecting that change 2) The two must be serialized with other synchronizations 3) must have strong failure guarantees (atomicity) Because of that, sstable list update and cache synchronization must be done under a lock, and cache synchronization cannot fail to synchronize. Normally cache synchronization achieves no-failure thing by wiping the cache (which is noexcept) in case failure is detect. There are some setup steps hoever which cannot be skipped, e.g. taking a lock followed by switching cache to use the new snapshot. That truly cannot fail. The lock inside cache synchronizers is redundant, since the user needs to take it anyway around the combined operation. In order to make ensuring strong exception guarantees easier, and making the cache interface easier to use correctly, this patch moves the control of the combined update into the cache. This is done by having cache::update() et al accept a callback (external_updater) which is supposed to perform modiciation of the underlying mutation source when invoked. This is in-line with the layering. Cache is layered on top of the on-disk mutation source (it wraps it) and reading has to go through cache. After the patch, modification also goes through cache. This way more of cache's requirements can be confined to its implementation. The failure semantics of update() and other synchronizers needed to change due to strong exception guaratnees. Now if it fails, it means the update was not performed, neither to the cache nor to the underlying mutation source. The database::_cache_update_sem goes away, serialization is done internally by the cache. The external_updater needs to have strong exception guarantees. This requirement is not new. It is however currently violated in some places. This patch marks those callbacks as noexcept and leaves a FIXME. Those should be fixed, but that's not in the scope of this patch. Aborting is still better than corrupting the state. Fixes #2754. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Thread stack allocation may fail, in which case we did not do the necessary invalidation.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	bf75b882ae	database: Add non-throwing try_trigger_compaction()	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	116d4ae02b	database: Make add_sstable() have strong exception guarantees If insert() fails, we left the database with stats updated, but sstable not being attached.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	56e3ce05db	row_cache: Don't require presence checker to be supplied externally The API is simpler and safer this way.	2017-09-04 10:04:29 +02:00
Glauber Costa	83323e155e	database: add gate for generic async operations to column family run_with_compaction_disabled(), which is called by truncate, has a pretty large defer point in remove(). When the code gets to finally execute, we can't guarantee that the column family will still be alive. That is true in particular if we issued a drop table command following truncate: by the time truncate gets to resume, the CF will be gone. Before the column family is dropped, it will always call its stop() method, which means we have an opportunity to do some waiting there. We already wait for flushes and current compactions to end. Traditionally, we have been solving similar problems by adding a gate that will catch asynchronous operations and making sure that potentially asynchronous operations will enter the gate before executing. Let's do the same thing here. We will close() the gate during stop(). Fixes #2726 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-08-24 13:12:57 -04:00
Pekka Enberg	981e320d54	database: Make secondary index manager per-column family Make the secondary index manager per-column family like in Apache Cassandra to keep CQL front-end similar between the two codebases.	2017-08-24 14:00:02 +03:00

1 2 3 4 5 ...

598 Commits