scylladb

Author	SHA1	Message	Date
Avi Kivity	cb549c767a	database: rename column_family to table The name "column_family" is both awkward and obsolete. Rename to the modern and accurate "table". An alias is kept to avoid huge code churn. To prevent a One Definition Rule violation, a preexisting "table" type is moved to a new namespace row_cache_stress_test. Tests: unit (release) Message-Id: <20180624065238.26481-1-avi@scylladb.com>	2018-06-24 14:54:46 +03:00
Glauber Costa	290d553c3a	compaction_strategy: allow the user to tell us if min_threshold has to be strict Now that we have the controller, we would like to take min_threshold as a hint. If there is nothing to compact, we can ignore that and start compacting less than min_threshold SSTables so that the backlog keeps reducing. But there are cases in which we don't want min_threshold to be a hint and we want to enforce it strictly. For instance, if write amplification is more of a concern than space amplification. This patch adds a YAML option that allows the user to tell us that. We will default to false, meaning min_threshold is not strictly enforced. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-06-15 13:42:43 -04:00
Gleb Natapov	f41575a156	Provide available memory size to database object during creation	2018-06-11 15:34:13 +03:00
Avi Kivity	2582f53b44	Merge "database and API: Add column_family::get_sstables_by_key" from Amnon " This is series is for nodetool getsstables. This patch is based on: `8daaf9833a` With some minor adjustments because of the code change in sstables. The idea is to allow searching for all the sstables that contains a given key. After this patch if there is a table t1 in keyspace k1 and it has a key called aa. curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa" Will return the list of sstables file names that contains that key. " * 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev: Add the API implementation to get_sstables_by_key api: column_family.json make the get_sstables_for_key doc clearer column_family: Add the get_sstables_by_partition_key method sstable test: add has_partition_key test sstable: Add has_partition_key method keys_test: add a test for nodetool_style string keys: Add from_nodetool_style_string factory method	2018-06-10 16:53:56 +03:00
Amnon Heiman	acb0a738eb	column_family: Add the get_sstables_by_partition_key method The get_sstables_by_partition_key method used by the API to return a set of sstables names that holds a given partition key. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2018-06-10 16:13:01 +03:00
Asias He	6496cdf0fb	db: Get rid of the streaming memtable delayed flush In `455d5a5` (streaming memtables: coalesce incoming writes), we introduced the delayed flush to coalesce incoming streaming mutations from different stream_plan. However, most of the time there will be one stream plan at a time, the next stream plan won't start until the previous one is finished. So, the current coalescing does not really work. The delayed flush adds 2s of dealy for each stream session. If we have lots of table to stream, we will waste a lot of time. We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a time. If we have 5000 tables, even if the tables are almost empty, the delay will waste 5000 * 10 * 2 = 27 hours. To stream a keyspace with 4 tables, each table has 1000 rows. Before: [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds After: [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds Fixes #3436 Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>	2018-06-06 10:16:02 +03:00
Piotr Sarna	f8237dd664	database: do not truncate already removed views This commit clears table's views before truncating it in drop_column_family function. The only case when views are not empty during drop is when they're backing secondary indexes of a base table and they are all atomically dropped in the same go as the base table itself. This change will prevent trying to truncate views that were already dropped, which used to result in no_such_column_family error. References #3202	2018-05-22 21:10:51 +02:00
Duarte Nunes	a3bbd52e2e	Merge 'Add materialized view metrics' from Piotr " This series introduces materialized view statistics, as stated in issue #3385: - updates pushed - updates failed - row lock stats It also addresses issue #3416 by decoupling user write stats from view update stats. " * 'materialized_view_metrics_9' of https://github.com/psarna/scylla: view: adapt view_stats to act as write stats storage_proxy: decouple write_stats from stats db: add row locking metrics view: add view metrics	2018-05-22 18:41:51 +01:00
Piotr Sarna	9246bb36bc	db: add row locking metrics This commit adds statistics to row_locker class. Metrics are independendly counted for all lock types: row<->partition and exclusive<->shared. Metrics gathered: - total acquisitions - operations that wait on the lock - histogram of the time spent on waiting on this type of lock References #3385 References #3416	2018-05-22 16:52:58 +02:00
Piotr Sarna	49bebcfa25	view: add view metrics This commit introduces view statistics: - updates pushed to local/remote replicas - updates failed to be pushed to local/remote replicas Metrics are kept on per-table basis, i.e. updates_pushed_remote shows the number of total updates (mutations) pushed to all paired mv replicas that this particular table has. Every single update is taken into consideration, so if view update requires removing a row from one view and adding a row to another, it will be counted as 2 updates. References #3385 References #3416	2018-05-22 16:52:58 +02:00
Glauber Costa	d758a416f8	backlog_controller: move compaction controller to the compaction manager There was recently an attempt to add minimum shares to major compactions which ended up being harder than it should be due to all the plumbing necessary to call the compaction controller from inside the compaction manager-- since it is currently a database object. We had this problem again when trying to return fixed shares in case of an exception. Taking a step back, all of those problems stem from the fact that the compaction controller really shouldn't be a part of the database: as it deals with compactions and its consequences it is a lot more natural to have it inside the compaction manager to begin with. Once we do that, all the aforementioned problems go away. So let's move there where it belongs. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-05-22 09:24:19 -04:00
Duarte Nunes	c053275a48	db/view/row_locking: Add timeout when waiting for the lock This ensures we respect the write timeout set by the client when applying base writes, in case a writes takes too long to acquire the row lock for the read-before-write phase of a materialized view update. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180507132755.8751-1-duarte@scylladb.com>	2018-05-07 18:22:39 +01:00
Duarte Nunes	4b3562c3f5	db/view: Limit number of pending view updates This patch adds a simple and naive mechanism to ensure a base replica doesn't overwhelm a potentially overloaded view replica by sending too many concurrent view updates. We add a semaphore to limit to 100 the number of outstanding view updates. We limit globally per shard, and not per destination view replica. We also limit statically. Refs #2538 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180426134457.21290-2-duarte@scylladb.com>	2018-05-07 11:25:27 +03:00
Piotr Sarna	fe02c3d0e2	database, sstables, tests: add large_partition_handler This commit makes database, sstables and tests aware of which large_partition_handler they use. Proper large_partition_handler is retrievable from config information and is based on existing compaction_large_partition_warning_threshold_mb entry. Right now CQL TABLE variant of large_partition_handler is used in the database. Tests use a NOP version of large_partition_handler, which does not depend on CQL queries at all.	2018-05-04 14:38:13 +02:00
Duarte Nunes	f298f57137	column_family: Add function to populate views The populate_views() function takes a set of views to update, a tokento select base table partitions, and the set of sstables to query. This lays the foundation for a view building mechanism to exist, which walks over a given base table, reads data token-by-token, calculates view updates (in a simplified way, compared to the existing functions that push view updates), and sends them to the paired view replicas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Duarte Nunes	67dd3e6e5d	column_family: Allow synchronizing with in-progress writes This patch adds a mechanism to class column_family through which we can synchronize with in-progress writes. This is useful for code that, after some modification, needs to ensure that new writes will see it before it can proceed. In particular, this will be used by the view building code, which needs to wait until the in-progress writes, which may have missed that there is now a view, is observable to the view building code. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Duarte Nunes	9b9ba525f7	database: Add get_views() function Returns all the schemas that are views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Glauber Costa	9188059427	database: group statements in their own scheduling group When we introduced the CPU scheduler, we have also introduced a group for commitlog - but never used it. There is also doubtful value in separating reads from writes, since they are often part of the same workload. To accomodate for that, let's rename the query group to "statement" (query is not incorrect, just confusing), and move the write path, currently ungrouped, inside it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-03-20 16:58:36 -04:00
Botond Dénes	c0009750c3	Add unit test for resource based cache eviction Specifically for the reader-permit based eviction. This test lives in a separate executable as it uses with_cql_test_env() and thus needs a main() of it's own.	2018-03-13 16:20:50 +02:00
Botond Dénes	d5bcadcfda	Time-based cache eviction Cached queriers should not sit in the cache indefinitely otherwise abandoned reads would cause excess and unncessary resource-usage. Attach an expiry timer to each cache-entry which evicts it after the TTL passes.	2018-03-13 10:34:34 +02:00
Botond Dénes	ff808d9ce6	Save and restore queriers in mutation_query() and data_query() Use the querier_cache (represented by the passed-in querier_cache_context) object to lookup saved queriers at the start of the page and save them at the end of it if it is likely that there will be more page requests.	2018-03-13 10:34:34 +02:00
Botond Dénes	1259031af3	Use the reader_concurrency_semaphore to limit reader concurrency	2018-03-08 14:12:12 +02:00
Botond Dénes	d5bb8a47fc	mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh In preparation to reader_concurrency_semaphore being added to the file. The reader_resource_tracker is really only a helper class for reader_concurrency_semaphore so the latter is better suited to provide the name of the file.	2018-03-08 10:29:16 +02:00
Raphael S. Carvalho	aa75684ee7	sstables: Warn when an extra-large partition is written Based on https://issues.apache.org/jira/browse/CASSANDRA-9643 For compaction_large_partition_warning_threshold_mb option set to 1, follow an example output: WARN 2018-02-22 19:52:11,029 [shard 0] sstable - Writing large row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445} (1276758 bytes) Fixes #2209. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>	2018-03-07 15:49:46 +00:00
Duarte Nunes	76e6423910	database: Truncate views when truncating the base table Fixes #3200 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180211124218.41373-1-duarte@scylladb.com>	2018-02-27 15:54:43 +02:00
Avi Kivity	432268f582	Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael "The motivation is that it's no longer needed after new resharding algorithm that is the sole responsible for working with shared sstables and regular compaction will not work with those! So resharding will schedule deletion of shared sstables once it's certain that shards that own them have the new unshared sstables. The manager was needed for orchestrating deletion of shared sstable across shards. It brings extra complexity that's not longer needed, and it was also overloading shard 0, but the latter could have been fixed. Tests: - unit: release mode - dtest: resharding_test.py" * 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla: Remove SSTable's atomic deletion manager Stop using SSTable's atomic deletion manager database: split column_family::rebuild_sstable_list	2018-02-08 19:10:16 +02:00
Duarte Nunes	456b678e0b	database.hh: Fix data query stage argument type Fixes a merge gone wrong. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180208163338.25238-1-duarte@scylladb.com>	2018-02-08 16:35:10 +00:00
Avi Kivity	404172652e	Merge "Use xxHash for digest instead of MD5" from Duarte "This series changes digest calculation to use a faster algorithm (xxHash) and to also cache calculated cell hashes that can be kept in memory to speed up subsequent digest requests. The MD5 hash function has proved to be slow for large cell values: size = 256; elapsed = 4us size = 512; elapsed = 8us size = 1024; elapsed = 14us size = 2048; elapsed = 21us size = 4096; elapsed = 33us size = 8192; elapsed = 51us size = 16384; elapsed = 86us size = 32768; elapsed = 150us size = 65536; elapsed = 278us size = 131072; elapsed = 531us size = 262144; elapsed = 1032us size = 524288; elapsed = 2026us size = 1048576; elapsed = 4004us size = 2097152; elapsed = 7943us size = 4194304; elapsed = 15800us size = 8388608; elapsed = 31731us size = 16777216; elapsed = 64681us size = 33554432; elapsed = 130752us size = 67108864; elapsed = 263154us The xxHash is a non-cryptographic, 64bit (there's work in progress on the 128 version) hash that can be used to replace MD5. It performs much better: size = 256; elapsed = 2us size = 512; elapsed = 1us size = 1024; elapsed = 1us size = 2048; elapsed = 2us size = 4096; elapsed = 2us size = 8192; elapsed = 3us size = 16384; elapsed = 5us size = 32768; elapsed = 8us size = 65536; elapsed = 14us size = 131072; elapsed = 28us size = 262144; elapsed = 59us size = 524288; elapsed = 116us size = 1048576; elapsed = 226us size = 2097152; elapsed = 456us size = 4194304; elapsed = 935us size = 8388608; elapsed = 1848us size = 16777216; elapsed = 4723us size = 33554432; elapsed = 10507us size = 67108864; elapsed = 21622us Performance was tested using a 3 node cluster with 1 cpu and 8GB, and with the following cassandra-stress loaders. Measurements are for the read workload. sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 32699 [READ:32699] partition rate : 32699 [READ:32699] row rate : 32699 [READ:32699] latency mean : 3.0 [READ:3.0] latency median : 3.0 [READ:3.0] latency 95th percentile : 3.9 [READ:3.9] latency 99th percentile : 4.5 [READ:4.5] latency 99.9th percentile : 6.6 [READ:6.6] latency max : 24.0 [READ:24.0] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:05 END md5: Results: op rate : 25241 [READ:25241] partition rate : 25241 [READ:25241] row rate : 25241 [READ:25241] latency mean : 3.9 [READ:3.9] latency median : 3.9 [READ:3.9] latency 95th percentile : 5.1 [READ:5.1] latency 99th percentile : 5.8 [READ:5.8] latency 99.9th percentile : 8.0 [READ:8.0] latency max : 24.8 [READ:24.8] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:06:36 END This translates into a 21% improvoment for this workload. Bigger cell values were also tested: sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 19964 [READ:19964] partition rate : 19964 [READ:19964] row rate : 19964 [READ:19964] latency mean : 4.9 [READ:4.9] latency median : 4.6 [READ:4.6] latency 95th percentile : 7.2 [READ:7.2] latency 99th percentile : 11.5 [READ:11.5] latency 99.9th percentile : 13.6 [READ:13.6] latency max : 29.2 [READ:29.2] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:08:20 END md5: Results: op rate : 12773 [READ:12773] partition rate : 12773 [READ:12773] row rate : 12773 [READ:12773] latency mean : 7.7 [READ:7.7] latency median : 7.3 [READ:7.3] latency 95th percentile : 10.2 [READ:10.2] latency 99th percentile : 16.8 [READ:16.8] latency 99.9th percentile : 19.2 [READ:19.2] latency max : 71.5 [READ:71.5] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:13:02 END This translates into a 37% improvoment for this workload. Fixes #2884 Tests: unit-tests (release), dtests (smp=2) Note: dtests are kinda broken in master (> 30 failures), so take the tests tag with a grain of himalayan salt." * 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits) tests/row_cache_test: Test hash caching tests/memtable_test: Test hash caching tests/mutation_test: Use xxHash instead of MD5 for some tests tests/mutation_test: Test xx_hasher alongside md5_hasher schema: Remove unneeded include service/storage_proxy: Enable hash caching service/storage_service: Add and use xxhash feature message/messaging_service: Specify algorithm when requesting digest storage_proxy: Extract decision about digest algorithm to use cache_flat_mutation_reader: Pre-calculate cell hash partition_snapshot_reader: Pre-calculate cell hash query::partition_slice: Add option to specify when digest is requested row: Use cached hash for hash calculation mutation_partition: Replace hash_row_slice with appending_hash mutation_partition: Allow caching cell hashes mutation_partition: Force vector_storage internal storage size test.py: Increase memory for row_cache_stress_test atomic_cell_hash: Add specialization for atomic_cell_or_collection query-result: Use digester instead of md5_hasher range_tombstone: Replace feed_hash() member function with appending_hash ...	2018-02-08 18:24:58 +02:00
Raphael S. Carvalho	b78881c0e9	database: split column_family::rebuild_sstable_list The motivation is that resharding will not want the code that is specific to regular compaction after atomic deletion is removed. Resharding will eventually only need to replace old tables with new ones, and it will be in charge of deletion of old tables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-02-07 22:18:18 -02:00
Glauber Costa	4272279bbb	controllers: unify the I/O and CPU controllers We have had so far an I/O controller, for compactions and memtables, and a CPU controller, for memtables only -- since the scheduling was still quota-based. Now that the CPU scheduler is fully functional, it is time to do away with the differences and integrate them both into one. We now have a memtable controller and a compaction controller, and they control both CPU and I/O. In the future, we may want to control processes that don't do one of them, like cache updates. If that ever happens, we'll try to make controlling one of them optional. But for now, since the I/O and CPU controllers for our main two processes would look exactly the same we should integrate them. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:30 -05:00
Avi Kivity	ce94e6deb7	database: place data_query execution stage into scheduling_group Because execution stages defer and batch processing of the function they run, they escape their fiber's context and therefore the scheduling group. Fix (for data_query) by initializing the execution_stage with the query scheduling_group. To do that we have to move the execution stage into the database object, so it has access to the scheduling group during initialization.	2018-02-07 17:19:29 -05:00
Glauber Costa	956af9f099	database, main: set up scheduling_groups for our main tasks Set up scheduling groups for streaming, compaction, memtable flush, query, and commitlog. The background writer scheduling group is retired; it is split into the memtable flush and compaction groups. Comments from Glauber: This patch is based in a patch from Avi with the same subject, but the differences are signficant enough so that I reset authorship. In particular: 1) A bug/regression is fixed with the boundary calculations for the memtable controller sampling function. 2) A leftover is removed, where after flushing a memtable we would go back to the main group before going to the cache group again 3) As per Tomek's suggestion, now the submission of compactions themselves are run in the compaction scheduling group. Having that working is what changes this patch the most: we now store the scheduling group in the compaction manager and let the compaction manager itself enforce the scheduling group. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Avi Kivity	641aaba12c	database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler thread_scheduling_groups are converted to plain scheduling_group. Due to differences in initialization (scheduling_group initializtion defers), we create the scheduling_groups in main.cc and propagate them to users via a new class database_config. The sstable writer loses its thread_scheduling_group parameter and instead inherits scheduling from its caller. Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas, the flush controller was adjusted to return values within the higher ranges.	2018-02-07 17:19:29 -05:00
Duarte Nunes	6b4b429883	query-result: Introduce class result_options Introduce class result_options to carry result options through the request pipeline, which at this point mean the result type and the digest algorithm. This class allows us to encapsulate the concrete digest algorithm to use. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 00:22:50 +00:00
Nadav Har'El	2ea1922a4d	Materialized views: serialize read-modify-update of base table Before this patch, our Materialized Views implementation can produce incorrect results when given concurrent updates of the same base-table row. Such concurrent updates may result, in certain cases, in two different rows added to the view table, instead of just one with the latest data. In this patch we we add locking which serializes the two conflicting updates, and solves this problem. The locking for a single base-table column_family is implemented by the row_locker class introduced in a previous patch. A long comment in the code of this patch explains in more detail why this locking is needed, when, and what types of locks are needed: We sometimes need to lock a single clustering row, sometimes an entire partition, sometimes an exclusive lock and sometimes a shared lock. Fixes #3168 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-01-30 16:21:43 +02:00
Botond Dénes	b7d902a9e9	database: remove unused concurrency config members Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <b257c7e9d403c55aaec34fc48863c18f9c9ae11a.1517314398.git.bdenes@scylladb.com>	2018-01-30 14:21:25 +02:00
Glauber Costa	0c00667206	streaming big: keep write_monitor alive until the end of flush After the new compaction controller code, the monitor has to be kept alive until the sstable is added to the SSTable set. This is correctly handled for all the writers, except the streaming big. That flusher is a big confusing, as it builds an sstable list first and only later adds the elements in the list to the sstable set. The monitors are destroyed at the end of phase 1, so we will SIGSEGV later when calling add_sstable(). The fix for this is to make sure the lifetime of the monitors are tied to the lifetime of the sstables being handled big the big streaming flush process. Caught by dtests, update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test Fixes #3131 Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test now passes. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180118202230.17107-1-glauber@scylladb.com>	2018-01-21 14:09:43 +02:00
Piotr Jastrzebski	4c74b8c7e7	Migrate materalized views to flat_mutation_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-18 07:32:35 +01:00
Glauber Costa	08a0c3714c	allow request-specific read timeouts in storage proxy reads Timeouts are a global property. However, for tables in keyspaces like the system keyspace, we don't want to uphold that timeout--in fact, we wan't no timeout there at all. We already apply such configuration for requests waiting in the queued sstable queue: system keyspace requests won't be removed. However, the storage proxy will insert its own timeouts in those requests, causing them to fail. This patch changes the storage proxy read layer so that the timeout is applied based on the column family configuration, which is in turn inherited from the keyspace configuration. This matches our usual way of passing db parameters down. In terms of implementation, we can either move the timeout inside the abstract read executor or keep it external. The former is a bit cleaner, the the latter has the nice property that all executors generated will share the exact same timeout point. In this patch, we chose the latter. We are also careful to propagate the timeout information to the replica. So even if we are talking about the local replica, when we add the request to the concurrency queue, we will do it in accordance with the timeout specified by the storage proxy layer. After this patch, Scylla is able to start just fine with very low timeouts--since read timeouts in the system keyspace are now ignored. Fixes #2462 Implementation notes, and general comments about open discussion in 2462: * Because we are not bypassing the timeout, just setting it high enough, I consider the concerns about the batchlog moot: if we fail for any other reason that will be propagated. Last case, because the timeout is per-CF, we could do what we do for the dirty memory manager and move the batchlog alone to use a different timeout setting. * Storage proxy likes specifying its timeouts as a time_point, whereas when we get low enough as to deal with the read_concurrency_config, we are talking about deltas. So at some point we need to convert time_points to durations. We do that in the database query functions. v2: - use per-request instead of per-table timeouts. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:21 -05:00
Glauber Costa	3c9eeea4cf	restricted_mutation_reader: don't pass timeouts through the config structure This patch enables passing a timeout to the restricted_mutation_reader through the read path interface -- using fill_buffer and friends. This will serve as a basis for having per-timeout requests. The config structure still has a timeout, but that is so far only used to actually pass the value to the query interface. Once that starts coming from the storage proxy layer (next patch) we will remove. The query callers are patched so that we pass the timeout down. We patch the callers in database.cc, but leave the streaming ones alone. That can be safely done because the default for the query path is now no_timeout, and that is what the streaming code wants. So there is no need to complicate the interface to allow for passing a timeout that we intend to disable. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:21 -05:00
Glauber Costa	80c4a211d8	consolidate timeout_clock At the moment, various different subsystems use their different ideas of what a timeout_clock is. This makes it a bit harder to pass timeouts between them because although most are actually a lowres_clock, that is not guaranteed to be the case. As a matter of fact, the timeout for restricted reads is expressed as nanoseconds, which is not a valid duration in the lowres_clock. As a first step towards fixing this, we'll consolidate all of the existing timeout_clocks in one, now called db::timeout_clock. Other things that tend to be expressed in terms of that clock--like the fact that the maximum time_point means no timeout and a semaphore that wait()s with that resolution are also moved to the common header. In the upcoming patch we will fix the restricted reader timeouts to be expressed in terms of the new timeout_clock. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Glauber Costa	40c428dc19	database: delete unused function no in-tree users. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Glauber Costa	4f1b875784	database: add a controller for I/O on memtable flushes. The algorithm and principle of operation is the same as the CPU controller. It is, however, always enabled and we will operate on I/O shares. I/O-bound workloads are expected to hit the maximum once virtual dirty fills up and stay there while the load is steady. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:58:57 -05:00
Glauber Costa	244c564aac	compaction: adjust shares for compactions Compactions can be a heavy disk user and the I/O scheduler can always guarantee that it uses its fair share of disk. Such fair share can, however, be a lot more than what compaction indeed need. This patch draws on the controllers infrastructure to adjust the I/O shares that the compaction class will get so that compaction bandwidth is dynamically adjusted. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:58:57 -05:00
Glauber Costa	1671d9c433	factor out some of the controller code The control algorithm we are using for memtables have proven itself quite successful. We will very likely use the same for other processes, like compactions. Make the code a bit more generic, so that a new controller has to only set the desired parameters Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-03 19:56:54 -05:00
Glauber Costa	3bd6bceaf0	sstables: add read_monitor_generator Passing the read monitor down to the sstable readers is tricky. The point of interest - like compaction - are usually very far from the interfaces that register the monitor, like read_rows. Between the two, there is usually a mutation_reader, which is and ought to be totally unaware of the read monitor: technically, a mutation_reader may not even know it is backed by sstables. The solution is to create a read_monitor_generator, that can be passed from the upper layers, like compaction, to the layers that are actually making the decision of which sstables to create readers for. Note that we don't need an equivalent piece of infrastructure for writes, because writes don't happen through hidden layers and have all the information they need to initialize their monitors. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Paweł Dziepak	24026a0c7d	db: fully convert incremental_reader_selector to flat readers	2017-12-13 12:01:03 +00:00
Paweł Dziepak	73b3d02cc0	db: make make_range_sstable_reader() return flat reader	2017-12-13 12:01:03 +00:00
Paweł Dziepak	8b3c3fc832	db: make column_family::make_reader() return flat reader	2017-12-13 12:01:03 +00:00
Paweł Dziepak	e12959616c	db: make column_family::make_sstable_reader() return a flat reader	2017-12-13 12:01:03 +00:00

1 2 3 4 5 ...

632 Commits