scylladb

Author	SHA1	Message	Date
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Avi Kivity	9cf6db3de5	Merge	2017-06-15 19:11:07 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Avi Kivity	da24bd7c34	Merge "Balance read requests according to CF's cache hit ratio" from Gleb "During read query with CL<ALL not all replicas are contacted. It is possible for some replicas to cache less data for some CF's (for instance because of node restart), so the replica choice may have a big impact on request's completion latency and on amount of work it generates in a cluster. This patch series keep track of per CF cached hit ratio and uses this information to choose best replicas for a request. Nodes with lower hit ratios are still contacted in order to populate their cache, but less frequently." * 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev: storage_proxy: load balance read requests according to cache hit rates choose extra replica for speculation in filter_for_query() consistency_level: drop filter_for_query_dc_local function database: reset node's hit rate information on connection drop messaging_service: connection drop notifier Store cluster wide cache hit statistics in CF messaging_service: return cache hit ratio as part of data read Distribute cache temperature over gossiper. periodically calculate avg cache hit rate between all shards database: introduce cache_temperature class Rename load_broadcaster.cc to misc_services.cc storage_proxy: use db::count_local_endpoints function instead open code it	2017-06-15 14:33:08 +03:00
Calle Wilund	525730e135	database: Fix assert in truncate to handle empty memtables+sstables If we do two truncates in a row, the second will have neither memtable nor sstable data. Thus we will not write/remove sstables, and thus get no resulting truncation replay position. Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>	2017-06-14 11:21:21 +02:00
Gleb Natapov	ca812a8ea0	database: reset node's hit rate information on connection drop Node may go down, so after it restarts cache hit rate info will be incorrect and it can be overwhelmed with traffic until new and up-to-date cache hit rate arrives. Solve this by dropping node's information on connection reset, it is more accurate than relying on gossip which may be slow and miss reboot of a node.	2017-06-13 09:57:14 +03:00
Gleb Natapov	0e4d5bc2f3	Store cluster wide cache hit statistics in CF	2017-06-13 09:57:14 +03:00
Gleb Natapov	69c5526301	messaging_service: return cache hit ratio as part of data read	2017-06-13 09:57:14 +03:00
Gleb Natapov	991ec4a16c	periodically calculate avg cache hit rate between all shards This patch adds new class cache_hitrate_calculator whose responsibility is to periodically calculate average cache hit rates between all shards for each CF.	2017-06-13 09:57:14 +03:00
Calle Wilund	18806989b6	database: remove hard rp ordering requirement, set low rp mark on truncate With commitlog keeping use-count per CF id, we can ease the ordering restriction on replay positiontion. Previously we required that all added mutations have a position > previously flushed. However, if we accept that replay must now be all data, by keeping track instead per CF of highest RP ever entered, we can instead just set a low mark on truncation, since this is the only remaining hard RP divider.	2017-06-07 12:07:01 +00:00
Calle Wilund	2913241df1	memtable/commitlog: Change bookkeep to track individul segments Use per CF-id reference count instead, and use handles as result of add operations. These must either be explicitly released or stored (rp_set), or they will release the corresponding replay_position upon destruction. Note: this does _not_ remove the replay positioning ordering requirement for mutations. It just removes it as a means to track segment liveness.	2017-06-07 12:07:01 +00:00
Raphael S. Carvalho	3b5ad23532	db: fix computation of live disk usage stat after compaction sstable::data_size() is used by rebuild_statistics() which only returns uncompressed data size, and the function called by it expects actual disk space used by all components. Boot uses add_sstable() which correctly updates the stat with sstable::bytes_on_disk(). That's what needs to be used by r__s() too. Fixes #1592 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>	2017-05-28 10:38:32 +03:00
Tomasz Grabiec	de70d942a9	memtable: Decouple from sstable We can make the dependency more abstract by using mutation_source instead of an sstable. Will be useful in some stress tests which want to avoid the disk, but is also good for the sake of decoupling. Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>	2017-05-25 19:30:21 +03:00
Raphael S. Carvalho	b7e1575ad4	db: remove partial sstable created by memtable flush which failed partial sstable files aren't being removed after each failed attempt to flush memtable, which happens periodically. If the cause of the failure is ENOSPC, memtable flush will be attempted forever, and as a result, column family may be left with a huge amount of partial files which will overwhelm subsequent boot when removing temporary TOC. In the past, it led to OOM because removal of temporary TOC took place in parallel. Fixes #2407. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>	2017-05-25 11:50:02 +03:00
Avi Kivity	fd0e1eb1e2	Merge "Fixes for mutation algebra" from Tomasz "Enforces commutativity of addition: m1 + m2 == m2 + m1 and consistency of difference and addition with equality: m1 + (m2 - m1) == m1 + m2" * tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev: mutation: Make compare_*_for_merge() consistent with equals() tests: mutation: Improve assertion failure message tests: Use default equality in test_mutation_diff_with_random_generator mutation: Make counter cell difference consistent with apply tests: range_tombstone_list_test: Improve error message tests: range_tombstone_list: Check adjacent range merging range_tombstone_list: Merge adjacent range tombstones in apply() tests: mutation: Check commutativity of mutation addition range_tombstone_list: Avoid violating set invariant range_tombstone_list: Make tombstone merging commutative range_tombstone_list: Add erase() operation to the reverter range_tombstone_list: Make all undo operations ordered relative to each other utils: Extract to_boost_visitor() to a separate header allocating_strategy: Introduce alloc_strategy_unique_ptr<>	2017-05-23 15:20:38 +03:00
Tomasz Grabiec	804f46f684	mutation: Make compare_*_for_merge() consistent with equals() equals() considers expiring cells to be different form non-expiring cells, but compare_row_marker_for_merge() considers them equal. Fix the latter to pick expiring cells. The choice was arbitrary.	2017-05-23 13:35:03 +02:00
Raphael S. Carvalho	4b4a1883aa	refresh: do not use default priority for loading new sstables Metadata is read using default priority class, which can significantly slow down the process under high load. Compaction class can be used, and if it turns out to be a problem, we can switch to a special class for it. Fixes #1859. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170517184546.17497-1-raphaelsc@scylladb.com>	2017-05-22 19:03:17 +03:00
Raphael S. Carvalho	28206993a4	database: fix indentation of distributed_loader::open_sstable Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-05-22 11:52:52 -03:00
Raphael S. Carvalho	a4e414cb3b	database: reduce memory requirement to load sstables SSTable load temporarily uses more space than needed to store metadata, due to: 1) All components are read using read_simple() which uses 128k buffer. file::dma_read_bulk() will allocate 128k, and may potentially allocate another big buffer (128k - read) for file::read_maybe_eof(). 2) read_filter() may use double the space it needs to. Due to the fact that sstable loading parallelism is unlimited, Scylla may require much more memory to load all sstables, and that may lead to OOM. Higher the number of sstables higher the memory overhead. To confirm this problem, I wrote a test[1] which loads 30k sstables in parallel and reports the memory usage peak in the end. When loading 30k sstables, each of which metadata is ~300kb, memory usage peak was ~18G. When loading completed, only ~9GB were needed to store all the metadata. [1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d To fix this problem, we need to set a limit on load parallelism (let's start with a small number like 3 and adjust later if needed) and rely on readahead so that the requirement drops considerably without increasing boot time. Actually, boot time is improved by it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com>	2017-05-22 11:52:51 -03:00
Duarte Nunes	983af595e9	database: Read existing base mutations When generating updates for a materialized view we need to read the existing base row, to be able to determine the primary key of the view row the new base update will supplant, in case the view includes a base non-primary key column in its own primary key. That old view row will be tombstoned or updated, if it exists, depending on the difference between the new base row and the existing one, if any. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Avi Kivity	f5dae826ce	Merge "Migrate schema tables to v3 format" from Calle "Defines origin v3-format for system/schema tables, and use them for schema storage/retrival. Includes a legacy_schema_migrator implementation/port from origin. Note that since we don't support features like triggers, functions and aggregates, it will bail if encountering such a feature used. Note also that this patch set does not convert the "hints" and "backlog" tables, even though these have changed in v3 as well. That will be a separate patch set. Tested against dtests. Note that patches for dtest + ccm will follow." * 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits) legacy_schema_migrator: Actually truncate legacy schema tables on finish database: Extract "remove" from "drop_columnfamily" v3 schema test fixes thrift: Update CQL mapping of static CFs schema_tables: Use v3 schema tables and formats type_parser: Origin expects empty string -> bytes_type cf_prop_defs: Add crc_check_chance as recognized (even if we don't use) types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s cql3_type: v3 to_string cql_types: Introduce cql3_type::empty and associate with empty data_type schema: rename column accessors to be in line with origin schema: Add "is_static_compact_table" schema_builder: Add helper to generate unique column names akin origin schema: Add utility functions for static columns schema: Use heterogeneous comparator for columns bounds cql3_type_parser: Resolve from cql3 names/expressions cql3_type: Add "prepare_interal" and "references_user_type" cql3::cql3_type: Add prepare_internal path using only "local" holders cql3_type: Add virtual destructor. database/main: encapsulate system CF dir touching ...	2017-05-17 11:25:52 +03:00
Asias He	0abfe39d8f	database: Log compaction strategy setting on shard 0 only The compaction strategy is per node not per shard. Do not duplicate the same log on all shards. Message-Id: <1494835519.git.asias@scylladb.com>	2017-05-17 11:17:41 +03:00
Gleb Natapov	c7ad3b9959	database: remove temporary sstables sequentially The code that removes each sstable runs in a thread. Parallel removing of a lot of sstables may start a lot of threads each of which is taking 128k for its stack. There is no much benefit in running deletion in parallel anyway, so fix it by deleting sstables sequentially. Fixes #2384 Message-Id: <20170516103018.GQ3874@scylladb.com>	2017-05-16 15:06:10 +03:00
Calle Wilund	3514123677	database: Extract "remove" from "drop_columnfamily"	2017-05-10 16:44:48 +00:00
Calle Wilund	6c8b5fc09d	schema_tables: Use v3 schema tables and formats Switches system/schema_* for system_schema/*, updates schema/schema builder and uses to hold/expect v3 style info (i.e. types & dropped).	2017-05-10 16:44:48 +00:00
Calle Wilund	48ddcbb77b	database/main: encapsulate system CF dir touching	2017-05-10 16:44:47 +00:00
Calle Wilund	2e1c23f2f2	database: Relax rp ordering check to allow non-commitlog mutations Allow replay to come post certain operations. Such as schema migration	2017-05-09 13:48:55 +00:00
Calle Wilund	27fdc5cfef	schema_tables/system_tables: Add v3 tables to "ALL" and handle in init I.e. deal with more than one keyspace in system_keyspace::make	2017-05-09 13:48:55 +00:00
Calle Wilund	4378dca6e1	schema_tables: Hide/abstract schema keyspace name	2017-05-09 13:48:55 +00:00
Avi Kivity	8c5c5d3004	Merge "CQL front-end for secondary indices" from Pekka "This patch series adds CQL front-end support for secondary indices. You can now execute CREATE INDEX and DROP INDEX statements, which will update the newly added "Indexes" system table. However, the indexes are not actually backed up by anything nor are they available for CQL queries. The feature is hidden behind a new cluster feature flag and enabled only with the "--experimental" flag." * 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits) schema: Kill index_type enum schema: Kill index_info class cql3/statements/create_index_statement: Use database::existing_index_names() in validation cql3/statements: Use secondary index manager in alter_table_statement class index: Add secondary_index_manager thrift/handler: Use index_metadata db/schema_tables: Index persistence schema: Add all_indices() to schema class schema: Remove add_default_index_names() from schema_builder class db/schema_tables: Add system table for indices cql3/Cgl.g: DROP INDEX cql3/statements: Add drop_index_statement class database: Add find_indexed_table() to database class cql3: Return change event from announce_migration() cql3/statements: Multiple index targets for CREATE INDEX cql3/statements: Use index_metadata in create_index_statement class cql3/statements: Use feature flag in create_index_statement class service/storage_service: Add feature flag for secondary indices database: Add get_available_index_name() to database class schema: Add get_default_index_name() to index_metadata class ...	2017-05-08 17:04:40 +03:00
Pekka Enberg	f26b8d7afb	database: Add find_indexed_table() to database class	2017-05-04 14:59:12 +03:00
Pekka Enberg	930fa79aff	database: Add get_available_index_name() to database class	2017-05-04 14:59:11 +03:00
Pekka Enberg	c6e7d4484a	database: Make existing_index_names() per-keyspace operation	2017-05-04 14:59:11 +03:00
Pekka Enberg	8c729f0f5f	database: Rewrite existing_index_names() to use new index metadata	2017-05-04 14:59:11 +03:00
Paweł Dziepak	24f4dcf9e4	db: make virtual dirty soft limit configurable Message-Id: <20170428150005.28454-1-pdziepak@scylladb.com>	2017-04-30 19:17:22 +03:00
Raphael S. Carvalho	8bae413bcf	database: fix format msg for sprint Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170425224920.16607-1-raphaelsc@scylladb.com>	2017-04-26 17:18:58 +03:00
Raphael S. Carvalho	662fe77c11	database: kill column_family::start_rewrite Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:33 -03:00
Raphael S. Carvalho	43ac19eb52	database: wire up new resharding algorithm Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:31 -03:00
Raphael S. Carvalho	cf45333588	database: implement new sstable resharding algorithm NOTE: it's not wired yet. Currently, a shared sstable is rewritten at all shards it belongs to and only after that, it's deleted. With this new algorithm, a shared sstable will be read only once and N unshared sstables will be created, each of them with 1/N of the data. After it's done, each owner shard will receive its new unshared sstable replacing its ancestors. Another benefit is that we'll no longer have resharding resulting in number of sstables growing considerably after resharding. A full-sized leveled sstable is usually 160MB, so after resharding, we could have N files of 160MB/N. Now, leveled strategy will help resharding. N adjacent sstables of same level will be resharded together, so we'll end up with N files of N*160MB/N. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:30 -03:00
Raphael S. Carvalho	6513252e91	database: introduce function to replace new sstables by their ancestors When resharding, we're working with sstables from all shards. So let's say we're done with resharding of sstable A that belongs to shard 0 and 1 and sstable B that belongs to shard 1 and 2. SStables were generated for shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables and remove the ancestors. Shard 1 for example will remove sstables A and B (ancestors) and add the new one. Then it comes this new function. We'll forward new sstables to their target shards using foreign sstable open info. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:27 -03:00
Raphael S. Carvalho	c44a2319e6	prevent regular compaction from choosing shared sstables For new resharding, it's important to exclude resharding sstables from the list of candidates for regular compaction. That's doesn't affect current resharding because it marks the sstables as compacting. That won't work with new resharding which will work with sstables from multiple shards. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:26 -03:00
Avi Kivity	68f0df12ee	Merge "Optimize reads with clustering restrictions" from Tomasz "This series makes several optimizations to sstable mutation reader relevant for large partitions. Some highlights: One optimization is to use the index for skipping across clustering restrictions. Currently we read whole partition in such cases. That includes the case when we need to read a static row and then jump to some clustering row in the middle of the partition. Another case is having more than one clustering restriction, e.g. selecting multiple single rows from the same partition. Another optimization is using information from the index for creation of streamed_mutation. That can save us the cost of reading the partition header form the data file in case we would not continue reading, but skip to the middle of that partition. Or we may not even attempt to read anything from that partition, if after we determine the key that reader will be put behind other readers, which will exhaust the query limit first. Another optimization is switching single-partition queries to use the index_reader infrastructure. Index lookups via index_reader are faster than find_disk_ranges(). This is also a cleanup, a step towards converting all code to use the index_reader." * tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits) sstables: Remove unused code sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition sstables: mutation_reader: Use index_reader for single-partition reads sstables: mutation_reader: Add trace-level logging sstables: mutation_reader: Move partition reading code to sstable_data_source sstables: mutation_reader: Move definitions out of the class body sstables: Move binary_search() to a header database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating sstables: index_reader: Introduce advance_to_next_partition() sstables: index_reader: Introduce advance_and_check_if_present() sstables: index_reader: Introduce advance_past() sstables: index_reader: Make copyable sstables: index_reader: Optimize advancing to extreme positions sstables: index_reader: Keep two last pages alive dht: ring_position_view: Add key getter dht: ring_position_view: Add constructor and factory from ring_position_view sstables: mutation_reader: Advance to next partition using index in some cases sstables: index_reader: Expose access to partition key and tombstone sstables: index_reader: Introduce promoted_index_view sstables: mutation_reader: Move _index_in_current to sstable_data_source ...	2017-04-20 13:58:37 +03:00
Tomasz Grabiec	4742008b70	sstables: mutation_reader: Use index_reader for single-partition reads This switches single-partition query to use the index_reader infrastructure. Index lookups via index_reader are faster than find_disk_ranges(). perf_fast_forward, rows: 1000000, value size: 100 Before: Testing forwarding with clustering restriction in a large partition: pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu no 0.002182 2 916 3 152 2 0 0 1 1 88.1% After: Testing forwarding with clustering restriction in a large partition: pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu no 0.000758 2 2639 3 152 2 0 0 1 1 48.6% This is also a cleanup, a step towards converting all code to use the index_reader.	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	bedd0ab6f9	database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating	2017-04-20 10:54:38 +02:00
Raphael S. Carvalho	3286f7aaa6	compaction: make major compaction go through compaction manager From now on, major compaction will go through compaction manager. Major compaction is serialized to reduce disk space requirement. Each column family will be running either minor and major compaction at a given time. The only issue is number of small sstables growing while major compaction is running, but major compaction itself will reduce the number of tables considerably. If this turns out to be an issue, we can allow minor to start in parallel to major, but not the other way around. Fixes #1156. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170417233125.14092-1-raphaelsc@scylladb.com>	2017-04-19 15:44:21 +03:00
Avi Kivity	27c42359bc	Merge seastar upstream * seastar 6b21197...2ebe842 (6): > Merge "Various improvements to execution stages" from Paweł > app-template: allow apps to specify a name for help message > bool_class: avoid initializing object of incomplete type > app-template: make sure we can still get help with required options > prometheus: Http handler that returns prometheus 0.4 protobuf or text format > Update DPDK to 17.02 Includes patch from Pawel to adjust to updated execution_stage interface.	2017-03-26 10:50:21 +03:00
Raphael S. Carvalho	7deeffc953	database: serialize sstable cleanup We're cleaning up sstables in parallel. That means cleanup may need almost twice the disk space used by all sstables being cleaned up, if almost all sstables need cleanup and every one will discard an insignificant portion of its whole data. Given that cleanup is frequently issued when node is running out of disk space, we should serialize cleanups in every shard to decrease the disk space requirement. Fixes #192. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>	2017-03-19 12:33:03 +02:00
Duarte Nunes	876a514743	database: Upgrade mutation to current schema to push view updates This patch ensures we upgrade the mutation to the current schema when generating and pushing view updates, so that the it matches the most up to date views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 18:15:27 +01:00
Duarte Nunes	bfb8a3c172	materialized views: Replace db::view::view class The write path uses a base schema at a particular version, and we want it to use the materialized views at the corresponding version. To achieve this, we need to map the state currently in db::view::view to a particular schema version, which this patch does by introducing the view_info class to hold the state previously in db::view::view, and by having a view schema directly point to it. The changes in the patch are thus: 1) Introduce view_info to hold the extra view state; 2) Point to the view_info from the schema; 3) Make the functions in the now stateless db::view::view non-member; 4) Remove the db::view::view class. All changes are structural and don't affect current behavior. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 15:50:05 +01:00

1 2 3 4 5 ...

800 Commits