scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-05 22:43:15 +00:00

Author	SHA1	Message	Date
Vlad Zolotarov	6839a50677	db::commitlog: entry_writer add a virtual destructor Add a virtual destructor for a base class commitlog::entry_writer. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1498511180-18391-1-git-send-email-vladz@scylladb.com>	2017-06-27 10:17:10 +03:00
Avi Kivity	9b21a9bfb6	Merge "Implement partial cache" from Tomasz and Piotr "This series enables cache to keep partial partitions. Reads no longer have to read whole partition from sstables in order to cache the result. The 10MB threshold for partition size in cache is lifted. Known issues: - There is no partial eviction yet, whole partitions are still evicted, and partition snapshots held by active reads are not evictable at all - Information about range continuity is not recorded if that would require inserting a dummy entry, or if previous entry doesn't belong to the latest snapshot - Cache update after memtable flush happening concurrently with reads may inhibit that reads' ability to populate cache (new issue) - Cache update from flushed memtables has partition granularity, so may cause latency problems with large partition - Schema is still tracked per-partition, so after schema changes reads may induce high latency due to whole partition needing to be converted atomically - Range tombstones are repeated in the stream for every range between cache entries they cover (new issue) - Populating scans for both small and large partitions (perf_fast_forward) experienced a 40% reduction of throughput, CPU bound How was this tested: - test.py --mode release - row_cache_stress_test -c1 -m1G - perf_fast_forward, passes except for the test case checking range continuity population which would require inserting a dummy entry (mentioned above) - perf_simple_query (-c1 -m1G --duration 32): before: 90k [ops/s] stdev: 4k [ops/s] after: 94k [ops/s] stdev: 2k [ops/s]" * tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits) tests: row_cache: Add test_tombstone_merging_in_partial_partition test case tests: Introduce row_cache_stress_test utils: Add helpers for dealing with nonwrapping_range<int> tests: simple_schema: Allow passing the tombstone to make_range_tombstone() tests: simple_schema: Accept value by reference tests: simple_schema: Make add_row() accept optional timestamp tests: simple_schema: Make new_timestamp() public tests: simple_schema: Introduce make_ckeys() tests: simple_schema: Introduce get_value(const clustered_row&) helper tests: simple_schema: Fix comment tests: simple_schema: Add missing include row_cache: Introduce evict() tests: Add cache_streamed_mutation_test tests: mutation_assertions: Allow expecting fragments mutation_fragment: Implement equality check tests: row_cache: Add test for population of random partitions tests: row_cache: Add test for partition tombstone population tests: row_cache: Test reading randomly populated partition tests: row_cache: Add test_single_partition_update() tests: row_cache: Add test_scan_with_partial_partitions ...	2017-06-26 14:54:37 +03:00
Avi Kivity	c4ae2206c7	messaging: respect inter_dc_tcp_nodelay configuration parameter We respect it partially (client side only) for now. Fixes #6. Message-Id: <20170623172048.23103-1-avi@scylladb.com>	2017-06-24 21:49:27 +02:00
Piotr Jastrzebski	77f944880c	cache: Remove support for wide partitions This will be handled by row cache now. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Duarte Nunes	4ef25e8e38	db/schema_tables: Add note to make_update_view_mutations Document that a new view schema passed to make_update_view_mutations() might be based on base schema that hasn't yet been loaded. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170618200558.96036-1-duarte@scylladb.com>	2017-06-23 15:24:35 +02:00
Avi Kivity	f0b20be14d	Revert "system_keyspace: Make sure "system" is written to keyspaces (visible)" This reverts commit `89ef69c4b3`. Prevents nodes from joining the cluster.	2017-06-21 16:58:04 +03:00
Calle Wilund	89ef69c4b3	system_keyspace: Make sure "system" is written to keyspaces (visible) Fixes #2514 Bug in schema version 3 update: We failed to write "system" to the schema tables. Only visible on an empty instance of course. Message-Id: <1497966982-10044-1-git-send-email-calle@scylladb.com>	2017-06-20 20:59:47 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	58fd3dd006	Merge "cql3: Quote type name when needed" from Duarte "This patch set ensures we quote the name of a UDT when it contains characters that may cause parsing by the CQL parser to fail. Fixes #2491" * 'cql3-quote-type/v1' of https://github.com/duarten/scylla: cql3/util: Make maybe_quote() take argument by const reference cql3/cql3_type: Quote UDT name if needed schema: Lift maybe_quote() into cql3/util	2017-06-18 17:59:47 +03:00
Duarte Nunes	b2c5aca4cf	db/schema_tables: View mutations shouldn't always include base ones When making the schema mutations for a view update, we should only include the base table schema mutations (in case the target node doesn't contain them) when the view is being directly updated. When it is being updated as a side effect of updating the base table, then including the base schema mutations will hide the actual changes being performed on the base. Fixes #2500 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>	2017-06-18 16:29:59 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Duarte Nunes	4886b7ed5e	schema: Lift maybe_quote() into cql3/util It's a more natural place given its current and future usages. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-06-15 19:55:52 +00:00
Avi Kivity	9cf6db3de5	Merge	2017-06-15 19:11:07 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Avi Kivity	da24bd7c34	Merge "Balance read requests according to CF's cache hit ratio" from Gleb "During read query with CL<ALL not all replicas are contacted. It is possible for some replicas to cache less data for some CF's (for instance because of node restart), so the replica choice may have a big impact on request's completion latency and on amount of work it generates in a cluster. This patch series keep track of per CF cached hit ratio and uses this information to choose best replicas for a request. Nodes with lower hit ratios are still contacted in order to populate their cache, but less frequently." * 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev: storage_proxy: load balance read requests according to cache hit rates choose extra replica for speculation in filter_for_query() consistency_level: drop filter_for_query_dc_local function database: reset node's hit rate information on connection drop messaging_service: connection drop notifier Store cluster wide cache hit statistics in CF messaging_service: return cache hit ratio as part of data read Distribute cache temperature over gossiper. periodically calculate avg cache hit rate between all shards database: introduce cache_temperature class Rename load_broadcaster.cc to misc_services.cc storage_proxy: use db::count_local_endpoints function instead open code it	2017-06-15 14:33:08 +03:00
Gleb Natapov	c7a59ab7ff	do not calculate serialized size of commitlog_entry_writer before final format is knows Currently commitlog_entry_writer constructor calculates serialized size before it is knows if a schema should be included into the entry. The result is never used since it is recalculated when schema information is supplied. The patch removes needless calculation. Message-Id: <20170614114607.GA21915@scylladb.com>	2017-06-14 14:53:07 +03:00
Gleb Natapov	87094849fa	storage_proxy: load balance read requests according to cache hit rates This patch makes storage proxy to choose replicas to read from base on their cache hit rates. Replicas with higher cache hit rates will see more requests while replicas with lower hit rates will see less. Local node has a special bonus and will get more requests even if another node has slightly higher cache hit rate (same goes for local vs remote DC), but after the patch it is no longer guarantied that a coordinator node will be chosen as a replica for the read (if the feature is enabled).	2017-06-13 09:57:14 +03:00
Gleb Natapov	bc8aa1b4ee	choose extra replica for speculation in filter_for_query() Currently storage proxy has to loop over remaining replicas to search for suitable extra replica, but doing it in filter_for_query() is extremely easy, so do it there instead.	2017-06-13 09:57:14 +03:00
Gleb Natapov	8437ea3b99	consistency_level: drop filter_for_query_dc_local function Merge filter_for_query_dc_local() functionality into filter_for_query(). This is more efficient since filter_for_query_dc_local() partitions endpoints into 'local' and 'remote' set but filter_for_query() already does it for CL=LOCAL so for such queries we needlessly do it twice.	2017-06-13 09:57:14 +03:00
Gleb Natapov	69c5526301	messaging_service: return cache hit ratio as part of data read	2017-06-13 09:57:14 +03:00
Calle Wilund	d9b8c79eb9	commitlog_replayer: Ignore sstable replay positions With relaxed position ordering, we cannot use existing sstables as water mark for replay. We must replay everything above truncation marks.	2017-06-07 12:07:01 +00:00
Calle Wilund	2913241df1	memtable/commitlog: Change bookkeep to track individul segments Use per CF-id reference count instead, and use handles as result of add operations. These must either be explicitly released or stored (rp_set), or they will release the corresponding replay_position upon destruction. Note: this does _not_ remove the replay positioning ordering requirement for mutations. It just removes it as a means to track segment liveness.	2017-06-07 12:07:01 +00:00
Calle Wilund	3512ed4596	storage_service/config: Add "native_transport_port_ssl" option Mimic origin behaviour, iff TLS encryption is enabled, and native_transport_port_ssl is set and different from native_transport_port, start both tls- and non-tls listeners. Message-Id: <1496061600-24454-2-git-send-email-calle@scylladb.com>	2017-05-29 15:53:56 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Avi Kivity	c8cb3d6ff5	Merge "Materialized views: bug fixes and unit tests" from Duarte "This series fixes bugs related to materialized views, most pertaining to column filtering in the where clause." * 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla: tests/view_schema_test: Add more test cases tests/cql_assertions: Add assertion for row set equality single_column_relation: Correctly print IN relation statement_restrictions: Allow filtering regular columns for views statement_restrictions: Relax clustering restrictions for views statement_restrictions: Relax partition restrictions for views cql3/statements: Prevent setting default ttl on view cql3/restrictions: Complete implementation of is_satisfied_by() db/view: Re-implement clustering_prefix_matches() db/view: Re-implement partition_key_matches() db/view: Generate regular tombstone for base deletions db/view: Consider cell liveness when generating updates db/view: Don't generate view updates for static rows	2017-05-20 13:52:56 +03:00
Paweł Dziepak	c560cf9d9d	Merge "fixes and improvements in the permissions cache implementation" from Vlad "There are numerous issues in the current implementation of permissions cache starting from the logical errors and bugs and ending with the suboptimal implementation described in the issue #2262." * 'permissions_cache_fixes-v4' of github.com:scylladb/seastar-dev: utils::loading_cache: avoid the reads storm when the key is not in the cache utils::loading_cache: cleanup utils::loading_cache: align the constrains in the constructor with the parameters description utils::loading_cache: refresh in the background auth::auth: add operator<<() for a permission_cache key auth::auth::permissions_cache: use the values from the configuration - don't try to be smart db::config: define a saner default value for permissions_validity_in_ms	2017-05-18 13:33:05 +01:00
Vlad Zolotarov	ea1cfabe28	db::config: define a saner default value for permissions_validity_in_ms It makes little sense to have the same value for permissions_update_interval_in_ms and permissions_validity_in_ms. This may cause the values to be invalidated only because some minor delays in the timer scheduling. It makes a lot more sense to make the permissions_update_interval_in_ms value smaller than permissions_validity_in_ms. This way we would minimize the chances of "false invalidation" due to some small delays in the timer scheduling. In addition, 2s seems to be a too small value for permissions_validity_in_ms since our default read_request_timeout_in_ms is 5s. This means that a single system_auth read failure would guarantee that the following queries are going to read system_auth data in the foreground. Setting it to 10s would allow a second read attempt before we enforce the foreground read. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-17 12:03:56 -04:00
Calle Wilund	29b20d410a	schema_tables: Remove "class" attribute from strategy options Not 100% proper, but in line with how we still store the info. Ensures (helps at least) to keep schema loaded from tables and schema from builder comparable. Fixes schema_changes_test error. Message-Id: <1495030581-2138-2-git-send-email-calle@scylladb.com>	2017-05-17 17:56:11 +03:00
Duarte Nunes	983af595e9	database: Read existing base mutations When generating updates for a materialized view we need to read the existing base row, to be able to determine the primary key of the view row the new base update will supplant, in case the view includes a base non-primary key column in its own primary key. That old view row will be tombstoned or updated, if it exists, depending on the difference between the new base row and the existing one, if any. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	8a77bfe35b	db/view: Calculate clustering ranges for MV read-before-write query Introduce the calculate_affected_clustering_ranges() function to calculate the smallest subject of affected clustering ranges that we need to query for. The update_requires_read_before_write() function checks whether a view is potentially affected by the base update. The patch also cleans up the may_be_affected_by() function. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	ec681060a8	db/view: Replace entry if cells don't match If a base table regular columns is part of the view's pk, and if that column changes, we should replace the entry, by deleting the row(s) with the old value and inserting a new one. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	bad0edb23b	db/view: Re-implement clustering_prefix_matches() This patch implements clustering_prefix_matches() in terms of abstract_restriction::is_satisfied_by() instead of ranges, which supports filtering just a subset of the clustering columns. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	b0d1ea76a2	db/view: Re-implement partition_key_matches() This patch implements partition_key_matches() in terms of abstract_restriction::is_satisfied_by() instead of ranges, which supports filtering just a component of a compound partition key. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	38be85a21d	db/view: Generate regular tombstone for base deletions Instead of shadowable tombstones, which only apply to updates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	1fd8b8e723	db/view: Consider cell liveness when generating updates This patch ensures we take into account the liveness of the base's regular column in the view's pk when generating view updates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	c421da6825	db/view: Don't generate view updates for static rows Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	f41a5e554d	view_info: Store base regular col in the view's PK as column_id This patch stores the base_non_pk_column_in_view column as column_id, which is more convenient, and it also stores a two-level optional to encode both lazy initialization and the absence of such a column. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:18 +02:00
Calle Wilund	c8f92536c1	legacy_schema_migrator: Actually truncate legacy schema tables on finish	2017-05-10 16:44:48 +00:00
Calle Wilund	6c8b5fc09d	schema_tables: Use v3 schema tables and formats Switches system/schema_* for system_schema/*, updates schema/schema builder and uses to hold/expect v3 style info (i.e. types & dropped).	2017-05-10 16:44:48 +00:00
Calle Wilund	f9b83e299e	type_parser: Origin expects empty string -> bytes_type	2017-05-10 16:44:48 +00:00
Calle Wilund	0e6ae8dec2	schema: rename column accessors to be in line with origin More pointedly: Expose columns as is (currently all_columns_in_select_order), expose name->column mapping more appropriately named. Renaming like this is not strictly neccesary, but there is a point to trying to keep nomenclature similar-ish with origin, esp. when select order column need to become filtered (spoiler alert).	2017-05-10 16:44:48 +00:00
Calle Wilund	b1c5447ab5	cql3_type_parser: Resolve from cql3 names/expressions Cassandra 3 uses cql names for column/field types, thus we need to parse these out-of-line, and resolve more akin to the cql parser. Also wrap building user types similarly to origin, using a "builder" wrapper, and usage graph resolving.	2017-05-10 16:44:47 +00:00
Calle Wilund	3964055d98	legacy_schema_migrator: Add schema table converter Initial. Does not actually write anything.	2017-05-10 16:44:47 +00:00
Calle Wilund	8066efb710	system_keyspace: Add getter/setter for built index status Even though we have none.	2017-05-09 13:48:55 +00:00
Calle Wilund	061ef16562	system_tables/schema_tables: Remove special format case of "execute_cql" Having a varadic parameter being used in implicit sprint is not very readable + makes it less intuitive when suddenly system keyspace becomes more than one -> multiple sprints in the chain -> more confusion or more execution paths. Its not that horrible with some spread out sprint:s	2017-05-09 13:48:55 +00:00
Calle Wilund	27fdc5cfef	schema_tables/system_tables: Add v3 tables to "ALL" and handle in init I.e. deal with more than one keyspace in system_keyspace::make	2017-05-09 13:48:55 +00:00
Calle Wilund	815aa8ba9f	schema_tables: Add schema definitions for v3 tables	2017-05-09 13:48:55 +00:00
Calle Wilund	4378dca6e1	schema_tables: Hide/abstract schema keyspace name	2017-05-09 13:48:55 +00:00
Calle Wilund	2fb36e3bf8	system_keyspace: Add query overloads with named keyspace	2017-05-09 13:48:55 +00:00
Calle Wilund	32909d4c84	system_keyspace: Add v3+legacy schema definitions	2017-05-09 13:48:55 +00:00

1 2 3 4 5 ...

867 Commits