scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Paweł Dziepak	9d82a1ebfd	abstract_read_executor: make make_requests() exception safe Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>	2017-08-22 12:09:42 +02:00
Avi Kivity	e428805ba5	Merge "Optimize query result partition and row counts" from Duarte "Now that range queries go through the normal digest path, we rely on query::result::calculate_counts() to count the amount of partitions and rows returned. This series optimizes it, in case it is needed, and also changes the result message to include the partition and row counts, avoiding the calculation altogether." * 'calculate-counts/v3' of github.com:duarten/scylla: query-result: Send row and partition count over the wire query::result: Optimize calculate_counts()	2017-08-17 13:41:21 +03:00
Duarte Nunes	ec75eac37d	ring_position_exponential_vector_sharder: Take ranges by rvalue Avoids some copies. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170814093310.29200-1-duarte@scylladb.com>	2017-08-14 12:55:43 +03:00
Duarte Nunes	d7bab684ea	query::result: Optimize calculate_counts() Now that range queries go through the normal digest path, we rely on query::result::calculate_counts() to count the amount of partitions and rows returned. This patch makes it a bit faster. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 10:28:29 +02:00
Duarte Nunes	bcf21aacc2	storage_proxy: Directly call query_nonsingular_mutations_locally Instead of duplicating the branch. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811001559.25788-1-duarte@scylladb.com>	2017-08-11 09:06:01 +03:00
Duarte Nunes	a3ee99554b	service/storage_proxy: Remove out of date comment Now that we don't go directly to reconciliation for range queries, the result isn't required to have the row and partition counts calculated (we no longer transform a reconciled_result to a query::result). Furthermore, this line was causing a lot of dtests to fail on account of them not expecting an error line in the logs. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170810225351.12610-1-duarte@scylladb.com>	2017-08-11 09:04:23 +03:00
Asias He	49360992d9	storage_service: Use the new range_streamer interface for removenode So that removenode operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:48 +08:00
Asias He	6b8dc85f12	storage_service: Use the new range_streamer interface for decommission So that decommission operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:48 +08:00
Asias He	24584b8509	storage_service: Use the new range_streamer interface for rebuild So that rebuild operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:47 +08:00
Gleb Natapov	d2a2a6d471	storage_proxy: make range_slice_read_executor go through digest matching state Currently scanning reads go to reconciliation stage directly which requires asking for mutation data from all peers. This patch makes it to try matching digests first like a single partition read. The change requires internode protocol changes since currently it is not possible to ask for multi partition data/digest over RPC. It means that the capability has to be guarded by new gossip feature flag which the patch also adds.	2017-08-03 11:37:03 +03:00
Gleb Natapov	3b7d8c8767	storage_proxy: add capability to read data/digest for non singular ranges Currently only mutation_data read supports non singular ranges. This patch extends data/digest reads to support them too.	2017-08-03 10:35:09 +03:00
Gleb Natapov	c619ef258b	storage_proxy: remove redundant parameter from never_speculating_read_executor constructor never_speculating_read_executor always waits for all targets so block_for parameter is always equal to targets.size(). No need to to pass it explicitly.	2017-08-03 10:08:44 +03:00
Tomasz Grabiec	e09220dbff	migration_manager: Log schema pulls	2017-07-27 20:08:25 +02:00
Tomasz Grabiec	350d98d4e1	migration_manager: Prevent pull requests from accumulating If schema merging completes at lower rate than incoming pull requests, then merge processes will accumulate and needlessly request and hold schema mutations. In rare cases, when there are constant schema changes, they may even overflow memory. This was seen in dtest: concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test Allowing only one active and one queued pull request per remote endpoint is enough.	2017-07-27 20:08:25 +02:00
Vlad Zolotarov	e98adb13d5	service::storage_service: initialize auth and tracing after we joined the ring Initialize the system_auth and system_traces keyspaces and their tables after the Node joins the token ring because as a part of system_auth initialization there are going to be issues SELECT and possible INSERT CQL statements. This patch effectively reverts the `d3b8b67` patch and brings the initialization order to how it was before that patch. Fixes #2273 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>	2017-07-27 10:54:36 +02:00
Vlad Zolotarov	9086c643a6	service::storage_proxy: add a trace points pair in the SELECT replica flow Add two trace points: at the beginning and at the end of the replica flow on the replica shard. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1499961542-16263-1-git-send-email-vladz@scylladb.com>	2017-07-20 16:44:25 +02:00
Calle Wilund	247c36e048	system_schema: Fix remaining places not handing two system keyspaces Some places remained where code looked directly at system_keyspace::NAME to determine iff a ks is considered special/system/protected. Including schema digest calculation. Export "is_system_keyspace" and use accordingly. Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>	2017-07-19 16:18:45 +03:00
Duarte Nunes	b8235f2e88	storage_proxy: Preserve replica order across mutations In storage_proxy we arrange the mutations sent by the replicas in a vector of vectors, such that each row corresponds to a partition key and each column contains the mutation, possibly empty, as sent by a particular replica. There is reconciliation-related code that assumes that all the mutations sent by a particular replica can be found in a single column, but that isn't guaranteed by the way we initially arrange the mutations. This patch fixes this and enforces the expected order. Fixes #2531 Fixes #2593 Signed-off-by: Gleb Natapov <gleb@scylladb.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170713162014.15343-1-duarte@scylladb.com>	2017-07-14 12:11:22 +03:00
Gleb Natapov	f88723e739	storage_proxy: pass pending_endpoints by reference instead of by value This makes lifetime of dead_endpoints object more clear and move() also has its price. Message-Id: <20170710084549.GX2324@scylladb.com>	2017-07-11 16:52:21 +03:00
Tomasz Grabiec	07ed512060	migration_manager: Give empty response to schema pulls from incompatible nodes The old nodes which are still using v2 schema tables will fail to apply our response, with error messages complaining about not being able to locate schema of certain versions (new schema tables). This change inhibits such errors by responding with an empty mutation list.	2017-07-07 19:09:57 +02:00
Tomasz Grabiec	5f613d0527	migration_manager: Don't pull schema from incompatible nodes Currently it results in scary error messages in logs about not being able to find schema of given version. It's benign, but may scare users. It the future incompatibilities could result in more subtle errors. Better to inhibit it completely.	2017-07-07 19:08:59 +02:00
Tomasz Grabiec	18a9e1762c	service: Advertise schema tables format version through gossip Will be needed to inhibit schema exchange on per-peer basis.	2017-07-07 19:07:59 +02:00
Tomasz Grabiec	ae4b24db06	misc_services: Switch to using reads_with[_no]_misses counters They better approximate the intended meaning than hits/misses, which according to Gleb is whether a read did any I/O or not.	2017-07-04 13:55:06 +02:00
Piotr Jastrzebski	05b56fcfb0	mutation_partition: Add support for specifying continuity This will allow expressing lack of information about certain ranges of rows (including the static row), which will be used in cache to determine if information in cache is complete or not. Continuity is represented internally using flags on row entries. The key range between two consecutive entries is continuous iff rows_entry::continuous() is true for the later entry. The range starting after the last entry is assumed to be continuous. The range corresponding to the key of the entry is continuous iff rows_entry::dummy() is false. [tgrabiec: - based on the following commits: 4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry 773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry - documented that partition tombstone is always complete - require specifying the partition tombstone when creating an incomplete entry - replaced rows_entry(dummy_tag, ...) constructor with more general rows_entry(position_in_partition, ...) - documented continuity semantics on mutation_partition - fixed _static_row_cached being lost by mutation_partition copy constructors - fixed conversion to streamed_mutation to ignore dummy entries - fixed mutation_partition serializer to drop dummy entries - documented semantics of continuity on mutation_partition level - dropped assumptions that dummy entries can be only at the last position - changed equality to ignore continuity completely, rather than partially (it was not ignoring dummy entries, but ignoring continuity flag) - added printout of continuity information in mutation_partition - fixed handling of empty entries in apply_reversibly() with regards to continuity; we no longer can remove empty entries before merging, since that may affect continuity of the right-hand mutation. Added _erased flag. - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key - fixed partition_builder to not ignore continuity - renamed dummy_tag_t to dummy_tag. _t suffix is reserved. - standardized all APIs on is_dummy and is_continuous bool_class:es - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics - dropped unused remove_dummy_entry() - simplified and inlined cache_entry::add_dummy_entry() - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous ]	2017-06-24 18:06:11 +02:00
Gleb Natapov	9b8499df0e	cache_hitrate_calculator: filter cfs based on replication strategy instead of a name The code filters CFs by name to not include system keyspace, but v3 schema added yet another system namespace. Better filter according to replication strategy to accommodate for schema v4 adding even more system keyspaces. Fixes: #2516 Message-Id: <20170621073816.GB3944@scylladb.com>	2017-06-22 11:26:34 +03:00
Gleb Natapov	72a4554dd9	storage_proxy: Fix compilation on older (1.55) boost Boost 1.55 (ubuntu 14) fails to compile because an iterator produce by boost::adaptors::transformed() when std::ref to lambda is passed to it do not match iterator concept. It cannot be default constructed because std::reference_wrapper is not default constructable. boost::range::min_element() never actually default construct it, but concept is checked anyway. The patch fixes it by providing an explicit functor that is default constructable. Message-Id: <20170618131836.GD3944@scylladb.com>	2017-06-18 16:54:41 +03:00
Duarte Nunes	b2c5aca4cf	db/schema_tables: View mutations shouldn't always include base ones When making the schema mutations for a view update, we should only include the base table schema mutations (in case the target node doesn't contain them) when the view is being directly updated. When it is being updated as a side effect of updating the base table, then including the base schema mutations will hide the actual changes being performed on the base. Fixes #2500 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>	2017-06-18 16:29:59 +03:00
Gleb Natapov	87094849fa	storage_proxy: load balance read requests according to cache hit rates This patch makes storage proxy to choose replicas to read from base on their cache hit rates. Replicas with higher cache hit rates will see more requests while replicas with lower hit rates will see less. Local node has a special bonus and will get more requests even if another node has slightly higher cache hit rate (same goes for local vs remote DC), but after the patch it is no longer guarantied that a coordinator node will be chosen as a replica for the read (if the feature is enabled).	2017-06-13 09:57:14 +03:00
Gleb Natapov	bc8aa1b4ee	choose extra replica for speculation in filter_for_query() Currently storage proxy has to loop over remaining replicas to search for suitable extra replica, but doing it in filter_for_query() is extremely easy, so do it there instead.	2017-06-13 09:57:14 +03:00
Gleb Natapov	0e4d5bc2f3	Store cluster wide cache hit statistics in CF	2017-06-13 09:57:14 +03:00
Gleb Natapov	69c5526301	messaging_service: return cache hit ratio as part of data read	2017-06-13 09:57:14 +03:00
Gleb Natapov	8ca1432b04	Distribute cache temperature over gossiper. When a node start it does not have any information about cache temperature of other nodes in the cluster and it is hard (if not impossible) to make right guess. During cluster startup all nodes have cold caches, so there is no point to redirect reads to other nodes even though local cache it cold, but if only that node restarted than other nodes have populated cache and reads should be redirected. The node will get up-to-date information about other nodes caches, but only after receiving first reply, until then it does not have the information to make right decisions which may cause unwanted spikes immediately after restart. Having cache temperature in gossiper helps to solve the problem.	2017-06-13 09:57:14 +03:00
Gleb Natapov	991ec4a16c	periodically calculate avg cache hit rate between all shards This patch adds new class cache_hitrate_calculator whose responsibility is to periodically calculate average cache hit rates between all shards for each CF.	2017-06-13 09:57:14 +03:00
Gleb Natapov	f59ecc2687	Rename load_broadcaster.cc to misc_services.cc load_broadcaster is very small class, move it into generic file so that we can put other small services there to save on compilation time.	2017-06-13 09:57:14 +03:00
Gleb Natapov	7bcf4c690f	storage_proxy: use db::count_local_endpoints function instead open code it	2017-06-13 09:57:14 +03:00
Calle Wilund	3512ed4596	storage_service/config: Add "native_transport_port_ssl" option Mimic origin behaviour, iff TLS encryption is enabled, and native_transport_port_ssl is set and different from native_transport_port, start both tls- and non-tls listeners. Message-Id: <1496061600-24454-2-git-send-email-calle@scylladb.com>	2017-05-29 15:53:56 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Avi Kivity	1a99ebaa65	storage_proxy: switch to the exponential sharder for nonsingular queries Nonsingular queries used exponential expansion of the token space to avoid spending too much cpu time on near-empty tables, but the generation of the search space was itself exponential. Switch to the exponential sharder which has linear cost.	2017-05-17 13:50:30 +03:00
Avi Kivity	f5dae826ce	Merge "Migrate schema tables to v3 format" from Calle "Defines origin v3-format for system/schema tables, and use them for schema storage/retrival. Includes a legacy_schema_migrator implementation/port from origin. Note that since we don't support features like triggers, functions and aggregates, it will bail if encountering such a feature used. Note also that this patch set does not convert the "hints" and "backlog" tables, even though these have changed in v3 as well. That will be a separate patch set. Tested against dtests. Note that patches for dtest + ccm will follow." * 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits) legacy_schema_migrator: Actually truncate legacy schema tables on finish database: Extract "remove" from "drop_columnfamily" v3 schema test fixes thrift: Update CQL mapping of static CFs schema_tables: Use v3 schema tables and formats type_parser: Origin expects empty string -> bytes_type cf_prop_defs: Add crc_check_chance as recognized (even if we don't use) types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s cql3_type: v3 to_string cql_types: Introduce cql3_type::empty and associate with empty data_type schema: rename column accessors to be in line with origin schema: Add "is_static_compact_table" schema_builder: Add helper to generate unique column names akin origin schema: Add utility functions for static columns schema: Use heterogeneous comparator for columns bounds cql3_type_parser: Resolve from cql3 names/expressions cql3_type: Add "prepare_interal" and "references_user_type" cql3::cql3_type: Add prepare_internal path using only "local" holders cql3_type: Add virtual destructor. database/main: encapsulate system CF dir touching ...	2017-05-17 11:25:52 +03:00
Vlad Zolotarov	a0737abdc5	cql_server::response: rework the tracing session ID insertion Insert the tracing session ID into the response body in the cql_server::response constructor. Fixes #2356 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-16 15:57:28 -04:00
Gleb Natapov	385645e8df	storage_proxy: Fix mutation logging Log mutation type only if mutation set is not empty. Message-Id: <20170510142406.GA30426@scylladb.com>	2017-05-11 15:49:52 +01:00
Vlad Zolotarov	a855e82eff	service::client_state: don't allow dropping the system_auth and system_traces objects Prevent the accidental dropping of system_auth and system_traces objects (keyspaces and tables) but allow their modification (including tables). We need to be able to modify keyspases in order to set/modify the replication strategy and its parameters. We need to be able to ALTER the tables in order to allow rolling upgrades when some of the tables has changed. Fixes #2346 Fixes #2338 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1494363335-20424-1-git-send-email-vladz@scylladb.com>	2017-05-11 13:03:30 +01:00
Paweł Dziepak	ba6b74e305	storage_service: counters are no longer experimental Message-Id: <20170510124552.23558-1-pdziepak@scylladb.com>	2017-05-10 17:18:32 +03:00
Gleb Natapov	ab92406585	storage_proxy: optimize reconcile logic for CL=ONE Regular single key query will never reconcile with CL=ONE since there will be no digest mismatch, but range queries do not have digest stage, so always goes through reconcile code. For CL=ONE there will be only one result though, so no need to run complicated reconciliation logic and the only result can be returned directly. Message-Id: <20170509100334.GQ28272@scylladb.com>	2017-05-10 17:09:34 +03:00
Calle Wilund	539b65fc90	client_state: Make "has_access" auth check schema ks name independent	2017-05-09 13:48:55 +00:00
Gleb Natapov	2d5a7c8058	storage_proxy: make read repair stats accessible through Prometheus Currently they can be read only through JMX. Message-Id: <20170509075546.GN28272@scylladb.com>	2017-05-09 11:23:38 +03:00
Avi Kivity	8c5c5d3004	Merge "CQL front-end for secondary indices" from Pekka "This patch series adds CQL front-end support for secondary indices. You can now execute CREATE INDEX and DROP INDEX statements, which will update the newly added "Indexes" system table. However, the indexes are not actually backed up by anything nor are they available for CQL queries. The feature is hidden behind a new cluster feature flag and enabled only with the "--experimental" flag." * 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits) schema: Kill index_type enum schema: Kill index_info class cql3/statements/create_index_statement: Use database::existing_index_names() in validation cql3/statements: Use secondary index manager in alter_table_statement class index: Add secondary_index_manager thrift/handler: Use index_metadata db/schema_tables: Index persistence schema: Add all_indices() to schema class schema: Remove add_default_index_names() from schema_builder class db/schema_tables: Add system table for indices cql3/Cgl.g: DROP INDEX cql3/statements: Add drop_index_statement class database: Add find_indexed_table() to database class cql3: Return change event from announce_migration() cql3/statements: Multiple index targets for CREATE INDEX cql3/statements: Use index_metadata in create_index_statement class cql3/statements: Use feature flag in create_index_statement class service/storage_service: Add feature flag for secondary indices database: Add get_available_index_name() to database class schema: Add get_default_index_name() to index_metadata class ...	2017-05-08 17:04:40 +03:00
Calle Wilund	2049303399	query_pagers: bugfix: must count pk only/pk + static rows as 1 Previously only counted clustered/regular Message-Id: <1494249013-4069-1-git-send-email-calle@scylladb.com>	2017-05-08 16:35:27 +03:00
Avi Kivity	9e67bd5aac	Merge " Add partial range deletion support" from Duarte "This series introduces partial support for range deletions. This allows deletion operations such as delete from cf where p=1 and c > 0 and c <= 3. This series only adds support for single-column range restrictions. We enforce that both range bounds be specified, because we can't represent infinite bounds in the current sstable format. Such bounds are represented as a prefix with no components, with the bound_kind informing whether they are a bottom of top bound. We're currently unable to serialize an infinite bound in such a way that it would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a composite with a (<length><value><EOC>)+ format. While we could technically represent the bottom bound, the top bound, if written as a single component with 0 bytes in size and some EOC, would always sort before other values. The same would happen if represented as an empty (no components) composite, because in Cassandra 2.2.x those always have EOC = NONE. This limitation should stay in place until we can properly represent range tombstones in the storage format." * 'range-deletions/v2' of https://github.com/duarten/scylla: mutation: Set cell using clustering_key_prefix mutation_partition: Harmonize apply_delete overloads prefix_compound_view_wrapper: Add is_full and is_empty functions tests/cql_query_test: Add range deletion tests cql3: Partially support ranged deletions single_column_primary_key_restrictions: Implement has_bound() modification_statement: Use statement_restrictions for where clause statement_restrictions: Expose primary key restrictions to_string: Add missing include	2017-05-07 19:27:09 +03:00
Avi Kivity	a592573491	Remove exception specifications C++17 removed exception specifications from the language, and gcc 7 warns about them even in C++14 mode. Remove them from the code base.	2017-05-05 17:02:31 +03:00

1 2 3 4 5 ...

1049 Commits