scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-03 06:35:51 +00:00

Author	SHA1	Message	Date
Nadav Har'El	515399ce17	materialized views: move hints to top-level directory While we keep ordinary hints in a directory parallel to the data directory, we decided to keep the materialized view hints in a subdirectory of the data directory, named "view_pending_updates". But during boot, we expect all subdirectories of data/ to be keyspace names, and when we notice this one, we print a warning: WARN: database - Skipping undefined keyspace: view_pending_updates This spurious warning annoyed users. But moreover, we could have bigger problems if the user actually tries to create a keyspace with that name. So in this patch, we move the view hints to a separate top-level directory, which defaults to /var/lib/scylla/view_hints, but as usual can be configured. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190107142257.16342-1-nyh@scylladb.com> (cherry picked from commit `da090a5458`)	2019-01-07 22:01:56 +02:00
Tomasz Grabiec	20c2745592	Merge "Improve times to start / stop the nodes" from Glauber If the compaction manager is started, compactions may start (this is regardless of whether or not we trigger them). The problem with that is that they start at a time in which we are flushing the commitlog and the initialization procedure waits for the commitlog to be fully flushed and the resulting memtables flushed before we move on. Because there are no incoming writes, the amount of shares in memtable flushes decrease as memory used decreases and that can cause the startup procedure to take a long time. We have recently started to bump the shares manually for manual flushes. While that guarantees that we will not drive the shares to zero, I will make the argument that we can do better by making sure that those things are, at this point, running alone: user experience is affected by startup times and the bump we give to user-triggered operations will only do so much. Even if we increase the shares a lot flushes will still be fighting for resources with compactions and startup will take longer than it could. By making sure that flushes are this point running alone we improve the user experience by making sure the startup is as fast as it can be. There is a similar problem at the drain level, which is also fixed in this series. Fixes #3958 * git@github.com:glommer/scylla.git faster-restart compaction_manager: delay initialization of the compaction manager. drain: stop compactions early (cherry picked from commit `3e70ae1d06`)	2019-01-03 14:56:16 +01:00
Duarte Nunes	5558fa8c44	service/storage_proxy: Protect against empty mutation when storing hint mutation_holder::get_mutation_for() can return nullptr's, so protect against those when storing a hint. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181221194853.98775-2-duarte@scylladb.com> (cherry picked from commit `e6a8883228`)	2018-12-23 12:27:27 +02:00
Duarte Nunes	f678eb52cd	service/storage_proxy: Protect against empty mutation in mutation_holder The per_destination_mutation holder can contain empty mutations, so make sure release_mutation() skips over those. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181221194853.98775-1-duarte@scylladb.com> (cherry picked from commit `6c4a34f378`)	2018-12-23 12:27:25 +02:00
Avi Kivity	dbe347811c	Merge "materialized views: Apply backpressure from view replicas" from Duarte " As the amount of pending view updates increases we know that there’s a mismatch between the rate at which the base receives writes and the rate at which the view retires them. We react by applying backpressure to decrease the rate of incoming base writes, allowing the slow view replicas to catch up. We want to delay the client’s next writes to a base replica and we use the base’s backlog of view updates to derive this delay. To validate this approach we tested a 3 node Scylla cluster on GCE, using n1-standard-4 instances with NVMEs. A loader running on a n1-standard-8 instance run cassandra-stress with 100 threads. With the delay function d(x) set to 1s, we see no base write timeouts. With the delay function as defined in the series, we see that backlogs stabilize at some (arbitrary) point, as predicted, but this stabilization co-exists with base write timeouts. However, the system overall behaves better than the current version, with the 100 view update limit, and also better than the version without such limit or any backpressure. More work is necessary to further stabilize the system. Namely, we want to keep delaying until we see the backlog is decreasing. This will require us to add more delay beyond the stabilization point, which in turn should minimize the base write timeouts, and will also minimize the amount of memory the backlog takes at each base replica. Design document: https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo Fixes #2538 " Reviewed-by: Nadav Har'El <nyh@scylladb.com> * 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits) service/storage_proxy: Release mutation as early as possible service/storage_proxy: Delay replica writes based on view update backlog service/storage_proxy: Get the backlog of a particular base replica service/storage_proxy: Add counters for delayed base writes main: Start and stop the view_update_backlog_broker service: Distribute a node's view update backlog service: Advertise view update backlog over gossip service/storage_proxy: Send view update backlog from replicas service/storage_proxy: Prepare to receive replica view update backlog service/storage_proxy: Expose local view update backlog tests/view_schema_test: Add simple test for db::view::node_update_backlog db/view: Introduce node_update_backlog class db/hints: Initialize current backlog database: Add counter for current view backlog database: Expose current memory view update backlog idl: Add db::view::update_backlog db/view: Add view_update_backlog database: Wait on view update semaphore for view building service/storage_proxy: Use near-infinite timeouts for view updates database: generate_and_propagate_view_updates no longer needs a timeout ... (cherry picked from commit `b66f59aa3d`)	2018-12-20 19:11:56 +02:00
Asias He	aac363ca86	storage_service: Notify NEW_NODE only when a node is new node This is a backport of CASSANDRA-11038. Before this, a restarted node will be reported as new node with NEW_NODE cql notification. To fix, only send NEW_NODE notification when the node was not part of the cluster Fixes: #3979 Tests: pushed_notifications_test.py:TestPushedNotifications.restart_node_test Message-Id: <453d750b98b5af510c4637db25b629f07dd90140.1544583244.git.asias@scylladb.com> (cherry picked from commit `71c1681f6c`)	2018-12-16 13:59:19 +02:00
Duarte Nunes	9e6cc5b024	service/storage_proxy: Embed the expire timer in the response handler Embedding the expire timer for a write response in the abstract_write_response_handler simplifies the code as it allows removing the rh_entry type. It will also make the timeout easily accessible inside the handler, for future patches. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181213111818.39983-1-duarte@scylladb.com> (cherry picked from commit `f8878238ed`)	2018-12-13 13:24:09 +00:00
Duarte Nunes	13b72c7b92	Merge branch 'gossip: Send node UP event to cql client after cql server is up' from Asias " This is a backport of CASSANDRA-8236. Before this patch, scylla sends the node UP event to cql client when it sees a new node joins the cluster, i.e., when a new node's status becomes NORMAL. The problem is, at this time, the cql server might not be ready yet. Once the client receives the UP event, it tries to connect to the new node's cql port and fails. To fix, a new application_sate::RPC_READY is introduced, new node sets RPC_READY to false when it starts gossip in the very beginning and sets RPC_READY to true when the cql server is ready. The RPC_READY is a bad name but I think it is better to follow Cassandra. Nodes with or without this patch are supposed to work together with no problem. Refs #3843 " * 'asias/node_up_down.upstream.v4.1' of github.com:scylladb/seastar-dev: storage_service: Use cql_ready facility storage_service: Handle application_state::RPC_READY storage_service: Add notify_cql_change storage_service: Add debug log in notify_joined storage_service: Add extra check in notify_joined storage_service: Add notify_joined storage_service: Add debug log in notify_up storage_service: Add extra check in notify_up storage_service: Add notify_up storage_service: Make notify_left log debug level storage_service: Introduce notify_left storage_service: Add debug log in notify_down storage_service: Introduce notify_down storage_service: Add set_cql_ready gossip: Add gossiper::is_cql_ready gms: Add endpoint_state::is_cql_ready gms: Add application_state::RPC_READY gms: Introduce cql_ready in versioned_value (cherry picked from commit `a42b2895c2`)	2018-12-13 12:06:59 +00:00
Duarte Nunes	97cd9108d6	db/system_distributed_keyspace: Create the schema with min_timestamp Different nodes can concurrently create the distributed system keyspace on boot, before the "if not exists" clause can take effect. However, the resulting schema mutations will be different since different nodes use different timestamps. This patch forces the timestamps to be the same across all nodes, so we save some schema mismatches. This fixes a bug exposed by `ca5dfdf`, whereby the initialization of the distributed system keyspace is done before waiting for schema agreement. While waiting for schema agreement in storage_service::join_token_ring(), the node still hasn't joined the ring and schemas can't be pulled from it, so nodes can deadlock. A similar situation can happen between a seed node and a non-seed node, where the seed node progresses to a different "wait for schema agreement" barrier, but still can't make progress because it can't pull the schema from the non-seed node still trying to join the ring. Finally, it is assumed that changes to the schema of the current distributed system keyspace tables will be protected by a cluster feature and a subsequent schema synchronization, such that all nodes will be at a point where schemas can be transferred around. Fixes #3976 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181211113407.20075-1-duarte@scylladb.com> (cherry picked from commit `89ae3fbf11`)	2018-12-11 14:53:30 +00:00
Gleb Natapov	4820130abe	storage_proxy: fix crash during write timeout callback invocation rh_entry address is captured inside timeout's callback lambda, so the structure should not be moved after it is created. Change the code to create rh_entry in-place instead of moving it into the map. Fixes #3972. Message-Id: <20181206164043.GN25283@scylladb.com> (cherry picked from commit `9fb79bf379`)	2018-12-09 15:25:52 +02:00
Gleb Natapov	2e8fefbc5a	storage_proxy: store hint for CL=ANY if all nodes replied with failure Current code assumes that request failed if all replicas replied with failure, but this is not true for CL=ANY requests. Take it into account. Fixed: #3565 (cherry picked from commit `17197fb005`)	2018-12-05 20:14:58 +00:00
Gleb Natapov	6be0635029	storage_proxy: complete write request early if all replicas replied with success of failure Currently if write request reaches CL and all replicas replied, but some replied with failures, the request will wait for timeout to be retired. Detect this case and retire request immediately instead. Fixes #3566 (cherry picked from commit `d1d04eae3c`)	2018-12-05 20:14:57 +00:00
Gleb Natapov	04a544c0a2	storage_proxy: check that write failure response comes from recognized replica Before accounting failure response we need to make sure it comes from a replica that participates in the request. (cherry picked from commit `76ab3d716b`)	2018-12-05 20:14:57 +00:00
Gleb Natapov	028f9b95d1	storage_proxy: move code executed on write timeout into separate function Currently the callback is in lambda, but we will want to call the code not only during timer expiration. (cherry picked from commit `7bc68aa0eb`)	2018-12-05 20:14:57 +00:00
Avi Kivity	54258ca8eb	Merge "db/hints: Use frozen_mutation in hinted handoff" from Duarte " This series changes hinted handoff to work with `frozen_mutation`s instead of naked `mutation`s. Instead of unfreezing a mutation from the commitlog entry and then freezing it again for sending, now we'll just keep the read, frozen mutation. Tests: unit(release) " * 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla: db/hints/manager: Use frozen_mutation instead of mutation db/hints/manager: Use database::find_schema() db/commitlog/commitlog_entry: Allow moving the contained mutation service/storage_proxy: send_to_endpoint overload accepting frozen_mutation service/storage_proxy: Build a shared_mutation from a frozen_mutation service/storage_proxy: Lift frozen_mutation_and_schema service/storage_proxy: Allow non-const ranges in mutate_prepare() (cherry picked from commit `1891779e64`)	2018-12-05 20:14:57 +00:00
Gleb Natapov	c9a030f1f0	storage_proxy: count number of timed out write attempts after CL is reached It is useful to have this counter to investigate the reason for read repairs. Non zero value means that writes were lost after CL is reached and RR is expected. Message-Id: <20181009120900.GF22665@scylladb.com> (cherry picked from commit `207b57a892`)	2018-12-05 20:14:57 +00:00
Gleb Natapov	1c7daef554	storage_proxy: do not pass write_stats down to send_to_live_endpoints write_stats is referenced from write handler which is available in send_to_live_endpoints already. No need to pass it down. Message-Id: <20181009133017.GA14449@scylladb.com> (cherry picked from commit `319ece8180`)	2018-12-05 20:14:57 +00:00
Duarte Nunes	b0a9c40ab1	service/storage_proxy: Consider target liveness in sent_to_endpoint() So we don't attempt to send mutations to unreachable endpoints and instead store a hint for them, we now check the endpoint status and populate dead_endpoints accordingly in storage_proxy::send_to_endpoint(). Fixes #3820 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181007100640.2182-1-duarte@scylladb.com> (cherry picked from commit `30d6ed8f92`)	2018-12-03 18:38:05 +00:00
Duarte Nunes	53924e5c7f	service/storage_proxy: Fix formatting of send_to_endpoint() Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181006204756.32232-1-duarte@scylladb.com> (cherry picked from commit `a69d468101`)	2018-12-03 18:37:59 +00:00
Duarte Nunes	1953c5fa61	Merge 'Fix filtering with LIMIT' from Piotr " This series adds proper handling of filtering queries with LIMIT. Previously the limit was erroneously applied before filtering, which leads to truncated results. To avoid that, paged filtering queries now use an enhanced pager, which remembers how many rows dropped and uses that information to fetch for more pages if the limit is not yet reached. For unpaged filtering queries, paging is done internally as in case of aggregations to avoid returning keeping huge results in memory. Also, previously, all limited queries used the page size counted from max(page size, limit). It's not good for filtering, because with LIMIT 1 we would then query for rows one-by-one. To avoid that, filtered queries ask for the whole page and the results are truncated if need be afterwards. Tests: unit (release) " * 'fix_filtering_with_limit_2' of https://github.com/psarna/scylla: tests: add filtering with LIMIT test tests: split filtering tests from cql_query_test cql3: add proper handling of filtering with LIMIT service/pager: use dropped_rows to adjust how many rows to read service/pager: virtualize max_rows_to_fetch function cql3: add counting dropped rows in filtering pager (cherry picked from commit `1afda28cf3`)	2018-12-02 12:07:46 +02:00
Nadav Har'El	9d458ffea9	Materialized Views and Secondary Index: no longer experimental After this patch, the Materialized Views and Secondary Index features are considered generally-available and no longer require passing an explicit "--experimental=on" flag to Scylla. The "--experimental=on" flag and the db::config::check_experimental() function remain unused, as we graduated the only two features which used this flag. However, we leave the support for experimental features in the code, to make it easier to add new experimental features in the future. Another reason to leave the command-line parameter behind is so existing scripts that still use it will not break. Fixes #3917 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20181115144456.25518-1-nyh@scylladb.com> (cherry picked from commit `78ed7d6d0c`)	2018-11-15 19:50:30 +02:00
Duarte Nunes	9776a048e7	Merge 'Generating view updates during streaming' from Piotr During streaming, there are cases when we should invoke the view write path. In particular, if we're streaming because of repair or if a view has not yet finished building and we're bootstrapping a new node. The design constraints are: 1) The streamed writes should be visible to new writes, but the sstable should not participate in compaction, or we would lose the ability to exclude the streamed writes on a restart; 2) The streamed writes must not be considered when generating view updates for them; 3) Resilient to node restarts; 4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges. We achieve this by writing the streamed writes to an sstable in a different folder, call it "staging". We achieve 1) by publishing the sstable to the column family sstable set, but excluding it from compactions. We do these steps upon boot, by looking at the staging directory, thus achieving 3). Fixes #3275 * 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits) tests: add materialized views test tests: add view update generator to cql test env main: add registering staging sstables read from disk database: add a check if loaded sstable is already staging database: add get_staging_sstable method streaming: stream tables with views through staging sstables streaming: add system distributed keyspace ref to streaming streaming: add view update generator reference to streaming main: add generating missed mv updates from staging sstables storage_service: move initializing sys_dist_ks before bootstrap db/view: add view_update_from_staging_generator service db/view: add view updating consumer table: add stream_view_replica_updates table: split push_view_replica_updates table: add as_mutation_source_excluding table: move push_view_replica_updates to table.cc database: add populating tables with staging sstables database: add creating /staging directory for sstables database: add sstable-excluding reader table: add move_sstable_from_staging_in_thread function ... (cherry picked from commit `a38f6078fb`)	2018-11-15 17:46:20 +02:00
Asias He	10cf97375e	streaming: Expose reason for streaming On receiving a mutation_fragment or a mutation triggered by a streaming operation, we pass an enum stream_reason to notify the receiver what the streaming is used for. So the receiver can decide further operation, e.g., send view updates, beyond applying the streaming data on disk. Fixes #3276 Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com> (cherry picked from commit `7f826d3343`)	2018-11-15 17:45:31 +02:00
Avi Kivity	6bf4a73d88	thrift: limit message size Limit message size according to the configuration, to avoid a huge message from allocating all of the server's memory. We also need to limit memory used in aggregate by thrift, but that is left to another patch. Fixes #3878. Message-Id: <20181024081042.13067-1-avi@scylladb.com> (cherry picked from commit `a9836ad758`)	2018-10-24 19:32:25 +03:00
Vlad Zolotarov	00dc400993	storage_proxy::query_result_local: create a single tracing span on a replica shard Every call of a tracing::global_trace_state_ptr object instead of a tracing::tracing_state_ptr or a call to tracing::global_trace_state_ptr::get() creates a new tracing session (span) object. This should never be done unless query handling moves to a different shard. Fixes #3862 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20181018003500.10030-1-vladz@scylladb.com> (cherry picked from commit `a87c11bad2`)	2018-10-24 09:45:14 +03:00
Duarte Nunes	28fd2044d2	Merge 'hinted handoff: add manager::state and split storing and replaying enablement' from Vlad " Refs #3828 (Probably fixes it) We found a few flaws in a way we enable hints replaying. First of all it was allowed before manager::start() is complete. Then, since manager::start() is called after messaging_service is initialized there was a time window when hints are rejected and this creates an issue for MV. Both issues above were found in the context of #3828. This series fixes them both. Tested {release}: dtest: materialized_views_test.py:TestMaterializedViews.write_to_hinted_handoff_for_views_test dtest: hintedhandoff_additional_test.py " * 'hinted_handoff_dont_create_hints_until_started-v1' of https://github.com/vladzcloudius/scylla: hinted handoff: enable storing hints before starting messaging_service db::hints::manager: add a "started" state db::hints::manager: introduce a _state (cherry picked from commit `3a53b3cebc`)	2018-10-24 09:43:03 +03:00
Piotr Sarna	aeb418af9e	service/pager: avoid dereferencing null partition key The pager::state() function returns a valid paging object even if the pager itself is exhausted. It may also not contain the partition key, so using it unconditionally was a bug - now, in case there is no partition key present, paging state will contain an empty partition key. Fixes #3829 Message-Id: <28401eb21ab8f12645c0a33d9e92ada9de83e96b.1539074813.git.sarna@scylladb.com> (cherry picked from commit `b3685342a6`)	2018-10-15 12:47:25 +03:00
Duarte Nunes	60edaec757	Merge 'Fix issues with endpoint state replication to other shards' from Tomasz Fixes #3798 Fixes #3694 Tests: unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test) * tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla: gms/gossiper: Replicate enpoint states in add_saved_endpoint() gms/gossiper: Make reset_endpoint_state_map() have effect on all shards gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards gms/gossiper: Always override states from older generations (cherry picked from commit `48ebe6552c`)	2018-10-09 10:14:30 +03:00
Calle Wilund	2996b8154f	storage_proxy: Add missing re-throw in truncate_blocking Iff truncation times out, we want to log it, but the exception should not be swallowed, but re-thrown. Fixes #3796. Message-Id: <20181001112325.17809-1-calle@scylladb.com>	2018-10-01 19:07:04 +02:00
Piotr Sarna	c41e0ade6c	storage_proxy: make get_restricted_ranges public This function is useful for splitting ranges in indexed queries.	2018-09-27 15:29:28 +02:00
Piotr Sarna	b6d90b2869	pager: make state() defined for exhausted pagers If service::pager is exhausted, state() function used to return a nullptr instead of a pointer to a valid paging state and the documented return type in this case was 'unspecified'. Sometimes a paging state may be needed anyway, even if the pager is already exhausted - thus, state() return value becomes defined after this commit. Exhausted pagers will return a valid object to a state with _remaining field set to 0.	2018-09-27 15:29:28 +02:00
Piotr Sarna	336cc70438	pager: add setters for partition/clustering keys	2018-09-27 15:18:06 +02:00
Piotr Sarna	1d34ef38a8	cql3: make pagers use time_point instead of duration A standard way for passing a timeout parameter is specifying a time_point, while pagers used to take a duration in order to compute time points on the fly. This patch adds a timeout parameter, which is a time_point, to fetch_page().	2018-09-27 15:18:06 +02:00
Avi Kivity	337ee6153a	Merge "Support SSTables 3.x in Scylla runtime" from Vladimir and Piotr " This patchset makes it possible to use SSTables 'mc' format, commonly referred to as 'SSTables 3.x', when running Scylla instance. Several bugs found on this way are fixed. Also, a configuration option is introduced to allow running Scylla either with 'mc' or 'la' format as default. Tests: unit {release} + tested Scylla with both 'la' and 'mc' formats to work fine: cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; [3/1890] cqlsh> USE test; cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''}; cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8); <<flush>> cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8; <<flush>> cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6); cqlsh:test> SELECT * FROM cfsst3 ; pk \| ck \| rc ----+----+------ 2 \| 3 \| null 4 \| 6 \| null (2 rows) <<Scylla restart>> cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10); cqlsh:test> SELECT * from cfsst3 ; pk \| ck \| rc ----+----+------ 5 \| 7 \| null 8 \| 10 \| null 2 \| 3 \| null 4 \| 6 \| null 7 \| 9 \| null 6 \| 8 \| null (6 rows) " * 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla: database: Honour enable_sstables_mc_format configuration option. sstables: Support SSTables 'mc' format as a feature. db: Add configuration option for enabling SSTables 'mc' format. tests: Add test for reading a complex column with zero subcolumns (SST3). sstables: Fix parsing of complex columns with zero subcolumns. sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding. sstables: Use parser_type instead of abstract_type::parse_type in column_translation. bytes: Add helper for turning bytes_view into sstring_view. sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists. sstables: Fix string formatting for exception messages in m_format_read_helpers. sstables: Don't validate timestamps against the max value on parsing. sstables: Always store only min bases in serialization_header. sstables: Support 'mc' version parsing from filename. SST3: Make sure we call consume_partition_end	2018-09-26 11:10:07 +01:00
Vladimir Krivopalov	c98937e04c	sstables: Support SSTables 'mc' format as a feature. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-25 17:23:40 -07:00
Paweł Dziepak	a3746d3b05	paging: make may_need_paging() more conservative There is a bad interaction between may_need_paging() and query result size limiter. The former is trying to avoid the complexity of paged queries when the number of returned rows is going to be smaller than the page size. The latter uses the fact that paged queries need not return all requested rows to limit the size of a query results. Since may_need_paging() may turn a paged query into non-paged one as a side effect it disables the oversized result protection. This patch limits the cases when may_need_paging() disables paging to the situations when we know for sure that query result size limiter won't be needed, i.e.: the result is not going to contain more than one row. If the client knows for sure that the paging is not needed and the performance impact is worthwhile it can disable paging on its side. Otherwise, let's default to the safer behaviour. Fixes #3620. Message-Id: <20180925134431.24329-1-pdziepak@scylladb.com>	2018-09-25 17:01:04 +03:00
Tomasz Grabiec	82270c8699	storage_proxy: Fix misqualification of reads as foreground or background in some cases The foreground reads metric is derived from the number of live read executors minus the number of background reads. Background reads are counted down when their resolver times out. However, a read executor may still be around for a while, resulting in such reads being accounted as foreground. Usually, the gap in which this happens is short, because executor reference holders timeout quickly as well. It's not always the case though. For instance, local read executor doesn't time out quickly when the target shard has an overloaded CPU, and it takes a while before the request goes through all the queues, even if IO is not involved. Observed in #3628. Fixes #3734. Another problem is that all reads which received CL responses are accounted as background, until all replicas respond, but if such read needs reconciliation, it's still practically a foreground read and should be accounted as such. Found during code review. Fixes #3745. This patch fixes both issues by rearranging accounting to track foreground reads instead of background reads, and considering all reads as foreground until the resulting promise is resolved. Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>	2018-09-05 20:42:51 +03:00
Asias He	89b769a073	storage_service: Wait for range setup before announcing join status When a joining node announcing join status through gossip, other existing nodes will send writes to the joining node. At this time, it is possible the joining node hasn't learnt the tokens of other nodes that causes the error like below: token_metadata - sorted_tokens is empty in first_token_index! storage_proxy - Failed to apply mutation from 127.0.4.1#0: std::runtime_error (sorted_tokens is empty in first_token_index!) To fix, wait for the token range setup before announcing the join status. Fixes: #3382 Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>	2018-09-05 10:51:43 +03:00
Botond Dénes	cd49c23a66	query_pagers: generate query_uuid for range-scans as well And thus enable stateful range scans.	2018-09-03 10:31:44 +03:00
Botond Dénes	6486d6c8bd	storage_proxy: use preferred/last replicas	2018-09-03 10:31:44 +03:00
Botond Dénes	577a06ce1b	storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent	2018-09-03 10:31:44 +03:00
Botond Dénes	6e59cee244	db::consistency_level::filter_for_query() add preferred_endpoints To the second overload (the one without read-repair related params) too.	2018-09-03 10:31:44 +03:00
Botond Dénes	2f66bde26f	storage_proxy: use query_mutations_from_all_shards() for range scans	2018-09-03 10:31:44 +03:00
Paweł Dziepak	6f1c3e6945	Merge "Convert more execution_stages to inherit scheduling_groups" from Avi " Previous work (`71471bb322`) converted the CQL layer to inheriting execution stages, paving the way to multiple users sharing the front-end. This patchset does the same thing to the back-end, converting more execution stages to preserve the caller's scheduling_group. Since RPC now (`8c993e0728`) assigns the correct scheduling group within the replica, we can extend that work so a statement is executed with the same scheduling group all the way to sstable parsing, even if we cross nodes in the process. This improves performance isolation and paves the way to multi-user SLA guarantees. " * tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla: database: make database's mutation apply stage inherit its scheduling group from the caller database: make database::_mutation_query_stage inherit the scheduling group database: make database::_data_query_stage inheriting its caller's scheduling_group storage_proxy: make _mutate_stage inherit its caller's scheduling_group	2018-08-28 13:49:31 +01:00
Avi Kivity	5792a59c96	migration_manager: downgrade frightening "Can't send migration request" ERROR This error is transient, since as soon as the node is up we will be able to send the migration request. Downgrade it to a warning to reduce anxiety among people who actually read the logs (like QA). The message is also badly worded as no one can guess what a migration request is, but that is left to another patch. Fixes #3706. Message-Id: <20180821070200.18691-1-avi@scylladb.com>	2018-08-27 14:49:36 +02:00
Avi Kivity	908e497f3d	storage_proxy: make _mutate_stage inherit its caller's scheduling_group Right now, storage_proxy's mutate_stage violates isolation by running in a plain execution_stage without a scheduling_group. This means do_mutate() will run under the main scheduling_group, at least until we reach the database apply execution stage, which is correct. Fix by moving to an inheriting execution stage; this works because the messaging service will tell RPC to set the correct execution stage for us. We could explicitly specify statement_scheduling_group, but inheriting the scheduling group allows us to have multiple statment scheduling groups, later.	2018-08-24 19:04:49 +03:00
Gleb Natapov	7277ee2939	storage_proxy: do not fail read without speculation on connection error After `ac27d1c93b` if a read executor has just enough targets to achieve request's CL and a connection to one of them will be dropped during execution ReadFailed error will be returned immediately and client will not have a chance to issue speculative read (retry). The patch changes the code to not return ReadFailed error immediately, but wait for timeout instead and give a client chance to issue speculative read in case read executor does not have additional targets to send speculative reads to by itself. Fixes #3699. Message-Id: <20180819131646.GK2326@scylladb.com>	2018-08-20 10:12:31 +03:00
Duarte Nunes	a025bf6a7d	Merge seastar upstream Seastar introduced a "compat" namespace, which conflicts with Scylla's own "compat" namespaces. The merge thus includes changes to scope uses of Scylla's "compat" namespaces. * seastar 8ad870f...9bb1611 (5): > util/variant_utils: Ensure variant_cast behaves well with rvalues > util/std-compat: Fix infinite recursion > doc/tutorial: Undo namespace changes > util/variant_utils: Add cast_variant() > Add compatbility with C++17's library types Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-08-14 13:07:09 +01:00
Piotr Sarna	8c18aaa511	cql3: pass query options to restrictions filter Query options may contain bound values needed for checking filtering restrictions. Previously, empty query_options{} were used, which caused prepared statements to fail. Fixes #3677	2018-08-09 17:44:45 +02:00
Amnon Heiman	80b1ef0f47	storage_service: Add nodes_status related metrics This patch adds a metric for a node own operation mode, the operation_mode metric represent the enum modes as gauge values according to: UNKNOWN = 0, STARTING = 1, JOINING = 2, NORMAL = 3, LEAVING = 4, DECOMMISSIONED = 5, DRAINING = 6, DRAINED = 7, MOVING = 8 Fixes: #3482 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20180806142706.23579-1-amnon@scylladb.com>	2018-08-06 18:19:56 +03:00

1 2 3 4 5 ...

1286 Commits