scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-29 19:21:01 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	1d9b88dceb	service/storage_service: Introduce cluster_schema_features()	2019-04-28 15:50:12 +02:00
Tomasz Grabiec	6e2c190b5f	schema_tables: Propagate storage_service& to merge_schema() We will need to calculate cluster schema features at the time we calculate the schema digest.	2019-04-28 12:33:10 +02:00
Piotr Sarna	037b517c85	service: initialize system distributed keyspace after schema agreement In order to avoid schema disagreements during upgrades (which may lead to deadlocks), system distributed keyspace initialization is moved right before starting the bootstrapping process, after the schema agreement checks already succeeded. Fixes #3976 Message-Id: <932e642659df1d00a2953df988f939a81275774a.1556204185.git.sarna@scylladb.com>	2019-04-25 18:44:08 +02:00
Gleb Natapov	c6b3b9ff13	cache_hitrate_calculator: wait for ongoing calculation to complete during stop Currently stop returns ready future immediately. This is not a problem since calculation loop holds a shared pointer to the local service, so it will not be destroyed until calculation completes and global database object db, that also used by the calculation, is never destroyed. But the later is just a workaround for a shutdown sequence that cannot handle it and will be changed one day. Make cache hitrate calculation service ready for it. Message-Id: <20190422113538.GR21208@scylladb.com>	2019-04-22 14:44:42 +03:00
Gleb Natapov	306f5b99b5	cache_hitrate_calculator: fix use after free in non_system_filter lambda non_system_filter lambda is defined static which means it is initialized only once, so the 'this' that is will capture will belong to a shard where the function runs first. During service destruction the function may run on different shard and access already other's shard service that may be already freed. Fixed #4425 Message-Id: <20190421152139.GN21208@scylladb.com>	2019-04-21 18:22:31 +03:00
Piotr Jastrzebski	2c599122e1	Update supported features on format change Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:38:31 +02:00
Piotr Jastrzebski	9c7e3dd470	Add _unbounded_range_tombstones_feature This requires introduction of storage_service::get_known_features and using it with check_knows_remote_features. Otherwise a node joining the existing cluster won't be able to join because it does not support unbounded range tombstones yet. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:37:12 +02:00
Piotr Jastrzebski	96ad8f7df9	Use _sstables_format to determine current format Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:37:12 +02:00
Piotr Jastrzebski	7339e9de30	Add service::read_sstables_format Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:37:12 +02:00
Piotr Jastrzebski	9934740c39	Register feature listeners in storage_service Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 10:36:58 +02:00
Piotr Jastrzebski	081542cf00	storage_service: add _sstables_format field Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-04-12 09:33:40 +02:00
Rafael Ávila de Espíndola	6191fd7701	Avoid duplicated read_keyspace_mutation calls There were many calls to read_keyspace_mutation. One in each function that prepares a mutation for some other schema change. With this patch they are all moved to a single location. Tests: unit (dev, debug) Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190328024440.26201-1-espindola@scylladb.com>	2019-04-07 09:26:56 +03:00
Vlad Zolotarov	0dc0a6025d	query_pager::fetch_page: cosmetics: fix code alignment Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20190401214030.5570-2-vladz@scylladb.com>	2019-04-02 11:53:10 +03:00
Avi Kivity	f259a4c3b4	Merge "Remove usage of static gossiper object in init.cc and storage_service" from Asias " This series removes the usage of the static gossiper object in init.cc and storage_service. Follow up series will remove more in other components. This is the effort to clean up the component dependencies and have better shutdown procedure. Tests: tests/gossip_test, tests/cql_query_test, tests/sstable_mutation_test, dtests. " * tag 'asias/storage_service_gossiper_dep_v5' of github.com:cloudius-systems/seastar-dev: storage_service: Do not use the global gms::get_local_gossiper() storage_service: Pass gossiper object to storage_service gms: Remove i_failure_detector.hh gossip: Get rid of the gms::get_local_failure_detector static object dht: Do not use failure_detector::is_alive in failure_detector_source_filter tests: Fix stop snitch in gossip_test.cc gossiper: Do not use value_factory from storage_service object gossiper: Use cfg options from _cfg instead of get_local_storage_service gossiper: Pass db::config object to gossiper class init: Pass gossiper object to init_ms_fd_gossiper	2019-03-26 08:54:46 +02:00
Duarte Nunes	93a1c27b31	service/storage_proxy: Don't consider view hints for MV backpressure When a view replica becomes unavailable, updates to it are stored as hints at the paired based replica. This on-disk queue of pending view updates grows as long as there are view updated and the view replica remains unavailable. Currently, we take that relative queue size into account when calculating the delay for new base writes, in the context of the backpressure algorithm for materialized views. However, the way we're calculating that on-disk backlog is wrong, since we calculate it per-device and then feed it to all the hints managers for that device. This means that normal hints will show up as backlog for the view hints manager, which in turn introduces delays. This can make the view backpressure mechanism kick-in even if the cluster uses no materialized views. There's yet another way in which considering the view hints backlog is wrong: a view replica that is unavailable for some period of time can cause the backlog to grow to a point where all base writes are applied the maximum delay of 1 second. This turns a single-node failure into cluster unavailability. The fix to both issues is to simply not take this on-disk backlog into account for the backpressure algorithm. Fixes #4351 Fixes #4352 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190321170418.25953-1-duarte@scylladb.com>	2019-03-24 20:29:56 +02:00
Asias He	7447c92d63	storage_service: Do not use the global gms::get_local_gossiper() Use the gossiper object stored in _gossiper member from storage_service.	2019-03-22 09:11:26 +08:00
Asias He	b91452ed4c	storage_service: Pass gossiper object to storage_service Pass the gossiper object to storage_service class in order to avoid the usage of the static object returned from get_local_gossiper().	2019-03-22 09:11:26 +08:00
Asias He	af579a055b	gossip: Get rid of the gms::get_local_failure_detector static object Store the failure_detector object inside gossiper object. - No more the global object sharded<failure_detector> - No need to initialize sharded<failure_detector> manually which simplifies the code in tests/cql_test_env.cc and init.cc.	2019-03-22 09:08:51 +08:00
Asias He	2b6a4050c2	dht: Do not use failure_detector::is_alive in failure_detector_source_filter Switch failure_detector_source_filter to use get_local_gossiper::is_alive directly since we are going to remove the static gms::get_local_failure_detector object soon. Pass the nodes that are down to the filter direclty, to avoid the range_streamer to depends on gossiper at all.	2019-03-22 08:26:47 +08:00
Asias He	c0f744b407	storage_service: Wait for gossip to settle only if do_bind is set In commit `71bf757b2c`, we call wait_for_gossip_to_settle() which takes some time to complete in storage_service::prepare_to_join(). In tests/cql_query_test calls init_server with do_bind == false which in turn calls storage_service::prepare_to_join(). Since in the test, there is only one node, there is no point to wait for gossip to settle. To make the cql_query_test fast again, do not call wait_for_gossip_to_settle if do_bind is false. Before this patch, cql_query_test takes forever to complete. After it takes 10s. Tests: tests/cql_query_test Message-Id: <3ae509e0a011ae30eef3f383c6a107e194e0e243.1553147332.git.asias@scylladb.com>	2019-03-21 12:46:00 -03:00
Glauber Costa	34b640993f	storage proxy: add tracepoints about delays When we are tracing requests, we would like to know everything that happened to a query that can contribute to it having increased latencies. We insert some of those latencies explicitly due to throttling, but we do not log that into tracing. In the case of storage proxy, we do have a log message at trace level but that is rarely used: trace messages are too heavy of a hammer, there is no way to specify specific queries, etc. The correct place for that is CQL tracing. This patch moves that message to CQL tracing. We also add a matching tracepoint assuring us that no delay happened if that's the case. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20190320163350.15075-1-glauber@scylladb.com>	2019-03-21 12:45:52 -03:00
Avi Kivity	eddb98e8c6	Merge "sstables: mc: Write and read static compact tables the same way as Cassandra" from Tomasz " Static compact tables are tables with compact storage and no clustering columns. Before this patch, Scylla was writing rows of static compact tables as clustered rows instead of as static rows. That's because in our in-memory model such tables have regular rows and no static row. In Cassandra's schema (since 3.x), those tables have columns which are marked as static and there are no regular columns. This worked fine as long as Scylla was writing and reading those sstables. But when importing sstables from Cassandra, our reader was skipping the static row, since it's not present in our schema, and returning no rows as a result. Also, Cassandra, and Scylla tools, would have problems reading those sstables. Fix this by writing rows for such tables the same way as Cassandra does. In order to support rolling downgrade, we do that only when all nodes are upgraded. Fixes #4139. Tests: - unit (dev) " * tag 'static-compact-mc-fix-v3.1' of github.com:tgrabiec/scylla: tests: sstables: Test reading of static compact sstable generated by Cassandra tests: sstables: Add test for writing and reading of static compact tables sstables: mc: Write static compact tables the same way as Cassandra sstable: mc: writer: Set _static_row_written inside write_static_row() sstables: Add sstable::features() sstables: mc: writer: Prepare write_static_row() for working with any column_kind storage_service: Introduce the CORRECT_STATIC_COMPACT feature flag sstables: mc: writer: Build indexed_columns together with serialization_header sstables: mc: writer: De-optimize make_serialization_header() sstable: mc: writer: Move attaching of mc-specific components out of generic code	2019-03-21 12:45:51 -03:00
Nadav Har'El	7c874057f5	materialized_views: propagate "view virtual columns" between nodes db::schema_tables::ALL and db::schema_tables::all_tables() are both supposed to list the same schema tables - the former is the list of their names, and the latter is the list of their schemas. This code duplication makes it easy to forget to update one of them, and indeed recently the new "view_virtual_columns" was added to all_tables() but not to ALL. What this patch does is to make ALL a function instead of constant vector. The newly named all_table_names() function uses all_tables() so the list of schema tables only appears once. So that nobody worries about the performance impact, all_table_names() caches the list in a per-thread vector that is only prepared once per thread. Because after this patch all_table_names() has the "view_virtual_columns" that was previously missing, this patch also fixes #4339, which was about virtual columns in materialized views not being propagated to other nodes. Unfortunately, to test the fix for #4339 we need a test with multiple nodes, so we cannot test it here in a unit test, and will instead use the dtest framework, in a separate patch. Fixes #4339 Branches: 3.0 Tests: all unit tests (release and debug mode), new dtest for #4339. The unit test mutation_reader_test failed in debug mode but not in release mode, but this probably has nothing to do with this patch (?). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190320063437.32731-1-nyh@scylladb.com>	2019-03-20 09:14:59 -03:00
Nadav Har'El	ccf731a820	Materialized views: add metric for current flow-control delay The materialized views flow control mechanism works by adding a certain delay to each client request, designed to slow down the client to the rate at we can complete the background view work. Until now we could observe this mechanism only indirectly, in whether or not it succeeded to keep the view backlog bounded; But we had no way to directly observe the delay that we decided to add. In fact, we had a bug where this delay was constantly zero, and we didn't even notice :-) So in this patch we add a new metric, scylla_storage_proxy_coordinator_last_mv_flow_control_delay The metric is a floating point number, in units of seconds. This metric is somewhat peculiar that it always contains the last delay used for some request - unlike other metrics it doesn't measure the "current" value of something. Moreover, it can jump wildly because there is no guarantee that each request's delay will be identical (in particular, different requests may involve different base replicas which have different view backlogs, so decide on different delays). In the future we may want to supplement this metric with some sort of delay histogram. But even this simple metric is already useful to debug certain scenarios and understand if the materialized-views flow control is working or not. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190227133630.26328-1-nyh@scylladb.com>	2019-03-20 09:14:59 -03:00
Asias He	71bf757b2c	gossiper: Enable features only after gossip is settled n1, n2, n3 in the cluster, shutdown n1, n2, n3 start n1, n2 start n3, we saw features are enabled using the system table while n1 and n2 are already up and running in the cluster. INFO 2019-02-27 09:24:41,023 [shard 0] gossip - Feature check passed. Local node 127.0.0.3 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH} INFO 2019-02-27 09:24:41,025 [shard 0] storage_service - Starting up server gossip INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.1 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH} INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.2 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH} The problem is we enable the features too early in the start up process. We should enable features after gossip is settled. Fixes #4289 Message-Id: <04f2edb25457806bd9e8450dfdcccc9f466ae832.1551406991.git.asias@scylladb.com>	2019-03-18 18:25:29 +01:00
Tomasz Grabiec	fefef7b9eb	storage_service: Introduce the CORRECT_STATIC_COMPACT feature flag When enabled on all nodes, sstable writers will start to produce correct MC-format sstables for compact storage tables by writing rows into the static row (like C*) rather than into the regular row. We only do that when all nodes are upgraded to support rolling downgrade. After all nodes are upgraded, regular rolling downgrade will not be possible. Refs #4139	2019-03-18 11:18:33 +01:00
Asias He	1d59f26c11	gossiper: Fix empty remote common_features in check_knows_remote_features Three nodes in the cluster node1, node2, node3 Shutdown the whole cluster Start node1 Start node2, node2 sees empty remote common_features. gossip - Feature check passed. Local node 127.0.0.2 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {} The problem is node3 hasn't started yet, node1 sees node3 has empty features. In get_supported_features(), an empty common features will be returned if an empty features of a node is seen. To fix, we should fallback to use the features saved in system table. Start node3, node3 sees empty remote common_features. gossip - Feature check passed. Local node 127.0.0.3 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {} The problem is node3 hasn't inserted its own features into gossip endpoint_state_map. get_supported_features() returns the common features of all nodes in endpoint_state_map. To fix, we should fallback to use the features stored in the system table for such node in this case. Fixes #4225	2019-03-18 10:56:10 +01:00
Piotr Sarna	2e05d86cf3	service: reduce number of spawned threads when notifying Commit `9c544df217` introduced running up/down/join/leave notifications in threaded context, but spawned a thread for every notification, while it could be done once for all notifiees. Reported-by: Avi Kivity <avi@scylladb.com> Message-Id: <34815d5aa11902c4a052cff38f4c45c45ff919d8.1552897848.git.sarna@scylladb.com>	2019-03-18 10:45:47 +02:00
Piotr Sarna	9c544df217	service: run notifying code in threaded context In order to allow yielding when handling endpoint lifecycle changes, notifiers now run in threaded context. Implementations which used this assumption before are supplemented with assertions that they indeed run in seastar::async mode. Fixes #4317 Message-Id: <45bbaf2d25dac314e4f322a91350705fad8b81ed.1552567666.git.sarna@scylladb.com>	2019-03-14 12:56:53 +00:00
Piotr Sarna	aea4b7ea78	service: remove unused stop_hints_manager Stopping hints manager now occurs when draining storage proxy and it shouldn't be executed independently, so it's removed from external API.	2019-03-07 13:44:06 +01:00
Piotr Sarna	cc806909d7	storage_proxy: add drain_on_shutdown implementation When storage proxy is shutting down, all interruptible writes can be timed out in order not to wait for them. Instead, the mechanism will fall back to storing hints and/or not progressing with view building.	2019-03-07 13:44:05 +01:00
Piotr Sarna	c61d0ee8aa	main: register storage proxy as lifecycle subscriber In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes #3826 Fixes #4028	2019-03-07 12:10:40 +01:00
Piotr Sarna	92df1d5a6b	storage_proxy: add endpoint_lifecycle_subscriber interface Storage proxy is able to react to membership changes in order to cancel long-standing operations for an endpoint.	2019-03-07 12:10:40 +01:00
Piotr Sarna	f9ff97511f	storage_proxy: register view update handlers for view write type View update handlers have a specialized class, so all writes of type write_type::VIEW are now registered as such.	2019-03-07 12:10:40 +01:00
Piotr Sarna	75ec5fa876	storage_proxy: add intrusive list of view write handlers In order to be able to iterate over view update write response handlers, an intrusive list of them is added to storage proxy. This way iteration can be easily yielded without invalidating operators and all logic is moved to slow path.	2019-03-07 12:10:40 +01:00
Piotr Sarna	c2048a0758	storage_proxy: add view_update_write_response_handler View update write response handler inherits from a regular write response handler, but it's also possible to link it intrusively in order to be able to induce timeouts on them later.	2019-03-07 12:10:40 +01:00
Jesse Haber-Kucharsky	a139afc30c	auth: Reject logins from disallowed roles When the `LOGIN` option for a role is set to `false`, Scylla should not permit the role to log in. Fixes #4284 Tests: unit (debug)	2019-02-28 15:02:53 -05:00
Avi Kivity	88322086cb	Merge "Add fuzzer-type unit test for range scans" from Botond " This series adds a fuzzer-type unit test for range scans, which generates a semi-random dataset and executes semi-random range scans against it, validating the result. This test aims to cover a wide range of corner cases with the help of randomness. Data and queries against it are generated in such a way that various corner cases and their combinations are likely to be covered. The infrastructure under range-scans have gone under massive changes in the last year, growing in complexity and scope. The correctness of range scans is critical for the correct functioning of any Scylla cluster, and while the current unit tests served well in detecting any major problems (mostly while developing), they are too simplistic and can only be relied on to check the correctness of the basic functionality. This test aims to extend coverage drastically, testing cases that the author of the range-scan code or that of the existing unit tests didn't even think exists, by relying on some randomness. Fixes: #3954 (deprecates really) " * 'more-extensive-range-scan-unit-tests/v2' of https://github.com/denesb/scylla: tests/multishard_mutation_query_test: add fuzzy test tests/multishard_mutation_query_test: refactor read_all_partitions_with_paged_scan() tests/test_table: add advanced `create_test_table()` overload tests/test_table: make `create_test_table()` customizable query: add trim_clustering_row_ranges_to() tests/test_table: add keyspace and table name params tests/test_table: s/create_test_cf/create_test_table/ tests: move create_test_cf() to tests/test_table.{hh,cc} tests/multishard_mutation_query_test: drop many partition test tests/multishard_mutation_query_test: drop range tombstone test	2019-02-27 17:26:53 +02:00
Nadav Har'El	da54d0fc7d	Materialized views: fix accidental zeroing of flow-control delay The materialized-views flow control carefully calculates an amount of microseconds to delay a client to slow it down to the desired rate - but then a typo (std::min instead of std::max) causes this delay to be zeroed, which in effect completely nullifies the flow control algorithm. Before this fix, experiments suggested that view flow control was not having any effect and view backlog not bounded at all. After this fix, we can see the flow control having its desired effect, and the view backlog converging. Fixes #4143. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190226161452.498-1-nyh@scylladb.com>	2019-02-26 18:22:18 +02:00
Tomasz Grabiec	b06aac4fdb	Merge "Fix temporary spurious schema version mismatch when nodes are restarted" from Asias Fixes: #4148 Fixes: #4258 Tests: resharding_test.py:reshardingtest_nodes4_with_sizetieredcompactionstrategy.resharding_by_smp_increase_test * seastar-dev.git asias/fix_schema_mismatch_when_nodes_restarts/v1: database: Add update_schema_version and announce_schema_version storage_service: Add application_state::SCHEMA when gossip starts	2019-02-26 12:55:52 +01:00
Avi Kivity	5f94bc902a	transport: add option to disable shard-aware drivers The shard-aware drivers can cause a huge amount of connections to be created when there are tens of thousands of clients. While normally the shard-aware drivers are beneficial, in those cases they can consume too much memory. Provide an option to disable shard awareness from the server (it is likely to be easier to do this on the server than to reprovision those thousands of clients). Tests: manual test with wireshark. Message-Id: <20190223173331.24424-1-avi@scylladb.com>	2019-02-26 12:44:11 +01:00
Asias He	459836079c	storage_service: Add application_state::SCHEMA when gossip starts In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw: 4 nodes in the tests n1, n2, n3, n4 are started n1 is stopped n1 is changed to use different shard config n1 is restarted ( 2019-01-27 04:56:00,377 ) The backtrace happened on n2 right fater n1 restarts: 0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled 1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled 2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled 3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed) 4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status = 5 Segmentation fault on shard 0. 6 Backtrace: 7 0x00000000041c0782 8 0x00000000040d9a8c 9 0x00000000040d9d35 10 0x00000000040d9d83 11 /lib64/libpthread.so.0+0x00000000000121af 12 0x0000000001a8ac0e 13 0x00000000040ba39e 14 0x00000000040ba561 15 0x000000000418c247 16 0x0000000004265437 17 0x000000000054766e 18 /lib64/libc.so.6+0x0000000000020f29 19 0x00000000005b17d9 The theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time n1 has SCHEMA application_state, when n1 restarts, n2 gets new application state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty application_state for SCHEMA. We dereference the nullptr application_state and abort. In commit `da80f27f44`, we fixed the problem by checking the pointer before dereference. To prevent this to happen in the first place, we'd better to add application_state::SCHEMA when gossip starts. This way, peer nodes always see the application_state::SCHEMA when a node restarts. Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test Fixes #4148 Fixes #4258	2019-02-26 19:30:22 +08:00
Piotr Sarna	c743617236	cql3: unify max value for row limit and per-partition limit Limits are stored as uint32_t everywhere, but in some places int32_t was used, which created inconsistencies when comparing the value to std::numeric_limits<Type>::max(). In order to solve inconsistencies, the types are unified to uint32_t, and instead of explicitly calling numeric limit max, an already existing constant value query::max_rows is utilized. Fixes #4253 Message-Id: <4234712ff61a0391821acaba63455a34844e489b.1550683120.git.sarna@scylladb.com>	2019-02-21 13:56:02 +02:00
Duarte Nunes	6e83457b1b	Merge 'Add PER PARTITION LIMIT' from Piotr " This series introduces PER PARTITION LIMIT to CQL. Protocol and storage is already capable of applying per-partition limits, so for nonpaged queries the changes are superficial - a variable is parsed and passed down. For paged queries and filtering the situation is a little bit more complicated due to corner cases: results for one partition can be split over 2 or more pages, filtering may drop rows, etc. To solve these, another variable is added to paging state - the number of rows already returned from last served partition. Note that "last" partition may be stretched over any number of pages, not just the last one, which is a case especially when considering filtering. As a result, per-partition-limiting queries are not eligible for page generator optimization, because they may need to have their results locally filtered for extraneous rows (e.g. when the next page asks for per-partition limit 5, but we already received 4 rows from the last partition, so need just 1 more from last partition key, but 5 from all next ones). Tests: unit (dev) Fixes #2202 " * 'add_per_partition_limit_3' of https://github.com/psarna/scylla: tests: remove superficial ignore_order from filtering tests tests: add filtering with per partition key limit test tests: publish extract_paging_state and count_rows_fetched tests: fix order of parameters in with_rows_ignore_order cql3,grammar: add PER PARTITION LIMIT idl,service: add persistent last partition row count cql3: prevent page generator usage for per-partition limit cql3: add checking for previous partition count to filtering pager: add adjusting per-partition row limit cql3: obey per partition limit for filtering cql3: clean up unneeded limit variables cql3: obey per partition limit for select statement cql3: add get_per_partition_limit cql3: add per_partition_limit to CQL statement	2019-02-18 14:47:11 +00:00
Piotr Sarna	acf7bedad4	idl,service: add persistent last partition row count In order to process paged queries with per-partition limits properly, paging state needs to keep additional information: what was the row count of last partition returned in previous run. That's necessary because the end of previous page and the beginning of current one might consist of rows with the same partition key and we need to be able to trim the results to the number indicated by per-partition limit.	2019-02-18 11:06:44 +01:00
Piotr Sarna	1dadae212a	cql3: add checking for previous partition count to filtering Filtering now needs to take into account per partition limits as well, and for that it's essential to be able to compare partition keys and decide which rows should be dropped - if previous page(s) contained rows with the same partition key, these need to be taken into consideration too.	2019-02-18 11:06:43 +01:00
Piotr Sarna	82a3883575	pager: add adjusting per-partition row limit For filtering pagers, per partition limit should be set to page size every time a query is executed, because some rows may potentially get dropped from results.	2019-02-18 10:55:52 +01:00
Piotr Sarna	b965c3778f	cql3: obey per partition limit for filtering Filtering queries now take into account the limit of rows per single partition provided by the user.	2019-02-18 10:29:34 +01:00
Gleb Natapov	b01a659014	storage_proxy: remove old Cassandra code Part of the code is already implemented (counters and hinted-handoff). Part of the code will probably never be (triggers). And the rest is the code that estimates number of rows per range to determine query parallelism, but we implemented exponential growth algorithms instead. Message-Id: <20190214112226.GE19055@scylladb.com>	2019-02-18 10:34:55 +02:00
Avi Kivity	a1567b0997	Merge "replace get_restricted_ranges() function with generator interface" from Gleb " get_restricted_ranges() is inefficient since it calculates all vnodes that cover a requested key ranges in advance, but callers often use only the first one. Replace the function with generator interface that generates requested number of vnodes on demand. " * 'gleb/query_ranges_to_vnodes_generator' of github.com:scylladb/seastar-dev: storage_proxy: limit amount of precaclulated ranges by query_ranges_to_vnodes_generator storage_proxy: remove old get_restricted_ranges() interface cql3/statements/select_statement: convert index query interface to new query_ranges_to_vnodes_generator interface tests: convert storage_proxy test to new query_ranges_to_vnodes_generator interface storage_proxy: convert range query path to new query_ranges_to_vnodes_generator interface storage_proxy: introduce new query_ranges_to_vnode_generator interface	2019-02-18 10:33:54 +02:00

1 2 3 4 5 ...

1412 Commits