scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 11:55:15 +00:00

Author	SHA1	Message	Date
Pekka Enberg	c1f6ce4251	Merge 'Fixes for the view_update_from_staging_generator' from Duarte "This series contains a couple of fixes to the view_update_from_staging_generator, the object responsible for generating view updates from sstables written through streaming. Fixes #4021" * 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla: db/view/view_update_from_staging_generator: Break semaphore on stop() db/view/view_update_from_staging_generator: Restore formatting db/view/view_update_from_staging_generator: Avoid creating more than one fiber (cherry picked from commit `96172b7bca`)	2018-12-29 20:22:54 +02:00
Avi Kivity	dbe347811c	Merge "materialized views: Apply backpressure from view replicas" from Duarte " As the amount of pending view updates increases we know that there’s a mismatch between the rate at which the base receives writes and the rate at which the view retires them. We react by applying backpressure to decrease the rate of incoming base writes, allowing the slow view replicas to catch up. We want to delay the client’s next writes to a base replica and we use the base’s backlog of view updates to derive this delay. To validate this approach we tested a 3 node Scylla cluster on GCE, using n1-standard-4 instances with NVMEs. A loader running on a n1-standard-8 instance run cassandra-stress with 100 threads. With the delay function d(x) set to 1s, we see no base write timeouts. With the delay function as defined in the series, we see that backlogs stabilize at some (arbitrary) point, as predicted, but this stabilization co-exists with base write timeouts. However, the system overall behaves better than the current version, with the 100 view update limit, and also better than the version without such limit or any backpressure. More work is necessary to further stabilize the system. Namely, we want to keep delaying until we see the backlog is decreasing. This will require us to add more delay beyond the stabilization point, which in turn should minimize the base write timeouts, and will also minimize the amount of memory the backlog takes at each base replica. Design document: https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo Fixes #2538 " Reviewed-by: Nadav Har'El <nyh@scylladb.com> * 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits) service/storage_proxy: Release mutation as early as possible service/storage_proxy: Delay replica writes based on view update backlog service/storage_proxy: Get the backlog of a particular base replica service/storage_proxy: Add counters for delayed base writes main: Start and stop the view_update_backlog_broker service: Distribute a node's view update backlog service: Advertise view update backlog over gossip service/storage_proxy: Send view update backlog from replicas service/storage_proxy: Prepare to receive replica view update backlog service/storage_proxy: Expose local view update backlog tests/view_schema_test: Add simple test for db::view::node_update_backlog db/view: Introduce node_update_backlog class db/hints: Initialize current backlog database: Add counter for current view backlog database: Expose current memory view update backlog idl: Add db::view::update_backlog db/view: Add view_update_backlog database: Wait on view update semaphore for view building service/storage_proxy: Use near-infinite timeouts for view updates database: generate_and_propagate_view_updates no longer needs a timeout ... (cherry picked from commit `b66f59aa3d`)	2018-12-20 19:11:56 +02:00
Avi Kivity	8f2d24bb8f	config: remove "to be removed before release" notice mc sstable config The "enable_sstables_mc_format" config item help text wants to remove itself before release. Since scylla-3.0 did not get enough mc format mileage, we decided to leave it in, so the notice should be removed. Fixes #4003. Message-Id: <20181219082554.23923-1-avi@scylladb.com> (cherry picked from commit `dd51c659f7`)	2018-12-19 19:08:36 +02:00
Duarte Nunes	97cd9108d6	db/system_distributed_keyspace: Create the schema with min_timestamp Different nodes can concurrently create the distributed system keyspace on boot, before the "if not exists" clause can take effect. However, the resulting schema mutations will be different since different nodes use different timestamps. This patch forces the timestamps to be the same across all nodes, so we save some schema mismatches. This fixes a bug exposed by `ca5dfdf`, whereby the initialization of the distributed system keyspace is done before waiting for schema agreement. While waiting for schema agreement in storage_service::join_token_ring(), the node still hasn't joined the ring and schemas can't be pulled from it, so nodes can deadlock. A similar situation can happen between a seed node and a non-seed node, where the seed node progresses to a different "wait for schema agreement" barrier, but still can't make progress because it can't pull the schema from the non-seed node still trying to join the ring. Finally, it is assumed that changes to the schema of the current distributed system keyspace tables will be protected by a cluster feature and a subsequent schema synchronization, such that all nodes will be at a point where schemas can be transferred around. Fixes #3976 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181211113407.20075-1-duarte@scylladb.com> (cherry picked from commit `89ae3fbf11`)	2018-12-11 14:53:30 +00:00
Gleb Natapov	4acfc5ed8f	hints: make hints manager more resilient to unexpected directory content Currently if hints directory contains unexpected directories Scylla fails to start with unhandled std::invalid_argument exception. Make the manager ignore malformed files instead and try to proceed anyway. Message-Id: <20181121134618.29936-2-gleb@scylladb.com> (cherry picked from commit `b4a8802edc`)	2018-12-08 13:42:43 +02:00
Gleb Natapov	cb9199bc7f	hints: add auxiliary function for scanning high level hints directory We scan hints directory in two places: to search for files to replay and to search for directories to remove after resharding. The code that translates directory name to a shard is duplicated. It is simple now, so not a bit issue but in case it grows better have it in one place. Message-Id: <20181121134618.29936-1-gleb@scylladb.com> (cherry picked from commit `9433d02624`)	2018-12-08 13:42:43 +02:00
Avi Kivity	54258ca8eb	Merge "db/hints: Use frozen_mutation in hinted handoff" from Duarte " This series changes hinted handoff to work with `frozen_mutation`s instead of naked `mutation`s. Instead of unfreezing a mutation from the commitlog entry and then freezing it again for sending, now we'll just keep the read, frozen mutation. Tests: unit(release) " * 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla: db/hints/manager: Use frozen_mutation instead of mutation db/hints/manager: Use database::find_schema() db/commitlog/commitlog_entry: Allow moving the contained mutation service/storage_proxy: send_to_endpoint overload accepting frozen_mutation service/storage_proxy: Build a shared_mutation from a frozen_mutation service/storage_proxy: Lift frozen_mutation_and_schema service/storage_proxy: Allow non-const ranges in mutate_prepare() (cherry picked from commit `1891779e64`)	2018-12-05 20:14:57 +00:00
Duarte Nunes	f8195a77b0	db/view/view_builder: Don't timeout waiting for view to be built Remove the timeout argument to db::view::view_builder::wait_until_built(), a test-only function to wait until a given materialized view has finished building. This change is motivated by the fact that some tests running on slow environments will timeout. Instead of incrementally increasing the timeout, remove it completely since tests are already run under an exterior timeout. Fixes #3920 Tests: unit release(view_build_test, view_schema_test) Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181115173902.19048-1-duarte@scylladb.com> (cherry picked from commit `6fbf792777`)	2018-12-05 19:20:36 +00:00
Duarte Nunes	5b724c80ab	db/view: Don't copy keyspace name Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181022104527.14555-1-duarte@scylladb.com> (cherry picked from commit `f3a5ec0fd9`)	2018-12-05 19:19:26 +00:00
Nadav Har'El	4a7ae81b3f	materialized views: update stats.write statistics in all cases mutate_MV usually calls send_to_endpoint() to push view update to remote view replicas. This function gets passed a statistics object, service::storage_proxy_stats::write_stats and, in particular, updates its "writes" statistic which counts the number of ongoing writes. In the case that the paired view replica happens to be the same node, we avoid calling send_to_endpoint() and call mutate_locally() instead. That function does not take a write_stats object, so the "writes" statistic doesn't get incremented for the duration of the write. So we should do this explicitly. Co-authored-by: Nadav Har'El <nyh@scylladb.com> Co-authored-by: Duarte Nunes <duarte@scylladb.com> (cherry picked from commit `1d5f8d0015`)	2018-12-05 19:19:26 +00:00
Duarte Nunes	9776a048e7	Merge 'Generating view updates during streaming' from Piotr During streaming, there are cases when we should invoke the view write path. In particular, if we're streaming because of repair or if a view has not yet finished building and we're bootstrapping a new node. The design constraints are: 1) The streamed writes should be visible to new writes, but the sstable should not participate in compaction, or we would lose the ability to exclude the streamed writes on a restart; 2) The streamed writes must not be considered when generating view updates for them; 3) Resilient to node restarts; 4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges. We achieve this by writing the streamed writes to an sstable in a different folder, call it "staging". We achieve 1) by publishing the sstable to the column family sstable set, but excluding it from compactions. We do these steps upon boot, by looking at the staging directory, thus achieving 3). Fixes #3275 * 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits) tests: add materialized views test tests: add view update generator to cql test env main: add registering staging sstables read from disk database: add a check if loaded sstable is already staging database: add get_staging_sstable method streaming: stream tables with views through staging sstables streaming: add system distributed keyspace ref to streaming streaming: add view update generator reference to streaming main: add generating missed mv updates from staging sstables storage_service: move initializing sys_dist_ks before bootstrap db/view: add view_update_from_staging_generator service db/view: add view updating consumer table: add stream_view_replica_updates table: split push_view_replica_updates table: add as_mutation_source_excluding table: move push_view_replica_updates to table.cc database: add populating tables with staging sstables database: add creating /staging directory for sstables database: add sstable-excluding reader table: add move_sstable_from_staging_in_thread function ... (cherry picked from commit `a38f6078fb`)	2018-11-15 17:46:20 +02:00
Vlad Zolotarov	c6de9ea39b	config: enable hinted handoff by default Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20181019180401.12400-1-vladz@scylladb.com> (cherry picked from commit `4d1bb719a4`)	2018-11-01 10:41:44 +02:00
Nadav Har'El	996b86b804	Materalized views: fix race condition in resharding while view building When a node reshards (i.e., restarts with a different number of CPUs), and is in the middle of building a view for a pre-existing table, the view building needs to find the right token from which to start building on all shards. We ran the same code on all shards, hoping they would all make the same decision on which token to continue. But in some cases, one shard might make the decision, start building, and make progress - all before a second shard goes to make the decision, which will now be different. This resulted, in some rare cases, in the new materialized view missing a few rows when the build was interrupted with a resharding. The fix is to add the missing synchronization: All shards should make the same decision on whether and how to reshard - and only then should start building the view. Fixes #3890 Fixes #3452 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20181028140549.21200-1-nyh@scylladb.com> (cherry picked from commit `b8337f8c9d`)	2018-10-29 09:52:25 +00:00
Avi Kivity	6bf4a73d88	thrift: limit message size Limit message size according to the configuration, to avoid a huge message from allocating all of the server's memory. We also need to limit memory used in aggregate by thrift, but that is left to another patch. Fixes #3878. Message-Id: <20181024081042.13067-1-avi@scylladb.com> (cherry picked from commit `a9836ad758`)	2018-10-24 19:32:25 +03:00
Avi Kivity	52be02558e	config: mark range_request_timeout_in_ms and request_timeout_in_ms as Used This makes them available in scylla --help. Fixes #3884. Message-Id: <20181023101150.29856-1-avi@scylladb.com> (cherry picked from commit `d9e0ea6bb0`)	2018-10-24 09:43:54 +03:00
Avi Kivity	a7cbfbe63f	Merge "hinted handoff: give a sender a low priority" from Vlad " Hinted handoff should not overpower regular flows like READs, WRITEs or background activities like memtable flushes or compactions. In order to achieve this put its sending in the STEAMING CPU scheduling group and its commitlog object into the STREAMING I/O scheduling group. Fixes #3817 " * 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla: db::hints::manager: use "streaming" I/O scheduling class for reads commitlog::read_log_file(): set the a read I/O priority class explicitly db::hints::manager: add hints sender to the "streaming" CPU scheduling group (cherry picked from commit `1533487ba8`)	2018-10-24 09:43:39 +03:00
Duarte Nunes	28fd2044d2	Merge 'hinted handoff: add manager::state and split storing and replaying enablement' from Vlad " Refs #3828 (Probably fixes it) We found a few flaws in a way we enable hints replaying. First of all it was allowed before manager::start() is complete. Then, since manager::start() is called after messaging_service is initialized there was a time window when hints are rejected and this creates an issue for MV. Both issues above were found in the context of #3828. This series fixes them both. Tested {release}: dtest: materialized_views_test.py:TestMaterializedViews.write_to_hinted_handoff_for_views_test dtest: hintedhandoff_additional_test.py " * 'hinted_handoff_dont_create_hints_until_started-v1' of https://github.com/vladzcloudius/scylla: hinted handoff: enable storing hints before starting messaging_service db::hints::manager: add a "started" state db::hints::manager: introduce a _state (cherry picked from commit `3a53b3cebc`)	2018-10-24 09:43:03 +03:00
Duarte Nunes	26c31f6798	Merge "db/hints: Expose current backlog" from Duarte " Hints are stored on disk by a hints::manager, ensuring they are eventually sent. A hints::resource_manager ensures the hints::managers it tracks don't consume more than their allocated resources by monitoring disk space and disabling new hints if needed. This series fixes some bugs related to the backlog calculation, but mainly exposes the backlog through a hints::manager so upper layers can apply flow control. Refs #2538 " * 'hh-manager-backlog/v3' of https://github.com/duarten/scylla: db/hints/manager: Expose current backlog db/hints/manager: Move decision about blocking hints to the manager db/hints/resource_manager: Correctly account resources in space_watchdog db/hints/resource_manager: Replace timer with seastar::thread db/hints/resource_manager: Ensure managers are correctly registered db/hints/resource_manager: Fix formatting db/hints: Disallow moving or copying the managers	2018-10-23 07:36:21 +00:00
Avi Kivity	337ee6153a	Merge "Support SSTables 3.x in Scylla runtime" from Vladimir and Piotr " This patchset makes it possible to use SSTables 'mc' format, commonly referred to as 'SSTables 3.x', when running Scylla instance. Several bugs found on this way are fixed. Also, a configuration option is introduced to allow running Scylla either with 'mc' or 'la' format as default. Tests: unit {release} + tested Scylla with both 'la' and 'mc' formats to work fine: cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; [3/1890] cqlsh> USE test; cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''}; cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8); <<flush>> cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8; <<flush>> cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6); cqlsh:test> SELECT * FROM cfsst3 ; pk \| ck \| rc ----+----+------ 2 \| 3 \| null 4 \| 6 \| null (2 rows) <<Scylla restart>> cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10); cqlsh:test> SELECT * from cfsst3 ; pk \| ck \| rc ----+----+------ 5 \| 7 \| null 8 \| 10 \| null 2 \| 3 \| null 4 \| 6 \| null 7 \| 9 \| null 6 \| 8 \| null (6 rows) " * 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla: database: Honour enable_sstables_mc_format configuration option. sstables: Support SSTables 'mc' format as a feature. db: Add configuration option for enabling SSTables 'mc' format. tests: Add test for reading a complex column with zero subcolumns (SST3). sstables: Fix parsing of complex columns with zero subcolumns. sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding. sstables: Use parser_type instead of abstract_type::parse_type in column_translation. bytes: Add helper for turning bytes_view into sstring_view. sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists. sstables: Fix string formatting for exception messages in m_format_read_helpers. sstables: Don't validate timestamps against the max value on parsing. sstables: Always store only min bases in serialization_header. sstables: Support 'mc' version parsing from filename. SST3: Make sure we call consume_partition_end	2018-09-26 11:10:07 +01:00
Vladimir Krivopalov	650b245657	db: Add configuration option for enabling SSTables 'mc' format. This flag will only be used for testing purposes until Scylla 3.o release and will be removed once SSTables 'mc' testing is completed. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-25 17:23:40 -07:00
Avi Kivity	c6f651ead4	Merge "Use fragmented buffers in commitlog writes" from Paweł " This series changes commitlog write path so that it uses fragmented buffers and therefore avoids large allocations. This is done by first switching the code to use seastar memory_output_stream interface, which can handle fragmented buffer without any additional actions from the user code needed and then making it use buffers of fixed size 128 kB. Tests: unit(release, debug) dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table) " * tag 'fragmented-commitlog-writes/v3' of https://github.com/pdziepak/scylla: commitlog: switch to fragmented buffers commitlog: drop buffer pools commitlog: drop recovery from bad alloc utils: drop data_output commitlog: use memory_output_stream serialization_visitors: add support for memory_output_stream utils: fragmented_temporary_buffer::view: add remove_prefix() utils: fragmented_temporary_buffer: add empty() and size_bytes() utils: fragmented_temporary_buffer: add get_ostream() idl: serializer: don't assume Iterator::value_type is bytes_view idl: serializer: create buffer view from streams utils: crc: accept FragmentRange	2018-09-25 12:43:06 +03:00
Botond Dénes	eb357a385d	flat_mutation_reader: make timeout opt-out rather than opt-in Currently timeout is opt-in, that is, all methods that even have it default it to `db::no_timeout`. This means that ensuring timeout is used where it should be is completely up to the author and the reviewrs of the code. As humans are notoriously prone to mistakes this has resulted in a very inconsistent usage of timeout, many clients of `flat_mutation_reader` passing the timeout only to some members and only on certain call sites. This is small wonder considering that some core operations like `operator()()` only recently received a timeout parameter and others like `peek()` didn't even have one until this patch. Both of these methods call `fill_buffer()` which potentially talks to the lower layers and is supposed to propagate the timeout. All this makes the `flat_mutation_reader`'s timeout effectively useless. To make order in this chaos make the timeout parameter a mandatory one on all `flat_mutation_reader` methods that need it. This ensures that humans now get a reminder from the compiler when they forget to pass the timeout. Clients can still opt-out from passing a timeout by passing `db::no_timeout` (the previous default value) but this will be now explicit and developers should think before typing it. There were suprisingly few core call sites to fix up. Where a timeout was available nearby I propagated it to be able to pass it to the reader, where I couldn't I passed `db::no_timeout`. Authors of the latter kind of code (view, streaming and repair are some of the notable examples) should maybe consider propagating down a timeout if needed. In the test code (the wast majority of the changes) I just used `db::no_timeout` everywhere. Tests: unit(release, debug) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>	2018-09-20 11:31:24 +02:00
Paweł Dziepak	4469f76e7c	commitlog: switch to fragmented buffers So far commitlog was using contiguous buffers for storing the data that is about to be written to disk. It was able to coalesce small writes so that multiple small mutations would use the same buffer, but if a muation was large the commitlog would attempt to allocate a single, appropriately large buffer. This excessively stresses the memory allocator and may cause memory fragmentation to become an issue. The solution is to use fixed-size buffers of 128 kB, which is the standard buffer size in Scylla and keep large values fragmented.	2018-09-18 17:22:59 +01:00
Paweł Dziepak	7c1add6769	commitlog: drop buffer pools Buffer pools were added in `7191a130bb` "Commitlog: recycle buffers to reduce fragmentation." They introduce a lot of complexity and will become unnecessary once the code is switched to use fixed-size 128kB buffers.	2018-09-18 17:22:59 +01:00
Paweł Dziepak	9fee8b8d76	commitlog: drop recovery from bad alloc If a node cannot allocate a 128 kB it is already in a very bad shape, so there isn't much value in trying to recover by attempting smaller allocations and it just adds more complexity to the segment allocation. It actually may be better to let some requests fail and give the node a chance to recover rather than trying to use every last byte of free memory and end up with bad_alloc in a noexcept context.	2018-09-18 17:22:59 +01:00
Paweł Dziepak	2e5b375309	utils: drop data_output	2018-09-18 17:22:59 +01:00
Paweł Dziepak	fe48aaae46	commitlog: use memory_output_stream memory_output_stream deals with all required pointer arithmetic and allows easy transition to fragmented buffers.	2018-09-18 17:22:59 +01:00
Tomasz Grabiec	cd201d1987	db/batchlog_manager: Do not return a value from timer callback Timer callbacks are std::function<void()>. Exposed by changing callback_t to noncopyable_function<>. Message-Id: <1536138045-29209-1-git-send-email-tgrabiec@scylladb.com>	2018-09-05 12:32:21 +03:00
Botond Dénes	6e59cee244	db::consistency_level::filter_for_query() add preferred_endpoints To the second overload (the one without read-repair related params) too.	2018-09-03 10:31:44 +03:00
Nadav Har'El	16a6f76873	materialized views: simplify do_delete_old_entry() In previous patches, we gave up on an old (and broken) attempt to track the timestamps of many unselected base-table columns through one row marker in the view table - and replaced them by "virtual cells", one per unselected cell. The do_delete_old_entry() function still contains old code which maintained that row marker, and is no longer needed. That old code is no only no longer needed, it also no longer did anything because all columns now appear in the view (as virtual columns) so the code ignored them when calculating the row marker. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180829131914.16042-1-nyh@scylladb.com>	2018-08-29 14:33:41 +01:00
Duarte Nunes	79d796e710	Merge 'Materialized Views: row liveness correction' from Nadav " When a view's partition key contains only columns from the base's partition key (and not an additional one), the liveness - existance or disappearance - of a view-table row is tied to the liveness of the base table row. And that, in turn, depends not only on selected columns (base-table columns SELECTed to also appear in the view) but also on unselected columns. This means that we may need to keep a view row alive even without data, just because some unselected column is alive in the base table. Before this patch set we tried to build a single "row marker" in the view column which tried to summarize the liveness information in all unselected columns. But this proved unworkable, as explained in issue #3362 and as will be demonstrated in unit tests at the end of this series. Because we can't replace several unselected cells by one row marker, what we do in this series is to add for each for the unselected cells a "virtual cell" which contains the cell's liveness information (timestamp, deletion, ttl) but not its value. For collections, we can't represent the entire collection by one virtual cell, and rather need a collection of virtual cells. Fixes #3362 " * 'virtual-cols-v3' of https://github.com/nyh/scylla: Materialized Views: test that virtual columns are not visible Materialized Views: unit test reproducing fixed issue #3362 Materialized Views: no need for elaborate row marker calculations Materialized Views: add unselected columns as virtual columns Materialized Views: fill virtual columns Do not allow selecting a virtual column schema: persist "view virtual" columns to a separate system table schema: add "view virtual" flag to schema's column_definition Add "empty" type name to CQL parser, but only for internal parsing	2018-08-29 14:32:38 +01:00
Tomasz Grabiec	10f6b125c8	database: Run system table flushes in the main scheduling group memtable flushes for system and regular region groups run under the memtable_scheduling_group, but the controller adjusts shares based on the occupancy of the regular region group. It can happen that regular is not under pressure, but system is. In this case the controller will incorrectly assign low shares to the memtable flush of system. This may result in high latency and low throughput for writes in the system group. I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2) in the dtest: limits_test.py:TestLimits.max_cells_test, which went away after this. Fixes #3717. Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>	2018-08-23 15:07:05 +03:00
Nadav Har'El	6c00341383	Materialized Views: no need for elaborate row marker calculations Now that we have separate virtual cells to represent unselected columns in a materialized view, we no longer need the elaborate row-marker liveness calculations which aimed (but failed) to do the same thing. So that code can be removed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-08-16 15:45:41 +03:00
Nadav Har'El	30f721afab	Materialized Views: add unselected columns as virtual columns When a view's partition key contains only columns from the base's partition key (and not an additional one), the liveness (existance or disappearance) of a view-table row is tied to the liveness of the base table row - and that depends not only on selected columns (base-table columns SELECTed to also appear in the view) but also on unselected columns. This means that we may need to keep a view row alive even without data, just because some unselected column is alive in the base table. Before this patch we tried to build a single "row marker" in the view column which summarizes the liveness information in all unselected columns, but this proved unworkable, as explained in issue #3362 and as will be demonstrated in unit tests in a later patch. Because we can't replace several unselected cells by one row marker, what we do in this patch is to add for each for the unselected cell a "virtual cell" which contains the cell's liveness information (timestamp, deletion, ttl) but not its value. For collections, we can't represent the entire collection by one virtual cell, and rather need a collection of virtual cells. This patch just adds the virtual columns to the view schema. Code in the previous patch, when it notices the virtual columns in the view's schema, added the appropriate content into these columns. We may need to add virtual columns to a view when first created, but also when an unselected column is added to the base table with "ALTER TABLE", so both are supported in this patch. Fixes #3362. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-08-16 15:42:22 +03:00
Nadav Har'El	782baa44ef	Materialized Views: fill virtual columns The add_cells_to_view() function usually adds selected cells from the base table to the view mutation. For issue #3362, we sometimes want to also add unselected cells as "virtual" cells - truncated versions of the base-table cells just without the values. This patch contains the code to fill the virtual columns' data using the regular columns from the base table. This patch does not yet actually add any virtual columns to the schema, so until that is done (in the next patch), this patch will not yet cause any behavior change. This is important for bisectability. Refs #3362. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-08-16 15:38:27 +03:00
Nadav Har'El	36a657fc10	schema: persist "view virtual" columns to a separate system table In the previous patch, we added a "view virtual" flag on columns. In this patch we add persistance to this flag: I.e., writing it to the on-disk schema table and reading it back on startup. But the implementation is not as simple as adding a flag: In the on-disk system tables, we have a "columns" table listing all the columns in the database and their types. Cqlsh's "DESCRIBE MATERIALIZED VIEW" works by reading this "columns" table, and listing all of the requested view's columns. Therefore, we cannot add "virtual columns" - which are columns not added by the user and not intended to be seen - to this list. We therefore need to create in this patch a separate list for virtual columns, in a new table "view_virtual_columns". This table is essentially identical to the existing "columns" table, just separate. We need to write each column to the appropriate table (columns with the view_virtual flag to "view_virtual_columns", columns without it to the old "columns"), read from both on startup, and remember to delete columns from both when a table is dropped. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-08-16 15:30:06 +03:00
Nadav Har'El	b4fc711903	Add "empty" type name to CQL parser, but only for internal parsing Even before this patch, Scylla supported the "empty" type (a column with no content) but only internally - i.e., in code but not in CQL syntax. The "empty" type was used in dense tables without regular columns, and a special optimization in db::cql_type_parser::parse() allowed this type name to be parsed when reading the schema tables, without allowing the "empty" type to be used by users in CQL statements. However, parse() only supported "empty" itself, and more complex types like list<empty> were not recognized by parse(). In the following patches, we plan to add to virtual columns to materialized views, with types empty, list<empty> or map<something, empty>. We need all these types to work, and before this patch, they don't. But we want all of these types to only work internally - when Scylla's code creates these hidden columns; we do not want to add the "empty" type to CQL's syntax. This is what we do in this patch: The CQL parser's comparator_type rule now has a parameter, "internal", used to differenciate internal calls via db::cql_type_parser::parse() from calls from CQL query parsing. If a user tries something like: CREATE TABLE e (pk empty PRIMARY KEY); He will get the error: Invalid (reserved) user type name empty Note that here, as usual, unknown types are treated as "user types", and "empty" is not allowed as a user type name - we "reserve" it in case one day in the future we will want to allow users a direct syntax to create empty columns. We already have, following Cassandra, a bunch of other names reserved from being user type names, including "byte", "complex", and others (see _reserved_type_names()), and using "empty" as a type name will result in a similar error message. Just like all other type names, the name "empty" is not a reserved keyword in other senses: a user can create a table or a column with the name "empty", just like he can create one with the name "int". Refs #3362. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-08-16 15:12:27 +03:00
Duarte Nunes	a025bf6a7d	Merge seastar upstream Seastar introduced a "compat" namespace, which conflicts with Scylla's own "compat" namespaces. The merge thus includes changes to scope uses of Scylla's "compat" namespaces. * seastar 8ad870f...9bb1611 (5): > util/variant_utils: Ensure variant_cast behaves well with rvalues > util/std-compat: Fix infinite recursion > doc/tutorial: Undo namespace changes > util/variant_utils: Add cast_variant() > Add compatbility with C++17's library types Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-08-14 13:07:09 +01:00
Avi Kivity	d6b0c4dda4	config: default murmur3_ignore_msb_bits to 12 even if not specified in scylla.yaml When murmur3_ignore_msb_bits was introduced in 1.7, we set its default zero (to avoid resharding on upgrade) and set it to 12 in the scylla.yaml template (to make sure we get the right value for new clusters). Now, however, things have changed: - clusters installed before 1.7 are a small minority - they should have resharded long ago - resharding is much better these days - we have more migrations from Cassandra compared to old clusters To allow clusters that migrated using their cassandra.yaml, and to clean up the default scylla.yaml, make the default 12. Users upgrading from pre-1.7 clusters will need to update their scylla.yaml, or to reshard (which is a good idea anyway). Fixes #3670. Message-Id: <20180808063003.26046-1-avi@scylladb.com>	2018-08-08 13:46:06 +02:00
Rafi Einstein	c7f41c988f	Add a counter to count large partition warning in compaction Fixes #3562 Tests: dtest(compaction_test.py) Message-Id: <20180807190324.82014-1-rafie@scylladb.com>	2018-08-07 20:15:09 +01:00
Avi Kivity	620e950fc8	Merge "No infinite time-outs for internal distributed queries" from Jesse " This series replaces infinite time-outs in internal distributed (non-local) CQL queries with finite ones. The implementation of tracing, which also performs internal queries, already has finite time-outs, so it is unchanged. Fixes #3603. " * 'jhk/finite_time_outs/v2' of https://github.com/hakuch/scylla: Use finite time-outs for internal auth. queries Use finite query time-outs for `system_distributed`	2018-08-01 11:23:42 +03:00
Asias He	4a0b561376	storage_service: Get rid of moving operation The moving operation changes a node's token to a new token. It is supported only when a node has one token. The legacy moving operation is useful in the early days before the vnode is introduced where a node has only one token. I don't think it is useful anymore. In the future, we might support adjusting the number of vnodes to reblance the token range each node owns. Removing it simplifies the cluster operation logic and code. Fixes #3475 Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>	2018-08-01 11:18:17 +03:00
Jesse Haber-Kucharsky	ca44f4de3c	Use finite query time-outs for `system_distributed`	2018-07-31 11:38:15 -04:00
Nadav Har'El	25bd139508	cross-tree: clean up use of std::random_device() std::random_device() uses the relatively slow /dev/urandom, and we rarely if ever intend to use it directly - we normally want to use it to seed a faster random_engine (a pseudo-random number generator). In many places in the code, we first created a random_device variable, and then using it created a random_engine variable. However, this practice created the risk of a programmer accidentally using the random_device object, instead of the random_engine object, because both have the same API; This hurts performance. This risk materialized in just two places in the code, utils/uuid.cc and gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is not included in this patch, and the fix for gossiper.{cc,hh} is included here. To avoid risking the same mistake in the future, this patch switches across the code to an idiom where the random_device object is not named, so cannot be accidentally used. We use the following idiom: std::default_random_engine _engine{std::random_device{}()}; Here std::random_device{}() creates the random device (/dev/urandom) and pulls a random integer from it. It then uses this seed to create the random_engine (the pseudo-random number generator). The std::random_device{} object is temporary and unnamed, and cannot be unintentionally used directly. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180726154958.4405-1-nyh@scylladb.com>	2018-07-26 16:54:58 +01:00
Tomasz Grabiec	894961006b	Merge "db/view/view_builder: Fixes to bookkeeping" from Duarte This series contains a couple of fixes to the bookkeeping of the view build process, which could cause data to be left behind in the system tables. * git@github.com:duarten/scylla.git materialized-views/view-build-fixes/v1: Duarte Nunes (3): db/system_keyspace: Add function to remove view build status of a shard db/view: Don't have shard 0 clear other shard's status on drop db/view: Restrict writes to the distributed system keyspace to shard 0	2018-07-17 18:01:28 +02:00
Tomasz Grabiec	25d09e51ac	Merge "db/view/build_progress_virtual_reader: Fixes to clustering key adjusts" from Duarte This series contains a couple of fixes to the adjusting of clustering keys in the build_progress_virtual_reader, some of which could potentially cause heap overflows when querying the legacy system table. * git@github.com:duarten/scylla.git materialized-views/build-progress-virtual-reader-fixes/v1: Duarte Nunes (3): db/view/build_progress_virtual_reader: Use correct schema to adjust ck db/view/build_progress_virtual_reader: Fix full ck detection db/view/build_progress_virtual_reader: Also adjust end RT bound	2018-07-17 18:00:30 +02:00
Avi Kivity	acb3163639	large_partition_handler: output friendly partition key Use abstract_type::to_string() to prettify partition key components. Manually tested by setting --compaction-large-partition-warning-threshold-mb to zero and inspecting the output for compound and non-compound partition keys.	2018-07-17 14:44:52 +03:00
Duarte Nunes	55caaec411	db/view/build_progress_virtual_reader: Also adjust end RT bound Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-07-11 23:28:31 +01:00
Duarte Nunes	eda6b88b0e	db/view/build_progress_virtual_reader: Fix full ck detection As an optimization, the virtual reader doesn't change the underlying key if it is not full, and hence doesn't include the extra clustering key. However, this detection is broken because it checked for 3 clustering columns, instead of 2. This patch fixes that by obtaining the clustering key size from the underlying schema instead of hardcoding the size. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-07-11 23:28:31 +01:00
Duarte Nunes	ff3a0d437a	db/view/build_progress_virtual_reader: Use correct schema to adjust ck The virtual reader adjusts clustering keys obtained from the underlying, scylla-specific schema, and potentially sheds the extra clustering key that's absent from the Cassandra-compatible schema. This patches ensures we use the correct schema to iterator over the key. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-07-11 23:28:31 +01:00

1 2 3 4 5 ...

1151 Commits