scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-19 16:15:07 +00:00

Author	SHA1	Message	Date
Avi Kivity	1bcc5a1b5c	Merge "database: assign proper io priority for streaming view updates" from Piotr " Streamed view updates parasitized on writing io priority, which is reserved for user writes - it's now properly bound to streaming write priority. Verified manually by checking appropriate io metrics: scylla_io_queue_total_bytes{class="streaming_write" ...} vs scylla_io_queue_total_bytes{class="query" ...} Tests: unit(dev) " Fixes #4615. * 'assign_proper_io_priority_to_streaming_view_updates' of https://github.com/psarna/scylla: db,view: wrap view update generation in stream scheduling group database: assign proper io priority for streaming view updates (cherry picked from commit `2c7435418a`)	2019-08-22 16:21:42 +03:00
Kamil Braun	a690e20966	Fix infinite looping when performing a range query on system.size_estimates. Queries to system.size_estimates table which are not single parition queries caused Scylla to go into an infinite loop inside multishard_combining_reader::fill_buffer. This happened because multishard_combinind_reader assumes that shards return rows belonging to separate partitions, which was not the case for size_estimates_mutation_reader. This commit fixes the issue and closes #4689.	2019-08-14 12:51:33 +02:00
Kamil Braun	7172009a0d	Fix segmentation fault when querying system.size_estimates for an empty keyspace.	2019-08-14 12:51:33 +02:00
Kamil Braun	cb688ef62e	Refactor size_estimates_virtual_reader Move the implementation of size_estimates_mutation_reader to a separate compilation unit to speed up compilation times and increase readability. Refactor tests to use seastar::thread.	2019-08-14 12:51:27 +02:00
Avi Kivity	094a2a4263	Merge "Catch unclosed partition sstable write #4794 " from Tomasz " Not emitting partition_end for a partition is incorrect. SStable writer assumes that it is emitted. If it's not, the sstable will not be written correctly. The partition index entry for the last partition will be left partially written, which will result in errors during reads. Also, statistics and sstable key ranges will not include the last partition. It's better to catch this problem at the time of writing, and not generate bad sstables. Another way of handling this would be to implicitly generate a partition_end, but I don't think that we should do this. We cannot trust the mutation stream when invariants are violated, we don't know if this was really the last partition which was supposed to be written. So it's safer to fail the write. Enabled for both mc and la/ka. Passing --abort-on-internal-error on the command line will switch to aborting instead of throwing an exception. The reason we don't abort by default is that it may bring the whole cluster down and cause unavailability, while it may not be necessary to do so. It's safer to fail just the affected operation, e.g. repair. However, failing the operation with an exception leaves little information for debugging the root cause. So the idea is that the user would enable aborts on only one of the nodes in the cluster to get a core dump and not bring the whole cluster down. " * 'catch-unclosed-partition-sstable-write' of https://github.com/tgrabiec/scylla: sstables: writer: Validate that partition is closed when the input mutation stream ends config, exceptions: Add helper for handling internal errors utils: config_file: Introduce named_value::observe() (cherry picked from commit `95c0804731`) (cherry picked from commit `cf4c238b28`)	2019-08-08 16:47:26 +03:00
Gleb Natapov	d566466fca	batchlog_manager: fix array out of bound access endpoint_filter() function assumes that each bucket of std::unordered_multimap contains elements with the same key only, so its size can be used to know how many elements with a particular key are there. But this is not the case, elements with multiple keys may share a bucket. Fix it by counting keys in other way. Fixes #3229 Message-Id: <20190501133127.GE21208@scylladb.com> (cherry picked from commit `95c6d19f6c`)	2019-05-03 11:59:29 +03:00
Duarte Nunes	79cf277ea2	db/schema_tables: Diff tables using ID instead of name Currently we diff schemas based on table/view name, and if the names match, then we detect altered schemas by comparing the schema mutations. This fails to detect transitions which involve dropping and recreating a schema with the same name, if a node receives these notifications simultaneously (for example, if the node was temporarily down or partitioned). Note that because the ID is persisted and created when executing a create_table_statement, then even if a schema is re-created with the exact same structure as before, we will still considered it altered because the mutations will differ. This also stops schema pulling from working, since it relies on schema merging. The solution is to diff schemas using their ID, and not their name. Keyspaces and user types are also susceptible to this, but in their case it's fine: these are values with no identity, and are just metadata. Dropping and recreating a keyspace can be views as dropping all tables from the keyspace, altering it, and eventually adding new tables to the keyspace. Note that this solution doesn't apply to tables dropped and created with the same ID (using the `WITH ID = {}` syntax). For that, we would need to detect deltas instead of applying changes and then reading the new state to find differences. However, this solution is enough, because tables are usually created with ID = {} for very specific, peculiar reasons. The original motivation meant for the new table to be treated exactly as the old, so the current behavior is in fact the desired one. Tests: unit(release), dtests(schema_test, schema_management_test) Fixes #3797 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181001230932.47153-2-duarte@scylladb.com> (cherry picked from commit `40a30d4129`)	2019-04-17 18:01:48 +01:00
Duarte Nunes	03ada48b40	db/schema_tables: Drop tables before creating new ones Doing it by the inverse order doesn't support dropping and creating a schema with the same name. Refs #3797 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181001230932.47153-1-duarte@scylladb.com> (cherry picked from commit `e404f09a23`)	2019-04-17 18:01:48 +01:00
Tomasz Grabiec	69d0b1e15c	schema_tables: Serialize schema merges fairly All schema changes made to the node locally are serialized on a semaphore which lives on shard 0. For historical reasons, they don't queue but rather try to take the lock without blocking and retry on failure with a random delay from the range [0, 100 us]. Contenders which do not originate on shard 0 will have an extra disadvantage as each lock attempt will be longer by the across-shard round trip latency. If there is constant contention on shard 0, contenders originating from other shards may keep loosing to take the lock. Schema merge executed on behalf of a DDL statement may originate on any shard. Same for the schema merge which is coming from a push notification. Schema merge executed as part of the background schema pull will originate on shard 0 only, where the application state change listeners run. So if there are constant schema pulls, DDL statements may take a long time to get through. The fix is to serialize merge requests fairly, by using the blocking semaphore::wait(), which is fair. We don't have to back-off any more, since submit_to() no longer has a global concurrency limit. Fixes #4436. Message-Id: <1555349915-27703-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `3fd82021b1`)	2019-04-16 10:19:45 +03:00
Tomasz Grabiec	d3d877b9db	Merge "db/view: Apply tracked tombstones for new updates" from Duarte When generating view updates for base mutations when no pre-existing data exists, we were forgetting to apply the tracked tombstones. Fixes #4321 Tests: unit(dev) * https://github.com/duarten/scylla materialized-views/4321/v1.1: db/view: Apply tracked tombstones for new updates tests/view_schema_test: Add reproducer for #4321 (cherry picked from commit `2b8bf0dbf8`)	2019-03-27 21:56:21 +00:00
Duarte Nunes	66a48746b8	service/storage_proxy: Don't consider view hints for MV backpressure When a view replica becomes unavailable, updates to it are stored as hints at the paired based replica. This on-disk queue of pending view updates grows as long as there are view updated and the view replica remains unavailable. Currently, we take that relative queue size into account when calculating the delay for new base writes, in the context of the backpressure algorithm for materialized views. However, the way we're calculating that on-disk backlog is wrong, since we calculate it per-device and then feed it to all the hints managers for that device. This means that normal hints will show up as backlog for the view hints manager, which in turn introduces delays. This can make the view backpressure mechanism kick-in even if the cluster uses no materialized views. There's yet another way in which considering the view hints backlog is wrong: a view replica that is unavailable for some period of time can cause the backlog to grow to a point where all base writes are applied the maximum delay of 1 second. This turns a single-node failure into cluster unavailability. The fix to both issues is to simply not take this on-disk backlog into account for the backpressure algorithm. Fixes #4351 Fixes #4352 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190321170418.25953-1-duarte@scylladb.com> (cherry picked from commit `93a1c27b31`)	2019-03-25 15:02:06 +02:00
Avi Kivity	3869b5ab51	Merge "Fix commitlog chunks overwriting each other" from Paweł " This series fixes a problem in the commitlog cycle() function that confused in-memory and on-disk size of chunks it wrote to disk. The former was used to decide how much data needs to be actually written, and the latter was used to compute the offset of the next chunk. If two chunk writes happened concurrently one the one positioned earlier in the file could corrupt the header of the next one. Fixes #4231. Tests: unit(dev), dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup,test_commitlog_replay_with_alter_table) " * tag 'fix-commitlog-cycle/v1' of https://github.com/pdziepak/scylla: commitlog: write the correct buffer size utils/fragmented_temporary_buffer_view: add remove suffix (cherry picked from commit `d95dec22d9`)	2019-03-04 17:58:46 +02:00
Nadav Har'El	82016c07f2	Materialized views: limit size of row batching during bulk view building The bulk materialized-view building processes (when adding a materialized view to a table with existing data) currently reads the base table in batches of 128 (view_builder::batch_size) rows. This is clearly better than reading entire partitions (which may be huge), but still, 128 rows may grow pretty large when we have rows with large strings or blobs, and there is no real reason to buffer 128 rows when they are large. Instead, when the rows we read so far exceed some size threshold (in this patch, 1MB), we can operate on them immediately instead of waiting for 128. As a side-effect, this patch also solves another bug: At worst case, all the base rows of one batch may be written into one output view partition, in one mutation. But there is a hard limit on the size of one mutation (commitlog_segment_size_in_mb, by default 32MB), so we cannot allow the batch size to exceed this limit. By not batching further after 1MB, we avoid reaching this limit when individual rows do not reach it but 128 of them did. Fixes #4213. This patch also includes a unit test reproducing #4213, and demonstrating that it is now solved. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190214093424.7172-1-nyh@scylladb.com> (cherry picked from commit `fec562ec8f`)	2019-02-16 21:54:41 +02:00
Calle Wilund	92cf2934c6	tls: Use a default prio string disabling TLS1.0 forcing min 128bits Fixes #4010 Unless user sets this explicitly, we should try explicitly avoid deprecated protocol versions. While gnutls should do this for connections initiated thusly, clients such as drivers etc might use obsolete versions. Message-Id: <20190107131513.30197-1-calle@scylladb.com> (cherry picked from commit `ba6a8ef35b`)	2019-02-05 19:45:13 +02:00
Calle Wilund	ed2fb65732	commitlog_replayer: Bugfix: finding truncation positions uses local var ref "uuid" was ref:ed in a continuation. Works 99.9% of the time because the continuation is not actually delayed (and assuming we begin the checks with non-truncated (system) cf:s it works). But if we do delay continuation, the resulting cf map will be borked. Fixes #4187. Message-Id: <20190204141831.3387-1-calle@scylladb.com> (cherry picked from commit `9cadbaa96f`)	2019-02-04 20:25:17 +02:00
Duarte Nunes	cf4b4d4878	Merge 'hinted handoff: cache cf mappings' from Vlad " Cache cf mappings when breaking in the middle of a segment sending so that the sender has them the next time it wants to send this segment for where it left off before. Also add the "discard" metric so that we can track hints that are being discarded in the send flow. " Fixes #4122 * 'hinted_handoff_cache_cf_mappings-v1' of https://github.com/vladzcloudius/scylla: hinted handoff: cache column family mappings for segments that were not sent out in full hinted handoff: add a "discarded" metric (cherry picked from commit `88c7c1e851`)	2019-01-23 17:14:29 +02:00
Nadav Har'El	515399ce17	materialized views: move hints to top-level directory While we keep ordinary hints in a directory parallel to the data directory, we decided to keep the materialized view hints in a subdirectory of the data directory, named "view_pending_updates". But during boot, we expect all subdirectories of data/ to be keyspace names, and when we notice this one, we print a warning: WARN: database - Skipping undefined keyspace: view_pending_updates This spurious warning annoyed users. But moreover, we could have bigger problems if the user actually tries to create a keyspace with that name. So in this patch, we move the view hints to a separate top-level directory, which defaults to /var/lib/scylla/view_hints, but as usual can be configured. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190107142257.16342-1-nyh@scylladb.com> (cherry picked from commit `da090a5458`)	2019-01-07 22:01:56 +02:00
Pekka Enberg	c1f6ce4251	Merge 'Fixes for the view_update_from_staging_generator' from Duarte "This series contains a couple of fixes to the view_update_from_staging_generator, the object responsible for generating view updates from sstables written through streaming. Fixes #4021" * 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla: db/view/view_update_from_staging_generator: Break semaphore on stop() db/view/view_update_from_staging_generator: Restore formatting db/view/view_update_from_staging_generator: Avoid creating more than one fiber (cherry picked from commit `96172b7bca`)	2018-12-29 20:22:54 +02:00
Avi Kivity	dbe347811c	Merge "materialized views: Apply backpressure from view replicas" from Duarte " As the amount of pending view updates increases we know that there’s a mismatch between the rate at which the base receives writes and the rate at which the view retires them. We react by applying backpressure to decrease the rate of incoming base writes, allowing the slow view replicas to catch up. We want to delay the client’s next writes to a base replica and we use the base’s backlog of view updates to derive this delay. To validate this approach we tested a 3 node Scylla cluster on GCE, using n1-standard-4 instances with NVMEs. A loader running on a n1-standard-8 instance run cassandra-stress with 100 threads. With the delay function d(x) set to 1s, we see no base write timeouts. With the delay function as defined in the series, we see that backlogs stabilize at some (arbitrary) point, as predicted, but this stabilization co-exists with base write timeouts. However, the system overall behaves better than the current version, with the 100 view update limit, and also better than the version without such limit or any backpressure. More work is necessary to further stabilize the system. Namely, we want to keep delaying until we see the backlog is decreasing. This will require us to add more delay beyond the stabilization point, which in turn should minimize the base write timeouts, and will also minimize the amount of memory the backlog takes at each base replica. Design document: https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo Fixes #2538 " Reviewed-by: Nadav Har'El <nyh@scylladb.com> * 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits) service/storage_proxy: Release mutation as early as possible service/storage_proxy: Delay replica writes based on view update backlog service/storage_proxy: Get the backlog of a particular base replica service/storage_proxy: Add counters for delayed base writes main: Start and stop the view_update_backlog_broker service: Distribute a node's view update backlog service: Advertise view update backlog over gossip service/storage_proxy: Send view update backlog from replicas service/storage_proxy: Prepare to receive replica view update backlog service/storage_proxy: Expose local view update backlog tests/view_schema_test: Add simple test for db::view::node_update_backlog db/view: Introduce node_update_backlog class db/hints: Initialize current backlog database: Add counter for current view backlog database: Expose current memory view update backlog idl: Add db::view::update_backlog db/view: Add view_update_backlog database: Wait on view update semaphore for view building service/storage_proxy: Use near-infinite timeouts for view updates database: generate_and_propagate_view_updates no longer needs a timeout ... (cherry picked from commit `b66f59aa3d`)	2018-12-20 19:11:56 +02:00
Avi Kivity	8f2d24bb8f	config: remove "to be removed before release" notice mc sstable config The "enable_sstables_mc_format" config item help text wants to remove itself before release. Since scylla-3.0 did not get enough mc format mileage, we decided to leave it in, so the notice should be removed. Fixes #4003. Message-Id: <20181219082554.23923-1-avi@scylladb.com> (cherry picked from commit `dd51c659f7`)	2018-12-19 19:08:36 +02:00
Duarte Nunes	97cd9108d6	db/system_distributed_keyspace: Create the schema with min_timestamp Different nodes can concurrently create the distributed system keyspace on boot, before the "if not exists" clause can take effect. However, the resulting schema mutations will be different since different nodes use different timestamps. This patch forces the timestamps to be the same across all nodes, so we save some schema mismatches. This fixes a bug exposed by `ca5dfdf`, whereby the initialization of the distributed system keyspace is done before waiting for schema agreement. While waiting for schema agreement in storage_service::join_token_ring(), the node still hasn't joined the ring and schemas can't be pulled from it, so nodes can deadlock. A similar situation can happen between a seed node and a non-seed node, where the seed node progresses to a different "wait for schema agreement" barrier, but still can't make progress because it can't pull the schema from the non-seed node still trying to join the ring. Finally, it is assumed that changes to the schema of the current distributed system keyspace tables will be protected by a cluster feature and a subsequent schema synchronization, such that all nodes will be at a point where schemas can be transferred around. Fixes #3976 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181211113407.20075-1-duarte@scylladb.com> (cherry picked from commit `89ae3fbf11`)	2018-12-11 14:53:30 +00:00
Gleb Natapov	4acfc5ed8f	hints: make hints manager more resilient to unexpected directory content Currently if hints directory contains unexpected directories Scylla fails to start with unhandled std::invalid_argument exception. Make the manager ignore malformed files instead and try to proceed anyway. Message-Id: <20181121134618.29936-2-gleb@scylladb.com> (cherry picked from commit `b4a8802edc`)	2018-12-08 13:42:43 +02:00
Gleb Natapov	cb9199bc7f	hints: add auxiliary function for scanning high level hints directory We scan hints directory in two places: to search for files to replay and to search for directories to remove after resharding. The code that translates directory name to a shard is duplicated. It is simple now, so not a bit issue but in case it grows better have it in one place. Message-Id: <20181121134618.29936-1-gleb@scylladb.com> (cherry picked from commit `9433d02624`)	2018-12-08 13:42:43 +02:00
Avi Kivity	54258ca8eb	Merge "db/hints: Use frozen_mutation in hinted handoff" from Duarte " This series changes hinted handoff to work with `frozen_mutation`s instead of naked `mutation`s. Instead of unfreezing a mutation from the commitlog entry and then freezing it again for sending, now we'll just keep the read, frozen mutation. Tests: unit(release) " * 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla: db/hints/manager: Use frozen_mutation instead of mutation db/hints/manager: Use database::find_schema() db/commitlog/commitlog_entry: Allow moving the contained mutation service/storage_proxy: send_to_endpoint overload accepting frozen_mutation service/storage_proxy: Build a shared_mutation from a frozen_mutation service/storage_proxy: Lift frozen_mutation_and_schema service/storage_proxy: Allow non-const ranges in mutate_prepare() (cherry picked from commit `1891779e64`)	2018-12-05 20:14:57 +00:00
Duarte Nunes	f8195a77b0	db/view/view_builder: Don't timeout waiting for view to be built Remove the timeout argument to db::view::view_builder::wait_until_built(), a test-only function to wait until a given materialized view has finished building. This change is motivated by the fact that some tests running on slow environments will timeout. Instead of incrementally increasing the timeout, remove it completely since tests are already run under an exterior timeout. Fixes #3920 Tests: unit release(view_build_test, view_schema_test) Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181115173902.19048-1-duarte@scylladb.com> (cherry picked from commit `6fbf792777`)	2018-12-05 19:20:36 +00:00
Duarte Nunes	5b724c80ab	db/view: Don't copy keyspace name Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181022104527.14555-1-duarte@scylladb.com> (cherry picked from commit `f3a5ec0fd9`)	2018-12-05 19:19:26 +00:00
Nadav Har'El	4a7ae81b3f	materialized views: update stats.write statistics in all cases mutate_MV usually calls send_to_endpoint() to push view update to remote view replicas. This function gets passed a statistics object, service::storage_proxy_stats::write_stats and, in particular, updates its "writes" statistic which counts the number of ongoing writes. In the case that the paired view replica happens to be the same node, we avoid calling send_to_endpoint() and call mutate_locally() instead. That function does not take a write_stats object, so the "writes" statistic doesn't get incremented for the duration of the write. So we should do this explicitly. Co-authored-by: Nadav Har'El <nyh@scylladb.com> Co-authored-by: Duarte Nunes <duarte@scylladb.com> (cherry picked from commit `1d5f8d0015`)	2018-12-05 19:19:26 +00:00
Duarte Nunes	9776a048e7	Merge 'Generating view updates during streaming' from Piotr During streaming, there are cases when we should invoke the view write path. In particular, if we're streaming because of repair or if a view has not yet finished building and we're bootstrapping a new node. The design constraints are: 1) The streamed writes should be visible to new writes, but the sstable should not participate in compaction, or we would lose the ability to exclude the streamed writes on a restart; 2) The streamed writes must not be considered when generating view updates for them; 3) Resilient to node restarts; 4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges. We achieve this by writing the streamed writes to an sstable in a different folder, call it "staging". We achieve 1) by publishing the sstable to the column family sstable set, but excluding it from compactions. We do these steps upon boot, by looking at the staging directory, thus achieving 3). Fixes #3275 * 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits) tests: add materialized views test tests: add view update generator to cql test env main: add registering staging sstables read from disk database: add a check if loaded sstable is already staging database: add get_staging_sstable method streaming: stream tables with views through staging sstables streaming: add system distributed keyspace ref to streaming streaming: add view update generator reference to streaming main: add generating missed mv updates from staging sstables storage_service: move initializing sys_dist_ks before bootstrap db/view: add view_update_from_staging_generator service db/view: add view updating consumer table: add stream_view_replica_updates table: split push_view_replica_updates table: add as_mutation_source_excluding table: move push_view_replica_updates to table.cc database: add populating tables with staging sstables database: add creating /staging directory for sstables database: add sstable-excluding reader table: add move_sstable_from_staging_in_thread function ... (cherry picked from commit `a38f6078fb`)	2018-11-15 17:46:20 +02:00
Vlad Zolotarov	c6de9ea39b	config: enable hinted handoff by default Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20181019180401.12400-1-vladz@scylladb.com> (cherry picked from commit `4d1bb719a4`)	2018-11-01 10:41:44 +02:00
Nadav Har'El	996b86b804	Materalized views: fix race condition in resharding while view building When a node reshards (i.e., restarts with a different number of CPUs), and is in the middle of building a view for a pre-existing table, the view building needs to find the right token from which to start building on all shards. We ran the same code on all shards, hoping they would all make the same decision on which token to continue. But in some cases, one shard might make the decision, start building, and make progress - all before a second shard goes to make the decision, which will now be different. This resulted, in some rare cases, in the new materialized view missing a few rows when the build was interrupted with a resharding. The fix is to add the missing synchronization: All shards should make the same decision on whether and how to reshard - and only then should start building the view. Fixes #3890 Fixes #3452 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20181028140549.21200-1-nyh@scylladb.com> (cherry picked from commit `b8337f8c9d`)	2018-10-29 09:52:25 +00:00
Avi Kivity	6bf4a73d88	thrift: limit message size Limit message size according to the configuration, to avoid a huge message from allocating all of the server's memory. We also need to limit memory used in aggregate by thrift, but that is left to another patch. Fixes #3878. Message-Id: <20181024081042.13067-1-avi@scylladb.com> (cherry picked from commit `a9836ad758`)	2018-10-24 19:32:25 +03:00
Avi Kivity	52be02558e	config: mark range_request_timeout_in_ms and request_timeout_in_ms as Used This makes them available in scylla --help. Fixes #3884. Message-Id: <20181023101150.29856-1-avi@scylladb.com> (cherry picked from commit `d9e0ea6bb0`)	2018-10-24 09:43:54 +03:00
Avi Kivity	a7cbfbe63f	Merge "hinted handoff: give a sender a low priority" from Vlad " Hinted handoff should not overpower regular flows like READs, WRITEs or background activities like memtable flushes or compactions. In order to achieve this put its sending in the STEAMING CPU scheduling group and its commitlog object into the STREAMING I/O scheduling group. Fixes #3817 " * 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla: db::hints::manager: use "streaming" I/O scheduling class for reads commitlog::read_log_file(): set the a read I/O priority class explicitly db::hints::manager: add hints sender to the "streaming" CPU scheduling group (cherry picked from commit `1533487ba8`)	2018-10-24 09:43:39 +03:00
Duarte Nunes	28fd2044d2	Merge 'hinted handoff: add manager::state and split storing and replaying enablement' from Vlad " Refs #3828 (Probably fixes it) We found a few flaws in a way we enable hints replaying. First of all it was allowed before manager::start() is complete. Then, since manager::start() is called after messaging_service is initialized there was a time window when hints are rejected and this creates an issue for MV. Both issues above were found in the context of #3828. This series fixes them both. Tested {release}: dtest: materialized_views_test.py:TestMaterializedViews.write_to_hinted_handoff_for_views_test dtest: hintedhandoff_additional_test.py " * 'hinted_handoff_dont_create_hints_until_started-v1' of https://github.com/vladzcloudius/scylla: hinted handoff: enable storing hints before starting messaging_service db::hints::manager: add a "started" state db::hints::manager: introduce a _state (cherry picked from commit `3a53b3cebc`)	2018-10-24 09:43:03 +03:00
Duarte Nunes	26c31f6798	Merge "db/hints: Expose current backlog" from Duarte " Hints are stored on disk by a hints::manager, ensuring they are eventually sent. A hints::resource_manager ensures the hints::managers it tracks don't consume more than their allocated resources by monitoring disk space and disabling new hints if needed. This series fixes some bugs related to the backlog calculation, but mainly exposes the backlog through a hints::manager so upper layers can apply flow control. Refs #2538 " * 'hh-manager-backlog/v3' of https://github.com/duarten/scylla: db/hints/manager: Expose current backlog db/hints/manager: Move decision about blocking hints to the manager db/hints/resource_manager: Correctly account resources in space_watchdog db/hints/resource_manager: Replace timer with seastar::thread db/hints/resource_manager: Ensure managers are correctly registered db/hints/resource_manager: Fix formatting db/hints: Disallow moving or copying the managers	2018-10-23 07:36:21 +00:00
Avi Kivity	337ee6153a	Merge "Support SSTables 3.x in Scylla runtime" from Vladimir and Piotr " This patchset makes it possible to use SSTables 'mc' format, commonly referred to as 'SSTables 3.x', when running Scylla instance. Several bugs found on this way are fixed. Also, a configuration option is introduced to allow running Scylla either with 'mc' or 'la' format as default. Tests: unit {release} + tested Scylla with both 'la' and 'mc' formats to work fine: cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; [3/1890] cqlsh> USE test; cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''}; cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8); <<flush>> cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8; <<flush>> cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6); cqlsh:test> SELECT * FROM cfsst3 ; pk \| ck \| rc ----+----+------ 2 \| 3 \| null 4 \| 6 \| null (2 rows) <<Scylla restart>> cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9); cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10); cqlsh:test> SELECT * from cfsst3 ; pk \| ck \| rc ----+----+------ 5 \| 7 \| null 8 \| 10 \| null 2 \| 3 \| null 4 \| 6 \| null 7 \| 9 \| null 6 \| 8 \| null (6 rows) " * 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla: database: Honour enable_sstables_mc_format configuration option. sstables: Support SSTables 'mc' format as a feature. db: Add configuration option for enabling SSTables 'mc' format. tests: Add test for reading a complex column with zero subcolumns (SST3). sstables: Fix parsing of complex columns with zero subcolumns. sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding. sstables: Use parser_type instead of abstract_type::parse_type in column_translation. bytes: Add helper for turning bytes_view into sstring_view. sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists. sstables: Fix string formatting for exception messages in m_format_read_helpers. sstables: Don't validate timestamps against the max value on parsing. sstables: Always store only min bases in serialization_header. sstables: Support 'mc' version parsing from filename. SST3: Make sure we call consume_partition_end	2018-09-26 11:10:07 +01:00
Vladimir Krivopalov	650b245657	db: Add configuration option for enabling SSTables 'mc' format. This flag will only be used for testing purposes until Scylla 3.o release and will be removed once SSTables 'mc' testing is completed. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-25 17:23:40 -07:00
Avi Kivity	c6f651ead4	Merge "Use fragmented buffers in commitlog writes" from Paweł " This series changes commitlog write path so that it uses fragmented buffers and therefore avoids large allocations. This is done by first switching the code to use seastar memory_output_stream interface, which can handle fragmented buffer without any additional actions from the user code needed and then making it use buffers of fixed size 128 kB. Tests: unit(release, debug) dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table) " * tag 'fragmented-commitlog-writes/v3' of https://github.com/pdziepak/scylla: commitlog: switch to fragmented buffers commitlog: drop buffer pools commitlog: drop recovery from bad alloc utils: drop data_output commitlog: use memory_output_stream serialization_visitors: add support for memory_output_stream utils: fragmented_temporary_buffer::view: add remove_prefix() utils: fragmented_temporary_buffer: add empty() and size_bytes() utils: fragmented_temporary_buffer: add get_ostream() idl: serializer: don't assume Iterator::value_type is bytes_view idl: serializer: create buffer view from streams utils: crc: accept FragmentRange	2018-09-25 12:43:06 +03:00
Botond Dénes	eb357a385d	flat_mutation_reader: make timeout opt-out rather than opt-in Currently timeout is opt-in, that is, all methods that even have it default it to `db::no_timeout`. This means that ensuring timeout is used where it should be is completely up to the author and the reviewrs of the code. As humans are notoriously prone to mistakes this has resulted in a very inconsistent usage of timeout, many clients of `flat_mutation_reader` passing the timeout only to some members and only on certain call sites. This is small wonder considering that some core operations like `operator()()` only recently received a timeout parameter and others like `peek()` didn't even have one until this patch. Both of these methods call `fill_buffer()` which potentially talks to the lower layers and is supposed to propagate the timeout. All this makes the `flat_mutation_reader`'s timeout effectively useless. To make order in this chaos make the timeout parameter a mandatory one on all `flat_mutation_reader` methods that need it. This ensures that humans now get a reminder from the compiler when they forget to pass the timeout. Clients can still opt-out from passing a timeout by passing `db::no_timeout` (the previous default value) but this will be now explicit and developers should think before typing it. There were suprisingly few core call sites to fix up. Where a timeout was available nearby I propagated it to be able to pass it to the reader, where I couldn't I passed `db::no_timeout`. Authors of the latter kind of code (view, streaming and repair are some of the notable examples) should maybe consider propagating down a timeout if needed. In the test code (the wast majority of the changes) I just used `db::no_timeout` everywhere. Tests: unit(release, debug) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>	2018-09-20 11:31:24 +02:00
Paweł Dziepak	4469f76e7c	commitlog: switch to fragmented buffers So far commitlog was using contiguous buffers for storing the data that is about to be written to disk. It was able to coalesce small writes so that multiple small mutations would use the same buffer, but if a muation was large the commitlog would attempt to allocate a single, appropriately large buffer. This excessively stresses the memory allocator and may cause memory fragmentation to become an issue. The solution is to use fixed-size buffers of 128 kB, which is the standard buffer size in Scylla and keep large values fragmented.	2018-09-18 17:22:59 +01:00
Paweł Dziepak	7c1add6769	commitlog: drop buffer pools Buffer pools were added in `7191a130bb` "Commitlog: recycle buffers to reduce fragmentation." They introduce a lot of complexity and will become unnecessary once the code is switched to use fixed-size 128kB buffers.	2018-09-18 17:22:59 +01:00
Paweł Dziepak	9fee8b8d76	commitlog: drop recovery from bad alloc If a node cannot allocate a 128 kB it is already in a very bad shape, so there isn't much value in trying to recover by attempting smaller allocations and it just adds more complexity to the segment allocation. It actually may be better to let some requests fail and give the node a chance to recover rather than trying to use every last byte of free memory and end up with bad_alloc in a noexcept context.	2018-09-18 17:22:59 +01:00
Paweł Dziepak	2e5b375309	utils: drop data_output	2018-09-18 17:22:59 +01:00
Paweł Dziepak	fe48aaae46	commitlog: use memory_output_stream memory_output_stream deals with all required pointer arithmetic and allows easy transition to fragmented buffers.	2018-09-18 17:22:59 +01:00
Tomasz Grabiec	cd201d1987	db/batchlog_manager: Do not return a value from timer callback Timer callbacks are std::function<void()>. Exposed by changing callback_t to noncopyable_function<>. Message-Id: <1536138045-29209-1-git-send-email-tgrabiec@scylladb.com>	2018-09-05 12:32:21 +03:00
Botond Dénes	6e59cee244	db::consistency_level::filter_for_query() add preferred_endpoints To the second overload (the one without read-repair related params) too.	2018-09-03 10:31:44 +03:00
Nadav Har'El	16a6f76873	materialized views: simplify do_delete_old_entry() In previous patches, we gave up on an old (and broken) attempt to track the timestamps of many unselected base-table columns through one row marker in the view table - and replaced them by "virtual cells", one per unselected cell. The do_delete_old_entry() function still contains old code which maintained that row marker, and is no longer needed. That old code is no only no longer needed, it also no longer did anything because all columns now appear in the view (as virtual columns) so the code ignored them when calculating the row marker. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180829131914.16042-1-nyh@scylladb.com>	2018-08-29 14:33:41 +01:00
Duarte Nunes	79d796e710	Merge 'Materialized Views: row liveness correction' from Nadav " When a view's partition key contains only columns from the base's partition key (and not an additional one), the liveness - existance or disappearance - of a view-table row is tied to the liveness of the base table row. And that, in turn, depends not only on selected columns (base-table columns SELECTed to also appear in the view) but also on unselected columns. This means that we may need to keep a view row alive even without data, just because some unselected column is alive in the base table. Before this patch set we tried to build a single "row marker" in the view column which tried to summarize the liveness information in all unselected columns. But this proved unworkable, as explained in issue #3362 and as will be demonstrated in unit tests at the end of this series. Because we can't replace several unselected cells by one row marker, what we do in this series is to add for each for the unselected cells a "virtual cell" which contains the cell's liveness information (timestamp, deletion, ttl) but not its value. For collections, we can't represent the entire collection by one virtual cell, and rather need a collection of virtual cells. Fixes #3362 " * 'virtual-cols-v3' of https://github.com/nyh/scylla: Materialized Views: test that virtual columns are not visible Materialized Views: unit test reproducing fixed issue #3362 Materialized Views: no need for elaborate row marker calculations Materialized Views: add unselected columns as virtual columns Materialized Views: fill virtual columns Do not allow selecting a virtual column schema: persist "view virtual" columns to a separate system table schema: add "view virtual" flag to schema's column_definition Add "empty" type name to CQL parser, but only for internal parsing	2018-08-29 14:32:38 +01:00
Tomasz Grabiec	10f6b125c8	database: Run system table flushes in the main scheduling group memtable flushes for system and regular region groups run under the memtable_scheduling_group, but the controller adjusts shares based on the occupancy of the regular region group. It can happen that regular is not under pressure, but system is. In this case the controller will incorrectly assign low shares to the memtable flush of system. This may result in high latency and low throughput for writes in the system group. I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2) in the dtest: limits_test.py:TestLimits.max_cells_test, which went away after this. Fixes #3717. Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>	2018-08-23 15:07:05 +03:00
Nadav Har'El	6c00341383	Materialized Views: no need for elaborate row marker calculations Now that we have separate virtual cells to represent unselected columns in a materialized view, we no longer need the elaborate row-marker liveness calculations which aimed (but failed) to do the same thing. So that code can be removed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-08-16 15:45:41 +03:00

1 2 3 4 5 ...

1168 Commits