scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 01:20:39 +00:00

Author	SHA1	Message	Date
Avi Kivity	fea5067dfa	Merge "Limit non-paged query memory consumption" from Botond " Non-paged queries completely ignore the query result size limiter mechanism. They consume all the memory they want. With sufficiently large datasets this can easily lead to a handful or even a single unpaged query producing an OOM. This series continues the work started by `134d5a5f7`, by introducing a configurable pair of soft/hard limit (default to 1MB/100MB) that is applied to otherwise unlimited queries, like reverse and unpaged ones. When an unlimited query reaches the soft limit a warning is logged. This should give users some heads-up to adjust their application. When the hard limit is reached the query is aborted. The idea is to not greet users with failing queries after an upgrade while at the same time protect the database from the really bad queries. The hard limit should be decreased from time to time gradually approaching the desired goal of 1MB. We don't want to limit internal queries, we trust ourselves to either use another form of memory usage control, or read only small datasets. So the limit is selected according to the query class. User reads use the `max_memory_for_unlimited_query_{soft,hard}_limit` configuration items, while internal reads are not limited. The limit is obtained by the coordinator, who passes it down to replicas using the existing `max_result_size` parameter (which is not a special type containing the two limits), which is now passed on every verb, instead of once per connection. This ensures that all replicas work with the same limits. For normal paged queries `max_result_size` is set to the usual `query::result_memory_limiter::maximum_result_size` For queries that can consume unlimited amount of memory -- unpaged and reverse queries -- this is set to the value of the aforementioned `max_memory_for_unlimited_query_{soft,hard}_limit` configuration item, but only for user reads, internal reads are not limited. This has the side-effect that reverse reads now send entire partitions in a single page, but this is not that bad. The data was already read, and its size was below the limit, the replica might as well send it all. Fixes: #5870 " * 'nonpaged-query-limit/v5' of https://github.com/denesb/scylla: (26 commits) test: database_test: add test for enforced max result limit mutation_partition: abort read when hard limit is exceeded for non-paged reads query-result.hh: move the definition of short_read to the top test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit test: set the allow_short_read slice option for paged queries partition_slice_builder: add with_option() result_memory_accounter: remove default constructor query_*(): use the coordinator specified memory limit for unlimited queries storage_proxy: use read_command::max_result_size to pass max result size around query: result_memory_limiter: use the new max_result_size type query: read_command: add max_result_size query: read_command: use tagged ints for limit ctor params query: read_command: add separate convenience constructor service: query_pager: set the allow_short_read flag result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit storage_proxy: add get_max_result_size() result_memory_limiter: add unlimited_result_size constant database: add get_statement_scheduling_group() database: query_mutations(): obtain the memory accounter inside query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field ...	2020-07-29 13:41:53 +03:00
Avi Kivity	22fe38732d	Update tools/jmx and tools/java submodules * tools/java a9480f3a87...aa7898d771 (4): > dist: debian: do not require root during package build > cassandra-stress: Add serial consistency options > dist: debian: fix detection of debuild > bin tools: Use non-default `cassandra.config` * tools/jmx c0d9d0f...626fd75 (1): > dist: debian: do not require root during package build Fixes #6655.	2020-07-29 12:55:18 +03:00
Botond Dénes	3804dfcc0c	test: database_test: add test for enforced max result limit Two tests are added: one that works on the low-level database API, and another one that works on the CQL API.	2020-07-29 08:32:34 +03:00
Botond Dénes	f7a4d19fb1	mutation_partition: abort read when hard limit is exceeded for non-paged reads If the read is not paged (short read is not allowed) abort the query if the hard memory limit is reached. On reaching the soft memory limit a warning is logged. This should allow users to adjust their application code while at the same time protecting the database from the really bad queries. The enforcement happens inside the memory accounter and doesn't require cooperation from the result builders. This ensures memory limit set for the query is respected for all kind of reads. Previously non-paged reads simply ignored the memory accounter requesting the read to stop and consumed all the memory they wanted.	2020-07-29 08:32:31 +03:00
Rafael Ávila de Espíndola	c4cb3817cf	build: Use -fdata-sections and -ffunction-sections This is a 4.2% reduction in the scylla text size, from 38975956 to 37404404 bytes. When benchmarking perf_simple_query without --shuffle-sections, there is no performance difference. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200724032504.3004-1-espindola@scylladb.com>	2020-07-28 19:39:26 +03:00
Botond Dénes	02a7492d62	query-result.hh: move the definition of short_read to the top It will be used by `result_memory_{limiter,accounter}` soon.	2020-07-28 18:00:29 +03:00
Botond Dénes	43c0da4b63	test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit To an unlimited value, in order to avoid aborting any unpaged queries executed by tests, that would exceed the default result limit of 1MB/100MB.	2020-07-28 18:00:29 +03:00
Botond Dénes	648ce473ab	test: set the allow_short_read slice option for paged queries Some tests use the lower level methods directly and meant to use paging but didn't and nobody noticed. This was revealed by the enforcement of max result size (introduced in a later patch), which caused these tests to fail due to exceeding the max result size. This patch fixes this by setting the `allow_short_reads` slice option.	2020-07-28 18:00:29 +03:00
Botond Dénes	d27f8321d7	partition_slice_builder: add with_option()	2020-07-28 18:00:29 +03:00
Botond Dénes	6660a5df51	result_memory_accounter: remove default constructor If somebody wants to bypass proper memory accounting they should at the very least be forced to consider if that is indeed wise and think a second about the limit they want to apply.	2020-07-28 18:00:29 +03:00
Botond Dénes	9eab5bca27	query_*(): use the coordinator specified memory limit for unlimited queries It is important that all replicas participating in a read use the same memory limits to avoid artificial differences due to different amount of results. The coordinator now passes down its own memory limit for reads, in the form of max_result_size (or max_size). For unpaged or reverse queries this has to be used now instead of the locally set max_memory_unlimited_query configuration item. To avoid the replicas accidentally using the local limit contained in the `query_class_config` returned from `database::make_query_class_config()`, we refactor the latter into `database::get_reader_concurrency_semaphore()`. Most of its callers were only interested in the semaphore only anyway and those that were interested in the limit as well should get it from the coordinator instead, so this refactoring is a win-win.	2020-07-28 18:00:29 +03:00
Botond Dénes	159d37053d	storage_proxy: use read_command::max_result_size to pass max result size around Use the recently added `max_result_size` field of `query::read_command` to pass the max result size around, including passing it to remote nodes. This means that the max result size will be sent along each read, instead of once per connection. As we want to select the appropriate `max_result_size` based on the type of the query as well as based on the query class (user or internal) the previous method won't do anymore. If the remote doesn't fill this field, the old per-connection value is used.	2020-07-28 18:00:29 +03:00
Botond Dénes	fbbbc3e05c	query: result_memory_limiter: use the new max_result_size type	2020-07-28 18:00:29 +03:00
Botond Dénes	92a7b16cba	query: read_command: add max_result_size This field will replace max size which is currently passed once per established rpc connection via the CLIENT_ID verb and stored as an auxiliary value on the client_info. For now it is unused, but we update all sites creating a read command to pass the correct value to it. In the next patch we will phase out the old max size and use this field to pass max size on each verb instead.	2020-07-28 18:00:29 +03:00
Botond Dénes	8992bcd1f8	query: read_command: use tagged ints for limit ctor params The convenience constructor of read_command now has two integer parameter next to each other. In the next patch we intend to add another one. This is recipe for disaster, so to avoid mistakes this patch converts these parameters to tagged integers. This makes sure callers pass what they meant to pass. As a matter of fact, while fixing up call-sites, I already found several ones passing `query::max_partitions` to the `row_limit` parameter. No harm done yet, as `query::max_partitions` == `query::max_rows` but this shows just how easy it is to mix up parameters with the same type.	2020-07-28 18:00:29 +03:00
Botond Dénes	2ca118b2d5	query: read_command: add separate convenience constructor query::read_command currently has a single constructor, which serves both as an idl constructor (order of parameters is fixed) and a convenience one (most parameters have default values). This makes it very error prone to add new parameters, that everyone should fill. The new parameter has to be added as last, with a default value, as the previous ones have a default value as well. This means the compiler's help cannot be enlisted to make sure all usages are updated. This patch adds a separate convenience constructor to be used by normal code. The idl constructor looses all default parameters. New parameters can be added to any position in the convenience constructor (to force users to fill in a meaningful value) while the removed default parameters from the idl constructor means code cannot accidentally use it without noticing.	2020-07-28 18:00:29 +03:00
Botond Dénes	1615fe4c5e	service: query_pager: set the allow_short_read flag All callers should set this already before passing the slice to the pager, however not all actually do (e.g. `cql3::indexed_table_select_statement::read_posting_list()`). Instead of auditing each call site, just make sure this is set in the pager itself. If someone is creating a pager we can be sure they mean to use paging.	2020-07-28 18:00:29 +03:00
Botond Dénes	989142464c	result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit The use of the global `result_memory_limiter::maximum_result_size` is probably a leftover from before the `_maximum_result_size` member was introduced (`aa083d3d85`).	2020-07-28 18:00:29 +03:00
Botond Dénes	9eb6d704b2	storage_proxy: add get_max_result_size() Meant to be used by the coordinator node to obtain the max result size applicable to the query-class (determined based on the current scheduling group). For normal paged queries the previously used `query::result_memory_limiter::maximum_result_size` is used uniformly. For reverse and unpaged queries, a query class dependent value is used. For user reads, the value of the `max_memory_for_unlimited_query_{soft,hard}_limit` configuration items is used, for other classes no limit is used (`query::result_memory_limiter::unlimited_result_size`).	2020-07-28 18:00:29 +03:00
Botond Dénes	c364c7c6a2	result_memory_limiter: add unlimited_result_size constant To be used as the max result size for internal queries.	2020-07-28 18:00:29 +03:00
Botond Dénes	a64d9b8883	database: add get_statement_scheduling_group()	2020-07-28 18:00:29 +03:00
Botond Dénes	d5cc932a0b	database: query_mutations(): obtain the memory accounter inside Instead of requesting callers to do it and pass it as a parameter. This is in line with data_query().	2020-07-28 18:00:29 +03:00
Botond Dénes	92ce39f014	query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field We want to switch from using a single limit to a dual soft/hard limit. As a first step we switch the limit field of `query_class_config` to use the recently introduced type for this. As this field has a single user at the moment -- reverse queries (and not a lot of propagation) -- we update it in this same patch to use the soft/hard limit: warn on reaching the soft limit and abort on the hard limit (the previous behaviour).	2020-07-28 18:00:29 +03:00
Botond Dénes	8aee7662a9	query: introduce max_result_size To be used to pass around the soft/hard limit configured via `max_memory_for_unlimited_query_{soft,hard}_limit` in the codebase.	2020-07-28 18:00:29 +03:00
Botond Dénes	517a941feb	query_class_config: move into the query namespace It belongs there, its name even starts with "query".	2020-07-28 18:00:29 +03:00
Botond Dénes	46d5b651eb	db/config: introduce max_memory_for_unlimited_query_soft_limit and max_memory_for_unlimited_query_hard_limit This pair of limits replace the old max_memory_for_unlimited_query one, which remains as an alias to the hard limit. The soft limit inherits the previous value of the limit (1MB), when this limit is reached a warning will be logged allowing the users to adjust their client codes without downtime. The hard limit starts out with a more permissive default of 100MB. When this is reached queries are aborted, the same behaviour as with the previous single limit. The idea is to allow clients a grace period for fixing their code, while at the same time protecting the database from the really bad queries.	2020-07-28 18:00:29 +03:00
Botond Dénes	9faaf46d4b	utils: config_src::add_command_line_options(): drop name and desc args Now that there are no ad-hoc aliases needing to overwrite the name and description parameter of this method, we can drop these and have each config item just use `name()` and `desc()` to access these.	2020-07-28 18:00:29 +03:00
Botond Dénes	dc23736d0c	db/config: replace ad-hoc aliases with alias mechanism We already uses aliases for some configuration items, although these are created with an ad-hoc mechanism that only registers them on the command line. Replace this with the built-in alias mechanism in the previous patch, which has the benefit of conflict resolution and also working with YAML.	2020-07-28 18:00:29 +03:00
Botond Dénes	003f5e9e54	utils: config: add alias support Allow configuration items to also have an alias, besides the name. This allows easy replacement of configuration items, with newer names, while still supporting the old name for backward compatibility. The alias mechanism takes care of registering both the name and the alias as command line arguments, as well as parsing them from YAML. The command line documentation of the alias will just refer to the name for documentation.	2020-07-28 17:59:51 +03:00
Raphael S. Carvalho	99b75d1f63	compaction: Improve compaction efficiency by killing the procedure that trims jobs This procedure consists of trimming SSTables off a compaction job until its weight[1] is smaller than one already taken by a running compaction. Min threshold is respected though, we only trim a job while its size is > min threshold. [1]: this value is a logarithimic function of the total size of the SSTables in a given job, and it's used to control the compaction parallelism. It's intended to improve the compaction efficiency by allowing more jobs to run in parallel, but it turns out that this can have an opposite effect because the write amplification can be significantly increased. Take STCS for example, the more similar-sized SSTables you compact together, the higher the compaction efficiency will be. With the trimming procedure, we're aiming at running smaller jobs, thinking that running more parallel compactions will provide us with better performance, but that's not true. Most of the efficiency comes from making informed decisions when selecting candidates for compaction. Similarly, this will also hurt TWCS, which does STCS in current window, and a sort of major compaction when the current window closes. If the TWCS jobs are trimmed, we'll likely need another compaction to get to the desired state, recompacting the same data again. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200728143648.31349-1-raphaelsc@scylladb.com>	2020-07-28 17:44:00 +03:00
Takuya ASADA	d7de9518fe	scylla_setup: skip boot partition On GCE, /dev/sda14 reported as unused disk but it's BIOS boot partition, should not use for scylla data partition, also cannot use for it since it's too small. It's better to exclude such partiotion from unsed disk list. Fixes #6636	2020-07-28 12:19:55 +03:00
Asias He	e6f640441a	repair: Fix race between create_writer and wait_for_writer_done We saw scylla hit user after free in repair with the following procedure during tests: - n1 and n2 in the cluster - n2 ran decommission - n2 sent data to n1 using repair - n2 was killed forcely - n1 tried to remove repair_meta for n1 - n1 hit use after free on repair_meta object This was what happened on n1: 1) data was received -> do_apply_rows was called -> yield before create_writer() was called 2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called with _writer_done[node_idx] not engaged 3) step 1 resumed, create_writer() was called and _repair_writer object was referenced 4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed 5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object To fix, we should call wait_for_writer_done() after any pending operations were done which were protected by repair_meta::_gate. This prevents wait for writer done finishes before the writer is in the process of being created. Fixes: #6853 Fixes: #6868 Backports: 4.0, 4.1, 4.2	2020-07-28 11:53:40 +03:00
Asias He	bdaf904864	storage_service: Improve log on removing pending replacing node The log "removing pending replacing node" is printed whenever a node jumps to normal status including a normal restart. For example, on node1, we saw the following when node2 restarts. [shard 0] storage_service - Node 127.0.0.2 state jump to normal [shard 0] storage_service - Remove node 127.0.0.2 from pending replacing endpoint This is confusing since no node is really being replaced. To fix, log only if a node is really removed from the pending replacing nodes. In addition, since do_remove_node will call del_replacing_endpoint, there is no need to call del_replacing_endpoint again in storage_service::handle_state_normal after do_remove_node. Fixes #6936	2020-07-28 11:51:22 +03:00
Piotr Sarna	ee35c4c3d6	db: handle errors when loading view build progress Currently, encountering an error when loading view build progress would result in view builder refusing to start - which also means that future views would not be built until the server restarts. A more user-friendly solution would be to log an error message, but continue to boot the view builder as if no views are currently in progress, which would at least allow future views to be built correctly. The test case is also amended, since now it expects the call to return that "no view builds are in progress" instead of an exception. Fixes #6934 Tests: unit(dev) Message-Id: <9f26de941d10e6654883a919fd43426066cee89c.1595922374.git.sarna@scylladb.com>	2020-07-28 11:32:09 +03:00
Piotr Sarna	0dbcaa1fd9	test: add a case for disengaged optional values in system tables Following the patch which fixes incorrect access to disengaged optionals, a test case which used to reproduce the problem is added. Message-Id: <99174d47c1c55ed8730b4998d5e5e464990d36e3.1595834092.git.sarna@scylladb.com>	2020-07-28 10:06:42 +03:00
Piotr Sarna	43a3719fe4	cql3: fix potential segfault on disengaged optional In untyped_result_set::get_view, there exists a silent assumption that the underlying data, which is an optional, to always be engaged. In case the value happens to be disengaged it may lead to creating an incorrect bytes view from a disengaged optional. In order to make the code safer (since values parsed by this code often come from the network and can contain virtually anything) a segfault is replaced with an exception, by calling optional's value() function, which throws when called on disengaged optionals. Fixes #6915 Tests: unit(dev) Message-Id: <6e9e4ca67e0e17c17b718ab454c3130c867684e2.1595834092.git.sarna@scylladb.com>	2020-07-28 10:06:00 +03:00
Raphael S. Carvalho	0d70efa58e	sstable: index_reader: Make sure streams are all properly closed on failure Turns out the fix `f591c9c710` wasn't enough to make sure all input streams are properly closed on failure. It only closes the main input stream that belongs to context, but it misses all the input streams that can be opened in the consumer for promote index reading. Consumer stores a list of indexes, where each of them has its own input stream. On failure, we need to make sure that every single one of them is properly closed before destroying the indexes as that could cause memory corruption due to read ahead. Fixes #6924. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>	2020-07-28 10:01:44 +03:00
Nadav Har'El	a7df8486b1	alternator test: add test for tracing In commit `8d27e1b`, we added tracing (see docs/tracing.md) support to Alternator requests. However, we never had a functional test that verifies this feature actually works as expected, and we recently noticed that for the GetItem and BatchGetItem requestd, the trace doesn't really work (it returns an empty list of events). So this patch adds a test, test/alternator/test_tracing.py, which verifies that the tracing feature works for the PutItem, GetItem, DeleteItem, UpdateItem, BatchGetItem, BatchWriteItem, Query and Scan operations. This test is very peculiar. It needs to use out-of-band REST API requests to enable and disable tracing (of course, the test is skipped when running against AWS - this is a Scylla-only feature). It also needs to read CQL-only system tables and does this using Alternator's ".scylla.alternator" interface for system tables - which came through for us here beautifully and demonstrated their usefulness. I paid a lot of attention for this test to remain reasonably fast - this entire test now runs in a little less than one second. Achieving this while testing eight different requests was a bit of a challenge, because traces take time until they are visible in the trace table. This is the main reason why in this patch the test for all eight request types are done in one test, instead of eight separate tests. Fixes #6891 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200727115401.1199024-1-nyh@scylladb.com>	2020-07-27 14:31:45 +02:00
Takuya ASADA	97fa17b17b	scylla_setup: remove square bracket from disk prompt selected list Selected list on disk prompt is looks like an alternatives, it's better to use single quote. Fixes #6760	2020-07-27 14:50:31 +03:00
Avi Kivity	3f84d41880	Merge "messaging: make verb handler registering independent of current scheduling group" from Botond " `0c6bbc8` refactored `get_rpc_client_idx()` to select different clients for statement verbs depending on the current scheduling group. The goal was to allow statement verbs to be sent on different connections depending on the current scheduling group. The new connections use per-connection isolation. For backward compatibility the already existing connections fall-back to per-handler isolation used previously. The old statement connection, called the default statement connection, also used this. `get_rpc_client_idx()` was changed to select the default statement connection when the current scheduling group is the statement group, and a non-default connection otherwise. This inadvertently broke `scheduling_group_for_verb()` which also used this method to get the scheduling group to be used to isolate a verb at handle register time. This method needs the default client idx for each verb, but if verb registering is run under the system group it instead got the non-default one, resulting in the per-handler isolation not being set-up for the default statement connection, resulting in default statement verb handlers running in whatever scheduling group the process loop of the rpc is running in, which is the system scheduling group. This caused all sorts of problems, even beyond user queries running in the system group. Also as of `0c6bbc8` queries on the replicas are classified based on the scheduling group they are running on, so user reads also ended up using the system concurrency semaphore. In particular this caused severe problems with ranges scans, which in some cases ended up using different semaphores per page resulting in a crash. This could happen because when the page was read locally the code would run in the statement scheduling group, but when the request arrived from a remote coordinator via rpc, it was read in a system scheduling group. This caused a mismatch between the semaphore the saved reader was created with and the one the new page was read with. The result was that in some cases when looking up a paused reader from the wrong semaphore, a reader belonging to another read was returned, creating a disconnect between the lifecycle between readers and that of the slice and range they were referencing. This series fixes the underlying problem of the scheduling group influencing the verb handler registration, as well as adding some additional defenses if this semaphore mismatch ever happens in the future. Inactive read handles are now unique across all semaphores, meaning that it is not possible anymore that a handle succeeds in looking up a reader when used with the wrong semaphore. The range scan algorithm now also makes sure there is no semaphore mismatch between the one used for the current page and that of the saved reader from the previous page. I manually checked that each individual defense added is already preventing the crash from happening. Fixes: #6613 Fixes: #6907 Fixes: #6908 Tests: unit(dev), manual(run the crash reproducer, observe no crash) " * 'query-classification-regressions/v1' of https://github.com/denesb/scylla: multishard_mutation_query: use cached semaphore messaging: make verb handler registering independent of current scheduling group multishard_mutation_query: validate the semaphore of the looked-up reader reader_concurrency_semaphore: make inactive read handles unique across semaphores reader_concurrency_semaphore: add name() accessor reader_concurrency_semaphore: allow passing name to no-limit constructor	2020-07-27 13:56:52 +03:00
Nadav Har'El	9080709c56	docs: add paragraph to tracing.md Issue #6919 was caused by an incorrect assumption: I assumed that we see the tracing session record, we can be sure that the event records for this session had already been written. In this patch we add a paragraph to the tracing documentation - docs/tracing.md, which explains that this assumption is in fact incorrect: 1. On a multi-node setup, replicas may continue to write tracing events after the coordinator "finished" (moved to background) the request and wrote the session record. 2. Even on a single-node setup, the writes of the session record and the individual events are asynchronous, and can happen in an unexpected order (which is what happened in issue #6919). Refs #6919. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200727102438.1194314-1-nyh@scylladb.com>	2020-07-27 13:38:57 +03:00
Takuya ASADA	0ffa0e8745	dist_util.py: use correct ID value to detect Amazon Linux 2 On `2d63acdd6a` we replaced 'ol' and 'amzn' to 'oracle' and 'amazon', but distro.id() actually returns 'amzn' for Amazon Linux 2, so we need to revert the change. Fixes #6882	2020-07-27 12:46:21 +03:00
Botond Dénes	eeeef0a0f1	multishard_mutation_query: use cached semaphore Instead of requesting the query class config from the database every time the semaphore is needed, use the cached one by calling `semaphore()`.	2020-07-27 12:17:22 +03:00
Nadav Har'El	65f75e3862	alternator test: enable test_get_records After issue #6864 was fixed, the test_streams.py::test_get_records test no longer fails, so its "xfail" marker can be removed. Refs #6864. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200722132518.1077882-1-nyh@scylladb.com>	2020-07-27 09:19:37 +02:00
Nadav Har'El	f488eaebaf	merge: db/view: view_update_generator: make staging reader evictable Merged patch set by Botond Dénes: The view update generation process creates two readers. One is used to read the staging sstables, the data which needs view updates to be generated for, and another reader for each processed mutation, which reads the current value (pre-image) of each row in said mutation. The staging reader is created first and is kept alive until all staging data is processed. The pre-image reader is created separately for each processed mutation. The staging reader is not restricted, meaning it does not wait for admission on the relevant reader concurrency semaphore, but it does register its resource usage on it. The pre-image reader however is restricted. This creates a situation, where the staging reader possibly consumes all resources from the semaphore, leaving none for the later created pre-image reader, which will not be able to start reading. This will block the view building process meaning that the staging reader will not be destroyed, causing a deadlock. This patch solves this by making the staging reader restricted and making it evictable. To prevent thrashing -- evicting the staging reader after reading only a really small partition -- we only make the staging reader evictable after we have read at least 1MB worth of data from it. test/boost: view_build_test: add test_view_update_generator_buffering test/boost: view_build_test: add test test_view_update_generator_deadlock reader_permit: reader_resources: add operator- and operator+ reader_concurrency_semaphore: add initial_resources() test: cql_test_env: allow overriding database_config mutation_reader: expose new_reader_base_cost db/view: view_updating_consumer: allow passing custom update pusher db/view: view_update_generator: make staging reader evictable db/view: view_updating_consumer: move implementation from table.cc to view.cc database: add make_restricted_range_sstable_reader() Signed-off-by: Botond Dénes <bdenes@scylladb.com> --- db/view/view_updating_consumer.hh \| 51 ++++++++++++++++++++++++++++--- db/view/view.cc \| 39 +++++++++++++++++------ db/view/view_update_generator.cc \| 19 +++++++++--- 3 files changed, 91 insertions(+), 18 deletions(-)	2020-07-27 09:19:37 +02:00
Botond Dénes	fe127a2155	sstables: clamp estimated_partitions to [1, +inf) in writers In some cases estimated number of partitions can be 0, which is albeit a legit estimation result, breaks many low-level sstable writer code, so some of these have assertions to ensure estimated partitions is > 0. To avoid hitting this assert all users of the sstable writers do the clamping, to ensure estimated partitions is at least 1. However leaving this to the callers is error prone as #6913 has shown it. As this clamping is standard practice, it is better to do it in the writers themselves, avoiding this problem altogether. This is exactly what this patch does. It also adds two unit tests, one that reproduces the crash in #6913, and another one that ensures all sstable writers are fine with estimated partitions being 0 now. Call sites previously doing the clamping are changed to not do it, it is unnecessary now as the writer does it itself. Fixes #6913 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>	2020-07-27 09:19:37 +02:00
Avi Kivity	91619d77a1	Merge "Simplify the lifetime management of write monitors" from Raphael " This makes sure that monitors are always owned by the same struct that owns the monitored writer, simplifying the lifetime management. This hopefully fixes some of the crashes we have observed around this area. " * 'espindola/use-compaction_writer-v6' of https://github.com/espindola/scylla: sstables: Rename _writer to _compaction_writer sstables: Move compaction_write_monitor to compaction_writer sstables: Add couple of writer() getters to garbage_collected_sstable_writer sstables: Move compaction_write_monitor earlier in the file	2020-07-27 09:19:37 +02:00
Dejan Mircevski	c11b2de84c	cql3: Fix tombstone-range check for TRUE A DELETE statement checks that the deletion range is symmetrically bounded. This check was broken for expression TRUE. Test the fix by setting initial_key_restrictions::expression to TRUE, since CQL doesn't currently allow WHERE TRUE. That change has been proposed anyway in feedback to #5763: https://github.com/scylladb/scylla/pull/5763#discussion_r443213343 Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-07-27 09:19:37 +02:00
Dejan Mircevski	ba74659f5a	cql/restrictions: Constrain to_sorted_vector As requested in #5763 feedback, enforce the function's assumptions with concept asserts. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-07-27 09:19:37 +02:00
Botond Dénes	0df4c2fd3b	messaging: make verb handler registering independent of current scheduling group `0c6bbc8` refactored `get_rpc_client_idx()` to select different clients for statement verbs depending on the current scheduling group. The goal was to allow statement verbs to be sent on different connections depending on the current scheduling group. The new connections use per-connection isolation. For backward compatibility the already existing connections fall-back to per-handler isolation used previously. The old statement connection, called the default statement connection, also used this. `get_rpc_client_idx()` was changed to select the default statement connection when the current scheduling group is the statement group, and a non-default connection otherwise. This inadvertently broke `scheduling_group_for_verb()` which also used this method to get the scheduling group to be used to isolate a verb at handle register time. This method needs the default client idx for each verb, but if verb registering is run under the system group it instead got the non-default one, resulting in the per-handler isolation not being set-up for the default statement connection, resulting in default statement verb handlers running in whatever scheduling group the process loop of the rpc is running in, which is the system scheduling group. This caused all sorts of problems, even beyond user queries running in the system group. Also as of `0c6bbc8` queries on the replicas are classified based on the scheduling group they are running on, so user reads also ended up using the system concurrency semaphore.	2020-07-27 10:11:21 +03:00

1 2 3 4 5 ...

22952 Commits