scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 11:55:15 +00:00

Author	SHA1	Message	Date
Yaron Kaikov	36a4eba22e	scylla_io_setup: configure "aio-max-nr" before iotune On severl instance types in AWS and Azure, we get the following failure during scylla_io_setup process: ``` ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application ``` We have scylla_prepare:configure_io_slots() running before the scylla-server.service start, but the scylla_io_setup is taking place before 1) Let's move configure_io_slots() to scylla_util.py since both scylla_io_setup and scylla_prepare are import functions from it 2) cleanup scylla_prepare since we don't need the same function twice 3) Let's use configure_io_slots() during scylla_io_setup to avoid such failure Fixes: #8587 Closes #8512 Refs: #8713 (cherry picked from commit `588a065304`)	2021-06-07 16:29:38 +03:00
Nadav Har'El	e9b1f10654	Update tools/java submodule with backported patches * tools/java 6ca351c221...aab793d9f5 (2): > nodetool: alternate way to specify table name which includes a dot > nodetool: do no treat table name with dot as a secondary index Fixes #6521 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-07 09:38:53 +03:00
Nadav Har'El	6057be3f42	alternator: fix equality check of nested document containing a set In issue #5021 we noticed that the equality check in Alternator's condition expressions needs to handle sets differently - we need to compare the set's elements ignoring their order. But the implementation we added to fix that issue was only correct when the entire attribute was a set... In the general case, an attribute can be a nested document, with only some inner set. The equality-checking function needs to tranverse this nested document, and compare the sets inside it as appropriate. This is what we do in this patch. This patch also adds a new test comparing equality of a nested document with some inner sets. This test passes on DynamoDB, failed on Alternator before this patch, and passes with this patch. Refs #5021 Fixes #8514 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210419184840.471858-1-nyh@scylladb.com> (cherry picked from commit `dae7528fe5`)	2021-06-07 09:10:08 +03:00
Nadav Har'El	673f823d8b	alternator: fix inequality check of two sets In issue #5021 we noted that Alternator's equality operator needs to be fixed for the case of comparing two sets, because the equality check needs to take into account the possibility of different element order. Unfortunately, we fixed only the equality check operator, but forgot there is also an inequality operator! So in this patch we fix the inequality operator, and also add a test for it that was previously missing. The implementation of the inequality operator is trivial - it's just the negation of the equality test. Our pre-existing tests verify that this is the correct implementation (e.g., if attribute x doesn't exist, then "x = 3" is false but "x <> 3" is true). Refs #5021 Fixes #8513 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210419141450.464968-1-nyh@scylladb.com> (cherry picked from commit `50f3201ee2`)	2021-06-07 08:45:54 +03:00
Nadav Har'El	0082968bd8	alternator: fix equality check of two unset attributes When a condition expression (ConditionExpression, FilterExpression, etc.) checks for equality of two item attributes, i.e., "x = y", and when one of these attributes was missing we correctly returned false. However, we also need to return false when both attributes are missing in the item, because this is what DynamoDB does in this case. In other words an unset attribute is never equal to anything - not even to another unset attribute. This was not happening before this patch: When x and y were both missing attributes, Alternator incorrectly returned true for "x = y", and this patch fixes this case. It also fixes "x <> y" which should to be true when both x and y are unset (but was false before this patch). The other comparison operators - <, <=, >, >=, BETWEEN, were all implemented correctly even before this patch. This patch also includes tests for all the two-unset-attribute cases of all the operators listed above. As usual, we check that these tests pass on both DynamoDB and Alternator to confirm our new behavior is the correct one - before this patch, two of the new tests failed on Alternator and passed on DynamoDB. Fixes #8511 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210419123911.462579-1-nyh@scylladb.com> (cherry picked from commit `46448b0983`)	2021-06-06 16:28:27 +03:00
Takuya ASADA	542cd7aff1	scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem Currently, var-lib-scylla.mount may fails because it can start before MDRAID volume initialized. We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for device become available, but systemd manual says it automatically configure dependency for mount unit when we specify filesystem path by "absolute path of a device node". So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>. Fixes #8279 Closes #8681 (cherry picked from commit `3d307919c3`)	2021-05-24 17:24:07 +03:00
Raphael S. Carvalho	2b29568bf4	sstables/mp_row_consumer: Fix unbounded memory usage when consuming a large run of partition tombstones mp_row_consumer will not stop consuming large run of partition tombstones, until a live row is found which will allow the consumer to stop proceeding. So partition tombstones, from a large run, are all accumulated in memory, leading to OOM and stalls. The fix is about stopping the consumer if buffer is full, to allow the produced fragments to be consumed by sstable writer. Fixes #8071. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210514202640.346594-1-raphaelsc@scylladb.com> Upstream fix: `db4b9215dd` scylla-4.4.2	2021-05-20 21:26:07 +03:00
Hagit Segev	93457807b8	release: prepare for 4.4.2	2021-05-20 00:02:31 +03:00
Takuya ASADA	cee62ab41b	install.sh: apply correct file security context when copying files Currently, unified installer does not apply correct file security context while copying files, it causes permission error on scylla-server.service. We should apply default file security context while copying files, using '-Z' option on /usr/bin/install. Also, because install -Z requires normalized path to apply correct security context, use 'realpath -m <PATH>' on path variables on the script. Fixes #8589 Closes #8602 (cherry picked from commit `60c0b37a4c`)	2021-05-19 12:41:20 +03:00
Takuya ASADA	728a5e433f	install.sh: fix not such file or directory on nonroot Since we have added scylla-node-exporter, we needed to do 'install -d' for systemd directory and sysconfig directory before copying files. Fixes #8663 Closes #8664 (cherry picked from commit `6faa8b97ec`)	2021-05-19 12:41:20 +03:00
Avi Kivity	9a2d4a7cc7	Merge 'Fix type checking in index paging' from Piotr Sarna When recreating the paging state from an indexed query, a bunch of panic checks were introduced to make sure that the code is correct. However, one of the checks is too eager - namely, it throws an error if the base column type is not equal to the view column type. It usually works correctly, unless the base column type is a clustering key with DESC clustering order, in which case the type is actually "reversed". From the point of view of the paging state generation it's not important, because both types deserialize in the same way, so the check should be less strict and allow the base type to be reversed. Tests: unit(release), along with the additional test case introduced in this series; the test also passes on Cassandra Fixes #8666 Closes #8667 * github.com:scylladb/scylla: test: add a test case for paging with desc clustering order cql3: relax a type check for index paging (cherry picked from commit `593ad4de1e`)	2021-05-19 12:41:05 +03:00
Takuya ASADA	cc050fd499	dist/redhat: stop using systemd macros, call systemctl directly Fedora version of systemd macros does not work correctly on CentOS7, since CentOS7 does not support "file trigger" feature. To fix the issue we need to stop using systemd macros, call systemctl directly. See scylladb/scylla-jmx#94 Closes #8005 (cherry picked from commit `7b310c591e`)	2021-05-18 13:50:07 +03:00
Raphael S. Carvalho	61145af5d9	compaction_manager: Don't swallow exception in procedure used by reshape and resharding run_custom_job() was swallowing all exceptions, which is definitely wrong because failure in a resharding or reshape would be incorrectly interpreted as success, which means upper layer will continue as if everything is ok. For example, ignoring a failure in resharding could result in a shared sstable being left unresharded, so when that sstable reaches a table, scylla would abort as shared ssts are no longer accepted in the main sstable set. Let's allow the exception to be propagated, so failure will be communicated, and resharding and reshape will be all or nothing, as originally intended. Fixes #8657. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com> (cherry picked from commit `10ae77966c`)	2021-05-18 13:00:20 +03:00
Avi Kivity	11bd83e319	Update tools/jmx (rpm systemd macros) * tools/jmx c510a56...7a101a0 (1): > dist/redhat: stop using systemd macros, call systemctl directly Ref scylladb/jmx#94.	2021-05-13 18:24:52 +03:00
Raphael S. Carvalho	b58305d919	compaction_manager: Redefine weight for better control of parallel compactions Compaction manager allows compaction of different weights to proceed in parallel. For example, a small-sized compaction job can happen in parallel to a large-sized one, but similar-sized jobs are serialized. The problem is the current definition of weight, which is the log (base 4) of total size (size of all sstables) of a job. This is what we get with the current weight definition: weight=5 for sizes=[1K, 3K] weight=6 for sizes=[4K, 15K] weight=7 for sizes=[16K, 63K] weight=8 for sizes=[64K, 255K] weight=9 for sizes=[258K, 1019K] weight=10 for sizes=[1M, 3M] weight=11 for sizes=[4M, 15M] weight=12 for sizes=[16M, 63M] weight=13 for sizes=[64M, 254M] weight=14 for sizes=[256M, 1022M] weight=15 for sizes=[1033M, 4078M] weight=16 for sizes=[4119M, 10188M] total weights: 12 Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5 jobs smaller than 1MB could proceed in parallel. High number of parallel compactions can be observed after repair, which potentially produces tons of small sstables of varying sizes. That causes compaction to use a significant amount of resources. To fix this problem, let's add a fixed tax to the size before taking the log, so that jobs smaller than 1M will all have the same weight. Look at what we get with the new weight definition: weight=10 for sizes=[1K, 2M] weight=11 for sizes=[3M, 14M] weight=12 for sizes=[15M, 62M] weight=13 for sizes=[63M, 254M] weight=14 for sizes=[256M, 1022M] weight=15 for sizes=[1033M, 4078M] weight=16 for sizes=[4119M, 10188M] total weights: 7 Fixes #8124. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com> (cherry picked from commit `81d773e5d8`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210512224405.68925-1-raphaelsc@scylladb.com>	2021-05-13 08:38:40 +03:00
Lauro Ramos Venancio	065111b42b	TWCS: initialize _highest_window_seen The timestamp_type is an int64_t. So, it has to be explicitly initialized before using it. This missing inicialization prevented the major compactation from happening when a time window finishes, as described in #8569. Fixes #8569 Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com> Closes #8590 (cherry picked from commit `15f72f7c9e`)	2021-05-06 08:52:15 +03:00
Nadav Har'El	ebd2c9bab0	Update tools/java submodule Backport sstableloader fix in tools/java submodule. Fixes #8230. * tools/java a3e010ee4f...6ca351c221 (1): > sstableloader: Handle non-prepared batches with ":" in identifier names Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-03 10:08:54 +03:00
Avi Kivity	bf9e1f6d2e	Merge '[branch 4.4] Backport reader_permit: always forward resources to the semaphore ' from Botond Dénes This is a backport of `8aaa3a7` to branch-4.4. The main conflicts were around Benny's reader close series (`fa43d76`), but it also turned out that an additional patch (2f1d65c) also has to backported to make sure admission on signaling resources doesn't deadlock. Refs: #8493 Closes #8571 * github.com:scylladb/scylla: test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units reader_concurrency_semaphore: add dump_diagnostics() reader_permit: always forward resources test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front reader_concurrency_semaphore: make admission conditions consistent	2021-04-30 22:02:46 +03:00
Botond Dénes	a710866235	test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress This unit test checks that the semaphore doesn't get into a deadlock when contended, in the presence of many memory-only reads (that don't wait for admission). This is tested by simulating the 3 kind of reads we currently have in the system: * memory-only: reads that don't pass admission and only own memory. * admitted: reads that pass admission. * evictable: admitted reads that are furthermore evictable. The test creates and runs a large number of these reads in parallel, read kinds being selected randomly, then creates a watchdog which kills the test if no progress is being made. (cherry picked from commit `45d580f056`)	2021-04-30 11:03:09 +03:00
Botond Dénes	3c3fc18777	test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units This unit test passes a read through admission again-and-again, just like an evictable reader would be during its lifetime. When readmitted the read sometimes has to wait and sometimes not. This is to check that the readmitting a previously admitted reader doesn't leak any units. (cherry picked from commit `cadc26de38`)	2021-04-30 11:03:09 +03:00
Botond Dénes	960f93383b	reader_concurrency_semaphore: add dump_diagnostics() Allow semaphore related tests to include a diagnostics printout in error messages to help determine why the test failed. (cherry picked from commit `d246e2df0a`)	2021-04-30 09:08:18 +03:00
Botond Dénes	1c0557c638	reader_permit: always forward resources This commit conceptually reverts `4c8ab10`. Said commit was meant to prevent the scenario where memory-only permits -- those that don't pass admission but still consume memory -- completely prevent the admission of reads, possibly even causing a deadlock because a permit might even blocks its own admission. The protection introduced by said commit however proved to be very problematic. It made the status of resources on the permit very hard to reason about and created loopholes via which permits could accumulate without tracking or they could even leak resources. Instead of continuing to patch this broken system, this commit does away with this "protection" based on the observation that deadlocks are now prevented anyway by the admission criteria introduced by `0fe75571d9`, which admits a read anyway when all the initial count resources are available (meaning no admitted reader is alive), regardless of availability of memory. The benefits of this revert is that the semaphore now knows about all the resources and is able to do its job better as it is not "lied to" about resource by the permits. Furthermore the status of a permit's resources is much simpler to reason about, there are no more loopholes in unexpected state transitions to swallow/leak resources. To prove that this revert is indeed safe, in the next commit we add robust tests that stress test admission on a highly contested semaphore. This patch also does away with the registered/admitted differentiation of permits, as this doesn't make much sense anymore, instead these two are unified into a single "active" state. One can always tell whether a permit was admitted or not from whether it owns count resources anyway. (cherry picked from commit `caaa8ef59a`)	2021-04-30 09:08:17 +03:00
Botond Dénes	f23052ae64	test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front The fuzzy test consumes a large chunk of resource from the semaphore up-front to simulate a contested semaphore. This isn't an accurate simulation, because no permit will have more than 1 units in reality. Furthermore this can even cause a deadlock since `8aaa3a7` as now we rely on all count units being available to make forward progress when memory is scarce. This patch just cuts out this part of the test, we now have a dedicated unit test for checking a heavily contested semaphore, that does it properly, so no need to try to fix this clumsy attempt that is just making trouble at this point. Refs: #8493 Tests: release(multishard_mutation_query_test:fuzzy_test) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210429084458.40406-1-bdenes@scylladb.com> (cherry picked from commit `26ae9555d1`)	2021-04-30 08:57:12 +03:00
Botond Dénes	15a157611a	reader_concurrency_semaphore: make admission conditions consistent Currently there are two places where we check admission conditions: `do_wait_admission()` and `signal()`. Both use `has_available_units()` to check resource availability, but the former has some additional resource related conditions on top (in `may_proceed()`), which lead to the two paths working with slightly different conditions. To fix, push down all resource availability related checks to `has_available_units()` to ensure admission conditions are consistent across all paths. (cherry picked from commit `d90cd6402c`)	2021-04-30 08:57:12 +03:00
Eliran Sinvani	d0b82e1e68	Materialized views: fix possibly old views comming from other nodes Migration manager has a function to get a schema (for read or write), this function queries a peer node and retrieves the schema from it. One scenario where it can happen is if an old node, queries an old not fixed index. This makes a hole through which views that are only adjusted for reading can slip through. Here we plug the hole by fixing such views before they are registered. Closes #8509 (cherry picked from commit `480a12d7b3`) Fixes #8554.	2021-04-29 14:03:03 +03:00
Botond Dénes	840ca41393	database: clear inactive reads in stop() If any inactive read is left in the semaphore, it can block `database::stop()` from shutting down, as sstables pinned by these reads will prevent `sstables::sstables_manager::close()` from finishing. This causes a deadlock. It is not clear how inactive reads can be left in the semaphore, as all users are supposed to clean up after themselves. Post 4.4 releases don't have this problem anymore as the inactive read handle was made a RAII object, removing the associated inactive read when destroyed. In 4.4 and earlier release this wasn't so, so errors could be made. Normally this is not a big issue, as these orphaned inactive reads are just evicted when the resources they own are needed, but it does become a serious issue during shutdown. To prevent a deadlock, clear the inactive reads earlier, in `database::stop()` (currently they are cleared in the destructor). This is a simple and foolproof way of ensuring any leftover inactive reads don't cause problems. Fixes: #8561 Tests: unit(dev) Closes #8562	2021-04-28 19:32:46 +03:00
Takuya ASADA	07051f25f2	dist: increase fs.aio-max-nr value for other apps Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla uses, if other apps on the environment also try to use aio, aio slot will be run out. So increase value +65536 for other apps. Related #8133 Closes #8228 (cherry picked from commit `53c7600da8`)	2021-04-25 16:15:25 +03:00
Takuya ASADA	8437f71b1b	dist: tune fs.aio-max-nr based on the number of cpus Current aio-max-nr is set up statically to 1048576 in /etc/sysctl.d/99-scylla-aio.conf. This is sufficient for most use cases, but falls short on larger machines such as i3en.24xlarge on AWS that has 96 vCPUs. We need to tune the parameter based on the number of cpus, instead of static setting. Fixes #8133 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #8188 (cherry picked from commit `d0297c599a`)	2021-04-25 16:15:12 +03:00
Avi Kivity	9f32f5a60c	Update seastar submodule (io_queue request size) * seastar 37eb6022fc...61939b5b8a (1): > io_queue: Double max request size Fixes #8496	2021-04-25 12:35:34 +03:00
Avi Kivity	910bc2417a	Update seastar submodule (low bandwidth disks) * seastar a75171fc89...37eb6022fc (2): > io_queue: Honor disks with tiny request rate > io_queue: Shuffle fair_group creation Fixes #8378.	2021-04-21 14:02:15 +03:00
Piotr Jastrzebski	7790beb655	row_cache: remove redundant check in make_reader This check is always true because a dummy entry is added at the end of each cache entry. If that wasn't true, the check in else-if would be an UB. Refs #8435. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> (cherry picked from commit `cb3dbb1a4b`)	2021-04-20 13:53:23 +02:00
Piotr Jastrzebski	1379f141c2	cache_flat_mutation_reader: fix do_fill_buffer Make sure that when a partition does not exist in underlying, do_fill_buffer does not try to fast forward withing this nonexistent partition. Test: unit(dev) Fixes #8435 Fixes #8411 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> (cherry picked from commit `1f644df09d`)	2021-04-20 13:53:17 +02:00
Piotr Jastrzebski	d14ec86e7d	read_context: add _partition_exists This new state stores the information whether current partition represented by _key is present in underlying. Refs #8435. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> (cherry picked from commit `ceab5f026d`)	2021-04-20 13:53:10 +02:00
Piotr Jastrzebski	bbada5b9e4	read_context: remove skip_first_fragment arg from create_underlying All callers pass false for its value so no need to keep it around. Refs #8435. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> (cherry picked from commit `b3b68dc662`)	2021-04-20 13:53:01 +02:00
Piotr Jastrzebski	d73ec88916	read_context: skip first fragment in ensure_underlying This was previously done in create_underlying but ensure_underlying is a better place because we will add more related logic to this consumption in the following patches. Refs #8435. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> (cherry picked from commit `088a02aafd`)	2021-04-20 13:52:46 +02:00
Kamil Braun	2efb458c7a	time_series_sstable_set: return partition start if some sstables were ck-filtered out When a particular partition exists in at least one sstable, the cache expects any single-partition query to this partition to return a `partition_start` fragment, even if the result is empty. In `time_series_sstable_set::create_single_key_sstable_reader` it could happen that all sstables containing data for the given query get filtered out and only sstables without the relevant partition are left, resulting in a reader which immediately returns end-of-stream (while it should return a `partition_start` and if not in forwarding mode, a `partition_end`). This commit fixes that. We do it by extending the reader queue (used by the clustering reader merger) with a `dummy_reader` which will be returned by the queue as the very first reader. This reader only emits a `partition_start` and, if not in forwarding mode, a `partition_end` fragment. Fixes #8447. Closes #8448. (cherry picked from commit `5c7ed7a83f`)	2021-04-20 13:52:34 +02:00
Kamil Braun	c05d8fcef1	clustering_order_reader_merger: handle empty readers The merger could return end-of-stream if some (but not all) of the underlying readers were empty (i.e. not even returning a `partition_start`). This could happen in places where it was used (`time_series_sstable_set::create_single_key_sstable_reader`) if we opened an sstable which did not have the queried partition but passed all the filters (specifically, the bloom filter returned a false positive for this sstable). The commit also extends the random tests for the merger to include empty readers and adds an explicit test case that catches this bug (in a limited scope: when we merge a single empty reader). It also modifies `test_twcs_single_key_reader_filtering` (regression test for #8432) because the time where the clustering key filter is invoked changes (some invocations move from the constructor of the merger to operator()). I checked manually that it still catches the bug when I reintroduce it. Fixes #8445. Closes #8446. (cherry picked from commit `7ffb0d826b`)	2021-04-20 13:52:13 +02:00
Kamil Braun	d29960da47	sstables: fix TWCS single key reader sstable filter The filter passed to `min_position_reader_queue`, which was used by `clustering_order_reader_merger`, would incorrectly include sstables as soon as they passed through the PK (bloom) filter, and would include sstables which didn't pass the PK filter (if they passed the CK filter). Fortunately this wouldn't cause incorrect data to be returned, but it would cause sstables to be opened unnecessarily (these sstables would immediately return eof), resulting in a performance drop. This commit fixes the filter and adds a regression test which uses statistics to check how many times the CK filter was invoked. Fixes #8432. Closes #8433. (cherry picked from commit `3687757115`)	2021-04-20 13:51:52 +02:00
Avi Kivity	e0d67ad6e4	Update seastar submodule (fair_queue fixes) * seastar 2c884a7449...a75171fc89 (2): > fair_queue: Preempted requests got re-queued too far > fair_queue: Improve requests preemption while in pending state Fixes #8296.	2021-04-14 15:40:45 +03:00
Hagit Segev	00da6b5e9e	release: prepare for 4.4.1 scylla-4.4.1	2021-04-07 00:28:45 +03:00
Gleb Natapov	4200e52444	storage_proxy: do not crash on LOCAL_QUORUM access to a DC with zero replication If a table that is not replicated to a certain DC (rf=0) is accessed with LOCAL_QUORUM on that DC the current code will crash since the 'targets' array will be empty and read executor does not handle it. Fix it by replying with empty result. Fixes #8354 Message-Id: <YGro+l2En3fF80CO@scylladb.com> (cherry picked from commit `cd24dfc7e5`) [avi: re-added virtual keyword when backporting, since 4.4 and below don't have `020da49c89`]	2021-04-06 19:34:49 +03:00
Nadav Har'El	f5e402ea7a	update tools/java submodule Backport for refs #8390. * tools/java 56470fda09...a3e010ee4f (1): > sstableloader: fix handling of rewritten partition Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-04-05 18:39:05 +03:00
Botond Dénes	05c6a40f05	result_memory_accounter: abort unpaged queries hitting the global limit The `result_memory_accounter` terminates a query if it reaches either the global or shard-local limit. This used to be so only for paged queries, unpaged ones could grow indefinitely (until the node OOM'd). This was changed in `fea5067` which enforces the local limit on unpaged queries as well, by aborting them. However a loophole remained in the code: `result_memory_accounter::check_and_update()` has another stop condition, besides `check_local_limit()`, it also checks the global limit. This stop condition was not updated to enforce itself on unpaged queries by aborting them, instead it silently terminated them, causing them to return less data then requested. This was masked by most queries reaching the local limit first. This patch fixes this by aborting unpaged mutation queries when they hit the global limit. Fixes: #8162 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226102202.51275-1-bdenes@scylladb.com> (cherry picked from commit `dd5a601aaa`)	2021-03-24 13:00:33 +02:00
Nadav Har'El	2b3bc9f174	Merge 'Fix reading whole requests during shedding' from Piotr Sarna When shedding requests (e.g. due to their size or number exceeding the limits), errors were returned right after parsing their headers, which resulted in their bodies lingering in the socket. The server always expects a correct request header when reading from the socket after the processing of a single request is finished, so shedding the requests should also take care of draining their bodies from the socket. Fixes #8193 Closes #8194 * github.com:scylladb/scylla: cql-pytest: add a shedding test transport: return error on correct stream during size shedding transport: return error on correct stream during shedding transport: skip the whole request if it is too large transport: skip the whole request during shedding (cherry picked from commit `0fea089b37`)	2021-03-24 12:49:57 +02:00
Piotr Sarna	4bfa605c38	Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani This is a reworked submission of #7686 which has been reverted. This series fixes some race conditions in MV/SI schema creation and load, we spotted some places where a schema without a base table reference can sneak into the registry. This can cause to an unrecoverable error since write commands with those schemas can't be issued from other nodes. Most of those cases can occur on 2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small window after a view or base table altering. Fixes #7709 Closes #8091 * github.com:scylladb/scylla: database: Fix view schemas in place when loading global_schema_ptr: add support for view's base table materialized views: create view schemas with proper base table reference. materialized views: Extract fix legacy schema into its own logic (cherry picked from commit `d473bc9b06`)	2021-03-24 12:25:26 +02:00
Tomasz Grabiec	dbb550e1a7	sstable: writer: ka/la: Write row marker cell after row tombstone Row marker has a cell name which sorts after the row tombstone's start bound. The old code was writing the marker first, then the row tombstone, which is incorrect. This was harmeless to our sstable reader, which recognized both as belonging to the current clustering row fragment, and collects both fine. However, if both atoms trigger creation of promoted index blocks, the writer will create a promoted index with entries wich violate the cell name ordering. It's very unlikely to run into in practice, since to trigger promoted index entries for both atoms, the clustering key would be so large so that the size of the marker cell exceeds the desired promoted index block size, which is 64KB by default (but user-controlled via column_index_size_in_kb option). 64KB is also the limit on clustering key size accepted by the system. This was caught by one of our unit tests: sstable_conforms_to_mutation_source_test ...which runs a battery of mutation reader tests with various desired promoted index block sizes, including the target size of 1 byte, which triggers an entry for every atom. The test started to fail for some random seeds after commit `ecb6abe` inside the test_streamed_mutation_forwarding_is_consistent_with_slicing test case, reporting a mutation mismatch in the following line: assert_that(sliced_m).is_equal_to(fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key())); It compares mutations read from the same sstable using different methods, slicing using clustering key restricitons, and fast forwarding. The reported mismatch was that fwd_m contained the row marker, but sliced_m did not. The sstable does contain the marker, so both reads should return it. After reverting the commit which introduced dynamic adjustments, the test passes, but both mutations are missing the marker, both are wrong! They are wrong because the promoted index contians entries whose starting positions violate the ordering, so binary search gets confused and selects the row tombstone's position, which is emitted after the marker, thus skipping over the row marker. The explanation for why the test started to fail after dynamic adjustements is the following. The promoted index cursor works by incrementally parsing buffers fed by the file input stream. It first parses the whole block and then does a binary search within the parsed array. The entries which cursor touches during binary search depend on the size of the block read from the file. The commit which enabled dynamic adjustements causes the block size to be different for subsequent reads, which allows one of the reads to walk over the corrupted entries and read the correct data by selecting the entry corresponding to the row marker. Fixes #8324 Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com> (cherry picked from commit `9272e74e8c`)	2021-03-24 10:38:54 +02:00
Avi Kivity	dffbcabbb1	Merge "mutation_writer: explicitly close writers" from Benny " _consumer_fut is expected to return an exception on the abort path. Wait for it and drop any exception so it won't be abandoned as seen in #7904. A future<> close() method was added to return _consumer_fut. It is called both after abort() in the error path, and after consume_end_of_stream, on the success path. With that, consume_end_of_stream was made void as it doesn't return a future<> anymore. Fixes #7904 Test: unit(release) " * tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla: mutation_writer: bucket_writer: add close mutation_writer/feed_writers: refactor bucket/shard writers mutation_writer: update bucket/shard writers consume_end_of_stream (cherry picked from commit `f11a0700a8`) scylla-4.4.0	2021-03-21 18:09:45 +02:00
Avi Kivity	a715c27a7f	Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8048 * github.com:scylladb/scylla: cdc: Limit size of topology description cdc: Extract create_stream_ids from topology_description_generator (cherry picked from commit `c63e26e26f`)	2021-03-21 14:05:36 +02:00
Benny Halevy	6804332291	dist: scylla_util: prevent IndexError when no ephemeral_disks were found Currently we call firstNvmeSize before checking that we have enough (at least 1) ephemeral disks. When none are found, we hit the following error (see #7971): ``` File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in if idata.is_recommended_instance(): File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance diskSize = self.firstNvmeSize File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize firstDisk = ephemeral_disks[0] IndexError: list index out of range ``` This change reverses the order and first checks that we found enough disks before getting the fist disk size. Fixes #7971 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8027 (cherry picked from commit `55e3df8a72`)	2021-03-21 12:19:24 +02:00
Nadav Har'El	a20991ad62	storage_service: correct missing exception in logging rebuild failure When failing to rebuild a node, we would print the error with the useless explanation "<no exception>". The problem was a typo in the logging command which used std::current_exception() - which wasn't relevant in that point - instead of "ep". Refs #8089 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210314113118.1690132-1-nyh@scylladb.com> (cherry picked from commit `d73934372d`)	2021-03-21 10:51:04 +02:00

1 2 3 4 5 ...

24944 Commits