Commit Graph

24144 Commits

Author SHA1 Message Date
Takuya ASADA
e4b42e622e scylla_coredump_setup: avoid coredump failure when hard limit of coredump is set to zero
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.

Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.

Fixes #8238

Closes #8245

(cherry picked from commit af8eae317b)
2021-06-13 18:27:13 +03:00
Avi Kivity
e625144d6e Update seastar submodule (nested exception logging)
* seastar b70b444924...5ef45afa4d (2):
  > utils/log.cc: fix nested_exception logging (again)
  > log: skip on unknown nested mixing instead of stopping the logging

Fixes #8327.
2021-06-13 18:24:52 +03:00
Yaron Kaikov
76ec7513f1 install.sh: Setup aio-max-nr upon installation
This is a follow up change to #8512.

Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla

As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.

Closes #8650

Refs: #8713

(cherry picked from commit dd453ffe6a)
2021-06-13 14:15:24 +03:00
Yaron Kaikov
f36f7035c8 scylla_io_setup: configure "aio-max-nr" before iotune
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```

We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before

1) Let's move configure_io_slots() to scylla_util.py since both
   scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure

Fixes: #8587

Closes #8512

Refs: #8713

(cherry picked from commit 588a065304)
2021-06-13 14:14:59 +03:00
Benny Halevy
709e934164 test: commitlog_test: test_allocation_failure: fill memory using smaller allocations
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.

To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).

Fixes #8028

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
(cherry picked from commit ca6f5cb0bc)
2021-06-10 19:37:37 +03:00
Dejan Mircevski
13428d56f6 cql3: Skip indexed column for CK restrictions
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns.  But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key.  We end up
with invalid clustering slice, which can cause problems downstream.

Fix this by skipping the indexed column when assembling the clustering
restrictions.

Tests: unit (dev)

Fixes #7888

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8320

(cherry picked from commit 0bd201d3ca)
2021-06-10 12:06:53 +03:00
Nadav Har'El
2c1f5e5225 Update tools/java submodule with backported patches
* tools/java 1489e7c539...7afe7018a5 (2):
  > nodetool: alternate way to specify table name which includes a dot
  > nodetool: do no treat table name with dot as a secondary index

Fixes #6521

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-07 09:49:55 +03:00
Nadav Har'El
9ae3edb102 alternator: fix equality check of nested document containing a set
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.

This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.

Refs #5021
Fixes #8514

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
(cherry picked from commit dae7528fe5)
In the backport to 4.3, I had to disable (skip) the new test, because it
relies on nested attribute updates to build an unsorted set - and those
were not yet supported in branch 4.3.
2021-06-07 09:25:19 +03:00
Nadav Har'El
11851fa4d9 alternator: fix inequality check of two sets
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.

Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!

So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.

The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).

Refs #5021
Fixes #8513

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
(cherry picked from commit 50f3201ee2)
2021-06-07 09:00:35 +03:00
Nadav Har'El
1a56e41f44 alternator: fix equality check of two unset attributes
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:

When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).

The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.

This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.

Fixes #8511

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
(cherry picked from commit 46448b0983)
2021-06-06 17:08:25 +03:00
Hagit Segev
d737d56a08 release: prepare for 4.3.4 scylla-4.3.4 2021-05-25 21:10:08 +03:00
Takuya ASADA
e1c993fc13 scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".

So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.

Fixes #8279

Closes #8681

(cherry picked from commit 3d307919c3)
2021-05-24 17:24:18 +03:00
Raphael S. Carvalho
0a6e38bf18 sstables/mp_row_consumer: Fix unbounded memory usage when consuming a large run of partition tombstones
mp_row_consumer will not stop consuming large run of partition
tombstones, until a live row is found which will allow the consumer
to stop proceeding. So partition tombstones, from a large run, are
all accumulated in memory, leading to OOM and stalls.
The fix is about stopping the consumer if buffer is full, to allow
the produced fragments to be consumed by sstable writer.

Fixes #8071.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210514202640.346594-1-raphaelsc@scylladb.com>

Upstream fix: db4b9215dd

(cherry picked from commit 2b29568bf4)
2021-05-20 21:26:42 +03:00
Calle Wilund
55bca74e90 database: Verify iff we actually are writing memtables to disk in truncate
Fixes #7732

When truncating with auto_snapshot on, we try to verify the low rp mark
from the CF against the sstables discarded by the truncation timestamp.
However, in a scenario like:

Fill memtables
Flush
Truncate with snapshot A
Fill memtables some more
Truncate
Move snapshot A to upload + refresh (load old tables)
Truncate

The last op will assert, because while we have sstables loaded, which
will be discarded now, we did not in fact generate any _new_ ones
(since memtables are empty), and the RP we get back from discard is
one from an earlier generation set.

(Any permutation of events that create the situation "empty memtable" +
"non-empty sstables with only old tables" will generate the same error).

Added a check that before flushing checks if we actually have any
data, and if not, does not uphold the RP relation assert.

Closes #7799

(cherry picked from commit 71c5dc82df)
2021-05-19 12:23:46 +03:00
Takuya ASADA
162d466034 install.sh: apply correct file security context when copying files
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.

Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.

Fixes #8589

Closes #8602

(cherry picked from commit 60c0b37a4c)
2021-05-19 00:07:09 +03:00
Avi Kivity
3e6d8c3fa7 Merge 'Fix type checking in index paging' from Piotr Sarna
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.

Tests: unit(release), along with the additional test case
       introduced in this series; the test also passes
       on Cassandra

Fixes #8666

Closes #8667

* github.com:scylladb/scylla:
  test: add a test case for paging with desc clustering order
  cql3: relax a type check for index paging

(cherry picked from commit 593ad4de1e)
2021-05-19 00:07:06 +03:00
Takuya ASADA
5d3ff1e8a1 dist/redhat: stop using systemd macros, call systemctl directly
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.

See scylladb/scylla-jmx#94

Closes #8005

(cherry picked from commit 7b310c591e)
2021-05-18 13:50:51 +03:00
Raphael S. Carvalho
5358eaf1d6 compaction_manager: Don't swallow exception in procedure used by reshape and resharding
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.

Fixes #8657.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
(cherry picked from commit 10ae77966c)
2021-05-18 13:00:32 +03:00
Avi Kivity
e78b96ee49 Update tools/jmx submodule (rpm systemd macros)
* tools/jmx 0457674...5fcba13 (1):
  > dist/redhat: stop using systemd macros, call systemctl directly

Ref scylladb/scylla-jmx#94.
2021-05-13 18:26:07 +03:00
Lauro Ramos Venancio
add245a27e TWCS: initialize _highest_window_seen
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.

This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.

Fixes #8569

Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>

Closes #8590

(cherry picked from commit 15f72f7c9e)
2021-05-06 08:52:31 +03:00
Avi Kivity
108f56c6ed Update tools/jmx submodule (nodetool cfstats deadlock)
* tools/jmx 47b355e...0457674 (1):
  > APIBuilder: Unlock RW-lock in remove()

Fixes #7991.
2021-05-03 16:52:41 +03:00
Nadav Har'El
d01ce491c0 Update tools/java submodule
Backport sstableloader fix in tools/java submodule.
Fixes #8230.

* tools/java 2bedecd3a7...1489e7c539 (1):
  > sstableloader: Handle non-prepared batches with ":" in identifier names

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-03 10:15:06 +03:00
Tomasz Grabiec
7b2f65191c thrift: Validate cell names when constructing clustering keys
Currently, if the user provides a cell name with too many components,
we will accept it and construct an invalid clusterin key. This may
result in undefined behavior down the stream.

It was caught by ASAN in a debug build when executing dtest
cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test with
nodetool flush manually added after the write. Triggered during
sstable writing to an MC-format sstable:

   seastar::shared_ptr<abstract_type const>::operator*() const at ././seastar/include/seastar/core/shared_ptr.hh:577
   sstables::mc::clustering_blocks_input_range::next() const at ./sstables/mx/writer.cc:180

To prevent corrupting the state in this way, we should fail
early. This patch addds validation which will fail thrift requests
which attempt to create invalid clustering keys.

Fixes #7568.

Example error:

  Internal server error: Cell name of ks.test has too many components, expected 1 got 2 in 0x0004000000040000017600

Message-Id: <1605550477-24810-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 0c5d23d274)
2021-05-02 12:09:35 +03:00
Avi Kivity
add5ffa787 Merge '[branch 4.4] Backport reader_permit: always forward resources to the semaphore ' from Botond Dénes
This is a backport of 8aaa3a7 to branch-4.4. The main conflicts were around Benny's reader close series (fa43d76), but it also turned out that an additional patch (2f1d65c) also has to backported to make sure admission on signaling resources doesn't deadlock.

Refs: #8493

Closes #8571

* github.com:scylladb/scylla:
  test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
  test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
  reader_concurrency_semaphore: add dump_diagnostics()
  reader_permit: always forward resources
  test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
  reader_concurrency_semaphore: make admission conditions consistent

(cherry picked from commit bf9e1f6d2e)

[avi: convert coroutine in mutation_reader_test.cc to seastar thread]
2021-05-01 12:43:00 +03:00
Eliran Sinvani
32a1f2dcd9 Materialized views: fix possibly old views comming from other nodes
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.

Here we plug the hole by fixing such views before they are registered.

Closes #8509

(cherry picked from commit 480a12d7b3)

Fixes #8554.
2021-04-29 14:03:41 +03:00
Botond Dénes
f2072665d1 database: clear inactive reads in stop()
If any inactive read is left in the semaphore, it can block
`database::stop()` from shutting down, as sstables pinned by these reads
will prevent `sstables::sstables_manager::close()` from finishing. This
causes a deadlock.
It is not clear how inactive reads can be left in the semaphore, as all
users are supposed to clean up after themselves. Post 4.4 releases don't
have this problem anymore as the inactive read handle was made a RAII
object, removing the associated inactive read when destroyed. In 4.4 and
earlier release this wasn't so, so errors could be made. Normally this
is not a big issue, as these orphaned inactive reads are just evicted
when the resources they own are needed, but it does become a serious
issue during shutdown. To prevent a deadlock, clear the inactive reads
earlier, in `database::stop()` (currently they are cleared in the
destructor). This is a simple and foolproof way of ensuring any
leftover inactive reads don't cause problems.

Fixes: #8561

Tests: unit(dev)

Closes #8562

(cherry picked from commit 840ca41393)
2021-04-28 22:36:41 +03:00
Takuya ASADA
beb2bcb8bd dist: increase fs.aio-max-nr value for other apps
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.

Related #8133

Closes #8228

(cherry picked from commit 53c7600da8)
2021-04-25 16:16:31 +03:00
Takuya ASADA
8255b7984d dist: tune fs.aio-max-nr based on the number of cpus
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.

We need to tune the parameter based on the number of cpus, instead of
static setting.

Fixes #8133

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #8188

(cherry picked from commit d0297c599a)
2021-04-25 16:16:26 +03:00
Hagit Segev
28f5e0bd20 release: prepare for 4.3.3 scylla-4.3.3 2021-04-07 10:24:04 +03:00
Gleb Natapov
09f3bb93a3 storage_proxy: do not crash on LOCAL_QUORUM access to a DC with zero replication
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.

Fixes #8354

Message-Id: <YGro+l2En3fF80CO@scylladb.com>
(cherry picked from commit cd24dfc7e5)

[avi: re-added virtual keyword when backporting, since
      4.4 and below don't have 020da49c89]
2021-04-06 19:37:32 +03:00
Nadav Har'El
76642eb00d update submodule tools/java
* tools/java d49ae89b4b...2bedecd3a7 (1):
  > sstableloader: fix handling of rewritten partition

Backported refs #8390.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-04-05 18:33:39 +03:00
Pavel Emelyanov
a60f394d9a test: Fix exit condition of row_cache_test::test_eviction_from_invalidated
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.

However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that

  evictable size < chunk size < evictable size + dummy size

In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.

fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
       seed from the issue + 200 times more with random seeds)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
(cherry picked from commit 096e452db9)
2021-04-04 18:11:08 +03:00
Avi Kivity
f2af68850c Merge 'Fix inconsistent mv si backport to 4.3' from Eliran Sinvani
This is a backport of the fix for #7709 to 4.3 version.

Closes #8375

* github.com:scylladb/scylla:
  Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani
  storage_proxy: Add .local_db() getters
2021-04-04 17:01:26 +03:00
Takuya ASADA
c7781f8c9e node_exporter_install: fix bad owner of node_exporter
node_exporter files as installed with weird ownership, 3434:3434.
This is because upstream node_exporter tar.gz contain the owner
information, but we should overwrite it to valid one.

Fixes #6222

Closes #8379
2021-04-01 17:34:19 +03:00
Piotr Sarna
8f37924694 Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani
This is a reworked submission of #7686 which has been reverted.  This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.

Fixes #7709

Closes #8091

* github.com:scylladb/scylla:
  database: Fix view schemas in place when loading
  global_schema_ptr: add support for view's base table
  materialized views: create view schemas with proper base table reference.
  materialized views: Extract fix legacy schema into its own logic

(cherry picked from commit d473bc9b06)
2021-03-30 08:08:14 +03:00
Pavel Emelyanov
8588eef807 storage_proxy: Add .local_db() getters
To facilitate the next patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4c7bc8a3d1)
2021-03-30 08:06:51 +03:00
Piotr Sarna
c50a2898cf transport: return error on correct stream during size shedding
When a request is shed due to being too large, its response
was sent with stream id 0 instead of the stream id that matches
the communication lane. That in turn confused the client,
which is no longer the case.

(cherry picked from commit 8635094144)
2021-03-25 09:22:36 +01:00
Piotr Sarna
44f7251809 transport: return error on correct stream during shedding
When a request is shed due to exceeding the max number of concurrent
requests, its response was sent with stream id 0 instead of
the stream id that matches the communication lane.
That in turn confused the client, which is no longer the case.

(cherry picked from commit d6ea6937ee)
2021-03-25 09:22:36 +01:00
Piotr Sarna
fc070d3dc6 transport: skip the whole request if it is too large
When a request is shed due to being too large, only the header
was actually read, and the body was still stuck in the socket
- and would be read in the next iteration, which would expect
to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.

Fixes #8193

(cherry picked from commit 4a24d7dca0)
2021-03-25 09:22:35 +01:00
Piotr Sarna
901784e122 transport: skip the whole request during shedding
When a request is shed due to exceeding the number of max concurrent
requests, only its header was actually read, and the body was still
stuck in the socket - and would be read in the next iteration,
which would expect to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.

Refs #8193

(cherry picked from commit 3eb7e768cb)
2021-03-25 09:22:29 +01:00
Botond Dénes
2ccda04d57 result_memory_accounter: abort unpaged queries hitting the global limit
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.

Fixes: #8162

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
(cherry picked from commit dd5a601aaa)
2021-03-24 13:00:46 +02:00
Amos Kong
e8facb1932 schema.cc/describe: fix invalid compaction options in schema
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.

    AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}

map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.

Fixes #7741

Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7734

(cherry picked from commit 6b1659ee80)
2021-03-24 12:57:50 +02:00
Tomasz Grabiec
6f338e7656 sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>

(cherry picked from commit 9272e74e8c)
2021-03-24 10:39:13 +02:00
Avi Kivity
7bb9230cfa Merge "mutation_writer: explicitly close writers" from Benny
"
_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

A future<> close() method was added to return
_consumer_fut.  It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.

With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.

Fixes #7904
Test: unit(release)
"

* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: add close
  mutation_writer/feed_writers: refactor bucket/shard writers
  mutation_writer: update bucket/shard writers consume_end_of_stream

(cherry picked from commit f11a0700a8)
2021-03-21 18:11:52 +02:00
Benny Halevy
2898e98733 dist: scylla_util: prevent IndexError when no ephemeral_disks were found
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks.  When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```

This change reverses the order and first checks that we found
enough disks before getting the fist disk size.

Fixes #7971

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8027

(cherry picked from commit 55e3df8a72)
2021-03-21 12:20:19 +02:00
Nadav Har'El
2796b0050d storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
(cherry picked from commit d73934372d)
2021-03-21 10:51:23 +02:00
Nadav Har'El
6bc005643e alternator-test: increase read timeout and avoid retries
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:

  1. When the test machine and the build are extremely slow, and the
     operation is long (usually, CreateTable or DeleteTable involving
     multiple views), the 60 second timeout might not be enough.

  2. If the timeout is reached, boto3 silently retries the same operation.
     This retry may fail because the previous one really succeeded at
     least partially! The symptom is tests which report an error when
     creating a table which already exists, or deleting a table which
     dooesn't exist.

The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).

Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.

Fixes #8135

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
(cherry picked from commit 0b2cf21932)
2021-03-19 00:09:17 +02:00
Raphael S. Carvalho
d591ff5422 LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
(cherry picked from commit e53cedabb1)
2021-03-18 19:19:58 +02:00
Raphael S. Carvalho
acb1c3eebf compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
(cherry picked from commit 7171244844)
2021-03-18 14:29:20 +02:00
Dejan Mircevski
a04242ea62 cql3/expr: Handle IN ? bound to null
Previously, we crashed when the IN marker is bound to null.  Throw
invalid_request_exception instead.

This is a 4.3 backport of the #8265 fix.

Tests: unit (dev)

(cherry picked from commit 8db24fc03b)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8308
2021-03-18 10:39:19 +02:00