Currently the exception handling code of feed_writer() assumes
consume_end_of_stream() doesn't throw. This is false and an exception
from said method can currently lead to an unclean destroy of the writer
and reader. Fix by also handling exceptions from
consume_end_of_stream() too.
Closes#10147
(cherry picked from commit 1963d1cc25)
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.
So this patch fixes the scyllasetup.py script to only pass this
parameter once.
Fixes#10016.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
(cherry picked from commit cb6630040d)
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.
Fixes#9540
----
This reverts 0d8f932 and introduce correct fix.
Closes#9970
* github.com:scylladb/scylla:
scylla_raid_setup: use mdmonitor only when RAID level > 0
Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
(cherry picked from commit df22396a34)
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.
At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.
In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).
Fixes#9568
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
(cherry picked from commit 56eb994d8f)
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.
Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).
Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.
This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.
Fixes#9554.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
(cherry picked from commit 6ae0ea0c48)
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).
Fixes#9540Closes#9782
(cherry picked from commit 0d8f932f0b)
This reverts commit 146f7b5421. It
causes a regression, and needs an additional fix. The bug is not
important enough to merit this complication.
Ref #9311.
The shard reader can outlive its parent reader (the multishard reader).
This creates a problem for lifecycle management: readers take the range
and slice parameters by reference and users keep these alive until the
reader is alive. The shard reader outliving the top-level reader means
that any background read-ahead that it has to wait on will potentially
have stale references to the range and the slice. This was seen in the
wild recently when the evictable reader wrapped by the shard reader hit
a use-after-free while wrapping up a background read-ahead.
This problem was solved by fa43d76 but any previous versions are
susceptible to it.
This patch solves this problem by having the shard reader copy and keep
the range and slice parameters in stable storage, before passing them
further down.
Fixes: #9719
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202113910.484591-1-bdenes@scylladb.com>
(cherry picked from commit 417e853b9b)
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.
In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.
Fixes#9406
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
(cherry picked from commit 5cbe9178fd)
In order to avoid race condition introduced in 9dce1e4 the
index_reader should be closed prior to it's destruction.
This only exposes 4.4 and earlier releases to this specific race.
However, it is always a good idea to first close the index reader
and only then destroy it since it is most likely to be assumed by
all developers that will change the reader index in the future.
Ref #9704 (because on 4.4 and earlier releases are vulnerable).
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Fixes#9704
(cherry picked from commit ddd7248b3b)
Closes#9717
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects. Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.
This is the most direct fix, because range and slice are calculated in
one place for all modification statements. It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior. We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects. We add a TODO for rejecting such DELETEs later.
Fixes#7852.
Tests: unit (dev), cql-pytest against Cassandra 4.0
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9286
(cherry picked from commit 1fdaeca7d0)
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.
This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:
1. Parse errors in fromJson()'s parameters are converted into a
function_execution_exception.
2. Any exceptions during the execute() of a native_scalar_function_for
function is converted into a function_execution_exception.
In particular, fromJson() uses a native_scalar_function_for.
Note, however, that functions which already took care to produce
a specific Cassandra error, this error is passed through and not
converted to a function_execution_exception. An example is
the blobAsText() which can return an invalid_request error, so
it is left as such and not converted. This also happens in Cassandra.
All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.
Fixes#7911
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
(cherry picked from commit 702b1b97bf)
Before this patch when writing an index block, the sstables writer was
storing range tombstones that span the boundary of the block in order
of end bounds. This led to a range tombstone being ignored by a reader
if there was a row tombstone inside it.
This patch sorts the range tombstones based on start bound before
writing them to the index file.
The assumption is that writing an index block is rare so we can afford
sortting the tombstones at that point. Additionally this is a writer of
an old format and writing to it will be dropped in the next major
release so it should be rarely used already.
Kudos to Kamil Braun <kbraun@scylladb.com> for finding the reproducer.
Test: unit(dev)
Fixes#9690
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit scylladb/scylla-enterprise@eb093afd6f)
(cherry picked from commit ab425a11a8)
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.
Fix by restoring the "remaining" count when adjusting the paging state
on short read.
Tests:
- index_with_paging_test
- secondary_index_test
Fixes#9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>
(cherry picked from commit 1e4da2dcce)
We need stopwaitsecs just like we do TimeoutStpSec=900 on
scylla-server.service, to avoid timeout on scylla-server shutdown.
Fixes#9485Closes#9545
(cherry picked from commit c9499230c3)
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.
After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.
To fix, make a copy before we yield.
Fixes#8859Closes#8862
(cherry picked from commit 7a32cab524)
GKE metadata server does not provide same metadata as GCE, we should not
return True on is_gce().
So try to fetch machine-type from metadata server, return False if it
404 not found.
Fixes#9471
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#9582
(cherry picked from commit 9b4cf8c532)
There are two APIs for checking the repair status and they behave
differently in case the id is not found.
```
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_async/system_auth?id=999", "duration": "1ms",
"status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad
Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"}
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_status?id=999&timeout=1", "duration": "0ms",
"status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server
Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"}
```
The correct status code is 400 as this is a parameter error and should
not be retried.
Returning status code 500 makes smarter http clients retry the request
in hopes of server recovering.
After this patch:
curl -X PGET
'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999'
{"message": "unknown repair id 9999", "code": 400}
curl -X GET
'http://127.0.0.1:10000/storage_service/repair_status?id=9999'
{"message": "unknown repair id 9999", "code": 400}
Fixes#9576Closes#9578
(cherry picked from commit f5f5714aa6)
Fixes#9103
compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.
Closes#9104
(cherry picked from commit 59555fa363)
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.
Fixes#8398.
Closes#8397
(cherry picked from commit f23a47e365)
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)
Reset the uid/gid infomation so this doesn't happen.
Closes#9579Fixes#9610.
(cherry picked from commit e1817b536f)
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.
The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.
So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.
This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.
In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().
Fixes#9542
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
(cherry picked from commit b95e431228)
When a tuple value is serialized, we go through every element type and
use it to serialize element values. But an element type can be
reversed, which is artificially different from the type of the value
being read. This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.
Fixes#7902
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8316
(cherry picked from commit 318f773d81)
Consider the following procedure:
- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
cluster
- n1 is down in the middle of removenode operation
Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.
This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues. We need a way to abort
such streaming attempts.
To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.
In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.
This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.
Fixes#8651Closes#8655
(cherry picked from commit 0858619cba)
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.
Fixes#9524
Test: unit(dev)
DTest: wide_rows_test.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
(cherry picked from commit a21b1fbb2f)
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.
Fixes#8503
(cherry picked from commit 3c36517598)
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.
Fixes#9138
Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
(cherry picked from commit 3ad0067272)
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.
This is fixed by using new conditions in cases where
we are using an index.
Fixes#8991.
Fixes#7708.
For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 54149242b4)
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().
This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.
This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.
Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.
Refs #7950
Refs #8081
Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
(cherry picked from commit ff5b42a0fa)
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.
Fixes: #8059
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
[avi: add #include]
(cherry picked from commit c3b4c3f451)
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.
In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().
Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.
Fixes#8636
A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
(cherry picked from commit c0dafa75d9)
This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory.
This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors.
Fixes#8521
Refs #8515
Tests:
* unit(release)
* YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust
relevant YCSB workload config for using 1.5MB objects:
```yaml
fieldcount=150
fieldlength=10000
```
Closes#8529
* github.com:scylladb/scylla:
test: add a test for rjson allocation
test: rename alternator_base64_test to alternator_unit_test
rjson: add a throwing allocator
(cherry picked from commit c36549b22e)
Fixes#8749
if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase
v2:
* Make interface more set-like (tolerate non-existance in erase).
Closes#8904
(cherry picked from commit 373fa3fa07)
To building Ubuntu AMI with CPU scaling configuration, we need force
running mode for scylla_cpuscaling_setup, which run setup without
checking scaling_governor support.
See scylladb/scylla-machine-image#204
Closes#9326
(cherry picked from commit f928dced0c)
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.
Fixes#9324Closes#9325
(cherry picked from commit cd7fe9a998)
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.
Refs #9331.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
(cherry picked from commit 52302c3238)
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.
Fixes#8829Closes#8832
(cherry picked from commit 9bfb2754eb)
Fixes#8212
Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.
Use case is "scrub" which calls with a limited set of tables
to snapshot.
Closes#8240
(cherry picked from commit f44420f2c9)
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
recreation
As they depend on each other, it is easier to add a test if they are in
a series.
Fixes: #8923Fixes: #8893
Tests: unit(dev, mutation_reader_test:debug)
"
* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
test: mutation_reader_test: add more test for reader recreation
evictable_reader: relax partition key check on reader recreation
evictable_reader: always reset static row drop flag
(cherry picked from commit 4209dfd753)
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.
Fixes#9191Closes#9193
(cherry picked from commit e5bb88b69a)
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.
This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.
With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.
Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.
Fixes#8710.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>
[avi: drop now unneeded storage_service_for_tests]
(cherry picked from commit a7cdd846da)
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.
The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.
This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
cassandra_tests/validation/entities/secondary_index_test.py::
testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.
Fixes#8717.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
(cherry picked from commit 97e827e3e1)
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.
This series comes with a test.
Fixes#8620
Tests: unit(release)
Closes#8632
* github.com:scylladb/scylla:
cql-pytest: add regression tests for index creation
cql3: fail to create an index if there is a name conflict
database: check for conflicting table names for indexes
(cherry picked from commit cee4c075d2)
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.
Fixes#9090
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#9074
(cherry picked from commit 90a607e844)
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.
Be more conservative when calculating the parallelism to avoid repair
using too much memory.
Fixes#8641Closes#8652
(cherry picked from commit b8749f51cb)
To avoid restart scylla-server.service unexpectedly, drop BindsTo=
from scylla-fstrim.timer.
Fixes#8921Closes#8973
(cherry picked from commit def81807aa)
systemd unit name of scylla-node-exporter is
scylla-node-exporter.service, not node-exporter.service.
Fixes#8966Closes#8967
(cherry picked from commit f19ebe5709)
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.
Also, we mixed two types of files as ghost file, it should handle differently:
1. automatically generated by postinst scriptlet
2. generated by user invoked scylla_setup
The package should remove only 1, since 2 is generated by user decision.
However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).
To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.
See scylladb/scylla-enterprise#1780Closes#8810Closes#8924Closes#8959
(cherry picked from commit f71f9786c7)
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.
The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.
The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.
fixes: #8983
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
(cherry picked from commit 63a2fed585)
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.
Fixes#8531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 39ecddbd34)
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
(cherry picked from commit 97bb15b2f2)
Backport of 6726fe79b6.
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Fixes#8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
Closes#8834
* backport-6726fe7-4.4:
view: fix use-after-move when handling view update failures
db,view: explicitly move the mutation to its helper function
db,view: pass base token by value to mutate_MV
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Refs #8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
(cherry picked from commit 8a049c9116)
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
(cherry picked from commit 7cdbb7951a)
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
(cherry picked from commit 88d4a66e90)
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.
That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.
This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.
Fixes#8345.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
(cherry picked from commit bcbb39999b)
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.
Fixes#8117
(cherry picked from commit 8cc4f39472)
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.
Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
(cherry picked from commit cf28552357)
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.
Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.
Fixes#8238Closes#8245
(cherry picked from commit af8eae317b)
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.
To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).
Fixes#8028
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
(cherry picked from commit ca6f5cb0bc)
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns. But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key. We end up
with invalid clustering slice, which can cause problems downstream.
Fix this by skipping the indexed column when assembling the clustering
restrictions.
Tests: unit (dev)
Fixes#7888
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8320
(cherry picked from commit 0bd201d3ca)
This is a follow up change to #8512.
Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla
As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.
Closes#8650
Refs: #8713
(cherry picked from commit dd453ffe6a)
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```
We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before
1) Let's move configure_io_slots() to scylla_util.py since both
scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure
Fixes: #8587Closes#8512
Refs: #8713
(cherry picked from commit 588a065304)
* tools/java 6ca351c221...aab793d9f5 (2):
> nodetool: alternate way to specify table name which includes a dot
> nodetool: do no treat table name with dot as a secondary index
Fixes#6521
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.
This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.
Refs #5021Fixes#8514
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
(cherry picked from commit dae7528fe5)
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.
Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!
So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.
The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).
Refs #5021Fixes#8513
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
(cherry picked from commit 50f3201ee2)
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:
When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).
The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.
This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.
Fixes#8511
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
(cherry picked from commit 46448b0983)
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".
So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.
Fixes#8279Closes#8681
(cherry picked from commit 3d307919c3)
mp_row_consumer will not stop consuming large run of partition
tombstones, until a live row is found which will allow the consumer
to stop proceeding. So partition tombstones, from a large run, are
all accumulated in memory, leading to OOM and stalls.
The fix is about stopping the consumer if buffer is full, to allow
the produced fragments to be consumed by sstable writer.
Fixes#8071.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210514202640.346594-1-raphaelsc@scylladb.com>
Upstream fix: db4b9215dd
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.
Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.
Fixes#8589Closes#8602
(cherry picked from commit 60c0b37a4c)
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.
Fixes#8663Closes#8664
(cherry picked from commit 6faa8b97ec)
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.
Tests: unit(release), along with the additional test case
introduced in this series; the test also passes
on Cassandra
Fixes#8666Closes#8667
* github.com:scylladb/scylla:
test: add a test case for paging with desc clustering order
cql3: relax a type check for index paging
(cherry picked from commit 593ad4de1e)
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.
See scylladb/scylla-jmx#94
Closes#8005
(cherry picked from commit 7b310c591e)
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.
Fixes#8657.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
(cherry picked from commit 10ae77966c)
Compaction manager allows compaction of different weights to proceed in
parallel. For example, a small-sized compaction job can happen in parallel to a
large-sized one, but similar-sized jobs are serialized.
The problem is the current definition of weight, which is the log (base 4) of
total size (size of all sstables) of a job.
This is what we get with the current weight definition:
weight=5 for sizes=[1K, 3K]
weight=6 for sizes=[4K, 15K]
weight=7 for sizes=[16K, 63K]
weight=8 for sizes=[64K, 255K]
weight=9 for sizes=[258K, 1019K]
weight=10 for sizes=[1M, 3M]
weight=11 for sizes=[4M, 15M]
weight=12 for sizes=[16M, 63M]
weight=13 for sizes=[64M, 254M]
weight=14 for sizes=[256M, 1022M]
weight=15 for sizes=[1033M, 4078M]
weight=16 for sizes=[4119M, 10188M]
total weights: 12
Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5
jobs smaller than 1MB could proceed in parallel. High number of parallel
compactions can be observed after repair, which potentially produces tons of
small sstables of varying sizes. That causes compaction to use a significant
amount of resources.
To fix this problem, let's add a fixed tax to the size before taking the log,
so that jobs smaller than 1M will all have the same weight.
Look at what we get with the new weight definition:
weight=10 for sizes=[1K, 2M]
weight=11 for sizes=[3M, 14M]
weight=12 for sizes=[15M, 62M]
weight=13 for sizes=[63M, 254M]
weight=14 for sizes=[256M, 1022M]
weight=15 for sizes=[1033M, 4078M]
weight=16 for sizes=[4119M, 10188M]
total weights: 7
Fixes#8124.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>
(cherry picked from commit 81d773e5d8)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210512224405.68925-1-raphaelsc@scylladb.com>
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.
This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.
Fixes#8569
Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>
Closes#8590
(cherry picked from commit 15f72f7c9e)
This is a backport of 8aaa3a7 to branch-4.4. The main conflicts were around Benny's reader close series (fa43d76), but it also turned out that an additional patch (2f1d65c) also has to backported to make sure admission on signaling resources doesn't deadlock.
Refs: #8493Closes#8571
* github.com:scylladb/scylla:
test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
reader_concurrency_semaphore: add dump_diagnostics()
reader_permit: always forward resources
test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
reader_concurrency_semaphore: make admission conditions consistent
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.
The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.
(cherry picked from commit 45d580f056)
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.
(cherry picked from commit cadc26de38)
Allow semaphore related tests to include a diagnostics printout in error
messages to help determine why the test failed.
(cherry picked from commit d246e2df0a)
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.
(cherry picked from commit caaa8ef59a)
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.
Refs: #8493
Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
(cherry picked from commit 26ae9555d1)
Currently there are two places where we check admission conditions:
`do_wait_admission()` and `signal()`. Both use `has_available_units()`
to check resource availability, but the former has some additional
resource related conditions on top (in `may_proceed()`), which lead to
the two paths working with slightly different conditions. To fix, push
down all resource availability related checks to `has_available_units()`
to ensure admission conditions are consistent across all paths.
(cherry picked from commit d90cd6402c)
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.
Here we plug the hole by fixing such views before they are registered.
Closes#8509
(cherry picked from commit 480a12d7b3)
Fixes#8554.
If any inactive read is left in the semaphore, it can block
`database::stop()` from shutting down, as sstables pinned by these reads
will prevent `sstables::sstables_manager::close()` from finishing. This
causes a deadlock.
It is not clear how inactive reads can be left in the semaphore, as all
users are supposed to clean up after themselves. Post 4.4 releases don't
have this problem anymore as the inactive read handle was made a RAII
object, removing the associated inactive read when destroyed. In 4.4 and
earlier release this wasn't so, so errors could be made. Normally this
is not a big issue, as these orphaned inactive reads are just evicted
when the resources they own are needed, but it does become a serious
issue during shutdown. To prevent a deadlock, clear the inactive reads
earlier, in `database::stop()` (currently they are cleared in the
destructor). This is a simple and foolproof way of ensuring any
leftover inactive reads don't cause problems.
Fixes: #8561
Tests: unit(dev)
Closes#8562
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.
Related #8133Closes#8228
(cherry picked from commit 53c7600da8)
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.
We need to tune the parameter based on the number of cpus, instead of
static setting.
Fixes#8133
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#8188
(cherry picked from commit d0297c599a)
This check is always true because a dummy entry is added at the end of
each cache entry. If that wasn't true, the check in else-if would be
an UB.
Refs #8435.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit cb3dbb1a4b)
Make sure that when a partition does not exist in underlying,
do_fill_buffer does not try to fast forward withing this nonexistent
partition.
Test: unit(dev)
Fixes#8435Fixes#8411
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 1f644df09d)
This new state stores the information whether current partition
represented by _key is present in underlying.
Refs #8435.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit ceab5f026d)
All callers pass false for its value so no need to keep it around.
Refs #8435.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit b3b68dc662)
This was previously done in create_underlying but ensure_underlying is
a better place because we will add more related logic to this
consumption in the following patches.
Refs #8435.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 088a02aafd)
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.
In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.
We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.
Fixes#8447.
Closes#8448.
(cherry picked from commit 5c7ed7a83f)
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).
The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).
It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.
Fixes#8445.
Closes#8446.
(cherry picked from commit 7ffb0d826b)
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.
Fixes#8432.
Closes#8433.
(cherry picked from commit 3687757115)
* seastar 2c884a7449...a75171fc89 (2):
> fair_queue: Preempted requests got re-queued too far
> fair_queue: Improve requests preemption while in pending state
Fixes#8296.
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.
Fixes#8354
Message-Id: <YGro+l2En3fF80CO@scylladb.com>
(cherry picked from commit cd24dfc7e5)
[avi: re-added virtual keyword when backporting, since
4.4 and below don't have 020da49c89]
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.
Fixes: #8162
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
(cherry picked from commit dd5a601aaa)
When shedding requests (e.g. due to their size or number exceeding the
limits), errors were returned right after parsing their headers, which
resulted in their bodies lingering in the socket. The server always
expects a correct request header when reading from the socket after the
processing of a single request is finished, so shedding the requests
should also take care of draining their bodies from the socket.
Fixes#8193Closes#8194
* github.com:scylladb/scylla:
cql-pytest: add a shedding test
transport: return error on correct stream during size shedding
transport: return error on correct stream during shedding
transport: skip the whole request if it is too large
transport: skip the whole request during shedding
(cherry picked from commit 0fea089b37)
This is a reworked submission of #7686 which has been reverted. This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.
Fixes#7709Closes#8091
* github.com:scylladb/scylla:
database: Fix view schemas in place when loading
global_schema_ptr: add support for view's base table
materialized views: create view schemas with proper base table reference.
materialized views: Extract fix legacy schema into its own logic
(cherry picked from commit d473bc9b06)
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.
This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.
However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.
This was caught by one of our unit tests:
sstable_conforms_to_mutation_source_test
...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.
The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:
assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));
It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.
After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!
They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.
The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.
Fixes#8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
(cherry picked from commit 9272e74e8c)
"
_consumer_fut is expected to return an exception
on the abort path. Wait for it and drop any exception
so it won't be abandoned as seen in #7904.
A future<> close() method was added to return
_consumer_fut. It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.
With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.
Fixes#7904
Test: unit(release)
"
* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
mutation_writer: bucket_writer: add close
mutation_writer/feed_writers: refactor bucket/shard writers
mutation_writer: update bucket/shard writers consume_end_of_stream
(cherry picked from commit f11a0700a8)
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.
This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.
This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.
This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.
For more details see #7961, #7993 and #7985.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8048
* github.com:scylladb/scylla:
cdc: Limit size of topology description
cdc: Extract create_stream_ids from topology_description_generator
(cherry picked from commit c63e26e26f)
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks. When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```
This change reverses the order and first checks that we found
enough disks before getting the fist disk size.
Fixes#7971
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8027
(cherry picked from commit 55e3df8a72)
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".
Refs #8089
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
(cherry picked from commit d73934372d)
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:
1. When the test machine and the build are extremely slow, and the
operation is long (usually, CreateTable or DeleteTable involving
multiple views), the 60 second timeout might not be enough.
2. If the timeout is reached, boto3 silently retries the same operation.
This retry may fail because the previous one really succeeded at
least partially! The symptom is tests which report an error when
creating a table which already exists, or deleting a table which
dooesn't exist.
The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).
Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.
Fixes#8135
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
(cherry picked from commit 0b2cf21932)
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.
Refs #8297.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
(cherry picked from commit e53cedabb1)
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.
So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.
Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.
Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.
This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.
Fixes#8247.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
(cherry picked from commit 7171244844)
Previously, we crashed when the IN marker is bound to null. Throw
invalid_request_exception instead.
This is a 4.4 backport of the #8265 fix.
Tests: unit (dev)
(cherry picked from commit 8db24fc03b)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8307Fixes#8265.
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.
However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that
evictable size < chunk size < evictable size + dummy size
In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.
fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
seed from the issue + 200 times more with random seeds)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
(cherry picked from commit 096e452db9)
in all expressions' from Nadav Har'El.
This series fixes#5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator. The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.
The first patch in the series fixes#8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.
Closes#8066
* github.com:scylladb/scylla:
alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
alternator: fix bug in ReturnValues=UPDATED_NEW
alternator: implemented nested attribute paths in UpdateExpression
alternator: limit the depth of nested paths
alternator: prepare for UpdateItem nested attribute paths
alternator: overhaul ProjectionExpression hierarchy implementation
alternator: make parsed::path object printable
alternator-test: a few more ProjectionExpression conflict test cases
alternator-test: improve tests for nested attributes in UpdateExpression
alternator: support attribute paths in ConditionExpression, FilterExpression
alternator-test: improve tests for nested attributes in ConditionExpression
alternator: support attribute paths in ProjectionExpression
alternator: overhaul attrs_to_get handling
alternator-test: additional tests for attribute paths in ProjectionExpression
alternator-test: harden attribute-path tests for ProjectionExpression
alternator: fix ValidationException in FilterExpression - and more
(cherry picked from commit cbbb7f08a0)
So it can be modified while walked to dispatch
subscribed event notifications.
In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.
Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.
Fixes#8143
Test: unit(release)
DTest:
update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
(cherry picked from commit baf5d05631)
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
(cherry picked from commit 1ed04affab)
Ref #8143.
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.
This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.
That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.
Fixes#8155.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
(cherry picked from commit 2cf0c4bbf1)
Includes fixup patch:
compaction_manager: Fix use-after-free in rewrite_sstables()
Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
(cherry picked from commit f7cc431477)
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.
It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.
This patch fixes an issue we saw recently in our sct tests:
```
INFO | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...
ERR | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```
Fixes#8187Closes#8213
(cherry picked from commit dc40184faa)
Refs: #8012Fixes: #8210
With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.
Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.
Closes#8209
* github.com:scylladb/scylla:
alternator::streams: Use better method for generation timestamp
system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
(cherry picked from commit e12e57c915)
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
---
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
Closes#8116
* github.com:scylladb/scylla:
tests: add a simple CDC cql pytest
cdc: add config option to disable streams rewriting
cdc: rewrite streams to the new description table
cql3: query_processor: improve internal paged query API
cdc: introduce no_generation_data_exception exception type
docs: cdc: mention system.cdc_local table
cdc: coroutinize do_update_streams_description
sys_dist_ks: split CDC streams table partitions into clustered rows
cdc: use chunked_vector for streams in streams_version
cdc: remove `streams_version::expired` field
system_distributed_keyspace: use mutation API to insert CDC streams
storage_service: don't use `sys_dist_ks` before it is started
(cherry picked from commit f0950e023d)
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).
However, look into /proc/mdstat of the AMI, it actually says no active md device available:
ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>
We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:
clear
No devices, no size, no level
Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html
So we should also check array_state != 'clear', not just array_state
existance.
Fixes#8219Closes#8220
(cherry picked from commit 2d9feaacea)
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.
to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.
Fixes#6054.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
(cherry picked from commit 5206a97915)
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.
Fixes#8147.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
(cherry picked from commit 21608bd677)
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.
However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.
To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.
Fixes#8163Closes#8164
(cherry picked from commit aabc67e386)
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.
Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.
Fixes#8049Closes#8119
(cherry picked from commit f3a82f4685)
* seastar 572536ef...74ae29bc (3):
> perftune.py: fix assignment after extend and add asserts
> scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
> scripts/perftune.py: remove repeated items after merging options from file
Fixes#7968.
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.
Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.
This could lead to the sstable write to crash, or generate a corrupted
sstable.
Fixes#7994
Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.
Fixes#8055Closes#8040
(cherry picked from commit 32d4ec6b8a)
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.
When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.
This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.
Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.
The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.
Fixes#8043
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 653610f4bc)
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.
Fixes#8062Closes#8063
(cherry picked from commit 2aa4631148)
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values
Fixes#7341
Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Closes#8019
(cherry picked from commit 718976e794)
node-exporter systemd unit name is "scylla-node-exporter.service", not
"node-exporter.service".
Fixes#8054Closes#8053
(cherry picked from commit 856fe12e13)
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.
The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.
What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.
Fixes#8060
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
(cherry picked from commit a03a8a89a9)
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.
Fixes: #8022
Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
(cherry picked from commit 3d001b5587)
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.
The issue is not always visible, it happens when there are multiple
filter files per shard.
Fixes#4513
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#8007
(cherry picked from commit 4498bb0a48)
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.
Fixes#6243Closes#6244
* github.com:scylladb/scylla:
dist/offline_redhat: fix umask error
dist/offline_installer/redhat: support cross build
(cherry picked from commit bb202db1ff)
This is a revival of #7490.
Quoting #7490:
The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.
We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.
As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().
Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).
By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.
This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.
Closes#7820
* github.com:scylladb/scylla:
test: add hashers_test
memtable: fix accounting of managed_bytes in partition_snapshot_accounter
test: add managed_bytes_test
utils: fragment_range: add a fragment iterator for FragmentedView
keys: update comments after changes and remove an unused method
mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
row_cache: more indentation fixes
utils: remove unused linearization facilities in `managed_bytes` class
misc: fix indentation
treewide: remove remaining `with_linearized_managed_bytes` uses
memtable, row_cache: remove `with_linearized_managed_bytes` uses
utils: managed_bytes: remove linearizing accessors
keys, compound: switch from bytes_view to managed_bytes_view
sstables: writer: add write_* helpers for managed_bytes_view
compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
types: change equal() to accept managed_bytes_view
types: add parallel interfaces for managed_bytes_view
types: add to_managed_bytes(const sstring&)
serializer_impl: handle managed_bytes without linearizing
utils: managed_bytes: add managed_bytes_view::operator[]
utils: managed_bytes: introduce managed_bytes_view
utils: fragment_range: add serialization helpers for FragmentedMutableView
bytes: implement std::hash using appending_hash
utils: mutable_view: add substr()
utils: fragment_range: add compare_unsigned
utils: managed_bytes: make the constructors from bytes and bytes_view explicit
utils: managed_bytes: introduce with_linearized()
utils: managed_bytes: constrain with_linearized_managed_bytes()
utils: managed_bytes: avoid internal uses of managed_bytes::data()
utils: managed_bytes: extract do_linearize_pure()
thrift: do not depend on implicit conversion of keys to bytes_view
clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
cql3: expression: linearize get_value_from_mutation() eariler
bytes: add to_bytes(bytes)
cql3: expression: mark do_get_value() as static
This test is a sanity check. It verifies that our wrappers over well known
hashes (xxhash, md5, sha256) actually calculate exactly those hashes.
It also checks that the `update()` methods of used hashers are linear with
respect to concatenation: that is, `update(a + b)` must be equivalent to
`update(a); update(b)`. This wasn't relied on before, but now we need to
confirm that hashing fragmented keys without linearizing them won't break
backward compatibility.
managed_bytes has a small overhead per each fragment. Due to that, managed_bytes
containing the same data can have different total memory usage in different
allocators. The smaller the preferred max allocation size setting is, the more
fragments are needed and the greater total per-fragment overhead is.
In particular, managed_bytes allocated in the LSA could grow in
memory usage when copied to the standard allocator, if the standard allocator
had a preferred max allocation setting smaller than the LSA.
partition_snapshot_accounter calculates the amount of memory used by
mutation fragments in the memtable (where they are allocated with LSA) based
on the memory usage after they are copied to the standard allocator.
This could result in an overestimation, as explained above.
But partition_snapshot_accounter must not overestimate the amount of freed
memory, as doing otherwise might result in OOM situations.
This patch prevents the overaccounting by adding minimal_external_memory_usage():
a new version of external_memory_usage(), which ignores allocator-dependent
overhead. In particular, it includes the per-fragment overhead in managed_bytes
only once, no matter how many fragments there are.
The comments were outdated after the latest changes (bytes_view vs
managed_bytes_view).
compound_view_wrapper::get_component() is unused, so we remove it.
Since we introduced relocatable package and offline installer, scylla binary itself can run almost any distributions.
However, setup scripts are not designed to run in unsupported distributions, it causes error on such environment.
This PR adds minimal support to run offline installation on unsupported distributions, tested on SLES, Arch Linux and Gentoo.
Closes#7858
* github.com:scylladb/scylla:
dist: use sysconfig_parser to parse gentoo config file
dist: add package name translation
dist: support SLES/OpenSUSE
install.sh: add systemd existance check
install.sh: ignore error missing sysctl entries
dist: show warning on unsupported distributions
dist: drop Ubuntu 14.04 code
dist: move back is_amzn2() to scylla_util.py
dist: rename is_gentoo_variant() to is_gentoo()
dist: support Arch Linux
dist: make sysconfig directory detectable
Flush is facing stalls because partition_snapshot_flat_reader::fill_buffer()
generates mutation fragment until buffer is full[1] without yielding.
this is the code path:
flush_reader::fill_buffer() <---------|
flat_mutation_reader::consume_pausable() <--------|
partition_snapshot_flat_reader::fill_buffer() -|
[1]: https://github.com/scylladb/scylla/blob/6cfc949e/partition_snapshot_reader.hh#L261
This is fixed by breaking the loop in do_fill_buffer() if preemption is needed,
allowing do_until() to yield in sequence, and when it resumes, continue from
where it left off, until buffer is full.
Fixes#7885.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210114141417.285175-1-raphaelsc@scylladb.com>
implicit revert of 6322293263
sshd previosly was used by the scylla manager 1.0.
new version does not need it. there is no point of
having it currently. it also confuses everyone.
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Closes#7921
The client_state::check_access() calls for global storage service
to get the features from it and check if the CDC feature is on.
The latter is needed to perform CDC-specific checks.
However it was noticed, that the check for the feature is excessive
as all the guarded if-s will resolve to false in case CDC is off
and the check_access will effectively work as it would with the
feature check.
With that observation, it's possible to ditch one more global storage
service reference.
tests: unit(dev), dtest(dev, auth)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105063651.7081-1-xemul@scylladb.com>
offline installer can run in non-systemd distributions, but it won't
work since we only have systemd units.
So check systemd existance and print error message.
is_redhat_variant() is the function to detect RHEL/CentOS/Fedora/OEL,
and is_debian_variant() is the function to detect Debian/Ubuntu.
Unlike these functions, is_gentoo_variant() does not detect "Gentoo variants",
we should rename it to is_gentoo().
Currently, install.sh provide a way to customize sysconfig directory,
but sysconfig directory is hardcoded on script.
Also, /etc/sysconfig seems correct to use default value, but current
code specify /etc/default as non-redhat distributions.
Instead of hardcoding, generate generate python script in install.sh
to save specified sysconfig directory path in python code.
The reply to a /column_family/ GET request contains info about all
column families. Currently, all this info is stored in a single
string when replying, and this string may require a big allocation
when there are many column families.
To avoid that allocation, instead of a single string, use a
body_writer function, which writes chunks of the message content
to the output stream.
Fixes#7916
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#7917
After these changes the generated code deserializes the stream into a chunked vector, instead of an contiguous one, so even if there are many fields in it, there won't be any big allocations.
I haven't run the scylla cluster test with it yet but it passes the unit tests.
Closes#7919
* github.com:scylladb/scylla:
idl: change the type of mutation_partition_view::rows() to a chunked_vector
idl-compiler: allow fields of type utils::chunked_vector
Numbers in JSON are not limited in range, so when the fromJson() function
converts a number to a limited-range integer column in Scylla, this
conversion can overflow. The following tests check that this conversion
should result in an error (FunctionFailure), not silent trunction.
Scylla today does silently wrap around the number, so these tests
xfail. They pass on Cassandra.
Refs #7914.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112151041.3940361-1-nyh@scylladb.com>
This patch adds more (failing) tests for issue #7911, where fromJson()
failures should be reported as a clean FunctionFailure error, not an
internal server error.
The previous tests we had were about JSON parse failures, but a
different type of error we should support is valid JSON which returned
the wrong type - e.g., the JSON returning a string when an integer
was expected, or the JSON returning a string with non-ASCII characters
when ASCII was expected. So this patch adds more such tests. All of
them xfail on Scylla, and pass on Cassandra.
Refs #7911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112122211.3932201-1-nyh@scylladb.com>
This patch adds a reproducer test for issue #7912, which is about passing
a null parameter to the fromJson() function supposed to be legal (and
return a null value), and is legal in Cassandra, but isn't allowed in
Scylla.
There are two tests - for a prepared and unprepared statement - which
fail in different ways. The issue is still open so the tests xfail on
Scylla - and pass on Cassandra.
Refs #7912.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112114254.3927671-1-nyh@scylladb.com>
Related issue scylladb/sphinx-scylladb-theme#88
Once this commit is merged, the docs will be published under the new domain name https://scylla.docs.scylladb.com
Frequently asked questions:
Should we change the links in the README/docs folder?
GitHub automatically handles the redirections. For example, https://scylladb.github.io/sphinx-scylladb-theme/stable/examples/index.html redirects to https://sphinx-theme.scylladb.com/stable/examples/index.html
Nevertheless, it would be great to change URLs progressively to avoid the 301 redirections.
Do I need to add this new domain in the custom dns domain section on GitHub settings?
It is not necessary. We have already edited the DNS for this domain and the theme creates programmatically the required CNAME file. If everything goes well, GitHub should detect the new URL after this PR is merged.
The DNS doesn't seem to have the right SSL certificates
GitHub handles the certificate provisioning but is not aware of the subdomain for this repo yet. make multi-version will create a new file "CNAME". This is published in gh-pages branch, therefore GitHub should create the missing cert.
Closes#7877
Use the thread_local seastar::testing::local_random_engine
in all seastar tests so they can be reproduced using
the --random-seed option.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-2-bhalevy@scylladb.com>
The min/max aggregators use aggregate_type_for comparators, and the
aggregate_type_for<timeuuid> is regular uuid. But that yields wrong
results; timeuuids should be compared as timestamps.
Fix it by changing aggregate_type_for<timeuuid> from uuid to timeuuid,
so aggregators can distinguish betwen the two. Then specialize the
aggregation utilities for timeuuid.
Add a cql-pytest and change some unit tests, which relied on naive
uuid comparators.
Fixes#7729.
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7910
"
Without interposer consumer on flush, it could happen that a new sstable,
produced by memtable flush, will not conform to the strategy invariant.
For example, with TWCS, this new sstable could span multiple time windows,
making it hard for the strategy to purge expired data. If interposer is
enabled, the data will be correctly segregated into different sstables,
each one spanning a single window.
Fixes#4617.
tests:
- mode(dev).
- manually tested it by forcing a flush of memtable spanning many windows
"
* 'segregation_on_flush_v2' of github.com:raphaelsc/scylla:
test: Add test for TWCS interposer on memtable flush
table: Wire interposer consumer for memtable flush
table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
table: Allow sstable write permit to be shared across monitors
memtable: Track min timestamp
table: Extend cache update to operate a memtable split into multiple sstables
This patch adds a reproducer test for issue #7911, which is about a parse
error in JSON string passed to the fromJson() function causing an
internal error instead of the expected FunctionFailure error.
The issue is still open so the test xfails on Scylla (and passes on
Cassandra).
Refs #7911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112094629.3920472-1-nyh@scylladb.com>
The option can only take integer values >= 0, since negative
TTL is meaningless and is expected to fail the query when used
with `USING TTL` clause.
It's better to fail early on `CREATE TABLE` and `ALTER TABLE`
statement with a descriptive message rather than catch the
error during the first lwt `INSERT` or `UPDATE` while trying
to insert to system.paxos table with the desired TTL.
Tests: unit(dev)
Fixes: #7906
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210111202942.69778-1-pa.solodovnikov@scylladb.com>
Unfortunately snapshot checking still does not work in the presence of
log entries reordering. It is impossible to know when exactly the
snapshot will be taken and if it is taken before all smaller than
snapshot idx entries are applied the check will fail since it assumes
that.
This patch disabled snapshot checking for SUM state machine that is used
in backpressure test.
Message-Id: <20201126122349.GE1655743@scylladb.com>
The value of mutation_partition_view::rows() may be very large, but is
used almost exclusively for iteration, so in order to avoid a big allocation
for an std::vector, we change its type to an utils::chunked_vector.
Fixes#7918
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The utils::chunked_vector has practically the same methods
as a std::vector, so the same code can be generated for it.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
We have recently seen a suspected corrupt mutation fragment stream to get
into an sstable undetected, causing permanent corruption. One of the
suspected ways this could happen is the compaction sstable write path not
being covered with a validator. To prevent events like this in the future
make sure all sstable write paths are validated by embedding the validator
right into the sstable writer itself.
Refs: #7623
Refs: #7640
Tests: unit(release)
* https://github.com/denesb/scylla.git sstable-writer-fragment-stream-validation/v2:
sstable_writer: add validation
test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
mutation_fragment_stream_validator: make it easier to validate concrete fragment types
flat_mutation_reader: extract fragment stream validator into its own header
Cassandra constructs `QueryOptions.SpecificOptions` in the same
way that we do (by not providing `serial_constency`), but they
do have a user-defined constructor which does the following thing:
this.serialConsistency = serialConsistency == null ? ConsistencyLevel.SERIAL : serialConsistency;
This effectively means that DEFAULT `SpecificOptions` always
have `SerialConsistency` set to `SERIAL`, while we leave this
`std::nullopt`, since we don't have a constructor for
`specific_options` which does this.
Supply `db::consistency_level::SERIAL` explicitly to the
`specific_options::DEFAULT` value.
Tests: unit(dev)
Fixes: #7850
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201231104018.362270-1-pa.solodovnikov@scylladb.com>
This adds a simple reproducer for a bug involving a CONTAINS relation on
frozen collection clustering columns when the query is restricted to a
single partition - resulting in a strange "marshalling error".
This bug still exists, so the test is marked xfail.
Refs #7888.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210107191417.3775319-1-nyh@scylladb.com>
We add a reproducer for issues #7868 and #7875 which are about bugs when
a table has a frozen collection as its clustering key, and it is sorted
in *reverse order*: If we tried to insert an item to such a table using an
unprepared statement, it failed with a wrong error ("invalid set literal"),
but if we try to set up a prepared statement, the result is even worse -
an assertion failure and a crash.
Interestingly, neither of these problems happen without reversed sort order
(WITH CLUSTERING ORDER BY (b DESC)), and we also add a test which
demonstrates that with default (increasing) order, everything works fine.
All tests pass successfully when run against Cassandra.
The fix for both issues was already committed, so I verified these tests
reproduced the bug before that commit, and pass now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110232312.3844408-1-nyh@scylladb.com>
In this patch, we port validation/entities/frozen_collections_test.java,
containing 33 tests for frozen collections of all types, including
nesting collections.
In porting these tests, I uncovered four previously unknown bugs in Scylla:
Refs #7852: Inserting a row with a null key column should be forbidden.
Refs #7868: Assertion failure (crash) when clustering key is a frozen
collection and reverse order.
Refs #7888: Certain combination of filtering, index, and frozen collection,
causes "marshalling error" failure.
Refs #7902: Failed SELECT with tuple of reversed-ordered frozen collections.
These tests also provide two more reproducers for an already known bug:
Refs #7745: Length of map keys and set items are incorrectly limited to
64K in unprepared CQL.
Due to these bugs, 7 out of the 33 tests here currently xfail. We actually
had more failing tests, but we fixed issue #7868 before this patch went in,
so its tests are passing at the time of this submission.
As usual in these sort of tests, all 33 pass when running against Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110231350.3843686-1-nyh@scylladb.com>
In test_streams.py we had some code to get a list of shards and iterators
duplicated three times. Put it in a function, shards_and_latest_iterators(),
to reduce this duplication.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006112421.426096-1-nyh@scylladb.com>
Add a mutation_fragment_stream_validating_filter to
sstables::writer_impl and use it in sstable_writer to validate the
fragment stream passed down to the writer implementation. This ensures
that all fragment streams written to disk are validated, and we don't
have to worry about validating each source separately.
The current validator from sstable::write_components() is removed. This
covers only part of the write paths. Ad-hoc validations in the reader
implementations are removed as well as they are now redundant.
The test violates clustering key order on purpose to produce a corrupt
sstable (to test scrub). Disable key validation so when we move the
validator into the writer itself in the next patch it doesn't abort the
test.
The current API is tailored to the `mutation_fragment` type. In
the next patch we will want to use the validator from a context where
the mutation fragments are already decomposed into their respective
concrete types, e.g. static_row, clustering_row, etc. To avoid having to
reconstruct a mutation fragment type just to use the validator, add an
API which allows validating these concrete types conveniently too.
Replace two methods for unreversal (`as` and `self_or_reversed`) with
a new one (`without_reversed`). More flexible and better named.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7889
Currently, frozen mutations, that contain partitions with out-of-order
or duplicate rows will trigger (if they even do) an assert in
`row::append_cell()`. However, this results in poor diagnostics (if at
all) as the context doesn't contain enough information on what exactly
went wrong. This results in a cryptic error message and an investigation
that can only start after looking at a coredump.
This series remedies this problem by explicitly checking for
out-of-order and duplicate rows, as early as possible, when the
supposedly empty row is created. If the row already existed (is a
duplicate) or it is not the last row in the partition (out-of-order row)
an exception is thrown and the deserialization is aborted. To further
improve diagnostics, the partition context is also added to the
exception.
Tests: unit(release)
* botond/frozen-mutation-bad-row-diagnostics/v3:
frozen_mutation: add partition context to errors coming from deserializing
partition_builder: accept_row(): use append_clustering_row()
mutation_partition: add append_clustered_row()
measuring_allocator is a wrapper around standard_allocator, but it exposed
the default preferred_max_contiguous_allocation, not the one from
standard_allocator. Thus managed_bytes allocated in those two allocators
had fragments of different size, and their total memory usage differed,
causing test_external_memory_usage to fail if
standard_allocator::preferred_max_contiguous_allocation was changed from the
default. Fix that.
Remove the following bits of `managed_bytes` since they are unused:
* `with_linearized_managed_bytes` function template
* `linearization_context_guard` RAII wrapper class for managing
`linearization_context` instances.
* `do_linearize` function
* `linearization_context` class
Since there is no more public or private methods in `managed_class`
to linearize the value except for explicit `with_linearized()`,
which doesn't use any of aforementioned parts, we can safely remove
these.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The patch fixes indentation issues introduced in previous patches
related to removing `with_linearized_managed_bytes` uses from the
code tree.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
There is no point in calling the wrapper since linearization code
is private in `managed_bytes` class and there is no one to call
`managed_bytes::data` because it was deleted recently.
This patch is a prerequisite for removing
`with_linearized_managed_bytes` function completely, alongside with
the corresponding parts of implementation in `managed_bytes`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since `managed_bytes::data()` is deleted as well as other public
APIs of `managed_bytes` which would linearize stored values except
for explicit `with_linearized`, there is no point
invoking `with_linearized_managed_bytes` hack which would trigger
automatic linearization under the hood of managed_bytes.
Remove useless `with_linearized_managed_bytes` wrapper from
memtable and row_cache code.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The keys classes (partition_key et al) already use managed_bytes,
but they assume the data is not fragmented and make liberal use
of that by casting to bytes_view. The view classes use bytes_view.
Change that to managed_bytes_view, and adjust return values
to managed_bytes/managed_bytes_view.
The callers are adjusted. In some places linearization (to_bytes())
is needed, but this isn't too bad as keys are always <= 64k and thus
will not be fragmented when out of LSA. We can remove this
linearization later.
The serialize_value() template is called from a long chain, and
can be reached with either bytes_view or managed_bytes_view.
Rather than trace and adjust all the callers, we patch it now
with constexpr if.
operator bytes_view (in keys) is converted to operator
managed_bytes_view, allowing callers to defer or avoid
linearization.
bytes_view can convert to managed_bytes_view, so the change
is compatible with the existing representation and the next
patches, which change compound types to use managed_bytes_view.
This operator has a single purpose: an easier port of legacy_compound_view
from bytes_view to managed_bytes_view.
It is inefficient and should be removed as soon as legacy_compound_view stops
using operator[].
managed_bytes_view is a non-owning view into managed_bytes.
It can also be implicitly constructed from bytes_view.
It conforms to the FragmentedView concept and is mainly used through that
interface.
It will be used as a replacement for bytes_view occurrences currently
obtained by linearizing managed_bytes.
This is a preparation for the upcoming introduction of managed_bytes_view,
intended as a fragmented replacement for bytes_view.
To ease the transition, we want both types to give equal hashes for equal
contents.
Unset values for key and value were not handled. Handle them in a
manner matching Cassandra.
This fixes all cases in testMapWithUnsetValues, so re-enable it (and
fix a comment typo in it).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the right-hand side of IN is an unset value, we must report an
error, like Cassandra does.
This fixes testListWithUnsetValues, so re-enable it.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Make the bind() operation of the scalar marker handle the unset-value
case (which it previously didn't).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Avoid crash described in #7740 by ignoring the update when the
element-to-remove is UNSET_VALUE.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Since we haven't implemented parse error on redis protocol parser,
reply message is broken at parse error.
Implemented parse error, reply error message correctly.
Fixes#7861Fixes#7114Closes#7862
When the clustering order is reversed on a map column, the column type
is reversed_type_impl, not map_type_impl. Therefore, we have to check
for both reversed type and map type in some places.
This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_map pass. However, it leaves
other places (invocations of is_map() and *_cast<map_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the clustering order is reversed on a list column, the column type
is reversed_type_impl, not list_type_impl. Therefore, we have to check
for both reversed type and list type in some places.
This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_list pass. However, it leaves
other places (invocations of is_list() and *_cast<list_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the clustering order is reversed on a set column, the column type
is reversed_type_impl, not set_type_impl. Therefore, we have to check
for both reversed type and set type in some places.
To make such checks easier, add convenience methods self_or_reversed()
and as() to abstract_type. Invoke those methods (instead of is_set()
and casts) enough to make test_clustering_key_reverse_frozen_set pass.
Leave other invocations of is_set() and *_cast<set_type_impl>() as
they are; some are protected by callers from being invoked on reverse
types, but some are quite possibly bugs untriggered by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This patch enables select cql statements where collection columns are
selected columns in queries where clustering column is restricted by
"IN" cql operator. Such queries are accepted by cassandra since v4.0.
The internals actually provide correct support for this feature already,
this patch simply removes relevant cql query check.
Tests: cql-pytest (testInRestrictionWithCollection)
Fixes#7743Fixes#4251
Signed-off-by: Vojtech Havel <vojtahavel@gmail.com>
Message-Id: <20210104223422.81519-1-vojtahavel@gmail.com>
* seastar d1b5d41b...a2fc9d72 (6):
> perftune.py: support passing multiple --nic options to tune multiple interfaces at once
> perftune.py recognize and sort IRQs for Mellanox NICs
> perftune.py: refactor getting of driver name into __get_driver_name()
Fixes#6266
> install-dependencies: support Manjaro
> append_challenged_posix_file_impl: optimize_queue: use max of sloppy_size_hint and speculative_size
> future: do_until: handle exception in stop condition
"
The size_estimates_mutation_reader call for global proxy
to get database from. The database is used to find keyspaces
to work with. However, it's safe to keep the local database
refernece on the reader itself.
tests: unit(debug)
"
* 'br-no-proxy-in-size-estimate-reader' of https://github.com/xemul/scylla:
size_estimate_reader: Use local db reference not global
size_estimate_reader: Keep database reference on mutation reader
size_estimate_reader: Keep database reference on virtual_reader
Conversions from views to owners have no business being implicit.
Besides, they would also cause various ambiguity problems when adding
managed_bytes_view.
From now on, memtable flush will use the strategy's interposer consumer
iff split_during_flush is enabled (disabled by default).
It has effect only for TWCS users as TWCS it's the only strategy that
goes on to implement this interposer consumer, which consists of
segregating data according to the window configuration.
Fixes#4617.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As a preparation for interposer on flush, let's allow database write monitor
to store a shared sstable write permit, which will be released as soon as
any of the sstable writers reach the sealing stage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In this patch, we port validation/entities/collection_test.java, containing
7 tests for CQL counters. Happily, these tests did not uncover any bugs in
Scylla and all pass on both Cassandra and Scylla.
There is one small difference that I decided to ignore instead of reporting
a bug. If you try a CREATE TABLE with both counter and non-counter columns,
Scylla gives a ConfigurationException error, while Cassandra gives a more
reasonable InvalidRequest. The ported test currently allows both.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201223181325.3148928-1-nyh@scylladb.com>
In issue #7843 there were questions raised on how much does Scylla
support the notion of Unicode Equivalence, a.k.a. Unicode normalization.
Consider the Spanish letter ñ - it can be represented by a single Unicode
character 00F1, but can also be represented as a 006E (lowercase "n")
followed by a 0303 ("combining tilde"). Unicode specifies that these
two representations should be considered "equivalent" for purposes of
sorting or searching. But the following tests demonstrates that this
is not, in fact, supported in Scylla or Cassandra:
1. If you use one representation as the key, then looking up the other one
will not find the row. Scylla (and Cassandra) do *not* consider
the two strings equivalent.
2. The LIKE operator (a Scylla-only extension) doesn't know that
the single-character ñ begins with an n, or that the two-character
ñ is just a single character.
This is despite the thinking on #7843 which by using ICU in the
implementation of LIKE, we somehow got support for this. We didn't.
Refs #7843
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229125330.3401954-1-nyh@scylladb.com>
This patch adds a reproducer for issue #7856, which is about frozen sets
and how we can in Scylla (but not in Cassandra), insert one in the "wrong"
order, but only in very specific circumstances which this reproducer
demonstrates: The bug can only be reproduced in a nested frozen collection,
and using prepared statements.
Refs #7856
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201231085500.3514263-1-nyh@scylladb.com>
Tracking both min and max timestamp will be required for memtable flush
to short-circuit interposer consumer if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This extension is needed for future work where a memtable will be segregated
during flush into one sstable or more. So now multiple sstables can be added
to the set after a memtable flush, and compaction is only triggered at the
end.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
- the mystical `accept_predicate` is renamed to `accept_keyspace`
to be more self-descriptive
- a short comment is added to the original calculate_schema_digest
function header, mentioning that it computes schema digest
for non-system keyspaces
Refs #7854
Message-Id: <04f1435952940c64afd223bd10a315c3681b1bef.1609763443.git.sarna@scylladb.com>
In scylla-jmx, we fixed a hardcode sysconfdir in EnvironmentFile path,
realpath was used to convert the path. This patch changed to use
realpath in scylla repo to make it consistent with scylla-jmx.
Suggested-by: Pekka Enberg <penberg@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7860
The original idea for `schema_change_test` was to ensure that **if** schema hasn't changed, the digest also remained unchanged. However, a cumbersome side effect of adding an internal distributed table (or altering one) is that all digests in `schema_change_test` are immediately invalid, because the schema changed.
Until now, each time a distributed system table was added/amended, a new test case for `schema_change_test` was generated, but this effort is not worth the effect - when a distributed system table is added, it will always propagate on its own, so generating a new test case does not bring any tangible new test coverage - it's just a pain.
To avoid this pain, `schema_change_test` now explicitly skips all internal keyspaces - which includes internal distributed tables - when calculating schema digest. That way, patches which change the way of computing the digest itself will still require adding a new test case, which is good, but, at the same time, changes to distributed tables will not force the developers to introduce needless schema features just for the sake of this test.
Tests:
* unit(dev)
* manual(rebasing on top of a change which adds two distributed system tables - all tests still passed)
Refs #7617Closes#7854
* github.com:scylladb/scylla:
schema_change_test: skip distributed system tables in digest
schema_tables: allow custom predicates in schema digest calc
alternator: drop unneeded sstring creation
system_keyspace: migrate helper functions to string_view
database: migrate find_keyspace to string views
With previous design of the schema change test, a regeneration
was necessary each time a new distributed system table was added.
It was not the original purpose of the test to keep track of new
distributed tables which simply propagate on their own,
so the test case is now modified: internal distributed tables
are not part of the schema digest anymore, which means that
changes inside them will not cause mismatches.
This change involves a one-shot regeneration of all digests,
which due to historical reasons included internal distributed
tables in the digest, but no further regenerations should ever
be necessary when a new internal distributed table is added.
For testing purposes it would be useful to be able to skip computing
schema for certain tables (namely, internal distributed tables).
In order to allow that, a function which accepts a custom predicate
is added.
It's now possible to use string views to check if a particular
table is a system table, so it's no longer needed to explicitly
create an sstring instance.
Functions for checking if the keyspace is system/internal were based
on sstring references, which is impractical compared to string views
and may lead to unnecessary creation of sstring instances.
It looks like the history of the flag begins in Cassandra's
https://issues.apache.org/jira/browse/CASSANDRA-7327 where it is
introduced to speedup tests by not needing to start the gossiper.
The thing is we always start gossiper in our cql tests, so the flag only
introduce noise. And, of course, since we want to move schema to use raft
it goes against the nature of the raft to be able to apply modification only
locally, so we better get rid of the capability ASAP.
Tests: units(dev, debug)
Message-Id: <20201230111101.4037543-2-gleb@scylladb.com>
When a node notice that it uses legacy SI tables it converts them to use
new format, but it update only local schema. It will only cause schema
discrepancy between nodes, there schema change should propagate
globally.
Fixes#7857.
Message-Id: <20201230111101.4037543-1-gleb@scylladb.com>
After 13fa2bec4c, every compaction will be performed through a filtering
reader because consumers cannot do the filtering if interposer consumer is
enabled.
It turns out that filtering_reader is adding significant overhead when regular
compactions are running. As no other compaction type need to actually do
any filtering, let's limit filtering_reader to cleanup compaction.
Alternatively, we could disable interposer consumer on behalf of cleanup,
or add support for the consumers to do the filtering themselves but that
would add lots of complexity.
Fixes#7748.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-2-raphaelsc@scylladb.com>
This filter is used to discard data that doesn't belong to current
shard, but scylla will only make a sstable available to regular
compaction after it was resharded on either boot or refresh.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-1-raphaelsc@scylladb.com>
This reverts commit ceb67e7728. The
"epel-release" package is needed to install the "supervisord"
package, which I somehow missed in testing...
Fixes#7851
This patch adds two simple tests for what happens when a user tries to
insert a row with one of the key column missing. The first tests confirms
that if the column is completely missing, we correctly print an error
(this was issue #3665, that was already marked fixed).
However, the second test demonstrates that we still have a bug when
the key column appears on the command, but with a null value.
In this case, instead of failing the insert (as Cassandra does),
we silently ignore it. This is the proper behavior for UNSET_VALUE,
but not for null. So the second test is marked xfail, and I opened
issue #7852 about it.
Refs #3665
Refs #7852
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201230132350.3463906-1-nyh@scylladb.com>
In a previous version of test_using_timeout.py, we had tables pre-filled
with some content labled "everything". The current version of the tests
don't use it, so drop it completely.
One test, test_per_query_timeout_large_enough, still had code that did
res = list(cql.execute(f"SELECT * FROM {table} USING TIMEOUT 24h"))
assert res == everything
this was a bug - it only works as expected if this test is run before
anything other test is run, and will fail if we ever reorder or parallelize
these tests. So drop these two lines.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229145435.3421185-1-nyh@scylladb.com>
This miniseries rewrites two alternator request handlers from seastar threads to coroutines - since these handlers are not on a hot path and using seastar threads is way too heavy for such a simple routine.
NOTE: this pull request obviously has to wait until coroutines are fully supported in Seastar/Scylla.
Closes#7453
* github.com:scylladb/scylla:
alternator: coroutinize untagging a resource
alternator: coroutinize tagging a resource
node_exporter had been added to scylla-server package by commit
95197a09c9.
So we can enable it by default for offline installation.
Closes#7832
* github.com:scylladb/scylla:
scylla_setup: cleanup if judgments
scylla_setup: enable node_exporter for offline installation
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.
Fixes#7758.
Closes#7842
* github.com:scylladb/scylla:
table: Fix potential reactor stall on LCS compaction completion
table: decouple preparation from execution when updating sstable set
table: change rebuild_sstable_list to return new sstable set
row_cache: allow external updater to decouple preparation from execution
The range_tombstone_list always (unless misused?) contains de-overlapped
entries. There's a test_add_random that checks this, but it suffers from
several problems:
- generated "random" ranges are sequential and may only overlap on
their borders
- test uses the keys of the same prefix length
Enhance the generator part to produce a purely random sequence of ranges
with bound keys of arbitrary length. Just pay attention to generate the
"valid" individual ranges, whose start is not ahead of the end.
Also -- rename the test to reflect what it's doing and increase the
number of iterations.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201228115525.20327-1-xemul@scylladb.com>
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.
This is fixed by futurizing build_new_sstable_list(), so it will
yield whenever needed.
Fixes#7758.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
row cache now allows updater to first prepare the work, and then execute
the update atomically as the last step. let's do that when rebuilding
the set, so now new set is created in the preparation phase, and the
new set replaces the old one in the execution phase, satisfying the
atomicity requirement of row cache.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
procedure is changed to return the new set, so caller will be responsible
for replacing the old set with the new one. this will allow our future
work where building new set and enabling it will be decoupled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
External updater may do some preparatory work like constructing a new sstable list,
and at the end atomically replace the old list by the new one.
Decoupling the preparation from execution will give us the following benefits:
- the preparation step can now yield if needed to avoid reactor stalls, as it's
been futurized.
- the execution step will now be able to provide strong exception guarantees, as
it's now decoupled from the preparation step which can be non-exception-safe.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The CQL tests in test/cql-pytest use the Python CQL driver's default
timeout for execute(), which is 10 seconds. This usually more than
enough. However, in extreme cases like noted in issue #7838, 10
seconds may not be enough. In that issue, we run a very slow debug
build on a very slow test machine, and encounter a very slow request
(a DROP KEYSPACE that needs to drop multiple tables).
So this patch increases the default timeout to an even larger
120 seconds. We don't care that this timeout is ridiculously
large - under normal operations it will never be reached, there
is no code which loops for this amount of time for example.
Tested that this patch fixes#7838 by choosing a much lower timeout
(1 second) and reproducing test failures caused by timeouts.
Fixes#7838.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201228090847.3234862-1-nyh@scylladb.com>
The MAX_LEVELS is the levels count, but sstable level (index) starts
from 0. So the maximum and valid level is MAX_LEVELS - 1.
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7833
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
(inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
(inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
(inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
(inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
(inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
(inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
(inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
(inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
(inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
(inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
(inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
(inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
(inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
(inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
(inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
(inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
(inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
(inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
(inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
(inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
(inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
(inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```
What happens here is that:
compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
: _out(std::move(out))
, _compression_metadata(cm)
, _offsets(_compression_metadata->offsets.get_writer())
, _compression(lc)
, _full_checksum(ChecksumType::init_checksum())
_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.
Fixes#7821
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
When applying a mutation partition to another if a dummy entry
from the source falls into a destination continuous range, it
can be just dropped. However, current implementation still
inserts it and then instantly removes.
Relax this code-flow by dropping the unwanted entry without
tossing it.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224130438.11389-1-xemul@scylladb.com>
node_exporter had been added to scylla-server package by commit
95197a09c9.
So we can enable it by default for offline installation.
Signed-off-by: Amos Kong <amos@scylladb.com>
When a rows_entry is added to row_cache it's constructed from
clustering_row by unpacking all its internals and putting
them into the rows_entry's deletable_row. There's a shorter
way -- the clustering_row already has the deletale_row onboard
from which rows_entry can copy-construct its.
This lets keeping the rows_entry and deletable_row set of
constructors a bit shorter.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224161112.20394-1-xemul@scylladb.com>
"
This series does a lot of cleanups, dead code removal, and most
importantly fixes the following things in IDL compiler tool:
* The grammar now rejects invalid identifiers, which,
in some cases, allowed to write things like `std:vector`.
* Error reporting is improved significantly and failures are
now pointing to the place of failure much more accurately.
This is done by restricting rule backtracing on those rules
which don't need it.
"
* 'idl-compiler-minor-fixes-v4' of https://github.com/ManManson/scylla:
idl: move enum and class serializer code writers to the corresponding AST classes
idl: extract writer functions for `write`, `read` and `skip` impls for classes and enums
idl: minor fixes and code simplification
idl: change argument name from `hout` to `cout` in all dependencies of `add_visitors` fn
idl: fix parsing of basic types and discard unneeded terminals
idl: remove unused functions
idl: improve error tracing in the grammar and tighten-up some grammar rules
idl: remove redundant `set_namespace` function
idl: remove unused `declare_class` function
idl: slightly change `str` and `repr` for AST types
idl: place directly executed init code into if __name__=="__main__"
To connection-less environment, we need to add node_exporter binary
to scylla-server package, not downloading it from internet.
Related #7765Fixes#2190Closes#7796
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:
In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.
These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.
This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.
What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.
A regression test is provided for the issue (written in
`cql-pytest` framework).
Tests: test/cql-pytest/test_large_cells_rows.py
Fixes: #6780
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
In Alternator's expression parser in alternator/expressions.g, a list can be
indexed by a '[' INTEGER ']'. I had doubts whether maybe a value-reference
for the index, e.g., "something[:xyz]", should also work. So this patch adds
a test that checks whether "something[:xyz]" works, and confirms that both
DynamoDB and Alternator don't accept it and consider it a syntax error.
So Alternator's parser is correct to insist that the index be a literal
integer.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201214100302.2807647-1-nyh@scylladb.com>
* seastar 2bd8c8d088...1f5e3d3419 (5):
> Merge "Avoid fair-queue rovers overflow if not configured" from Pavel E
> doc: add a coroutines section to the tutorial
> Merge "tests/perf: add random-seed config option" from Benny
> iotune: Print parameters affecting the measurement results
> cook: Add patch cmd for ragel build (signed char confusion on aarch64)
Expand the role of AST classes to also supply methods for actually
generating the code. More changes will follow eventually until
all generation code is handled by these classes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
We've encountered a number of reactor stalls
related to token_metadata that were fixed
in 052a8d036d.
This is a follow-up series that adds a clear_gently
method to token_metadata that uses continuations
to prevent reactor stalls when destroying token_metadata
objects.
Test: unit(dev), {network_topology_strategy,storage_proxy}_test(debug)
"
* tag 'token_metadata_clear_gently-v3' of github.com:bhalevy/scylla:
token_metadata: add clear_gently
token_metadata: shared_token_metadata: add mutate_token_metadata
token_metdata: futurize update_normal_tokens
abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
repair: replace_with_repair: convert to coroutine
A previous patch added test/cql-pytest/cassandra_tests - a framework for
porting Cassandra's unit tests to Python - but only ported two tiny test
files with just 3 tests. In this patch, we finally port a much larger
test file validation/entities/collection_test.java. This file includes
50 separate tests, which cover a lot of aspects of collection support,
as well as how other stuff interact with collections.
As of now, 23 (!) of these 50 tests fail, and exposed six new issues
in Scylla which I carefully documented:
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #7740: CQL prepared statements incomplete support for "unset" values
Refs #7743: Restrictions missing support for "IN" on tables with
collections, added in Cassandra 4.0
Refs #7745: Length of map keys and set items are incorrectly limited to 64K
in unprepared CQL
Refs #7747: Handling of multiple list updates in a single request differs
from recent Cassandra
Refs #7751: Allow selecting map values and set elements, like in
Cassandra 4.0
These issues vary in severity - some are simply new Cassandra 4.0 features
that Scylla never implemented, but one (#7740) is an old Cassandra 2.2
feature which it seems we did not implement correctly in some cases that
involve collections.
Note that there are some things that the ported tests do not include.
In a handful of places there are things which the Python driver checks,
before sending a request - not giving us an opportunity to check how
the server handles such errors. Another notable change in this port is
that the original tests repeated a lot of tests with and without a
"nodetool flush". In this port I chose to stub the flush() function -
it does NOT flush. I think the point of these tests is to check the
correctness of the CQL features - *not* to verify that memtable flush
works correctly. Doing a real memtable flush is not only slow, it also
doesn't really check much (Scylla may still serve data from cache,
not sstables). So I decided it is pointless.
An important goal of this patch is that all 50 tests (except three
skipped tests because Python has client-side checking), pass when
run on Cassandra (with test/cql-pytest/run-cassandra). This is very
important: It was very easy to make mistakes while porting the tests,
and I did make many such mistakes; But running the against Cassandra
allowed me to fix those mistakes - because the correct tests should
pass on Cassandra. And now they do.
Unfortunately, the new tests are significantly slower than what we've
been accustomed in Alternator/CQL tests. The 50 tests create more than a
hundred tables, udfs, udts, and similar slow operations - they do not
reuse anything via fixtures. The total time for these 50 tests (in dev
build mode) is around 18 seconds. Just one test - testMapWithLargePartition
is responsibe for almost half (!) of that time - we should consider in
the future whether it's worth it or can be made smaller.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201215155802.2867386-1-nyh@scylladb.com>
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
mutate_token_metadata acquires the shared_token_metadata lock,
clones the token_metadata (using clone_async)
and calls an asynchronous functor on
the cloned copy of the token_metadata to mutate it.
If the functor is successful, the mutated clone
is set back to to the shared_token_metadata,
otherwise, the clone is destroyed.
With that, get rid of shared_token_metadata::clone
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.
This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.
Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions. Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the seastar-cpu-map.sh to the SBINFILES variable, which is used to
create symbolic links to scripts so that they appear in $PATH.
Please note that there are additional Python scripts (like perftune.py),
which are not in $PATH. That's because Python scripts are handled
separately in "install.sh" and no Python script has a "sbin" symlink. We
might want to change this in the future, though.
Fixes#6731Closes#7809
This is a temporary scaffold for weaning ourselves off
linearization. It differs from with_linearized_managed_bytes in
that it does not rely on the environment (linearization_context)
and so is easier to remove.
We use managed_bytes::data() in a few places when we know the
data is non-fragmented (such as when the small buffer optimization
is in use). We'd like to remove managed_bytes::data() as linearization
is bad, so in preparation for that, replace internal uses of data()
with the equivalent direct access.
do_linearize() is an impure function as it changes state
in linearization_context. Extract the pure parts into a new
do_linearize_pure(). This will be used to linearize managed_bytes
without a linearization_context, during the transition period where
fragmented and non-fragmented values coexist.
do_get_value() is careful to return a fragmented view, but
its only caller get_value_from_mutation() linearizes it immediately
afterwards. Linearize it sooner; this prevents mixing in
fragmented values from cells (now via IMR) and fragmented values
from partition/clustering keys. It only works now because
keys are not fragmented outside LSA, and value_view has a special
case for single-fragment values.
This helps when keys become fragmented.
Converting from bytes to bytes is nonsensical, but it helps
when transitioning to other types (managed_bytes/managed_bytes_view),
and these types will have to_bytes() conversions.
We introduce a new single-key sstable reader for sstables created by `TimeWindowCompactionStrategy`.
The reader uses the fact that sstables created by TWCS are mostly disjoint with respect to the contained `position_in_partition`s in order to avoid having multiple sstable readers opened at the same time unnecessarily. In case there are overlapping ranges (for example, in the current time-window), it performs the necessary merging (it uses `clustering_order_reader_merger`, introduced recently).
The reader uses min/max clustering key metadata present in `md` sstables in order to decide when to open or close a sstable reader.
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 equal windows of data
(each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
with and without the change
The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.
With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.
Fixes#6418.
Closes#7437
* github.com:scylladb/scylla:
sstables: gather clustering key filtering statistics in TWCS single key reader
sstables: use time_series_sstable_set in time_window_compaction_strategy
sstable_set: new reader for TWCS single partition queries
mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set
sstable_set: introduce min_position_reader_queue
sstable_set: introduce time_series_sstable_set
sstables: add min_position and max_position accessors
sstable_set: make create_single_key_sstable_reader a virtual method
clustering_order_reader_merger: fix the 0 readers case
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 windows of data
(each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
with and without the change
The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.
With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.
Fixes https://github.com/scylladb/scylla/issues/6418.
This commit introduces a new implementation of `create_single_key_sstable_reader`
in `time_series_sstable_set` dedicated for TWCS-created sstables.
It uses the fact that such sstables are mostly disjoint with respect to
contained `position_in_partition`s in order to decrease the number of
sstable readers that are opened at the same time.
The implementation uses `clustering_order_reader_merger` under the hood.
The reader assumes that the schema does not have static columns and none
of the queried sstable contain partition tombstones; also, it assumes
that the sstables have the min/max clustering key metadata in order for
the implementation to be efficient. Thus, if we detect that some of
these assumptions aren't true, we fall back to the old implementation.
This is a queue of readers of sstables in a time_series_sstable_set,
returning the readers in order of the smallest position_in_partition
that the sstables have. It uses the min/max clustering key sstable
metadata.
The readers are opened lazily, at the moment of being returned.
At this moment it is a slightly less efficient version of
bag_sstable_set, but in following commits we will use the new data
structures to gain advantage in single partition queries
for sstables created by TimeWindowCompactionStrategy.
... of sstable_set_impl.
Soon we shall provide a specialized implementation in one of the
`sstable_set_impl` derived classes.
The existing implementation is used as the default one.
There were two problems with handling conflicting equalities on the same PK column (eg, c=1 AND c=0):
1. When the column is indexed, Scylla crashed (#7772)
2. Computing ranges and slices was throwing an exception
This series fixes them both; it also happens to resolve some old TODOs from restriction_test.
Tests: unit (dev, debug)
Closes#7804
* github.com:scylladb/scylla:
cql3: Fix value_for when restriction is impossible
cql3: Fix range computation for p=1 AND p=1
Previously, single_column_restrictions::value_for() assumed that a
column's restriction specifies exactly one value for the column. But
since 37ebe521e3, multiple equalities on the same column are allowed,
so the restriction could be a conjunction of conflicting
equalities (eg, c=1 AND c=0). That violates an assert and crashes
Scylla.
This patch fixes value_for() by gracefully handling the
impossible-restriction case.
Fixes#7772
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Previously compute_bounds was assuming that primary-key columns are
restricted by exactly one equality, resulting in the following error:
query 'select p from t where p=1 and p=1' failed:
std::bad_variant_access (std::get: wrong index for variant)
This patch removes that assumption and deals correctly with the
multiple-equalities case. As a byproduct, it also stops raising
"invalid null value" exceptions for null RHS values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Split `write`, `read` and `skip` serializer function writers to
separate functions in `handle_class` and `handle_enum` functions,
which slightly improves readability.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
* Introduce `ns_qualified_name` and `template_params_str` functions
to simplify code a little bit in `handle_enum` and `handle_class`
functions.
* Previously each serializer had a separate namespace open-close
statements, unify them into a single namespace scope.
* Fix a few more `hout` -> `cout` argument names.
* Rename `template` pattern to `template_decl` to improve clarity.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the patch all functions that are called from `add_visitors`
and this function itself declared the argument denoting the output
file as `hout`. Though, this was quite misleading since `hout`
is meant to be header file with declarations, while `cout` is an
implementation file.
These functions write to implmentation file hence `hout` should
be changed to `cout` to avoid confusion.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the patch `btype` production was using `with_colon`
rule, which accidentally supported parsing both numbers and
identifiers (along with other invalid inputs, such as "123asd").
It was changed to use `ns_qualified_ident` and those places
which can accept numeric constants, are explicitly listing
it as an alternative, e.g. template parameter list.
Unfortunately, I had to make TemplateType to explicitly construct
`BasicType` instances from numeric constants in template arguments
list. This is exactly the way it was handled before, though.
But nonetheless, this should be addressed sometime later.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Remove the following functions since they are not used:
* `open_namespaces`
* `close_namespaces`
* `flat_template`
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch replaces use of some handwritten rules to use their
alternatives already defined in `pyparsing.pyparsing_common`
class, i.e.: `number`, `identifier` productions.
Changed ignore patterns for comments to use pre-defined
`pp.cppStyleComment` instead of hand-written combination of
'//'-style and C-style comment rules.
Operator '-' is now used whenever possible to improve debugging
experience: it disables default backtracking for productions
so that compiler fails earlier and can now point more precisely
to a place in the input string where it failed instead of
backtracking to the top-level rule and reporting error there.
Template names and class names now use `ns_qualified_ident`
rule instead of `with_colon` which prevents grammar from
matching invalid identifiers, such as `std:vector`.
Many places are using the updated `identifier` production, which
is working correctly unlike its predecessor: now inputs
such as `1ident` are considered invalid.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Surround string representation with angle brackets. This improves
readability when printing debug output.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since idl compiler is not intended to be used as a module to other
python build scripts, move initialization code under an if checking
that current module name is "__main__".
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When recycling a segment in O_DSYNC mode if the size of the segment
is neither shrunk nor grown, avoid calling file::truncate() or
file::allocate().
Message-Id: <20201215182332.1017339-2-kostja@scylladb.com>
"
This patch series consists of the following patches:
1. The first one turned out to be a massive rewrite of almost
everything in `idl-compiler.py`. It aims to decouple parser
structures from the internal representation which is used
in the code-generation itself.
Prior to the patch everything was working with raw token lists and
the code was extremely fragile and hard to understand and modify.
Moreover, every change in the parser code caused a cascade effect
of breaking things at many different places, since they were relying
on the exact format of output produced by parsing rules.
Now there is a bunch of supplementary AST structures which provide
hierarchical and strongly typed structure as the output of parsing
routine.
It is much easier to verify (by the means of `isinstance`, for example)
and extend since the internal structures used in code-generation are
decoupled from the structure of parsing rules, which are now controlled
by custom parse actions providing high-level abstractions.
It is tested manually by checking that the old code produces exactly
the same autogenerated sources for all Scylla IDLs as the new one.
2 and 3. Cosmetics changes only: fixed a few typos and moved from
old-fashioned `string.Template` to python f-strings.
This improves readability of the idl-compiler code by a lot.
Only one non-functional whitespace change introduced.
4. This patch adds a very basic support for the parser to
understand `const` specifier in case it's used with a template
parameter for a data member in a class, e.g.
struct my_struct {
std::vector<const raft::log_entry> entries;
};
It actually does two things:
* Adjusts `static_asserts` in corresponding serializer methods
to match const-ness of fields.
* Defines a second serializer specialization for const type in
`.dist.hh` right next to non-const one.
This seems to be sufficient for raft-related uses for now.
Please note there is no support for the following cases, though:
const std::vector<raft::log_entry> entries;
const raft::term_t term;
None of the existing IDLs are affected by the change, so that
we can gradually improve on the feature and write the idl
unit-tests to increase test coverage with time.
5. A basic unit-test that writes a test struct with an
`std::vector<S<const T>>` field and reads it back to verify
that serialization works correctly.
6. Basic documentation for AST classes.
TODO: should also update the docs in `docs/IDL.md`. But it is already
quite outdated, and some changes would even be out of scope for this
patch set.
"
* 'idl-compiler-refactor-v5' of https://github.com/ManManson/scylla:
idl: add docstrings for AST classes
idl: add unit-test for `const` specifiers feature
idl: allow to parse `const` specifiers for template arguments
idl: fix a few typos in idl-compiler
idl: switch from `string.Template` to python f-strings and format string in idl-compiler
idl: Decouple idl-compiler data structures from grammar structure
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.
Fixes#7482
Message-Id: <20201216090038.GB3244976@scylladb.com>
A tool which lists all partitions contained in an sstable index. As all
partitions in an sstable are indexed, this tool can be used to find out
what partitions are contained in a given sstable.
The printout has the following format:
$pos: $human_readable_value (pk{$raw_hex_value})
Where:
* $pos: the position of the partition in the (decompressed) data file
* $human_readable_value: the human readable partition key
* $raw_hex_value: the raw hexadecimal value of the binary representation
of the partition key
For now the tool requires the types making up the partition key to be
specified on the command line, using the `--type|-t` command line
argument, using the Cassandra type class name notation for types.
As these are not assumed to be widely known, this patch includes a
document mapping all cql3 types to their Cassandra type class name
equivalent (but not just).
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201208092323.101349-1-bdenes@scylladb.com>
Fixes#7732
When truncating with auto_snapshot on, we try to verify the low rp mark
from the CF against the sstables discarded by the truncation timestamp.
However, in a scenario like:
Fill memtables
Flush
Truncate with snapshot A
Fill memtables some more
Truncate
Move snapshot A to upload + refresh (load old tables)
Truncate
The last op will assert, because while we have sstables loaded, which
will be discarded now, we did not in fact generate any _new_ ones
(since memtables are empty), and the RP we get back from discard is
one from an earlier generation set.
(Any permutation of events that create the situation "empty memtable" +
"non-empty sstables with only old tables" will generate the same error).
Added a check that before flushing checks if we actually have any
data, and if not, does not uphold the RP relation assert.
Closes#7799
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault
Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)
Fixes#7792Closes#7798
* github.com:scylladb/scylla:
database: add flushes to waiting for pending operations
table: unify waiting for pending operations
database: add a phaser for flush operations
database: add waiting for pending streams on table drop
This patch introduces very limited support for declaring `const`
template parameters in data members.
It's not covering all the cases, e.g.
`const type member_variable` and `const template_def<T1, T2, ...>`
syntax is not supported at the moment.
Though the changes are enough for raft-related use: this makes it
possible to declare `std::vector<raft::log_entries_ptr>` (aka
`std::vector<lw_shared_ptr<const raft::log_entry>>`) in the IDL.
Existing IDL files are not affected in any way.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Move to a modern and lightweight syntax of f-strings
introduced in python 3.6. It improves readability and provides
greater flexibility.
A few places are now using format strings instead, though.
In case when multiline substitution variable is used, the template
string should be first re-indented and only after that the
formatting should be applied, or we can end up with screwed
indentation the in generated sources.
This change introduces one invisible whitespace change
in `query.dist.impl.hh`, otherwise all generated code is exactly
the same.
Tests: build(dev) and diff genetated IDL sources by hand
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Instead of operating on the raw lists of tokens, transform them into
typed structures representation, which makes the code by many orders of
magnitude simpler to read, understand and extend.
This includes sweeping changes throughout the whole source code of the
tool, because almost every function was tightly coupled to the way
data was passed down from the parser right to the code generation
routines.
Tested manually by checking that old generated sources are precisely
the same as the new generated sources.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Pending flushes can participate in races when a table
with auto_snapshot==false is dropped. The race is as follows:
1. A flush of table T is initiated
2. The flush operation is preempted
3. Table T is dropped without flushing, because it has auto_snapshot off
4. The flush operation from (2.) wakes up and continues
working on table T, which is already dropped
5. Segfault/memory corruption
To prevent such races, a phaser for pending flushes is introduced
We already wait for pending reads and writes, so for completeness
we should also wait for all pending stream operations to finish
before dropping the table to avoid inconsistencies.
Download node_exporter in frozen image to prepare adding node_exporter
to relocatable pacakge.
Related #2190Closes#7765
[avi: updated toolchain, x86_64/aarch64/s390x]
Alternator tracing tests require the cluster to have the 'always'
isolation level configured to work properly. If that's not the case,
the tests will fail due to not having CAS-related traces present
in the logs. In order to help the users fix their configuration,
a helper message is printed before the test case is performed.
Automatic tests do not need this, because they are all ran with
matching isolation level, but this message could greatly improve
the user experience for manual tests.
Message-Id: <62bcbf60e674f57a55c9573852b6a28f99cbf408.1607949754.git.sarna@scylladb.com>
The outcome of alternator tracing tests was that tracing probability
was always set to 0 after the test was finished. That makes sense
for most test runs, but manual tests can work on existing clusters
with tracing probability set to some other value. Due to preserve
previous trace probability, the value is now extracted and stored,
so that it can be restored after the test is done.
Message-Id: <94f829b63f92847b4abb3b16f228bf9870f90c2e.1607949754.git.sarna@scylladb.com>
Normally a file size should be aligned around block size, since
we never write to it any unaligned size. However, we're not
protected against partial writes.
Just to be safe, align up the amount of bytes to zerofill
when recycling a segment.
Message-Id: <20201211142628.608269-4-kostja@scylladb.com>
Three tests in test_streams.py run update_table() on a table without
waiting for it to complete, and then call update_table() on the same
table or delete it. This always works in Scylla, and usually works in
AWS, but if we reach the second call, it may fail because the previous
update_table() did not take effect yet. We sometimes see these failures
when running the Alternator test suite against AWS.
So in this patch, after an each update_table() we wait for the table
to return from UPDATING to ACTIVE status.
The entire Alternator test suite now passes (or skipped) on AWS,
so: Fixes#7778.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213164931.2767236-1-nyh@scylladb.com>
The test test_query_filter.py::test_query_filter_paging fails on AWS
and shouldn't fail, so this patch fixes the test. Note that this is
only a test problem - no fix is needed for Alternator itself.
The test reads 20 results with 1-result pages, and assumed that
21 pages are returned. The 21st page may happen because when the
server returns the 20th, it might not yet know there will be no
additional results, so another page is needed - and will be empty.
Still a different implementation might notice that the last page
completed the iteration, and not return an extra empty page. This is
perfectly fine, and this is what AWS DynamoDB does today - and should
not be considered an error.
Refs #7778
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213143612.2761943-1-nyh@scylladb.com>
When request signature checking is enabled in Alternator, each request
should come with the appropriate Authorization header. Most errors in
this preparing this header will result in an InvalidSignatureException
response; But DynamoDB returns a more specific error when this header is
completely missing: MissingAuthenticationTokenException. We should do the
same, but before this patch we return InvalidSignatureException also for
a missing header.
The test test_authorization.py::test_no_authorization_header used to
enshrine our wrong error message, and failed when run against AWS.
After this patch, we fix the error message and the test - which now
passes against both Alternator and AWS.
Refs #7778.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213133825.2759357-1-nyh@scylladb.com>
This series allows setting per-query timeout via CQL. It's possible via the existing `USING` clause, which is extended to be available for `SELECT` statement as well. This parameter accepts a duration and can also be provided as a marker.
The parameter acts as a regular part of the `USING` clause, which means that it can be used along with `USING TIMESTAMP` and `USING TTL` without issues.
The series comes with a pytest test suite.
Examples:
```cql
SELECT * FROM t USING TIMEOUT 200ms;
```
```cql
INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```
Working with prepared statements works as usual - the timeout parameter can be
explicitly defined or provided as a marker:
```cql
SELECT * FROM t USING TIMEOUT ?;
```
```cql
INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```
Tests: unit(dev)
Fixes#7777Closes#7781
* github.com:scylladb/scylla:
test: add prepared statement tests to USING TIMEOUT suite
docs: add an entry about USING TIMEOUT
test: add a test suite for USING TIMEOUT
storage_proxy: start propagating local timeouts as timeouts
cql3: allow USING clause for SELECT statement
cql3: add TIMEOUT attribute to the parser
cql3: add per-query timeout to select statement
cql3: add per-query timeout to batch statement
cql3: add per-query timeout to modification statement
cql3: add timeout to cql attributes
First of all, select statement is extended with an 'attrs' field,
which keeps the per-query attributes. Currently, only TIMEOUT
parameter is legal to use, since TIMESTAMP and TTL bear no meaning
for reads.
Secondly, if TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
1. It's unused since cbe510d1b8
2. It's unsafe to keep a reference to token_metadata&
potentially across yield points.
The higher-level motivation is to make
storage_service::get_token_metadata() private so we
can control better how it's used.
For cdc, if the token_metadata is going to be needed
to the future, it'd be better get it from
db_context::_proxy.get_token_metadata_ptr().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213162351.52224-2-bhalevy@scylladb.com>
"
This series fixes use-after-free via token_metadata&
We may currently get a token_metadata& via get_token_metadata() and
use it across yield points in a couple of sites:
- do_decommission_removenode_with_repair
- get_new_source_ranges
To fix that, get_token_metadata_ptr and hold on to it
across yielding.
Fixes#7790
Dtest: update_cluster_layout_tests:TestUpdateClusterLayout.simple_removenode_2_test(debug)
Test: unit(dev)
"
* tag 'storage_service-token_metadata_ptr-v2' of github.com:bhalevy/scylla:
storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
Provide the token_metadata& to get_new_source_ranges by the caller,
who keeps it valid throughout the call.
Note that there is no need to clone_only_token_map
since the token_metadata_ptr is immutable and can be
used just as well for calling strat.get_range_addresses.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When yielding in clone_only_token_map or clone_after_all_left
the token_metadata got with get_token_metadata() may go away.
Use get_token_metadata_ptr() instead to hold on to it.
And with that, we don't need to clone_only_token_map.
`metadata` is not modified by calculate_natural_endpoints, so we
can just refer to the immutable copy retrieved with
get_token_metadata_ptr.
Fixes#7790
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The validate_column_family() helper uses the global proxy
reference to get database from. Fortunatelly, all the callers
of it can provide one via argument.
tests: unit(dev)
"
* 'br-no-proxy-in-validate' of https://github.com/xemul/scylla:
validation: Remove get_local_storage_proxy call
client_state: Call validate_column_family() with database arg
client_state: Add database& arg to has_column_family_access
storage_proxy: Add .local_db() getters
validate: Mark database argument const
"
The initial intent was to remove call for global storage service from
secondary index manager's create_view_for_index(), but while fixing it
one of intermediate schema table's helper managed to benefit from it
by re-using the database reference flying by.
The cleanup is done by simply pushing the database reference along the
stack from the code that already has it down the create_view_for_index().
tests: unit(dev)
"
* 'br-no-storages-in-index-and-schema' of https://github.com/xemul/scylla:
schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
schema-tables: Add database argument to make_update_table_mutations
schema-tables: Factor out calls getting database instance
index-manager: Move feature evaluation one level up
`ops` might be passed as a disengaged shared_ptr when called
from `decommission_with_repair`.
In this case we need to propagate to sync_data_using_repair a
disengaged std::optional<utils::UUID>.
Fixes#7788
DTest: update_cluster_layout_tests:TestUpdateClusterLayout.verify_latest_copy_decommission_node_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213073743.331253-1-bhalevy@scylladb.com>
* seastar 8b400c7b45...2de43eb6bf (3):
> core: show span free sizes correctly in diagnostics
> Merge "IO queues to share capacities" from Pavel E
> file: make_file_impl: determine blockdev using st_mode
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by clearing all instances of the warning (including [[nodiscard]]
violations that crept in while it was disabled) and reinstating the warning.
Closes#7767
* github.com:scylladb/scylla:
build: reinstate -Wunused-value warning for [[nodiscard]]
test: lib: don't ignore future in compare_readers()
test: mutation_test: check both ranges when comparing summaries
serialializer: silence unused value warning in variant deserializer
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).
Not needed for .deb, since debian/ubunto do not install tuned by
default.
Fixes#7696Closes#7776
Two halves of the tunnel finally connect -- the
latter helper needs the local database instance and
is only called by the former one which already has it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are 3 callers of this helper (cdc, migration manager and tests)
and all of them already have the database object at hands.
The argument will be used by next patch to remove call for global
storage proxy instance from make_update_indices_mutations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The make_update_indices_mutations gets database instance
for two things -- to find the cf to work with and to get
the value of a feature for index view creation.
To suit both and to remove calls for global storage proxy
and service instances get the database once in the
function entrance. Next patch will clean this further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The create_view_for_index needs to know the state of the
correct-idx-token-in-secondary-index feature. To get one
it takes quite a long route through global storage service
instance.
Since there's only one caller of the method in question,
and the method is called in a loop, it's a bit faster to
get the feature value in caller and pass it in argument.
This will also help to get rid of the call for global
storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_next_partition uses global proxy instance to get
the local database reference. Now it's available in the
reader object itself, so it's possible to remove this
call for global storage proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reader uses local databse instance in its get_next_partition
method to find keyspaces to work with
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is used in validate_column_family. The last caller of it was removed by
previous patch, so we may kill the helper itself
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A sequel to #7692.
This series gets rid of linearization when validating collections and tuple types. (Other types were already validated without linearizing).
The necessary helpers for reading from fragmented buffers were introduced in #7692. All this series does is put them to use in `validate()`.
Refs: #6138Closes#7770
* github.com:scylladb/scylla:
types: add single-fragment optimization in validate()
utils: fragment_range: add with_simplified()
cql3: statements: select_statement: remove unnecessary use of with_linearized
cql3: maps: remove unnecessary use of with_linearized
cql3: lists: remove unnecessary use of with_linearized
cql3: tuples: remove unnecessary use of with_linearized
cql3: sets: remove unnecessary use of with_linearized
cql3: tuples: remove unnecessary use of with_linearized
cql3: attributes: remove unnecessary uses of with_linearized
types: validate lists without linearizing
types: validate tuples without linearizing
types: validate sets without linearizing
types: validate maps without linearizing
types: template abstract_type::validate on FragmentedView
types: validate_visitor: transition from FragmentRange to FragmentedView
utils: fragmented_temporary_buffer: add empty() to FragmentedView
utils: fragmented_temporary_buffer: don't add to null pointer
Manipulating fragmented views is costlier that manipulating contiguous views,
so let's detect the common situation when the fragmented view is actually
contiguous underneath, and make use of that.
Note: this optimization is only useful for big types. For trivial types,
validation usually only checks the size of the view.
Reading from contiguous memory (bytes_view) is significantly simpler
runtime-wise than reading from a fragmented view, due to less state and less
branching, so we often want to convert a fragmented view to a simple view before
processing it, if the fragmented view contains at most one fragment, which is
common. with_simplified() does just that.
This is primarily a stylistic change. It makes the interface more consistent
with deserialize(). It will also allow us to call `validate()` for collection
elements in `validate_aux()`.
This will allow us to easily get rid of linearizations when validating
collections and tuples, because the helpers used in validate_aux() already
have FragmentedView overloads.
When fragmented_temporary_buffer::view is created from a bytes_view,
_current is null. In that case, in remove_current(), null pointer offset
happens, and ubsan complains. Fix that.
The heuristic of STCS reshape is correct, and it built the compaction
descriptor correctly, but forgot to return it to the caller, so no
reshape was ever done on behalf of STCS even when the strategy
needed it.
Fixes#7774.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>
Currently removenode works like below:
- The coordinator node advertises the node to be removed in
REMOVING_TOKEN status in gossip
- Existing nodes learn the node in REMOVING_TOKEN status
- Existing nodes sync data for the range it owns
- Existing nodes send notification to the coordinator
- The coordinator node waits for notification and announce the node in
REMOVED_TOKEN
Current problems:
- Existing nodes do not tell the coordinator if the data sync is ok or failed.
- The coordinator can not abort the removenode operation in case of error
- Failed removenode operation will make the node to be removed in
REMOVING_TOKEN forever.
- The removenode runs in best effort mode which may cause data
consistency issues.
It means if a node that owns the range after the removenode
operation is down during the operation, the removenode node operation
will continue to succeed without requiring that node to perform data
syncing. This can cause data consistency issues.
For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
n3 is the old replicas, n2 is being removed, after the removenode
operation, the new replicas are n1, n5, n3. If n3 is down during the
removenode operation, only n1 will be used to sync data with the new
owner n5. This will break QUORUM read consistency if n1 happens to
miss some writes.
Improvements in this patch:
- This patch makes the removenode safe by default.
We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.
If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.
$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>
Example restful api:
$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"
- The coordinator can abort data sync on existing nodes
For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.
- The coordinator can decide which nodes to ignore and pass the decision
to other nodes
Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.
Fixes#7359Closes#7626
Verify that the input types are iterators and their value types are compatible
with the compare function.
Because some of the inputs were not actually valid iterators, they are adjusted
too.
Closes#7631
* github.com:scylladb/scylla:
types: add constraint on lexicographical_tri_compare()
composite: make composite::iterator a real input_iterator
compound: make compount_type::iterator a real input_iterator
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).
We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.
Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(
Fixes#7763.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
Snitch name needs to be exchanged within cluster once, on shadow
round, so joining nodes cannot use wrong snitch. The snitch names
are compared on bootstrap and on normal node start.
If the cluster already used mixed snitches, the upgrade to this
version will fail. In this case customer needs to add a node with
correct snitch for every node with the wrong snitch, then put
down the nodes with the wrong snitch and only then do the upgrade.
Fixes#6832Closes#7739
Whereas in CQL the client can pass a timeout parameter to the server, in
the DynamoDB API there is no such feature; The server needs to choose
reasonable timeouts for its own internal operations - e.g., writes to disk,
querying other replicas, etc.
Until now, Alternator had a fixed timeout of 10 seconds for its
requests. This choice was reasonable - it is much higher than we expect
during normal operations, and still lower than the client-side timeouts
that some DynamoDB libraries have (boto3 has a one-minute timeout).
However, there's nothing holy about this number of 10 seconds, some
installations might want to change this default.
So this patch adds a configuration option, "--alternator-timeout-in-ms",
to choose this timeout. As before, it defaults to 10 seconds (10,000ms).
In particular, some test runs are unusually slow - consider for example
testing a debug build (which is already very slow) in an extremely
over-comitted test host. In some cases (see issue #7706) we noticed
the 10 second timeout was not enough. So in this patch we increase the
default timeout chosen in the "test/alternator/run" script to 30 seconds.
Please note that as the code is structured today, this timeout only
applies to some operations, such as GetItem, UpdateItem or Scan, but
does not apply to CreateTable, for example. This is a pre-existing
issue that this patch does not change.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207122758.2570332-1-nyh@scylladb.com>
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
"
The multishard_mutation_query test is toooo slow when built
with clang in dev mode. By reducing the number of scans it's
possible to shrink the full suite run time from half an hour
down to ~3 minutes.
tests: unit(dev)
"
* 'br-devel-mode-tests' of https://github.com/xemul/scylla:
test: Make multishard_mutation_query test do less scans
configure: Add -DDEVEL to dev build flags
When built by clang this dev-mode test takes ~30 minutes to
complete. Let's reduce this time by reducing the scale of
the test if DEVEL is set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.
Also, --nic is currently ignored by command suggestion, need to print it just like other options.
Related #7395Closes#7724
* github.com:scylladb/scylla:
scylla_setup: print --swap-directory and --swap-size on command suggestion
scylla_setup: print --nic on command suggestion
scylla_setup: fix wrong command suggestion on --io-setup scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead.
A sequel to #7692.
This series gets rid of linearization in `serialize_for_cql`, which serializes collections and user types from `collection_mutation_view` to CQL. We switch from `bytes` to `bytes_ostream` as the intermediate buffer type.
The only user of of `serialize_for_cql` immediately copies the result to another `bytes_ostream`. We could avoid some copies and allocations by writing to the final `bytes_ostream` directly, but it's currently hidden behind a template.
Before this series, `serialize_for_cql_aux()` delegated the actual writing to `collection_type_impl::pack` and `tuple_type_impl::build_value`, by passing them an intermediate `vector`. After this patch, the writing is done directly in `serialize_for_cql_aux()`. Pros: we avoid the overhead of creating an intermediate vector, without bloating the source code (because creating that intermediate vector requires just as much code as serializing the values right away). Cons: we duplicate the CQL collection format knowledge contained in `collection_type_impl::pack` and `tuple_type_impl::build_value`.
Refs: #6138Closes#7771
* github.com:scylladb/scylla:
types: switch serialize_for_cql from bytes to bytes_ostream
types: switch serialize_for_cql_aux from bytes to bytes_ostream
types: serialize user types to bytes_ostream
types: serialize lists to bytes_ostream
types: serialize sets to bytes_ostream
types: serialize maps to bytes_ostream
utils: fragment_range: use range-based for loop instead of boost::for_each
types: add write_collection_value() overload for bytes_ostream and value_view
Increase accepted disk-to-RAM ratio to 105 to accomodate even 7.5GB of
RAM for one NVMe log various reasons for not recommending the instance
type.
Fixes#7587Closes#7600
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.
Fixes#4089Closes#7766
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).
This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.
Fixes#7768
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
Up until now, Scylla's debian packages dependencies versions were
unspecified. This was due to a technical difficulty to determine
the version of the dependent upon packages (such as scylla-python3
or scylla-jmx). Now, when those packages are also built as part of
this repo and are built with a version identical to the server package
itself we can depend all of our packages with explicit versions.
The motivation for this change is that if a user tries to install
a specific Scylla version by installing a specific meta package,
it will silently drag in the latest components instead of the ones
of the requested versions.
The expected change in behavior is that after this change an attempt
to install a metapackage with version which is not the latest will fail
with an explicit error hinting the user what other packages of the same
version should be explicitly included in the command line.
Fixes#5514Closes#7727
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by reinstating the warning, now that all instances of the warning
have been fixed.
A copy/paste error means we ignore the termination of one of the
ranges. Change the comma expression to a disjunction to avoid
the unused value warning from clang.
The code is not perfect, since if the two ranges are not the same
size we'll invoke undefined behavior, but it is no worse than before
(where we ignored the comparison completely).
The variant deserializer uses a fold expression to implement
an if-tree with a short-circuit, producing an intermediate boolean
value to terminate evaluation. This intermediate value is unneeded,
but evokes a warning from clang when -Wunused-value is enabled.
Since we want to enable the warning, add a cast to void to ignore
the intermediate value.
We want to pass bytes_ostream to this loop in later commits.
bytes_ostream does not conform to some boost concepts required by
boost::for_each, so let's just use C++'s native loop.
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.
Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:
(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
_data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
_start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
_kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}
Closes#7764
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.
This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.
As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.
Fixes#7706
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
The original goal of this patch was to replace the two single-node dtests
allow_filtering_test and allow_filtering_secondary_indexes_test, which
recently caused us problems when we wanted to change the ALLOW FILTERING
behavior but the tests were outside the tree. I'm hoping that after this
patch, those two tests could be removed from dtest.
But this patch actually tests more cases then those original dtest, and
moreover tests not just whether ALLOW FILTERING is required or not, but
also that the results of the filtering is correct.
Currently, four of the included tests are expected to fail ("xfail") on
Scylla, reproducing two issues:
1. Refs #5545:
"WHERE x IN ..." on indexed column x wrongly requires ALLOW FILTERING
2. Refs #7608:
"WHERE c=1" on clustering key c should require ALLOW FILTERING, but
doesn't.
All tests, except the one for issue #5545, pass on Cassandra. That one
fails on Cassandra because doesn't support IN on an indexed column at all
(regardless of whether ALLOW FILTERING is used or not).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201115124631.1224888-1-nyh@scylladb.com>
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.
AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}
map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.
Fixes#7741
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7734
"
The storage service is called there to get the cached value
of db::system_keyspace::get_local_host_id(). Keeping the value
on database decouples it from storage service and kills one
more global storage service reference.
tests: unit(dev)
"
* 'br-remove-storage-service-from-counters-2' of https://github.com/xemul/scylla:
counters: Drop call to get_local_storage_service and related
counters: Use local id arg in transform_counter_update_to_shards
database: Have local id arg in transform_counter_updates_to_shards()
storage_service: Keep local host id to database
This PR adds the Sphinx documentation generator and the custom theme ``sphinx-scylladb-theme``. Once merged, the GitHub Actions workflow should automatically publish the developer notes stored under ``docs`` directory on http://scylladb.github.io/scylla
1. Run the command ``make preview`` from the ``docs`` directory.
3. Check the terminal where you have executed the previous command. It should not raise warnings.
3. Open in a new browser tab http://127.0.0.1:5500/ to see the generated documentation pages.
The table of contents displays the files sorted as they appear on GitHub. In a subsequent iteration, @lauranovich and I will submit an additional PR proposing a new folder organization structure.
Closes#7752
* github.com:scylladb/scylla:
docs: fixed warnings
docs: added theme
The previous way of deleting records based on the whole
sstatble data_size causes overzealous deletions (#7668)
and inefficiency in the rows cache due to the large number
of range tombstones created.
Therefore we'd be better of by juts letting the
records expire using he 30 days TTL.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201206083725.1386249-1-bhalevy@scylladb.com>
This reverts commit 0aa1f7c70a, reversing
changes made to 72c59e8000. The diff is
strange, including unrelated commits. There is no understanding of the
cause, so to be safe, revert and try again.
The local host id is now passed by argument, so we don't
need the counter_id::local() and some other methods that
call or are called by it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Only few places in it need the uuid. And since it's only 16 bytes
it's possibvle to safely capture it by value in the called lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places that call it -- database code itself and
tests. The former already has the local host id, so just pass
one.
The latter are a bit trickier. Currently they use the value from
storage_service created by storage_service_for_tests, but since
this version of service doesn't pass through prepare_to_join()
the local_host_id value there is default-initialized, so just
default-initialize the needed argument in place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The value in question is cached from db::system_keyspace
for places that want to have it without waiting for
futures. So far the only place is database counters code,
so keep the value on database itself. Next patches will
make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Citing #6138: > In the past few years we have converted most of our codebase to
work in terms of fragmented buffers, instead of linearised ones, to help avoid
large allocations that put large pressure on the memory allocator. > One
prominent component that still works exclusively in terms of linearised buffers
is the types hierarchy, more specifically the de/serialization code to/from CQL
format. Note that for most types, this is the same as our internal format,
notable exceptions are non-frozen collections and user types. > > Most types
are expected to contain reasonably small values, but texts, blobs and especially
collections can get very large. Since the entire hierarchy shares a common
interface we can either transition all or none to work with fragmented buffers.
This series gets rid of intermediate linearizations in deserialization. The next
steps are removing linearizations from serialization, validation and comparison
code.
Series summary:
- Fix a bug in `fragmented_temporary_buffer::view::remove_prefix`. (Discovered
while testing. Since it wasn't discovered earlier, I guess it doesn't occur in
any code path in master.)
- Add a `FragmentedView` concept to allow uniform handling of various types of
fragmented buffers (`bytes_view`, `temporary_fragmented_buffer::view`,
`ser::buffer_view` and likely `managed_bytes_view` in the future).
- Implement `FragmentedView` for relevant fragmented buffer types.
- Add helper functions for reading from `FragmentedView`.
- Switch `deserialize()` and all its helpers from `bytes_view` to
`FragmentedView`.
- Remove `with_linearized()` calls which just became unnecessary.
- Add an optimization for single-fragment cases.
The addition of `FragmentedView` might be controversial, because another concept
meant for the same purpose - `FragmentRange` - is already used. Unfortunately,
it lacks the functionality we need. The main (only?) thing we want to do with a
fragmented buffer is to extract a prefix from it and `FragmentRange` gives us no
way to do that, because it's immutable by design. We can work around that by
wrapping it into a mutable view which will track the offset into the immutable
`FragmentRange`, and that's exactly what `linearizing_input_stream` is. But it's
wasteful. `linearizing_input_stream` is a heavy type, unsuitable for passing
around as a view - it stores a pair of fragment iterators, a fragment view and a
size (11 words) to conform to the iterator-based design of `FragmentRange`, when
one fragment iterator (4 words) already contains all needed state, just hidden.
I suggest we replace `FragmentRange` with `FragmentedView` (or something
similar) altogether.
Refs: #6138Closes#7692
* github.com:scylladb/scylla:
types: collection: add an optimization for single-fragment buffers in deserialize
types: add an optimization for single-fragment buffers in deserialize
cql3: tuples: don't linearize in in_value::from_serialized
cql3: expr: expression: replace with_linearize with linearized
cql3: constants: remove unneeded uses of with_linearized
cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell
cql3: lists: remove unneeded use of with_linearized
query-result-set: don't linearize in result_set_builder::deserialize
types: remove unneeded collection deserialization overloads
types: switch collection_type_impl::deserialize from bytes_view to FragmentedView
cql3: sets: don't linearize in value::from_serialized
cql3: lists: don't linearize in value::from_serialized
cql3: maps: don't linearize in value::from_serialized
types: remove unused deserialize_aux
types: deserialize: don't linearize tuple elements
types: deserialize: don't linearize collection elements
types: switch deserialize from bytes_view to FragmentedView
types: deserialize tuple types from FragmentedView
types: deserialize set type from FragmentedView
types: deserialize map type from FragmentedView
types: deserialize list type from FragmentedView
types: add FragmentedView versions of read_collection_size and read_collection_value
types: deserialize varint type from FragmentedView
types: deserialize floating point types from FragmentedView
types: deserialize decimal type from FragmentedView
types: deserialize duration type from FragmentedView
types: deserialize IP address types from FragmentedView
types: deserialize uuid types from FragmentedView
types: deserialize timestamp type from FragmentedView
types: deserialize simple date type from FragmentedView
types: deserialize time type from FragmentedView
types: deserialize boolean type from FragmentedView
types: deserialize integer types from FragmentedView
types: deserialize string types from FragmentedView
types: remove unused read_simple_opt
types: implement read_simple* versions for FragmentedView
utils: fragmented_temporary_buffer: implement FragmentedView for view
utils: fragment_range: add single_fragmented_view
serializer: implement FragmentedView for buffer_view
utils: fragment_range: add linearized and with_linearized for FragmentedView
utils: fragment_range: add FragmentedView
utils: fragmented_temporary_buffer: fix view::remove_prefix
Values usually come in a single fragment, but we pay the cost of fragmented
deserialization nevertheless: bigger view objects (4 words instead of 2 words)
more state to keep updated (i.e. total view size in addition to current fragment
size) and more branches.
This patch adds a special case for single-fragment buffers to
abstract_type::deserialize. They are converted to a single_fragmented_view
before doing anything else. Templates instantiated with single_fragmented_view
should compile to better code than their multi-fragmented counterparts. If
abstract_type::deserialize is inlined, this patch should completely prevent any
performance penalties for switching from with_linearized to fragmented
deserialization.
with_linearized creates an additional internal `bytes` when the input is
fragmented. linearized copies the data directly to the output `bytes`, so it's
more efficient.
Devirtualizes collection_type_impl::deserialize (so it can be templated) and
adds a FragmentedView overload. This will allow us to deserialize collections
with explicit cql_serialization_format directly from fragmented buffers.
The final part of the transition of deserialize from bytes_view to
FragmentedView.
Adds a FragmentedView overload to abstract_type::deserialize and
switches deserialize_visitor from bytes_view to FragmentedView, allowing
deserialization of all types with no intermediate linearization.
The partition builder doesn't expect the looked-up row to exist. In fact
it already existing is a sign of a bug. Currently bugs resulting in
duplicate rows will manifest by tripping an assert in
`row::append_cell()`. This however results in poor diagnostics, so we
want to catch these errors sooner to be able to provide higher level
diagnostics. To this end, switch to the freshly introduced
`append_clustering_row()` so that duplicate rows are found early and in
a context where their identity is known.
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
but an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
A microbenchmark was added that compares the existing combining reader
(which uses `mutation_reader_merger` underneath) with a new combining reader
built using the new `clustering_order_reader_merger` and a simple queue of readers
that returns readers from some supplied set. The used set of readers is built from the following
ranges of keys (each range corresponds to a single reader):
`[0, 31]`, `[30, 61]`, `[60, 91]`, `[90, 121]`, `[120, 151]`.
The microbenchmark runs the reader and divides the result by the number of mutation fragments.
The results on my laptop were:
```
$ build/release/test/perf/perf_mutation_readers -t clustering_combined.* -r 10
single run iterations: 0
single run duration: 1.000s
number of runs: 10
test iterations median mad min max
clustering_combined.ranges_generic 2911678 117.598ns 0.685ns 116.175ns 119.482ns
clustering_combined.ranges_specialized 3005618 111.015ns 0.349ns 110.063ns 111.840ns
```
`ranges_generic` denotes the existing combining reader, `ranges_specialized` denotes the new reader.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7688
* github.com:scylladb/scylla:
tests: mutation_source_test for clustering_order_reader_merger
perf: microbenchmark for clustering_order_reader_merger
mutation_reader_test: test clustering_order_reader_merger in memory
test: generalize `random_subset` and move to header
mutation_reader: introduce clustering_order_reader_merger
In issue #7722, it was suggested that we should port Cassandra's CQL unit
tests into our own repository, by translating the Java tests into Python
using the new cql-pytest framework. Cassandra's CQL unit test framework is
orders of magnitude faster than dtest, and in-tree, so Cassandra have been
moving many CQL correctness tests there, and we can also benefit from their
test cases.
In this patch, we take the first step in a long journey:
1. I created a subdirectory, test/cql-pytest/cassandra_tests, where all the
translated Cassandra tests will reside. The structure of this directory
will mirror that of the test/unit/org/apache/cassandra/cql3 directory in
the Cassandra repository.
pytest conveniently looks for test files recursively, so when all the
cql-pytest are run, the cassandra_tests files will be run as well.
As usual, one can also run only a subset of all the tests, e.g.,
"test/cql-pytest/run -vs cassandra_tests" runs only the tests in the
cassandra_tests subdirectory (and its subdirectories).
2. I translated into Python two of the smallest test files -
validation/entities/{TimeuuidTest,DataTypeTest}.java - containing just
three test functions.
The plan is to translate entire Java test files one by one, and to mirror
their original location in our own repository, so it will be easier
to remember what we already translated and what remains to be done.
3. I created a small library, porting.py, of functions which resemble the
common functions of the Java tests (CQLTester.java). These functions aim
to make porting the tests easier. Despite the resemblence, the ported code
is not 100% identical (of course) and some effort is still required in
this porting. As we continue this porting effort, we'll probably need
more of these functions, can can also continue to improve them to reduce
the porting effort.
Refs #7722.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201192142.2285582-1-nyh@scylladb.com>
This series introduces a `large_data_counters` element to `scylla_metadata` component to explicitly count the number of `large_{partitions,rows,cells}` and `too_many_rows` in the sstable. These are accounted for in the sstable writer whenever the respective large data entry is encountered.
It is taken into account in `large_data_handler::maybe_delete_large_data_entries`, when engaged.
Otherwise, if deleting a legacy sstable that has no such entry in `scylla_metadata`, just revert to using the current method of comparing the sstable's `data_size` to the various thresholds.
Fixes#7668
Test: unit(dev)
Dtest: wide_rows_test.py (in progress)
Closes#7669
* github.com:scylladb/scylla:
docs: sstable-scylla-format: add large_data_stats subcomponent
large_data_handler: maybe_delete_large_data_entries: use sstable large data stats
large_data_handler: maybe_delete_large_data_entries: accept shared_sstable
large_data_handler: maybe_delete_large_data_entries: move out of line
sstables: load large_data_stats from scylla_metadata
sstables: store large_data_stats in scylla_metadata
sstables: writer: keep track of large data stats
large_data_handler: expose methods to get threshold
sstables: kl/writer: never record too many rows
large_data_handler: indicate recording of large data entries
large_data_handler: move constructor out of line
In test/cql-pytest/run.py we have a 200 second timeout to boot Scylla.
I never expected to reach this timeout - it normally takes (in dev
build mode) around 2 seconds, but in one run on Jenkins we did reach it.
It turns out that the code does not recognize this timeout correctly,
thought that Scylla booted correctly - and then failed all the
subtests when they fail to connect to Scylla.
This patch fixes the timeout logic. After the timeout, if Scylla's
CQL port is still not responsive, the test run is failed - without
trying to run many individual tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201150927.2272077-1-nyh@scylladb.com>
When a row was inserted into a table with no regular columns, and no
such row existed in the first place, postimage would not be produced.
Fix this.
Fixes#7716.
Closes#7723
If the sstable has scylla_metadata::large_data_stats use them
to determine whether to delete the corresponding large data records.
Otherwise, defer to the current method of comparing the sstable
data_size to the respective thresholds.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the actual deletion if the large data entries
is done in the background, and we don't captures the shared_sstable,
we can safely pass it to maybe_delete_large_data_entries when
deleting the sstable in sstable::unlink and it will be release
as soon as maybe_delete_large_data_entries returns.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Load the large data stats from the scylla_metadata component
if they are present. Otherwise, if we're opening a legacy sstable
that has scylla_metadata_type::LargeDataStats, leave
sstable::_large_data_stats disengaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Store the large data statistics in the scylla_metadata component.
These will be retrieved when loading the sstable and be
used for determining whether to delete the corresponding
large data entries upon sstable deletion.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, statement_restrictions::find_idx() would happily return an
index for a non-EQ restriction (because it checked only the column
name, not the operator). This is incorrect: when the selected index
is for a non-EQ restriction, it is impossible to query that index
table.
Fixes#7659.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7665
* seastar 010fb0df1e...8b400c7b45 (6):
> append_challenged_posix_file_impl::read_dma: allow iovec to cross _logical_size
> Merge "Extend per task-queue timing statistics" from Pavel E
> tls_test: Create test certs at build time
> cook: upgrade hwloc version
> memory: rate-limit diagnostics messages
> util/log: add rate-limited version of writer version of log()
Currently, if the user provides a cell name with too many components,
we will accept it and construct an invalid clusterin key. This may
result in undefined behavior down the stream.
It was caught by ASAN in a debug build when executing dtest
cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test with
nodetool flush manually added after the write. Triggered during
sstable writing to an MC-format sstable:
seastar::shared_ptr<abstract_type const>::operator*() const at ././seastar/include/seastar/core/shared_ptr.hh:577
sstables::mc::clustering_blocks_input_range::next() const at ./sstables/mx/writer.cc:180
To prevent corrupting the state in this way, we should fail
early. This patch addds validation which will fail thrift requests
which attempt to create invalid clustering keys.
Fixes#7568.
Example error:
Internal server error: Cell name of ks.test has too many components, expected 1 got 2 in 0x0004000000040000017600
Message-Id: <1605550477-24810-1-git-send-email-tgrabiec@scylladb.com>
This patch adds an option to scylla_setup to configure an rsyslog destination.
The monitoring stack has an option to get information from rsyslog it
requires that rsyslog on the scylla machines will send the trace line to
it.
The configuration will be in a Scylla configuration file, so it is safe to run it multiple times.
Fixes#7589
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#7634
* github.com:scylladb/scylla:
dist/common/scripts/scylla_setup: Optionally config rsyslog destination
Adding dist/common/scripts/scylla_rsyslog_setup utility
This patch adds an option to scylla_setup to configure an rsyslog
destination.
The monitoring stack has an option to get information from rsyslog, it
requires that rsyslog on the scylla machines will send the trace line to
it.
If the /etc/rsyslog.d/ directory exists (that means the current system
runs rsyslog) it will ask if to add rsyslog configuration and if yes, it
would run scylla_rsyslog_setup.
Fixes#7589
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.
Related #7395
* scylla-dev/snapshot_fixes_v1:
raft: ignore append_reply from a peer in SNAPSHOT state
raft: Ignore outdated snapshots
raft: set next_idx to correct value after snapshot transfer
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
Fix#7680 by never using secondary index for multi-column restrictions.
Modify expr::is_supported_by() to handle multi-column correctly.
Tests: unit (dev)
Closes#7699
* github.com:scylladb/scylla:
cql3/expr: Clarify multi-column doesn't use indexing
cql3: Don't use index for multi-column restrictions
test: Add eventually_require_rows
The first two patches in this series are small improvements to cql-pytest to prepare for the third and main patch. This third patch adds cql-pytest tests which check that we fail CQL queries that try to inject non-ASCII and non-UTF-8 strings for ascii and text columns, respectively.
The tests do not discover any unknown bug in Scylla, however, they do show that Scylla is more strict in its definition of "valid UTF-8" compared to Cassandra.
Closes#7719
* github.com:scylladb/scylla:
test/cql-pytest: add tests for validation of inserted strings
test/cql-pytest: add "scylla_only" fixture
test/cpy-pytest: enable experimental features
This change adds tracking of all the CQL errors that can be
raised in response to a CQL message from a client, as described
in the CQL v4 protocol and with Scylla's CDC_WRITE_FAILUREs
included.
Fixes#5859Closes#7604
We have "Conflicts: kernel < 3.10.0-514" on rpm package to make sure
the environment is running newer kernel.
However, user may use non-standard kernel which has different package name,
like kernel-ml or kernel-uek.
On such environment Conflicts tag does not works correctly.
Even the system running with newer kernel, rpm only checks "kernel" package
version number.
To avoid such issue, we need to drop Conflicts tag.
Fixes#7675
This patch adds comprehensive cql-pytest tests for checking the validation
of strings - ASCII or UTF-8 - in CQL. Strings can be represented in CQL
using several methods - a strings can be a string literal as
part of the statement, can be encoded as a blob (0x...), or
can be a binding parameter for a prepared statement, or returned
by user-defined functions - and these tests check all of them.
We already have low-level unit tests for UTF-8 parsing in
test/boost/utf8_test.cc, but the new tests here confirms that we really
call these low-level functions in the correct way. Moreover, since these
are CQL tests, they can also be run against Cassandra, and doing that
demonstrated that Scylla's UTF-8 parsing is *stricter* than Cassandra's -
Scylla's UTF-8 parser rejects the following sequences which Cassandra's
accepts:
1. \xC0\x80 as another non-minimal representation of null. Note that other
non-minimal encodings are rejected by Cassandra, as expected.
2. Characters beyond the official Unicode range (or what Scylla considers
the end of the range).
3. UTF-16 surrogates - these are not considered valid UTF-8, but Cassandra
accepts them, and Scylla does not.
In the future, we should consider whether Scylla is more correct than
Cassandra here (so we're fine), or whether compatibility is more important
than correctness (so this exposed a bug).
The ASCII tests reproduces issue #5421 - that trying to insert a
non-ASCII string into an "ascii" column should produce an error on
insert - not later when fetching the string. This test now passes,
because issue 5421 was already fixed.
These tests did not exposed any bug in Scylla (other than the differences
with Cassandra mentioned a bug), so all of them pass on Scylla. Two
of the tests fail on Cassandra, because Cassandra does not recognize
some invalid UTF-8 (according to Scylla's definition) as invalid.
Refs #5421.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reject the previously accepted case where the multi-column restriction
applied to just a single column, as it causes a crash downstream. The
user can drop the parentheses to avoid the rejection.
Fixes#7710
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7712
"
This series adds maybe_yield called from
cleanup_compaction::get_ranges_for_invalidation
to avoid reactor stalls.
To achieve that, we first extract bool_class can_yield
to utils/maybe_yield.hh, and add a convience helper:
utils::maybe_yield(can_yield) that conditionally calls
seastar::thread::maybe_yield if it can (when called in a
seastar thread).
With that, we add a can_yield parameter to dht::to_partition_ranges
and dht::partition_range::deoverlap (defaults to false), and
use it from cleanup_compaction::get_ranges_for_invalidation,
as the latter is always called from `consume_in_thread`.
Fixes#7674
Test: unit(dev)
"
* tag 'unstall-get_ranges_for_invalidation-v2' of github.com:bhalevy/scylla:
compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points
dht/i_partitioner: to_partition_ranges: support yielding
locator: extract can_yield to utils/maybe_yield.hh
It is used to force remove a node from gossip membership if something
goes wrong.
Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.
For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"
Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.
Fixes#2134Closes#5436
This patch adds a fixture "scylla_only" which can be used to mark tests
for Scylla-specific features. These tests are skipped when running against
other CQL servers - like Apache Cassandra.
We recognize Scylla by looking at whether any system table exists with
the name "scylla" in its name - Scylla has several of those, and Cassandra
has none.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
bytes_view is one of the types we want to deserialize from (at least for now),
so we want to be able to pass it to deserialize() after it's transitioned to
FragmentView.
single_fragmented_view is a wrapper implementing FragmentedView for bytes_view.
It's constructed from bytes_view explicitly, because it's typically used in
context where we want to phase linearization (and by extension, bytes_view) out.
This patch introduces FragmentedView - a concept intented as a general-purpose
interface for fragmented buffers.
Another concept made for this purpose, FragmentedRange, already exists in the
codebase. However, it's unwieldy. The iterator-based design of FragmentRange is
harder to implement and requires more code, but more importantly it makes
FragmentRange immutable.
Usually we want to read the beginning of the buffer and pass the rest of it
elsewhere. This is impossible with FragmentRange.
FragmentedView can do everything FragmentRange can do and more, except for
playing nicely with iterator-based collection methods, but those are useless for
fragmented buffers anyway.
disk parsing expects output from recursive listing of GCP
metadata REST call, the method used to do it by default,
but now it requires a boolean flag to run in recursive mode
Fixes#7684Closes#7685
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.
This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.
Increase to 1200, which is enough for 6 instances * 200 shards.
Fixes#7700.
Closes#7701
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.
Fixes#7703Closes#7704
Fixes#7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
Closes#7697
* github.com:scylladb/scylla:
redis::service: Shut down sharded<> subobject on startup exception
transport::controller: Shut down distributed object on startup exception
Refs #7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
Fixes#7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
The downstream code expects a single-column restriction when using an
index. We could fix it, but we'd still have to filter the rows
fetched from the index table, unlike the code that queries the base
table directly. For instance, WHERE (c1,c2,c3) = (1,2,3) with an
index on c3 can fetch just the right rows from the base table but all
the c3=3 rows from the index table.
Fixes#7680
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
After snapshot is transferred progress::next_idx is set to its index,
but the code uses current snapshot to set it instead of the snapshot
that was transferred. Those can be different snapshots.
After a node becomes leader it needs to do two things: send an append
message to establish its leadership and commit one entry to make sure
all previous entries with smaller terms are committed as well.
Snapshot index cannot be used to check snapshot correctness since some
entries may not be command and thus do not affect snapshot value. Lest
use applied entries count instead.
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In commit 9b28162f88 (repair: Use label
for node ops metrics), we switched to use label for different node
operations. We should use the same description for the same metric name.
Fixes#7681Closes#7682
1. sstables: move `sstable_set` implementations to a separate module
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.
The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
2. mutation_reader: move `mutation_reader::forwarding` to flat_mutation_reader.hh
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
3. sstables: move sstable reader creation functions to `sstable_set`
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
4. sstables: pass `ring_position` to `create_single_key_sstable_reader`
instead of `partition_range`.
It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
5. sstable_set: refactor `filter_sstable_for_reader_by_pk`
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.
The logic has been extracted from `filter_sstable_for_reader_by_pk`.
Split from #7437.
Closes#7655
* github.com:scylladb/scylla:
sstable_set: refactor filter_sstable_for_reader_by_pk
sstables: pass ring_position to create_single_key_sstable_reader
sstables: move sstable reader creation functions to `sstable_set`
mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
sstables: move sstable_set implementations to a separate module
This piece of logic was wrong for two unrelated reasons:
1. When fragmented_temporary_buffer::view is constructed from bytes_view,
_current is null. When remove_prefix was used on such view, null pointer
dereference happened.
2. It only worked for the first remove_prefix call. A second call would put a
wrong value in _current_position.
For sstable versions greater or equal than md, the `min_max_column_names`
sstable metadata gives a range of position-in-partitions such that all
clustering rows stored in this sstable have positions in this range.
Partition tombstones in this context are understood as covering the
entire range of clustering keys; thus, if the sstable contains at least
one partition tombstone, the sstable position range is set to be the
range of all clustered rows.
Therefore, by checking that the position range is *not* the range of all
clustered rows we know that the sstable cannot have any partition tombstones.
Closes#7678
It is not legal to fast forward a reader before it enters a partition.
One must ensure that there even is a partition in the first place. For
this one must fetch a `partition_start` fragment.
Closes#7679
Fixes, features needed for testing, snapshot testing.
Free election after partitioning (replication test) .
* https://github.com/alecco/scylla/tree/raft-ale-tests-05e:
raft: replication test: partitioning with leader
raft: replication test: run free election after partitioning
raft: expose fsm tick() to server for testing
raft: expose is_leader() for testing
raft: replication test: test take and load snapshot
raft: fix a bug in leader election
raft: fix default randomized timeout
raft: replication test: fix custom next leader
raft: replication test: custom next leader noop for same
raft: replication test: fix failure detector for disconnected
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.
The logic has been extracted from `filter_sstable_for_reader_by_pk`.
instead of partition_range.
It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
Currently, each internal page fetched during aggregating
gets a timeout based on the time the page fetch was started,
rather than the query start time. This means the query can
continue processing long after the client has abandoned it
due to its own timeout, which is based on the query start time.
Fix by establishing the timeout once when the query starts, and
not advancing it.
Test: manual (SELECT count(*) FROM a large table).
Fixes#1175.
Closes#7662
The C and C++ sub-builds were placed in submodule_pool to
reduce concurrency, as they are memory intensive (well, at least
the C++ jobs are), and we choose build concurrency based on memory.
But the other submodules are not memory intensives, and certainly
the packaging jobs are not (and they are single-threaded too).
To allow these simple jobs to utilize multicores more efficiently,
remove them from submodule_pool so they can run in parallel.
Closes#7671
The unified package is quite large (1GB compressed), and it
is the last step in the build so its build time cannot be
parallized with other tasks. Compress it with pigz to take
advantage of multiple cores and speed up the build a little.
Closes#7670
We initially implemented run() and out() functions because we couldn't use
subprocess.run() since we were on Python 3.4.
But since we moved to relocatable python3, we don't need to implement it ourselves.
Why we keep using these functions are, because we needed to set environemnt variable to set PATH.
Since we recently moved away these codes to python thunk, we finally able to
drop run() and out(), switch to subprocess.run().
When partitioning without keeping the existing leader, run an election
without forcing a particular leader.
To force a leader after partitioning, a test can just set it with new_leader{X}.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For tests to advance servers they need to invoke tick().
This is needed to advance free elections.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Through configuration trigger automatic snapshotting.
For now, handle expected log index within the test's state machine and
pass it with snapshot_value (within the test file).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If a server responds favourably to RequestVote RPC, it should
reset its election timer, otherwise it has very high chances of becoming
a candidate with an even newer term, despite successful elections.
A candidate with a term larger than the leader rejects AppendEntries
RPCs and can not become a leader itself (because of protection
against of disruptive leaders), so is stuck in this state.
Range after election timeout should start at +1.
This matches existing update_current_term() code adding dist(1, 2*n).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Adjustments after changes due to free election in partitioning and changes in
the code.
Elapse previous leader after isolating it.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
scylla_rsyslog_setup adds a configuration file to rsyslog to forward the
trances to a remote server.
It will override any existing file, so it is safe to run it multiple
times.
It takes an ip, or ip and port from the users for that configuration, if
no port is provided, the default port of Scylla-Monitoring promtail is
used.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Follow-up to https://github.com/scylladb/scylla/pull/6916.
- Fixes wrong usage of `resource_manager::prepare_per_device_limits`,
- Improves locking in `resource_manager` so that it is more safe to call its methods concurrently,
- Adds comments around `resource_manager::register_manager` so that it's more clear what this method does and why.
Closes#7660
* github.com:scylladb/scylla:
hints/resource_manager: add comments to register_manager
hints/resource_manager: fix indentation
hints/resource_manager: improve mutual exclusion
hints/resource_manager: correct prepare_per_device_limits usage
The "ninja dist-server-tar" command is a full replacement for
"build_reloc.sh" script. We release engineering infrastructure has been
switched to ninja, so let's remove "build_reloc.sh" as obsolete.
Now that CDC is GA, it should be enabled in all the tests by default.
To achieve that the PR adds a special db::config::add_cdc_extension()
helper which is used in cql_test_envm to make sure CDC is usable in
all the tests that use cql_test_env.m As a result, cdc_tests can be
simplified.
Finally, some trailing whitespaces are removed from cdc_tests.
Tests: unit(dev)
Closes#7657
* github.com:scylladb/scylla:
cdc: Remove trailing whitespaces from cdc_tests
cdc: Remove mk_cdc_test_config from tests
config: Add add_cdc_extension function for testing
cdc: Add missing includes to cdc_extension.hh
The patch which introduces build-dependent testing
has a regression: it quietly filters out all tests
which are not part of ninja output. Since ninja
doesn't build any CQL tests (including CQL-pytest),
all such tests were quietly disabled.
Fix the regression by only doing the filtering
in unit and boost test suites.
test: dev (unit), dev + --build-raft
Message-Id: <20201119224008.185250-1-kostja@scylladb.com>
Some systems (at least, Centos 7, aarch64) block the membarrier()
syscall via seccomp. This causes Scylla or unit tests to burn cpu
instead of sleeping when there is nothing to do.
Fix by instructing podman/docker not to block any syscalls. I
tested this with podman, and it appears [1] to be supported on
docker.
[1] https://docs.docker.com/engine/security/seccomp/#run-without-the-default-seccomp-profileCloses#7661
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.
The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
"
The qctx is global object that references query processor and
database to let the rest of the code query system keyspace.
As the first step of de-globalizing it -- remove the database
reference from it. After the set the qctx remains a simple
wrapper over the query processor (which is already de-globalized)
and the query processor in turn is mostly needed only to parse
the query string into prepared statement only. This, in turn,
makes it possible to remove the qctx later by parsing the
query strings on boot and carrying _them_ around, not the qctx
itself.
tests: unit(dev), dtest(simple_cluster_driver_test:dev), manual start/stop
"
* 'br-remove-database-from-qctx' of https://github.com/xemul/scylla:
query-context: Remove database from qctx
schema-tables: Use query processor referece in save_system(_keyspace)?_schema
system-keyspace: Rewrite force_blocking_flush
system-keyspace: Use cluster_name string in check_health
system-keyspace: Use db::config in setup_version
query-context: Kill global helpers
test: Use cql_test_env::evecute_cql instead of qctx version
code: Use qctx::evecute_cql methods, not global ones
system-keyspace: Do not call minimal_setup for the 2nd time
system-keyspace: Fix indentation after previous patch
system-keyspace: Do not do invoke_on_all by hands
system-keyspace: Remove dead code
The save_system_schema and save_system_keyspace_schema are both
called on start and can the needed get query processor reference
from arguments.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is called after query_processor::execute_internal
to flush the cf. Encapsulating this flush inside database and
getting the database from query_processor lets removing
database reference from global qctx object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The check_help needs global qctx to get db.config.cluster_name,
which is already available at the caller side.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the beginning of de-globalizing global qctx thing.
The setup_version() needs global qctx to get config from.
It's possible to get the config from the caller instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch, but for tests. Since cql_test_env
does't have qctx on board, the patch makes one step forward
and calls what is called by qctx::execute_cql.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are global db::execute_cql() helpers that just forward
the args into qctx::execute_cql(). The former are going away,
so patch all callers to use qctx themselves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
THe system_keyspace::minimal_setup is called by main.cc by hands
already, some steps before the regular ::setup().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cache_truncation_record needs to run cf.cache_truncation_record
on each shard's DB, so the invoke_on_all can be used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit causes start, stop and register_manager methods of the
resource_manager to be serialized with respect to each other using the
_operation_lock.
Those function modify internal state, so it's best if they are
protected with a semaphore. Additionally, those function are not going
to be used frequently, therefore it's perfectly fine to protect them in
such a coarse manner.
Now, space_watchdog has a dedicated lock for serializing its on_timer
logic with resource_manager::register_manager. The reason for separate
lock is that resource_manager::stop cannot use the same lock as the
space_watchdog - otherwise a situation could occur in which
space_watchdog waits for semaphore units held by
resource_manager::stop(), and resource_manager::stop() waits until the
space_watchdog stops its asynchronous event loop.
The resource_manager::prepare_per_device_limits function calculates disk
quota for registered hints managers, and creates an association map:
from a storage device id to those hints manager which store hints on
that device (_per_device_limits_map)
This function was used with an assumption that it is idempotent - which
is a wrong assumption. In resource_manager::register_manager, if the
resource_manager is already started, prepare_per_device_limits would be
called, and those hints managers which were previously added to the
_per_device_limits_map would be added again. This would cause the space
used by those managers to be calculated twice, which would artificially
lower the limit which we impose on the space hints are allowed to occupy
on disk.
This patch fixes this problem by changing the prepare_per_device_limits
function to operate on a hints manager passed by argument. Now, we make
sure that this function is called on each hints manager only once.
Now that CDC is GA and enabled by default, there's no longer a need
for a specific config in CDC tests.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.
The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)
`merging_reader` is a simple adapter over `mutation_fragment_merger`.
The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.
There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
The PR also improves some comments.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7656
* github.com:scylladb/scylla:
mutation_reader: `generalize combined_mutation_reader`
mutation_reader: fix description of mutation_fragment_merger
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.
Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.
The delay is added to the timestamp to make sure that all the nodes
in the cluster manage to learn about it before the timestamp becomes in the past.
It is safe to not add the delay for the first node because we know it's the only node
in the cluster and no one else has to learn about the timestamp.
Fixes#7645
Tests: unit(dev)
Closes#7654
* github.com:scylladb/scylla:
cdc: Don't add delay to the timestamp of the first generation
cdc: Change for_testing to add_delay in make_new_cdc_generation
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.
The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)
`merging_reader` is a simple adapter over `mutation_fragment_merger`.
The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.
There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
"
We've recently seen failures in this unit test as follows:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
```
This series fixes 2 issues in this test:
1. The core issue where std::out_of_range exception
is not handled in calculate_natural_endpoints().
2. A secondary issue where the static `snitch_inst` isn't
stopped when the first exception is hit, failing
the next time the snitch is started, as it wasn't
stopped properly.
Test: network_topology_strategy_test(release)
"
* tag 'nts_test-harden-calculate_natural_endpoints-v1' of github.com:bhalevy/scylla:
test: network_topology_strategy_test: has_sufficient_replicas: handle empty dc endpoints case
test: network_topology_strategy_test: fixup indentation
test: network_topology_strategy_test: always stop_snitch after create_snitch
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.
Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.
Fixes#7645
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The meaning of the parameter changes from defining whether the function
is called in testing environment to deciding whether a delay should be
added to a timestamp of a newly created CDC generation.
This is a preparation for improvement in the following patch that does
not always add delay to every node but only to non-first node.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Asias He reports that git on Windows filesystem is unhappy about the
colon character (":") present in dist-check files:
$ git reset --hard origin/master
error: invalid path 'tools/testing/dist-check/docker.io/centos:7.sh'
fatal: Could not reset index file to revision 'origin/master'.
Rename the script to use a dash instead.
Closes#7648
Current tests uses hash state machine that checks for specific order of
entries application. The order is not always guaranty though.
Backpressure may delay some entires to be submitted and when they are
released together they may be reordered in the debug mode due to
SEASTAR_SHUFFLE_TASK_QUEUE. Introduce an ability for test to choose
state machine type and implement commutative state machine that does
not care about ordering.
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
If scylla_raid_setup script called without --raiddev argument
then try to use any of /dev/md[0-9] devices instead of only
one /dev/md0. Do it in this way because on Ubuntu 20.04
/dev/md0 used by OS already.
Closes#7628
gcc fails to compile current master like this
In file included from ./service/client_state.hh:44,
from ./cql3/cql_statement.hh:44,
from ./cql3/statements/prepared_statement.hh:47,
from ./cql3/statements/raw/select_statement.hh:45,
from build/dev/gen/cql3/CqlParser.hpp:64,
from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/service.hh:188:21: error: declaration of ‘const auth::resource& auth::command_desc::resource’ changes meaning of ‘resource’ [-fpermissive]
188 | const resource& resource; ///< Resource impacted by this command.
| ^~~~~~~~
In file included from ./auth/authenticator.hh:57,
from ./auth/service.hh:33,
from ./service/client_state.hh:44,
from ./cql3/cql_statement.hh:44,
from ./cql3/statements/prepared_statement.hh:47,
from ./cql3/statements/raw/select_statement.hh:45,
from build/dev/gen/cql3/CqlParser.hpp:64,
from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/resource.hh:98:7: note: ‘resource’ declared here as ‘class auth::resource’
98 | class resource final {
| ^~~~~~~~
clang doesn't fail
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201118155905.14447-1-xemul@scylladb.com>
If a list of target endpoints for sending view updates contains
duplicates, it results in benign (but annoying) broken promise
errors happening due to duplicated write response handlers being
instantiated for a single endpoint.
In order to avoid such errors, target remote endpoints are deduplicated
from the list of pending endpoints.
A similar issue (#5459) solved the case for duplicated local endpoints,
but that didn't solve the general case.
Fixes#7572Closes#7641
This PR allows changing the hinted_handoff_enabled option in runtime, either by modifying and reloading YAML configuration, or through HTTP API.
This PR also introduces an important change in semantics of hinted_handoff_enabled:
- Previously, hinted_handoff_enabled controlled whether _both writing and sending_ hints is allowed at all, or to particular DCs,
- Now, hinted_handoff_enabled only controls whether _writing hints_ is enabled. Sending hints from disk is now always enabled.
Fixes: #5634
Tests:
- unit(dev) for each commit of the PR
- unit(debug) for the last commit of the PR
Closes#6916
* github.com:scylladb/scylla:
api: allow changing hinted handoff configuration
storage_proxy: fix wrong return type in swagger
hints_manager: implement change_host_filter
storage_proxy: always create hints manager
config: plug in hints::host_filter object into configuration
db/hints: introduce host_filter
hints/resource_manager: allow registering managers after start
hints: introduce db::hints::directory_initializer
directories.cc: prepare for use outside main.cc
Fixes#7064
Iff broadcast address is set to ipv6 from main (meaning prefer
ipv6), determine the "public" ipv6 address (which should be
the same, but might not be), via aws metadata query.
Closes#7633
available_memory is used to seed many caches and controllers. Usually
it's detected from the environment, but unit tests configure it
on their own with fake values. If they forget, then the undefined
behavior sanitizer will kick in in random places (see 8aa842614a
("test: gossip_test: configure database memory allocation correctly")
for an example.
Prevent this early by asserting that available_memory is nonzero.
Closes#7612
std::iterator is deprecated since C++17 so define all the required iterator_traits directly and stop using std::iterator at all.
More context: https://www.fluentcpp.com/2018/05/08/std-iterator-deprecated
Tests: unit(dev)
Closes#7635
* github.com:scylladb/scylla:
log_heap: Remove std::iterator from hist_iterator
types: Remove std::iterator from tuple_deserializing_iterator
types: Remove std::iterator from listlike_partial_deserializing_iterator
sstables: remove std::iterator from const_iterator
token_metadata: Remove std::iterator from tokens_iterator
size_estimates_virtual_reader: Remove std::iterator
token_metadata: Remove std::iterator from tokens_iterator_impl
counters: Remove std::iterator from iterators
compound_compat: Remove std::iterator from iterators
compound: Remove std::iterator from iterator
clustering_interval_set: Remove std::iterator from position_range_iterator
cdc: Remove std::iterator from collection_iterator
cartesian_product: Remove std::iterator from iterator
bytes_ostream: Remove std::iterator from fragment_iterator
We saw this intermittent failure in testCalculateEndpoints:
```
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
```
It turns out that there are no endpoints associated with the dc passed
to has_sufficient_replicas in the `all_endpoints` map.
Handle this case by returning true.
The dc is still required to appear in `dc_replicas`,
so if it's not found there, fail the test gracefully.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently stop_snitch is not called if the test fails on exception.
This causes a failure in create_snitch where snitch_inst fails to start
since it wasn't stopped earlier.
For example:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
Backtrace:
0x0000000002825e94
0x000000000282ffa9
0x00007fd065f971df
/lib64/libc.so.6+0x000000000003dbc4
/lib64/libc.so.6+0x00000000000268a3
/lib64/libc.so.6+0x0000000000026788
/lib64/libc.so.6+0x0000000000035fc5
0x0000000000b484cf
0x0000000002a7c69f
0x0000000002a7c62f
0x0000000000b47b9e
0x0000000002595da2
0x0000000002595913
0x0000000002a83a31
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
scoped_critical_alloc_section was recently introduced to replace
disable_failure_guard and made the old class deprecated.
This patch replaces all occurences of disable_failure_guard with
scoped_critical_alloc_section.
Without this patch the build prints many warnings like:
warning: 'disable_failure_guard' is deprecated: Use scoped_critical_section instead [-Wdeprecated-declarations]
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ca2a91aaf48b0f6ed762a6aa687e6ac5e936355d.1605621284.git.piotr@scylladb.com>
As requested in #7057, allow certain alterations of system_auth tables. Potentially destructive alterations are still rejected.
Tests: unit (dev)
Closes#7606
* github.com:scylladb/scylla:
auth: Permit ALTER options on system_auth tables
auth: Add command_desc
auth: Add tests for resource protections
And since now there is no danger of them filling the logs, the log-level
is promoted to info, so users can see the diagnostics messages by
default.
The rate-limit chosen is 1/30s.
Refs: #7398
Tests: manual
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201117091253.238739-1-bdenes@scylladb.com>
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.
To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
The GET `hinted_handoff_enabled_by_dc` endpoint had an incorrect return
type specified. Although it does not have an implementation, yet, it was
supposed to return a list of strings with DC names for which generating
hints is enabled - not a list of string pairs. Such return type is
expected by the JMX.
Implements a function which is responsible for changing hints manager
configuration while it is running.
It first starts new endpoint managers for endpoints which weren't
allowed by previous filter but are now, and then stops endpoint managers
which are rejected by the new filter.
The function is blocking and waits until all relevant ep managers are
started or stopped.
Now, the hints manager object for regular hints is always created, even
if hints are disabled in configuration. Please note that the behavior of
hints will be unchanged - no hints will be sent when they are disabled.
The intent of this change is to make enabling and disabling hints in
runtime easier to implement.
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.
Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.
Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
Adds a db::hints::host_filter structure, which determines if generating
hints towards a given target is currently allowed. It supports
serialization and deserialization between the hinted_handoff_enabled
configuration/cli option.
This patch only introduces this structure, but does not make other code
use it. It will be plugged into the configuration architecture in the
following commits.
This change modifies db::hints::resource_manager so that it is now
possible to add hints::managers after it was started.
This change will make it possible to register the regular hints manager
later in runtime, if it wasn't enabled at boot time.
Introduces a db::hints::directory_initializer object, which encapsulates
the logic of initializing directories for hints (creating/validating
directories, segment rebalancing). It will be useful for lazy
initialization of hints manager.
Currently, the `directories` class is used exclusively during
initialization, in the main() function. This commit refactors this class
so that it is possible to use it to initialize directories much later
after startup.
The intent of this change is to make it possible for hints manager to
create directories for hints lazily. Currently, when Scylla is booted
with hinted handoff disabled, the `hints_directory` config parameter is
ignored and directories for hints are neither created nor verified.
Because we would like to preserve this behavior and introduce
possibility to switch hinted handoff on in runtime, the hints
directories will have to be created lazily the first time hinted handoff
is enabled.
* seastar 043ecec7...c861dbfb (3):
> Merge "memory: allow configuring when to dump memory diagnostics on allocation failures" from Botond
> perftune.py: support kvm-clock on tune-clock
> execution_stage: inheriting_concrete_execution_stage: add get_stats()
These alterations cannot break the database irreparably, so allow
them.
Expand command_desc as required.
Add a type (rather than command_desc) parameter to
has_column_family_access() to minimize code changes.
Fixes#7057
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When a node bootstraps or upgrades from a pre-CDC version, it creates a
new CDC generation, writes it to a distributed table
(system_distributed.cdc_generation_descriptions), and starts gossiping
its timestamp. When other nodes see the timestamp being gossiped, they
retrieve the generation from the table.
The bootstrapping/upgrading node therefore assumes that the generation
is made durable and other nodes will be able to retrieve it from the
table. This assumption could be invalidated if periodic commitlog mode
was used: replicas would acknowledge the write and then immediately
crash, losing the write if they were unlucky (i.e. commitlog wasn't
synced to disk before the write was acknowledged).
This commit enforces all writes to the generations table to be
synced to commitlog immediately. It does not matter for performance as
these writes are very rare.
Fixes https://github.com/scylladb/scylla/issues/7610.
Closes#7619
An entry can be snapshotted, before the outgoing message is sent, so the
message has to hold to it to avoid use after free.
Message-Id: <20201116113323.GA1024423@scylladb.com>
Materialized view updates participate in a retirement program,
which makes sure that they are immediately taken down once their
target node is down, without having to wait for timeout (since
views are a background operation and it's wasteful to wait in the
background for minutes). However, this mechanism has very delicate
lifetime issues, and it already caused problems more than once,
most recently in #5459.
In order to make another bug in this area less likely, the two
implementations of the mechanism, in on_down() and drain_on_shutdown(),
are unified.
Possibly refs #7572Closes#7624
Commit e5be3352cf ("database, streaming, messaging: drop
streaming memtables") removed streaming memtables; this removes
the mechanisms to synchronize them: _streaming_flush_gate and
_streaming_flush_phaser. The memory manager for streaming is removed,
and its 10% reserve is evenly distributed between memtables and
general use (e.g. cache).
Note that _streaming_flush_phaser and _streaming_flush_date are
no longer used to syncrhonize anything - the gate is only used
to protect the phaser, and the phaser isn't used for anything.
Closes#7454
DEBIAN_FRONTEND environment variable was added just for prevent opening
dialog when running 'apt-get install mdadm', no other program depends on it.
So we can move it inside of apt_install()/apt_uninstall() and drop scylla_env,
since we don't have any other environment variables.
To passing the variable, added env argument on run()/out().
"
This is a follow-up on 052a8d036d
"Avoid stalls in token_metadata and replication strategy"
The added mutate_token_metadata helper combines:
- with_token_metadata_lock
- get_mutable_token_metadata_ptr
- replicate_to_all_cores
Test: unit(dev)
"
* tag 'mutate_token_metadata-v1' of github.com:bhalevy/scylla:
storage_service: fixup indentation
storage_service: mutate_token_metadata: do replicate_to_all_cores
storage_service: add mutate_token_metadata helper
Replicate the mutated token_metadata to all cores on success.
This moves replication out of update_pending_ranges(mutable_token_metadata_ptr, sstring),
so add explicit call to replicate_to_all_cores where it is called outside
of mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replace a repeating pattern of:
with_token_metadata_lock([] {
return get_mutable_token_metadata_ptr([] (mutable_token_metadata_ptr tmptr) {
// mutate token_metadata via tmptr
});
});
With a call to mutate_token_metadata that does both
and calls the function with then mutable_token_metadata_ptr.
A following patch will also move the replication to all
cores to mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The "dist" target fails as follows:
$ ./tools/toolchain/dbuild ninja dist
ninja: error: 'build/dev/scylla-unified-package-..tar.gz', needed by 'dist-unified-tar', missing and no known rule to make it
Fix two issues:
- Fix Python variable references to "scylla_version" and
"scylla_release", broken by commit bec0c15ee9 ("configure.py: Add
version to unified tarball filename"). The breakage went unnoticed
because ninja default target does not call into dist...
- Remove dependencies to build/<mode>/scylla-unified-package.tar.gz. The
file is now in build/<mode>/dist/tar/ directory and contains version
and release in the filename.
Message-Id: <20201113110706.150533-1-penberg@scylladb.com>
To test handling of connectivity issues and recovery add support for
disconnecting servers.
This is not full partitioning yet as it doesn't allow connectivity
across the disconnected servers (having multiple active partitions.
* https://github.com/alecco/scylla/pull/new/raft-ale-partition-simple-v3:
raft: replication test: connectivity partitioning support
raft: replication test: block rpc calls to disconnected servers
raft: replication test: add is_disconnected helper
raft: replication test: rename global variable
raft: replication test: relocate global connection state map
We currently keep a copy of scylla-package.tar.gz in "build/<mode>" for
compatibility. However, we've long since switched our CI system over to
the new location, so let's remove the duplicate and use the one from
"build/<mode>/dist/tar" instead.
Message-Id: <20201113075146.67265-1-penberg@scylladb.com>
Add to the DynamoDB compatibility document, docs/alternator/compatibility.md,
a mention that Alternator streams are still an experimental features, and
how to turn it on (at this point CDC is no longer an experimental feature,
but Alternator Streams are).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112184436.940497-1-nyh@scylladb.com>
Drop the adjective "experimental" used to describe Alternator in
docs/alternator/getting-started.md.
In Scylla, the word "experimental" carries a specific meaning - no support
for upgrades, not enough QA, not ready for general use) and Alternator is
no longer experimental in that sense.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112185249.941484-1-nyh@scylladb.com>
This adds some test cases for ALTER KEYSPACE:
- ALTER KEYSPACE happy path
- ALTER KEYSPACE wit invalid options
- ALTER KEYSPACE for non-existing keyspace
- CREATE and ALTER KEYSPACE using NetworkTopologyStrategy with
non-existing data center in configuration, which triggers a bug in
Scylla:
https://github.com/scylladb/scylla/issues/7595
Message-Id: <20201112073110.39475-1-penberg@scylladb.com>
Introduce partition update command consisting of nodes still seeing
each other. Nodes not included are disconnected from everything else.
If the previous leader is not part of the new partition, the first node
specified in the partition will become leader.
For other nodes to accept a new leader it has to have a committed log.
For example, if the desired leader is being re-connected and it missed
entries other nodes saw it will not win the election. Example A B C:
partition{A,C},entries{2},partition{B,C}
In this case node C won't accept B as a new leader as it's missing 2
entries.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In main.cc, we spawn a future which starts the hints manager, but we
don't wait for it to complete. This can have the following consequences:
- The hints manager does some asynchronous operations during startup,
so it can take some time to start. If it is started after we start
handling requests, and we admit some requests which would result in
hints being generated, those hints will be dropped instead because we
check if hints manager is started before writing them.
- Initialization of hints manager may fail, and Scylla won't be stopped
because of it (e.g. we don't have permissions to create hints
directories). The consequence of this is that hints manager won't be
started, and hints will be dropped instead of being written. This may
affect both regular hints manager, and the view hints manager.
This commit causes us to wait until hints manager start and see if there
were any errors during initialization.
Fixes#7598Closes#7599
CDC is ready to be a non-experimental feature so remove the experimental flag for it.
Also, guard Alternator Streams with their own experimental flag. Previously, they were using CDC experimental flag as they depend on CDC.
Tests: unit(dev)
Closes#7539
* github.com:scylladb/scylla:
alternator: guard streams with an experimental flag
Mark CDC as GA
cdc: Make it possible for CDC generation creation to fail
Add new alternator-streams experimental flag for
alternator streams control.
CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following patch enables CDC by default and this means CDC has to work
will all the clusters now.
There is a problematic case when existing cluster with no CDC support
is stopped, all the binaries are updated to newer version with
CDC enabled by default. In such case, nodes know that they are already
members of the cluster but they can't find any CDC generation so they
will try to create one. This creation may fail due to lack of QUORUM
for the write.
Before this patch such situation would lead to node failing to start.
After the change, the node will start but CDC generation will be
missing. This will mean CDC won't be able to work on such cluster before
nodetool checkAndRepairCdcStreams is run to fix the CDC generation.
We still fail to bootstrap if the creation of CDC generation fails.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When a missing base column happens to be named `idx_token`,
an additional helper message is printed in logs.
This additional message does not need to have `error` severity,
since the previous, generic message is already marked as `error`.
This patch simply makes it easier to write tests, because in case
this error is expected, only one message needs to be explicitly
ignored instead of two.
Closes#7597
This miniseries adds metrics which can help the users detect potential overloads:
* due to having too many in-flight hints
* due to exceeding the capacity of the read admission queue, on replica side
Closes#7584
* github.com:scylladb/scylla:
reader_concurrency_semaphore: add metrics for shed reads
storage_proxy: add metrics for too many in-flight hints failures
If interposer consumer is enabled, partition filtering will be done by the
consumer instead, but that's not possible because only the producer is able
to skip to the next partition if the current one is filtered out, so scylla
crashes when that happens with a bad function call in queue_reader.
This is a regression which started here: 55a8b6e3c9
To fix this problem, let's make sure that partition filtering will only
happen on the producer side.
Fixes#7590.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com>
"
This series is a rebased version of 3 patchsets that were sent
separately before:
1. [PATCH v4 00/17] Cleanup storage_service::update_pending_ranges et al.
This patchset cleansup service/storage_service use of
update_pending_ranges and replicate_to_all_cores.
It also moves some functionality from gossiping_property_file_snitch::reload_configuration
into a new method - storage_service::update_topology.
This prepares storage_service for using a shared ptr to token_metadata,
updating a copy out of line under a semaphore that serializes writers,
and eventually replicating to updated copy to all shards and releasing
the lock. This is a follow up to #7044.
2. [PATCH v8 00/20] token_metadata versioned shared ptr
Rather than keeping references on token_metadata use a shared_token_metadata
containing a lw_shared_ptr<token_metadata> (a.k.a token_metadata_ptr)
to keep track of the token_metadata.
Get token_metadata_ptr for a read-only snapshot of the token_metadata
or clone one for a mutable snapshot that is later used to safely update
the base versioned_shared_object.
token_metadata_ptr is used to modify token_metadata out of line, possibly with
multiple calls, that could be preeempted in-between so that readers can keep a consistent
snapshot of it while writers prepare an updated version.
Introduce a token_metadata_lock used to serialize mutators of token_metadata_ptr.
It's taken by the storage_service before cloning token_metadata_ptr and held
until the updated copy is replicated on all shards.
In addition, this series introduces token_metadata::clone_async() method
to copy the tokne_metadata class using a asynchronous function with
continuations to avoid reactor stalls as seen in #7220.
Fixes#7044
3. [PATCH v3 00/17] Avoid stalls in token_metadata and replication strategy
This series uses the shared_token_metadata infrastructure.
First patches in the series deal wth cloning token_metadata
using continuations to allow preemption while cloning (See #7220).
Then, the rest of the series makes sure to always run
`update_pending_ranges` and `calculate_pending_ranges_for_*` in a thread,
it then adds a `can_yield` parameter to the token_metadata and abstract_replication_strategy
`get_pending_ranges` and friends, and finally it adds `maybe_yield` calls
in potentially long loops.
Fixes#7313Fixes#7220
Test: unit (dev)
Dtest: gating(dev)
"
* tag 'replication_strategy_can_yield-v4' of github.com:bhalevy/scylla: (54 commits)
token_metadata_impl: set_pending_ranges: add can_yield_param
abstract_replication_strategy: get rid of get_ranges_in_thread
repair: call get_ranges_in_thread where possible
abstract_replication_strategy: add can_yield param to get_pending_ranges and friends
abstract_replication_strategy: define can_yield bool_class
token_metadata_impl: calculate_pending_ranges_for_* reindent
token_metadata_impl: calculate_pending_ranges_for_* pass new_pending_ranges by ref
token_metadata_impl: calculate_pending_ranges_for_* call in thread
token_metadata: update_pending_ranges: create seastar thread
abstract_replication_strategy: add get_address_ranges method for specific endpoint
token_metadata_impl: clone_after_all_left: sort tokens only once
token_metadata: futurize clone_after_all_left
token_metadata: futurize clone_only_token_map
token_metadata: use mutable_token_metadata_ptr in calculate_pending_ranges_for_*
repair: replace_with_repair: use token_metadata::clone_async
storage_service: reindent token_metadata blocks
token_metadata: add clone_async
abstract_replication_strategy: accept a token_metadata_ptr in get_pending_address_ranges methods
abstract_replication_strategy: accept a token_metadata_ptr in get_ranges methods
boot_strapper: get_*_tokens: use token_metadata_ptr
...
Add a test that better clarifies what StartingSequenceNumber returned by
DescribeStream really guarantees (this question was raised in a review
of a different patch). The main thing we can guarantee is that reading a
shard from that position returns all the information in that shard -
similar to TRIM_HORIZON. This test verifies this, and it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112081250.862119-1-nyh@scylladb.com>
When the admission queue capacity reaches its limits, excessive
reads are shed in order to avoid overload. Each such operation
now bumps the metrics, which can help the user judge if a replica
is overloaded.
This change adds metrics for counting request message types
listed in the CQL v.4 spec under section 4.1
(https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec).
To organize things properly, we introduce a new cql_server::transport_stats
object type for aggregating the message and server statistics.
Fixes#4888Closes#7574
Rewrite in a more readable way that will later allow us to split the WHERE expression in two: a storage-reading part and a post-read filtering part.
Tests: unit (dev,debug)
Closes#7591
* github.com:scylladb/scylla:
cql3: Rewrite need_filtering() from scratch
cql3: Store index info in statement_restrictions
The name of the utility function test_object_name() is confusing - by
starting with the word "test", pytest can think (if it's imported to the
top-level namespace) that it is a test... So this patch gives it a better
name - unique_name().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201111140638.809189-1-nyh@scylladb.com>
This patch adds a new document, docs/alternator/compatibility.md,
which focuses on what users switching from DynamoDB to Alternator
need to know about where Alternator differs from DynamoDB and which
features are missing.
The compatibility information in the old alternator.md is not deleted
yet. It probably should.
Fixes#7556
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110180242.716295-1-nyh@scylladb.com>
* seastar a62a80ba1d...043ecec732 (8):
> semaphore: make_expiry_handler: explicitly use this lambda capture
> configure: add --{enable,disable}-debug-shared-ptr option
> cmake: add SEASTAR_DEBUG_SHARED_PTR also in dev mode
> tls_test: Update the certificates to use sha256
> logger: allow applying a rate-limit to log messages
> Merge "Handle CPUs not attached to any NUMA nodes" from Pavel E
> memory: fix malloc_usable_size() during early initialization
> Merge "make semaphore related functions noexcept" from Benny
Makes it easier to understand, in preparation for separating the WHERE
expression into filtering and storage-reading parts.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
To rewrite need_filtering() in a more readable way, we need to store
info on found indexes in statement_restrictions data members.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The functions can be simplified as they are all now being called
from a seastar thread.
Make them sequential, returning void, and yielding if necessary.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Some of the callers of get_address_ranges are interested in the ranges
of a specific endpoint.
Rather than building a map for all endpoints and then traversing
it looking for this specific endpoint, build a multimap of token ranges
relating only to the specified endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the sorted tokens are copied needlessly by on this path
by `clone_only_token_map` and then recalculated after calling
remove_endpoint for each leaving endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Call the futurized clone_only_token_map and
remove the _leaving_endpoints from the cloned token_metadata_impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Does part of clone_async() using continuations to prevent stalls.
Rename synchronous variant to clone_only_token_map_sync
that is going to be deprecated once all its users will be futurized.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replacing old code using lw_shared_ptr<token_metadata> with the "modern"
mutable_token_metadata_ptr alias.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clone the input token_metadata asynchronously using
clone_async() before modifying it using update_normal_tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many code blocks using with_token_metadata_lock
and get_mutable_token_metadata_ptr now need re-indenting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Only replace_with_repair needs to clone the token_metadata
and update the local copy, so we can safely pass a read-only
snapshot of the token_metadata rather than copying it in all cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Perform replication in 2 phases.
First phase just clones the mutable_token_metadata_ptr on all shards.
Second phase applies the cloned copies onto each local_ss._shared_token_metadata.
That phase should never fail.
To add suspenders over the belt, in the impossible case we do get an
exception, it is logged and we abort.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
clone _token_metadata for updating into _updated_token_metadata
and use it to update the local token_metadata on all shard via
do_update_pending_ranges().
Adjust get_token_metadata to get either the update the updated_token_metadata,
if available, or the base token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than using `serialized_action`, grab a lock before mutating
_token_metadata and hold it until its replicated to all shards.
A following patch will use a mutable token_metadata_ptr
that is updated out of line under the lock.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the replication to other shards happens later in `prepare_to_join`
that is called in `init_server`.
We should isolate the changes made by init_server and update them first
to all shards so that we can serialize them easily using a lock
and a mutable_token_metadata_ptr, otherwise the lock and the mutable_token_metadata_ptr
will have to be handed over (from this call path) to `prepare_to_join`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.
expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate that, keep a const shared_token_metadata& in class database
rather than a const token_metadata&
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation to chaging network_topology_strategy to
accept a const shared_token_metadata& rather than token_metadata&.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than accessing abstract_replication_strategy::_token_metedata directly.
In preparation to changing it to a shared_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that `replicate_tm_only` doesn't throw, we handle all errors
in `replicate_tm_only().handle_exception`.
We can't just proceed with business as usual if we failed to replicate
token_metadata on all shards and continue working with inconsistent
copies.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And with that mark also do_replicate_to_all_cores as noexcept.
The motivation to do so is to catch all errors in replicate_tm_only
and calling on_internal_error in the `handle_exception` continuation
in do_replicate_to_all_cores.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling invalidate_cached_rings and update_topology
on all shards do that only on shard 0 and then replicate
to all other shards using replicate_to_all_cores as we do
in all other places that modify token_metadata.
Do this in preparation to using a token_metadata_ptr
with which updating of token_metadata is done on a cloned
copy (serialized under a lock) that becomes visible only when
applied with replicate_to_all_cores.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the functionality from gossiping_property_file_snitch::reload_configuration
to the storage_service class.
With that we can make get_mutable_token_metadata private.
TODO: update token_metadata on shard 0 and then
replicate_to_all_cores rather than updating on all shards
in parallel.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
keyspace_changed just calls update_pending_ranges (and ignoring any
errors returned from it), so invoke it on shard 0, and with
that update_pending_ranges() is always called on shard 0
and it doesn't need to use `invoke_on` shard 0 by itself.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We need to assert in only 2 places:
do_update_pending_ranges, that updates token metadata,
and replicate_tm_only, that copies the token metadata
to all other shards.
Currently we throw errors if this is violated
but it should never happen and it's not really recoverable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently update_pending_ranges involves 2 serialized actions:
do_update_pending_ranges, and then replicate_to_all_cores.
These can be combind by calling do_replicate_to_all_cores
directly from do_update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It was introduced in 74b4035611
As part of the fix for #3203.
However, the reactor stalls have nothing to do with gossip
waiting for update_pending_ranges - they are related to it being
synchronous and quadratic in the number of tokens
(e.g. get_address_ranges calling calculate_natural_endpoints
for every token then simple_strategy::calculate_natural_endpoints
calling get_endpoint for every token)
There is nothing special in handle_state_leaving that requires
moving update_pending_ranges to the background, we call
update_pending_ranges in many other places and wait for it
so if gossip loop waiting on it was a real problem, then it'd
be evident in many other places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently _update_pending_ranges_action is called only on shard 0
and only later update_pending_ranges() updates shard 0 again and replicates
the result to all shards.
There is no need to wait between the two, and call _update_pending_ranges_action
again, so just call update_pending_ranges() in the first place.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
so that the updated host_id (on shard 0) will get replicated to all shards
via update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's already done by each handle_state_* function
either by directly calling replicate_to_all_cores or indirectly, via
update_pending_renages.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the updates to token_metadata are immediately visible
on shard 0, but not to other shards until replicate_to_all_cores
syncs them.
To prepare for converting to using shared token_metadata.
In the new world the updated token_metadata is not visible
until committed to the shared_token_metadata, so
commit it here and replicate to all other shards.
It is not clear this isn't needed presently too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the consumer happens to check the EOS flag before it hits the
exception injected by the abort (by calling fill_buffer()), they can
think the stream ended normally and expect it to be valid. However this
is not guaranteed when the reader is aborted. To avoid consumers falsely
thinking the stream ended normally, don't set the EOS flag on abort at
all.
Additionally make sure the producer is aborted too on abort. In theory
this is not needed as they are the one initiating the abort, but better
to be safe then sorry.
Fixes: #7411
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>
"
Today, if scylla crashes mid-way in sstable::idempotent-move-sstable
or sstable::create_links we may end up in an inconsistent state
where it refuses to restart due to the presence of the moved-
sstable component files in both the staging directory and
main directory.
This series hardens scylla against this scenario by:
1. Improving sstable::create_links to identify the replay condition
and support it.
2. Modifying the algorithm for moving sstables between directories
to never be in a state where we have two valid sstable with the
same generation, in both the source and destination directories.
Instead, it uses the temporary TOC file as a marker for rolling
backwards or forewards, and renames it atomically from the
destination directory back to the source directory as a commit
point. Before which it is preparing the sstable in the destination
dir, and after which it starts the process of deleting the sstable
in the source dir.
Fixes#7429
Refs #5714
"
* tag 'idempotent-move-sstable-v3' of github.com:bhalevy/scylla:
sstable: create_links: support for move
sstable_directory: support sstables with both TemporaryTOC and TOC
sstable: create_links: move automatic sstring variables
sstable: create_links: use captured comps
sstable: create_links: capture dir by reference
sstable: create_links: fix indentation
sstable: create_links: no need to roll-back on failure anymore
sstable: create_links: support idempotent replay
sstable: create_links: cleanup style
sstable: create_links: add debug/trace logging
sstable: move_to_new_dir: rm TOC last
sstable: move_to_new_dir: io check remove calls
test: add sstable_move_test
Previously, test/cql-pytest/run was a Python script, while
test/cql-pytest/run-cassandra (to run the tests against Cassandra)
was still a shell script - modeled after test/alternator/run.
This patch makes rewrites run-cassandra in Python.
A lot of the same code is needed for both run and run-cassandra
tools. test/cql-pytest/run was already written in a way that this
common code was separate functions. For example, functions to start a
server in a temporary directory, to check when it finishes booting,
and to clean up at the end. This patch moves this common code to
a new file, "run.py" - and the tools "run" and "cassandra-run" are
very short programs which mostly use functions from run.py (run-cassandra
also has some unique code to run Cassandra, that no other test runner
will need).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110215210.741753-1-nyh@scylladb.com>
We currently set PATH for relocatable CLI tools in scylla_util.run() and
scylla_util.out(), but it doesn't work for perftune.py, since it's not part of
Scylla, does not use scylla_util module.
We can set PATH in python thunk instead, it can set PATH for all python scripts.
Fixes#7350
After full cluster shutdown, the node which is being replaced will not have its
STATUS set to NORMAL (bug #6088), so listeners will not update _token_metadata.
The bootstrap procedure of replacing node has a workaround for this
and calls update_normal_tokens() on token metadata on behalf of the
replaced node based on just its TOKENS state obtained in the shadow
round.
It does this only for the replacing_a_node_with_same_ip case, but not
for replacing_a_node_with_diff_ip. As a result, replacing the node
with the same ip after full cluster shutdown fails.
We can always call update_normal_tokens(). If the cluster didn't
crash, token_metadata would get the tokens.
Fixes#4325
Message-Id: <1604675972-9398-1-git-send-email-tgrabiec@scylladb.com>
This patch introduces a new way to do functional testing on Scylla,
similar to Alternator's test/alternator but for the CQL API:
The new tests, in test/cql-pytest, are written in Python (using the pytest
framework), and use the standard Python CQL driver to connect to any CQL
implementation - be it Scylla, Cassandra, Amazon Keyspaces, or whatever.
The use of standard CQL allows the test developer to easily run the same
test against both Scylla and Cassandra, to confirm that the behaviour that
our test expects from Scylla is really the "correct" (meaning Cassandra-
compatible) behavior.
A developer can run Scylla or Cassandra manually, and run "pytest"
to connect to them (see README.md for more instructions). But even more
usefully, this patch also provides two scripts: test/cql-pytest/run and
test/cql-pytest/run-cassandra. These scripts automate the task of running
Scylla or Cassandra (respectively) in a random IP address and temporary
directory, and running the tests against it.
The script test/cql-pytest/run is inspired by the existing test run
scripts of Alternator and Redis, but rewritten in Python in a way that
will make it easy to rewrite - in a future patch - all these other run
scripts to use the same common code to safely run a test server in a
temporary directory.
"run" is extremely quick, taking around two seconds to boot Scylla.
"run-cassandra" is slower, taking 13 seconds to boot Cassandra (maybe
this can be improved in the future, I still don't know how).
The tests themselves take milliseconds.
Although the 'run' script runs a single Scylla node, the developer
can also bring up any size of Scylla or Cassandra cluster manually
and run the tests (with "pytest") against this cluster.
This new test framework differs from the existing alternatives in the
following ways:
dtest: dtest focuses on testing correctness of *distributed* behavior,
involving clusters of multiple nodes and often cluster changes
during the test. In contrast, cql-pytest focuses on testing the
*functionality* of a large number of small CQL features - which
can usually be tested on a single-node cluster.
Additionally, dtest is out-of-tree, while cql-pytest is in-tree,
making it much easier to add or change tests together with code
patches.
Finally, dtest tests are notoriously slow. Hundreds of tests in
the new framework can finish faster than a single dtest.
Slow and out-of-tree tests are difficult to write, and I believe
this explains why no developer loves writing dtests and maintainers
do not insist on having them. I hope cql-pytest can change that.
test/cql: The defining difference between the existing test/cql suite
and the new test/cql-pytest is the new framework is programmatic,
Python code, not a text file with desired output. Tests written with
` code allow things like looping, repeating the same test with different
parameters. Also, when a test fails, it makes it easier to understand
why it failed beyond just the fact that the output changed.
Moreover, in some cases, the output changes benignly and cql-pytest
may check just the desired features of the output.
Beyond this, the current version of test/cql cannot run against
Cassandra. test/cql-pytest can.
The primary motivation for this new framework was
https://github.com/scylladb/scylla/issues/7443 - where we had an
esoteric feature (sort order of *partitions* when an index is addded),
which can be shown in Cqlsh to have what we think is incorrect behavior,
and yet: 1. We didn't catch this bug because we never wrote a test for it,
possibly because it too difficult to contribute tests, and 2. We *thought*
that we knew what Cassandra does in this case, but nobody actually tested
it. Yes, we can test it manually with cqlsh, but wouldn't everything be
better if we could just run the same test that we wrote for Scylla against
Cassandra?
So one of the tests we add in this patch confirms issue #7443 in Scylla,
and that our hunch was correct and Cassandra indeed does not have this
problem. I also add a few trivial tests for keyspace create and drop,
as additional simple examples.
Refs #7443.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110110301.672148-1-nyh@scylladb.com>
I noticed that we require filtering for continuous clustering key, which is not necessary. I dropped the requirement and made sure the correct data is read from the storage proxy.
The corresponding dtest PR: https://github.com/scylladb/scylla-dtest/pull/1727
Tests: unit (dev,debug), dtest (next-gating, cql*py)
Closes#7460
* github.com:scylladb/scylla:
cql3: Delete some newlines
cql3: Drop superfluous ALLOW FILTERING
cql3: Drop unneeded filtering for continuous CK
When there are too many in-flight hints, writes start returning
overloaded exceptions. We're missing metrics for that, and these could
be useful when judging if the system is in overloaded state.
This commit changes the build file generation and the package
creation scripts to be product aware. This will change the
relocatable package archives to be named after the product,
this commit deals with two main things:
1. Creating the actual Scylla server relocatable with a product
prefixed name - which is independent of any other change
2. Expect all other packages to create product prefixed archive -
which is dependant uppon the actual submodules creating
product prefixed archives.
If the support is not introduced in the submodules first this
will break the package build.
Tests: Scylla full build with the original product and a
different product name.
Closes#7581
Currently debian_files_gen.py mistakenly renames scylla-server.service to
"scylla-server." on non-standard product name environment such as
scylla-enterprise, it should be fix to correct filename.
Fixes#7423
This patch introduces many changes to the Scylla `CMakeLists.txt`
to enable building Scylla without resorting to pre-building
with a previous configure.py build, i.e. cmake script can now
be used as a standalone solution to build and execute scylla.
Submodules, such as Seastar and Abseil, are also dealt with
by importing their CMake scripts directly via `add_subdirectory`
calls. Other submodules, such as `libdeflate` now have a
custom command to build the library at runtime.
There are still a lot of things that are incomplete, though:
* Missing auxiliary packaging targets
* Unit-tests are not built (First priority to address in the
following patches)
* Compile and link flags are mostly hardcoded to the values
appropriate for the most recent Fedora 33 installation.
System libraries should be found via built-in `Find*` scripts,
compiler and linker flags should be observed and tested by
executing feature tests.
* The current build is aimed to be built by GCC, need to support
Clang since we are moving to it.
* Utility cmake functions should be moved to a separate "cmake"
directory.
The script is updated to use the most recent CMake version available
in Fedora 33, which is 3.18.
Right now this is more of a PoC rather that a full-fledged solution
but as far as it's not used widely, we are free to evolve it in
a relaxed manner, improving it step by step to achieve feature
parity with `configure.py` solution.
The value in this patch is that now we are able to use any
C++ IDE capable of dealing with CMake solutions and take
advantage of their built-in capabilities, such as:
* Building a code model to efficiently navigate code.
* Find references to symbols.
* Use pretty-printers, beautifiers and other tools conveniently.
* Run scylla and debug it right from the IDE.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201103221619.612294-1-pa.solodovnikov@scylladb.com>
DescribeTable should return a UUID "TableId" in its reponse.
We alread had it for CreateTable, and now this patch adds it to
DescribeTable.
The test for this feature is no longer xfail. Moreover, I improved
the test to not only check that the TableId field is present - it
should also match the documented regular expression (the standard
representation of a UUID).
Refs #5026
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201104114234.363046-1-nyh@scylladb.com>
When moving a sstable between directories, we would like to
be able to crash at any point during the algorithm with a
clear way to either roll the operation forwards or backwards.
To achieve that, define sstable::create_links_common that accepts
a `mark_for_removal` flag, implementing the following algorithm:
1. link src.toc to dst.temp_toc.
until removed, the destination sstable is marked for removal.
2. link all src components to dst.
crashing here will leave dst with both temp_toc and toc.
3.
a. if mark_for_removal is unset then just remove dst.temp_toc.
this is commit the destination sstable and complete create_links.
b. if mark_for_removal is set then move dst.temp_toc to src.temp_toc.
this will atomically toggle recovery after crash from roll-back
to roll-forward.
here too, crashing at this point will leave src with both
temp_toc and toc.
Adjust the unit test for the revised algorithm.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Keep descriptors in a map so it could be searched easily by generation.
and possibly delete the descriptor, if found, in the presence of
a temporary toc component.
A following patch will add support to create_links for moving
sstables between directories. It is based on keeping a TemporaryTOC
file in the destination directory while linking all source components.
If scylla crashes here, the destination sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move backwards.
Then, create_links will atomically move the TemporaryTOC from
the destination back to the source directory, to toggle rolling
back to rolling forward by marking the source sstable for removal.
If scylla crashes here, the source sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move forward.
Add unit test for this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we use `idempotent_link_file` it'll no longer
fail with EEXIST in a replay scenario.
It may fail on ENOENT, and return an exceptional future.
This will be propagated up the stack. Since it may indicate
parallel invokation of move_to_new_dir, that deletes the source
sstable while this thread links it to the same destination,
rolling back by removing the destination links would
be dangerous.
For an other error, the node is going to be isolated
and stop operation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Handle the case where create_link is replayed after crashing in the middle.
In particular, if we restart when moving sstables from staging to the base dir,
right after create_links completes, and right before deleting the source links,
we end up with seemingly 2 valid sstables, one still in staging and the other
already in the base table directory, both are hard linked to the same inodes.
Make create_links idempotent so it can replay the operation safely if crashed and
restarted at any point of its operation.
Add unit tests for replay after partial create_links that is expected to succeed,
and a test for replay when an sstable exist in the destination that is not
hard-linked to the source sstable; create_links is expected to fail in this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate cleanup on crash, first rename the TOC file to TOC.tmp,
and keep until all other files are removed, finally remove TOC.tmp.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This use-after move was apprently exposed after switching to clang
in commit eb861e68e9.
The directory_entry is required for std::stoi(de.name.c_str())
and later in the catch{} clause.
This shows in the node logs as a "Ignore invalid directory" debug
log message with an empty name, and caused the hintedhandoff_rebalance_test
to fail when hints files aren't rebalanced.
Test: unit(dev)
DTest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201106172017.823577-1-bhalevy@scylladb.com>
On older distribution such as CentOS7, it does not support systemd user mode.
On such distribution nonroot mode does not work, show warning message and
skip running systemctl --user.
Fixes#7071
... to config descriptions
We allow setting the transitional auth as one of the options
in scylla.yaml, but don't mention it at all in the field's
description. Let's change that.
Closes#7565
The function is used by raft and fails with ubsan and clang.
The ub is harmless. Lets wait for it to be fixed in boost.
Message-Id: <20201109090353.GZ3722852@scylladb.com>
Retry mechanism didn't work when URLError happend. For example:
urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>
Let's catch URLError instead of HTTP since URLError is a base exception
for all exceptions in the urllib module.
Fixes: #7569Closes#7567
If _offset falls beyond compound_type->types().size()
ignore the extra components instead of accessing out of the types
vector range.
FIXME: we should validate the thrift key against the schema
and reject it in the thrift handler layer.
Refs #7568
Test: unit(dev)
DTest: cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201108175738.1006817-1-bhalevy@scylladb.com>
Users can change `durable_writes` anytime with ALTER KEYSPACE.
Cassandra reads the value of `durable_writes` every time when applying
a mutation, so changes to that setting take effect immediately. That is,
mutations are added to the commitlog only when `durable_writes` is `true`
at the moment of their application.
Scylla reads the value of `durable_writes` only at `keyspace` construction time,
so changes to that setting take effect only after Scylla is restarted.
This patch fixes the inconsistency.
Fixes#3034Closes#7533
This series provides assorted fixes which are a
pre-requisite for the joint consensus implementation
series which follows.
* scylla-dev/raft-misc:
raft: fix raft_fsm_test flakiness
raft: drop a waiter of snapshoted entry
raft: use correct type for node info in add_server()
raft: overload operator<< for debugging
An index that is waited can be included in an installed snapshot in
which case there is no way to know if the entry was committed or not.
Abort such waiters with an appropriate error.
Overload operator<< for ostream and print relevant state for server, fsm, log,
and typed_uint64 types.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The query processor is present in the global namespace and is
widely accessed with global get(_local)?_query_processor().
There's a long-term task to get rid of this globality and make
services and componenets reference each-other and, for and
due-to this, start and stop in specific order. This set makes
this for the query processor.
The remaining users of it are -- alternator, controllers for
client services, schema_tables and sys_dist_ks. All of them
except for the schema_tables are fixed just by passing the
reference on query processor with small patches. The schema
tables accessing qp sit deep inside the paxos code, but can
be "fixed" with the qctx thing until the qctx itself is
de-globalized.
* https://github.com/xemul/scylla/tree/br-rip-global-query-processor:
code: RIP global query processor instance
cql test env: Keep query processor reference on board
system distributed keyspace: Start sharded service erarlier
schema_tables: Use qctx to make internal requests
transport: Keep sharded query processor reference on controller
thrift: Keep sharded query processor reference on controller
alternator: Use local query processor reference to get keys
alternator: Keep local query processor reference in server
The only purpose of this change is to compile (git-bisect
safety) and thus prove that the next patch is correct.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the view builder cannot read view building progress from an
internal CQL table it produces an error message, but that only confuses
the user and the test suite -- this situation is entirely recoverable,
because the builder simply assumes that there is no progress and the
view building should start from scratch.
Fixes#7527Closes#7558
repair: Use single writer for all followers
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.
To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.
Fixes#7525Closes#7528
* github.com:scylladb/scylla:
repair: No more vector for _writer_done and friends
repair: Use single writer for all followers
The default of DBUILD_TOOL=docker requires passwordless access to docker
by the user of dbuild. This is insecure, as any user with unconstrained
access to docker is root equivalent. Therefore, users might prefer to
run docker as root (e.g. by setting DBUILD_TOOL="sudo docker").
However, `$tool -e HOME` exports HOME as seen by $tool.
This breaks dbuild when `$tool` runs docker as a another user.
`$tool -e HOME="$HOME"` exports HOME as seen by dbuild, which is
the intended behaviour.
Closes#7555
Instead of invoking `$tool`, as is done everywhere else in dbuild,
kill_it() invoked `docker` explicitly. This was slightly breaking the
script for DBUILD_TOOL other than `docker`.
Closes#7554
Cleanup compaction is using consume_pausable_in_thread() to skip over
disowned partitions, which uses flat_mutation_reader::next_partition().
The implementation of next_partition() for the sstable reader has a
bug which may cause the following assertion failure:
scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed.
This happens when the sstable reader's buffer gets full when we reach
the partition end. The last fragment of the partition won't be pushed
into the buffer but will stay in the _ready variable. When
next_partition() is called in this state, _ready will not be cleared
and the fragment will be carried over to the next partition. This will
cause assertion failure when the reader attempts to emit the first
fragment of the next partition.
The fix is to clear _ready when entering a partition, just like we
clear _range_tombstones there.
Fixes#7553.
Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com>
Fixes returned rows ordering to proper signed token ordering. Before this change, rows were sorted by token, but using unsigned comparison, meaning that negative tokens appeared after positive tokens.
Rename `token_column_computation` to `legacy_token_column_computation` and add some comments describing this computation.
Added (new) `token_column_computation` which returns token as `long_type`, which is sorted using signed comparison - the correct ordering of tokens.
Add new `correct_idx_token_in_secondary_index` feature, which flags that the whole cluster is able to use new `token_column_computation`.
Switch token computation in secondary indexes to (new) `token_column_computation`, which fixes the ordering. This column computation type is only set if cluster supports `correct_idx_token_in_secondary_index` feature to make sure that all nodes
will be able to compute new `token_column_computation`. Also old indexes will need to be rebuilt to take advantage of this fix, as new token column computation type is only set for new indexes.
Fix tests according to new token ordering and add one new test to validate this aspect explicitly.
Fixes#7443
Tested manually a scenario when someone created an index on old version of Scylla and then migrated to new Scylla. Old index continued to work properly (but returning in wrong order). Upon dropping and re-creating the index, it still returned the same data, but now in correct order.
Closes#7534
* github.com:scylladb/scylla:
tests: add token ordering test of indexed selects
tests: fix tests according to new token ordering
secondary_index: use new token_column_computation
feature: add correct_idx_token_in_secondary_index
column_computation: add token_column_computation
token_column_computation: rename as legacy
The shared_from_this lw_shared_ptr must not be accessed
across shards. Capturing it in the lambda passed to
mutation_writer::distribute_reader_and_consume_on_shards
causes exactly that since the captured lw_shared_ptr
is copied on other shards, and ends up in memory corruption
as seen in #7535 (probably due to lw_shared_ptr._count
going out-of-sync when incremented/decremented in parallel
on other shards with no synchronization.
This was introduced in 289a08072a.
The writer is not needed in the body of this lambda anyways
so it doesn't need to capture it. It is already held
by the continuations until the end of the chain.
Fixes#7535
Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com>
"
Since we are switching to clang due to raft make it actually compile
with clang.
"
tgrabiec: Dropped the patch "raft: compile raft by default" because
the replication_test still fails in debug mode:
/usr/include/boost/container/deque.hpp:1802:63: runtime error: applying non-zero offset 8 to null pointer
* 'raft-clang-v2' of github.com:scylladb/scylla-dev:
raft: Use different type to create type dependent statement for static assertion
raft: drop use of <ranges> for clang
raft: make test compile with clang
raft: drop -fcoroutines support from configure.py
Now that both repair followers and repair master use a single writer. We
can get rid of the vector associated with _writer_done and friends.
Fixes#7525
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.
To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.
Fixes#7525
A gcc bug [1] caused objects built by different versions of gcc
not to interoperate. Gcc helpfully warns when it encounters code that
could be affected.
Since we build everything with one version, and as that versions is far
newer than the last version generating incorrect code, we can silence
that warning without issue.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728Closes#7495
Do not run tests which are not built.
For that, pass the test list from configure.py to test.py
via ninja unit_test_list target.
Minor cleanups.
* scylla-dev.git/test.py-list:
test: enable raft tests
test.py: do not run tests which are not built
configure.py: add a ninja command to print unit test list
test.py: handle ninja mode_list failure
configure.py: don't pass modes_list unless it's used
Add new test validating that rows returned from both non-indexed selects
and indexed selects return rows sorted in token order (making sure
that both positive and negative tokens are present to test if signed
comparison order is maintained).
Switches token column computation to (new) token_column_computation,
which fixes#7443, because new token column will be compared using
signed comparisons, not the previous unsigned comparison of CQL bytes
type.
This column computation type is only set if cluster supports
correct_idx_token_in_secondary_index feature to make sure that all nodes
will be able to compute (new) token_column_computation. Also old
indexes will need to be rebuilt to take advantage of this fix, as new
token column computation type is only set for new indexes.
Add new correct_idx_token_in_secondary_index feature, which will be used
to determine if all nodes in the cluster support new
token_column_computation. This column computation will replace
legacy_token_column_computation in secondary indexes, which was
incorrect as this column computation produced values that when compared
with unsigned comparison (CQL type bytes comparison) resulted in
different ordering than token signed comparison. See issue:
https://github.com/scylladb/scylla/issues/7443
Introduce new token_column_computation class which is intended to
replace legacy_token_column_computation. The new column computation
returns token as long_type, which means that it will be ordered
according to signed comparison (not unsigned comparison of bytes), which
is the correct ordering of tokens.
Raname token_column_computation to legacy_token_column_computation, as
it will be replaced with new column_computation. The reason is that this
computation returns bytes, but all tokens in Scylla can now be
represented by int64_t. Moreover, returning bytes causes invalid token
ordering as bytes comparison is done in unsigned way (not signed as
int64_t). See issue:
https://github.com/scylladb/scylla/issues/7443
meaningful
When computing moving average rates too early after startup, the
rate can be infinite, this is simply because the sample interval
since the system started is too small to generate meaningful results.
Here we check for this situation and keep the rate at 0 if it happens
to signal that there are still no meaningful results.
This incident is unlikely to happen since it can happen only during a
very small time window after restart, so we add a hint to the compiler
to optimize for that in order to have a minimum impact on the normal
usecase.
Fixes#4469
The memory configuration for the database object was left at zero.
This can cause the following chain of failures:
- the test is a little slow due to the machine being overloaded,
and debug mode
- this causes the memtable flush_controller timer to fire before
the test completes
- the backlog computation callback is called
- this calculates the backlog as dirty_memory / total_memory; this
is 0.0/0.0, which resolves to NaN
- eventually this gets converted to an integer
- UBSAN dooesn't like the convertion from NaN to integer, and complains
Fix by initializing dbcfg.available_memory.
Test: gossip_test(debug), 1000 repetitions with concurrency 6
Closes#7544
Fixes#7325
When building with clang on fedora32, calling the string_view constructor
of bignum generates broken ID:s (i.e. parsing borks). Creating a temp
std::string fixes it.
Closes#7542
Since 11a8912093, get_gossip_status
returns a std::string_view rather than a sstring.
As seen in dtest we may print garbage to the log
if we print the string_view after preemption (calling
_gossiper.reset_endpoint_state_map().get())
Test: update_cluster_layout_tests:TestUpdateClusterLayout.simple_add_two_nodes_in_parallel_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201103132720.559168-1-bhalevy@scylladb.com>
"
Introduce a gentle (yielding) implementation of reserve for chunked
vector and use it when reserving the backing storage vector for large
bitset. Large bitset is used by bloom filters, which can be quite large
and have been observed to cause stalls when allocating memory for the
storage.
Fixes: #6974
Tests: unit(dev)
"
* 'gentle-reserve/v1' of https://github.com/denesb/scylla:
utils/large_bitset: use reserve_partial() to reserve _storage
utils/chunked_vector: add reserve_partial()
There's a perf_bptree test that compares B+ tree collection with
std::set and std::map ones. There will come more, also the "patterns"
to compare are not just "fill with keys" and "drain to empty", so
here's the perf_collection test, that measures timings of
- fill with keys
- drain key by key
- empty with .clear() call
- full scan with iterator
- insert-and-remove of a single element
for currently used collections
- std::set
- std::map
- intrusive_set_external_comparator
- bplus::tree
* https://github.com/xemul/scylla/tree/br-perf-collection-test:
test: Generalize perf_bptree into perf_collection
perf_collection: Clear collection between itartions
perf_collection: Add intrusive_set_external_comparator
perf_collection: Add test for single element insertion
perf_collection: Add test for destruction with .clear()
perf_collection: Add test for full scan time
To avoid stalls when reserving memory for a large bloom filter. The
filter creation already has a yielding loop for initialization, this
patch extends it to reservation of memory too.
A variant of reserve() which allows gentle reserving of memory. This
variant will allocate just one chunk at a time. To drive it to
completion, one should call it repeatedly with the return value of the
previous call, until it returns 0.
This variant will be used in the next patch by the large bitset creation
code, to avoid stalls when allocating large bloom filters (which are
backed by large bitset).
Although the code for it existed already, the validation function
hasn't been invoked properly. This change fixes that, adding
a validating check when converting from text to specific value
type and throwing a marshal exception if some characters
are not ASCII.
Fixes#5421Closes#7532
When unit tests fail the test.py dump their output on the screen. This is impossible
to read this output from the terminal, all the more so the logs are anyway saved in
the testlog/ directory. At the same time the names of the failed tests are all left
_before_ these logs, and if the terminal history is not large enough, it becomes
quite annoying to find the names out.
The proposal is not to spoil the terminal with raw logs -- just names and summaries.
Logs themselves are at testlog/$mode/$name_of_the_test.log
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201031154518.22257-1-xemul@scylladb.com>
Otherwise all readers will be created with the default forwarding::yes.
This inhibits some optimizations (e.g. results in more sstable read-ahead).
It will also be problematic when we introduce mutation sources which don't support
forwarding::yes in the future.
Message-Id: <1604065206-3034-1-git-send-email-tgrabiec@scylladb.com>
Clang brings us working support for coroutines, which are
needed for Raft and for code simplification.
perf_simple_query as well as full system tests show no
significant performance regression.
Test: unit(dev, release, debug)
Closes#7531
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
Closes#7498
* github.com:scylladb/scylla:
alternator::streams: Reduce the query limit depending on cdc opts
alternator::streams: Use end-of-record info in get_records
This series cleans up the gossiper endpoint_state interface
marking methods const and const noexcept where possible.
To achieve that, endpoint_state::get_status was changed to
return a string_view rather than a sstring so it won't
need to allocate memory.
Also, the get_cluster_name and get_partitioner_name were
changes to return a const sstring& rather than sstring
so they won't need to allocate memory.
The motivation for the series stems from #7339
where an exception in get_host_id within a storage_service
notification handler, called from seastar::defer crashed
the server.
With this series, get_host_id may still throw exceptions on
logical error, but not from calling get_application_state_ptr.
Refs #7339
Test: unit(dev)
* tag 'gossiper-endpoint-noexcept-v2':
gossiper: mark trivial methods noexcept
gossiper: get_cluster_name, get_partitioner_name: make noexcept
gossiper: get_gossip_status: return string_view and make noexcept
gms/endpoint_state: mark methods using get_status noexcept
gms/endpoint_state: get_status: return string_view and make noexcept
gms/endpoint_state: mark get_application_state_ptr and is_cql_ready noexcept
gms/endpoint_state: mark trivial methods noexcept
gms/heart_beat_state: mark methods noexcept
gms/versioned_value: mark trivial methods noexcept
gms/version_generator: mark get_next_version noexcept
fb_utilities.hh: mark methods noexcept
messaging: msg_addr: mark methods noexcept
gms/inet_address: mark methods noexcept
when logging in to the GCE instance that is created from the GCE image it takes 10 seconds to understand that we are not running on AWS. Also, some unnecessary debug logging messages are printed:
```
bentsi@bentsi-G3-3590:~/devel/scylladb$ ssh -i ~/.ssh/scylla-qa-ec2 bentsi@35.196.8.86
Warning: Permanently added '35.196.8.86' (ECDSA) to the list of known hosts.
Last login: Sun Nov 1 22:14:57 2020 from 108.128.125.4
_____ _ _ _____ ____
/ ____| | | | | __ \| _ \
| (___ ___ _ _| | | __ _| | | | |_) |
\___ \ / __| | | | | |/ _` | | | | _ <
____) | (__| |_| | | | (_| | |__| | |_) |
|_____/ \___|\__, |_|_|\__,_|_____/|____/
__/ |
|___/
Version:
666.development-0.20201101.6be9f4938
Nodetool:
nodetool help
CQL Shell:
cqlsh
More documentation available at:
http://www.scylladb.com/doc/
By default, Scylla sends certain information about this node to a data collection server. For information, see http://www.scylladb.com/privacy/
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
Initial image configuration failed!
To see status, run
'systemctl status scylla-image-setup'
[bentsi@artifacts-gce-image-jenkins-db-node-aa57409d-0-1 ~]$
```
this PR fixes this
Closes#7523
* github.com:scylladb/scylla:
scylla_util.py: remove unnecessary logging
scylla_util.py: make is_aws_instance faster
scylla_util.py: added ability to control sleep time between retries in curl()
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.
Fixes#7515Closes#7516
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
The instructions are updated for multiarch images (images that
can be used on x86 and ARM machines).
Additionally,
- docker is replaced with podman, since that is now used by
developers. Docker is still supported for developers, but
the image creation instructions are only tested with podman.
- added instructions about updating submodules
- `--format docker` is removed. It is not necessary with
more recent versions of docker.
Closes#7521
connection_notifier.hh defines a number of template-specialized
variables in a header. This is illegal since you're allowed to
define something multiple times if it's a template, but not if it's
fully specialized. gcc doesn't care but clang notices and complains.
Fix by defining the variiables as inline variables, which are
allowed to have definitions in multiple translation units.
Closes#7519
Some ARM cores are slow, and trip our current timeout of 3000
seconds in debug mode. Quadrupling the timeout is enough to make
debug-mode tests pass on those machines.
Since the timeout's role is to catch rare infinite loops in unsupervised
testing, increasing the timeout has no ill effect (other than to
delay the report of the failure).
Closes#7518
The main goal of this this series is to fix issue #6951 - a Query (or Scan) with
a combination of filtering and projection parameters produced wrong results if
the filter needs some attributes which weren't projected.
This series also adds new tests for various corner cases of this issue. These
new tests also pass after this fix, or still fail because some other missing
feature (namely, nested attributes). These additional tests will be important if
we ever want to refactor or optimize this code, because they exercise some rare
corner code paths at the intersection of filtering and projection.
This series also fixes some additional problems related to this issue, like
combining old and new filtering/projection syntaxes (should be forbidden), and
even one fix to a wrong comment.
Closes#7328
* github.com:scylladb/scylla:
alternator test: tests for nested attributes in FilterExpression
alternator test: fix comment
alternator tests: additional tests for filter+projection combination
alternator: forbid combining old and new-style parameters
alternator: fix query with both projection and filtering
when calling curl and exception is raised we can see unnecessary log messages that we can't control.
For example when used in scylla_login we can see following messages:
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
Initial image configuration failed!
To see status, run
'systemctl status scylla-image-setup'
These methods can return a const sstring& rather than
allocating a sstring. And with that they can be marked noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Change get_gossip_status to return string_view,
and with that it can be noexcept now that it doesn't
allocate memory via sstring.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that get_status returns string_view, just compare it with a const char*
rather than making a sstring out of it, and consequently, can be marked noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
get_status doesn't need to allocate a sstring, it can just
return a std::string_view to the status string, if found.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although std::map::find is not guaranteed to be noexcept
it depends on the comperator used and in this case comparing application_state
is noexcept. Therefore, we can safely mark get_application_state_ptr noexcept.
is_cql_ready depends on get_application_state_ptr and otherwise
handles an exceptions boost::lexical_cast so it can be marked
noexcept as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that get_next_version() is noexcept,
update_heart_beat can be noexcept too.
All others are trivially noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Based on gms::inet_address.
With that, gossiper::get_msg_addr can be marked noexcept (and const while at it).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clang does not implement P1814R0 (class template argument deduction
for alias templates), so it can't deduce the template arguments
for range_bound, but it can for interval_bound, so switch to that.
Using the modern name rather than the compatibility alias is preferred
anyway.
Closes#7422
In commit de38091827 the two IO priority classes streaming_read
and streaming_write into just one. The document docs/isolation.md
leaves a lot to be desired (hint, hint, to anyone reading this and
can write content!) but let's at least not have incorrect information
there.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201101102220.2943159-1-nyh@scylladb.com>
This series adds more context to debugging information in case a view gets out of sync with its base table.
A test was conducted manually, by:
1. creating a table with a secondary index
2. manually deleting computed column information from system_schema.computed_columns
3. restarting the target node
4. trying to write to the index
Here's what's logged right after the index metadata is loaded from disk:
```
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Column idx_token in view ks.t_c_idx_index was not found in the base table ks.t
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Missing idx_token column is caused by an incorrect upgrade of a secondary index. Please recreate index ks.t_c_idx_index to avoid future issues.
```
And here's what's logged during the actual failure - when Scylla notices that there exists
a column which is not computed, but it's also not found in the base table:
```
ERROR 2020-10-30 12:31:25,709 [shard 0] storage_proxy - exception during mutation write to 127.0.0.1: seastar::internal::backtraced<std::runtime_error> (base_schema(): operation unsupported when initialized only for view reads. Missing column in the base table: idx_token Backtrace: 0x1d14513
0x1d1468b
0x1d1492b
0x109bbad
0x109bc97
0x109bcf4
0x1bc4370
0x1381cd3
0x1389c38
0xaf89bf
0xaf9b20
0xaf1654
0xaf1afe
0xb10525
0xb10ad8
0xb10c3a
0xaaefac
0xabf525
0xabf262
0xac107f
0x1ba8ede
0x1bdf749
0x1be338c
0x1bfe984
0x1ba73fa
0x1ba77a4
0x9ea2c8
/lib64/libc.so.6+0x27041
0x9d11cd
--------
seastar::lambda_task<seastar::execution_stage::flush()::{lambda()#1}>
```
Hopefully, this information will make it much easier to solve future problems with out-of-sync views.
Tests: unit(dev)
Fixes#7512Closes#7513
* github.com:scylladb/scylla:
view: add printing missing base column on errors
view: simplify creating base-dependent info for reads only
view: fix typo: s/dependant/dependent
view: add error logs if a view is out of sync with its base
In certain CQL statements it's possible to provide a custom timestamp via the USING TIMESTAMP clause. Those values are accepted in microseconds, however, there's no limit on the timestamp (apart from type size constraint) and providing a timestamp in a different unit like nanoseconds can lead to creating an entry with a timestamp way ahead in the future, thus compromising the table.
To avoid this, this change introduces a sanity check for modification and batch statements that raises an error when a timestamp of more than 3 days into the future is provided.
Fixes#5619Closes#7475
The constructors just set up the references, real start happens in .start()
so it is safe to do this early. This helps not carrying migration manager
and query processor down the storage service cluster joining code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The query processor global instance is going away. The schema_tables usage
of it requires a huge rework to push the qp reference to the needed places.
However, those places talk to system keyspace and are thus the users of the
"qctx" thing -- the query context for local internal requests.
To make cql tests not crash on null qctx pointer, its initialization should
come earlier (conforming to the main start sequence).
The qctx itself is a global pointer, which waits for its fix too, of course.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When an out-of-sync view is attempted to be used in a write operation,
the whole operation needs to be aborted with an error. After this patch,
the error contains more context - namely, the missing column.
The code which created base-dependent info for materialized views
can be expressed with fewer branches. Also, the constructor
which takes a single parameter is made explicit.
When Scylla finds out that a materialized view contains columns
which are not present in the base table (and they are not computed),
it now presents comprehensible errors in the log.
* seastar 6973080cd1...57b758c2f9 (11):
> http: handle 'match all' rule correctly
> http: add missing HTTP methods
> memory: remove unused lambda capture in on_allocation_failure()
> Support seastar allocator when seastar::alien is used
> Merge "make timer related functions noexcept" from Benny
> script: update dependecy packages for centos7/8
> tutorial: add linebreak between sections
> doc: add nav for the second last chap
> doc: add nav bar at the bottom also
> doc: rename add_prologue() to add_nav_to_body()
> Wrong name used in an example in mini tutorial.
An upcoming change in Seastar only initializes the Seastar allocator in
reactor threads. This causes imr_test and double_decker_test to fail:
1. Those tests rely on LSA working
2. LSA requires the Seastar allocator
3. Seastar is not initialized, so the Seastar allocator is not initialized.
Fix by switching to the Seastar test framework, which initializes Seastar.
Closes#7486
test.py estimates the amount of memory needed per test
in order not to overload the machine, but it underestimates
badly and so machines with many cores but not a lot of memory
fail the tests (in debug mode principally) due to running out
of memory.
Increase the estimate from 2GB per test to 6GB.
Closes#7499
gcc collects all the initialization code for thread-local storage
and puts it in one giant function. In combination with debug mode,
this creates a very large stack frame that overflows the stack
on aarch64.
Work around the problem by placing each initializer expression in
its own function, thus reusing the stack.
Closes#7509
Add additional comments to select_statement_utils, fix formatting, add
missing #pragma once and introduce set_internal_paging_size_guard to
set internal_paging in RAII fashion.
Closes#7507
Gossip currently runs inside the default (main) scheduling group. It is
fine to run inside default scheduling group. From time to time, we see
many tasks in main scheduling group and we suspect gossip. It is best
we can move gossip to a dedicated scheduling group, so that we can catch
bugs that leak tasks to main group more easily.
After this patch, we can check:
scylla_scheduler_time_spent_on_task_quota_violations_ms{group="gossip",shard="0"}
Fixes: #7154
Tests: unit(dev)
We require a kernel that is at least 3.10.0-514, because older
kernel have an XFS related bug that causes data corruption. However
this Requires: clause pulls in a kernel even in Docker installation,
where it (and especially the associated firmware) occupies a lot of
space.
Change to a Conflicts: instead. This prevents installation when
the really old kernel is present, but doesn't pull it in for the
Docker image.
Closes#7502
Overview
Fixes#7355.
Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below).
Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR.
It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page).
GROUP BY (first commit)
Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times):
```
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
```
This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355.
Paging (second commit)
Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit).
The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range:
```
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
```
The second commit fixes this issue by properly trimming `row_ranges`.
The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`.
The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query).
The second commit fixes the first two failing examples from issue #7355.
Tests (fourth commit)
Extensively tests queries on tables with secondary indices with aggregates and `GROUP BY`s.
Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`,
`whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios).
I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios.
Configurable internal_paging_size (third commit)
Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000). This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes.
Closes#7497
* github.com:scylladb/scylla:
tests: Add secondary index aggregates tests
select_statement: Introduce internal_paging_size
select_statement: Fix paging on indexed selects
select_statement: Fix GROUP BY on indexed select
Update the toolchain to Fedora 33 with clang 11 (note the
build still uses gcc).
The image now creates a /root/.m2/repository directory; without
this the tools/jmx build fails on aarch64.
Add java-1.8.0-openjdk-devel since that is where javac lives now.
Add a JAVA8_HOME environment variable; wihtout this ant is not
able to find javac.
The toolchain is enabled for x86_64 and aarch64.
Extensively tests queries on tables with secondary indices with
aggregates and GROUP BYs. Tests three cases that are implemented
in indexed_table_select_statement::do_execute - partition_slices,
whole_partitions and (non-partition_slices and non-whole_partitions).
As some of the issues found were related to paging, the tests check
scenarios where the inserted data is smaller than a page, larger than
a page and larger than two pages (and some boundary scenarios).
Before this change, internal page_size when doing aggregate, GROUP BY
or nonpaged filtering queries was hard-coded to DEFAULT_COUNT_PAGE_SIZE.
This made testing hard (timeouts in debug build), because the tests had
to be large to test cases when there are multiple internal pages.
This change adds new internal_paging_size variable, which is
configurable by set_internal_paging_size and reset_internal_paging_size
free functions. This functionality is only meant for testing purposes.
Fixes two issues related to improper paging on indexed SELECTs. As those
two issues are closely related (fixing one without fixing the other
causes invalid results of queries), they are in a single commit.
The first issue is that when using slice.set_range, the existing
_row_ranges (which specify clustering key prefixes) are not taken into
account. This caused the wrong rows to be included in the result, as the
clustering key bound was set to a half-open range:
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
This change fixes this issue by properly trimming row_ranges.
The second fixed problem is related to setting the paging_state
to internal_options. It was improperly set just after reading from
index, making the base query start from invalid paging_state.
This change fixes this issue by setting the paging_state after both
index and base table queries are done. Moreover, the paging_state is
now set based on paging_state of index query and the results of base
table query (as base query can return more rows than index query).
Fixes the first two failing examples from issue #7355.
Before the change, GROUP BY SELECTs with some WHERE restrictions on an
indexed column would return invalid results (same grouped column values
appearing multiple times):
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
This is fixed by correctly passing _group_by_cell_indices to
result_set_builder. Fixes the third failing example from issue #7355.
This is the continuation of 30722b8c8e, so let me re-cite Rafael:
The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.
There is nothing we can do if these allocations fail, so use
disable_failure_guard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201028140553.21709-1-xemul@scylladb.com>
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:
class repair_writer {
_writer_done[node_idx] =
mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
...
}).handle_exception([this] {
...
});
}
The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.
To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.
Fixes#7406Closes#7430
LD_PRELOAD libraries usually have dependencies in the host system,
which they will not have access to in a relocatable environment
since we use a different libc. Detect that LD_PRELOAD is in use and if
so, abort with an error.
Fixes#7493.
Closes#7494
Clang complains if it sees linker-only flags when called for compilation,
so move the compile-time flags from cxx_ld_flags to cxxflags, and remove
cxx_ld_flags from the compiler command line.
The linker flags are also passed to Seastar so that the build-id and
interpreter hacks still apply to iotune.
Closes#7466
python3 has its own relocatable package, no need to include it
in scylla-package.tar.gz.
Python has its own relocatable package, so packaging it in scylla-package.ta
Closes#7467
We already have a test for the behavior of a closed shard and how
iterators previously created for it are still valid. In this patch
we add to this also checking that the shard id itself, not just the
iterator, is still valid.
Additionally, although the aforementioned test used a disabled stream
to create a closed shard, it was not a complete test for the behavior
of a disabled stream, and this patch adds such a test. We check that
although the stream is disabled, it is still fully usable (for 24 hours) -
its original ARN is still listed on ListStreams, the ARN is still usable,
its shards can be listed, all are marked as closed but still fully readable.
Both tests pass on DynamoDB, and xfail on Alternator because of
issue #7239 - CDC drops the CDC log table as soon as CDC is disabled,
so the stream data is lost immediately instead of being retained for
24 hours.
Refs #7239
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006183915.434055-1-nyh@scylladb.com>
This patch fills the following columns in `system.clients` table:
* `connection_stage`
* `driver_name`
* `driver_version`
* `protocol_version`
It also improves:
* `client_type` - distinguishes cql from thrift just in case
* `username` - now it displays correct username iff `PasswordAuthenticator` is configured.
What is still missing:
* SSL params (I'll happily get some advice here)
* `hostname` - I didn't find it in tested drivers
Refs #6946Closes#7349
* github.com:scylladb/scylla:
transport: Update `connection_stage` in `system.clients`
transport: Retrieve driver's name and version from STARTUP message
transport: Notify `system.clients` about "protocol_version"
transport: On successful authentication add `username` to system.clients
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.
Fixes#7488
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Allows QA to bypass the normal hardcoded 24h ttl of data and still
get "proper" behaviour w.r.t. available stream set/generations.
I.e. can manually change cdc ttl option for alternator table after
streams enabled. Should not be exposed, but perhaps useful for
testing.
Closes#7483
Refs #7364
The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.
Closes#7433
Fixes#7435
Adds an "eor" (end-of-record) column to cdc log. This is non-null only on
last-in-timestamp group rows, i.e. end of a singular source "event".
A client can use this as a shortcut to knowing whether or not he has a
full cdc "record" for a given source mutation (single row change).
Closes#7436
Makes files shorter while still keeping the lines under 120 columns.
Separate from other commits to make review easier.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Don't require filtering when a continuous slice of the clustering key
is requested, even if partition is unrestricted. The read command we
generate will fetch just the selected data; filtering is unnecessary.
Some tests needed to update the expected results now that we're not
fetching the extra data needed for filtering. (Because tests don't do
the final trim to match selectors and assert instead on all the data
read.)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The username becomes known in the course of resolving challenges
from `PasswordAuthenticator`. That's why username is being set on
successful authentication; until then all users are "anonymous".
Meanwhile, `AllowAllAuthenticator` (the default) does not request
username, so users logged with it will remain as "anonymous" in
`system.clients`.
Shuffling of code was necessary to unify existing infrastructure
for INSERTing entries into `system.clients` with later UPDATEs.
In some cases a collection is used to keep several elements,
so it's good to know this timing.
For example, a mutation_partition keeps a set of rows, if used
in cache it can grow large, if used in mutation to apply, it's
typically small. Plain replacement of bst into b-tree caused
performance degardation of mutation application because b-tree
is only better at big sizes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This collection is widely used, any replacement should be
compared against it to better understand pros-n-cons.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Alternator does not yet support direct access to nested attributes in
expressions (this is issue #5024). But it's still good to have tests
covering this feature, to make it easier to check the implementation
of this feature when it comes.
Until now we did not have tests for using nested attributes in
*FilterExpression*. This patch adds a test for the straightforward case,
and also adds tests for the more elaborate combination of FilterExpression
and ProjectionExpression. This combination - see issue #6951 - means that
some attributes need to be retrieved despite not being projected (because
they are needed in a filter). When we support nested attributes there will
be special cases when the projected and filtered attributes are parts of
the same top-level attribute, so the code will need to handle those cases
correctly. As I was working on issue #6951 now, it is a good time to write
a test for these special cases, even if nested attributes aren't yet
supported - so we don't forget to handle these special cases later.
Both new tests pass on DynamoDB, and xfail on Alternator.
Refs #5024 (nested attributes)
Refs #6951 (FilterExpression with ProjectionExpression)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A comment in test/alternator/test_lsi.py wrongly described the schema
of one of the test tables. Fix that comment.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch provides two more tests for issue #6951. As this issue was
already fixed, the two new tests pass.
The two new test check two special cases for which were handled correctly
but not yet tested - when the projected attribute is a key attribute of
the table or of one of its LSIs. Having these two additional tests will
ensure that any future refactoring or optimizations in the this area of
the code (filtering, projection, and its combination) will not break these
special cases.
Refs #6951.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The DynamoDB API has for the Query and Scan requests two filtering
syntaxes - the old (QueryFilter or ScanFilter) and the new (FilterExpression).
Also for projection, it has an old syntax (AttributesToGet) and a new
one (ProjectionExpression). Combining an old-style and new-style parameter
is forbidden by DynamoDB, and should also be forbidden by Alternator.
This patch fixes, and removes the "xfails" tag, of two tests:
test_query_filter.py::test_query_filter_and_projection_expression
test_filter_expression.py::test_filter_expression_and_attributes_to_get
Refs #6951
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We had a bug when a Query/Scan had both projection (ProjectionExpression
or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter).
The problem was that projection left only the requested attributes, and
the filter might have needed - and not got - additional attributes.
The solution in this patch is to add the generated JSON item also
the extra attributes needed by filtering (if any), run the filter on
that, and only at the end remove the extra filtering attributes from
the item to be returned.
The two tests
test_query_filter.py::test_query_filter_and_attributes_to_get
test_filter_expression.py::test_filter_expression_and_projection_expression
Which failed before this patch now pass so we drop their "xfail" tag.
Fixes#6951.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-10-05 02:19:22 +03:00
1280 changed files with 26771 additions and 7780 deletions
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
# Reporting an issue
## Reporting an issue
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
# Contributing Code to Scylla
## Contributing Code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnto_metrics_histogram(api_operations.name);}),
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.